“Most people in AI forget that the hardest part of building a new AI solution or product is not the AI or algorithms — it’s the data collection and labeling.” — Luke de Oliveira, Lawrence Berkeley National Laboratory
Access to AI technology is quickly becoming democratized. Developers these days have access to frameworks and libraries such as Tensorflow, PyTorch, and Keras, while major platforms like Amazon, Facebook and Google also provide useful services.
Consequently, it’s an exciting time for startups entering the field of machine learning and AI. While the opportunities for AI applications expand, the hype surrounding machine learning distracts investors and founders from a key hurdle: in many cases, it’s training data, not the algorithms themselves, that will determine whether the final AI product is a competitive asset.
Put simply, AI training data refers to input data for a machine-learning model alongside the desired output, which the AI should eventually learn to produce by itself. For example, a training data set can contain images of cats and dogs labeled as such, tweets marked as either positive or negative for a sentiment analysis algorithm, or audio recordings alongside transcripts for a machine-learning transcription engine.
Training data is increasingly becoming a competitive edge for companies to stand out from other players in this sector. Without large volumes of high-quality, labeled data, machine-learning algorithms will never deliver a good ROI. Algorithms find relationships, develop understanding, make decisions, and evaluate their own confidence in their predictions — all based on the training data they’re given. The better the training data is, the better the model performs. This is especially true for tasks that involve language, where a large volume of rich, well-annotated data is essential to ensure a successful project outcome.
Humans Helping Humans Train Successful Applications
Through my work with Gengo, I have a birds-eye view of this process — and I’ve developed an understanding of how developers can best structure their AI/ML projects for maximum success. Gengo started life as crowd-powered translation service with a focus on quality and precision. We’ve built a multilingual crowd of more than 21,000 skilled linguists, and we provide a range of language services to large clients that include Amazon and Facebook. Since we often help businesses with services to improve their AI investments, we wondered: what if we optimized our offering for the needs of developers working on machine-language projects that require specialized language datasets?
This led to the formation of Gengo.ai: a platform for any business or enterprise that needs fast access to top-quality multilingual data to succeed in its AI ambitions. With Gengo.ai, developers have access to a huge crowd platform that is 100-percent focused on language tasks — those that involve natural language, speech, communication, and multilingual projects.
Why use Crowdsourcing to Gain Access to Language Training Data?
Like most outsourcing decisions, it comes down to defining your core competence. Most companies simply don’t have the in-house expertise to manage the collection and curation of a large language data set. Add to that the large opportunity cost of building a platform with their own engineering resources, and the ROI just doesn’t pencil out. By outsourcing, the development team also benefits from the cost and time efficiencies of a service that’s specifically designed to manage the process of defining, submitting, and gathering language-based data. Ultimately, it all boils down to an important outcome: faster time to market of a better-trained AI product.
What about Other Crowdsourcing Services?
Many developers are already familiar with Amazon Mechanical Turk — an early player in this field, and a great resource, especially for smaller tasks. It’s also cost-competitive because it draws on an almost infinite pool of cheap labor. However, with Mechanical Turk, you’re less likely to find a concentration of specific experts in your crowd, which can be a determining factor for developers with specialized needs. Also, Amazon offers little hand-holding to ensure the quality of the data, which means the burden of quality control falls on the developer. Consequently, the development team may have to iterate through several submissions to the crowd in order to arrive at the optimal dataset — which can impact overall project duration and costs.
Other crowdsourcing solutions such as Upwork are great for finding 1-2 people with whom you can closely interact. However, these don’t scale well because they aren’t built with platform technology such as job-distribution and quality-management systems to scale tasks effectively across 10, 100, or more than 1,000 people.
The Stages of a Typical Training Project
There are four phases to consider as you integrate training into your overall project timeline:
- Vetting and planning are critical to ensure a successful machine-learning project. Often, we see that the development team has no idea what it will cost to acquire the data needed to train a machine-learning application. So, make sure you secure firm quotes from several service providers as early as possible. Also, plan your data acquisition phase early in the project. If you need a large volume of data, make sure you build sufficient time into your schedule for the data provider to properly assess your project and provide a detailed timeline.
- Pilot phase. The first step involves initial data gathering — it’s here where you and your data provider collaborate to define the scope and specifics of the data you need. From this, an initial dataset, suitable for initial tests, is defined.
- Calibration phase. Even if your dataset seems perfect, at times during training you will run into unexpected errors. Since it’s impossible to account for all of these obstacles, you should agree on clear metrics to evaluate the quality of your data. It’s also important to sample some of your data to confirm that it won’t skew towards certain results and outcomes, as this might prove costly later in the project’s development. Some things to check for could include labeling guidelines that don’t function as intended or bias introduced from certain data sources.
- At-scale acquisition phase. Once you’re calibrated and have everything in place, you should be able to mass-produce data. At this point, the time, effort and cost required for the project will drop significantly. Depending on the scale of your project, you may want to consider investing in further increasing efficiency. You could do this through directly integrating with an API to cut out any overhead costs from manual workflows.
How Should a Developer Structure a Request for Language Training Data?
Based on my experience, here are five tips to maximize your project success:
- It’s crucial to define the scope of your project. Collection, labeling and cleaning of data all carry a different price, so be clear on which services you need from your data providers.
- Determine whether your data tasks require the workers to work in a specific tool (maybe even one proprietary to your company), or whether you will leave it up to the data provider to figure out the working environment.
- Figure out the specific instructions that the workers will need. For example, they will need to know the procedure regarding data points that don’t apply: skipping it or marking it “N/A” are two possible options. You’ll probably want to give your own guidance, then have the data provider experts add additional instructions since they have seen more edge cases and might be able to anticipate specific problems.
- Align with your provider on timeline requirements. There can be a ramp-up phase to identify the right worker cohort, and sometimes the data provider might even need to bring additional workers into the crowd. Be careful with urgent requests and define the timeline early in the project.
- Of course, it’s important to consider pricing. From the outset, decide whether a certain number of data points or remaining within a fixed budget is most important. This frees you to focus solely on the quality of the data when negotiating.
Sources of detailed, accurate data used to be hard to find. However, thanks to the growth and diversification of the industry, there are now a range of services offering crowd-based solutions to this problem. With several companies now offering cheap, efficient data creation and annotation, it’s worth investigating how they might impact your project.
Training data is no longer a sticking point, but a golden opportunity to boost your ROI. With a bit of due diligence, it’s entirely possible to find competitively priced data that can lift the performance of your model to new heights.
Feature image via Pixabay.