McKinsey Global Institute recently reported that companies adopting all five forms of AI — computer vision, natural language, virtual assistants, robotic process automation, and advanced machine learning — stand to benefit disproportionately vs. their competitors. However, before organizations can enjoy the benefits of AI, they must ensure the data they use for their initiatives is usable and unbiased.
After all, Machine Learning (ML) algorithms are only as good as the data on which they are trained. And, one worrying trend manifesting itself is biased algorithms. To remove algorithmic bias, organizations must first ensure the training data they use is as free as possible from bias.
Bias in ML training data can take many forms, but the end result is it can cause an algorithm to miss the relevant relations between features and target outputs. Whether your organization is a small business, global enterprise, or governmental agency, it’s essential you mitigate bias in your training data at every phase of your Artificial Intelligence (AI) initiatives.
Training Data Makes AI Work
A machine learning model is usually built in three phases: training, validation, and testing. In the training phase, a large amount of data is annotated — labeled by humans or another method — and input to a machine learning algorithm, with a specific result in mind. The algorithm looks for patterns in the training data that map the input data attributes to the target then outputs a model that captures these patterns. For the model to be useful, it needs to be accurate, and accuracy requires data that points to the requisite target or target attribute. Validation and testing help refine and prove the model.
High-Quality Training Data Must Be Unbiased Training Data
Machines need massive volumes of data to learn. Accurately annotating training data is as critical as the learning algorithm itself. A common reason that ML models fall short in terms of accuracy is that they were created based on biased training data.
Without high-quality, unbiased data to train machine learning models, investment in AI initiatives is money wasted. A recent study from Infosys found that 49% of IT decision-makers reported that their organization is unable to deploy the AI technologies they want because their data is not ready to support the requirements of AI technologies…
What Causes Training Data Bias and What Is the Consequence?
Engineers and data scientists, as well as executive roles such as the chief technology officer, should carefully consider the prejudices they inherently carry when building AI solutions and do what they can to correct for these prejudices.
Bias of ML models — or “machine bias” — can be a result of unbalanced data. Imagine a data set for search query classifiers for an eCommerce website, to predict relevant results for a given search term: “women’s shoes.” A typical data bias example could be a data set that consists mostly of high heels, sandals, and boots — with very few samples of athletic shoes. A classifier model trained with this unbalanced dataset is going to lean heavily toward shoes that align with the given sample data, and fail to return relevant results to someone who is looking for women’s tennis shoes. This is bias in action. Straightforward, but critical to correct.
As machine learning projects get more complex, with subtle variants to identify, it becomes crucial to have training data that is human-annotated in a completely unbiased way. When training data, human bias can wreak havoc on the accuracy of a machine learning model. Imagine creating an ML model with the intention of differentiating between not only washers and dryers, but between the condition of the appliances.
If you have a team of in-house personnel annotating the images used as training data, it’s essential they adhere to a completely unbiased approach to classifying the images. Let’s say they’ll be classifying a variety of shoe styles by gender, which may be a subjective judgment for many styles. Without a diverse approach, you risk creating a less-than-accurate machine learning model.
If you are basing a mobile app, for example, on the ability to comb e-commerce sites for appliances in a particular condition within a specific price range, a biased, inaccurate ML model is not going to drive the adoption needed to succeed.
How Do We Ensure Our Training Data Isn’t Biased?
To help ensure optimal results, it’s essential that organizations have tech teams with diverse members in charge of both building models and creating training data. In addition to building a diverse team, organizations should also take the following suggestions into consideration when attempting to mitigate bias in their data.
- If training data comes from internal systems, try to find the most comprehensive data and experiment with different datasets and metrics.
- If training data is collected or processed by external partners, it is important to recruit diversified crowds for annotation so data can be more representative.
- Design the data annotation tasks correctly and carefully communicate instructions so that the crowd correctly performs the tasks without knowing how the data will be used. Knowing what the data may be used for may impact the judgments an annotator makes.
- Once the training data is created, it’s important to check if the data has any implicit bias.
It Is Up to Humans to Reduce Machine Bias
Organizations that perform data annotation internally will likely have discovered it can be difficult to visualize high-dimensional training data and check for biases. ML teams should regularly validate machine learning models and test for bias. At the end of the day, it’s important to remember that machine learning algorithms will be as biased as the people who collected, contextualized, and fed it its training data. While getting ahead of competitors in the race for AI adoption may be crucial for business success, it’s important to remember that humans must still oversee algorithms.
Ultimately, it is up to us — CTOs, CEOs, CIOs, data scientists, machine learning engineers, and product managers — to determine the path machine learning algorithms take. As AI practitioners, we should carefully consider the prejudices we inherently carry when creating these technologies and correct for them.
Feature image via Pixabay.