July 9, 2024

The Hidden Workforce of AI: Data Annotation Teams

The allure of AI's "magic" often obscures the human labor behind it.

7 min read

Meet our Editor-in-chief

Paul Estes

For 20 years, Paul struggled to balance his home life with fast-moving leadership roles at Dell, Amazon, and Microsoft, where he led a team of progressive HR, procurement, and legal trailblazers to launch Microsoft’s Gig Economy freelance program

Gig Economy
  • PwC estimates that AI will add over $15 trillion to the global economy, but businesses won’t see these numbers without investing in accurate data annotation.

  • Businesses rely on massive amounts of data to train AI models, sometimes up to one petabyte of data (180 million words) for a large language model. This data needs human labeling for accuracy.

  • Data annotation teams are critical in ensuring AI models function correctly, especially in high-risk fields like medicine and self-driving cars.

Staff writer

From AI to FinOps, our team's collective brainpower fuels this blog.

When Amazon launched its AI stores, where customers could pick up any products they wanted and simply walk out, AI tech seemed truly magical. Fast-forward and it turned out that the AI model was, in fact, over 1,000 employees in India who carefully monitored customers on cameras and then billed their accounts.

Artificial intelligence is far from magic. It often involves thousands of hours of human labor. Even fully functional AI systems wouldn’t exist if it weren’t for teams of data annotators that go through the painstaking process of constructing accurate training data.

Data annotation—labeling data sets for AI training—is at the core of every AI model. AI promises to add $15.7 trillion to the global economy by 2030. However, without precise data sets, AI models cannot train a specific function and deliver the ROI that enterprises are being promised. 

Human labor is still essential in developing and launching applicable AI models. Let’s explore AI's hidden workforce, delving into the critical role of data annotation teams.

The Hidden Workforce: The Rise of Annotation Teams

Artificial intelligence models rely on massive volumes of high-quality data for training. A single large-language model (like those used in AI chatbots) could use several petabytes of data across around one billion variables during its training stages.

The size of datasets in AI training is measured in parameters. The sheer quantity of data companies need to gather, especially for LLMs, involves finding data from millions of sources. During the development of Chat GPT-2 (the predecessor to the widely available ChatGPT tool available today), OpenAI used around 1.5 billion parameters. Although OpenAI doesn’t disclose where it draws its training data, the media theorizes that it scrapes internet data to train its models

To feed these millions of articles and billions of individual data points into a model, researchers must first collect data, clean it (by removing duplicates, incorrect data, or unrelated information), and then feed it into their model. An AI model will then undergo a training process to understand the data context better, building up tokenization standards for that specific data set. Tokenization is the process of understanding a sentence and predicting what comes next based on previous context – which is vital in how AI chatbots function.

Source: Process of data cleaning for AI models.

Training an AI model is highly intensive, requiring precise data collection and formatting, alongside budget-breaking processing fees to run the model during its training phase. According to OpenAI, training GPT-4 costs around $100 million.

The Importance of Data Quality

Every impressive benefit of artificial intelligence that enterprises pursue is locked behind high-quality data. With clean, structured, and applicable data sets for training, businesses can develop models that enable them to unlock the real value of AI. 

Behind every successful AI model is a team of data scientists. Beyond just creating, training, and configuring the model, teams of data scientists also need to capture data and transform it into a high-quality state for training. Dirty data, including inconsistencies, duplicates, or errors, can transfer those mistakes into an AI’s algorithm, inducing bias and leading to output errors.

Annotation teams must collect and prepare data to ensure that AI models can effectively use it to develop unique functionalities. In any deployment where even a tiny error could create significant issues, AI projects are even more reliant on data annotation teams:

  • Medical Diagnosis – AI models that aid in medical imagery and diagnosis must avoid false positives or false negatives at all costs. Annotation teams have to carefully label features within medical images for training to avoid errors in the model’s output. The Stanford Medical ImageNet provides over a petabyte of searchable and fully human-annotated radiology and pathology images which is the foundation for many medical AI models.
  • Translation Technology – Lionbridge uses a human-in-the-loop annotation system in which linguist experts conduct language-based annotation for text and images. The platform uses these human annotation solutions to provide generative AI translation technology, with human input creating a highly accurate and high-performance AI translation model.
  • Facial Recognition – Human data annotators must work with facial data to label features, especially regarding creating detailed repositories of representative samples of various ethnicities. Face++Facial recognition technology has notoriously had bias issues in the past, which human annotators are attempting to overcome. KeyLabs uses human annotators for facial recognition training, especially for pictures of humans wearing masks. Their tireless work has provided the basis for AI facial recognition models with 99.9% accuracy.

Effective labeling, categorization, and annotation of training data is the first and potentially most fundamental step in creating an AI model. With teams of humans working behind the scenes on data annotation, AI as we know it today would exist. 

Commenting on this, one AI Sessions Atendee stated, “100% of the time we queue all of our outputs for review. We have AI doing things in real time but then we have an annotation team that reviews every single thing that's done and then corrects it. With our higher confidence stuff, we need to figure out a process to how to migrate off that we have some strategies, but we need to rebuild the tooling to support that.”

Building a High-Quality Data Annotation Pipeline

Before an enterprise builds a high-quality data annotation pipeline, it must first define the project's form. Depending on the deployment of an AI model, the complexity, volume, and labeling requirements will shift. An AI model that works as an internal chatbot will have limited annotation when compared against a medical imaging model, for example.

After identifying the form of annotation a business needs, it can strategically source a team to handle data labeling. Enterprises have three central options to choose from:

  • Internal Teams—Businesses can form internal data scientist teams, enabling rapid data annotation. However, an internal team may lack the necessary skills to effectively annotate data for AI purposes. Only enterprises with an extensive data science division that can invest in upskilling can use internal teams.
  • Managed Service Providers (MSPs)—MSPs offer pre-built data annotation teams with expertise in specific domains. Teams may have previous experience with medical imagery, satellite imagery, or another form of data relevant to AI model training. SuperAnnotate, a widely used annotation platform, has over 400 highly trained annotators with professional backgrounds in everything from medicine to law.
  • Crowdsourcing Platforms – Finally, businesses can turn to crowdsourced data labeling via online marketplaces. For example, Amazon Mechanical Turk (MTurk)  is a distributed workforce that manages data validation and annotation. This choice enhances efficiency but may lack the specialist knowledge that an MSP may offer.

Source: Amazon Mechanical Turk Marketplace.

An example of internal teams in action comes from TELUS International

TELUS International uses internal teams to develop volumes of data for AI models. By effectively capturing and manually transcribing over 10,000 hours of audio, they could create a transcription model with 95% accuracy, which was then used in AI chatbots and virtual assistants. As TELUS already had a team of data scientists and the project needed to be more technical, an internal team was a compelling choice. 

To understand which of these data pipeline construction options may be best for a business, it should determine:

  • Domain Knowledge Requirements - The more complex your domain requirements, the more likely you will need to hire a specialist team. 
  • Labeling Guidelines Complexity - If the guidelines are complex, you may need a specialist team or give extensive training to your internal teams.
  • Quality Control Measures – Will your business have quality control measures in place and regulations to protect sensitive data? If so, you could opt for a lower-accuracy data annotation team via crowdsourcing.

Whichever format you choose, you should endeavor to provide a pleasant and fulfilling experience for your data annotators. 

AI Isn’t a Magical Solution – Humans Are Vital in the AI Lifecycle

AI isn’t – and never has been – a magical solution. Behind every output is hours of manpower data annotation work, with dedicated teams honing and improving the functionality of the AI tools we use in business.

Businesses should strive to empower data annotators with better working conditions and opportunities. Data annotators are the foundational workers upon which AI is built. By improving the human elements of AI, we can create better, more precise models for enterprise use cases. 

Businesses must strategically plan and manage these human resources to ensure successful AI projects. 

Cut through the AI hype and join the thousands of business leaders getting practical enterprise insights delivered to their inbox

Welcome to the community! We'll be in touch soon.