The allure of AI's "magic" often obscures the human labor behind it.
When Amazon launched its AI stores, where customers could pick up any products they wanted and simply walk out, AI tech seemed truly magical. Fast-forward and it turned out that the AI model was, in fact, over 1,000 employees in India who carefully monitored customers on cameras and then billed their accounts.
Artificial intelligence is far from magic. It often involves thousands of hours of human labor. Even fully functional AI systems wouldn’t exist if it weren’t for teams of data annotators that go through the painstaking process of constructing accurate training data.
Data annotation—labeling data sets for AI training—is at the core of every AI model. AI promises to add $15.7 trillion to the global economy by 2030. However, without precise data sets, AI models cannot train a specific function and deliver the ROI that enterprises are being promised.
Human labor is still essential in developing and launching applicable AI models. Let’s explore AI's hidden workforce, delving into the critical role of data annotation teams.
Artificial intelligence models rely on massive volumes of high-quality data for training. A single large-language model (like those used in AI chatbots) could use several petabytes of data across around one billion variables during its training stages.
The size of datasets in AI training is measured in parameters. The sheer quantity of data companies need to gather, especially for LLMs, involves finding data from millions of sources. During the development of Chat GPT-2 (the predecessor to the widely available ChatGPT tool available today), OpenAI used around 1.5 billion parameters. Although OpenAI doesn’t disclose where it draws its training data, the media theorizes that it scrapes internet data to train its models.
To feed these millions of articles and billions of individual data points into a model, researchers must first collect data, clean it (by removing duplicates, incorrect data, or unrelated information), and then feed it into their model. An AI model will then undergo a training process to understand the data context better, building up tokenization standards for that specific data set. Tokenization is the process of understanding a sentence and predicting what comes next based on previous context – which is vital in how AI chatbots function.
Training an AI model is highly intensive, requiring precise data collection and formatting, alongside budget-breaking processing fees to run the model during its training phase. According to OpenAI, training GPT-4 costs around $100 million.
Every impressive benefit of artificial intelligence that enterprises pursue is locked behind high-quality data. With clean, structured, and applicable data sets for training, businesses can develop models that enable them to unlock the real value of AI.
Behind every successful AI model is a team of data scientists. Beyond just creating, training, and configuring the model, teams of data scientists also need to capture data and transform it into a high-quality state for training. Dirty data, including inconsistencies, duplicates, or errors, can transfer those mistakes into an AI’s algorithm, inducing bias and leading to output errors.
Annotation teams must collect and prepare data to ensure that AI models can effectively use it to develop unique functionalities. In any deployment where even a tiny error could create significant issues, AI projects are even more reliant on data annotation teams:
Effective labeling, categorization, and annotation of training data is the first and potentially most fundamental step in creating an AI model. With teams of humans working behind the scenes on data annotation, AI as we know it today would exist.
Commenting on this, one AI Sessions Atendee stated, “100% of the time we queue all of our outputs for review. We have AI doing things in real time but then we have an annotation team that reviews every single thing that's done and then corrects it. With our higher confidence stuff, we need to figure out a process to how to migrate off that we have some strategies, but we need to rebuild the tooling to support that.”
Before an enterprise builds a high-quality data annotation pipeline, it must first define the project's form. Depending on the deployment of an AI model, the complexity, volume, and labeling requirements will shift. An AI model that works as an internal chatbot will have limited annotation when compared against a medical imaging model, for example.
After identifying the form of annotation a business needs, it can strategically source a team to handle data labeling. Enterprises have three central options to choose from:
An example of internal teams in action comes from TELUS International.
TELUS International uses internal teams to develop volumes of data for AI models. By effectively capturing and manually transcribing over 10,000 hours of audio, they could create a transcription model with 95% accuracy, which was then used in AI chatbots and virtual assistants. As TELUS already had a team of data scientists and the project needed to be more technical, an internal team was a compelling choice.
To understand which of these data pipeline construction options may be best for a business, it should determine:
Whichever format you choose, you should endeavor to provide a pleasant and fulfilling experience for your data annotators.
AI isn’t – and never has been – a magical solution. Behind every output is hours of manpower data annotation work, with dedicated teams honing and improving the functionality of the AI tools we use in business.
Businesses should strive to empower data annotators with better working conditions and opportunities. Data annotators are the foundational workers upon which AI is built. By improving the human elements of AI, we can create better, more precise models for enterprise use cases.
Businesses must strategically plan and manage these human resources to ensure successful AI projects.