Chatbots charm us, until AI hallucinations strike with plausible-sounding but misleading content. While experts believe these fabrications will persist with current Large Language Model (LLM) technology, they can decrease through better training.
"Data deviation remains one of the primary causes of AI hallucinations," explained Zhang Shihao, head of the annotation service unit of Digital Tianma, a subsidiary of Ant Group in Liangjiang New Area that specializes in information and operation services.
Digital Tianma in Liangjiang Digital Economy Industrial Park. [Photo provided to english.liangjiang.gov.cn]
LLMs rely on training data to generate information. If a user prompts a model with a question beyond its training data, the model might generate a false response.
"Increasing training data volume and improving training data quality help mitigate hallucinations," Zhang said. His annotator team reduces data deviation through data collection, cleansing, analysis, and calibration.
"For instance, data cleansing involves refining massive raw datasets by correcting errors, removing duplicates, filling gaps, and ensuring consistency to improve usability."
The work of data annotators, we call them AI trainers, is like that of teachers - imparting knowledge and building reasoning capabilities. AI training, computational power, and algorithms are the three pillars determining the quality of LLMs,” Zhang explained.
Well-trained models deliver more accurate responses. For instance, if a query contains typos, models trained for user intent recognition can infer the purpose from context and respond accordingly.
With eight years in AI training, Zhang has witnessed the field transform from labor-intensive to knowledge-driven: "Before 2022, training primarily focused on general knowledge. For example, in the autonomous driving field, annotators label street-view images, identifying features such as crosswalks or vehicles. Those tasks required minimal expertise but vast manpower."
"The AI landscape has been evolving rapidly since 2022, and we need subject-matter experts for enhanced LLM training. This shift is reflected in our hiring, which values multi-skilled professionals," said Zhang.
Li Wenyuan, who joined Zhang's team last August after nearly a decade in finance, said that LLM integration with domain-specific expertise relies on precise, specialized datasets.
Li works as a data annotator. [Photo by Li Wenyuan]
Li's work mainly supports Ma Xiaocai, an online financial model developed by Ant Group. "After initial training, the model has obtained basic financial knowledge. Now, trainers need advanced expertise to refine it further. Their educational backgrounds and industry experience are critical," Li explained.
Digital Tianma employs 5,000 annotators and has processed hundreds of millions of high-quality data entries since its establishment in 2023. As Zhang emphasizes, the AI revolution is beyond coding; trainers who blend technical skills with industry savvy are equally important.
The job market reflects this demand. Data from Zhilian Recruitment, a major recruitment platform in China, shows that demand for data annotators surged by over 50 percent year-on-year in February.