Generative AI – Importance of Human Evaluation for Domain-Specific Implementations

Generative AI made a grand entry into the AI world with the availability of ChatGPT. It is an extremely powerful tool in two ways.

New Update
OpenAI ChatGPT

Generative AI



Generative AI made a grand entry into the AI world with the availability of ChatGPT. It is an extremely powerful tool in two ways. It effortlessly generates human-like content such as writing a poem and summarising content, just to name a few. Second, it democratises the use of the power of AI and hands it over to mainstream users, not just techies. “Prompt engineering” is just a few simple commands in free-form text to produce some fantastic output. However, quite a few enterprise applications will need LLMs to understand the data related to a particular domain-specific to the enterprise for use either internally or for customers.

Training with Domain Knowledge


As we know, LLMs are foundation models trained on large amounts of text from the past but will not have access to current data and, more specifically, domain-specific data within enterprises. This flexibility to have an LLM understand enterprise data will open opportunities to build fantastic enterprise applications. How can this be achieved?

The most obvious choice is to try and train the LLMs with the domain data. However, this involves training all the weights of the LLM with documents and artefacts related to the domain. This is a time-consuming process and involves significant costs. Alternatively, we could use prompt engineering itself. A method called retrieval augmented generation for question-answering applications uses prompt engineering with data from enterprises in the prompt. A more commonly used method is parameter-efficient fine-tuning, which involves freezing the weights of the pre-trained LLM and fine-tuning it with a small model. This is much more effective and widely used. We will focus on the last approach to demonstrate the importance of human-in-the-loop in fine-tuning.

 Need for Human Evaluation


Let’s consider a fundamental paradigm shift with generative AI – it needs to generate very good text to address the needs of the user. But the description of “good generated text” is very subjective – consider writing poems or generating code! Currently, existing metrics such as BLEU and ROUGE compare generated text to references using simple rules. You can imagine how limiting this metric will be!

So, it is very appropriate that human evaluation services are used to measure the performance of the model and use the output to optimise the model. Human-in-the-loop services are important for two critical parts of a generative AI application as described below.

Supervised Fine-Tuning


As indicated earlier, this approach involves freezing the weights of the pre-trained LLM and fine-tuning fewer weights, which provides excellent results. This will involve training the model with training data specific to a domain. To demonstrate, let us consider a question-answering application (similar to ChatGPT). Human-in-the-loop services are used in such applications to create what is called “demonstration data”. In a question-answering application, this is a conversation between an end-user and a language model. The human-in-the-loop team is involved in creating conversations for multiple scenarios around the specific domain. This labelled data is used to fine-tune the model in a process called supervised fine-tuning.

Human Feedback for Reinforcement Learning

Given the subjectivity of text generated, reinforcement learning with a reward function is typically used to train the model. This model is then calibrated and trained with human preferences. Typically, human-in-the-loop is involved in labelling the output of the LLM on various parameters that cover the scope of the application. These parameters are typically subjective such as fluency, relevance, sensitive content, and contradiction. Apart from this, there is a need to compare two different outputs (generated text) and provide a preference between the two. This data, called “preference data”, is then used to train the reward model.

These large LLMs trained on humongous amounts of data tend to “hallucinate”. Only humans can put them on the right track!

Authored by Kannan Sundar, Chief Digital Transformation Officer (CDO) at NextWealth

DQ Online