Large Language Model (LLM) Augmenter

Detailed guide to using the LLM Augmenter. This page also serves as a proof of concept for the LLM Augmenter.

Introduction

The LLMAugmenter offers an advanced dataset augmentation technique that leverages large language models (LLMs) for paraphrasing and summarization. By generating diverse rephrasings and summaries of input data, this augmenter helps prevent overfitting in text-based sentiment classification models by adding rich variability to the training dataset.

Overview

The LLMAugmenter provides a robust solution to augment text data, reducing the risk of overfitting and enhancing model performance in NLP tasks. This augmenter relies on two distinct LLM-driven techniques to achieve variability in the dataset:

- Paraphrasing via Questioning
- Summarization

Paraphrasing via Questioning

The Paraphrasing via Questioning technique in LLMAugmenter uses a large language model (LLM) to rephrase sentences by framing the task as a question-answering exercise. This approach prompts the model to reword sentences without repeating the same verbs or phrases, generating a unique paraphrase that maintains the original meaning but provides distinct wording. By framing the task as a question-answer prompt, the model's responses are directed towards producing a rephrased answer, adding both lexical and structural diversity.

By generating multiple paraphrased versions of the same text, LLMAugmenter introduces subtle variations that help the model learn more generalized features of the language. This variation reduces the chances of overfitting, as the model isn't exposed to identical sentences repeatedly.

Paraphrased sentences, with different structures and vocabulary, prepare the model to handle a broader range of linguistic patterns, improving its robustness in real-world applications.

As the LLM avoids verb repetition while rephrasing, it deepens the model's understanding of summarizes and related terms, which is especially beneficial for tasks like sentiment analysis, where nuanced language is common.

Parameters

sentence

(str) - Input text to augment
example: 'This is a test sentence.'

max_new_tokens

(int) - Maximum number of new tokens that can be introduced to the output sentence.
default: 512

Usage Example

# Import and Initialize LLMAugmenter
from NLarge.llm import LLMAugmenter

llm_aug = LLMAugmenter()

sample_text = "This movie is a must-watch for all the family."
print(sample_text)

res = llm_aug.paraphrase_with_question(sample_text)
print(res)

Summarization

The Summarization technique in LLMAugmenter utilizes a transformer-based summarizer model, specifically the BART model, to condense longer texts or passages into shorter, yet semantically complete summaries. This technique is especially useful for extracting core information from large documents or verbose sentences.

In NLP tasks, long sentences or paragraphs may contain extraneous information. Summarization helps reduce this noise by retaining only the most important information.

Ultimately, training on both long-form text and its summaries helps the model develop a nuanced understanding of essential versus non-essential details. This ability to differentiate relevant information is invaluable in tasks that require prioritization of critical data, like summarization, classification, and even information extraction.

Parameters

text

(str) - Input text to augment.
example: 'This is a test sentence.'

max_length

(int) - Maximum length of the summary.
default: 100

min_length

(int) - Minimum length of the summary.
default: 30

Usage Example

# Import and Initialize LLMAugmenter
from NLarge.llm import LLMAugmenter

llm_aug = LLMAugmenter()

sample_text = """
Eternal Horizons is a masterpiece. It’s not just a film but an experience that lingers 
with you long after the credits roll. I highly recommend it to anyone looking for a 
deeply moving and visually enchanting cinematic experience."
"""
print(sample_text)

res = llm_aug.summarize_with_summarizer(sample_text, max_length=40, min_length=5)
print(res)

Full Example of LLM Summarizer Augmentation

For your reference, below is a full example of the NLarge LLM Summarizer Argumentation on a dataset. This example will also function as a proof of concept for the NLarge LLM Summarizer Augmentation. This example will be evaluating augmented datasets on LSTM based on the loss and accuracy metrics. We have chosen the 'rotten tomatoes' dataset due to it's small size that is prone to overfitting.

Full Code:

import datasets
from datasets import Dataset, Features, Value, concatenate_datasets
from NLarge.dataset_concat import augment_data, MODE
from NLarge.pipeline import TextClassificationPipeline
from NLarge.model.RNN import TextClassifierLSTM                  

original_train_data, original_test_data = datasets.load_dataset(
"rotten_tomatoes", split=["train", "test"]
)  
features = Features({"text": Value("string"), "label": Value("int64")})
original_train_data = Dataset.from_dict(
    {
        "text": original_train_data["text"],
        "label": original_train_data["label"],
    },
    features=features,
)  

# Augment and increase size by 5%, 200%
percentage= {
    MODE.LLM.SUMMARIZE: 0.05,
}
augmented_summarize_5 = augment_data(original_train_data, percentage)

percentage= {
    MODE.LLM.SUMMARIZE: 2.00,
}
augmented_summarize_200 = augment_data(original_train_data, percentage)

# Convert augmented data into Datasets
augmented_dataset_5 = Dataset.from_dict(
    {
        "text": [item["text"] for item in augmented_summarize_5],
        "label": [item["label"] for item in augmented_summarize_5],
    },
    features=features,
)
augmented_dataset_200 = Dataset.from_dict(
    {
        "text": [item["text"] for item in augmented_summarize_200],
        "label": [item["label"] for item in augmented_summarize_200],
    },
    features=features,
)

# Concatenate original and augmented datasets
augmented_train_data_5 = concatenate_datasets(
    [original_train_data, augmented_dataset_5]
)
augmented_train_data_200 = concatenate_datasets(
    [original_train_data, augmented_dataset_200]
)

# Initialize Pipelines
pipeline_augmented_5 = TextClassificationPipeline(
    augmented_data=augmented_train_data_5,
    test_data=original_test_data,
    max_length=128,
    test_size=0.2,
    model_class=TextClassifierLSTM,
)
pipeline_augmented_200 = TextClassificationPipeline(
    augmented_data=augmented_train_data_200,
    test_data=original_test_data,
    max_length=128,
    test_size=0.2,
    model_class=TextClassifierLSTM,
)

# Train Models
pipeline_augmented_5.train_model(n_epochs=10)
pipeline_augmented_200.train_model(n_epochs=10)

# Plot Loss 
pipeline_augmented_5.plot_loss(title="LSTM - LLM Summarizer Augment (5%)")
pipeline_augmented_200.plot_loss(title="LSTM - LLM Summarizer Augment (200%)")

# Plot Accuracy
pipeline_augmented_5.plot_acc(title="LSTM - LLM Summarizer Augment (5%)")
pipeline_augmented_200.plot_acc(title="LSTM - LLM Summarizer Augment (200%)")

Models' Loss

Models' Accuracy