Random Augmenter

Detailed guide to using the Random Augmenter. This page also serves as a proof of concept for the Random Augmenter.

Introduction

We will be explaining the different modes of the Random Augmenter, including an example using the 'Rotten Tomatoes' dataset later on.

As we can observe from the name of the augmenter, the Random Augmenter revolves around using a probability defined to modify a sequence. The random augmentation process involves iterating over each word in the sequence and performing the defined Action with a predefined probability. This introduces variability into the dataset, potentially improving the robustness and generalization capabilities of NLP models. Before we begin the explanation for each Action mode, let's first import and initialize the augmenter.

Importing & initializing the library

Before using the library, you should first import and initialize the random augmenter. Since there are different modes to the Random Augmenter, be sure to import the Action class too!

Import & Initialize NLarge Random Augmenter:

from NLarge.random import RandomAugmenter, Action
random_aug = RandomAugmenter()

Great! Now let us go through each Random Augment Mode.

Random Swap

The Swap Action randomly samples the target sequence with the predefined probability and swaps it's position with the adjacent words if the sampled word is not in the 'stop_words' argument.

Arguments:

data

Input text to augment

action

Action to perform, in the case of using Random Swap, action=Action.SWAP

aug_percent

Percentage of words in sequence to augment

aug_min

Minimum number of words to augment

aug_max

Maximum number of words to augment

skipwords

List of words to skip augmentation

input = "This is a simple example sentence for testing."
random_aug(data=input, action=Action.SWAP, aug_percent=0.3, aug_min=1, aug_max=10, skipwords=['is','a','for'])

Random Substitute

The Substitute Action randomly samples the target sequence with the predefined probability. It then substitutes the sampled word(s) with words chosen randomly in the provided 'target_words' argument if the sampled word(s) is not in the 'stop_words' argument.

Arguments:

data

Input text to augment

action

Action to perform, in the case of using Random Substitute, action=Action.SUBSTITUTE

aug_percent

Percentage of words in sequence to augment

aug_min

Minimum number of words to augment

aug_max

Maximum number of words to augment

skipwords

List of words to skip augmentation

target_words

List of words to substitue with the original sampled word

input = "This is a simple example sentence for testing."
random_aug(data=input, action=Action.SUBSTITUTE, aug_percent=0.3, aug_min=1, aug_max=10, skipwords=['is','a', 'for'], target_words=['great', 'awesome'])

Random Delete

The Delete Action randomly samples the target sequence with the predefined probability. It then deletes the sampled word if the sampled word is not in the 'stop_words' argument.

Arguments:

data

Input text to augment

action

Action to perform, in the case of using Random Delete, action=Action.DELETE

aug_percent

Percentage of words in sequence to augment

aug_min

Minimum number of words to augment

aug_max

Maximum number of words to augment

skipwords

List of words to skip augmentation

input = "This is a simple example sentence for testing."
random_aug(data=input, action=Action.DELETE, aug_percent=0.3, aug_min=1, aug_max=10, skipwords=['is','a', 'for'] )

Random Crop

The Crop Action randomly samples a starting index and a ending index in the target sequence. The set of continuous words from the sampled starting to the sampled ending index will then be checked for the existence of stopwords. If the set of continuous words does not contain 'stopwords', it will be deleted.

Arguments:

data

Input text to augment

action

Action to perform, in the case of using Random Delete, action=Action.DELETE

aug_percent

Percentage of words in sequence to augment

aug_min

Minimum number of words to augment

aug_max

Maximum number of words to augment

skipwords

List of words to skip augmentation

input = "This is a simple example sentence for testing."
random_aug(data=input, action=Action.CROP, aug_percent=0.3, aug_min=1, aug_max=10, skipwords=['is','a', 'for'] )

Example of Random Augmentation

For your reference, below is a full example of the NLarge Random Argumentation on a dataset. This example will also function as a proof of concept for the NLarge Random Augmentation. This example will be evaluating augmented datasets on RNN and LSTM based on the loss and accuracy metrics. We have chosen the 'rotten tomatoes' dataset due to it's small size that is prone to overfitting.

Importing libraries:

import datasets 
from datasets import Dataset, Features, Value, concatenate_datasets 
from NLarge.dataset_concat import augment_data, MODE 
from NLarge.pipeline import TextClassificationPipeline
from NLarge.model.RNN import TextClassifierRNN, TextClassifierLSTM

Downloading 'rotten-tomatoes' dataset

Here, we download the dataset and ensure that the features are in the correct format for our dataset augmentation later on.

original_train_data, original_test_data = datasets.load_dataset(
"rotten_tomatoes", split=["train", "test"]
)  
features = Features({"text": Value("string"), "label": Value("int64")})
original_train_data = Dataset.from_dict(
    {
        "text": original_train_data["text"],
        "label": original_train_data["label"],
    },
    features=features,
)

Applying augmentation and enlarging dataset

We will be performing a 10% Random Substitute Augmentation and a 100% Random Substitute Augmentation on the dataset. This would increase the dataset size by 10% and 100% respectively.

# Augment and increase size by 10% and 100%
percentages = {
    MODE.RANDOM.SUBSTITUTE: 0.1,  # 10% of data for random augmentation
}
augmented_data_list_10 = augment_data(original_train_data, percentages)

percentages = {
    MODE.RANDOM.SUBSTITUTE: 1.0,  # 100% of data for random augmentation
}
augmented_data_list_100 = augment_data(original_train_data, percentages)


# Convert augmented data into Datasets
augmented_dataset_10 = Dataset.from_dict(
    {
        "text": [item["text"] for item in augmented_data_list_10],
        "label": [item["label"] for item in augmented_data_list_10],
    },
    features=features,
)

augmented_dataset_100 = Dataset.from_dict(
    {
        "text": [item["text"] for item in augmented_data_list_100],
        "label": [item["label"] for item in augmented_data_list_100],
    },
    features=features,
)

# Concatenate original and augmented datasets
augmented_train_data_10 = concatenate_datasets(
    [original_train_data, augmented_dataset_10]
)

augmented_train_data_100 = concatenate_datasets(
    [original_train_data, augmented_dataset_100]
)

RNN: Loading the pipeline & model training

Here, we will initialize and train the pipeline using RNN and the augmented datasets.

pipeline_augmented_10 = TextClassificationPipeline(
    augmented_data=augmented_train_data_10,
    test_data=original_test_data,
    max_length=128,
    test_size=0.2,
    model_class=TextClassifierRNN,
)
pipeline_augmented_100 = TextClassificationPipeline(
    augmented_data=augmented_train_data_100,
    test_data=original_test_data,
    max_length=128,
    test_size=0.2,
    model_class=TextClassifierRNN,
)
pipeline_augmented_10.train_model(n_epochs=10)
pipeline_augmented_100.train_model(n_epochs=10)

RNN: Evaluating the models' performance

Plotting the loss and accuracy graphs, we can visualize the performance improvements between the two amount of augmentation on RNN.

pipeline_augmented_10.plot_loss(title="10% Random Substitute on RNN")
pipeline_augmented_100.plot_loss(title="100% Random Substitute on RNN")
pipeline_augmented_10.plot_acc(title="10% Random Substitute on RNN")
pipeline_augmented_100.plot_acc(title="100% Random Substitute on RNN")

Looking at the graphs, we can see a stark improvement on both loss and accuracy of the RNN model.

Models' Loss

Models' Accuracy

LSTM: Loading the pipeline & model training

Here, we will initialize and train the pipeline using LSTM and the augmented datasets.

pipeline_augmented_10_LSTM = TextClassificationPipeline(
    augmented_data=augmented_train_data_10,
    test_data=original_test_data,
    max_length=128,
    test_size=0.2,
    model_class=TextClassifierLSTM,
)
pipeline_augmented_100_LSTM = TextClassificationPipeline(
    augmented_data=augmented_train_data_100,
    test_data=original_test_data,
    max_length=128,
    test_size=0.2,
    model_class=TextClassifierLSTM,
)
pipeline_augmented_10_LSTM.train_model(n_epochs=10)
pipeline_augmented_100_LSTM.train_model(n_epochs=10)

LSTM: Evaluating the models' performance

Plotting the loss and accuracy graphs, we can visualize the performance improvements between the two amount of augmentation when used on LSTM.

pipeline_augmented_10_LSTM.plot_loss(title="10% Random Substitute on LSTM")
pipeline_augmented_100_LSTM.plot_loss(title="100% Random Substitute on LSTM")
pipeline_augmented_10_LSTM.plot_acc(title="10% Random Substitute on LSTM")
pipeline_augmented_100_LSTM.plot_acc(title="100% Random Substitute on LSTM")

Looking at the graphs, we can see a stark improvement on both loss and accuracy of the LSTM model.

Models' Loss

Models' Accuracy

Analysis of Results

The results of our experiment indicate that the performance of the models keeps increasing with higher levels of augmentation. This suggests that data augmentation provides a clear benefit for sentiment classification tasks. Additionally, the findings highlight the importance of data augmentation in enhancing the diversity and robustness of training datasets, leading to imporved model performance.

The data augmentation techniques mitigates overfitting by effectively increasing the size of the training dataset, reducing the likelihood of the model memorizing specific examples and encouraging it to learn general patterns instead. The introduction of variations in the training data makes the model more robust to noise and variations in real world input data, which is crucial for achieving good performance on unseen data.