Synonym Augmenter

Detailed guide to using the Synonym Augmenter. This page also serves as a proof of concept for the Synonym Augmenter.

Introduction

The Synonym Augmenter enables data augmentation for text sentiment classification by introducing variability in text through synonym replacement. This augmenter enhances a dataset by augmenting words with their synonyms, which can improve model robustness by introducing semantic variability without changing a sentiment.

The Synonym Augmenter samples word in the target sequence with a predefined probability and replace it with a randomly chosen synonym from a set of synonyms of the sampled word.

In the current version of NLarge, the set of synonyms can also be drawn from WordNet, an extensive lexical database.

Key Components

WordNet

WordNet provides synonym and antonym lookup, with optional parts of speech (POS) filtering. The POS tagging functionality identifies relevant grammatical structures for more accurate augmentation.

PartsOfSpeech

Our POS functionality maps between POS tags and constituent tags to ensure compatibility with WordNet's POS requirements.

The current version supports noun, verb, adjective and adverb classifications.

SynonymAugmenter

The augmenter uses the WordNet class to perform augmentation by replacing words with synonyms based on user-defined criteria. It utilizes POS tagging to determine eligible words for substituition, while skip lists (stop words and regex patterns) can prevent certain words from being replaced.

Import & Initialize NLarge Synonym Augmenter

Before we proceed further, let us first import and initialize the SynonymAugmenter instance.

from NLarge.synonym import SynonymAugmenter

syn_aug = SynonymAugmenter()

Parameters

data

(str) - Input text to augment
example: 'This is a test sentence.'

aug_src

(str) - Augmentation source, currently supports only "wordnet".
default: 'wordnet'

lang

(str) - Language of the input text.
default: 'eng'

aug_max

(int) - Maximum number of words to augment.
default: 10

aug_p

(float) - Probability of augmenting each word.
default: 0.3

stopwords

(list) - List of words to exclude from augmentation.
default: None

tokenizer

(function) - Function to tokenize the input text.
default: None

reverse_tokenizer

(function) - Function to detokenize the augmented text.
default: None

Single Sentence Usage Example

sample_text = "The quick brown fox jumps over the lazy dog."
print(sample_text)
syn_aug(sample_text, aug_src='wordnet', aug_p=0.3, aug_max=20)

Full Example of Synonym Augmentation

For your reference, below is a full example of the NLarge Synonym Argumentation on a dataset. This example will also function as a proof of concept for the NLarge Synonym Augmentation. This example will be evaluating augmented datasets on LSTM based on the loss and accuracy metrics. We have chosen the 'rotten tomatoes' dataset due to it's small size that is prone to overfitting.

Full Code:

import datasets
from datasets import Dataset, Features, Value, concatenate_datasets
from NLarge.dataset_concat import augment_data, MODE
from NLarge.pipeline import TextClassificationPipeline
from NLarge.model.RNN import TextClassifierLSTM                  

original_train_data, original_test_data = datasets.load_dataset(
"rotten_tomatoes", split=["train", "test"]
)  
features = Features({"text": Value("string"), "label": Value("int64")})
original_train_data = Dataset.from_dict(
    {
        "text": original_train_data["text"],
        "label": original_train_data["label"],
    },
    features=features,
)  

# Augment and increase size by 5%, 10%
percentage= {
    MODE.SYNONYM.WORDNET: 0.05,
}
augmented_synonym_5 = augment_data(original_train_data, percentage)

percentage= {
    MODE.SYNONYM.WORDNET: 0.10,
}
augmented_synonym_10 = augment_data(original_train_data, percentage)

# Convert augmented data into Datasets
augmented_dataset_5 = Dataset.from_dict(
    {
        "text": [item["text"] for item in augmented_synonym_5],
        "label": [item["label"] for item in augmented_synonym_5],
    },
    features=features,
)
augmented_dataset_10 = Dataset.from_dict(
    {
        "text": [item["text"] for item in augmented_synonym_10],
        "label": [item["label"] for item in augmented_synonym_10],
    },
    features=features,
)

# Concatenate original and augmented datasets
augmented_train_data_5 = concatenate_datasets(
    [original_train_data, augmented_dataset_5]
)
augmented_train_data_10 = concatenate_datasets(
    [original_train_data, augmented_dataset_10]
)

# Initialize Pipelines
pipeline_augmented_5 = TextClassificationPipeline(
    augmented_data=augmented_train_data_5,
    test_data=original_test_data,
    max_length=128,
    test_size=0.2,
    model_class=TextClassifierLSTM,
)
pipeline_augmented_10 = TextClassificationPipeline(
    augmented_data=augmented_train_data_10,
    test_data=original_test_data,
    max_length=128,
    test_size=0.2,
    model_class=TextClassifierLSTM,
)

# Train Models
pipeline_augmented_5.train_model(n_epochs=10)
pipeline_augmented_10.train_model(n_epochs=10)

# Plot Loss 
pipeline_augmented_5.plot_loss(title="5% Synonym Augment on LSTM")
pipeline_augmented_10.plot_loss(title="10% Synonym Augment on LSTM")

# Plot Accuracy
pipeline_augmented_5.plot_acc(title="5% Synonym Augment on LSTM")
pipeline_augmented_10.plot_acc(title="10% Synonym Augment on LSTM")

Models' Loss

Models' Accuracy