Helper Functions/Classes

Here are some helper functions and classes we designed to aid the usage of NLarge in a DNN usage.

dataset_concat

You may import the augment_data function and the MODE class from this library file.

MODE class

The MODE class categorizes the available augmentation techniques in NLarge. It is used with the augment_data function.

Augmentation Modes

  • RANDOM
    • SWAP : Swaps words within the text.

    • SUBSTITUTE : Substitutes words with other words.

    • DELETE : Deletes certain words from the text.

    • CROP : Trims the text.

  • SYNONYM
    • WORDNET : Replaces words with synonyms from WordNet.

  • LLM
    • PARAPHRASE : Paraphrases the text using a language model.

    • SUMMARIZE : Summarize the text using a language model.

augment_data Function

The augment_data function enables the generation of new samples from an existing dataset using the augmentation techniques provided in NLarge, including random transformations, synonym replacement, and language model-based paraphrasing or summarization. This function allows users to specify percentages for different augmentation modes and stack multiple augmentation modes to diversify and enlarge the dataset, which can help improve model robustness and prevent overfitting.

Parameters

dataset (Dataset)

The original dataset to augment, structured with at least two fields: "text" for the input text and "label" for associated labels.

percentages (dict)

A dictionary specifying the augmentation techniques to apply and the percentages of data to be augmented by each technique. Keys are augmentation modes (from the MODE class) and values ate float numbers representing the percentage of samples for each augmentation.

Returns

The function returns a list of augmented dataset samples. Each sample is a dictionary with "text" (augmented text) and "label" (original label) fields.

Example Usage:

from NLarge.dataset_concat import augment_data, MODE

# Augment and increase size by 100%
percentages = {
    MODE.RANDOM.SUBSTITUTE: 0.5,  # 50% of data for random augmentation
    MODE.SYNONYM.WORDNET: 0.5,  # 50% of data for synonym augmentation
}

augmented_data_list = augment_data(original_train_data, percentages)

pipeline

You may import the TextClassifierPipeline class from this library file.

TextClassificationPipeline

The TextClassifierPipeline class in the NLarge library is designed to streamline the process of setting up and training a text classification model. It handles data preprocessing, vocabulary creation, model instantiation, and evaluation, enabling users to initialize a complete pipeline with minimal setup.

Key Features

  1. Data Preparation
    • Tokenizes text, builds vocabulary, and numericalizes text data.
    • Supports handling both augmented and test datasets, and splits the augmented dataset into training and validation sets.
    • Automatically configures PyTorch DataLoaders to handle batching and padding.
  2. Model Initialization
    • Instantiates a text classification model based on theTextClassifierRNN class.
    • Loads pre-trained embeddings (GloVe) for initializing word vectors, improving performance for text representations.
    • Allows users to specify key hyperparameters such as embedding dimension, hidden dimension, number of layers, dropout rate, and learning rate.
  3. Training and Evaluation:
    • Includes methods for training (train_model) and evaluation (evaluate) of the model, using accuracy and cross-entropy loss.
    • Tracks training and validation loss and accuracy over epochs.
    • Supports saving the best model weights during training based on validation performance.
  4. Visualization:
    • Provides plot_loss and plot_acc methods to visualize the training and validation loss and accuracy over epochs, aiding in monitoring model convergence and generalization.

Pipeline Initialization

To initialize a pipline, users need to provide:

  • augmented_data: The training dataset, ideally augmented with techniques provided by NLarge to imporve robustness
  • test_data: The test dataset for final evaluation
  • hyperparameters: Key parameters like batch_size, embedding_dim, hidden_dim, n_layers, dropout_rate and lr

Example Initialization

# Import Libraries
from NLarge.pipeline import TextClassificationPipeline
from NLarge.model.RNN import TextClassifierRNN

# Initialize Pipeline
pipeline_augmented = TextClassificationPipeline(
    augmented_data=augmented_train_data,
    test_data=original_test_data,
    max_length=128,
    test_size=0.2,
    model_class=TextClassifierRNN,
)

# Train Pipeline
pipeline_augmented.train_model(n_epochs=10)