Helper Functions/Classes
Here are some helper functions and classes we designed to aid the usage of NLarge in a DNN usage.
dataset_concat
You may import the augment_data
function and the MODE
class from this library file.
MODE class
The MODE
class categorizes the available augmentation techniques in NLarge. It is used with the augment_data
function.
Augmentation Modes
RANDOM
SWAP
: Swaps words within the text.SUBSTITUTE
: Substitutes words with other words.DELETE
: Deletes certain words from the text.CROP
: Trims the text.
SYNONYM
WORDNET
: Replaces words with synonyms from WordNet.
LLM
PARAPHRASE
: Paraphrases the text using a language model.SUMMARIZE
: Summarize the text using a language model.
augment_data Function
The augment_data
function enables the generation of new samples from an existing dataset using the augmentation techniques provided in NLarge, including random transformations, synonym replacement, and language model-based paraphrasing or summarization. This function allows users to specify percentages for different augmentation modes and stack multiple augmentation modes to diversify and enlarge the dataset, which can help improve model robustness and prevent overfitting.
Parameters
dataset
(Dataset)
The original dataset to augment, structured with at least two fields: "text" for the input text and "label" for associated labels.
percentages
(dict)
A dictionary specifying the augmentation techniques to apply and the percentages of data to be augmented by each technique. Keys are augmentation modes (from the MODE
class) and values ate float numbers representing the percentage of samples for each augmentation.
Returns
The function returns a list of augmented dataset samples. Each sample is a dictionary with "text" (augmented text) and "label" (original label) fields.
Example Usage:
pipeline
You may import the TextClassifierPipeline
class from this library file.
TextClassificationPipeline
The TextClassifierPipeline
class in the NLarge library is designed to streamline the process of setting up and training a text classification model. It handles data preprocessing, vocabulary creation, model instantiation, and evaluation, enabling users to initialize a complete pipeline with minimal setup.
Key Features
- Data Preparation
- Tokenizes text, builds vocabulary, and numericalizes text data.
- Supports handling both augmented and test datasets, and splits the augmented dataset into training and validation sets.
- Automatically configures PyTorch DataLoaders to handle batching and padding.
- Model Initialization
- Instantiates a text classification model based on the
TextClassifierRNN
class. - Loads pre-trained embeddings (GloVe) for initializing word vectors, improving performance for text representations.
- Allows users to specify key hyperparameters such as embedding dimension, hidden dimension, number of layers, dropout rate, and learning rate.
- Training and Evaluation:
- Includes methods for training (
train_model
) and evaluation (evaluate
) of the model, using accuracy and cross-entropy loss. - Tracks training and validation loss and accuracy over epochs.
- Supports saving the best model weights during training based on validation performance.
- Visualization:
- Provides
plot_loss
andplot_acc
methods to visualize the training and validation loss and accuracy over epochs, aiding in monitoring model convergence and generalization.
Pipeline Initialization
To initialize a pipline, users need to provide:
- augmented_data: The training dataset, ideally augmented with techniques provided by NLarge to imporve robustness
- test_data: The test dataset for final evaluation
- hyperparameters: Key parameters like
batch_size
,embedding_dim
,hidden_dim
,n_layers
,dropout_rate
andlr
Example Initialization