Friday, April 26, 2024
Tensorflow: LSTM Text Classification with emoji [MASK]
Harrinson Arrubla
@ TwitterProject Source
For more information visit the project's GitHub repository. This can be deploued directly on Jupyter Notebooks or running the Streamlit app.
Why all Start?
Text classification is every where, and nowadays for most companies, including the one I am working for, it is crucial to get a deep knowledge about the customer's experience. Indeed, due to sentiment analysis task over a large dataset, emerges this blog. Here I outlook the initial model used for classification using a technique that I call emoji [MASK] inspired in the Tensorflow mask strategy used for text classification.
Background
I had two prior text classification algorithm experiences before this task, during the Data Mining and Warehousing graduate class at the University of Texas Rio Grande Valley, led by Professor Yifeng Gao. The objective was clear: to win a text classification competition. Our team decided to utilize a pre-trained BERT model, while one of our competitors, who ultimately emerged victorious with an accuracy rate 0.02% higher than ours and using only half of our computational and time resources, employed a hyper-parameterized Naive-Bayes model. Quite surprising!
Consequently, I opted to use the Naive-Bayes algorithm for this task. However, I gained invaluable insights from the bidirectional LSTM (Long Short-Term Memory) classification model using TensorFlow's sequential models, prompting me to write a brief blog post about it.
Context
Although I will not disclose the actual dataset utilized, we can employ a rather decent dataset found here. This Twitter dataset contains comments categorized into three classes: Positive, Negative, and Neutral. Messages that are irrelevant to the entity are considered Neutral. The objective is to assess the sentiment of the message regarding the specified entity.
In this blog post, we will perform binary classification (positive or negative). However, I encourage readers to customize the model for classifying more than two sentiments. Additionally, since the model was trained on short-length sentences, utilizing sentences ranging from approximately 110 to 160 characters might be advisable for optimal performance.
Cleaning the Data
Let's export the datasets and create the directory paths for accessing documents in further steps. We'll use the zipfile
library to extract the files from a zip
file. As usual, pandas
provide powerful data frame management for long datasets.
zip_path = './media/datasets/archive.zip'
extract_dir = './media/datasets/csv/'
vocabulary_dir = './media/vocabulary/'
models_dir = './media/models/'
with zipfile.ZipFile(zip_path, 'r') as zip_file:
zip_file.extractall(extract_dir)
dataframe_train = pd.read_csv(extract_dir+'twitter_training.csv', names=['ID','user','SC','Comment'])
dataframe_test = pd.read_csv(extract_dir+'twitter_validation.csv', names=['ID','user','SC','Comment'])
The data looks as follows:
ID | user | Sentiment | Comment |
---|---|---|---|
2619 | Borderlands | Positive | i maed dis bc i love maya (the siren) and im v... |
2619 | Borderlands | Positive | i do dis bc i am maya (the siren) and im very ... |
2619 | Borderlands | Positive | i maed is dis bc i love maya ( the siren ) and... |
2619 | Borderlands | Positive | i maed dis bc i love the (the beautiful) girl ... |
2620 | Borderlands | Positive | A reimagining of the Keep on the Borderlands. ... |
2620 | Borderlands | Positive | And he also makes the caves of chaos! This stu... |
2620 | Borderlands | Positive | And he's also doing "The Eagles"! This stuff w... |
Reducing our Dataset
Next, we drop unnecessary columns such as 'ID','user', and use list as the type
format for inoutting to the model.
def dropRowValue(dataframe,column,values):
return dataframe[~dataframe[column].isin(values)]
def sentimentFilter(sentence, category):
"""
By default the category entered to the sentimentFilter function will be transformed to 1
"""
sentiment_num_list = []
for sentiment in sentence:
if sentiment == category:
sentiment_num_list.append(1)
else:
sentiment_num_list.append(0)
return sentiment_num_list
# training
filter_dataframe_train = dropRowValue(dataframe_train,'SC',['Neutral','Irrelevant']).drop(['ID','user'], axis=1)
list_x_train = filter_dataframe_train['Comment'].to_list()
list_y_train = filter_dataframe_train['SC'].to_list()
y_train = sentimentFilter(list_y_train,'Positive')
# testing
filter_dataframe_test = dropRowValue(dataframe_test,'SC',['Neutral','Irrelevant']).drop(['ID','user'], axis=1)
list_x_test = filter_dataframe_test['Comment'].to_list()
list_y_test = filter_dataframe_test['SC'].to_list()
y_test = sentimentFilter(list_y_test,'Positive')
Emoji [MASK]
I believe there is significant importance in the significance of emojis in daily dialect. Why? More than ninety percent of online consumers use emojis every day, and the emojis are multi-cultural and multi-language.
Here, I introduce the emoji [MASK]. Essentially, adding a [MASK] strategy, as used in some TensorFlow tokenization models, to the emojis over all the training data for improving accuracy. By doing so, the data will not perceive the Unicode text but the emotional emotion instead.
There are other approaches to this same problem, such as training the model to identify the emoji and convert it into an emotion, which might be interesting. Nevertheless, this model aims to generate a simple and non-complex demanding model for solving this issue. Then, I decided to use a Python library called emoji.
def emojiMask(sentence):
emoji_mask_sentence = emoji.replace_emoji(sentence,
replace=lambda chars,
data_dict: chars.encode('ascii',
'namereplace').decode())
emoji_mask_sentence = re.sub(r"\\N\{(.+?)\}", r"\1", emoji_mask_sentence)
return emoji_mask_sentence
x_train_emojimask = [emojiMask(str(row)) for row in list_x_train]
x_test_emojimask = [emojiMask(str(row)) for row in list_x_test]
How Does the emojiMask Function Work?
Well, consider the following text:
- "The sound of the waves 🌊 and the warmth of the sun ☀️ make it the perfect day to relax in my swimsuit 👙 and listen to the birds 🦜 singing in the palm trees 🌴. Don't forget your sunglasses 🕶️!"
Then, applying the emojiMask function the result is:
- "The sound of the waves WATER WAVE and the warmth of the sun BLACK SUN WITH RAYSVARIATION SELECTOR-16 make it the perfect day to relax in my swimsuit BIKINI and listen to the birds PARROT singing in the palm trees PALM TREE. Don't forget your sunglasses DARK SUNGLASSESVARIATION SELECTOR-16!"
Quite interesting, right?
Before moving on, we need to decide if removing stopwords is a priority or not and how this will affect positively or negatively our model. In case, you want to perform another approach you are very welcome. Here, I decided to remove the stopwords using the nltk
package together with the following function:
def cleanSentence(sentence, stopwords=True):
stopwords_vocabulary = nltk.corpus.stopwords.words('english')
stopwords_pattern = r'\b(?:' + '\s*|'.join(map(re.escape, stopwords_vocabulary)) + r')\b' if stopwords else ''
lower_sentence = sentence.lower()
clean_sentence = re.sub(stopwords_pattern + r'|[^\w\s]', '', lower_sentence)
return clean_sentence
x_train_clean = [cleanSentence(row) for row in x_train_emojimask]
x_test_clean = [cleanSentence(row) for row in x_test_emojimask]
Tensorflow Implementation
In TensorFlow, the inputs in the model might be set up beforehand. Using the pre-loaded models from TensorFlow datasets makes running a model quite easier because the tensor structure is already configured for this purpose.
Here, I decided to transform all data into lists and then into tensors. Additionally, we prepare the datasets for training and testing the final model. The shuffle
function randomizes the dataset to prevent the model from learning the order of the data. The batch
function divides the shuffled dataset into batches, improving training speed and efficiency by allowing the model to process multiple samples at once. The prefetch
function prefetches data to overlap preprocessing and model execution, reducing training time.
BUFFER_SIZE = 10000
BATCH_SIZE = 64
# tensor vector for model
def trainTensorSlice(xlist,ylist):
x_tensor = tf.constant(xlist)
y_tensor = tf.constant(ylist)
dataset_tensor = tf.data.Dataset.from_tensor_slices((x_tensor,y_tensor)) # no batching, the batching will be later on
return dataset_tensor
train_ds = trainTensorSlice(x_train_clean,y_train)
test_ds = trainTensorSlice(x_test_clean,y_test)
train_dataset = train_ds.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_ds.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
Vocabulary
I started using the simplest model found on the TensorFlow website, specifically the issue gives a clear description of how to setup a simple text classification model. Here, I realized that if the tokenization needs are simple and the vocabulary is small and static, the approach using StaticVocabularyTable
can work efficiently. However, for more complex tokenization needs or dynamic vocabularies, using a higher-level API like TextVectorization
or a text encoder might be more appropriate. These higher-level APIs offer more flexibility and are easier to integrate into TensorFlow models, making them suitable for a wider range of tokenization tasks.
I strongly encourage you to utilize tf.lookup.KeyValueTensorInitializer
, which provides an in-depth insight into TensorFlow models.
VOCAB_SIZE = 2000
SEQUENCE_LENGTH = 100
encoder = tf.keras.layers.TextVectorization(max_tokens=VOCAB_SIZE,
output_mode='int',
output_sequence_length=SEQUENCE_LENGTH)
encoder.adapt(train_dataset.map(lambda text, label: text))
Conclusion
This introductory model offers simplicity and efficiency through its straightforward architecture and use of a pre-trained TextVectorization
layer. It benefits from the ability to handle variable-length sequences and capture long-range dependencies, thanks to the mask_zero=True
and Bidirectional LSTM
layers.
However, its simplicity comes at the cost of limited complexity, which may hinder its performance on tasks requiring nuanced language understanding. Additionally, its relatively large number of parameters could lead to overfitting, particularly on smaller datasets, and its complexity might reduce interpretability and ease of debugging, especially for more complex tokenization and language processing needs.
In this case, the database used is quite good for avoiding overfitting. If you might want to explore other vocabulary sizes, it could be a good learning experience. This will demonstrate, for example, that some sizes induce overfitting. For loading extra vocabularies:
vocabulary_file = open(vocabulary_dir+'vocabulary2T.obj', 'rb')
loaded_vocabulary = pickle.load(vocabulary_file)
encoder.set_vocabulary(loaded_vocabulary)
Model
The general LSTM model I used at the starting stage emerges from following the Tensorflow Text Classification docs.
model = tf.keras.Sequential([
encoder,
tf.keras.layers.Embedding(
input_dim=len(encoder.get_vocabulary()),
output_dim=64,
mask_zero=True),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1)
])
If we use the sigmoid activation function as the output layer in our model, converting logits (raw model outputs) to probabilities, the loss function should typically not have from_logits=True
. This parameter is used with the BinaryCrossentropy
loss when the model outputs raw logits and the labels are in the range [0, 1]. Since the sigmoid activation already squashes the logits to [0, 1] range, using from_logits=False
is appropriate, allowing the loss function to interpret the model's output as probabilities directly.
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(1e-4),
metrics=['accuracy'])
model.summary()
history = model.fit(train_dataset,
epochs=15,
verbose=2)
Saving and Exporting the Model
According to the docs, the model can be saved using the save
method. Since the model contains a TextVectorization
layer, which has non-tf.Variable
weights, I should use the TensorFlow format (save_format='tf'
) instead of the HDF5 format (save_format='h5'
). Additionally, tf.keras.models.load_model
will load the model checkpoints, weights, and complete features for use without training the model again.
Regarding the history object returned by model.fit()
, which is not directly serializable, we cannot use the model.save(save_format='tf')
command. Instead, we use the pickle
package to save it and load it using the pickle.dump()
method.
- For saving:
model.save('./media/models/2T/model2T_15e/', save_format='tf')
with open('./media/models/2T/model2T_15e/training_history.obj', 'wb') as f:
pickle.dump(history.history, f)
- For loading:
loaded_model = tf.keras.models.load_model('./media/models/2T/model2T_15e')
with open('./media/models/2T/model2T_15e/training_history.obj', 'rb') as f:
loaded_history = pickle.load(f)
Evaluating the Model
The conclusions we draw from the model output are:
- Decreasing Loss: The loss values are decreasing with each epoch, starting from 0.5434 and reaching 0.2184 at the end. A decreasing loss indicates that the model is improving and learning to better fit the training data.
- Increasing Accuracy: The accuracy values are increasing with each epoch, starting from 0.7283 and reaching 0.9047 at the end. An increasing accuracy means that the model is becoming better at making correct predictions on the given data.
- Convergence: Both the loss and accuracy curves appear to be converging, which suggests that the model is reaching a stable point and further training may not significantly improve the performance.
One way of analyzing the model performance is by running a Monte Carlo simulation. For doing that, there is a
comment.json
file in the datasets directory. This file has 250+ comments with their own categorization (positive or negative), and it will be used for generating the accuracy of the model over time for short sentences with emojis.
Monte Carlo Simulation
Since I've learned about Monte Carlo Simulations, I use them as much as I can. They offer a great tool for measuring accuracy for "random-generated" data. However, in this case, it was quite demanding to create those datasets for validating our Monte Carlo Simulations. The dataset used contains 250+ comments specially written (prompt engineering) for this task.
As a result, I found that the accuracy fluctuates between ~70% and 78%, which directly implies that the model correctly classifies 7 out of each 10 comments, indicating a "good" performance of the model. It should be noted that the comments used in the comments.json
file do not necessarily belong to the vocabulary. One possible feature that might increase the model's performance is including as many emojis as possible in the vocabulary, due to the fact that even though the type of topic or language spoken is pretty similar.
Final Comments
There are several exciting strategies we can explore to enhance our model further. Playing with the datasets, trying out different preprocessing techniques, and experimenting with new features can lead to better performance and insights.
Additionally, generating a validation set from our existing data can help us fine-tune our model and evaluate its performance more effectively. This will allow us to test our model on unseen data and make any necessary adjustments to improve its accuracy and robustness.
Remember, machine learning is all about experimentation and iteration. Let's continue to push the boundaries of what's possible and strive for excellence in our project!