Nowadays world is going digital in all possible ways and sharing has become a habit of modern day internet users. So when we do something and see something special, we try to share it with close ones instantaneously. Also when we see some posts or comments from any known (or sometimes unknown as well), we usually try to respond back. All these actions that we are doing consists of different sentiments, it could be POSITIVE or NEGATIVE.

Different social media applications, marketing companies, digital media companies use this sentimental values to spread their words or products. So here comes the sentimental classifier which helps them to do so.

Today we will try to build a “Sentiment Classifier” using “Recurrent Neural Network”, which is considered to be one of the most powerful engine of Deep Learning mechanism. We will try to demonstrate the steps as detailed as possible.

Used Tools:

  • Tensorflow
  • Keras
  • scikit-learn

Step 1: Data Aggregation

For our application, we have used pre-processed data from IMDB movie reviews as found here. Also, we are having plans to enhance this dataset for possibly better performance.

Once the dataset is downloaded, let’s unzip it in our project directory. The dataset is divided into train & test datasets. In our test, we have merged them into one single dataset and let Keras divide them in runtime. In order to do so we have followed following steps:

# tar -xzf aclImdb_v1.tar.gz
# mkdir -p resources/dataset/pos
# mkdir -p resources/dataset/neg
# cp aclImdb/train/pos/*.txt resources/dataset/pos/
# cp aclImdb/test/pos/*.txt resources/dataset/pos/
# cp aclImdb/train/neg/*.txt resources/dataset/neg/
# cp aclImdb/test/neg/*.txt resources/dataset/neg/

Step 2: Dataset Preparation

Now we have to use following piece of code to tokenize the sentences and pad them to the equal length.

sentence_list = []
labels = []

print("Reading Positive dataset: resources/dataset/pos/*.txt")
for filename in os.listdir("resources/dataset/pos/"):
    filepath = "resources/dataset/pos/{}".format(filename)
    with open(filepath, "r") as fp:
        txt = fp.read().replace("<br />", "\n")
        sentence_list.append(txt)
        labels.append(1)

print("Reading Negative dataset: resources/dataset/neg/*.txt")
for filename in os.listdir("resources/dataset/neg/"):
    filepath = "resources/dataset/neg/{}".format(filename)
    with open(filepath, "r") as fp:
        txt = fp.read().replace("<br />", "\n")
        sentence_list.append(txt)
        labels.append(0)

# tokenize the dataset
max_features = 100
tokenizer = Tokenizer(num_words=max_features, split=' ', lower=True)
tokenizer.fit_on_texts(sentence_list)
X = tokenizer.texts_to_sequences(sentence_list)
X = pad_sequences(X, maxlen=max_features)
Y = labels

batch_size = 512
embed_dim = 300
lstm_out = 100
input_length = X.shape[1]

Step 3: Preparing the model

Let’s initialize a Sequential Model.

model = Sequential()
model.add(Embedding(max_features, embed_dim, input_length=input_length))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1 ,activation='softmax'))
model.compile(loss = 'binary_crossentropy', optimizer='adam', metrics = ['accuracy'])
model.summary()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.001, random_state=None)
# adding a checkpoint to show the model accuracy
checkpoint = ModelCheckpoint(model_path + "models-0.75.bin.hdf5", monitor='val_acc', verbose=True, save_best_only=True, mode='max')
# start the training
model.fit(X_train, Y_train, epochs=150, batch_size=batch_size, validation_split=0.005, shuffle=True, callbacks=[checkpoint])

Now we should have the screen where the LSTM model will start learning patterns for the sentiment classification from the supplied labeled dataset.

We have attached the codebase with the merged dataset here. You can use this to kick start to your tests. Please feel free to get back to us with questions.

 

Note: In our next tutorial we will explain how to use Convolutional Neural Network & Recurrent Neural Network combined to classify sentiments for textual contents.