Hands-On Project: Sentiment Analysis of Movie Reviews

Sentiment analysis involves determining whether a given piece of text conveys a positive, negative, or neutral sentiment. In this project, we will classify movie reviews as either positive or negative.

Objective

To build a machine learning model that can classify the sentiment of movie reviews using natural language processing (NLP) techniques.

Dataset

The IMDB Movie Reviews Dataset is a widely used dataset for sentiment analysis. You can download it from Kaggle.

Dataset Details:

• Text: The review content.

• Sentiment: The label (positive or negative).

Tools and Libraries Required

bash



pip install pandas numpy scikit-learn nltk matplotlib seaborn tensorflow

Step-by-Step Guide

1. Import Libraries

python


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

2. Load the Dataset

python

# Load dataset
data = pd.read_csv('IMDB Dataset.csv')
# Display dataset information
print(data.head())
print(data['sentiment'].value_counts())

3. Data Cleaning

  • Convert text to lowercase.
  • Remove punctuation, special characters, and numbers.
  • Tokenise the text and remove stop-words.
  • Apply stemming or lemmatization.

python

nltk.download('stopwords')
nltk.download('punkt')
# Data cleaning function
def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize and remove stopwords
    words = word_tokenize(text)
    words = [word for word in words if word not in stopwords.words('english')]
    return ' '.join(words)
data['cleaned_review'] = data['review'].apply(clean_text)

4. Split the Dataset

python


X = data['cleaned_review']
y = data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Feature Extraction with CountVectorizer

Convert text into numerical features using bag-of-words:

python

vectorizer = CountVectorizer(max_features=5000)
X_train_vectors = vectorizer.fit_transform(X_train).toarray()
X_test_vectors = vectorizer.transform(X_test).toarray()

6. Train the Model

Using Naive Bayes Classifier for simplicity:

python

model = MultinomialNB()
model.fit(X_train_vectors, y_train)

7. Evaluate the Model

python

# Predictions
y_pred = model.predict(X_test_vectors)
# Metrics
print(classification_report(y_test, y_pred))
conf_matrix = confusion_matrix(y_test, y_pred)
# Confusion Matrix Heatmap
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

8. Test with Custom Reviews

python

# Example review
new_review = ["The movie was fantastic! Amazing plot and great acting."]
new_review_cleaned = [clean_text(review) for review in new_review]
new_review_vectorized = vectorizer.transform(new_review_cleaned).toarray()
# Prediction
prediction = model.predict(new_review_vectorized)
print("Sentiment:", "Positive" if prediction[0] == 1 else "Negative")

Enhancements with Deep Learning

For improved performance, you can use deep learning models like LSTMs or Transformers.
Here’s a brief outline:

  • Preprocess text using TensorFlow/Keras Tokenizer.
  • Convert text to sequences using embeddings (e.g., Word2Vec or GloVe).
  • Build an LSTM or GRU model in TensorFlow/Keras.
  • Train the model on the cleaned dataset.

Use Cases of Sentiment Analysis

  1. Customer Feedback Analysis: Identify satisfaction levels from product reviews.
  2. Brand Monitoring: Analyse public sentiment about a brand on social media.
  3. Movie Recommendations: Predict user preferences based on past reviews.

Conclusion

Sentiment analysis of movie reviews is an excellent hands-on project to understand natural language processing and machine learning. By following these steps, you’ll not only build a working model but also gain practical knowledge of preprocessing, vectorization, and classification techniques.