Sentiment analysis involves determining whether a given piece of text conveys a positive, negative, or neutral sentiment. In this project, we will classify movie reviews as either positive or negative.
Objective
To build a machine learning model that can classify the sentiment of movie reviews using natural language processing (NLP) techniques.
Dataset
The IMDB Movie Reviews Dataset is a widely used dataset for sentiment analysis. You can download it from Kaggle.
Dataset Details:
• Text: The review content.
• Sentiment: The label (positive or negative).
Tools and Libraries Required
bash
pip install pandas numpy scikit-learn nltk matplotlib seaborn tensorflow
Step-by-Step Guide
1. Import Libraries
python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
2. Load the Dataset
python
# Load dataset
data = pd.read_csv('IMDB Dataset.csv')
# Display dataset information
print(data.head())
print(data['sentiment'].value_counts())
3. Data Cleaning
- Convert text to lowercase.
- Remove punctuation, special characters, and numbers.
- Tokenise the text and remove stop-words.
- Apply stemming or lemmatization.
python
nltk.download('stopwords')
nltk.download('punkt')
# Data cleaning function
def clean_text(text):
# Convert to lowercase
text = text.lower()
# Remove special characters and numbers
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Tokenize and remove stopwords
words = word_tokenize(text)
words = [word for word in words if word not in stopwords.words('english')]
return ' '.join(words)
data['cleaned_review'] = data['review'].apply(clean_text)
4. Split the Dataset
python
X = data['cleaned_review']
y = data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
5. Feature Extraction with CountVectorizer
Convert text into numerical features using bag-of-words:
python
vectorizer = CountVectorizer(max_features=5000)
X_train_vectors = vectorizer.fit_transform(X_train).toarray()
X_test_vectors = vectorizer.transform(X_test).toarray()
6. Train the Model
Using Naive Bayes Classifier for simplicity:
python
model = MultinomialNB()
model.fit(X_train_vectors, y_train)
7. Evaluate the Model
python
# Predictions
y_pred = model.predict(X_test_vectors)
# Metrics
print(classification_report(y_test, y_pred))
conf_matrix = confusion_matrix(y_test, y_pred)
# Confusion Matrix Heatmap
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
8. Test with Custom Reviews
python
# Example review
new_review = ["The movie was fantastic! Amazing plot and great acting."]
new_review_cleaned = [clean_text(review) for review in new_review]
new_review_vectorized = vectorizer.transform(new_review_cleaned).toarray()
# Prediction
prediction = model.predict(new_review_vectorized)
print("Sentiment:", "Positive" if prediction[0] == 1 else "Negative")
Enhancements with Deep Learning
For improved performance, you can use deep learning models like LSTMs or Transformers.
Here’s a brief outline:
- Preprocess text using TensorFlow/Keras Tokenizer.
- Convert text to sequences using embeddings (e.g., Word2Vec or GloVe).
- Build an LSTM or GRU model in TensorFlow/Keras.
- Train the model on the cleaned dataset.
Use Cases of Sentiment Analysis
- Customer Feedback Analysis: Identify satisfaction levels from product reviews.
- Brand Monitoring: Analyse public sentiment about a brand on social media.
- Movie Recommendations: Predict user preferences based on past reviews.
Conclusion
Sentiment analysis of movie reviews is an excellent hands-on project to understand natural language processing and machine learning. By following these steps, you’ll not only build a working model but also gain practical knowledge of preprocessing, vectorization, and classification techniques.