Introduction to Sentiment Analysis in Text Classification
This short piece is part of an assignment for the completion of Portfolio 3 in the Purwadhika Data Science Bootcamp, guided by The honorable Mas Shafanda Nabil Sembodo.
Previously, we learned from Mas Nabil about text classification, which is the process of converting sentences into words that can be understood and processed by a computer. This machine-readable data can then be used for various purposes, one of which is sentiment analysis.
What is Sentiment Analysis?
Sentiment analysis is a way for computers to understand the emotions behind the words we type. It helps determine whether the text expresses a positive or negative sentiment towards something. This process enables a computer to identify and classify sentiments as either positive or negative.
Why Sentiment Analysis?
We can use sentiment analysis to measure the likability of a person, which might seem trivial to us but is crucial for celebrities or politicians. For instance, we can assess whether a politician is suitable for running in an election based on their likability among the public. By extracting and analyzing comments about the politician, we can determine if they are well-liked or not. This insight can help predict their chances of being elected based on public sentiment.
Similarly, this analysis can be applied to celebrities we might consider hiring for endorsing our products. If the sentiment analysis reveals a negative sentiment or low likability, we might need to reconsider using that celebrity for our product endorsement.
Social media platforms often have thousands to millions of comments about politicians or celebrities. It is challenging to gauge the overall sentiment from such a large volume of comments manually. Sentiment analysis helps us understand the emotions of each commenter regarding the individual. From this, we can infer the overall likability of the politician or celebrity.
Beyond assessing the likability of individuals, we can also gauge the likability of products we offer. Are the responses positive, negative, or just neutral? Insights from sentiment analysis can help us adjust our strategies to improve product appeal, aiming to boost sales. This might involve changing marketing strategies or product development approaches.
How to Perform Sentiment Analysis
There are several methods to conduct sentiment analysis. However, since this writing is introductory and the author is also a beginner, we will touch on two basic approaches in sentiment analysis: the Dictionary-Based Approach and the Machine Learning Approach.
Dictionary-Based Approach
In the dictionary-based approach, the computer identifies the emotions or sentiments in the text by rating each word in the sentence. These ratings are then summed up and classified to determine if the overall sentiment is negative or positive. Here’s how it works in detail:
- Word Tokenization: The first step involves break down the sentence into individual words (tokens). Dont forget to stem and lemmatize the words!
- Sentiment Dictionary: Make a predefined dictionary where each word is categorized as either positive or negative.
- Scoring: Each word in the sentence is scored based on the dictionary. Positive words add to the positive score (+1), and negative words (-1) add to the negative score.
- Summation: The scores are summed up to get an overall sentiment score for the sentence.
- Classification: The final score is then classified. If the score leans more towards positive, the sentiment is classified as positive. If it leans more towards negative, the sentiment is classified as negative.
For example, consider the sentence “This laptop looks nice and has cool screen, but the performance is bad.” The dictionary might assign a positive score to “nice” and “cool” and assign a negative score on “bad”. Adding these scores together would classify the sentence as having a positive enough sentiment.
This approach actually very simple compared to another solution because it use a simple word lookup and do not required a training data. because it is not requiring a training data. The only concerned is just the approach only works when we already make a large dictionary that covers all of our data. For example in code refers to this syntax below.
# Dictionary-based sentiment analysis
from collections import Counter
# Example sentiment dictionary
sentiment_dict = {
"love": 1,
"like": 1,
"enjoy": 1,
"good": 1,
"great": 1,
"happy": 1,
"hate": -1,
"dislike": -1,
"bad": -1,
"terrible": -1,
"sad": -1,
}
def sentiment_analysis(text):
words = text.lower().split()
scores = [sentiment_dict.get(word, 0) for word in words]
score = sum(scores)
return "positive" if score > 0 else "negative" if score < 0 else "neutral"
# Example usage
text = "I love this movie but hate the ending"
print(f"Sentiment: {sentiment_analysis(text)}")
Machine Learning Approach
Another more advanced method is using machine learning for sentiment analysis. Here’s a basic overview of how it works:
- Data Collection: Collect a large dataset of sentences labeled with their sentiment (positive or negative).
- Text Vectorization: Convert the text data into numerical form using techniques like TF-IDF (Term Frequency-Inverse Document Frequency). This step transforms sentences into vectors that can be processed by machine learning algorithms.
- Model Training: Use the labeled data to train a machine learning model, such as a Support Vector Machine (SVM) or a Random Forest classifier. The model learns to associate certain word patterns with positive or negative sentiments.
- Prediction: Once the model is trained, it can predict the sentiment of new, unseen sentences by analyzing their vector representations.
- Evaluation: Evaluate the model’s performance using metrics such as accuracy, precision, recall, and F1-score to ensure it accurately classifies sentiments.
Support Vector Machine (SVM) Model on Sentiment Analysis
A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for both classification and regression tasks, but it is mostly used for classification. In the context of text analysis, SVM is often used to classify texts into different categories (e.g., positive vs. negative sentiment).
Imagine a 2D plot with dots representing our text data. Red dots are negative reviews, and blue dots are positive reviews. The SVM algorithm will draw a straight line that best separates the red and blue dots while maximizing the distance between the line and the nearest dots of each color.
For code application refers to this syntax:
# SVM for sentiment analysis
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Example data
texts = ["I love this movie", "I hate this movie", "This film is great", "This film is bad"]
labels = ["positive", "negative", "positive", "negative"]
# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
y = [1 if label == "positive" else 0 for label in labels]
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
# Train an SVM classifier
svm_clf = SVC(kernel='linear')
svm_clf.fit(X_train, y_train)
# Predict and evaluate
y_pred = svm_clf.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))
Decision Tree Model on Sentiment Analysis
The model works by creating a tree structure for decision-making based on word features. Because it is a categorize as a basic model, it is really easy to interpret but unfortunaly this approach is prone to overfitting.
When using this model ,similar to the earlier approach, we needed to gather a dataset of labeled sentences. After that, we need to convert sentences into numerical vectors using TF-IDF. Than, we train the data. The result of training the model is a tree structure where each node represents a decision based on word features. For code application refers to this syntax:
# Decision Tree for sentiment analysis
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Example data
texts = ["I love this movie", "I hate this movie", "This film is great", "This film is bad"]
labels = ["positive", "negative", "positive", "negative"]
# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
y = [1 if label == "positive" else 0 for label in labels]
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
# Train a Decision Tree classifier
dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train, y_train)
# Predict and evaluate
y_pred = dt_clf.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))
Cheat Code for Sentiment Analysis
In practice, when performing machine learning-based sentiment analysis, we need labeled training data. Sometimes, it is challenging to find pre-labeled data. Fortunately, we can use pre-trained models available on platforms like Hugging Face. Two recommended models for sentiment analysis are:
- IndoBERT Base Model: IndoBERT is a state-of-the-art language model for Indonesian based on the BERT model. The pretrained model is trained using a masked language modeling (MLM) objective and next sentence prediction (NSP) objective.
- Fine-tuned Indonesian Sentiment Classifier: This model is a fine-tuned version of indobenchmark/indobert-base-p1 on the IndoNLU’s SmSA dataset.
Both models have been trained using datasets from IndoNLU. Before using these models, ensure your training data is covered by these datasets.
Challenges in Sentiment Analysis
When we use Sentiment analysis, it is often to faces several challenges:
- Short Texts: Often, social media posts are very short and may not contain grammatically complete sentences.
- Noise: Social media data is full of noise like typos and slang words.
- High Dimensionality: The data is diverse and includes elements like memes or emojis.
- Large Volume: There is a massive amount of data.
- Difficulty in Filtering: Social media data from various users often covers topics unrelated to the intended subject. For example, comments on a celebrity’s post might include unrelated advertisements.
Conclusion
In summary, sentiment analysis is a powerful tool for understanding public opinion and emotions expressed in text. By using techniques like the dictionary-based approach and machine learning, we can classify sentiments accurately and gain valuable insights. These methods transform raw textual data into meaningful information that can guide decision-making in various domains, from marketing to political campaigns. Understanding these basic approaches provides a solid foundation for anyone new to the field of sentiment analysis, enabling them to explore more advanced techniques and applications in the future.