Text classification¶
In this exercise we will create a simple text classification system. We want our system to be able to determine if a sentence has a positive or a negative tone. For example, a sentence “I love this sandwich.” would be a positive sentence whereas “I do not like this restaurant” is an examples of a negative sentence.
The classification in this case will be based on Naive Bayes classifier. The idea of the classification is the following:
We create a database of sentences that are known to be either positive or negative.
We create a classifier and train it using the data from our database.
We define the sentence or sentences we want to evaluate.
Based on individual words appearing in either the positive category or the negative category, the sentence is evaluated.
Let’s say, for example, that we have a database like in the table below:
Sentence |
Category |
---|---|
I love this sandwich. |
pos |
this is an amazing place! |
pos |
I feel very good about these beers. |
pos |
this is my best work. |
pos |
what an awesome view |
pos |
I do not like this restaurant |
neg |
I am tired of this stuff. |
neg |
I can’t deal with this” |
neg |
he is my sworn enemy! |
neg |
my boss is horrible. |
neg |
This would be the training data and the sentences we would like to evaluate would be compared to these sentences. The more training data there is, the more accurate the results will be.
If you’re interested in the mathematics behind the algoritm, read this.
Creating and training the classifier¶
For this exercise, we will use a library called textblob that contains practical tools for common natural language processing tasks. Textblob-library has a number of different classifiers to choose from. We will be using the Naive Bayes Classifier.
We will define a set of training data, which contains sentences and their category (either ‘pos’ or ‘neg’) and feed the training data to the classifier.
# import the classifier
from textblob.classifiers import NaiveBayesClassifier
import nltk
nltk.download('punkt')
[nltk_data] Downloading package punkt to /home/runner/nltk_data...
[nltk_data] Package punkt is already up-to-date!
True
# Define training data
train = [
('I love this sandwich.', 'pos'),
('this is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('this is my best work.', 'pos'),
('what an awesome view', 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('he is my sworn enemy!', 'neg'),
('my boss is horrible.', 'neg')
]
# Create the classifier and feed the training data.
cl = NaiveBayesClassifier(train)
Testing the classifier¶
Now that we have created and trained the classifier, we can test how well it performs.
Let’s write couple of sentences and test them using the classify()
-method.
# Define a test sentence and classify it
cl.classify("This is an amazing library!")
'pos'
It looks like the classifier works! At least in this case. Try another sentence yourself! Test for example, if the classifier can handle a negative sentence properly.
We can also define a larger set of test sentences and see how accurately the classifier can classify those sentences. For this we need can use accuracy()
-method.
# Define test data similarly as the training data before
test = [
('the beer was good.', 'pos'),
('I do not enjoy my job', 'neg'),
("I ain't feeling dandy today.", 'neg'),
('I feel amazing!', 'pos'),
('Gary is a friend of mine.', 'pos'),
("I can't believe I'm doing this.", 'neg')
]
# Test the accuracy of the classifier using test data.
cl.accuracy(test)
0.8333333333333334
It looks like the classifier got 83% correct. Not too bad considering our small set of training data. We can also train our classifier more and see if we get different results. The classifier can further be trained using the update()
-method. Figure out how it works and make more train the classifier more! You can also check if the accuracy has improved after the training.
Playing with probabilities¶
We can dig deeper into the probabilities of a sentence being positive or negative by using prob_classify()
-method. Let’s try it out!
# Create a probability distribution from a sentence
prob_dist = cl.prob_classify("This one's a doozy.")
# Check the probability of the sentence being positive
prob_dist.prob("pos")
0.6311475409836058
# Check the probability of the sentence being negative
prob_dist.prob("neg")
0.3688524590163936
Which category would be selected for the sentence ‘This one’s a doozy.’?
Text classification is also possible for longer texts. See the example below!
# Let's define a larger text blob and use our classifier on it.
from textblob import TextBlob
blob = TextBlob("The beer is good. But the hangover is horrible.", classifier=cl)
blob.classify()
'pos'
# We can also print all the sentences in the blob and classify them separately.
for sentence in blob.sentences:
print(sentence)
print(sentence.classify())
print()
The beer is good.
pos
But the hangover is horrible.
neg
Are you interested in learning more about Naive Bayes classification? Try this tutorial:
https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn