# Text classification

In this exercise we will create a simple text classification system. We want our system to be able to determine if a sentence has a positive or a negative tone. For example, a sentence "I love this sandwich." would be a positive sentence whereas "I do not like this restaurant" is an examples of a negative sentence.

The classification in this case will be based on [Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier). The idea of the classification is the following:

1. We create a database of sentences that are known to be either positive or negative.
1. We create a classifier and train it using the data from our database.
1. We define the sentence or sentences we want to evaluate.
1. Based on individual words appearing in either the positive category or the negative category, the sentence is evaluated.

Let's say, for example, that we have a database like in the table below:


| Sentence | Category | 
|:---------|:--------:| 
| I love this sandwich. | pos |
| this is an amazing place! | pos |
| I feel very good about these beers. | pos |
| this is my best work. | pos |
| what an awesome view | pos |
| I do not like this restaurant | neg |
| I am tired of this stuff. | neg |
| I can't deal with this" | neg |
| he is my sworn enemy! | neg |
| my boss is horrible. | neg |

This would be the training data and the sentences we would like to evaluate would be compared to these sentences. The more training data there is, the more accurate the results will be.

If you're interested in the mathematics behind the algoritm, [read this](https://medium.com/syncedreview/applying-multinomial-naive-bayes-to-nlp-problems-a-practical-explanation-4f5271768ebf).

##  Creating and training the classifier

For this exercise, we will use a library called [textblob](https://textblob.readthedocs.io/en/dev/index.html) that contains practical tools for common natural language processing tasks. Textblob-library has a number of different classifiers to choose from. We will be using the [Naive Bayes Classifier](https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.classifiers.NaiveBayesClassifier).

We will define a set of training data, which contains sentences and their category (either 'pos' or 'neg') and feed the training data to the classifier.

In [None]:
# import the classifier
from textblob.classifiers import NaiveBayesClassifier
import nltk
nltk.download('punkt')

In [None]:
# Define training data
train = [
    ('I love this sandwich.', 'pos'),
    ('this is an amazing place!', 'pos'),
    ('I feel very good about these beers.', 'pos'),
    ('this is my best work.', 'pos'),
    ('what an awesome view', 'pos'),
    ('I do not like this restaurant', 'neg'),
    ('I am tired of this stuff.', 'neg'),
    ("I can't deal with this", 'neg'),
    ('he is my sworn enemy!', 'neg'),
    ('my boss is horrible.', 'neg')
]

# Create the classifier and feed the training data.
cl = NaiveBayesClassifier(train)

## Testing the classifier

Now that we have created and trained the classifier, we can test how well it performs.
Let's write couple of sentences and test them using the [`classify()`](https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.classifiers.NaiveBayesClassifier.classify)-method.

In [None]:
# Define a test sentence and classify it
cl.classify("This is an amazing library!")

It looks like the classifier works! At least in this case. Try another sentence yourself! Test for example, if the classifier can handle a negative sentence properly.

We can also define a larger set of test sentences and see how accurately the classifier can classify those sentences. For this we need can use [`accuracy()`](https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.classifiers.NaiveBayesClassifier.accuracy)-method.

In [None]:
# Define test data similarly as the training data before
test = [
    ('the beer was good.', 'pos'),
    ('I do not enjoy my job', 'neg'),
    ("I ain't feeling dandy today.", 'neg'),
    ('I feel amazing!', 'pos'),
    ('Gary is a friend of mine.', 'pos'),
    ("I can't believe I'm doing this.", 'neg')
 ]

In [None]:
# Test the accuracy of the classifier using test data.
cl.accuracy(test)

It looks like the classifier got 83% correct. Not too bad considering our small set of training data. We can also train our classifier more and see if we get different results. The classifier can further be trained using the [`update()`](https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.classifiers.NaiveBayesClassifier.update)-method. Figure out how it works and make more train the classifier more! You can also check if the accuracy has improved after the training.

## Playing with probabilities

We can dig deeper into the probabilities of a sentence being positive or negative by using [`prob_classify()`](https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.classifiers.NaiveBayesClassifier.prob_classify)-method. Let's try it out!

In [None]:
# Create a probability distribution from a sentence
prob_dist = cl.prob_classify("This one's a doozy.")

In [None]:
# Check the probability of the sentence being positive
prob_dist.prob("pos")

In [None]:
# Check the probability of the sentence being negative
prob_dist.prob("neg")

Which category would be selected for the sentence 'This one's a doozy.'?

Text classification is also possible for longer texts. See the example below!

In [None]:
# Let's define a larger text blob and use our classifier on it.
from textblob import TextBlob
blob = TextBlob("The beer is good. But the hangover is horrible.", classifier=cl)
blob.classify()

In [None]:
# We can also print all the sentences in the blob and classify them separately.
for sentence in blob.sentences:
    print(sentence)
    print(sentence.classify())
    print()

Are you interested in learning more about Naive Bayes classification? Try this tutorial:

[https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn](https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn)