Let me show how you can build your own simple text classification model and understand the underlying concepts around tf-idf and cosine similarity.
First we have to understand how TF-IDF works.
All keywords found in the content you’re writing can be measured via the TF-IDF formula to judge their importance. The formula is based on a logarithm and gives a score which is used to determine the most important terms in a document. As it’s mathematically based, the TF-IDF formula can be used in any language.
In information retrieval or text mining, the term tf-idf, is a well known method to evaluate how important a word in a document is. tf-idf is a very interesting way to convert the textual representation of information into a Vector Space Model (VSM), or into sparse features.
Now, what we have to create a model to find similarity. Let’s build our own libraries:
The first library would be about tokenizing. There can be advanced ways of doing it. For now, we would tokenize on basis of whitespace. The Function should look like the following:
Post tokenization, we need to create a dictionary of unique words. Which can be done with the following code.
Now that we have an index vocabulary, we can convert the test document set into a vector space which is an algebraic model representing textual information as a vector. VSM is denoted as:
Code for the same would be:
The components of this vector could represent the importance of a term (tf–idf) or even the absence or presence (Bag of Words) of it in a document where each term of the vector is indexed as our index vocabulary. Now, we’re going to use the term-frequency to represent each term in our vector space.
Tf of our documents is represented as :
TF: Term Frequency– this measures how frequently the term is used in a single document. The longer the document, the more likely it is that the term frequency will be high. This is then divided by the total number of terms in the document.
TF = (Number of time the term appears in the document) / (Total number of words in document)
Now that we have understood tf, we can move on to idf:
IDF: Inverse Document Frequency — this measures the importance of the specific term for its relevance within the corpus. Commonly used terms i.e. stop words such as “is”, “of” and “the” carry less importance, as they are used frequently in all documents within the corpus. The IDF can be calculated as follows:
IDF = (Total number of documents) / (total number of documents containing the keyword)
Similarly, the tf-idf of our documents will be:
We now should be able to check the similarity of our query by checking the cosine-similarity. For that we have to understand that the cosine similarity between two vectors (or two documents on the Vector Space) is a measure that calculates the cosine of the angle between them. This metric is a measurement of orientation and not magnitude, it can be seen as a comparison between documents on a normalized space because we’re not taking into consideration only the magnitude of each word count (tf-idf) of each document, but the angle between the documents. What we have to do to build the cosine similarity equation is to solve the equation of the dot product for the cosine angle:
By checking the cosine angle between both we can figure out the document that is closest to our query.
Let’s take the documents below to define our universe of training data:
And initialize a blank array to contain dictionary.
Now let’s tokenize the documents we have.
Creating dictionary for both the documents.
Creating VSM for both the documents.
Creating Tf for both the documents.
IDF of the entire training document set.
Tf-IDF model of the entire data set.
Now our model is ready to be queried. All we need is the idf as it pertains to the training data set, tfi-dfs of all the documents and the dictionary to refer to the words it has been trained on.
Now let’s see how does the model respond on querying with “which place you live”. To start with, let’s generate the tf for the same.
Let’s generate tf-idf of the query.
Now, let’s see which document is closest to the query.
The output should be “Krypton”
This is how you can create your basic NLP on browser using tf-idf and cosine similarity.
You can find the code base here.
Stay tuned for further tech updates!