NLTK is a toolkit built specially for working with NLP (Natural Language Processing) in Python. It provides us with a lot of test datasets along with it. A variety of tasks can be accomplished using this toolkit like tokenizing, parse tree visualization, and others. In this article, we will look at as to how we can set up NLTK in our system and can perform different tasks during the text processing step.

What is Natural Language Processing

According to IBM:

Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can.

NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models. Together, these technologies enable computers to process human language in the form of text or voice data and to ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment.

NLP drives computer programs that translate text from one language to another, respond to spoken commands, and summarize large volumes of text rapidly—even in real time. There is a good chance you have interacted with NLP in the form of voice-operated GPS systems, digital assistants, speech-to-text dictation software, customer service chatbots, and other consumer conveniences. But NLP also plays a growing role in enterprise solutions that help streamline business operations, increase employee productivity, and simplify mission-critical business processes.

Installing NLTK

Use the pip install method to install NLTK into your system easily.

pip install nltk

Downloading the Datasets into Your System

Although this step is optional; however, you can use the facility of downloading test datasets for making yourself familiar with the working on NLP.

import nltk

You can see this screen and install the required corpus. Once you have completed this step let us dive deep into the different operations using NLTK.


The breaking down of a text into smaller words is referred to as tokens. Tokens are a small part of that text and tokenization is all about separating the text into different tokens of words to build a dictionary such that we represent all the words in a list. Numbers, special characters, etc. are all included in tokenization process.

from nltk.tokenize import sent_tokenize, word_tokenize

text = “Natural language processing is an exciting area.”


# output: [‘Natural language processing is an exciting area.’, ‘Huge budget have been allocated for this.’]


# output: [‘Natural’, ‘language’, ‘processing’, ‘is’, ‘an’, ‘exciting’, ‘area’, ‘.’, ‘Huge’, ‘budget’, ‘have’, ‘been’, ‘allocated’, ‘for’, ‘this’, ‘.’]

Conversion of Lower Case

For avoiding redundancy in the token list, we convert all the letters into lower case. The reason for the initialization of such process is not to let our model to get confused and count the world like “Mouse” and “mouse” as different words. Such standardization practices not only keep the model efficiency according to our standards, but also does not create useless data in the list.

text = re.sub(r”[^a-zA-Z0-9]”, ” “, text.lower())

words = text.split()


# output -> [‘natural’, ‘language’, ‘processing’, ‘is’, ‘an’, ‘exciting’, ‘area’, ‘huge’, ‘budget’, ‘have’, ‘been’, ‘allocated’, ‘for’, ‘this’]

Stopping Removal of Words

When we execute the model, we encounter a lot of noise in the words. These are the stop words like the, he, her, etc. which do not help us and must be removed for cleaner processing inside the model. Fortunately, NLTK helps us to eliminate such words easily available in the English language.


# output->

[‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, “you’re”, “you’ve”, “you’ll”, “you’d”, ‘your’, ‘yours’, ‘yourself’, ‘yourselves’, ‘he’, ‘him’, ‘his’, ‘himself’, ‘she’, “she’s”, ‘her’, ‘hers’, ‘herself’, ‘it’, “it’s”, ‘its’, ‘itself’, ‘they’, ‘them’, ‘their’, ‘theirs’, ‘themselves’, ‘what’, ‘which’, ‘who’, ‘whom’, ‘this’, ‘that’, “that’ll”, ‘these’, ‘those’, ‘am’, ‘is’, ‘are’, ‘was’, ‘were’, ‘be’, ‘been’, ‘being’, ‘have’, ‘has’, ‘had’, ‘having’, ‘do’, ‘does’, ‘did’, ‘doing’, ‘a’, ‘an’, ‘the’, ‘and’, ‘but’, ‘if’, ‘or’, ‘because’, ‘as’, ‘until’, ‘while’, ‘of’, ‘at’, ‘by’, ‘for’, ‘with’, ‘about’, ‘against’, ‘between’, ‘into’, ‘through’, ‘during’, ‘before’, ‘after’, ‘above’, ‘below’, ‘to’, ‘from’, ‘up’, ‘down’, ‘in’, ‘out’, ‘on’, ‘off’, ‘over’, ‘under’, ‘again’, ‘further’, ‘then’, ‘once’, ‘here’, ‘there’, ‘when’, ‘where’, ‘why’, ‘how’, ‘all’, ‘any’, ‘both’, ‘each’, ‘few’, ‘more’, ‘most’, ‘other’, ‘some’, ‘such’, ‘no’, ‘nor’, ‘not’, ‘only’, ‘own’, ‘same’, ‘so’, ‘than’, ‘too’, ‘very’, ‘s’, ‘t’, ‘can’, ‘will’, ‘just’, ‘don’, “don’t”, ‘should’, “should’ve”, ‘now’, ‘d’, ‘ll’, ‘m’, ‘o’, ‘re’, ‘ve’, ‘y’, ‘ain’, ‘aren’, “aren’t”, ‘couldn’, “couldn’t”, ‘didn’, “didn’t”, ‘doesn’, “doesn’t”, ‘hadn’, “hadn’t”, ‘hasn’, “hasn’t”, ‘haven’, “haven’t”, ‘isn’, “isn’t”, ‘ma’, ‘mightn’, “mightn’t”, ‘mustn’, “mustn’t”, ‘needn’, “needn’t”, ‘shan’, “shan’t”, ‘shouldn’, “shouldn’t”, ‘wasn’, “wasn’t”, ‘weren’, “weren’t”, ‘won’, “won’t”, ‘wouldn’, “wouldn’t”]

# Remove stop words
words = [w for w in words if w not in stopwords. Words(“english”)]
print (words)
# output =-> [‘natural’,’language’,’processing’,’’exciting’,’area’,’huge’]


In our text we will encounter words which have similar meanings to each other, like ‘playing’, ‘played’, ‘’playful’ etc. All these words have a root word, and they convey the similar sort of the meaning. So, it is much better to extract the root word and eliminate the rest. Here the root word formed is called ‘stem’ and it is not necessarily that stem needs to exist and have a meaning. Just by committing the suffix and prefix, we generate the stems.

NLTK has libraries like SnowballStemmer, LancasterStemmer, and PorterStemmer to tackle this problem.

from nltk.stem.porter import PorterStemmer

# Reduce words to their stems

stemmed = [PorterStemmer().stem(w) for w in words]


# output -> [‘natur’, ‘languag’, ‘process’, ‘excit’, ‘area’, ‘huge’, ‘budget’, ‘alloc’]


If we want to extract the base form of the word, we will use the process of Lemmatization. The word extracted is Lemma and it is available in the English dictionary.  NLTK provides us with the WordNet Lemmatizer that makes use of the WordNet Database to lookup lemmas of words.

from nltk.stem.wordnet import WordNetLemmatizer

# Reduce words to their root form

lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]


#output -> [‘natural’, ‘language’, ‘processing’, ‘exciting’, ‘area’, ‘huge’, ‘budget’, ‘allocated’]

Stemming is much faster than lemmatization as it does not need to look up in the dictionary and just follows with the algorithm for generation of the root words.

Syntax Tree Generation or Parse Tree

For defining English grammar, we can define it and then use NTLK RegexParser to extract all parts of speech from the sentence and draw functions to visualize it.

# Import required libraries

import nltk‘punkt’)‘averaged_perceptron_tagger’)

from nltk import pos_tag, word_tokenize, RegexpParser

# Example text

sample_text = “The quick brown fox jumps over the lazy dog”

# Find all parts of speech in above sentence

tagged = pos_tag(word_tokenize(sample_text))

#Extract all parts of speech from any text

chunker = RegexpParser(“””

                                          NP: {?*} #To extract Noun Phrases

                                          P: {}                    #To extract Prepositions

                                          V: {}                    #To extract Verbs

                                          PP: {

} #To extract Prepositional Phrases
VP: { *} #To extract Verb Phrases
# Print all parts of speech in above sentence
output = chunker.parse(tagged)
print(“After Extractingn”, output)



Part of Speech (POS) Tagging:

If we study English literature closely, then we will conclude that there are many same words which have different meanings. POS tagging is a part of text processing for avoiding that confusion.

With respect to the definition and context, we give each word a particular tag and process them. Two Steps are used here:

  • Tokenize text (word_tokenize).
  • Apply the pos_tag from NLTK to the above step.
  • import nltk
  • from nltk.corpus import stopwords
  • from nltk.tokenize import word_tokenize, sent_tokenize
  • stop_words = set(stopwords.words(‘english’))
  • txt = “Natural language processing is an exciting area.”
  •        ” Huge budget have been allocated for this.”
  • # sent_tokenize is one of instances of
  • # PunktSentenceTokenizer from the nltk.tokenize.punkt module
  • tokenized = sent_tokenize(txt)
  • for i in tokenized:
  •   # Word tokenizers is used to find the words
  •   # and punctuation in a string
  •   wordsList = nltk.word_tokenize(i)
  •   # removing stop words from wordList
  •   wordsList = [w for w in wordsList if not w in stop_words]
  •   # Using a Tagger. Which is part-of-speech
  •   # tagger or POS-tagger.
  •   tagged = nltk.pos_tag(wordsList)
  •   print(tagged)
  • # output -> [(‘Natural’, ‘JJ’), (‘language’, ‘NN’), (‘processing’, ‘NN’), (‘exciting’, ‘JJ’), (‘area’, ‘NN’), (‘.’, ‘.’)] [(‘Huge’, ‘NNP’), (‘budget’, ‘NN’), (‘allocated’, ‘VBD’), (‘.’, ‘.’)]



What is Natural Language Processing? | IBM



Please enter your comment!
Please enter your name here