Python NLTK Tutorial: Natural Language Processing
Learn how to use Python's Natural Language Toolkit (NLTK) to perform real text analysis and NLP tasks. This tutorial includes installation instructions, sample code, and hands-on challenges for students interested in programming.
Prerequisites
Before starting this tutorial, students should have:
- Basic understanding of Python syntax (variables, functions, loops)
- Python 3.7 or higher installed on their computer
- A text editor or IDE (VS Code, PyCharm, IDLE, or Thonny recommended)
- Basic command line/terminal knowledge
- Completed the main Lesson 10 activities to understand NLP concepts
Estimated Time: 45-60 minutes for setup and all exercises
Step 1: Installation and Setup
Installing NLTK
Open your terminal or command prompt and run the following command:
# Install NLTK using pip
pip install nltk
# If you're using Python 3 specifically, you may need:
pip3 install nltk
Downloading NLTK Data
NLTK requires additional data files for text processing. Run this Python script to download essential packages:
# Download required NLTK data
import nltk
# Download all essential packages (recommended for beginners)
nltk.download('popular')
# OR download specific packages individually:
# nltk.download('punkt') # For tokenization
# nltk.download('averaged_perceptron_tagger') # For POS tagging
# nltk.download('maxent_ne_chunker') # For named entity recognition
# nltk.download('words') # English word list
# nltk.download('stopwords') # Common words to filter out
# nltk.download('vader_lexicon') # For sentiment analysis
Installation Tip
The nltk.download('popular') command will open a download window. This is the easiest option for beginners. It may take 2-5 minutes to download all packages. Make sure you're connected to the internet!
Exercise 1: Tokenization
Tokenization is the process of breaking text into individual words or sentences. Let's see how NLTK does this automatically.
Word Tokenization
import nltk
from nltk.tokenize import word_tokenize
# Sample text
text = "Natural language processing is fascinating! It helps computers understand human language."
# Tokenize into words
tokens = word_tokenize(text)
# Display results
print("Original text:")
print(text)
print("\nTokenized words:")
print(tokens)
print(f"\nTotal number of tokens: {len(tokens)}")
Expected Output
Original text: Natural language processing is fascinating! It helps computers understand human language. Tokenized words: ['Natural', 'language', 'processing', 'is', 'fascinating', '!', 'It', 'helps', 'computers', 'understand', 'human', 'language', '.'] Total number of tokens: 13
Sentence Tokenization
from nltk.tokenize import sent_tokenize
# Text with multiple sentences
paragraph = "AI is transforming education. Students can now learn at their own pace. Teachers have powerful new tools to help them."
# Tokenize into sentences
sentences = sent_tokenize(paragraph)
# Display results
print("Sentences found:")
for i, sentence in enumerate(sentences, 1):
print(f"{i}. {sentence}")
Challenge 1: Your Turn!
Write a program that:
- Takes a paragraph of text as input (use
input()or a variable) - Counts the total number of words (tokens)
- Counts the total number of sentences
- Calculates the average words per sentence
- Prints all results in a formatted way
Hint: Use len() to count items in a list!
Exercise 2: Part-of-Speech Tagging
Part-of-speech (POS) tagging identifies whether each word is a noun, verb, adjective, etc. This is crucial for understanding sentence structure.
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog."
# Tokenize and tag parts of speech
tokens = word_tokenize(sentence)
tagged = pos_tag(tokens)
# Display results
print("Word → Part of Speech")
print("-" * 30)
for word, tag in tagged:
print(f"{word:15} → {tag}")
# Count different types
print("\n" + "="*30)
nouns = [word for word, tag in tagged if tag.startswith('NN')]
verbs = [word for word, tag in tagged if tag.startswith('VB')]
adjectives = [word for word, tag in tagged if tag.startswith('JJ')]
print(f"Nouns: {nouns}")
print(f"Verbs: {verbs}")
print(f"Adjectives: {adjectives}")
Common POS Tags
- NN: Noun, singular (dog, computer)
- NNS: Noun, plural (dogs, computers)
- VB: Verb, base form (run, eat)
- VBD: Verb, past tense (ran, ate)
- VBG: Verb, gerund/present participle (running, eating)
- JJ: Adjective (quick, brown)
- RB: Adverb (quickly, very)
- DT: Determiner (the, a, an)
- IN: Preposition (in, on, over)
Challenge 2: Word Type Counter
Create a program that:
- Analyzes a paragraph of text
- Counts how many nouns, verbs, and adjectives are in the text
- Calculates the percentage of each type
- Displays the results in a formatted report
Bonus: Find the most common noun and verb in the text!
Exercise 3: Named Entity Recognition
Named Entity Recognition (NER) identifies and classifies proper nouns like people, organizations, and locations.
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
# Sample text with named entities
text = "Apple Inc. was founded by Steve Jobs in Cupertino, California. Microsoft was started by Bill Gates in Seattle."
# Process the text
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
entities = ne_chunk(tagged)
# Display the tree structure
print("Named Entity Tree:")
print(entities)
# Extract named entities
print("\nExtracted Named Entities:")
print("-" * 40)
for chunk in entities:
if hasattr(chunk, 'label'):
entity_name = ' '.join(c[0] for c in chunk)
entity_type = chunk.label()
print(f"{entity_name:25} → {entity_type}")
# Alternative: More detailed extraction
def extract_entities(text):
"""Extract and categorize named entities from text."""
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
entities = ne_chunk(tagged)
results = {'PERSON': [], 'ORGANIZATION': [], 'GPE': [], 'OTHER': []}
for chunk in entities:
if hasattr(chunk, 'label'):
entity_name = ' '.join(c[0] for c in chunk)
label = chunk.label()
if label in results:
results[label].append(entity_name)
else:
results['OTHER'].append(entity_name)
return results
# Use the function
entities_dict = extract_entities(text)
print("\n" + "="*40)
print("Categorized Entities:")
print(f"People: {entities_dict['PERSON']}")
print(f"Organizations: {entities_dict['ORGANIZATION']}")
print(f"Locations (GPE): {entities_dict['GPE']}")
Common Entity Types
- PERSON: Names of people (Steve Jobs, Bill Gates)
- ORGANIZATION: Companies, institutions (Apple Inc., Microsoft)
- GPE: Geo-Political Entity - cities, states, countries (Cupertino, California, Seattle)
- DATE: Dates and time expressions
- MONEY: Monetary values
- PERCENT: Percentages
Challenge 3: News Article Analyzer
Write a program that:
- Reads a news article or paragraph (you can use a multi-line string)
- Extracts all named entities
- Counts how many of each type (PERSON, ORGANIZATION, GPE)
- Creates a summary report showing the main people, organizations, and places mentioned
Test it with: Copy a paragraph from a news website and see what entities the program finds!
Exercise 4: Sentiment Analysis with VADER
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a tool specifically designed for analyzing sentiment in social media and text. It's particularly good at understanding modern language, including emojis and slang!
from nltk.sentiment import SentimentIntensityAnalyzer
# Create sentiment analyzer
sia = SentimentIntensityAnalyzer()
# Test different sentences
sentences = [
"I love this product! It's absolutely amazing!",
"This is the worst experience ever. Very disappointed.",
"The movie was okay. Nothing special.",
"I'm feeling pretty good about the test results.",
"Ugh, another rainy Monday. Just great. 😒"
]
print("Sentiment Analysis Results")
print("="*60)
for sentence in sentences:
# Get sentiment scores
scores = sia.polarity_scores(sentence)
# Determine overall sentiment
if scores['compound'] >= 0.05:
sentiment = "POSITIVE ✓"
elif scores['compound'] <= -0.05:
sentiment = "NEGATIVE ✗"
else:
sentiment = "NEUTRAL —"
print(f"\nText: {sentence}")
print(f" Positive: {scores['pos']:.2f}")
print(f" Negative: {scores['neg']:.2f}")
print(f" Neutral: {scores['neu']:.2f}")
print(f" Compound: {scores['compound']:.2f}")
print(f" Overall: {sentiment}")
# Analyzing longer text
def analyze_review(review_text):
"""Analyze a product review or longer text."""
scores = sia.polarity_scores(review_text)
compound = scores['compound']
# Determine rating
if compound >= 0.5:
rating = "⭐⭐⭐⭐⭐ Highly Positive"
elif compound >= 0.05:
rating = "⭐⭐⭐⭐ Positive"
elif compound <= -0.5:
rating = "⭐ Highly Negative"
elif compound <= -0.05:
rating = "⭐⭐ Negative"
else:
rating = "⭐⭐⭐ Neutral"
return {
'scores': scores,
'rating': rating,
'confidence': abs(compound)
}
# Example review analysis
review = """
This product exceeded my expectations! The quality is fantastic and
it arrived quickly. Customer service was helpful when I had questions.
Highly recommend to anyone looking for a reliable option.
"""
result = analyze_review(review)
print("\n" + "="*60)
print("Review Analysis:")
print(f"Rating: {result['rating']}")
print(f"Confidence: {result['confidence']:.2%}")
Understanding VADER Scores
- Positive: Proportion of positive words (0.0 to 1.0)
- Negative: Proportion of negative words (0.0 to 1.0)
- Neutral: Proportion of neutral words (0.0 to 1.0)
- Compound: Overall sentiment score (-1.0 to 1.0)
- Positive sentiment: compound score ≥ 0.05
- Neutral sentiment: -0.05 < compound score < 0.05
- Negative sentiment: compound score ≤ -0.05
Challenge 4: Review Analyzer Tool
Create a program that:
- Accepts multiple product reviews as input (use a list of strings)
- Analyzes the sentiment of each review
- Calculates the average sentiment score across all reviews
- Identifies the most positive and most negative review
- Generates a summary report with overall product rating
Bonus Challenge: Add a feature that detects reviews with "mixed" sentiment (both positive and negative words) and flags them for manual review!
Exercise 5: Text Analysis - Stopwords & Word Frequency
Stopwords are common words (like "the", "is", "at") that usually don't carry much meaning. Removing them helps us focus on the important words in text analysis.
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
# Download stopwords if not already done
# nltk.download('stopwords')
# Sample text
text = """
Artificial intelligence is transforming education. Teachers can use AI to
personalize learning for students. Students can learn at their own pace.
AI systems can provide immediate feedback. The future of education looks
very different with AI technology.
"""
# Tokenize and convert to lowercase
tokens = word_tokenize(text.lower())
# Get English stopwords
stop_words = set(stopwords.words('english'))
# Filter out stopwords and punctuation
filtered_words = [
word for word in tokens
if word.isalnum() and word not in stop_words
]
print("Original tokens:", len(tokens))
print("After filtering:", len(filtered_words))
print("\nFiltered words:", filtered_words[:20]) # Show first 20
# Count word frequency
word_freq = Counter(filtered_words)
print("\n" + "="*50)
print("Top 10 Most Common Words:")
print("="*50)
for word, count in word_freq.most_common(10):
print(f"{word:20} → {count} times")
# Find keywords (words appearing 2+ times)
keywords = [word for word, count in word_freq.items() if count >= 2]
print(f"\nKeywords (appearing 2+ times): {keywords}")
# Calculate basic statistics
unique_words = len(word_freq)
total_words = len(filtered_words)
vocabulary_richness = unique_words / total_words
print(f"\nText Statistics:")
print(f" Unique words: {unique_words}")
print(f" Total words: {total_words}")
print(f" Vocabulary richness: {vocabulary_richness:.2%}")
Challenge 5: Compare Two Texts
Write a program that:
- Takes two different texts as input (e.g., two news articles, two book summaries)
- Extracts keywords from each (removing stopwords)
- Finds common keywords between both texts
- Identifies unique keywords in each text
- Determines which text has richer vocabulary
- Creates a comparison report
Hint: Use Python sets to find intersections and differences!
Final Project: Text Analysis Dashboard
Comprehensive NLP Project
Combine everything you've learned to create a complete text analysis tool! Your program should:
Required Features:
- Input: Accept text input from the user (paste or type)
- Basic Stats: Count words, sentences, and average words per sentence
- Part-of-Speech Analysis: Count nouns, verbs, and adjectives
- Named Entities: Extract and display people, organizations, and places
- Sentiment Analysis: Determine if the text is positive, negative, or neutral
- Keywords: Identify the top 5-10 most important words (after removing stopwords)
- Report: Display all results in a clear, formatted report
Bonus Features (Choose 2+):
- Save results to a text file or CSV
- Create visualizations (word cloud, bar charts) using matplotlib
- Compare multiple texts side-by-side
- Add a simple GUI using tkinter
- Analyze sentiment over time in a series of texts
- Detect and highlight difficult or complex sentences
# Starter code for your Text Analysis Dashboard
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag, ne_chunk
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from collections import Counter
def analyze_text(text):
"""
Comprehensive text analysis function.
Returns a dictionary with all analysis results.
"""
results = {}
# TODO: Add your analysis code here
# 1. Basic statistics
# 2. POS analysis
# 3. Named entities
# 4. Sentiment
# 5. Keywords
return results
def display_report(results):
"""
Display results in a formatted report.
"""
print("="*60)
print("TEXT ANALYSIS REPORT")
print("="*60)
# TODO: Format and display all results
pass
def main():
"""Main program loop."""
print("Welcome to the NLP Text Analysis Dashboard!")
print("-"*60)
# Get user input
text = input("Enter or paste your text:\n")
# Analyze
results = analyze_text(text)
# Display
display_report(results)
# Ask if user wants to analyze another text
again = input("\nAnalyze another text? (y/n): ")
if again.lower() == 'y':
main()
else:
print("Thank you for using the Text Analysis Dashboard!")
if __name__ == "__main__":
main()
Additional Resources & Next Steps
Documentation
Video Tutorials
- Search YouTube: "NLTK Tutorial for Beginners"
- Look for: "Python Text Analysis with NLTK"
- Recommended: "Sentiment Analysis Python Tutorial"
Learning Path: What's Next?
After mastering NLTK basics, explore:
- spaCy: A more modern, faster NLP library (spacy.io)
- TextBlob: Simplified NLP for beginners (textblob.readthedocs.io)
- Transformers: State-of-the-art models like BERT and GPT (huggingface.co)
- Machine Learning: Train your own text classifiers with scikit-learn
- Deep Learning: Build neural networks for NLP with TensorFlow or PyTorch
Common Issues & Solutions
Solution: NLTK is not installed. Run pip install nltk in your terminal/command prompt.
Solution: You need to download NLTK data. Run:
import nltk
nltk.download('punkt')
Or download all popular packages with: nltk.download('popular')
Solutions:
- Process smaller chunks of text at a time
- Use more efficient libraries like spaCy for large-scale processing
- Cache results that don't need to be recalculated
- The first time you run NER or POS tagging can be slow as models load
Remember: VADER works best with social media text and modern language. For formal or literary text, results may be less accurate. Sarcasm and complex irony are still difficult for automated sentiment analysis.