Back to All Lessons
Coding Extension Python + NLTK

Python NLTK Tutorial: Natural Language Processing

Learn how to use Python's Natural Language Toolkit (NLTK) to perform real text analysis and NLP tasks. This tutorial includes installation instructions, sample code, and hands-on challenges for students interested in programming.

Prerequisites

Before starting this tutorial, students should have:

  • Basic understanding of Python syntax (variables, functions, loops)
  • Python 3.7 or higher installed on their computer
  • A text editor or IDE (VS Code, PyCharm, IDLE, or Thonny recommended)
  • Basic command line/terminal knowledge
  • Completed the main Lesson 10 activities to understand NLP concepts

Estimated Time: 45-60 minutes for setup and all exercises

Step 1: Installation and Setup

Installing NLTK

Open your terminal or command prompt and run the following command:

# Install NLTK using pip
pip install nltk

# If you're using Python 3 specifically, you may need:
pip3 install nltk

Downloading NLTK Data

NLTK requires additional data files for text processing. Run this Python script to download essential packages:

# Download required NLTK data
import nltk

# Download all essential packages (recommended for beginners)
nltk.download('popular')

# OR download specific packages individually:
# nltk.download('punkt')        # For tokenization
# nltk.download('averaged_perceptron_tagger')  # For POS tagging
# nltk.download('maxent_ne_chunker')  # For named entity recognition
# nltk.download('words')        # English word list
# nltk.download('stopwords')    # Common words to filter out
# nltk.download('vader_lexicon')  # For sentiment analysis

Installation Tip

The nltk.download('popular') command will open a download window. This is the easiest option for beginners. It may take 2-5 minutes to download all packages. Make sure you're connected to the internet!

Exercise 1: Tokenization

Tokenization is the process of breaking text into individual words or sentences. Let's see how NLTK does this automatically.

Word Tokenization

import nltk
from nltk.tokenize import word_tokenize

# Sample text
text = "Natural language processing is fascinating! It helps computers understand human language."

# Tokenize into words
tokens = word_tokenize(text)

# Display results
print("Original text:")
print(text)
print("\nTokenized words:")
print(tokens)
print(f"\nTotal number of tokens: {len(tokens)}")

Expected Output

Original text:
Natural language processing is fascinating! It helps computers understand human language.

Tokenized words:
['Natural', 'language', 'processing', 'is', 'fascinating', '!', 'It', 'helps', 'computers', 'understand', 'human', 'language', '.']

Total number of tokens: 13

Sentence Tokenization

from nltk.tokenize import sent_tokenize

# Text with multiple sentences
paragraph = "AI is transforming education. Students can now learn at their own pace. Teachers have powerful new tools to help them."

# Tokenize into sentences
sentences = sent_tokenize(paragraph)

# Display results
print("Sentences found:")
for i, sentence in enumerate(sentences, 1):
    print(f"{i}. {sentence}")

Challenge 1: Your Turn!

Write a program that:

  1. Takes a paragraph of text as input (use input() or a variable)
  2. Counts the total number of words (tokens)
  3. Counts the total number of sentences
  4. Calculates the average words per sentence
  5. Prints all results in a formatted way

Hint: Use len() to count items in a list!

Exercise 2: Part-of-Speech Tagging

Part-of-speech (POS) tagging identifies whether each word is a noun, verb, adjective, etc. This is crucial for understanding sentence structure.

import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog."

# Tokenize and tag parts of speech
tokens = word_tokenize(sentence)
tagged = pos_tag(tokens)

# Display results
print("Word → Part of Speech")
print("-" * 30)
for word, tag in tagged:
    print(f"{word:15} → {tag}")

# Count different types
print("\n" + "="*30)
nouns = [word for word, tag in tagged if tag.startswith('NN')]
verbs = [word for word, tag in tagged if tag.startswith('VB')]
adjectives = [word for word, tag in tagged if tag.startswith('JJ')]

print(f"Nouns: {nouns}")
print(f"Verbs: {verbs}")
print(f"Adjectives: {adjectives}")

Common POS Tags

  • NN: Noun, singular (dog, computer)
  • NNS: Noun, plural (dogs, computers)
  • VB: Verb, base form (run, eat)
  • VBD: Verb, past tense (ran, ate)
  • VBG: Verb, gerund/present participle (running, eating)
  • JJ: Adjective (quick, brown)
  • RB: Adverb (quickly, very)
  • DT: Determiner (the, a, an)
  • IN: Preposition (in, on, over)

View full POS tag list

Challenge 2: Word Type Counter

Create a program that:

  1. Analyzes a paragraph of text
  2. Counts how many nouns, verbs, and adjectives are in the text
  3. Calculates the percentage of each type
  4. Displays the results in a formatted report

Bonus: Find the most common noun and verb in the text!

Exercise 3: Named Entity Recognition

Named Entity Recognition (NER) identifies and classifies proper nouns like people, organizations, and locations.

import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk

# Sample text with named entities
text = "Apple Inc. was founded by Steve Jobs in Cupertino, California. Microsoft was started by Bill Gates in Seattle."

# Process the text
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
entities = ne_chunk(tagged)

# Display the tree structure
print("Named Entity Tree:")
print(entities)

# Extract named entities
print("\nExtracted Named Entities:")
print("-" * 40)
for chunk in entities:
    if hasattr(chunk, 'label'):
        entity_name = ' '.join(c[0] for c in chunk)
        entity_type = chunk.label()
        print(f"{entity_name:25} → {entity_type}")

# Alternative: More detailed extraction
def extract_entities(text):
    """Extract and categorize named entities from text."""
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    entities = ne_chunk(tagged)
    
    results = {'PERSON': [], 'ORGANIZATION': [], 'GPE': [], 'OTHER': []}
    
    for chunk in entities:
        if hasattr(chunk, 'label'):
            entity_name = ' '.join(c[0] for c in chunk)
            label = chunk.label()
            
            if label in results:
                results[label].append(entity_name)
            else:
                results['OTHER'].append(entity_name)
    
    return results

# Use the function
entities_dict = extract_entities(text)
print("\n" + "="*40)
print("Categorized Entities:")
print(f"People: {entities_dict['PERSON']}")
print(f"Organizations: {entities_dict['ORGANIZATION']}")
print(f"Locations (GPE): {entities_dict['GPE']}")

Common Entity Types

  • PERSON: Names of people (Steve Jobs, Bill Gates)
  • ORGANIZATION: Companies, institutions (Apple Inc., Microsoft)
  • GPE: Geo-Political Entity - cities, states, countries (Cupertino, California, Seattle)
  • DATE: Dates and time expressions
  • MONEY: Monetary values
  • PERCENT: Percentages

Challenge 3: News Article Analyzer

Write a program that:

  1. Reads a news article or paragraph (you can use a multi-line string)
  2. Extracts all named entities
  3. Counts how many of each type (PERSON, ORGANIZATION, GPE)
  4. Creates a summary report showing the main people, organizations, and places mentioned

Test it with: Copy a paragraph from a news website and see what entities the program finds!

Exercise 4: Sentiment Analysis with VADER

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a tool specifically designed for analyzing sentiment in social media and text. It's particularly good at understanding modern language, including emojis and slang!

from nltk.sentiment import SentimentIntensityAnalyzer

# Create sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Test different sentences
sentences = [
    "I love this product! It's absolutely amazing!",
    "This is the worst experience ever. Very disappointed.",
    "The movie was okay. Nothing special.",
    "I'm feeling pretty good about the test results.",
    "Ugh, another rainy Monday. Just great. 😒"
]

print("Sentiment Analysis Results")
print("="*60)

for sentence in sentences:
    # Get sentiment scores
    scores = sia.polarity_scores(sentence)
    
    # Determine overall sentiment
    if scores['compound'] >= 0.05:
        sentiment = "POSITIVE ✓"
    elif scores['compound'] <= -0.05:
        sentiment = "NEGATIVE ✗"
    else:
        sentiment = "NEUTRAL —"
    
    print(f"\nText: {sentence}")
    print(f"  Positive: {scores['pos']:.2f}")
    print(f"  Negative: {scores['neg']:.2f}")
    print(f"  Neutral:  {scores['neu']:.2f}")
    print(f"  Compound: {scores['compound']:.2f}")
    print(f"  Overall:  {sentiment}")

# Analyzing longer text
def analyze_review(review_text):
    """Analyze a product review or longer text."""
    scores = sia.polarity_scores(review_text)
    compound = scores['compound']
    
    # Determine rating
    if compound >= 0.5:
        rating = "⭐⭐⭐⭐⭐ Highly Positive"
    elif compound >= 0.05:
        rating = "⭐⭐⭐⭐ Positive"
    elif compound <= -0.5:
        rating = "⭐ Highly Negative"
    elif compound <= -0.05:
        rating = "⭐⭐ Negative"
    else:
        rating = "⭐⭐⭐ Neutral"
    
    return {
        'scores': scores,
        'rating': rating,
        'confidence': abs(compound)
    }

# Example review analysis
review = """
This product exceeded my expectations! The quality is fantastic and 
it arrived quickly. Customer service was helpful when I had questions. 
Highly recommend to anyone looking for a reliable option.
"""

result = analyze_review(review)
print("\n" + "="*60)
print("Review Analysis:")
print(f"Rating: {result['rating']}")
print(f"Confidence: {result['confidence']:.2%}")

Understanding VADER Scores

  • Positive: Proportion of positive words (0.0 to 1.0)
  • Negative: Proportion of negative words (0.0 to 1.0)
  • Neutral: Proportion of neutral words (0.0 to 1.0)
  • Compound: Overall sentiment score (-1.0 to 1.0)
    • Positive sentiment: compound score ≥ 0.05
    • Neutral sentiment: -0.05 < compound score < 0.05
    • Negative sentiment: compound score ≤ -0.05

Challenge 4: Review Analyzer Tool

Create a program that:

  1. Accepts multiple product reviews as input (use a list of strings)
  2. Analyzes the sentiment of each review
  3. Calculates the average sentiment score across all reviews
  4. Identifies the most positive and most negative review
  5. Generates a summary report with overall product rating

Bonus Challenge: Add a feature that detects reviews with "mixed" sentiment (both positive and negative words) and flags them for manual review!

Exercise 5: Text Analysis - Stopwords & Word Frequency

Stopwords are common words (like "the", "is", "at") that usually don't carry much meaning. Removing them helps us focus on the important words in text analysis.

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter

# Download stopwords if not already done
# nltk.download('stopwords')

# Sample text
text = """
Artificial intelligence is transforming education. Teachers can use AI to 
personalize learning for students. Students can learn at their own pace. 
AI systems can provide immediate feedback. The future of education looks 
very different with AI technology.
"""

# Tokenize and convert to lowercase
tokens = word_tokenize(text.lower())

# Get English stopwords
stop_words = set(stopwords.words('english'))

# Filter out stopwords and punctuation
filtered_words = [
    word for word in tokens 
    if word.isalnum() and word not in stop_words
]

print("Original tokens:", len(tokens))
print("After filtering:", len(filtered_words))
print("\nFiltered words:", filtered_words[:20])  # Show first 20

# Count word frequency
word_freq = Counter(filtered_words)

print("\n" + "="*50)
print("Top 10 Most Common Words:")
print("="*50)
for word, count in word_freq.most_common(10):
    print(f"{word:20} → {count} times")

# Find keywords (words appearing 2+ times)
keywords = [word for word, count in word_freq.items() if count >= 2]
print(f"\nKeywords (appearing 2+ times): {keywords}")

# Calculate basic statistics
unique_words = len(word_freq)
total_words = len(filtered_words)
vocabulary_richness = unique_words / total_words

print(f"\nText Statistics:")
print(f"  Unique words: {unique_words}")
print(f"  Total words: {total_words}")
print(f"  Vocabulary richness: {vocabulary_richness:.2%}")

Challenge 5: Compare Two Texts

Write a program that:

  1. Takes two different texts as input (e.g., two news articles, two book summaries)
  2. Extracts keywords from each (removing stopwords)
  3. Finds common keywords between both texts
  4. Identifies unique keywords in each text
  5. Determines which text has richer vocabulary
  6. Creates a comparison report

Hint: Use Python sets to find intersections and differences!

Final Project: Text Analysis Dashboard

Comprehensive NLP Project

Combine everything you've learned to create a complete text analysis tool! Your program should:

Required Features:

  1. Input: Accept text input from the user (paste or type)
  2. Basic Stats: Count words, sentences, and average words per sentence
  3. Part-of-Speech Analysis: Count nouns, verbs, and adjectives
  4. Named Entities: Extract and display people, organizations, and places
  5. Sentiment Analysis: Determine if the text is positive, negative, or neutral
  6. Keywords: Identify the top 5-10 most important words (after removing stopwords)
  7. Report: Display all results in a clear, formatted report

Bonus Features (Choose 2+):

  • Save results to a text file or CSV
  • Create visualizations (word cloud, bar charts) using matplotlib
  • Compare multiple texts side-by-side
  • Add a simple GUI using tkinter
  • Analyze sentiment over time in a series of texts
  • Detect and highlight difficult or complex sentences
# Starter code for your Text Analysis Dashboard

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag, ne_chunk
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from collections import Counter

def analyze_text(text):
    """
    Comprehensive text analysis function.
    Returns a dictionary with all analysis results.
    """
    results = {}
    
    # TODO: Add your analysis code here
    # 1. Basic statistics
    # 2. POS analysis
    # 3. Named entities
    # 4. Sentiment
    # 5. Keywords
    
    return results

def display_report(results):
    """
    Display results in a formatted report.
    """
    print("="*60)
    print("TEXT ANALYSIS REPORT")
    print("="*60)
    
    # TODO: Format and display all results
    
    pass

def main():
    """Main program loop."""
    print("Welcome to the NLP Text Analysis Dashboard!")
    print("-"*60)
    
    # Get user input
    text = input("Enter or paste your text:\n")
    
    # Analyze
    results = analyze_text(text)
    
    # Display
    display_report(results)
    
    # Ask if user wants to analyze another text
    again = input("\nAnalyze another text? (y/n): ")
    if again.lower() == 'y':
        main()
    else:
        print("Thank you for using the Text Analysis Dashboard!")

if __name__ == "__main__":
    main()

Additional Resources & Next Steps

Video Tutorials

  • Search YouTube: "NLTK Tutorial for Beginners"
  • Look for: "Python Text Analysis with NLTK"
  • Recommended: "Sentiment Analysis Python Tutorial"

Learning Path: What's Next?

After mastering NLTK basics, explore:

  1. spaCy: A more modern, faster NLP library (spacy.io)
  2. TextBlob: Simplified NLP for beginners (textblob.readthedocs.io)
  3. Transformers: State-of-the-art models like BERT and GPT (huggingface.co)
  4. Machine Learning: Train your own text classifiers with scikit-learn
  5. Deep Learning: Build neural networks for NLP with TensorFlow or PyTorch

Common Issues & Solutions

Solution: NLTK is not installed. Run pip install nltk in your terminal/command prompt.

Solution: You need to download NLTK data. Run:

import nltk
nltk.download('punkt')

Or download all popular packages with: nltk.download('popular')

Solutions:

  • Process smaller chunks of text at a time
  • Use more efficient libraries like spaCy for large-scale processing
  • Cache results that don't need to be recalculated
  • The first time you run NER or POS tagging can be slow as models load

Remember: VADER works best with social media text and modern language. For formal or literary text, results may be less accurate. Sarcasm and complex irony are still difficult for automated sentiment analysis.