TF-IDF Vectorizer: Turning Words into Meaningful Numbers
When dealing with text data, simply counting words (like CountVectorizer does) isn’t always enough. Some words like “the”, “is”, or “this” appear very frequently, but they don’t really help in understanding the meaning of the text. To solve this, we use TF-IDF (Term Frequency – Inverse Document Frequency).
What is TF-IDF?
TF-IDF is a method to convert text into numbers by giving more importance to rare but meaningful words and less importance to common words.
-
TF (Term Frequency): How often a word appears in a document.
-
IDF (Inverse Document Frequency): How rare a word is across all documents.
-
TF × IDF: Final weight for a word. Common words get low scores, rare words get high scores.
Example (Step by Step)
Suppose we have 3 sentences:
-
"I love my phone"
-
"This phone has a great camera"
-
"I love this camera"
-
Word “phone” appears often → gets lower importance.
-
Word “great” appears once → gets higher importance.
This way, TF-IDF highlights important keywords in documents.
Why Use TF-IDF?
-
When to Use: For text classification, search engines, document similarity, sentiment analysis.
-
Why Use: Unlike CountVectorizer, it reduces the weight of common words and gives importance to unique, meaningful words.
-
How to Use: Replace CountVectorizer with
TfidfVectorizerin your code.
TF-IDF in Python
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
documents = [
"I love my phone",
"This phone has a great camera",
"I love this camera"
]
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
# Fit and transform documents
X = vectorizer.fit_transform(documents)
# Show vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())
# Show TF-IDF values
print("\nTF-IDF Matrix:\n", X.toarray())
Important Parameters of TF-IDF Vectorizer
1. max_features
-
Limits the number of words in vocabulary.
-
Example:
max_features=5keeps only top 5 words.
2. ngram_range
-
Considers multiple words together (n-grams).
-
Example:
ngram_range=(1,2)includes single words and pairs like “great camera”.
3. stop_words
-
Removes common meaningless words.
-
Example:
stop_words='english'removes words like “the”, “is”.
4. min_df & max_df
-
min_df: Ignore words that appear too rarely.
-
max_df: Ignore words that appear too frequently.
-
Example:
min_df=2keeps only words that appear in at least 2 documents.
5. use_idf
-
Whether to apply IDF or not.
-
Example:
use_idf=Falsemakes it similar to CountVectorizer but normalized.
6. norm
-
Normalizes row values (so that long sentences don’t get higher scores).
-
Options:
'l1','l2', orNone.
7. smooth_idf
-
Prevents division by zero when a word appears only once.
-
Default:
True.
Comparison Table: CountVectorizer vs TfidfVectorizer
| Feature | CountVectorizer | TfidfVectorizer |
|---|---|---|
| Output | Word counts | Weighted scores |
| Handles common words | No | Yes (down-weights them) |
| Highlights unique words | No | Yes |
| Use case | Simple models | Search engines, classification |
Real-Life Analogy
Imagine a classroom discussion:
-
Common words like “the”, “is” are spoken by everyone → less important.
-
Unique terms like “neural networks” or “camera quality” are rare but meaningful → more important.
TF-IDF ensures we pay attention to these important terms.
Final Thoughts
TF-IDF is a powerful improvement over simple word counts. It’s widely used in:
-
Document similarity (Google search)
-
Sentiment analysis
If you’re working with text and want your model to focus on meaningful words rather than just frequent ones, TF-IDF Vectorizer is your go-to tool.
Comments
Post a Comment