TF-IDF Vectorizer: Turning Words into Meaningful Numbers

When dealing with text data, simply counting words (like CountVectorizer does) isn’t always enough. Some words like “the”, “is”, or “this” appear very frequently, but they don’t really help in understanding the meaning of the text. To solve this, we use TF-IDF (Term Frequency – Inverse Document Frequency).


What is TF-IDF?

TF-IDF is a method to convert text into numbers by giving more importance to rare but meaningful words and less importance to common words.

  • TF (Term Frequency): How often a word appears in a document.

  • IDF (Inverse Document Frequency): How rare a word is across all documents.

  • TF × IDF: Final weight for a word. Common words get low scores, rare words get high scores.


Example (Step by Step)

Suppose we have 3 sentences:

  1. "I love my phone"

  2. "This phone has a great camera"

  3. "I love this camera"

  • Word “phone” appears often → gets lower importance.

  • Word “great” appears once → gets higher importance.

This way, TF-IDF highlights important keywords in documents.


Why Use TF-IDF?


TF-IDF in Python

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "I love my phone",
    "This phone has a great camera",
    "I love this camera"
]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform documents
X = vectorizer.fit_transform(documents)

# Show vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())

# Show TF-IDF values
print("\nTF-IDF Matrix:\n", X.toarray())

Important Parameters of TF-IDF Vectorizer

1. max_features

  • Limits the number of words in vocabulary.

  • Example: max_features=5 keeps only top 5 words.

2. ngram_range

  • Considers multiple words together (n-grams).

  • Example: ngram_range=(1,2) includes single words and pairs like “great camera”.

3. stop_words

  • Removes common meaningless words.

  • Example: stop_words='english' removes words like “the”, “is”.

4. min_df & max_df

  • min_df: Ignore words that appear too rarely.

  • max_df: Ignore words that appear too frequently.

  • Example: min_df=2 keeps only words that appear in at least 2 documents.

5. use_idf

  • Whether to apply IDF or not.

  • Example: use_idf=False makes it similar to CountVectorizer but normalized.

6. norm

  • Normalizes row values (so that long sentences don’t get higher scores).

  • Options: 'l1', 'l2', or None.

7. smooth_idf

  • Prevents division by zero when a word appears only once.

  • Default: True.


Comparison Table: CountVectorizer vs TfidfVectorizer

Feature CountVectorizer TfidfVectorizer
Output Word counts Weighted scores
Handles common words No Yes (down-weights them)
Highlights unique words No Yes
Use case Simple models Search engines, classification

Real-Life Analogy

Imagine a classroom discussion:

  • Common words like “the”, “is” are spoken by everyone → less important.

  • Unique terms like neural networks or “camera quality” are rare but meaningful → more important.
    TF-IDF ensures we pay attention to these important terms.


Final Thoughts

TF-IDF is a powerful improvement over simple word counts. It’s widely used in:

If you’re working with text and want your model to focus on meaningful words rather than just frequent ones, TF-IDF Vectorizer is your go-to tool.

Comments

Popular posts from this blog

Understanding Data Leakage in Machine Learning: Causes, Examples, and Prevention

🌳 Understanding Maximum Leaf Nodes in Decision Trees (Scikit-Learn)

Linear Regression with and without Intercept: Explained Simply