What is Data Snooping?

Data snooping happens when information from the test set (or future unseen data) leaks into the training process. This makes the model appear to perform better than it really does, but in reality, it fails on new unseen data.


Options Analysis:

  1. Leads to biased estimation on test sets

    • Correct → Since the test set is no longer independent, performance metrics become overly optimistic.

  2. Increases the risk of false positives

    • Correct → Because the model fits patterns that aren’t generalizable, leading to more false discoveries.

  3. Leads to better estimation on training sets

    • Wrong → Snooping doesn’t help training estimation; training accuracy can be high anyway, but the issue is test bias.

  4. Reduces the risk of false positives

    • Wrong → It actually increases the risk.


✅ Correct Answers:

  • Leads to biased estimation on test sets

  • Increases the risk of false positives


Would you like me to also prepare a blog-style writeup explaining data snooping vs proper train-test separation, with a small real-world example (like stock market predictions)?

Comments

Popular posts from this blog

Understanding Data Leakage in Machine Learning: Causes, Examples, and Prevention

🌳 Understanding Maximum Leaf Nodes in Decision Trees (Scikit-Learn)

Linear Regression with and without Intercept: Explained Simply