Spam Filter (1) -- Data Resource & ROC Curve

This is my project of CS559 Machine Learning @Stevens 2013 Fall.

This project will design a Spam Filter based on the date set from Discovery Challenge workshop at ECML/PKDD 2006 in Berlin.


  • I used two basic machine learning technologies learned from the class: Naive Bayes and Logistic Regression.
  • And also made some improvement based on practical implementation.
  • Another improvement I have used is Self-Learning which augments the evaluation data on training set.


This competition was held at German in 2006. I finished the Task A in this challenge. The evaluation criterion is the AUC value which I will explain a bit of it later on. The first ranked team has accessed 0.95 Average AUC during the challenge and up to 0.98 updated after the challenge. I get the Average AUC above 0.95, too.

Experiment Data

Give an example of a mail, this is the first line of the first mail in

1 9:3 94:1 109:1 163:1

This line represents a spam email (starting with class label “1”) with four words. The word ID of the first token is 9 and this word occurs 3 times within this email, indicated by “:3”.


ROC (Receiver Operating Characteristic) is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the total actual positives (TPR = true positive rate) vs. the fraction of false positives out of the total actual negatives (FPR = false positive rate), at various threshold settings.

Figure 1 Figure 1: The four possible labels of predict result

Here gives a sample of ROC curves in Figure 2. Where X-axis, called Specificity, is False Positive rate = FP / [FP + TN]; Y-axis, called Sensitivity, is True Positive rate = TP / [TP + FN].

Figure 2 Figure 2: Sample of ROC curves

AUC (Area Under Curve): The AUC value is the area under the ROC curve. The AUC value will between 0 and 1, higher value means a better performance.

The Two Types of Classifiers

There are two types of Machine Learning Techniques: Discriminative Model (Conditional Model) and Generative Model.

Discriminative Model is modeling the conditional probability distribution P(y x), which can be used for predicting y from x. It includes Logistic Regression, SVM, etc.

Generative Model specifies a joint probability distribution over observation and label sequences. It includes Naive Bayes, etc.

I have used Naive Bayes (NB) and Logistic Regression (LR) in my project.

Author: Renjie Weng
Project repository: GiHub Pages
Formal report: PDF & PPT