# Spam Filter (1) -- Data Resource & ROC Curve

This is my project of CS559 Machine Learning @Stevens 2013 Fall.

This project will design a Spam Filter based on the date set from Discovery Challenge workshop at ECML/PKDD 2006 in Berlin.

## Abstract

- I used two basic machine learning technologies learned from the class:
**Naive Bayes**and**Logistic Regression**. - And also made some improvement based on practical implementation.
- Another improvement I have used is
**Self-Learning**which augments the evaluation data on training set.

## Introduction

This competition was held at German in 2006. I finished the Task A in this challenge. The evaluation criterion is the AUC value which I will explain a bit of it later on. The first ranked team has accessed 0.95 Average AUC during the challenge and up to 0.98 updated after the challenge. I get the Average AUC above 0.95, too.

### Experiment Data

Give an example of a mail, this is the first line of the first mail in task_labeled_train.tf:

```
1 9:3 94:1 109:1 163:1
```

This line represents a spam email (starting with class label “1”) with four words. The word ID of the first token is 9 and this word occurs 3 times within this email, indicated by “:3”.

### AUC & ROC

`ROC (Receiver Operating Characteristic)`

is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of *true positives out of the total actual positives ( TPR = true positive rate)*

`vs.`

the fraction of *false positives out of the total actual negatives (*, at various threshold settings.

`FPR`

= false positive rate)Figure 1: The four possible labels of predict result

Here gives a sample of ROC curves in Figure 2. Where X-axis, called Specificity, is `False Positive rate = FP / [FP + TN]`

; Y-axis, called Sensitivity, is `True Positive rate = TP / [TP + FN]`

.

Figure 2: Sample of ROC curves

`AUC (Area Under Curve)`

: The AUC value is the area under the ROC curve. The AUC value will between 0 and 1, higher value means a better performance.

### The Two Types of Classifiers

There are two types of Machine Learning Techniques:
`Discriminative Model (Conditional Model)`

and `Generative Model`

.

Discriminative Model is modeling the conditional probability distribution P(y | x), which can be used for predicting y from x. It includes Logistic Regression, SVM, etc. |

Generative Model specifies a joint probability distribution over observation and label sequences. It includes Naive Bayes, etc.

I have used Naive Bayes (NB) and Logistic Regression (LR) in my project.

Author:Renjie Weng

Project repository:GiHub Pages

SlideShow

Formal report:PDF & PPT