Ranger's Way

Spam Filter (1) -- Data Resource & ROC Curve

January 21, 2014

This is my project of CS559 Machine Learning @Stevens 2013 Fall.

This project will design a Spam Filter based on the date set from Discovery Challenge workshop at ECML/PKDD 2006 in Berlin.

Abstract

I used two basic machine learning technologies learned from the class: Naive Bayes and Logistic Regression.
And also made some improvement based on practical implementation.
Another improvement I have used is Self-Learning which augments the evaluation data on training set.

Introduction

This competition was held at German in 2006. I finished the Task A in this challenge. The evaluation criterion is the AUC value which I will explain a bit of it later on. The first ranked team has accessed 0.95 Average AUC during the challenge and up to 0.98 updated after the challenge. I get the Average AUC above 0.95, too.

Experiment Data

Give an example of a mail, this is the first line of the first mail in task_labeled_train.tf:

1 9:3 94:1 109:1 163:1

This line represents a spam email (starting with class label “1”) with four words. The word ID of the first token is 9 and this word occurs 3 times within this email, indicated by “:3”.

AUC & ROC

ROC (Receiver Operating Characteristic) is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the total actual positives (TPR = true positive rate) vs. the fraction of false positives out of the total actual negatives (FPR = false positive rate), at various threshold settings.

Figure 1: The four possible labels of predict result

Here gives a sample of ROC curves in Figure 2. Where X-axis, called Specificity, is False Positive rate = FP / [FP + TN]; Y-axis, called Sensitivity, is True Positive rate = TP / [TP + FN].

Figure 2: Sample of ROC curves

AUC (Area Under Curve): The AUC value is the area under the ROC curve. The AUC value will between 0 and 1, higher value means a better performance.

The Two Types of Classifiers

There are two types of Machine Learning Techniques: Discriminative Model (Conditional Model) and Generative Model.

Discriminative Model is modeling the conditional probability distribution P(y

x), which can be used for predicting y from x. It includes Logistic Regression, SVM, etc.

Generative Model specifies a joint probability distribution over observation and label sequences. It includes Naive Bayes, etc.

I have used Naive Bayes (NB) and Logistic Regression (LR) in my project.

Author: Renjie Weng
Project repository: GiHub Pages
SlideShow
Formal report: PDF & PPT

Category: Technology 33

Tags: ML 4 spam 4 probability 1

点击查看评论

三人同行七十希
五樹梅花廿一支
七子團圓正半月
除百零五使得知

About me

My name is Renjie Weng.
I like physics, programming, and cycling.
Programmer => [ Java, C++, python, Android, etc. ]
Interests => [ Distributed Systems , Cryptography ]

Favorite Bloggers

...