Bayesian filtering

Writing and using PopTray plug-ins

Moderators: KY Dave, jojobear99, Rdsok

User avatar
Renier
Site Admin
Posts: 1957
Joined: Mon Oct 15, 2001 12:54 pm
Location: Cape Town, South-Africa
Contact:

Bayesian filtering

Post by Renier » Mon Sep 08, 2003 10:38 am

This weekend I decided to spend my time one something related to PopTray 3.1 instead of PopTray 3.0 (although I also worked on a new beta).

I wrote my own Bayesian filter object in Delphi. I've implemented the technique mentioned by Paul Graham in his Plan for Spam ariticle.

My object seems to work quite well. I break up the message into tokens and save statistics for these tokens. I also have an option to trim HTML comments from the original message. I've learned it using 1000 emails (500 good and 500 spam), and the results seems to be pretty good. One my slow PC I can score about 100 emails per second, so performance seems fine.

I still need to investigate some more on the Internet so see what other optimizations and tricks people have come up with since the original article. If you have any good links, please let me know.

The idea is to create a Bayesian plugin for PopTray 3.1 so people don't need another program like K9 to do the filtering. The PopTray 3.0 user interface already has the necessary "Mark/Unmark as Spam" buttons, so it should fit in with the existing PopTray UI.

Comments?

User avatar
homaquebec
PopTray Family
Posts: 913
Joined: Tue May 27, 2003 6:47 pm
Location: Québec (Canada)

Re: Bayesian filtering

Post by homaquebec » Mon Sep 08, 2003 2:11 pm

Renier wrote: I wrote my own Bayesian filter object in Delphi. I've implemented the technique mentioned by Paul Graham in his Plan for Spam article.
[...]
Comments?
That makes me very very happy. If you succeed with this plugin (I dont doubt), it will be a great step in PopTray development! Spammers will have to watch out!
Last edited by homaquebec on Mon Sep 08, 2003 6:29 pm, edited 1 time in total.

User avatar
quosego
Guru
Posts: 219
Joined: Mon Oct 15, 2001 11:42 pm
Location: The Netherlands

Post by quosego » Mon Sep 08, 2003 2:56 pm

Great idea!
If you manage to equal K9 (and why shouldn't you) Poptray would be an extremely powerful tool to check mail and filter spam.

But why the plugin approach? I thought plugins where meant to give other parties than the original author an opportunity to make extensions to a program.

I think that a bayesian spamfilter on PT will move the focus from a mailchecking application to a spam fighting application.
And therefore it probably will attract many new users.
The current PT design in combination with a bayesian plugin will mean that a user who is interested in filtering and rating email has to find these functions in 3 different places:
Option tab for white and blacklistst
Rules tab for rules
Plugin for bayesian characteristics.
This might be a problem for new users (or people who rate shareware/freeware :) )

As a suggestion for the bayesian plugin i would like to mention that i am very much interested in the gathered information on which email is classified. K9 for example is not very open in this matter, a lot of things happen out of my sight. Maybe you can provide more information to the users of the PT plugin.

User avatar
Renier
Site Admin
Posts: 1957
Joined: Mon Oct 15, 2001 12:54 pm
Location: Cape Town, South-Africa
Contact:

Post by Renier » Mon Sep 08, 2003 3:06 pm

I'm doing things in plugins, because people don't want PopTray to grow too complicated with too many features. Maybe Bayesian filtering isn't important to a lot of people, and they don't have to install the plugin then.

I guess a question arises. What is PopTray's main function? It was originally written as a mail notifier, but it seems most people on this forum is using it as a spam fighting tool.

I could start a poll on the subject, but I'm not sure if the forum users are a good indication of all PopTray users. PopTray gets about 10,000 downloads every month, and we are only a small number of active people on the forum.

User avatar
Curtz
Priceless
Posts: 552
Joined: Tue Nov 27, 2001 3:52 am
Location: A nice tree

Post by Curtz » Mon Sep 08, 2003 3:10 pm

I learned one thina at the university; let specialists take care of everything!

Renier made the best POP3 checker; keep up the good work!

K9 etc. are the best spam fighting tools. Let THEM code spam fighting tools.

Perhaps Renier could add support for these tools:

"Configure this account for K9", "Configure this account for SpamAssasin" etc.

But of course, if Renier makes a brilliant plugin... then I should just shut up! :)
Last edited by Curtz on Mon Sep 08, 2003 3:24 pm, edited 1 time in total.

User avatar
Renier
Site Admin
Posts: 1957
Joined: Mon Oct 15, 2001 12:54 pm
Location: Cape Town, South-Africa
Contact:

Post by Renier » Mon Sep 08, 2003 3:22 pm

By that definition I should remove a lot of the rules, blacklist etc. features from PopTray.

User avatar
vitoco
Veteran
Posts: 422
Joined: Wed Jul 09, 2003 9:22 pm
Location: Chile
Contact:

Post by vitoco » Mon Sep 08, 2003 4:08 pm

Renier wrote:I'm doing things in plugins, because people don't want PopTray to grow too complicated with too many features. Maybe Bayesian filtering isn't important to a lot of people, and they don't have to install the plugin then.
I agree with this. My ISP parses all incoming mail with SpamAssassin (bayesian) and those with a high probability to be spam are changed in three ways:

1. *****SPAM***** is inserted in the subject
2. Explanation headers are inserted
3. Original offending mail is placed as an attachment and a warning is left as the body message.

Then, a simple rule with point 1 makes PopTray identify spam. I don't need a bayesian filter in my pc. Another advantage to be the ISP who runs the bayesian filter is that it can process mail (and spam) from all available accounts, so it can identify new spam just because it receives many copies of it, one for each account.

I have another question about plugins: Will a bayesian filter plugin work with IMAP4 or webmail plugins? I never tried to program windows apps, so I'm not sure how them works. Is there a protocol? :?
Renier also wrote:I guess a question arises. What is PopTray's main function? It was originally written as a mail notifier, but it seems most people on this forum is using it as a spam fighting tool.
I found PT while I was looking for a email notifier. And it gets my attention because of its simplicity and the ability to preview messages and to save attachments. Then I liked the way I could ignore or warn for important messages. But I think that writing emails (or just send acks), format and display HTML and do some other tricky processing should not be on the core of PT (as it could be a Stop parsing following rules rule option to give PT a way to manage rule dependencies :twisted: ).
Renier at last wrote:I could start a poll on the subject, but I'm not sure if the forum users are a good indication of all PopTray users. PopTray gets about 10,000 downloads every month, and we are only a small number of active people on the forum.
But I think that we are doing quite well. 8)

:arrow: :arrow: :arrow: Spanish (Chile).ptlang for beta 12 is available at http://www.inf.utfsm.cl/~vparada/poptray/spanish.html

++Vitoco

User avatar
Renier
Site Admin
Posts: 1957
Joined: Mon Oct 15, 2001 12:54 pm
Location: Cape Town, South-Africa
Contact:

Post by Renier » Mon Sep 08, 2003 4:21 pm

vitoco wrote:My ISP parses all incoming mail with SpamAssassin
I would really recommend the reading of the above Paul Graham article (and its Better Filitering followup) about the problems with using network based filter instead of client ones. Also if something becomes to popular then it can fail. Some spammers now run their spam trhough SpamAssassin before sending, to make sure it isn't detected as spam.
vitoco wrote:Will a bayesian filter plugin work with IMAP4 or webmail plugins?
Yes. The filtering will happen in PopTray, so it doesn't matter how you get the messages into PopTray. You would require to switch on the GetBody option though, otherwise the Bayesian filtering will only happen on the headers.

Using an extra program like K9,SpamAssassin of course put extra overhead into the whole checking process too.

I'm not a big fan of using spam blacklists (like SpamPal does). A few times now my redirecting mail server has ended up on one of these lists, and because my mail server bounces mail that comes from a blacklisted server, the mail would never reach me. Because I use a re-direct server, that would mean ALL my mail would never reach me. Very annoying.

Using a client filter means I can decided whether this is spam or not. Not some at ISP makes this decisions, that could be wrong. ISP filters also don't usually have a whitelist feature.

Of course spammers will also find ways around Bayesian filters when they become very popular.

User avatar
vitoco
Veteran
Posts: 422
Joined: Wed Jul 09, 2003 9:22 pm
Location: Chile
Contact:

Post by vitoco » Mon Sep 08, 2003 5:42 pm

Renier wrote:if something becomes to popular then it can fail.
Sure, and that is one reason why I request that stop parsing rules option, so I could detect some SA false positives and not to ignore them with the following *****SPAM**** rule. :twisted:
Renier wrote:Some spammers now run their spam trhough SpamAssassin before sending, to make sure it isn't detected as spam.
I´ve received a whole spam message inside a jpg or gif, whatever it is named the attachment (not an HTML which points to an image somewhere). Obviously not detected by SA.
Renier wrote:The filtering will happen in PopTray, so it doesn't matter how you get the messages into PopTray. You would require to switch on the GetBody option though, otherwise the Bayesian filtering will only happen on the headers.
No good news to me. I wanted a fast new email notifier. I do not want to download a whole mail to check if it´s spam and then ignore it. With SOBIG virus on the net, PT stars will be rotating almost forever. I´m not downloading email to my pc at work, just trying to pick those mails that are very important to me ASAP.
Renier wrote:Using a client filter means I can decided whether this is spam or not. Not some at ISP makes this decisions, that could be wrong.
Well, my ISP is not really an ISP, just my undergraduate university. And it allows me to teach SA for what is and not is spam in my own mailbox. I can also setup rules for a whitelist. :D BTW, trimming HTML comments is a good idea, Renier.

:idea: One more thing, I did a request some weeks ago to change spam icons from red to blue, black or grey, as red is used for important things: exclamation, shine, red envelope.

++V

User avatar
Renier
Site Admin
Posts: 1957
Joined: Mon Oct 15, 2001 12:54 pm
Location: Cape Town, South-Africa
Contact:

Post by Renier » Mon Sep 08, 2003 6:34 pm

vitoco wrote:I request that stop parsing rules option
You don't have to mention this in EVERY MESSAGE YOU POST. I've heard you the first time.
vitoco wrote:I do not want to download a whole mail to check if it´s spam and then ignore it. With SOBIG virus on the net, PT stars will be rotating almost forever.
That is why I have the extra size setting. Mine is set at 20KB, and I've yet to see a spam bigger than that, but it doesn't try to download the SoBig worms. And if you don't like the Bayesian plugin the it's easy to just not install it.

User avatar
Curtz
Priceless
Posts: 552
Joined: Tue Nov 27, 2001 3:52 am
Location: A nice tree

Post by Curtz » Mon Sep 08, 2003 6:37 pm

Renier wrote:By that definition I should remove a lot of the rules, blacklist etc. features from PopTray.
Not really. There is a little more work and understanding of the technology involved in making and maintaining a Bayesian plugin, but if I am mistakening, no problem! :)

But of course it would be best to make it as a plug-in. I would probably keep K9 running, as I like specialized tools (and THEIR further development and debugging )the most. Initially I was going to install PopFile, to make it also work as a tool in Outlook, but I am waiting for a real interface... :evil:

But generally I like to use specialized tools when it comes to technology, I don't know if K9 is the best, but for now it certainly was the smallest tools with an okay interface.

Waiting for PopFile to improve :(

My biggest wish for PopTray is... multi-thread it! And boy you must hate such a request... I suppose that then you'll have to re-code PopTray? :)

:)

User avatar
ComputerBob
Guru
Posts: 278
Joined: Sat Jun 14, 2003 5:27 pm
Location: The Gulf Coast of the Sunshine State, USA
Contact:

Post by ComputerBob » Wed Sep 10, 2003 9:33 pm

Renier wrote:I guess a question arises. What is PopTray's main function? It was originally written as a mail notifier, but it seems most people on this forum is using it as a spam fighting tool.
I'm a long, long-term PT user who first began using PT as a simple mail notifier, but who, because of the obscene amount of spam nowdays (pun intended), has come to count on PT as a spam-fighting tool. :?
ComputerBob - Making Geek-Speak Chic™
http://www.computerbob.com
One Of The Largest One-Person Sites On The Web
With Tons of Information, Software, Help, and Fun

User avatar
homaquebec
PopTray Family
Posts: 913
Joined: Tue May 27, 2003 6:47 pm
Location: Québec (Canada)

Post by homaquebec » Thu Sep 11, 2003 1:08 am

ComputerBob wrote:I'm a long, long-term PT user who first began using PT as a simple mail notifier, but who, because of the obscene amount of spam nowdays (pun intended), has come to count on PT as a spam-fighting tool. :?
I found PT where I was looking to fire ICQ as a mail notifier but I plenty agree with you for the following.

Since some weeks almost all the spam I receive is detected. Thanks PT, Renier and others. :P

SeViR

Post by SeViR » Fri Oct 10, 2003 12:18 pm

One important thing, How lines needs download the plugin for apply the bayessian tecnique?? if you download complete mails, you need download the mails two times, for PopTray and for your Mail Client.

CJ

Feedback on Bayesian Filtering plugin

Post by CJ » Sat Oct 25, 2003 6:21 pm

Hi, Renier -

Firstly, congratulations on PopTray 3. Fabulous software.

I currently use Spambayes POP3Proxy together with PopTray 3, but would be quite interested in seeing Bayesian filtering/classification added to PopTray - as long as the implementation worked as well as Spambayes currently does. I'd therefore be much happier if that functionality was added as an optional plugin (that I can opt out of!)

I am using Spambayes to add a spam classification header to all mail it proxies, leaving the job of deleting/filtering/moving mail based on its classification header to the MUA. Could PopTray do the same, or would the classification only be used within the PopTray interface to delete spam from the server prior to retrieval? I'd like other options besides simply deleting mail classified as spam from the server. For example, it's useful to have a sample collection of spam (I think it's called a corpus) to be able to quickly retrain the Bayesian filter.

As for tips on how to best implement the filter, I think you'd be hard-pressed to find better algorithms than the ones used in Spambayes. It's open-source Python, if that's of any use.

CJ

User avatar
Renier
Site Admin
Posts: 1957
Joined: Mon Oct 15, 2001 12:54 pm
Location: Cape Town, South-Africa
Contact:

Post by Renier » Mon Oct 27, 2003 9:22 am

Since PopTray is not a POP3 proxy it isn't possible to give new headers through to your mail client.

Guest

Post by Guest » Mon Oct 27, 2003 10:20 am

Renier wrote:Since PopTray is not a POP3 proxy it isn't possible to give new headers through to your mail client.
Yeah, I figured. Oh well, I guess I'd stick to spambayes in that case.

CJ

User avatar
ilNebbioso
PopTray Family
Posts: 773
Joined: Fri Feb 01, 2002 10:30 am
Location: Milan, Italy
Contact:

Re: Bayesian filtering

Post by ilNebbioso » Mon Oct 27, 2003 10:57 pm

Renier wrote:Comments?
These are my two cents: I hope I'm not saying something silly or obviuos... In such cases: sorry in advance!! (this is because your link was too technical for me).

The goal could it be:
* to have a white/black list for addresses (this is yet present in main app);
* to have a bannedwords.txt/.ini (one per line, possibly with a ";" for remarks) in order to give user the possibility to custom the list or direct import from other apps (I could share my mailserver spam .ini file :wink: );
* (very obvious?) to have the possibility to use addresses used inside html tags (i.e. mailto or href) as bannedwords;
* to have the filter working with headers/subjects (my mailserver spam filter doesn't work on subject! :o );
* to have the possibility to convert the base64 content before filtering: a lot of spammers actually send messages in base64 encoding, so normal filtering doesn't apply! :evil:

mig

SpamPal and Bayesian filter

Post by mig » Wed Mar 03, 2004 9:51 am

Sorry for my anglish...
I have proposition that excelent PopTray will beter cooperation with SpamPal, which have Bayesian filter. Now, I can see action SpamPal (and Bayesian) when I do prewiev - I see modified subject by SpamPal (added word SPAM). Normal I see normal subject and i don't know what is this. I would like delete spam on the serwer by PopTry (the best!) than load spam on my computer and filtering .
My proposicion - you permit change subject by SpanPal in mail windows.
I salute autor PopTray :!:

Alec_Burgess
Still here
Posts: 18
Joined: Thu Feb 27, 2003 6:04 am

Re: Bayesian filtering

Post by Alec_Burgess » Mon Mar 08, 2004 2:11 am

Renier wrote: I still need to investigate some more on the Internet so see what other optimizations and tricks people have come up with since the original article. If you have any good links, please let me know.
from the author of Popfile:
John Graham-Cumming : The Spammers' Compendium http://www.jgc.org/tsc/index.htm

A list of the various tricks currently in use by spammers. Most if not all are things that Popfile's baysian analysis either catches already with training or has been tweaked to handle.

FWIW: I use ISP --> Popfile --> PopTray for notifications primarily based on the bucket assignments that Popfile has made followed by .... ISP --> Popfile --> Outlook Express to dump SPAM (not a big problem for me fortunately) and to move email from INBOX to subject related sub-INBOXES.
Regards ... Alec
------------------
Win2K SP4 Poptray 3.1 beta 7 RC (PTLV-009a)
now
WinXP SP2 Poptray 3.2 beta 5 RC1

Locked

Who is online

Users browsing this forum: No registered users and 1 guest