The Rules & Regular Expressions Thread

General discussion about PopTray. You love it? You hate it? Talk about it here.

Moderators: KY Dave, jojobear99, Rdsok

User avatar
Rdsok
PopTray Family
Posts: 1416
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Sun Mar 28, 2004 3:25 am

Here are a few RegExpr's that I use to help capture some spam that shouldn't produce many false positives. Since I can't spend much time on this I thought I'd post what I've got so far. I hope to expand on this and will post other successful rules as I can come up with or find more. I'm also new to using regular expressions and some of these could be expanded or may have better ways to write these but they do work well.

Symbols and Accented Letters
(best used with languages that don't use accented letters like English etc)
For: To, From, CC, Subject, Body

[áàãåâåÂÅÁÂÀÄÃèëéêËÊÉÈïíìîÏÍÌÎñÑóòðÓÔÒÖøØùüûÜÚÛ¶§µ¥ßæÆм½¾ç±þ¡¿÷·ª¹²³¤«»]


Unusual strings common in some spam
(don't use for Body the string 'iso-gb2312' is used in some normal HTML)
For: Subject, From, To, CC
(big5|atf\-8|windows\-1251|koi8|ISO\-|gb2312|us\-ascii))

Unusual strings common in some spam
(almost same as above but safe for Body)
Update WARNING Some of this will flag on non-Western encoded languages and probably shouldn't be used if you may recieve something encoded in these languages
For: Body
(big5|atf\-8|windows\-1251|koi8|gb2312)

Fonts to small to read
(Vitoco helped me with this one, even if he doesn't know it :wink: )
For: Body
font[^>]+size[^>](\-\d|1px)

This will cause some false positives so use with caution!!!!! (pun intended)
If used with the good Whitelist, false positives should be kept at a minimum.

7 or more non AlphaNumeric characters in a row
For: Subject

\W{7,}



That's it for now. I urge you to criticize these if you see something wrong or could be done better.

Enjoy Rdsok
Last edited by Rdsok on Sat Apr 24, 2004 2:14 am, edited 1 time in total.

Guest

1 old "dirty trick"

Post by Guest » Tue Apr 20, 2004 7:35 pm

This example isn't "pure" regexp, but I have find it very useful, and, you'll see, regexps are needed for it.

1. The main idea is that (user)name of your mailbox must not be equal yo yours nickname, first or full name, or combination of them. For example, my favorite nickname is Roman ShaRP, and my oldest mailbox is "determinator@mail.ru".
2. 'Determinator' is not correct word at all, but many spammers use it in "To" field (maybe, they thinks that it is my nickname, heh). So, for flitering purposes I'm using use regexp
[Dd][Ee][Tt][Ee][Rr][Mm][Ii][Nn][Aa][Tt][Oo][Rr][^@]
, - it catches any use of this word in any case, but overlooks the email adress.

User avatar
vitoco
Veteran
Posts: 422
Joined: Wed Jul 09, 2003 9:22 pm
Location: Chile
Contact:

Re: 1 old "dirty trick"

Post by vitoco » Wed Apr 21, 2004 2:31 am

Roman ShaRP wrote:So, for flitering purposes I'm using use regexp
[Dd][Ee][Tt][Ee][Rr][Mm][Ii][Nn][Aa][Tt][Oo][Rr][^@]
, - it catches any use of this word in any case, but overlooks the email adress.
This regexp will also work for you:

Code: Select all

determinator[^@]
because RegExpr test is CASE INSENSITIVE :-)

++V

Roman ShaRP
First Timer
Posts: 4
Joined: Tue Apr 20, 2004 6:21 pm
Location: Ukraine

Re: 1 old "dirty trick"

Post by Roman ShaRP » Wed Apr 21, 2004 6:58 pm

vitoco wrote: because RegExpr test is CASE INSENSITIVE :-)
You should add in PopTray, Master :wink: You know, it's a little bit unusual ... but very useful for filtering, I see :)

I should add something too. I think, this filtering "regexp trick" could be more useful if check with it all "Header", not only "To" field.
Knowledge itself is a power.

User avatar
vitoco
Veteran
Posts: 422
Joined: Wed Jul 09, 2003 9:22 pm
Location: Chile
Contact:

Re: 1 old "dirty trick"

Post by vitoco » Fri Apr 23, 2004 2:30 am

Roman ShaRP wrote:I think, this filtering "regexp trick" could be more useful if check with it all "Header", not only "To" field.
Well, just select Header instad of To in the area combo. :o

++V

Borgtex
Groupie
Posts: 52
Joined: Mon Mar 08, 2004 1:32 pm

Post by Borgtex » Fri Apr 23, 2004 5:13 pm

Rdsok wrote:Fonts to small to read
(Vitoco helped me with this one, even if he doesn't know it :wink: )
For: Body
font[^>]+size[^>](\-\d|1px)
I use this one for that: font-size:\s*((\-[^(p|>)]*)|1)\s*(pt|px)

User avatar
Rdsok
PopTray Family
Posts: 1416
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Fri Apr 23, 2004 5:38 pm

That is a good point to filter also for the 'pt' values. I've used them so rarely that I'd forgotten about them. I'll also lookup and try to catch the very rarely used 'em' value.

I would like to point out that just filtering on 'font-size:' only catches the CSS style of font sizing where as 'font[^>]+size[^>]' has caught both CSS and HTML style font sizing that I tested against.

I'll post an updated version later that hopefully will catch all three values 'px', 'pt' and 'em' (when I find out what a small 'em' size is). :D

Roman ShaRP
First Timer
Posts: 4
Joined: Tue Apr 20, 2004 6:21 pm
Location: Ukraine

Post by Roman ShaRP » Sat Apr 24, 2004 1:19 am

Rdsok wrote:Unusual strings common in some spam
(don't use for Body the string 'iso-gb2312' is used in some normal HTML)
For: Subject, From, To, CC
(big5|atf\-8|windows\-1251|koi8|ISO\-|gb2312|us\-ascii))
What is "atf-8"? Maybe, "utf-8"?
Unusual strings common in some spam
(almost same as above but safe for Body)
For: Body
(big5|atf\-8|windows\-1251|koi8|gb2312)
In ex-USSR (Russia, Ukraine, other... ) "windows-1251" and "koi8-r" (koi-8u) are not unusual, but most popular e-mail encodings (cyrillic). In this encodings (or code pages) "pure English" letters don't change their's positions (and utf-8 "overlooks" them, too), so I can write you letter in english in koi8-r and you can read it even if you haven't it support.

I know, ex-USSR spammers are active and ugly guys, but really use those filters with caution if you expect to receive something from ex-USSR people
Knowledge itself is a power.

User avatar
Rdsok
PopTray Family
Posts: 1416
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Sat Apr 24, 2004 2:39 am

Roman,

To answer your question about 'atf-8', Yes that is correct and it was not an error. I tested against over 4400 spam emails and actually found that exact string in well over 10% of them. The only drawback is that spammers will change tactics eventually and you will probably find fewer and fewer instances of this or for that matter any string.

Thank you very much for pointing out the language encoding issues. When I posted that I wasn't even considering other languages. Sometimes I forget that this is a global community and I of all people should know better. :oops: I have already updated that post to include a warning about that.

Correct me if this is also wrong, but, I think it should still be safe to look for any encoding info like those in the subject line (I found that very often in the spam I was using then) still since I don't believe it would normally be found in a subject line no matter what language you view your email in.

Again, my apolgies for my oversight and promise to not let it happen again. (I hope). :D

--------------------------------------

I dislike all spammers equally no matter what language they speak and didn't intend to in doing so alienate others because of the language they spoke. I've always have seen any person as just a person, it just seems that the leaders/politicians of the world have a problem doing the same thing.

Borgtex
Groupie
Posts: 52
Joined: Mon Mar 08, 2004 1:32 pm

Exclude words?

Post by Borgtex » Mon Apr 26, 2004 5:14 pm

I'm working in a regexp to filter this spammers trick to fool bayesian filters:

<tag>dfgfds, fdgfds, ryrty, wqre, sfds, uiouio, adsa, bgfbf, saxsa, refre, gtred, sdsfs, khkhk, pfghgf, sfsrwwe, sfsdf,hfgdr uyiuy pojuijo fdtr iuhuk niuhi niuh yuguy uyuy mij uuig igy iug uyg</any tag>


and I ended with this:
(([>,\.\s]+[bcdfghjklmnpqrstvwxyz]{3,}[,\.\s]+)[^</\d]*){5,}

it looks for at least five words with at least 3 chars and without vowels

the problem is that a sentence like:

"There are a lot of extensions nowadays: jpg and png for graphics, htm or html for web pages, also php and asp (active server pages) bmp,bitmap..."

gives a false positive. Althought the possibility of finding a sentence like that is, with the help of protection rules and white list, almost negligible I would like to find the way to exclude words. The idea is something like [^(php|htm|html|png|bmp)], but that doesn't works, of course. Any solution or it just can't be done?

Roman ShaRP
First Timer
Posts: 4
Joined: Tue Apr 20, 2004 6:21 pm
Location: Ukraine

Post by Roman ShaRP » Fri Apr 30, 2004 8:02 am

Rdsok wrote: Correct me if this is also wrong, but, I think it should still be safe to look for any encoding info like those in the subject line (I found that very often in the spam I was using then) still since I don't believe it would normally be found in a subject line no matter what language you view your email in.
Believe me, they can. In fact, almost all message from my friends contain this line in subject. "X-mailer" field tells, that Microsoft Outlook 5 and 6 can produce messages of such kind.

Thank you very much for your kindness, but I think, I should tell you something about our country, people and politics (later, in private message, OK?).
Knowledge itself is a power.

User avatar
Rdsok
PopTray Family
Posts: 1416
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Fri Apr 30, 2004 8:54 pm

Roman,

Thanks again for the information. Feel free to PM me anytime you wish, that also goes for anyone on this forum.

User avatar
Rdsok
PopTray Family
Posts: 1416
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Sat May 08, 2004 6:41 pm

Just posting this to add it to the list
henchard wrote:I like the idea of being able to use regular expressions to help combat spam. Also I'm sure that I would need to sit down and think about this quite hard before being able to write some good ones.

In the course of looking for something else about Forte Agent I came across this archived page of expressions written to weed out Spam in Forte Agent

http://web.archive.org/web/200304082343 ... M#V1.03.01

and just wondered if some of these expressions might be useable (and how much amendment might be necessary) in Pop Tray? Perhaps Vitoco (or someone more knowledgeable than me) could comment as to whether some of them would translate ok to use in Poptray? Or is there too much of difference between the syntax?

User avatar
Rdsok
PopTray Family
Posts: 1416
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Sat May 08, 2004 11:10 pm

I've seen several spams missing a Subject and or From fields in the Headers lately. These rules can catch them.

Header | Contains | Subject: (NOT)
Header | Contains | From: (NOT)

PS this won't catch the ones with just blank subjects or froms. The message must actually be missing the field completely for these to fire.

I've got an idea or two to try and catch the blank ones that won't give false positives but will have to test them further. Most variations I've tried so far will false on me. Then I'll just have to get the ones that have spaces in them without giving a false positive on it.

henchard
Still here
Posts: 19
Joined: Fri Apr 16, 2004 9:54 am
Location: Dorset UK

Some Spam Rules for comment/adaption?

Post by henchard » Thu May 13, 2004 12:14 pm

Please excuse my inexperience with Regular expressions (very much a beginner)!. However, I've lifted these from elsewhere and am trying them out in Poptray. They may be useful to others to filter spam. I'm not saying I fully understand them! or how useful they will prove - so would suggest try with caution (i.e. mark as spam rather than delete).

Since first writing this I've started testing them and it seems that the outside {} brackets shouldn't be there or should be replaced by() in many cases (Told you I didn't know much!). I still feel they be useful to those with more knowledge than me! When I've tested them I may post an amended list.

Subject
{\b([fuc]((<[^>]*>)|(-))?){3}k}

Subject
{\b([se]((<[^>]*>)|(-))?){2}x}

Subject
{\b([puss]((<[^>]*>)|(-))?){4}y}

Subject
{\b([vi1|!a@gr]((<[^>]*>)|(-)|(\s))?){5}(a|@)}

Subject
{\b([peni1|!]((<[^>]*>)|(-))?){4}s}

Subject
{\b([si!1|ldena@fi]((<[^>]*>)|(-))?){9}l}

Subject
{\b([manhoo]((<[^>]*>)|(-))?){6}d}

Subject
{\b([te]((<[^>]*>)|(-))?){3}n}

Subject
{\b([hardcor]((<[^>]*>)|(-))?){7}e}

Subject
{\b([herba]((<[^>]*>)|(-)|(\s))?){5}l}

Subject
{\b([sp]((<[^>]*>)|(-)|(\s))?){2}y}

Subject
{\b([checku]((<[^>]*>)|(-)|(\s))?){6}p}

Subject
{\b([medicatio]((<[^>]*>)|(-)|(\s))?){9}n}

Subject
{\b([prescriptio]((<[^>]*>)|(-)|(\s))?){11}n}

Subject
{\b([VP\-R]((<[^>]*>)|(-)|(\s))?){4}X}

Subject
{\b([Vicodi]((<[^>]*>)|(-)|(\s))?){6}n}

Subject
{\bfuc?k'?n'?suc?k}

Body
{(([xz])([^xz]{0,300})){20}[xz]}

Body
Text20={http\:\/\/(s?rd|click\.shopping)\.yahoo\.com\/(.*?)\/[?*]{1,2}http\:}

Subject
{[Ss]ildenafil [Cc][i1]trate}

Subject
{[Vv][iI1!]cod[i1!I]n|Xenical|BustPro|Zoloft|Ultram|Celebrex|Levitra|Xanax|alprazolam}

Body
{a href\S+href\=http\:\/\/\S+\shref\=}

Subject
{[vV].?[i1I!|lXí].?[a@A].?[gG].?[rR].?[a@A]}

Body
{(?s)[^\s>]<!--.{0,64}?-->[^\s<]}

Subject
{\?.*!|!.*\?}

Body
{(?s)(<\!\-\-[a-z0-9\[\]]{8,14}\-\->.*?){19,}.*?(<\!\-\-[a-z0-9\[\]]{8,14}\-\->)}

Body
{<\!\-\-random\-\->}

Body
{\<[a-z]([a-z0-9])*\>[a-zA-Z0-9 \-]{1,10}\<[a-z]([a-z0-9])*\>[a-zA-Z0-9 \-]{1,10}<[a-z]([a-z0-9])*\>}

Borgtex
Groupie
Posts: 52
Joined: Mon Mar 08, 2004 1:32 pm

Re: Some Spam Rules for comment/adaption?

Post by Borgtex » Fri May 14, 2004 4:30 pm

henchard wrote:Please excuse my inexperience with Regular expressions (very much a beginner)!. However, I've lifted these from elsewhere and am trying them out in Poptray. They may be useful to others to filter spam. I'm not saying I fully understand them! or how useful they will prove - so would suggest try with caution (i.e. mark as spam rather than delete).

Since first writing this I've started testing them and it seems that the outside {} brackets shouldn't be there or should be replaced by() in many cases (Told you I didn't know much!). I still feel they be useful to those with more knowledge than me! When I've tested them I may post an amended list.

Subject
{\b([fuc]((<[^>]*>)|(-))?){3}k}

Subject
{\b([se]((<[^>]*>)|(-))?){2}x}

Subject
{\b([puss]((<[^>]*>)|(-))?){4}y}

Subject
{\b([vi1|!a@gr]((<[^>]*>)|(-)|(\s))?){5}(a|@)}

Subject
{\b([peni1|!]((<[^>]*>)|(-))?){4}s}

Subject
{\b([si!1|ldena@fi]((<[^>]*>)|(-))?){9}l}

Subject
{\b([manhoo]((<[^>]*>)|(-))?){6}d}

Subject
{\b([te]((<[^>]*>)|(-))?){3}n}

Subject
{\b([hardcor]((<[^>]*>)|(-))?){7}e}

Subject
{\b([herba]((<[^>]*>)|(-)|(\s))?){5}l}

Subject
{\b([sp]((<[^>]*>)|(-)|(\s))?){2}y}

Subject
{\b([checku]((<[^>]*>)|(-)|(\s))?){6}p}

Subject
{\b([medicatio]((<[^>]*>)|(-)|(\s))?){9}n}

Subject
{\b([prescriptio]((<[^>]*>)|(-)|(\s))?){11}n}

Subject
{\b([VP\-R]((<[^>]*>)|(-)|(\s))?){4}X}

Subject
{\b([Vicodi]((<[^>]*>)|(-)|(\s))?){6}n}

Subject
{\bfuc?k'?n'?suc?k}

Body
{(([xz])([^xz]{0,300})){20}[xz]}

Body
Text20={http\:\/\/(s?rd|click\.shopping)\.yahoo\.com\/(.*?)\/[?*]{1,2}http\:}

Subject
{[Ss]ildenafil [Cc][i1]trate}

Subject
{[Vv][iI1!]cod[i1!I]n|Xenical|BustPro|Zoloft|Ultram|Celebrex|Levitra|Xanax|alprazolam}

Body
{a href\S+href\=http\:\/\/\S+\shref\=}

Subject
{[vV].?[i1I!|lXí].?[a@A].?[gG].?[rR].?[a@A]}

Body
{(?s)[^\s>]<!--.{0,64}?-->[^\s<]}

Subject
{\?.*!|!.*\?}

Body
{(?s)(<\!\-\-[a-z0-9\[\]]{8,14}\-\->.*?){19,}.*?(<\!\-\-[a-z0-9\[\]]{8,14}\-\->)}

Body
{<\!\-\-random\-\->}

Body
{\<[a-z]([a-z0-9])*\>[a-zA-Z0-9 \-]{1,10}\<[a-z]([a-z0-9])*\>[a-zA-Z0-9 \-]{1,10}<[a-z]([a-z0-9])*\>}

most of that rules don't seem to have too much sense, and also can be dangerous to use; i.e., [hardcor] doesn't means "find that word", but "find any of that letters", so it will just detect a lonely "a" or an "r" or and "o", but not the whole word

Better read vitoco's guide: viewtopic.php?t=1626

henchard
Still here
Posts: 19
Joined: Fri Apr 16, 2004 9:54 am
Location: Dorset UK

Spam Rules for Comment/adaption

Post by henchard » Fri May 14, 2004 8:37 pm

You will note that I acknowledged my inexperience at the start and made these available as perhaps being of help. I am currently testing these and others.


\b([hardcor]((<[^>]*>)|(-))?){7}e

catches the hidden word hardcore (eg !hardcore! )as well as things like

hardcore porn
hardcore sex
hardcore filth

etc. try it yourself at http://www.roblocher.com/technotes/regexp.aspx

some of the others are for specific spammer tricks eg spammers who use a lot of comments.

(([xz])([^xz]{0,300})){20}[xz]

for example is designed to find filler garbage with lots of z's and x's in the body

as I said when I've tested them I'll see about posting more info, but IMHO they have give me some useful ideas for starting to write and adapt reg expressions.

User avatar
Rdsok
PopTray Family
Posts: 1416
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Fri May 14, 2004 10:28 pm

I think what he was trying to get across to you is that since in examples you used '[]' instead of '()' around the 'word', that as long as there are 7 letters within a word that match those inside the '[]' as well as the 'e' at the end you will have a match.

Example (not a perfect ones but works as an examples)
cardoore as will aaaaaaae would fire the rule

converting the '[]' to '()' would solve what he was using as the example like..
\b((hardcor)((<[^>]*>)|(-))?){7}e

There are several of the examples you had given that have those types of problems that could produce a false positive with the rules. I doubt that he tested them all for that, he was just pointing out some of the shortcomings that do exist. I've made similar errors like that myself, so don't feel that you're the only one doing that. :D We all try to work together to hopefully catch some of our oversights.

Borgtex
Groupie
Posts: 52
Joined: Mon Mar 08, 2004 1:32 pm

Post by Borgtex » Sat May 15, 2004 2:45 am

Of course I never wanted to patronize you, I just wanted to make clear that the regexp that you posted should not be used, because they can give a lot of troubles, and people may think that using regular expressions is a bad idea

another example: \b([vi1!a@gr]((<[^>]*>)|(-))?){5}a

would, in fact, give a false positive with, i.e. gr@vianetworks.com

also, that rule: (([xz])([^xz]{0,300})){20}[xz]

will give a false positive with this text:

"Xenofex is a powerful plugin for photoshop that includes the following filters:
Xenofex distort, Xenofex Barrel, Xenofex Flag, Xenofex Scatter, Xenofex Zig Zag, Xenofex Zoom

And much other. You can order Xenofex by mail, by fax or by phone.
Whe can send the software to your zone if you want"

User avatar
Rdsok
PopTray Family
Posts: 1416
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Sat May 15, 2004 5:32 am

Borgtex,

I'm jealous, your examples were better than mine... :cry: I guess that what I get for trying to answer a posting and do work at the same time. :D

Post Reply

Who is online

Users browsing this forum: No registered users and 3 guests