that regexp rule doesnt work with this mesage in poptray

General discussion about PopTray. You love it? You hate it? Talk about it here.

Moderators: KY Dave, jojobear99, Rdsok

Borgtex
Groupie
Posts: 52
Joined: Mon Mar 08, 2004 1:32 pm

that regexp rule doesnt work with this mesage in poptray

Post by Borgtex » Fri May 14, 2004 4:12 pm

that's the rule I use for porn mail:

(older|mature|lonely|teen|lovely|hot|hottest|young|amateur|College|drunk|beautiful|wild|cow|legal|moms)\s*.{0,10}(girl|gals|amateur|teen|vids|pics|female|star|moms|mothers|chicks)

and that's source of the message:

Screw survivor, want to see some Ratex X reality shows?<br>
Check out triplxrealitytv, a hot new contest for all the amateur<br>
star
s of the mature entertainment industry. Watch these girls<br>
strive to become a star among thousands, see them compete with <br>
each other for the top spot. See one go from amateur to queen.<br>
<br>

I tested it with Testrexp and it works fine; poptray, however, fails to delete the message. Somebody can see any reason for that?

User avatar
Rdsok
PopTray Family
Posts: 1419
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Fri May 14, 2004 5:19 pm

It's testing fine on my test account. I sent the message in 2 formats the first one was text and the second was html (the <br> is an html break command). You didn't mention which version PopTray you were using, but I'm using PopTray 3.1 beta5 for testing here.

Here are a few possible reasons I can think of that will cause your rule to not fire. I'm trying to be complete but there can be other reasons also, some of these will seem obvious also.

Make sure the email account you get the message from isn't whitelisted.
(I on purpose don't whitelist any of my email accounts so I can test with them.)

Make sure on the Advanced Options page you have Retrieve Body while Checking and that the area you are testing is the Body in the Rule.

Another obvious one, make sure your using RegExpr as the criteria for checking the body.

Make sure you haven't selected the Important option on the Rule (I acccidently did this once :oops: )

Make sure you haven't selected the Protect against auto-delete (I know that is a 'duh' but after my previous accident who knows :D )

There are a few malformed headers that will mess up any rule, such as having extra CR characters in the header, this one will show as a blank line in the header area. A NUL character in the header will also affect this but then you also wouldn't be able to preview the message.

If you wish to test for a malformed email, just make sure your not in your whitelist and email a test message to yourself with the body you mentioned.

I think that covers what I know will affect this. As a disclamer for missing some, I haven't finished my morning coffee yet... :D

Hope that helps,
Rdsok

Borgtex
Groupie
Posts: 52
Joined: Mon Mar 08, 2004 1:32 pm

Post by Borgtex » Fri May 14, 2004 6:22 pm

Using Beta5. Also, all mails with things like "hot amateurs", "lovely vids" and all other possible combinations are correctly filtered & deleted, so the RegExp expression is applied. I'm able to preview the message from poptray, so no malformed header or corruption, nor base64 encoding. And It is not protected in any way.

As you suggested, I've sent the message myself and it worked. I'ts very strange. Anyway, I post the entire message (don't worry about the mail addresses, none of them are mine):

Return-Path: <Terrie@bar-fan.com>
Received: from arnet.arnetcafe.com.mx (201.129.231.125) by netmail.tiscali.es (6.7.018)
id 4083C9CC0050029B; Wed, 12 May 2004 08:38:19 +0200
Message-ID: <52870794077927.679dci80255ed@hotmail.com>
Received: from 68.104.112.104 by law8-qf07.law4.hotmail.com with DAV;
Wed, 12 May 2004 08:36:36 +0100
Reply-To: "Tracy Mack" <Lori@prey.co.uk>
From: "Tracy Mack" <Lori@gordon-bennett.co.uk>
To: <pachi69@tiscali.es>
Subject: oft taboo reality tv
Date: Wed, 12 May 2004 00:40:36 -0700
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="--95781726078460764756"

----95781726078460764756
Content-Type: text/html;
Content-Transfer-Encoding: 7Bit

<html>
<head>
</head>

<body>
Screw survivor, want to see some Ratex X reality shows?<br>
Check out triplxrealitytv, a hot new contest for all the amateur<br>
stars of the mature entertainment industry. Watch these girls<br>
strive to become a star among thousands, see them compete with <br>
each other for the top spot. See one go from amateur to queen.<br>
<br>
<a href="http://ceres.95781726078460764756.meyer ... ty/">watch this thrilling rated x reality
show now!</a><br>
<br>
<a href="http://expound.95781726078460764756.jes ... .php">stop send me ads </a>
</body>
</html>

----95781726078460764756--


btw, thanks for your help
Last edited by Borgtex on Fri May 14, 2004 6:50 pm, edited 1 time in total.

User avatar
Rdsok
PopTray Family
Posts: 1419
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Fri May 14, 2004 6:47 pm

Borgtex,

I don't see anything odd in your cut & paste so it's sounding like something is malformed then in the email in question. At this point, that is a big assumption. If you wish I can attempt to find the problem. I'll need you to attach the email (not just forward it) and send it to poptray <at> oklahoma-isp <dot> net please don't spam it. Just take out the extra spaces there and replace the appropriate symbols etc.

Send the email with 'For Rdsok to test' in the subject line somewhere. Also mention which post this refers to just in case my memory goes away. :D Then reply in this thread and I'll get it and check it out.

One additional bit of info. Even attaching the message may correct the corrupt area of the email. So don't get your hopes up. But sometimes I can see a hint just by looking at them and then I may be able to recreate it anyway. This is what happened when I found the CR problem, but I was still able to get lucky and figure it out with a few hints from Renier.

Borgtex
Groupie
Posts: 52
Joined: Mon Mar 08, 2004 1:32 pm

Post by Borgtex » Fri May 14, 2004 7:52 pm

Thanks Rdsock, I've already sent it to you

User avatar
Rdsok
PopTray Family
Posts: 1419
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Fri May 14, 2004 9:48 pm

I got the email, I'll try to check it out this evening and let you know if we get lucky.

Rdsok

User avatar
Rdsok
PopTray Family
Posts: 1419
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Sat May 15, 2004 5:21 am

Borgtex,

I'm wondering exactly what you meant by the '\s*.' part of your expression. To make it a bit easier to write, I'm going to shorten your expression a little to this

(word_group_1)\s*.{0,10}(word_group_2)

I can't get the rule to fire that way, but if I adjust it some to

(word_group_1)\s.*(word_group_2)

basically I've reversed the positions of the '.' and '*' (which makes it shorthand for a wildcard which matches zero or more chars) and removed the {0,10} to make it work.

Since I'm not sure what you were attempting in that area, my changes may make quite a difference.

The way you wrote the rule it wouldn't look past the new line that the <br> causes along with the hidden <CR><LF>. Basically it was testing each line with the rule by themselves and your rule says that it requires at least one word from group 1 and a whitespace char and one word from group 2 since the words 'amateur' and 'star' are on seperate lines, the rule doesn't fire.

Let me know if that helps (or just confuses you as much as I am). :D

Rdsok

Borgtex
Groupie
Posts: 52
Joined: Mon Mar 08, 2004 1:32 pm

Post by Borgtex » Sat May 15, 2004 1:58 pm

It confuses me :wink:

the idea of the rule was to filter i.e.

amateur stars, as well as:

Code: Select all

AmateurStars
amateur_stars
amateur          stars
amateur <b>stars</b>
amateur blonde stars
amateur&nbps;stars
amateur           &nbps;stars
and of course: amateur <br>stars

but not...
"amateur skiers where several injured while trying to imitate last hollywood's star Harrison Ford Movie"

the rule you suggested (word_group_1)\s.*(word_group_2) would give a false positive here.


To make it clear, my rule says: "one word from group 1" + "0 or more whitespaces" + "0 to 10 characters of any kind"+ "word from group 2". Note that the whitespaces and the "unidentified chars" are not required

I tested the rule with a Regular Expression test program, and it works fine. It's just that poptray does not want to do it :(

User avatar
Rdsok
PopTray Family
Posts: 1419
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Sat May 15, 2004 4:30 pm

Ok, now I know what you were looking to do with your expression. I thought that may have been it, but I was reluctant to assume that and wanted you to tell me directly.

I'll get back to you, if I can get it to work, with a modified version trying to take into consideration your definition you just gave me. Until then.

Rdsok
Last edited by Rdsok on Sat May 15, 2004 8:32 pm, edited 1 time in total.

User avatar
Rdsok
PopTray Family
Posts: 1419
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Sat May 15, 2004 5:59 pm

Whew,

The trouble you are having has to do with the fact that HTML code is read and not ignored by PopTray and in this example email, the <br> or for that matter $nbsp isn't considered as whitespace like it is with HTML.

Here is the code that will work using the HTML codes you've mentioned but not any other HTML at all.
(older|mature|lonely|teen|lovely|hot|hottest|young|amateur|College|drunk|beautiful|wild|cow|legal|moms)(\s|<br>|&nbsp|<b>|</b>){0,10}(girl|gals|amateur|teen|vids|pics|female|star|moms|mothers|chicks)
Basically that is the hard way to do it. I was unable to quickly come up with a way to include anthing that happens within '<>' tags like say '<.*>', but there is probably a way to do it. I may play more with it, but for now I'll leave it up to you to devise the <wildcard> tag that will work for you. :D

PS Don't forget the old spammer trick of creating jibberish HTML tags like this one '<hsidylkl>. You can't see it in the HTML page, but PopTray will see it.
Last edited by Rdsok on Sat May 15, 2004 8:33 pm, edited 1 time in total.

User avatar
Rdsok
PopTray Family
Posts: 1419
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Sat May 15, 2004 6:07 pm

Borgtex,

One other thing to consider. In your example your trying to catch the words 'amateur' and 'star' . One of the other spammer tricks will also make this break. Here is an example.

am<jibberish>ate<code>r st<inbetween>ar

Again, HTML will ignore the nonsense code and just display the two words, but PopTray won't ignore it and the rule won't fire.

The only consolation here is that it won't create a false positive. :D

Rdsok

User avatar
Rdsok
PopTray Family
Posts: 1419
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Sat May 15, 2004 10:00 pm

Borgtex,

I think I found a way to create a wildcard so it will include any HTML code tags used between the 2 word groups. It is still different than yours though. This will fire the rule if there are any number of whitespaces, &nbsp or <html tags or jibberish> for zero or more times but it won't fire if any other word is used that isn't in the 2 word groups.

(wordgroup1)(\s|&nbsp|<.*>)*(wordgroup2)

I used the '*' instead of '{0,10}' which enabled the '<.*>' to fire properly so that any html tag used will be taken into consideration. Trying to use the {0,10} to limit them would make the '<.*>' in the group not fire which really shouldn't be happening. According to any regular expression tool to test with, including TestRExp, both versions should be working, but only the one with the '*' fires the rule with PopTray. I'll post a bug report after I figure out what I want to say.

Rdsok

Borgtex
Groupie
Posts: 52
Joined: Mon Mar 08, 2004 1:32 pm

Post by Borgtex » Sun May 16, 2004 3:28 am

You had a good idea. I modified it a bit:

(wordgroup1)(\s|&nbsp|<[^>]*>)*.{0,10}(wordgroup2)

and now you will say "Oh, no, the infamous .{0,10} again. This guy is obsessed with that" ;)

hehe. It's just that i find the {0,10} very useful to specify proximity: if two of the "forbidden" words are closer enough to each other, it will fire the rule too, i.e.

hot & nice amateurs
hot russian girl
hot!!!! girls

I know that it can give some rare false positives, like "hot weekend topics", but I can live with that.

And it seems that the final rule worked! the damn spam message is gone forever, muhahaha:

(older|mature|teen|lo[vn]ely|hot(test)*|young|amateur|College|drunk|beautiful|wild|cow|legal|moms|wet)(\s|&nbsp|<[^>]*>)*.{0,10}(girl|gals|amateur|teen|vids|pics|female|star|moms|mothers|chicks)

As you, I think that there must be some kind of bug in the way poptray processes the rule, as the first was perfectly correct and must have fired the action. Anyway, I think that now the rule is better :)

Thanks a lot for your dedication and your solutions, Rdsok

User avatar
Rdsok
PopTray Family
Posts: 1419
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Sun May 16, 2004 3:51 am

I was happy to help. If it's any consolation, I like the idea of limiting the rule with the {0,10}. I just wasn't getting it to work with the wildcard inside the html tag part. The odd part was that it only seemed to break after adding the 2nd group of words.

Anyway, enjoy your new rule.

Rdsok

henchard
Still here
Posts: 19
Joined: Fri Apr 16, 2004 9:54 am
Location: Dorset UK

Can you explain to a newbie?

Post by henchard » Sun May 16, 2004 12:34 pm

Not sure if I should post because of my inexperience (apologies if I am missing something obvious).

Surely there are thousands of potential false matches with

(older|mature|teen|lo[vn]ely|hot(test)*|young|amateur|College|drunk|beautiful|wild|cow|legal|moms|wet)(\s|&nbsp|<[^>]*>)*.{0,10}(girl|gals|amateur|teen|vids|pics|female|star|moms|mothers|chicks)

such as

mature fifteen year old malt whisky
wine older than thirteen years is best
it is hot in the tropics
it was older than davids
The young Springsteen grew up in New Jersey
Springsteen weds his girlfriend Patti
older forum topics will be deleted
Springsteens stark album
Neil Young's stark album
they sang beautiful madrigals

or am I getting this regular expression stuff all wrong (quite probably)?

Borgtex
Groupie
Posts: 52
Joined: Mon Mar 08, 2004 1:32 pm

Post by Borgtex » Sun May 16, 2004 1:32 pm

No, you are right; the {0,10} allows all that. However, as my main languaje is spanish and I also have white lists and rules to protect good mails, it will rarely will return a false positive in my case.

But I admit that for more security is better to use: (group1)(\s|\W|&nbsp|<[^>]*>)*(group2)

User avatar
vitoco
Veteran
Posts: 422
Joined: Wed Jul 09, 2003 9:22 pm
Location: Chile
Contact:

Post by vitoco » Wed May 19, 2004 2:24 pm

Few comments:

(word_group_1)\s*.{0,10}(word_group_2)

will search for one of the words from the first group, then one or more spaces (including newlines and tabs), then up to ten chars of whatever, then a word from the second group.

To swap the dot and astherisk will search for a totally different pattern that is not helping here.

I would say:

(word_group_1)\s*.{0,10}\s*(word_group_2)

to get many spaces before the second word, or just

(word_group_1).*(word_group_2)

or

(word_group_1).{0,10}(word_group_2)

to have a limit in the number of bytes between words.

Note also that "/W" won't fire at "_".

:idea: If none of the above does not work, I think that Poptray checks in a buffered (or line) way and not with the full message's body available (assuming that the option was set), and one word (or a part of it) appears at the end of one buffer and the other at the beginning of the next one.

:arrow: Renier should say something about this.

++Vitoco

User avatar
Rdsok
PopTray Family
Posts: 1419
Joined: Fri Mar 19, 2004 11:36 pm
Location: Norman, Oklahoma USA
Contact:

Post by Rdsok » Wed May 19, 2004 5:09 pm

Hi vitoco,

I had hoped you would have spoken up much sooner on this one :D .

You are (of course) absolutely right about there being a difference between the '*.' and '.*' wildcards. Borgtex's original regexpr was set up to catch words from the 2 groups that were relatively close together, hence the '{0,10}' part that followed. It did seem that originally that Poptray was working in 'singleline' mode like you mentioned, but in reality, what made it appear that way was that we (Borgtex and I) were overlooking the html codes like <br> and that would make his regexpr not fire. :oops:

There was a problem, that I've already done a bug report on, that did deal with the '{0,10}' iterator. I described it as fully as I could in this thread Poptray RegExpr iterator problem. I found this problem after I started to catch (include) any html tags into Borgtex's original regular expression.

Renier had mentioned in response to ilNebbioso that he has been very busy lately with work. I suspect he has taken note of the recent posts but hasn't been able to spend much time on keeping us also up to date. Personally, I'm glad his paying job is doing good. I know he will address the issues brought up, he always has in the past.

User avatar
Renier
Site Admin
Posts: 1957
Joined: Mon Oct 15, 2001 12:54 pm
Location: Cape Town, South-Africa
Contact:

Post by Renier » Thu May 20, 2004 6:43 am

It doesn't help that I comment on the problems when I haven't had time to look into it yet.

User avatar
vitoco
Veteran
Posts: 422
Joined: Wed Jul 09, 2003 9:22 pm
Location: Chile
Contact:

Post by vitoco » Thu May 20, 2004 2:16 pm

Renier, I just wanted a comment about how the body is processed in rules, not about RegExprs.

Please select one of the following:

(a) the body is totally downloaded, then the rule is tested.

(b) the body is downloaded in chunks of X KiB, and the rule is run on every buffered chunk.

(c) after each downloaded line of the body the rule is tested.

(d) none of the above!

I think this is an architecture question... And you are The Architect ;)

I might look by myself inside the code, but I'm not used neither to Delphi nor to objets, properties & methods!

Option (a) has a cost in RAM, (c) in CPU, (b) a lower cost both in RAM and CPU. I guess that (d) is only possible if Retrieve Body while Checking is not set or Preview Top Lines is set to 0 lines or so...

:arrow: Only (c) allows us to search multiline strings in a safe way.

++Vitoco

Post Reply

Who is online

Users browsing this forum: Google [Bot] and 7 guests