Author Topic: Suggestions on Spam Scanner (Read 1516 times)

parad0x · « **on:** August 13, 2013, 11:01:10 AM »

I am thinking to make a spam scanner but I am confused on how to rate an e-mail a spam.
I need some suggestions on how to rate an email on the basis of a spam mail. The program will scan some words from a sample spam mail and then asks the user to input his email, then it'll scan the user's mail and rate this as if it is a spam mail or not. I actually have no idea on how to make the rate scale and what else it should do. Please don't say that it should regularly scan your inbox folder and such stuff as I haven't done Networking chapter. I'll add support for that after doing network programming.

proxx · « **Reply #1 on:** August 13, 2013, 11:14:43 AM »

I suggest you start out with a blacklist.
Think about words like "viagra" , compiled lists shouldnt be too hard to find online.
Than let the blacklist run agains the email.
1 match = 1 point
if match:
points +=
if email has more than x point > blacklist.

Senders adress could also be a viable entry point.
Either build a list of trusted or a list of untrusted senders.
Perhaps do both , say if sender == *.x.com than point + 1 , even before it reaches that point.
That way untrusted senders will automatically rank higher.

When you have this point system in place you can see how strict you want to set the tool.
You could say;
low = less than 2
medium = more than 2, less than 4
etc.

Deque · « **Reply #2 on:** August 13, 2013, 12:30:36 PM »

Get some ideas here:
https://en.wikipedia.org/wiki/Anti-spam_techniques

You also might want to look into computerlinguistics, which is used to make content based spam detection.

parad0x · « **Reply #3 on:** August 13, 2013, 03:09:40 PM »

Quote from: proxx on August 13, 2013, 11:14:43 AM

1 match = 1 point
if match:
points +=
if email has more than x point > blacklist.

I had exactly the same thing in mind as you stated but the question is How many points should indicate it is a spam?

Btw you had given me a nice idea on what to add in this. Thank you Deque for the link.

proxx · « **Reply #4 on:** August 13, 2013, 03:20:14 PM »

Quote from: parad0x on August 13, 2013, 03:09:40 PM

I had exactly the same thing in mind as you stated but the question is How many points should indicate it is a spam?

Btw you had given me a nice idea on what to add in this. Thank you Deque for the link.

I suggest you make the program adjustable based on user preference.
Find yourself a big ass spammed mailbox, just post that adress everywhere on the web and wait for the spam to roll in.
Than you can start testing with strenght.
Or make it learn..

vezzy · « **Reply #5 on:** August 13, 2013, 07:27:08 PM »

In technical terms, the technique you're looking for should be Bayesian spam filtering, and even more advanced: Markovian discrimination (where entire chains of words rather than a single one are interpreted).

Of course you could make it a basic content-control filter, but with user prompts to mitigate false positives. A true Bayesian system needs to "learn" (machine learning) and most contemporary spam filters on based on Bayesian logic (nothing is objectively true or false, but the probability of something being one of the two increases with the onset of new data).

proxx's suggestion is to make a spamtrap. While this is often used to study spam, indeed, it's worth noting that dedicated spammers usually catch on to this stuff quickly and may start deliberately targeting it to hijack your research. Additionally, if someone forwards to or replies to a message sent by a spammer that has your spamtrap in To: or CC:, they will end up blacklisted themselves. It's a pretty dodgy thing, overall.

That said, if you're going to do a Bayesian filter, you need to be aware that avoiding it is as simple as carefully inserting innocent words ("ham") between the spam text, a technique known as Bayesian poisoning. This usually confuses most naive Bayesian filters, but with proper heuristics and machine learning, you could combat it.

Lastly, spam filters are a dime a dozen, and most of them are highly alike. You're probably just reinventing the wheel, unless you simply want the learning experience, or if you have something unorthodox planned.

Personally I recommend fingerprinting and analyzing the actual requests. This is highly effective, because automated software used by spammers usually isn't standards-compliant. Make a filter that checks for any inconsistencies from the official RFC implementation of SMTP, or one that analyzes email headers. Shady user agents are always a giveaway.

Best of luck to you.

xC · « **Reply #6 on:** August 14, 2013, 02:59:32 AM »

+1 to vezzy, very good information. Trusted sites and blacklisted words seems a little counterproductive though. Best option as stated is to authenticate headers, requests, etc.

parad0x · « **Reply #7 on:** August 15, 2013, 09:57:28 AM »

What you guys suggested are awesome things I should implement but this spam scanner I am gonna build is an excercise problem which was in the chapter related to Regex and strings. It was meant to be a simple scanner which chooses 30 words found mostly in spams and then search for them in an email and then rate it. I am not that experianced in Java to code what you said vezzy but I'll try my best to make it better than a simple spam scanner.

Author Topic: Suggestions on Spam Scanner (Read 1516 times)

parad0x

Suggestions on Spam Scanner

proxx

Re: Suggestions on Spam Scanner

Deque

Re: Suggestions on Spam Scanner

parad0x

Re: Suggestions on Spam Scanner

proxx

Re: Suggestions on Spam Scanner

vezzy

Re: Suggestions on Spam Scanner

xC

Re: Suggestions on Spam Scanner

parad0x

Re: Suggestions on Spam Scanner