Fighting spam using MySQL and PHP
Posted on 27 December 2006 (07:14 PM)
Spam is a terrible, terrible thing. And I'm not even talking about spam emails. Luckily, I don't receive a lot of Viagra, free offers or casino coupons in my mailbox, so when that happens, I just send them straight to the trash bin.
I'm talking about comment spam. This form of spamming is even worse, 'cause it's out in the open, where everyone can read it, polluting innocent websites that have got nothing to do whatsoever with that malicious content.
In this article I will provide a way to arm yourself against this criminal behaviour.
How spam bots work
I'm not exactly an expert, but it's obvious that spam machines target a specific script of yours, send their malicious variables through either POST or GET and repeat that action over and over again. You can't stop them from finding out what variables your script expects and under which names they go.
You can always check for header information, like this...
if ($_SERVER['HTTP_REFERER'] != 'http://www.foo.bar){exit ("That's wrong!");}- Download this code: /code/fight_spam_with_mysql_and_php1.txt
...but that's unreliable. Spam bots are able to mimic these values and you will receive spam all the same.
You can't stop these machines sending stuff to your script, so what can you do?
Expect the unexpected
Spam machines are configured to send variables with names and patterns of which it knows for sure that your script will accept them. The best you can do, is undo this guess work and make the variables you send highly dynamic.
I have developed a way to create random variables that are different on every page refresh. That way, spam bots can never guess how to formulate the right string. It's best if you use a database to do this, but an array of values will work just as well. The advantage of using a database is that it (most likely) will expand over time, so more and more string combinations become available just by updating your content.
PHP and MySQL to the rescue
So, here goes. Pick a table in your database, it doesn't matter what's in it. I'm using my article database in this example, the "articleTitle" in particular:
$sql= "SELECT id,articleTitle FROM ".ARTICLE_TABLE." ORDER BY RAND() LIMIT 1";$result= $db->query($sql);$row= $result->fetch();$trap_key= md5 ($row['articleTitle']);$trap_value = strtoupper ( md5 ($row['articleTitle']) . substr (md5 ($row['articleTitle']),11));$trap_key2= md5 ('trap_id');$trap_id= $row['id'];- Download this code: /code/fight_spam_with_mysql_and_php2.txt
As you can see, I fetch one random title from my database using MySQL's RAND ( ) function. With that random string I construct a key and a value, using a little MD5. I obscure the value even more by concatenating a substring of itself.
This is the HTML I produce using these values:
<input type="hidden" name="<?php echo $trap_key;?>" value="<?php echo $trap_value;?>"><input type="hidden" name="<?php echo $trap_key2;?>" value="<?php echo $trap_id;?>">- Download this code: /code/fight_spam_with_mysql_and_php3.txt
You can check the source of this page; the value is different on every page refresh. Now spam bots will never know what variables to pass to my script!
The receiving end
Of course, my script needs to figure out if the submitted string is a correct combination of identifier and value. On the receiving end almost the exact same bit of code takes place:
$sql= "SELECT MD5(articleTitle) AS hash FROM ".ARTICLE_TABLE." WHERE id = '" . $_POST[md5('trap_id')] . "'";$result= $db->query($sql);$row= $result->fetch();$hash= $row['hash'];if (!isset ($_POST[$hash]) || $_POST[$hash] != strtoupper ( $hash . substr ($hash,11))){exit ("Your comment could not be submitted due to security measures.");}- Download this code: /code/fight_spam_with_mysql_and_php4.txt
Using MySQL's MD5 ( ) function, I fetch the article's title where the id is the same as was send by the form. I then construct the value in the same way as was done on the other end of the form and check to see if they are the same:
if (!isset ($_POST[$hash]) || $_POST[$hash] != strtoupper ( $hash . substr ($hash,11))){exit ("Your comment could not be submitted due to security measures.");}- Download this code: /code/fight_spam_with_mysql_and_php5.txt
And now, we wait
The script is running for a couple of days now, and so far I did not receive a single spam comment. I keep my fingers crossed though, spam bots might need a couple of days to adapt to the new form and maybe I will start receiving spam all the same in a couple of weeks. But I think not. Logically, this ought to work pretty well.
I will post an update as soon as I've found enhancements or bugs, and I would be happy to receive your feedback on this technique.
Filed under PHP, Industry and Culture
- ← previous article: Merry Christmas everyone!
- → next article: Stop your page from shifting!
Comments:
Nice article.
We've talked about this earlier and it seems you've found a way to code this nice 'n easy.
Let's hope this will work fine for the current generation spambots and the articles' comments will remain clean!
I'm with you on the timeframe-issue. There's always a great chance you didn't think (at least) one step aheas, but then again, the bots just cannot "guess" the variables.
We'll see what happens, maybe you left some space for the bots to use, maybe not.
Sam
A friend of mine linked me to this page, after reading what I had did. Thought you might like to read.
Pretty similar in concept, but mine doesnt require mysql (so you can apply this to a form mailer or for people without DB access), and added a keyword ban feature and another sneaky bit:
http://prostheticallyhip.com/?p=245
this comment has been quoted by Harmen Janssen
Hey thanks a lot, Will :)
I will certainly take a look out, 'cause to be honest, the spam bots are slowly figuring things out and I am graduately receiving spam again.
I might write an update later.
Maar hoe werkt het nou eigenlijk, verwacht je bij elke post een andere waarde uit de twee hidden velden?
this comment has been quoted by Harmen Janssen
Ja, het is volledig random, ik kan van tevoren niet bepalen welke waarde in de velden komt te staan.
Het is echter nog lang niet waterdicht, spam sijpelt nog altijd binnen. Geen idee hoe ze het doen, maar helaas heb ik het momenteel te druk om een meer solide systeem te implementeren.
Zelf ga ik óf een Captcha systeem implementeren, obv Cold Fusion óf de Cold Fusion versie van de WordPress API AKISMET (http://akismet.com/). Ik heb een hele tijd een spamfilter van mezelf bijgehouden in mijn eigen CMS voor mijn website die op woorden en URL's filtert, maar goed, zoals je al zegt, je blijft achter de feiten aan lopen. De CFAKISMET is een OO implementatie van de WordPress API en loopt gewoon op de server. Dus als iemand een comment post op mijn website wordt die door de API gehaald. Als het goed gaat wordt de comment gewoon gepost, anders geeft de API een foutmelding terug, die ik dan zelf netjes afvang. Weet nog niet hoeveel werk dit is, maar adhv de uitleg op de website van AKISMET lijkt het mee te vallen. Er is ook een PHP versie beschikbaar (http://akismet.com/development/ - voor PHP4 en PHP5) ;-) Iets voor jou?
Ja, dat ziet er zeker interessant uit :)
Bedankt voor de links. Ik vind Captcha's zelf vaak eigenlijk te onvriendelijk voor gebruikers, dus daar wil ik nog niet aan..
Idem, ik heb Captcha zo lang mogelijk afgehouden, omdat ik het een onnodig hinder vind voor mijn bezoekers - en het is niet echt gebruikersvriendelijk of toegankelijk. Toen ik Akismet via via ontdekte leek dat me helemaal je van het.
This algorithm is fundamentally *flawed*.
You give spammers the ID of the table and the correct answer.
Spammer need to load just once your page, look at trap_value and trap_key and build a robot that just posts to your page over an over again, using the very same trap_value and trap_key all the time. It'll validate just fine.
Each page load produces a random trap_key+trap_value tupel, true.
Your code doesn't have the ability to verify if a trap_value+trap_key pair has been issued (and used) before. That's the problem.
Why don't you store both values in a session?
Much safer, because spambots can't read sessions. That is what i use on my sort of captha.
That's a fine idea, Daniel :)
I have to find some time to work on Whatstyle, but I schedule won't allow it these days :s