19th June, 2007

Injections, Spiders, Spam and phpPHP

Tuesday, 9:23 am in CodeGirl

So the other day I started getting obsessed over bots.  I think it started out when I finally fixed my Apache error pages.  Because I’m somewhat obsessive-compulsive over some things, I always had this dream of making these Apache error pages that logged referrer and browser information in a database.  Except it never worked; it logged okay, but never any data.  I couldn’t figure out why for ages (admittedly I wasn’t looking very hard), until I read somewhere that if you use a full path in an error page (as I was, to deal with my sub-domains), Apache conveniently drops all redirect information.1  Thanks Apache!  So after goofing off with .htaccess for a while, I eventually set it so that it referrers all errors to /errors; this meant I had to have a different set of error page files for each subdomain, but at least it was now logging redirect data.  H’okay.

So suddenly I’ve got all this meaningful error data coming in, and I’m absolutely floored by how many PHP injection hacks I’m getting hit with; generally either at Uncreative or my fanlistings.  Luckily I use a custom-hacked version of phpFanBase, sk.include isn’t fooled by cheap include tricks, and my server has remote include() turned off anyway (as it should; that shit is deadly), so it’s not an issue of vandalism.  Mostly I was just interested to note how many of these ‘dragnet hack’ attempts were done using spiders which were obviously identifying themselves as spiders (libwww-perl and PEAR mostly).  So after some scraping around, I set up the following in my .htaccess:

# bot blacklist
Order Allow,Deny
Allow from all

SetEnvIfNoCase User-Agent "^libwww-perl" bad_bot
SetEnvIfNoCase User-Agent "^TurnitinBot" bad_bot
SetEnvIfNoCase User-Agent "^dragonfly" bad_bot
SetEnvIfNoCase User-Agent "^OutfoxBot" bad_bot
SetEnvIfNoCase User-Agent "^yodaobot" bad_bot
SetEnvIfNoCase User-Agent "^larbin" bad_bot
SetEnvIfNoCase User-Agent "^Syntryx" bad_bot
SetEnvIfNoCase User-Agent "^OrangeSpider" bad_bot
SetEnvIfNoCase User-Agent "^Java" bad_bot
SetEnvIfNoCase User-Agent "^studybot" bad_bot
SetEnvIfNoCase User-Agent "^PHP version tracker" bad_bot
SetEnvIfNoCase User-Agent "^CaRP" bad_bot
SetEnvIfNoCase User-Agent "^genieBot" bad_bot
SetEnvIfNoCase User-Agent "^User-Agent:" bad_bot
SetEnvIfNoCase User-Agent "^DataCha0s" bad_bot
SetEnvIfNoCase User-Agent "^$" bad_bot
SetEnvIfNoCase User-Agent "^autokrawl" bad_bot
SetEnvIfNoCase User-Agent "^cfetch" bad_bot
SetEnvIfNoCase User-Agent "^voyager" bad_bot

Deny from env=bad_bot

In a nutshell, this will cause Apache to examine incoming user agent strings, and if they match anything in the list, it will deny them from the site.  Some of these bots are known as being ‘bad’ – address harvesters and so forth – and some of them I just don’t like, such as the bot for Turnitin.  Yeah, I have an ideological objection to a so-called ‘anti-plagiarism’ bot actively breaching my copyright to allegedly prevent copyright breach.  What can I say?

There’s just one problem; manual blocking of this sort is always going to be a losing battle.  While it will work for bots like Turnitin that don’t actively attempt to hide their identities, it ain’t gonna work for most scraper bots, which have constantly changing IPs and user agents.

Enter Project Honey Pot.  This site attempts to actively trap malicious bots by getting webmasters to lay out deliberate ‘honeypots’; tempting pages with email addresses, fake comment forms and login boxes that tracks any attempt by a bot to spam or brute-force.  The site also keeps a large directory of known ‘bad’ IP addresses; ah, now we’re talking.  Next problem; how to integrate the Honey Pot database of known bad IPs into protection for my website?  Project Honey Pot offers a service called http:BL which essentially allows users to use DNS queries to check IP addresses against the database to determine if they are dangerous or not.  There is an Apache module integrating the feature for download, but of course that’s not good for those of us who are on shared hosting.  So, after some hemming and hawing, I came up with the following script:

<?php
$akey = "abcdefghijk";  // your http:BL Access Key
$redirect = "http://www.unspam.com/noemailcollection/";  // change to the URL of your site's honeypot

$min_threat = 10;
$banned_mask = 7;     // see http://www.projecthoneypot.org/httpbl_api.php
$lookup = "dnsbl.httpbl.org";

//* END CONFIG *******************************************************//
// include class file
  include( "Net/DNS.php" );

  // reverse the ip
  $ip = explode( '.', $_SERVER['REMOTE_ADDR'] );
  $hp = $akey .'.'. $ip[3] .'.'. $ip[2] .'.'. $ip[1] .'.'. $ip[0] .'.'. $lookup;

  // create new Net_DNS object
  $ndr = new Net_DNS_Resolver();

  // debug output?
  // $ndr->debug = 1;
  
  // set nameservers
  // uncomment if you're having problems getting Net_DNS to work
  // $ndr->nameservers = array( '4.2.2.1', '4.2.2.2' );

  // query for IP address
  $r = $ndr->search( $hp );

if( !empty( $r ) ){
  $c = explode( '.', $r->answer[0]->address );
  
  // if we look suspicious, redirect us to the specified page and kill all subsequent content
  if( ( $banned_mask & $c[3] ) != 0 && $c[2] > $min_threat ) {
    header( "Location: $redirect" );
?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
  <title>phpPHP</title>
  <meta http-equiv="refresh" content="0;URL=<?=$redirect?>">
</head>
<body>
</body>
</html>
<?
    die();
  }
}
?>

This script will query the IP address of any visitor against http:BL and if it is found to match the criteria for dangerousness (as per $min_threat and $banned_mask), it will redirect the browser to a specified URL, which you should probably set to your site’s honeypot.  To get it working, you will need to sign up at Project Honey Pot, get a http:BL access key, download Net_DNS, and bung that and this script somewhere up on your site.  Now just include() the above script into any page you wish to ‘protect’.

And that’s pretty much it.

It’s by no means perfect, and it won’t stop all email harvesters and spam scrapers from your site, but it makes me fell like I’m doing my – very tiny – bit to bite back against the tide of bullshit spam rubbish that’s become a constant infection on the internet.  In case you’re wondering, it definitely fools at least some bots; I’ve had two new spam attempts in the time I’ve spent writing this post.  sk.log also keeps a record of things like CAPTCHA failures; since integrating http:BL it’s gone down from 400+ ‘hits’ a day to about 60.  That there’s results I tells ya!

And you know, it gives me another tracker to watch.  I sure do love watching trackers…

  1. It also breaks 403 password authentication.  So now I know why I haven’t been able to log in to admin my fanlistings for a zillion years. ^

Comments

Add Comment
auto insert line breaks
use log.code
use smilies
Verification
  • v-s.net v0.6 and all content (unless noted) © Dee.
  • sk.log v0.6 spat this out in 1.693 seconds.
  • 47 / 211,505
artistic-twobyfour