eZ Community » Forums » General » How to stop spiders?
expandshrink

How to stop spiders?

How to stop spiders?

Friday 18 March 2005 4:28:27 pm - 10 replies

Any idea about how to stop bad spiders from entering the site? With bad spiders a refer to the ones wich collect email addresses. The problem is that some of them go so fast that they can make the server stop working if they attack in the peak hours. It wold be also good to protect people emails. Masking email adresses is a bit useless, spiders learn so fast that make masking useless.

Any idea would be appreciated.
Thanks,
Luis.

Friday 18 March 2005 4:38:28 pm

Hello

You can use the wash operator on your emails to obfuscate them.

See this post : http://www.ez.no/community/forum/setup_design/obfuscate_email_addresses

Lex

Friday 18 March 2005 5:03:10 pm

My main problem isn't obfuscate email adresses. My main problem is the spider crawling the site at the maximum speed supported by the server, what produces a slow site or even blocks the server.

Friday 18 March 2005 5:16:54 pm

Hi Luis,

Here you find some info about spiders control
http://www.searchengineworld.com/robots/robots_tutorial.htm

Modified on Friday 18 March 2005 5:17:20 pm by Ɓukasz Serwatka

Wednesday 23 March 2005 11:57:42 am

Have to second the robots.txt file idea. BAsically, you're left with either:

1) hope that they stop
2) put a robots.txt file in there and hope that they stop.

I would do 2.

Jonathan

Wednesday 23 March 2005 2:42:58 pm

Unfortunaly, I don't think the robots.txt would do anything.

If I were a spider programmer, the first steps I'd do in my program would be :
- check if there is a robots.txt
- then immediatly visit the "forbidden" folders, because they must be the most interesting ones ...

Probably if you obfuscate all your e-mail adresses, spiders won't come back because they don't see anything interesting on your site.

Wednesday 23 March 2005 6:32:44 pm

Hi Luis,

You might want to log the IP address of the <i>bad</i> spiders and then block them. This will only work however, on spiders that have a fixed IP.

I hope this helps

Tony

Thursday 24 March 2005 11:02:57 pm

How do you recognize a bad spider? If you can recognize it from the HTTP information, add the following lines to your .htaccess file, or in the Apache httpd.conf, if you can access that one.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (badspider)
RewriteRule !^nocrawl.html /nocrawl.html [L]

'badspider' is a regular expression matching the reported user agent of the bad spider. 'nocrawl.html' is simply a short html page with no links.

Alternatively, you can add the following code in the beginning of the index.php file:

if( preg_match( '/badspider/', $_SERVER['HTTP_USER_AGENT']))
  die( 'Go away' );

Modified on Thursday 24 March 2005 11:05:16 pm by Harry Oosterveen

Friday 25 March 2005 1:44:01 pm

You can find more info on http://ezinearticles.com/?Invasion-of-the-Email-Snatchers&id=20846. On this page is also a list of bad spiders, applied to the .htaccess method I mentioned above. To apply this list to the php-code for the index.php file, use this:

$badspiders = array( 
  'EmailSiphon',
  'EmailWolf',
  'ExtractorPro',
  'Mozilla.*NEWT',
  'Crescent',
  'CherryPicker',
  '[Ww]eb[Bb]andit',
  'WebEMailExtrac.*',
  'NICErsPRO',
  'Telesoft',
  'Zeus.*Webster',
  'Microsoft.URL',
  'Mozilla/3.Mozilla/2.01',
  'EmailCollector' );
	
$regex = '/^(' . join( '|', $badspiders ) . ')/';

if( preg_match( $regex, $_SERVER['HTTP_USER_AGENT'])) {
  die( 'Go away' );
}

Note that new robots will evolve, so you have to adapt this list regularly.

Monday 28 March 2005 12:14:07 pm

There is a much easier way...

Just include a directory in the top of your robots.txt file that includes code to add whoever visits it to a ban list. That way, if the spider doesn't listen, as soon as it starts to investigate, it gets locked out. Your human traffic will be unaffected.

You could easily adapt that php code into a simple script to do it. Just add a database handler, and a three column table with id, name, and ip.

J

Monday 28 March 2005 2:04:28 pm

Your problem is that spiders visit your site at the wrong hours of day, draining your system for resources.. How about a script that replaces the robots.txt file at different times of day? Letting them search your site at night, and banning all robots at daytime, for example.

expandshrink

You must be logged in to post messages in this topic!

36 542 Users on board!

Forums menu

Proudly Developed with from