I don’t know if anyone except me will need this script, so i put it in blog just not to loose it
Very simple function analyze $_SERVER[’HTTP_USER_AGENT’] variable and looking for crawler signature. If function founds crawler, it will return it’s name, otherwise – false.
Usage examples:
- save to database and output somethere in admin zone or on site
- save for indexing statistics and analyze it later
- use for cloacking or doorways :) ( i do not advise you to do it )
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
array('Google', 'Google'),
array('msnbot', 'MSN'),
array('Rambler', 'Rambler'),
array('Yahoo', 'Yahoo'),
array('AbachoBOT', 'AbachoBOT'),
array('accoona', 'Accoona'),
array('AcoiRobot', 'AcoiRobot'),
array('ASPSeek', 'ASPSeek'),
array('CrocCrawler', 'CrocCrawler'),
array('Dumbot', 'Dumbot'),
array('FAST-WebCrawler', 'FAST-WebCrawler'),
array('GeonaBot', 'GeonaBot'),
array('Gigabot', 'Gigabot'),
array('Lycos', 'Lycos spider'),
array('MSRBOT', 'MSRBOT'),
array('Scooter', 'Altavista robot'),
array('AltaVista', 'Altavista robot'),
array('IDBot', 'ID-Search Bot'),
array('eStyle', 'eStyle Bot'),
array('Scrubby', 'Scrubby robot')
);
foreach ($crawler as $c)
{
if (stristr($USER_AGENT, $c[0]))
{
return($c[1]);
}
}
return false;
}
// example
$crawler = crawlerDetect($_SERVER['HTTP_USER_AGENT']);
if ($crawler )
{
// it is crawler, it's name in $crawler variable
}
else
{
// usual visitor
}UPDATE:
After reading this i decide to update my code a bit. Change is connected to usage of function on high volume website.
<?php
$crawlers = array(
'Google'=>'Google',
'MSN' => 'msnbot',
'Rambler'=>'Rambler',
'Yahoo'=> 'Yahoo',
'AbachoBOT'=> 'AbachoBOT',
'accoona'=> 'Accoona',
'AcoiRobot'=> 'AcoiRobot',
'ASPSeek'=> 'ASPSeek',
'CrocCrawler'=> 'CrocCrawler',
'Dumbot'=> 'Dumbot',
'FAST-WebCrawler'=> 'FAST-WebCrawler',
'GeonaBot'=> 'GeonaBot',
'Gigabot'=> 'Gigabot',
'Lycos spider'=> 'Lycos',
'MSRBOT'=> 'MSRBOT',
'Altavista robot'=> 'Scooter',
'AltaVista robot'=> 'Altavista',
'ID-Search Bot'=> 'IDBot',
'eStyle Bot'=> 'eStyle',
'Scrubby robot'=> 'Scrubby',
);
function crawlerDetect($USER_AGENT)
{
// to get crawlers string used in function uncomment it
// it is better to save it in string than use implode every time
// global $crawlers
// $crawlers_agents = implode('|',$crawlers);
$crawlers_agents = 'Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';
if ( strpos($crawlers_agents , $USER_AGENT) === false )
return false;
// crawler detected
// you can use it to return its name
/*
else {
return array_search($USER_AGENT, $crawlers);
}
*/
}
// example
$crawler = crawlerDetect($_SERVER['HTTP_USER_AGENT']);
if ($crawler )
{
// it is crawler, it's name in $crawler variable
}
else
{
// usual visitor
}







Great code dude ! do you know the complete list of web spiders ?? anyone ??
i’m sorry, but i dont think that anyone knows full list
How to detect all website that have my link in others website. Please
Your function is not working. I tryed like this:
function test($userAgent) { $crawlers = 'Opera|Mozilla|HostTracker|EasyDL|e-collector|EmailCollector|Telesoft|Twiceler|InternetSeer.com|MJ12bot|YahooFeedSeeker|Yahoo-MMCrawler|Yandex|findlinks|Bloglines subscriber|Dumbot|Sosoimagespider|QihooBot|FAST-WebCrawler|Superdownloads Spiderman|LinkWalker|msnbot|ASPSeek|WebAlta Crawler|Lycos|FeedFetcher-Google|Yahoo|YoudaoBot|AdsBot-Google|Googlebot|Scooter|Gigabot|Charlotte|eStyle|AcioRobot|GeonaBot|msnbot-media|Baidu|CocoCrawler|Google|Charlotte t|Yahoo! Slurp China|Sogou web spider|YodaoBot|MSRBOT|AbachoBOT|Sogou head spider|AltaVista|IDBot|Sosospider|Yahoo! Slurp|Java VM|DotBot|LiteFinder|Yeti|Rambler|Scrubby|Baiduspider|accoona'; if ( strpos($crawlers , $USER_AGENT) === false ) return false; } if (getIsCrawler($_SERVER['HTTP_USER_AGENT'])) die('spider');Tryed with mozilla and with Opera
you should really just delete this…. bad coding and slow. stick to designing ;)
There is an API available at http://www.atlbl.com that detects all forms of webcrawlers ( normal, stealth, evil )