I don’t know if anyone except me will need this script, so i put it in blog just not to loose it
Very simple function analyze $_SERVER[’HTTP_USER_AGENT’] variable and looking for crawler signature. If function founds crawler, it will return it’s name, otherwise – false.
Usage examples:
– save to database and output somethere in admin zone or on site
– save for indexing statistics and analyze it later
– use for cloacking or doorways :) ( i do not advise you to do it )
function crawlerDetect($USER_AGENT) { $crawlers = array( array('Google', 'Google'), array('msnbot', 'MSN'), array('Rambler', 'Rambler'), array('Yahoo', 'Yahoo'), array('AbachoBOT', 'AbachoBOT'), array('accoona', 'Accoona'), array('AcoiRobot', 'AcoiRobot'), array('ASPSeek', 'ASPSeek'), array('CrocCrawler', 'CrocCrawler'), array('Dumbot', 'Dumbot'), array('FAST-WebCrawler', 'FAST-WebCrawler'), array('GeonaBot', 'GeonaBot'), array('Gigabot', 'Gigabot'), array('Lycos', 'Lycos spider'), array('MSRBOT', 'MSRBOT'), array('Scooter', 'Altavista robot'), array('AltaVista', 'Altavista robot'), array('IDBot', 'ID-Search Bot'), array('eStyle', 'eStyle Bot'), array('Scrubby', 'Scrubby robot') ); foreach ($crawler as $c) { if (stristr($USER_AGENT, $c[0])) { return($c[1]); } } return false; } // example $crawler = crawlerDetect($_SERVER['HTTP_USER_AGENT']); if ($crawler ) { // it is crawler, it's name in $crawler variable } else { // usual visitor }
UPDATE:
After reading this i decide to update my code a bit. Change is connected to usage of function on high volume website.
<?php $crawlers = array( 'Google'=>'Google', 'MSN' => 'msnbot', 'Rambler'=>'Rambler', 'Yahoo'=> 'Yahoo', 'AbachoBOT'=> 'AbachoBOT', 'accoona'=> 'Accoona', 'AcoiRobot'=> 'AcoiRobot', 'ASPSeek'=> 'ASPSeek', 'CrocCrawler'=> 'CrocCrawler', 'Dumbot'=> 'Dumbot', 'FAST-WebCrawler'=> 'FAST-WebCrawler', 'GeonaBot'=> 'GeonaBot', 'Gigabot'=> 'Gigabot', 'Lycos spider'=> 'Lycos', 'MSRBOT'=> 'MSRBOT', 'Altavista robot'=> 'Scooter', 'AltaVista robot'=> 'Altavista', 'ID-Search Bot'=> 'IDBot', 'eStyle Bot'=> 'eStyle', 'Scrubby robot'=> 'Scrubby', ); function crawlerDetect($USER_AGENT) { // to get crawlers string used in function uncomment it // it is better to save it in string than use implode every time // global $crawlers // $crawlers_agents = implode('|',$crawlers); $crawlers_agents = 'Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby'; if ( strpos($crawlers_agents , $USER_AGENT) === false ) return false; // crawler detected // you can use it to return its name /* else { return array_search($USER_AGENT, $crawlers); } */ } // example $crawler = crawlerDetect($_SERVER['HTTP_USER_AGENT']); if ($crawler ) { // it is crawler, it's name in $crawler variable } else { // usual visitor }
Great code dude ! do you know the complete list of web spiders ?? anyone ??
i’m sorry, but i dont think that anyone knows full list
How to detect all website that have my link in others website. Please
Your function is not working. I tryed like this:
Tryed with mozilla and with Opera
you should really just delete this…. bad coding and slow. stick to designing ;)
There is an API available at http://www.atlbl.com that detects all forms of webcrawlers ( normal, stealth, evil )
Seriously? Do you really think $_SERVER[‘HTTP_USER_AGENT’] will return ONLY “Google”?
Here is real life example how google identifies itself:
“Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
It won’t work unless you’ll use regexp like in the site you linked (preg_match there).
OMG, don’t u know what the functioon ‘strpos’ really do? return TRUE if the string is found…
Not true Kameloul. It returns the index of the first character of the string if found and false otherwise. If found at the beginning of the string, it will return 0 (the 0th character is the start of the search string). This is why it is important to do strpos comparison with === not ==
This code does not seem to work. Bots (like google) in my experience so far have a $_SERVER[‘HTTP_USER_AGENT’] that looks like
“Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
this does not trigger your function …
may i ask something? How a search engine (local server search engine) work with robots?.
let me give u an example, let say i have 3 localhost website and 1 search engine website. how do i want the search engine website read robots.txt file in each of my website? i hope u could understand my question. u can reply via my email.
I’m trying something like this in Perl:
my $client_agnt=$ENV{HTTP_USER_AGENT};
if($client_agnt=~/(libwww-perl)/i
or $client_agnt=~/(Robot)/i
or $client_agnt=~/(Spider)/i
or $client_agnt=~/(Crawler)/i
or $client_agnt=~/(Google)/i
or $client_agnt=~/(Fireball)/i
or $client_agnt=~/(Lycos)/i
or $client_agnt=~/(Eule)/i
or $client_agnt=~/(Northernlight)/i
or $client_agnt=~/(Aladin)/i
or $client_agnt=~/(Proxy)/i
or $client_agnt=~/(Minder)/i
or $client_agnt=~/(Accoona)/i
or $client_agnt=~/(Yahoo)/i
or $client_agnt=~/(MSN)/i
or $client_agnt=~/(Rambler)/i
or $client_agnt=~/(Seek)/i)
{
&ShowPage(‘user_agent.html’,$cont_html,{‘crawler’=>$1});
}
This generates a site with useless content, the crawler thinks he was successful and sends the content to the origin. Do you guys know what other crawlers types I could query?
function crawlerDetect($USER_AGENT) {
$crawlers_agents = ‘Google|GoogleBot|Googlebot|msnbot|Rambler|Yahoo|AbachoBOT|accoona|AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby’;
$crawlers = explode(“|”, $crawlers_agents);
foreach($crawlers as $crawler) {
if ( strpos($USER_AGENT, $crawler) !== false)
return true;
}
return false;
}
if(crawlerDetect($_SERVER[‘HTTP_USER_AGENT’]) === true) {
echo “go away bot”;
} else {
echo “welcome to my website”;
}
For people who want to find out a complete list of crawlers/bots, I suggest using a MySQL database. Use a table that stores your online users ( information about the browser/ip for people who browse through your website ) and simply have a field called “user_agent” and one called “page”. For all users (including bots) run a MySQL query, but for bots, run one that inserts “empty” into the “page” field, for users, insert a session variable. Then, after a day or two, check through the table. And for any “page” fields that show up as an empty field ( no data ) use the submitted “User Agent” field and find something irregular. Then, add this to your list of bots.
Actually hey, I did find this useful – thanks.
I used it in a simple view counter script, as I didn’t want the counter to include views from search engines / spiders.
Hey thanks a lot for such a nice piece of code. it works nicely.
Thanks
Dhanesh Mane
Most cheapo web crawlers won’t take a cookie, so set one and retrieve it. If you don’t get the cookie back, you know it’s either a crawler or someone with cookies off.
Thanks! :)
Nice post! thank you for sharing.
It should be noted that the original block of code will not function. The foreach loop iterates over an array called $crawler which has not been initialized. It should read ‘foreach ($crawlers a $c)’
The second script will never work because $_SERVER[‘HTTP_USER_AGENT’] will give the whole string and not just ‘Google’ and can not be compared , the $crawlers_agents does not contain the whole $_SERVER[‘HTTP_USER_AGENT’] but only a part of it.
great one….im using this function
It is not accurate. There are houndreds of spiders and crowlers. Better is:
if the whole user agent string is missing = robot
if we’re doing extended tracking and the Browser type was unidentifiable then it’s most likely a bot (browser signatures http://browscap.org/)
Some bots actually say they’re bots right up front = if(preg_match(“#(bot|Bot|spider|Spider|crawl|Crawl)#”,$browser))
Thanks for your script.
I would like to block bad bots but I found some bad bots fake user-agent to Googlebot
Do you know how to check crawlers by IP’s hostname, not by user-agent?