Cult-Foo

Detect crawlers with PHP

Tags:

I don’t know if anyone except me will need this script, so i put it in blog just not to loose it
Very simple function analyze $_SERVER[’HTTP_USER_AGENT’] variable and looking for crawler signature. If function founds crawler, it will return it’s name, otherwise – false.

Usage examples:
– save to database and output somethere in admin zone or on site
– save for indexing statistics and analyze it later
– use for cloacking or doorways :) ( i do not advise you to do it )

function crawlerDetect($USER_AGENT)
{
    $crawlers = array(
    array('Google', 'Google'),
    array('msnbot', 'MSN'),
    array('Rambler', 'Rambler'),
    array('Yahoo', 'Yahoo'),
    array('AbachoBOT', 'AbachoBOT'),
    array('accoona', 'Accoona'),
    array('AcoiRobot', 'AcoiRobot'),
    array('ASPSeek', 'ASPSeek'),
    array('CrocCrawler', 'CrocCrawler'),
    array('Dumbot', 'Dumbot'),
    array('FAST-WebCrawler', 'FAST-WebCrawler'),
    array('GeonaBot', 'GeonaBot'),
    array('Gigabot', 'Gigabot'),
    array('Lycos', 'Lycos spider'),
    array('MSRBOT', 'MSRBOT'),
    array('Scooter', 'Altavista robot'),
    array('AltaVista', 'Altavista robot'),
    array('IDBot', 'ID-Search Bot'),
    array('eStyle', 'eStyle Bot'),
    array('Scrubby', 'Scrubby robot')
    );

    foreach ($crawler as $c)
    {
        if (stristr($USER_AGENT, $c[0]))
        {
            return($c[1]);
        }
    }

    return false;
}

// example

$crawler = crawlerDetect($_SERVER['HTTP_USER_AGENT']);

if ($crawler )
{
   // it is crawler, it's name in $crawler variable
}
else
{
   // usual visitor
}

UPDATE:
After reading this i decide to update my code a bit. Change is connected to usage of function on high volume website.

<?php
  $crawlers = array(
    'Google'=>'Google',
    'MSN' => 'msnbot',
    'Rambler'=>'Rambler',
    'Yahoo'=> 'Yahoo',
    'AbachoBOT'=> 'AbachoBOT',
    'accoona'=> 'Accoona',
    'AcoiRobot'=> 'AcoiRobot',
    'ASPSeek'=> 'ASPSeek',
    'CrocCrawler'=> 'CrocCrawler',
    'Dumbot'=> 'Dumbot',
    'FAST-WebCrawler'=> 'FAST-WebCrawler',
    'GeonaBot'=> 'GeonaBot',
    'Gigabot'=> 'Gigabot',
    'Lycos spider'=> 'Lycos',
    'MSRBOT'=> 'MSRBOT',
    'Altavista robot'=> 'Scooter',
    'AltaVista robot'=> 'Altavista',
    'ID-Search Bot'=> 'IDBot',
    'eStyle Bot'=> 'eStyle',
    'Scrubby robot'=> 'Scrubby',
    );

function crawlerDetect($USER_AGENT)
{
    // to get crawlers string used in function uncomment it
    // it is better to save it in string than use implode every time
    // global $crawlers
    // $crawlers_agents = implode('|',$crawlers);
    $crawlers_agents = 'Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';

    if ( strpos($crawlers_agents , $USER_AGENT) === false )
       return false;
    // crawler detected
    // you can use it to return its name
    /*
    else {
       return array_search($USER_AGENT, $crawlers);
    }
    */
}

// example

$crawler = crawlerDetect($_SERVER['HTTP_USER_AGENT']);

if ($crawler )
{
   // it is crawler, it's name in $crawler variable
}
else
{
   // usual visitor
}

Did you like it? Help us spread the word!

24 Responses to "Detect crawlers with PHP"

  1. Great code dude ! do you know the complete list of web spiders ?? anyone ??

  2. elPas0 says:

    i’m sorry, but i dont think that anyone knows full list

  3. crohole says:

    How to detect all website that have my link in others website. Please

  4. FDisk says:

    Your function is not working. I tryed like this:

    function test($userAgent) {
    	$crawlers = 'Opera|Mozilla|HostTracker|EasyDL|e-collector|EmailCollector|Telesoft|Twiceler|InternetSeer.com|MJ12bot|YahooFeedSeeker|Yahoo-MMCrawler|Yandex|findlinks|Bloglines subscriber|Dumbot|Sosoimagespider|QihooBot|FAST-WebCrawler|Superdownloads Spiderman|LinkWalker|msnbot|ASPSeek|WebAlta Crawler|Lycos|FeedFetcher-Google|Yahoo|YoudaoBot|AdsBot-Google|Googlebot|Scooter|Gigabot|Charlotte|eStyle|AcioRobot|GeonaBot|msnbot-media|Baidu|CocoCrawler|Google|Charlotte t|Yahoo! Slurp China|Sogou web spider|YodaoBot|MSRBOT|AbachoBOT|Sogou head spider|AltaVista|IDBot|Sosospider|Yahoo! Slurp|Java VM|DotBot|LiteFinder|Yeti|Rambler|Scrubby|Baiduspider|accoona';
    	if ( strpos($crawlers  , $USER_AGENT) === false )
    		return false;
    }
    if (getIsCrawler($_SERVER['HTTP_USER_AGENT']))
    	die('spider');
    

    Tryed with mozilla and with Opera

  5. Nathaniel says:

    you should really just delete this…. bad coding and slow. stick to designing ;)

  6. Mark says:

    There is an API available at http://www.atlbl.com that detects all forms of webcrawlers ( normal, stealth, evil )

  7. rozwell says:

    Seriously? Do you really think $_SERVER[‘HTTP_USER_AGENT’] will return ONLY “Google”?
    Here is real life example how google identifies itself:
    “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
    It won’t work unless you’ll use regexp like in the site you linked (preg_match there).

    • karmeloul says:

      OMG, don’t u know what the functioon ‘strpos’ really do? return TRUE if the string is found…

      • Carl says:

        Not true Kameloul. It returns the index of the first character of the string if found and false otherwise. If found at the beginning of the string, it will return 0 (the 0th character is the start of the search string). This is why it is important to do strpos comparison with === not ==

  8. Rick says:

    This code does not seem to work. Bots (like google) in my experience so far have a $_SERVER[‘HTTP_USER_AGENT’] that looks like

    “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

    this does not trigger your function …

  9. Izzuddin says:

    may i ask something? How a search engine (local server search engine) work with robots?.

    let me give u an example, let say i have 3 localhost website and 1 search engine website. how do i want the search engine website read robots.txt file in each of my website? i hope u could understand my question. u can reply via my email.

  10. PerlProgrammer says:

    I’m trying something like this in Perl:

    my $client_agnt=$ENV{HTTP_USER_AGENT};

    if($client_agnt=~/(libwww-perl)/i
    or $client_agnt=~/(Robot)/i
    or $client_agnt=~/(Spider)/i
    or $client_agnt=~/(Crawler)/i
    or $client_agnt=~/(Google)/i
    or $client_agnt=~/(Fireball)/i
    or $client_agnt=~/(Lycos)/i
    or $client_agnt=~/(Eule)/i
    or $client_agnt=~/(Northernlight)/i
    or $client_agnt=~/(Aladin)/i
    or $client_agnt=~/(Proxy)/i
    or $client_agnt=~/(Minder)/i
    or $client_agnt=~/(Accoona)/i
    or $client_agnt=~/(Yahoo)/i
    or $client_agnt=~/(MSN)/i
    or $client_agnt=~/(Rambler)/i
    or $client_agnt=~/(Seek)/i)
    {
    &ShowPage(‘user_agent.html’,$cont_html,{‘crawler’=>$1});
    }

    This generates a site with useless content, the crawler thinks he was successful and sends the content to the origin. Do you guys know what other crawlers types I could query?

  11. k1ng says:

    function crawlerDetect($USER_AGENT) {
    $crawlers_agents = ‘Google|GoogleBot|Googlebot|msnbot|Rambler|Yahoo|AbachoBOT|accoona|AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby’;
    $crawlers = explode(“|”, $crawlers_agents);
    foreach($crawlers as $crawler) {
    if ( strpos($USER_AGENT, $crawler) !== false)
    return true;
    }
    return false;
    }

    if(crawlerDetect($_SERVER[‘HTTP_USER_AGENT’]) === true) {
    echo “go away bot”;
    } else {
    echo “welcome to my website”;
    }

  12. Leo Myers says:

    For people who want to find out a complete list of crawlers/bots, I suggest using a MySQL database. Use a table that stores your online users ( information about the browser/ip for people who browse through your website ) and simply have a field called “user_agent” and one called “page”. For all users (including bots) run a MySQL query, but for bots, run one that inserts “empty” into the “page” field, for users, insert a session variable. Then, after a day or two, check through the table. And for any “page” fields that show up as an empty field ( no data ) use the submitted “User Agent” field and find something irregular. Then, add this to your list of bots.

  13. atomiku says:

    Actually hey, I did find this useful – thanks.
    I used it in a simple view counter script, as I didn’t want the counter to include views from search engines / spiders.

  14. Hey thanks a lot for such a nice piece of code. it works nicely.

    Thanks
    Dhanesh Mane

  15. Kevin V says:

    Most cheapo web crawlers won’t take a cookie, so set one and retrieve it. If you don’t get the cookie back, you know it’s either a crawler or someone with cookies off.

  16. Nice post! thank you for sharing.

  17. Carl says:

    It should be noted that the original block of code will not function. The foreach loop iterates over an array called $crawler which has not been initialized. It should read ‘foreach ($crawlers a $c)’

  18. John says:

    The second script will never work because $_SERVER[‘HTTP_USER_AGENT’] will give the whole string and not just ‘Google’ and can not be compared , the $crawlers_agents does not contain the whole $_SERVER[‘HTTP_USER_AGENT’] but only a part of it.

  19. budyk_ir says:

    great one….im using this function

  20. birkof says:

    It is not accurate. There are houndreds of spiders and crowlers. Better is:

    if the whole user agent string is missing = robot
    if we’re doing extended tracking and the Browser type was unidentifiable then it’s most likely a bot (browser signatures http://browscap.org/)
    Some bots actually say they’re bots right up front = if(preg_match(“#(bot|Bot|spider|Spider|crawl|Crawl)#”,$browser))

  21. Jake says:

    Thanks for your script.
    I would like to block bad bots but I found some bad bots fake user-agent to Googlebot
    Do you know how to check crawlers by IP’s hostname, not by user-agent?

Trackbacks/Pingbacks

Do you have something to say?