Forum Moderators: open

Message Too Old, No Replies

Known Spider Information and References

         

incrediBILL

8:47 pm on Jul 14, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The purpose of this thread is to collect a list of common search engine user agent names, domains, crawler and/or robots.txt page.

Hopefully this list will help people find information about known spiders quickly and gain some insights into the spider's purpose, as well as being a useful quick resource guide in the future.

Let me kick off this list with a few entries of my own:

Google

  • How to Verify Googlebot [googlewebmastercentral.blogspot.com] and most major search engines with round trip DNS
  • Blocking Googlebot and other Google robots with robots.txt [google.com]
  • Googlebot: main spider for the web and news index
  • Googlebot-Mobile: spiders the mobile index
  • Googlebot-Image: crawlers for the image index
  • Mediapartners-Google: AdSense spider only used if AdSense ads are displayed on your site.
  • Adsbot-Google: AdWords landing page quality spider only used when Google AdWords advertise your site.

    Yahoo

  • Slurp [help.yahoo.com]: main spider for web, images, and more.
  • YahooSeeker/M1A1-R2D2 [help.yahoo.com]: crawler collects documents from the Mobile Web

    Bing

  • MSNBot [help.live.com]: the Bing web spider finds text, documents, images, and links, for the index.

    Ask

  • Teoma [about.ask.com]: the Ask web spider that locates the text, images and links that appear in the Ask index.
  • Firefox Minefield: Ask has been making screenshots for their indexed pages using the user agent "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1"

    Feel free to contribute IP ranges for spiders that don't use full trip DNS validation.

    However, IPs should not be submitted when discussing distributed crawlers that are run from volunteer computers.

  • wilderness

    8:50 pm on Jul 16, 2009 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Bill,
    This list some six years old [webmasterworld.com] was created by bull.
    At one time there was an active link in the forum Library.

    Perhaps this list needs updating?

    incrediBILL

    6:49 am on Jul 19, 2009 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    That's ancient, we need some current data

    GaryK

    5:34 pm on Jul 19, 2009 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Isn't current data obtained by simply looking at recent forum posts?

    incrediBILL

    6:15 pm on Jul 19, 2009 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Not everything, and digging through the entire forum for each bot gets time consuming, if nothing more dropping links in a single thread like Bull did will help everyone.

    I was hoping to create a resource like we did with the default programming library thread.

    If there's no interest, I can just kill the thread and forget about it.

    tangor

    6:18 pm on Jul 19, 2009 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    I'm interested!

    GaryK

    6:32 pm on Jul 19, 2009 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Bill, since you know I keep an extensive list of bots and other user agents, perhaps it makes sense for me to add a database field with a link to a WebmasterWorld forum thread for it. And then every so often I can generate an update and send it to you for posting here. I often tweet about new UAs and include a shortened URL to the thread here where we're discussing it, so it wouldn't be a huge effort to fill-in more links and code the report.

    incrediBILL

    4:00 am on Jul 20, 2009 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    If someone can help automate it even better, thanks for volunteering Gary!

    I'm sure others wouldn't mind helping to find all the threads initially if you can dump out a list of bots.

    This could easily be the best bot index on the web.

    Ideally we would want to know the bot name, site, crawler page and thread, and hopefully the thread will have the last 3 if possible.

    GaryK

    4:25 am on Jul 20, 2009 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    I'll get to work on it in the morning starting with a dump of bot names.

    GaryK

    5:58 pm on Jul 22, 2009 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Bill, did you ever get my second PM with the link to the file on my server?

    BradleyT

    6:47 pm on Sep 4, 2009 (gmt 0)

    10+ Year Member



    Any updates on this?

    phranque

    9:15 am on Sep 5, 2009 (gmt 0)

    WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



    i created this list of bot-like user agent strings.
    admittedly it was quite a few months ago and from a relatively small sample but it could be useful for the purposes of kicking this thread again:

    AdsBot-Google+(+http://www.google.com/adsbot.html)
    Gigabot/3.0+(G75)
    Gigabot/3.0+(http://www.gigablast.com/spider.html)
    Googlebot-Image/1.0
    Java/1.6.0_10
    Mozilla/4.0+(compatible;+BOTW+Spider;++http://botw.org)
    Mozilla/5.0+(Twiceler-0.9+http://www.cuil.com/twiceler/robot.html)
    Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+fr;+rv:1.8.1)+VoilaBot+BETA+1.2+(support.voilabot@orange-ftgroup.com)
    Mozilla/5.0+(compatible;+Ask+Jeeves/Teoma;++http://about.ask.com/en/docs/about/webmasters.shtml)
    Mozilla/5.0+(compatible;+Charlotte/1.1;+http://www.searchme.com/support/)
    Mozilla/5.0+(compatible;+DBLBot/1.0;++http://www.dontbuylists.com/)
    Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)
    Mozilla/5.0+(compatible;+ScoutJet;++http://www.scoutjet.com/)
    Mozilla/5.0+(compatible;+Yahoo!+Slurp/3.0;+http://help.yahoo.com/help/us/ysearch/slurp)
    Mozilla/5.0+(compatible;+Yahoo!+Slurp;+http://help.yahoo.com/help/us/ysearch/slurp)
    Sosospider+(+http://help.soso.com/webspider.htm)
    SurveyBot/2.3+(Whois+Source)
    Yandex/1.01.001+(compatible;+Win16;+I)
    ia_archiver+(+http://www.alexa.com/site/help/webmasters;+crawler@alexa.com)
    msnbot-media/1.0+(+http://search.msn.com/msnbot.htm)
    msnbot/1.1+(+http://search.msn.com/msnbot.htm)

    i could do it again from a newer sample but i'm guessing GaryK is way ahead of me on something like this...