homepage Welcome to WebmasterWorld Guest from 54.211.80.155
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Charter : Search Engine Spider and User Agent Identification

 

Charter - Search Engine Spider and User Agent Identification

Forum: Search Engine Spider and User Agent Identification

Category: The Search Engine World

Moderator: incrediBILL & Ocean10000

Previous Moderator: volatilegx (founding moderator: littleman)

Founded: Nov 2, 1999

Overview:
Spiders are small independent programs that go out and download websites. They take the website data (same that is viewed in a browser) and use it for various purposes. Our theme here is mainly Search engine promotion, thus we are mostly concerned with search engine spiders.

PREMODERATED FORUM
Every thread must be approved by a moderator before it is published. Please see the guidelines below for reasons why posts may not be approved. We try to make pre-moderation decisions in a timely manner - but because we are a volunteer staff and not always available, a decision can take as long as 12-24 hours.

The moderators often edit post titles and may not always send a note to explain. Title edits are made to attract more clicks to your thread, to clarify differences between similar topics, and to help similar discussions appear as clearly non-duplicate to the search engines.

Topics Covered:
Spiders, Spider IP's, and other spider topics, design, care & feeding are also welcome.

Additionally, some spiders hide as various programming library default user agents [webmasterworld.com] or common browser user agents therefore the scope of the forum has expanded to include generic user agent identification and elimination as part of the spider identification process.

Posting Guidelines:
The WebmasterWorld Terms of Service [webmasterworld.com] remain in full effect in this forum.

IP addresses tend to change ownership over time so unless the IP information is expressly owned by a search engine, such as Google or Yahoo, needs to be obfuscated in the D block of the IP address.

Any IP address or reverse DNS information not expressly belonging to a search engine should be masked as follows:

  • Example IP: 111.222.333.nnn
  • Example DNS: nnn.333.222.111.example.com

    Additionally, the IPs should be obscured when discussing distributed crawlers that are run from volunteer computers.

    Links that are allowed to be posted within the Spider ID forum:

  • Links contained within Search Engine User Agent strings are allowed
  • Links to the Search Engine home page
  • Links to the Search Engine crawler page or robots.txt page
  • Educational material and standards documents - Microsoft, Apache, Google Guidelines, etc.
  • Authoritative news stories - NY Times, Wall Street Journal, PC World, Wired, BBC, CNN, NBC, etc.

    Please do not link to other forums or blogs.

    In addition, it is never appropriate to link to any website that you operate or that hosts your own content - no matter how authoritative that content may be.

    Information from WHOIS should be limited to the Host name and IP range leaving out specific names, addresses or other personal information.

    References:

    Google

  • How to Verify Googlebot [googlewebmastercentral.blogspot.com] and most major search engines with round trip DNS
  • Blocking Googlebot and other Google robots with robots.txt [google.com]
  • Googlebot: main spider for the web and news index
  • Googlebot-Mobile: spiders the mobile index
  • Googlebot-Image: crawlers for the image index
  • Mediapartners-Google: AdSense spider only used if AdSense ads are displayed on your site.
  • Adsbot-Google: AdWords landing page quality spider only used when Google AdWords advertise your site.

    Yahoo

  • How to validate Yahoo! crawlers [ysearchblog.com]:
  • Slurp [help.yahoo.com]: main spider for web, images, and more.
  • YahooSeeker/M1A1-R2D2 [help.yahoo.com]: crawler collects documents from the Mobile Web


    Bing:

  • How to validate Bing crawlers [bing.com]
  • MSNBot [help.live.com]: the Bing web spider finds text, documents, images, and links, for the index.

    Ask

  • How to validate Ask crawlers [about.ask.com]
  • Teoma [about.ask.com]: the Ask web spider that locates the text, images and links that appear in the Ask index.
  • Firefox Minefield: Ask has been making screenshots for their indexed pages using the user agent "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1"



  •  

    Home / Forums Index / Search Engines / Charter : Search Engine Spider and User Agent Identification
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved