homepage Welcome to WebmasterWorld Guest from 23.20.33.176
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
vspider not adhereing to robots.txt
tonynoriega




msg:3857510
 3:39 pm on Feb 25, 2009 (gmt 0)

after i stole ..uhmm... borrowed a template of a robots.txt file from pageoneresults, i came across a new problem.

User-agent: googlebot
User-agent: slurp
User-agent: msnbot
User-agent: teoma
User-agent: W3C-checklink
User-agent: WDG_SiteValidator
Disallow: /js/
Disallow: /_includes/

User-agent: *
Disallow: /

our website search tool is Verity. spider is called "vspider".

realized, i was blocking it. ok, so thought i could add:

User-agent: vspider

and be all good....

nope, its still being blocked.

anyone had an instance like this before?

if i cant allow this spider, and have to remove the lines:

User-agent: *
Disallow: /

am i going to have to manually add every other possible spider or bot?

 

jdMorgan




msg:3857519
 3:59 pm on Feb 25, 2009 (gmt 0)

The problem, which I warned about in Pageoneresults' thread, is likely that vspider does not handle the multiple-user-agent policy-record format, as defined by the original Standard for Robot Exclusion. Although it was part of the original document, many spiders don't handle it correctly -- Cuil's twiceler robot being a recent example.

Try adding a duplicate record of the one you have, above the one you have, but listing only the vspider user-agent in it.

When creating a multiple-user-agent policy-record, you should carefully test that each robot recognizes it and behaves accordingly. If you cannot test, then go to each robots' Webmaster Help page, and see if they indicate that they can handle it. If not, then defining separate policy records is indicated.

Another approach is to use mod_rewrite or ISAPI Rewrite to rewrite (not redirect!) robot requests for robots.txt to one of two robots.txt files; one that allows access to all spiders, and the other that denies access to all spiders. You could also rewrite the robots.txt files to a script to dynamically generate the proper robots.txt directives for each spider.

When using either dynamic robots.txt delivery approach, be careful what you do with unrecognized spiders -- whether you allow or deny them. Allowing them means you have to maintain the script for new unwelcome spiders, while denying them risks the chance that some major 'bot might change its user-agent string and be unrecognized and denied.

Jim

tonynoriega




msg:3857524
 4:06 pm on Feb 25, 2009 (gmt 0)

I will try adding the dup record. As follows right?

User-agent: vspider
Disallow:

User-agent: googlebot
...etc...etc...

jdMorgan




msg:3857547
 4:28 pm on Feb 25, 2009 (gmt 0)

User-agent: vspider
Disallow: /js/
Disallow: /_includes/

User-agent: googlebot
...etc...etc...

Jim

tonynoriega




msg:3857664
 6:57 pm on Feb 25, 2009 (gmt 0)

that still kills the verity search engine...

i have tried various and multiple records to work, but apparently my web admin says that there is no way to use:

User-agent: *
Disallow: /

in the robots.txt at all. even if you allow, or add a separate record...

ill just find a lengthy list of bots and spiders and add them manually..

jdMorgan




msg:3858736
 11:07 pm on Feb 26, 2009 (gmt 0)

Another approach is to use mod_rewrite or ISAPI Rewrite to rewrite (not redirect!) robot requests for robots.txt to one of two robots.txt files; one that allows access to all spiders, and the other that denies access to all spiders. You could also rewrite the robots.txt files to a script to dynamically generate the proper robots.txt directives for each spider.

Or you could even use a PHP script to generate your robots.txt "file."

In the meantime, file a complaint with Verity... Obviously, their robot does not conform to the Standard.

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved