homepage Welcome to WebmasterWorld Guest from 54.166.53.169
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
vspider not adhereing to robots.txt
tonynoriega

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3857508 posted 3:39 pm on Feb 25, 2009 (gmt 0)

after i stole ..uhmm... borrowed a template of a robots.txt file from pageoneresults, i came across a new problem.

User-agent: googlebot
User-agent: slurp
User-agent: msnbot
User-agent: teoma
User-agent: W3C-checklink
User-agent: WDG_SiteValidator
Disallow: /js/
Disallow: /_includes/

User-agent: *
Disallow: /

our website search tool is Verity. spider is called "vspider".

realized, i was blocking it. ok, so thought i could add:

User-agent: vspider

and be all good....

nope, its still being blocked.

anyone had an instance like this before?

if i cant allow this spider, and have to remove the lines:

User-agent: *
Disallow: /

am i going to have to manually add every other possible spider or bot?

 

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3857508 posted 3:59 pm on Feb 25, 2009 (gmt 0)

The problem, which I warned about in Pageoneresults' thread, is likely that vspider does not handle the multiple-user-agent policy-record format, as defined by the original Standard for Robot Exclusion. Although it was part of the original document, many spiders don't handle it correctly -- Cuil's twiceler robot being a recent example.

Try adding a duplicate record of the one you have, above the one you have, but listing only the vspider user-agent in it.

When creating a multiple-user-agent policy-record, you should carefully test that each robot recognizes it and behaves accordingly. If you cannot test, then go to each robots' Webmaster Help page, and see if they indicate that they can handle it. If not, then defining separate policy records is indicated.

Another approach is to use mod_rewrite or ISAPI Rewrite to rewrite (not redirect!) robot requests for robots.txt to one of two robots.txt files; one that allows access to all spiders, and the other that denies access to all spiders. You could also rewrite the robots.txt files to a script to dynamically generate the proper robots.txt directives for each spider.

When using either dynamic robots.txt delivery approach, be careful what you do with unrecognized spiders -- whether you allow or deny them. Allowing them means you have to maintain the script for new unwelcome spiders, while denying them risks the chance that some major 'bot might change its user-agent string and be unrecognized and denied.

Jim

tonynoriega

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3857508 posted 4:06 pm on Feb 25, 2009 (gmt 0)

I will try adding the dup record. As follows right?

User-agent: vspider
Disallow:

User-agent: googlebot
...etc...etc...

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3857508 posted 4:28 pm on Feb 25, 2009 (gmt 0)

User-agent: vspider
Disallow: /js/
Disallow: /_includes/

User-agent: googlebot
...etc...etc...

Jim

tonynoriega

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3857508 posted 6:57 pm on Feb 25, 2009 (gmt 0)

that still kills the verity search engine...

i have tried various and multiple records to work, but apparently my web admin says that there is no way to use:

User-agent: *
Disallow: /

in the robots.txt at all. even if you allow, or add a separate record...

ill just find a lengthy list of bots and spiders and add them manually..

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3857508 posted 11:07 pm on Feb 26, 2009 (gmt 0)

Another approach is to use mod_rewrite or ISAPI Rewrite to rewrite (not redirect!) robot requests for robots.txt to one of two robots.txt files; one that allows access to all spiders, and the other that denies access to all spiders. You could also rewrite the robots.txt files to a script to dynamically generate the proper robots.txt directives for each spider.

Or you could even use a PHP script to generate your robots.txt "file."

In the meantime, file a complaint with Verity... Obviously, their robot does not conform to the Standard.

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved