Forum Moderators: goodroi
currently we are coding a webcrawler and i am searching thsi forum for problems and solutions regarding the robots.txt and the robots meta-tag.
a question still could not answer myself is what a questionmark (?) in allow/disallow-lines stands for?
at this point we think of ignorin this, cause it is not in the robotstxt.org-specifications.
can somebody give me a hint?
example from googles robots.txt would be:
User-agent: *
Allow: /searchhistory/
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalogues
Disallow: /news
Disallow: /nwshp
Disallow: /?
Disallow: /addurl/image?
Disallow: /pagead/
Disallow: /relpage/
Disallow: /sorry/
Disallow: /imgres
Disallow: /keyword/
Disallow: /u/
etc...
this example is plain text downloaded, so there should be no prob with encoding-standards, i guess. as long as we dont know about that we ignore all lines with questionmarks.
thanks on that, and yes, i have got another question:
what do webmasters mean by using an asterix after the name of the user-agent?
example:
User-agent: Xbot*
at this point we would just trim the asterix away and and regex the remainder with our botname.
read you
Wildcard * is only valid on its own in:
User-agent: *
If its used in any other context then it should be treated as normal symbol - any webmaster who assumes that it will patterm match user-agent is mistaken.