Forum Moderators: open
Still ignoring robots.txt.
And haven't you also wondered how certain files can be requested despite the fact that no link to them is anywhere on your site?
a bot just doesn't need a list to get a file
But it needs some way of knowing that the file existsIt can get all that by requesting the files under each level (folder.) It doesn't need the file name *prior* to the request. It tells the server to open each folder and get each type of file (or all files) and in doing so it learns the file names.
Command=GetFolders&Type=File&CurrentFolder=...
[edited by: Oblivious at 10:34 am (utc) on Jan 30, 2016]
it takes my program about 8 seconds to crawl each 1000
It does not detect just "Experibot", so in the regex which parses the robots.txt file, I look for either
User-agent: * OR User-agent: Experibot_v1
it's 80 seconds between each two calls to the same siteYup, that seems more than reasonable. Still, bonus points if the robot understands and obeys
Crawl-Delay: 120
(Mine doesn't actually say this; it's "Crawl-Delay: 3".) If you go by the official robots.txt standard, the only sine qua non is the "Disallow:" directive. But some things are so ubiquitous there's really no excuse. Nope, not even if Google Itself ignores it ;)