Bots Crawling Pages Blocked by Robots.txt - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Bots Crawling Pages Blocked by Robots.txt

incrediBILL

7:43 pm on Apr 4, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I have one theory of why bots attempt to crawl pages they are forbidden to access: link checking.

Assume all bots are blocked from the entire site with "disallow: \" in robots.txt

Assume that real search engines honor robots.txt and don't crawl your site.

Now, assume you see entries in your log files where those same spiders are accessing various pages in your site that defy the robots.txt block?

The only reason I can come up with is that the bad bots, the scrapers that ignore robots.txt, scraped the site and those links were indexed on the scraper sites.

Assuming the search engines crawled the scraper sites they may be accessing your pages, despite being blocked in robots.txt, to link check those links on the scraped pages.

That's my theory about why search engines might appear to defy robots.txt and I'm thinking about setting up a honeypot site designed just to test that theory.

At a minimum, it would be nice if the search engine told us why it was on our site. Giving us a simple referrer to where it found that link so we could diagnose the situation would be the simplest solution.

Any thoughts on this?

Samizdata

8:58 pm on Apr 4, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Giving us a simple referrer to where it found that link so we could diagnose the situation would be the simplest solution.

I thought the days of search engines giving referrer information to webmasters were effectively over.

...

keyplyr

10:31 pm on Apr 4, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

There's nothing in the robots.txt standard that says a disallowed file/directory/site isn't to be crawled... only that it is not to be indexed.

I see googlebot, msnbot, bingbot, yandex, etc al crawling disallowed files all the time, but *almost never see these files in their index.

* A few years ago I did have some issues when Inktomi was crawling for Yahoo.

incrediBILL

11:07 pm on Apr 4, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I see googlebot, msnbot, bingbot, yandex, etc al crawling disallowed files all the time, but *almost never see these files in their index.

But I think it's just checking files referenced from other locations.

Most bot code I've seen from publicly available bots always drops the site from the crawl if blocked in robots.txt but sometimes the home page is still crawled but not indexed, never seen anything that actually honors it go further except for link checking.

FWIW, the SEs have taken some real heat for crawling blocked paths before because sometimes the resource intensive load of some database services on the server was bringing it down, which is why the path was blocked in the first place.

keyplyr

11:39 pm on Apr 4, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

But I think it's just checking files referenced from other locations.

That and other reasons I guess. I'm just distinguishing that a robots.txt disallowed instruction is not really a crawl block... it's just a request not to "index."

I have a disallowed directory of html files that I use as tool-tip type bubbles. The big SEs crawl them all the time. I even use ajax to call them. This used to stop the bots but no longer since the SEs learned to parse javascript. However I never see these html pages indexed in the SERP.

incrediBILL

11:42 pm on Apr 4, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

This used to stop the bots but no longer since the SEs learned to parse javascript.

Drifting slightly off the topic but SEs and scrapers can easily access anything in javascript these days using tools like PhantomJS to load the page.

Done it myself, it's trivial.

As far as the robots.txt goes, they're not supposed to crawl or index the page but it doesn't stop direct access to specifically referenced pages, which technically isn't a crawl, they're just not supposed to index them.

Hobbs

6:52 am on Apr 5, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

From a Search Engine's POV:
If I am going to send you my visitors
if I am going to display my ads on your pages
no, I want to and will discover all your public content,
if the visitors I send you can see it, so should I
yet I will respect your desire not to index, so I agree with keyplyr on the indexing part.

Imagine if I ask you to link to my site but look only at the first top half of my pages for example, you simply won't. Otherwise it would be possible for SE's send traffic to Phishing & P0rn type sites.

So I am ok with SE's crawling robots denied public sections of my site at a 'reasonable' rate, and using them ONLY to assure compliance to TOS & nothing else. That of course includes link checking.