Welcome to WebmasterWorld Guest from 54.166.220.138

Message Too Old, No Replies

Google is spidering everything, including blocked pages

     
6:49 pm on Dec 27, 2011 (gmt 0)

WebmasterWorld Senior Member zeus is a WebmasterWorld Top Contributor of All Time 10+ Year Member



This year I have seen A LOT of pages spidered which are not meant to be spidered, pages I did not even know existed and pages which are blocked for Month in robots.txt but still show up. Its like google is spidering everything even if its not a real page or a link to your site which url is not correct. Also see a lot of admin, wp-content, pages on google.
7:08 pm on Dec 27, 2011 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



It's difficult to keep the blighters out. They are extremely nosey.

Use everything available, not just meta robots noindex and robots.txt but also deny directives and RewriteRules detecting IPs and UAs.
7:45 pm on Dec 27, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



The introduction of Google Web Preview sounded the death-knell for robots.txt compliance.

It is no consolation that Microsoft bots are even worse.

Only enforcible restrictions seem to work these days.

...
10:31 pm on Dec 27, 2011 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Preview isn't a robot. You and I may think it is, but g### has decided otherwise. After a few weeks of "Which part of 'Disallow: /piwik' did you not understand?" I caved in and went to a more robust

RewriteCond %{REMOTE_ADDR} ^(207\.46|157\.5[4-9]|157\.60|209\.8[45])\. [OR]
RewriteCond %{HTTP_USER_AGENT} [Bb]ot
RewriteRule piwik\.js - [F]

I couldn't tell you offhand what each of those IPs belongs to, but Preview must be in there somewhere.

I think the idea is that any linked javascript affects the appearance of the page, so an accurate preview has to include it-- even if it doesn't affect the page. D'you suppose they know enough to stay out of GA?

For a while I was simply blocking preview, because I didn't see it doing any good. When you're small, previews are loaded up on the spot, not pulled out of an existing archive. I've tested this. And if the Preview isn't immediately followed by a visit to the page, why bother?

Food for thought, there. What types of pages benefit from a preview? Different thread.
10:36 pm on Dec 27, 2011 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Without looking at my notes, if I remember right, some of those are Bing or Yahoo IPs.
12:53 am on Dec 28, 2011 (gmt 0)

5+ Year Member



@lucy. That's msn i think
1:36 am on Dec 28, 2011 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



... and 209.84-85 is the tentacle of Google used by Preview and Translate.

Urk. For a moment there I thought I'd locked translations out of the piwik logs, but a quick detour to raw logs reminds me that all associated requests come in with the user's IP. The bad news is that the same detour tells me I have overlooked a few Preview IPs, including the vanilla 74.125 that I didn't realize was even used for Preview. Maybe I'd better go to UA.

###.
1:53 am on Dec 28, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



some of those are Bing

And they don't necessarily have "Bot" or "bot" in the UA (neither does Google Web Preview).

My point was that since the introduction of Google Web Preview, the lip-service formerly paid to robots.txt compliance by major search engines seems to have been deprecated.

Compliance was always voluntary, now they are volunteering to take anything they want.

There is no law against it, after all.

...
3:04 am on Dec 28, 2011 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



There is no law against it, after all.

Like the man said, an htaccess file beats robots.txt every time.* Maybe Preview thinks of itself as a sort of honorary human.

... and another quick detour tells me that the same applies to Translate. You can feed in the address of a roboted-out file and the translation comes back, large as life. I don't mind that so much, though, because it starts with actual humans requesting the real page. Or does Translate, like Preview, keep cached versions of the most popular sites? Hmmm.

Corollary thought: I am going to need to spend some time studying obscure HTML markup, because I need something that says "Do not translate this part".

And they don't necessarily have "Bot" or "bot" in the UA

... which is why I added "Web\ Preview" ;)


* Or maybe it was "A loaded .45 beats four aces." Something like that, anyway.
1:59 am on Dec 30, 2011 (gmt 0)

10+ Year Member



>There is no law against it, after all.

That remains to be seen...

I think the computer missuse act in the UK prohibits unauthorised access. You could argue that anything on a web server is fair game - but if the webmaster has correctly implemented industry standard files and tags to keep robots out - his intention is pretty clear. If the robots choose to ignore his instructions and access the files anyway, then perhaps they fall foul of that act.

I'm not a lawyer - but I reckon it might be arguable at least. Just because you CAN get something off a computer, doesn't mean you're allowed to.
10:53 am on Dec 30, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



pages I did not even know existed

That's strange. Is this your site or not?

robots.txt are guidelines. Bots follow links they found externally or internally in your domain. If you don't want pages to be found you need to protect them or not expose them at all. And btw in robots.txt you expose them - by restricting them. You cannot guarantee that others don't read and display on purpose the content of robots.txt in some external page with hard-coded links. Guess what happens next.

If the robots choose to ignore his instructions and access the files anyway, then perhaps they fall foul of that act.

I would say fix your code instead of blaming spiders.

also deny directives and RewriteRules detecting IPs and UAs.

That's basically cloaking. Serving different content to different visitors, you don't know what will happen if the IP is reassigned or the UA changes.
11:09 am on Dec 30, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



i've noticed a couple of times in the past that Google WMT will suddenly report hundreds of 404s for pages which no longer exist on my server. they were spiderable in the past, but dropped out of the index when i deleted them (months and months ago)

and these weren't the kind of pages to attract backlinks either. so i presume that google must occasionally try and spider non-existant URLs from one of their old indexes.
1:27 pm on Dec 30, 2011 (gmt 0)

WebmasterWorld Senior Member zeus is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I got 28.000 404 on google tools non of those pages exist anymore, but they are linked to by others sites, but also some have no links to them, but still show up in google tools.
3:12 pm on Dec 30, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



It takes a bit but invalid links drop out of the google index.

If you keep seeing them in gwt, is possible the old links might still referenced inside your domain's pages. In this case you won't see them indexed with site:example.com as accessing them now will return some error code, but you will keep seeing the errors.

Normally gwt won't show errors because an external site still has references to non existing pages in your domain.
3:21 pm on Dec 30, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Preview isn't a robot. You and I may think it is, but g### has decided otherwise. After a few weeks of "Which part of 'Disallow: /piwik' did you not understand?" I caved in and went to a more robust


Can we confirm that blocking the Preview bot won't hurt search rankings now, or in the future? If I was Google, I wouldn't want to build a Preview tool and then have a slew of sites on the first page not support it, it makes Google look bad. Just something to consider.
9:23 pm on Dec 31, 2011 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I told google via robots.txt that it was not allowed to take my pics. Web Preview is their way around that. I block web preview. If that hurts the sites then so be it.

I also have a notice on my sites saying "You cannot use my content for commercial gain". Which google ignores: it is definitely using my content for commercial gain.

Back to the OP: not only all existing pages but never-existed pages as well, to which they get 404. If they persist they may find some of my sites blocked entirely to them. Shame I'm in UK, where most idiots still use google as a browser URL entry field. :(
 

Featured Threads

Hot Threads This Week

Hot Threads This Month