homepage Welcome to WebmasterWorld Guest from 54.166.123.2
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Google is spidering everything, including blocked pages
zeus




msg:4401752
 6:49 pm on Dec 27, 2011 (gmt 0)

This year I have seen A LOT of pages spidered which are not meant to be spidered, pages I did not even know existed and pages which are blocked for Month in robots.txt but still show up. Its like google is spidering everything even if its not a real page or a link to your site which url is not correct. Also see a lot of admin, wp-content, pages on google.

 

g1smd




msg:4401758
 7:08 pm on Dec 27, 2011 (gmt 0)

It's difficult to keep the blighters out. They are extremely nosey.

Use everything available, not just meta robots noindex and robots.txt but also deny directives and RewriteRules detecting IPs and UAs.

Samizdata




msg:4401771
 7:45 pm on Dec 27, 2011 (gmt 0)

The introduction of Google Web Preview sounded the death-knell for robots.txt compliance.

It is no consolation that Microsoft bots are even worse.

Only enforcible restrictions seem to work these days.

...

lucy24




msg:4401807
 10:31 pm on Dec 27, 2011 (gmt 0)

Preview isn't a robot. You and I may think it is, but g### has decided otherwise. After a few weeks of "Which part of 'Disallow: /piwik' did you not understand?" I caved in and went to a more robust

RewriteCond %{REMOTE_ADDR} ^(207\.46|157\.5[4-9]|157\.60|209\.8[45])\. [OR]
RewriteCond %{HTTP_USER_AGENT} [Bb]ot
RewriteRule piwik\.js - [F]

I couldn't tell you offhand what each of those IPs belongs to, but Preview must be in there somewhere.

I think the idea is that any linked javascript affects the appearance of the page, so an accurate preview has to include it-- even if it doesn't affect the page. D'you suppose they know enough to stay out of GA?

For a while I was simply blocking preview, because I didn't see it doing any good. When you're small, previews are loaded up on the spot, not pulled out of an existing archive. I've tested this. And if the Preview isn't immediately followed by a visit to the page, why bother?

Food for thought, there. What types of pages benefit from a preview? Different thread.

g1smd




msg:4401812
 10:36 pm on Dec 27, 2011 (gmt 0)

Without looking at my notes, if I remember right, some of those are Bing or Yahoo IPs.

zerillos




msg:4401843
 12:53 am on Dec 28, 2011 (gmt 0)

@lucy. That's msn i think

lucy24




msg:4401857
 1:36 am on Dec 28, 2011 (gmt 0)

... and 209.84-85 is the tentacle of Google used by Preview and Translate.

Urk. For a moment there I thought I'd locked translations out of the piwik logs, but a quick detour to raw logs reminds me that all associated requests come in with the user's IP. The bad news is that the same detour tells me I have overlooked a few Preview IPs, including the vanilla 74.125 that I didn't realize was even used for Preview. Maybe I'd better go to UA.

###.

Samizdata




msg:4401861
 1:53 am on Dec 28, 2011 (gmt 0)

some of those are Bing

And they don't necessarily have "Bot" or "bot" in the UA (neither does Google Web Preview).

My point was that since the introduction of Google Web Preview, the lip-service formerly paid to robots.txt compliance by major search engines seems to have been deprecated.

Compliance was always voluntary, now they are volunteering to take anything they want.

There is no law against it, after all.

...

lucy24




msg:4401865
 3:04 am on Dec 28, 2011 (gmt 0)

There is no law against it, after all.

Like the man said, an htaccess file beats robots.txt every time.* Maybe Preview thinks of itself as a sort of honorary human.

... and another quick detour tells me that the same applies to Translate. You can feed in the address of a roboted-out file and the translation comes back, large as life. I don't mind that so much, though, because it starts with actual humans requesting the real page. Or does Translate, like Preview, keep cached versions of the most popular sites? Hmmm.

Corollary thought: I am going to need to spend some time studying obscure HTML markup, because I need something that says "Do not translate this part".

And they don't necessarily have "Bot" or "bot" in the UA

... which is why I added "Web\ Preview" ;)


* Or maybe it was "A loaded .45 beats four aces." Something like that, anyway.

7_Driver




msg:4402305
 1:59 am on Dec 30, 2011 (gmt 0)

>There is no law against it, after all.

That remains to be seen...

I think the computer missuse act in the UK prohibits unauthorised access. You could argue that anything on a web server is fair game - but if the webmaster has correctly implemented industry standard files and tags to keep robots out - his intention is pretty clear. If the robots choose to ignore his instructions and access the files anyway, then perhaps they fall foul of that act.

I'm not a lawyer - but I reckon it might be arguable at least. Just because you CAN get something off a computer, doesn't mean you're allowed to.

enigma1




msg:4402363
 10:53 am on Dec 30, 2011 (gmt 0)

pages I did not even know existed

That's strange. Is this your site or not?

robots.txt are guidelines. Bots follow links they found externally or internally in your domain. If you don't want pages to be found you need to protect them or not expose them at all. And btw in robots.txt you expose them - by restricting them. You cannot guarantee that others don't read and display on purpose the content of robots.txt in some external page with hard-coded links. Guess what happens next.

If the robots choose to ignore his instructions and access the files anyway, then perhaps they fall foul of that act.

I would say fix your code instead of blaming spiders.

also deny directives and RewriteRules detecting IPs and UAs.

That's basically cloaking. Serving different content to different visitors, you don't know what will happen if the IP is reassigned or the UA changes.

londrum




msg:4402364
 11:09 am on Dec 30, 2011 (gmt 0)

i've noticed a couple of times in the past that Google WMT will suddenly report hundreds of 404s for pages which no longer exist on my server. they were spiderable in the past, but dropped out of the index when i deleted them (months and months ago)

and these weren't the kind of pages to attract backlinks either. so i presume that google must occasionally try and spider non-existant URLs from one of their old indexes.

zeus




msg:4402393
 1:27 pm on Dec 30, 2011 (gmt 0)

I got 28.000 404 on google tools non of those pages exist anymore, but they are linked to by others sites, but also some have no links to them, but still show up in google tools.

enigma1




msg:4402402
 3:12 pm on Dec 30, 2011 (gmt 0)

It takes a bit but invalid links drop out of the google index.

If you keep seeing them in gwt, is possible the old links might still referenced inside your domain's pages. In this case you won't see them indexed with site:example.com as accessing them now will return some error code, but you will keep seeing the errors.

Normally gwt won't show errors because an external site still has references to non existing pages in your domain.

StoutFiles




msg:4402404
 3:21 pm on Dec 30, 2011 (gmt 0)

Preview isn't a robot. You and I may think it is, but g### has decided otherwise. After a few weeks of "Which part of 'Disallow: /piwik' did you not understand?" I caved in and went to a more robust


Can we confirm that blocking the Preview bot won't hurt search rankings now, or in the future? If I was Google, I wouldn't want to build a Preview tool and then have a slew of sites on the first page not support it, it makes Google look bad. Just something to consider.

dstiles




msg:4402722
 9:23 pm on Dec 31, 2011 (gmt 0)

I told google via robots.txt that it was not allowed to take my pics. Web Preview is their way around that. I block web preview. If that hurts the sites then so be it.

I also have a notice on my sites saying "You cannot use my content for commercial gain". Which google ignores: it is definitely using my content for commercial gain.

Back to the OP: not only all existing pages but never-existed pages as well, to which they get 404. If they persist they may find some of my sites blocked entirely to them. Shame I'm in UK, where most idiots still use google as a browser URL entry field. :(

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved