homepage Welcome to WebmasterWorld Guest from 54.204.64.152
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 42 message thread spans 2 pages: 42 ( [1] 2 > >     
Having Google +1 button on a page will override robots.txt blocks!
indyank




msg:4358035
 5:16 pm on Sep 2, 2011 (gmt 0)

Now this one is interesting.I just found this answer from a google employee in one of their forum threads - [google.com...]

The +1 Button is only intended to be used on pages that contain all public content. By putting the button on a page we're taking it as an indication from you that this page is public content. This means that we will fetch your page even if crawler directives indicate otherwise.


What is interesting to me is until now,i was thinking that if you block a page in robots.txt, google wouldn't even crawl it. But it looks like google will fetch the page, see whether it has a +1 button and then disregard the robots.txt block. Wow, this is great!

Looks like Googlebot will get into every corner of your site.

 

indyank




msg:4358037
 5:26 pm on Sep 2, 2011 (gmt 0)

I think white listing wouldn't work with google anymore. From what I gather by going through that thread and the OP's observations, it looks like some of their bots might not even identify themselves with a proper useragent.

These guys are getting more and more evil with every passing day!

netmeg




msg:4358043
 5:36 pm on Sep 2, 2011 (gmt 0)

um, yea, that's pretty nasty.

tedster




msg:4358044
 5:36 pm on Sep 2, 2011 (gmt 0)

My assumption was a bit different. I'm thinking that when a +1 vote is registered for a page, then that URL gets added to Google's crawling queue if it wasn't already there - and it then gets crawled even if there's a Disallow rule.

For one, that process would be a lot less resource intensive that taking a full inventory of every URL on the web. And this process would only be a minor ignoring of the robots.txt protocol instead of a major violation.

Nevertheless, we'll need to be very cautious about any automated adding of +1 buttons across a website.

CritterNYC




msg:4358063
 6:29 pm on Sep 2, 2011 (gmt 0)

There's nothing nefarious about it. Google doesn't need to crawl all your pages (regardless of robots.txt) to find the +1 buttons. You've got the JavaScript, from Google, right on the page. They just check referrers for the +1 to ensure it's been crawled. That's all.

netmeg




msg:4358073
 6:57 pm on Sep 2, 2011 (gmt 0)

Yes, but what about the various things I block because I don't want a lot of sort=, pagination=, display= and other types of potential duplicate content cluttering up the joint? I could rely on GWT's parameter exclusion (yea right) or rely on Google to figure it out (because THAT always works) or figure out a way to return a noindex (hunh?) Easier just to not use +1, except they're pretty much saying you *gotta* use +1 - at least that's what the clients hear.

They're messing with my systems. Do. Not. Want.

tedster




msg:4358074
 6:59 pm on Sep 2, 2011 (gmt 0)

but what about the various things I block because I don't want a lot of sort=, pagination=, display= and other types of potential duplicate content cluttering up the joint?

Exactly right. I don't even want to see a crawler requesting those URLs or to spend the bandwidth responding.

MrFewkes




msg:4358082
 7:32 pm on Sep 2, 2011 (gmt 0)

Netmeg - Hard.Luck.On.You.

Another google trick - they are really showing their true colours el-rapido these days.

Sgt_Kickaxe




msg:4358084
 7:43 pm on Sep 2, 2011 (gmt 0)

they are really showing their true colours el-rapido these days.


I agree, 110%. I'm sure Google has a plan in place for when webmasters revolt and say hey - stop making billions off my content yo!

Ignoring robots.txt directives sounds like grounds for a lawsuit, why should you foot the bandwidth?

Worse is that analytics and some web hosts hide googlebot activity, as if they know it shouldn't be there...

update: from John Mu
Just to follow up on the Instant Preview questions.. We fetch the content for Instant Previews (provided it's not cached yet) on demand when the user requests it. When we do that, we need to be able to fetch the page the way that the user would see it, and for that we may fetch content that's otherwise disallowed by the robots.txt file.


Yup, lawsuit incoming. The explanation doesn't hold water since it should not be possible to request an instant preview on a disallowed page. Chicken before the egg problem John.

freejung




msg:4358091
 8:14 pm on Sep 2, 2011 (gmt 0)

but what about the various things I block because I don't want a lot of sort=, pagination=, display= and other types of potential duplicate content cluttering up the joint?

Right, but the +1 button uses the rel="canonical" link if it exists to identify the target of the button, so all you have to do is properly canonicalize those pages and the bot should request the URL that the button actually points to, which will be the canonical unless specified otherwise.

Or you can explicitly specify the URL in the code for the button.

You still have a fair amount of control over this.

I can see how it's kind of rude in principle, but in actual practice why would you want a Google +1 button pointing a URL that you don't Google to see? Just make sure you're pointing the button at a URL that you like, which you probably ought to do anyway otherwise when people share it on Google+ they'll be sharing all sorts of random parametrized URLs that you don't like.

freejung




msg:4358094
 8:23 pm on Sep 2, 2011 (gmt 0)

Hmmm... I just realized that it's not entirely certain that my interpretation of the statement is correct. I just asked for clarification, we'll see what they say.

londrum




msg:4358102
 8:46 pm on Sep 2, 2011 (gmt 0)

you can understand their thinking though... google+1 is not for users, not really. its not like the facebook button where people click it to share stuff with their friends... the whole point of google+1 is to let google know that you think the page is worthy of being in their index. everything else is just fluff. so why put the button on a page if you dont want it boosted in the index? there's no point.

the only reason to put it on a noindexed page is if google spreads some of the benefit throughout the rest of the site. but is that the way it works? i dont think it is. a click only counts towards that specific URL.

tedster




msg:4358109
 8:59 pm on Sep 2, 2011 (gmt 0)

I do understand the thinking behind this, but I also think they haven't thought it through very well. Now, if the crawling uses a canonical link (we're waiting for that clarification) that would handle a lot of these edge cases that look so troubling.

However, Google does rush things into production without looking at cross-discipline ramifications. Remember how the first AJAX SERPs broke Analytics?

netmeg




msg:4358115
 9:08 pm on Sep 2, 2011 (gmt 0)

Actually you can program the +1 button now to share to your google+ page, as I understand it.

I am not saying I would *intentionally* put the button on a noindexed page, I am talking about where I put the button on pages where alternative URLs (to the same content) can be generated, and where I usually block off those alternative URLs from being indexed. The canonical might help in some cases, but I already have canonical set and I still get crap in the index if I don't specifically block it out. Google is not 100% reliable in this.

freejung




msg:4358120
 9:12 pm on Sep 2, 2011 (gmt 0)

Actually you can program the +1 button now to share to your google+ page

That is currently the default behavior.

Sgt_Kickaxe




msg:4358125
 9:32 pm on Sep 2, 2011 (gmt 0)

The canonical might help in some cases, but I already have canonical set and I still get crap in the index if I don't specifically block it out. Google is not 100% reliable in this.


Google is 100% reliable in that they will crawl all available data, the question is will they obey webmasters and not crawl what we say don't crawl. The answer appears to be no. I've set up several honeypot pages to see what Googlebot does in reality, the only unbiased answers come from testing for yourself.

freejung




msg:4358141
 9:51 pm on Sep 2, 2011 (gmt 0)

the only unbiased answers come from testing for yourself

Good point, except that they say that they may crawl the page, which means even if they don't crawl your test pages, they may still crawl others in other contexts -- however, it would be interesting to see the results. Maybe I'll set up a test too, in my vast spare time...

Marshall




msg:4358142
 9:52 pm on Sep 2, 2011 (gmt 0)

Maybe I'm over simplifying, but isn't anything that helps your ranking a good think even if it has some quirks? Remember, that old adage applies here: you get what you pay for and Google+ is free.

Marshall

Dan01




msg:4358190
 1:48 am on Sep 3, 2011 (gmt 0)

Most sites now use a Database to basically pull a page together. Back in the days that I used HTML to create every page, it would have been easy to delete something like the +1 from a specific page.

Now though, with a DB driven site, it is nearly impossible.

On another tread I mentioned I was deleting pages to try to improve my page rank. Every page on the DB portion of the site has +1. It wouldn't have made sense to use no-index or disallow to remove the page from the index, perhaps.

loner




msg:4358200
 2:42 am on Sep 3, 2011 (gmt 0)

No +1, no problem. Sounds needy.

DirigoDev




msg:4358208
 3:34 am on Sep 3, 2011 (gmt 0)

Now though, with a DB driven site, it is nearly impossible.


Not so! Suppressing +1 should be easy peasy lemon squeezy on most any DB driven site. Just suppress +1 on pages you don't want in the index. Too much complaining here. Personally, I think that Google could honor the noindex command on a +1 page. They need to step up to the plate on this one. Mr. Cutts, why can't Google honor the noindex?

indyank




msg:4358209
 3:37 am on Sep 3, 2011 (gmt 0)

robots.txt has been designed to block some or all the disciplined bots from crawling any page.

Why should a recently introduced +1 button take precedence to jeopardise what has been well understood by all? I do agree that one shouldn't use the button on a page they don't want to share publicly. But, any good bot should obey the most restrictive instruction when there are conflicts.

When you have two robot meta tags on a page by mistake, one telling the bots to "index" and the other telling the bot to "noindex", doesn't google say they will apply the most restrictive tag? Why should it be different in this case?

I sincerely feel that it would be better if they apply the same logic here.

graeme_p




msg:4358220
 5:04 am on Sep 3, 2011 (gmt 0)

except they're pretty much saying you *gotta* use +1 - at least that's what the clients hear.


Why? It only affects SERPS for people who follow those who +1. Are these sites going to get lots of +1s from people who are followed by potential customers?

Having a +1 button does not seem to greatly change the number of +1s you get.

If its just the clients perception and you disagree but cannot change their minds, that is their problem - much the same as if they insisted on link exchanges with dodgy sites.

@indyrank. This is different. It is more like what Facebook does with links - do they follow robots.txt? Incidentally FB also censor links in comments, even to fairly mild material (Wikipedia entry on a racy text adventure game, IMDB page on a frightening but not particularly offensive film).

I agree that ideally Google would index them separately, but that would be expensive. It does not seem unreasonable to say that you should not put +1 buttons on pages you do not want indexed. I think what they need to do is document it better so people are clear about it when they add the button.

indyank




msg:4358227
 6:11 am on Sep 3, 2011 (gmt 0)

graeme_p, what is different? Do not mix up facebook with a search engine bot. AFAIK facebook doesn't have any bot to crawl websites or pages on websites like the search robots.

Yes, google does mix up social and search a lot these days to confuse several webmasters like you. However, whenever googlebot crawls a page on a site, it is supposed to check the robots.txt to find what is disallwed and obey the rules there.It is googlebot which is supposed to determine whether a page can be crawled and indexed.

[edited by: indyank at 6:34 am (utc) on Sep 3, 2011]

indyank




msg:4358229
 6:18 am on Sep 3, 2011 (gmt 0)

I don't think canonical tags are a good solution to this issue for several reasons. Moreover, they are not widely used and they are just hints for googlebot.

robots.txt should take precedence over anything else.

indyank




msg:4358230
 6:25 am on Sep 3, 2011 (gmt 0)

Google, if you don't like users using the +1 button on pages they have blocked, help them remove it, warn them or ban them from using that button.

It isn't nice to find workarounds for bypassing robots.txt.

Dan01




msg:4358231
 6:36 am on Sep 3, 2011 (gmt 0)

Just suppress +1 on pages you don't want in the index.


Suppress? I'd love to know more about suppress. If you have the same sidebar for the whole site, how do you suppress something on that sidebar on one page? I don't see suppress in Drupal, Wordpress or any of the other DB driven CMSs.

I am not saying it can be done, but how?

g1smd




msg:4358242
 8:25 am on Sep 3, 2011 (gmt 0)

A few lines of code can detect what page the user requested and then decide to show or not show the button.

I recently did this for the FaceBook "like" button on a MediaWiki-driven website. It should be equally simple to deploy on any script-driven site just as long as you know the URL format for all pages that should display the button or for all pages that should not display the button.

In this case the rule was "don't show the button on pages where URL contains index.php or begins Talk: or...". Additionally, we don't bother showing the button if it is a searchengine bot requesting the page.

nippi




msg:4358244
 9:20 am on Sep 3, 2011 (gmt 0)

and with regard to another thread - +1 button on any page that is a spider trap folder, will catch and ban google.

Pfui




msg:4358309
 3:00 pm on Sep 3, 2011 (gmt 0)

The question is will they obey webmasters and not crawl what we say don't crawl. The answer appears to be no.

That's been true a long time, unfortunately. You'll find scores of reports in WW's "Search Engine Spider and User Agent Identification [webmasterworld.com]" forum.

For example, here's fresh info about non-obvious Twitter-mining:

Resolving "urlresolver" | Google IPs repeat no-robots runs
Recap post: [webmasterworld.com...]

And more GWT news:

Google Web Preview | Not just from bare IPs anymore... [webmasterworld.com...]

After spending too much unrecompensed time 'accommodating' GWT before G worked out their own bugs, I will no longer kick their tires for them via +1 or anything else. I'm seriously weary, and increasingly wary, of jumping through their we-cloak-but-you-can't hoops.

This 42 message thread spans 2 pages: 42 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved