Forum Moderators: martinibuster

Message Too Old, No Replies

How much should Google be allowed to spider WRT AdSense?

TOS says "the Site(s), or any portion thereof"

         

claus

9:24 am on Apr 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is continued from Google's 302 Redirect Problem [webmasterworld.com] msg #74

The issue is: In some instances, a webmaster might want to exclude certain parts of a site from spidering altogether. One such part could be links that forward via a 302 server header, to protect innocent target websites against accidental "hijacking" but there are other -- more "normal" -- situations, examples below.

It seems like: If you want to run AdSense you must allow Google unlimited access to your site. Ie. You must allow acces to parts of your site that you otherwise would allow nobody (including Google) access to. This can't be intentional, can it?

Background

In the other thread, Reid replied to the above (302 link) suggestion:

Claus this could be a real problem for people running adsense because your not allowed to exclude googlebot from any part of your site.

... and elaborated with this:

last month I read on adsense guidelines 'do not use a robots.txt file. and then buried in the optimization tips there is a line

If you have a robots.txt file, remove the file or add the following two lines to the top of the file:

User-agent: Mediapartners-Google*
Disallow:

This change will allow our bot to crawl the content of your site, so that we may provide you with the most relevant Google ads.

The mediapartners bot is not the same as Googlebot, so with a rule like that Googlebot does not have the same rights as mediapartners. Apart from that, the quote above is from the "optimization tips", which - by nature - are tips, not terms. This is what i found in the AdSense TOS, part 16:

In addition, You grant Google the right to access, index and cache the Site(s), or any portion thereof, including by automated means including Web spiders or crawlers.

URL: [google.com...]

So, i believe Reid is right, although it seems very wrong to me.

Examples of situations where this seems odd:

If i have a page, link, or section that:

a) no real users will ever be able to access directly
b) does not carry adsense ads
c) should not be included in search results

Then, why should i allow anyone to spider it? It could be, eg. my super secret admin section or personal section with all my bank passwords or whatever (not that i have either of those, but some might have).

Other examples are eg. a graphics folder, a folder with stylesheets, or a folder such as "cgi-bin" that only contains scripts.

Questions:

Q1: Is this a flaw in the TOS that nobody has been paying attention to, or discovered?

Q2: Or - if i want to have AdSense, does that really mean that i should grant Google access to stuff that i would otherwise grant nobody (including Google) access to?

Q3: Could it really be right that i should grant the "Mediapartners bot" unlimited access - ie. more access than any other spider, including Googlebot?

[edited by: claus at 9:45 am (utc) on April 5, 2005]

Reid

9:31 am on Apr 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



'optimisation tips' URL
[google.com ]

sorry about that.

I knew I read that somewhere and when I signed up I carefully went through the terms and did come out with the impresssion that they wanted full access for bots. I imagine that this is to prevent hiding any sneaky tactics from an adsense point of view.

To get this thread back on topic, considering the 302 indexing problem with googlebot, this could be a problem for adsense publishers, esp what you pointed out clause

In addition, You grant Google the right to access, index and cache the Site(s),

does this prevent adsense publishers from using the NOARCHIVE?

diamondgrl

9:54 am on Apr 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You can't be serious, can you?

It seems like it is reasonable language and only if you interpret Google as the most maniacal business partner on the planet would you be concerned with language like that. Their robots.txt instruction is just a helpful tip, not an ultimatum that you must follow or you go straight to Adsense hell.

Am I missing something here? Anyone?

claus

9:54 am on Apr 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> reasonable language

Diamondgrl, TOS'es are very serious and wording is important. The phrase "the Site(s), or any portion thereof" means "everything and no exceptions". I don't think this is reasonable, and i am serious about this.

I just can't believe that this would be intentional. Of course, the AdSense bot should be allowed access to every single page that contains AdSense ads, that's no issue - but access to everything?

I'd think that this was an error. Some minor adjustments in these TOS should be made here.

Also a good point Reid, if you don't want your documents cached, should you allow it anyway if you run AdSense? I think quite a few news sites would not subscribe to that.

Also, i would personally object to any spidering of any secret material i had on my sites a long time before i even considered caching.

Reid

10:03 am on Apr 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



and only if you interpret Google as the most maniacal business partner on the planet

The truth is out there LOL

Reid

10:11 am on Apr 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Could this be the problem here?
[webmasterworld.com...]
'googlebot ignoring robots.txt'

If googlebot see's adsense code does it ignore robots.txt? it IS in the TOS after all.

diamondgrl

10:11 am on Apr 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, there are plenty of things that cause me to lose sleep. This won't be one of them.

claus

11:01 am on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



*bump*

So, do you all think this is a non-issue?

Ie. we should just go on doing business as usual even though AdSense TOS explicitly disallows it?

Has anybody ever got a robots.txt file (which disallows sections of their site from spidering) validated and approved by the AdSense team?

jomaxx

3:53 pm on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What's the big deal? If nobody can access a given directory then it's not part of your site. And if you only allow the Mediapartners spider to have access to the site, then it won't end up in Google's index anyway.

IMO if you have super-secret files in publicly available directories, Google is not your biggest problem.

DamonHD

4:43 pm on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi,

I try never to have a robots.txt file at all, and make sure that I am happy for a robot to try to look at anything.

I have other ways of dealing with broken/stupid/kleptomanic robots...

I do *very* occasionally have a limited robots.txt on a few vanity domains where all the actual content is available (to robots et al) through a canonical domain.

Kids: just say no to robots.txt!

Rgds

Damon

claus

4:52 pm on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> What's the big deal? If nobody can access a given directory then it's not part of your site

I can think of several unfortunate examples.

Eg. an "add to cart" button. What this is is essentially a link. I wouldn't like any spiders to visit such a link, but of course the users should be able to - it is part of the site, just not for spiders.

However, the "big deal" is that i should be able to let any page i choose be disallowed from spidering, as long as that page does not serve AdSense ads.

Well, the general consensus seems to be that this is a non-issue, at least judging from the limited response. I don't know if this is because people just allow the mediabot access to everything, or because they do what they think is most reasonable, regardless of Mediabot and AdSense TOS.

So, let me rephrase once again:

Has anyone, ever, changed their "robots.txt" rules to something they would not have if they did not run AdSense? If so, how and why?

[edited by: claus at 5:07 pm (utc) on April 6, 2005]

jdMorgan

5:02 pm on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I agree with diamondgrl that there are better things to lie awake at night over.

I see their posted access requirement as an over-simplification, so that they don't have to include a 14-page webmaster primer along with it.

It's obvious that Google needs access to the pages on your site that would be useful in classifying your site for the purpose of serving relevant ads on your pages. No more, no less.

As far as Google demanding full access to your cgi-bin, password-protected directories, etc., that's simply ridiculous. There's no reason they'd need it, and even if they did issue such an unreasonable ultimatum, it would be rather easy to cloak those resources anyway.

Don't worry about their exact words--again, I see them as over-simplified--without trying to understand what it is that they want. They simply want to be able to classify your site accurately. (You can also take away an important lesson in useability here: Always explain the basic motivation behind your policies; It makes interpreting them without legal counsel a lot easier, and removes a lot of fear and unreasonable doubt.)

Jim

Rodney

5:12 pm on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'd have to agree with diamondgrl on this one. It doesn't seem like that big of deal at all unless you stretch to interpret the TOS to mean something sinister.

The phrase "the Site(s), or any portion thereof" means "everything and no exceptions".

I'm not so sure that's what that phrase in the TOS means, especially taken out of context. I *might* be more worried if it said AND any portion thereof, but "or" seems doesn't seem to mean "everything".

I think it's just business as usual. If it really concerns you, you might be able to contact the adsense team with your specific question and maybe they can shed some light that will set your mind at ease.

oddsod

5:33 pm on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



jdMorgan, I don't know about "over simplified". I'm sure they had a legal team check it over thoroughly. Besides, the agreement isn't designed - like the Adsense tips - to be understood by everyone; it is designed to cover all the relevant legal issues.

If it occured only in the guidelines to disallow: <nothing> then it could be read as a simplification. The fact that it appears there, in the TOS and elsewhere suggests that Google does really want access to all parts of the site. True, they may not access it, they may not have any malicious intent for accessing parts that robots.txt blocks, but it seems they clearly want the right to do it.

It would be interesting to know what the reason behind this is (apart from the usual "serving more appropriate ads" bit).

HughMungus

6:08 pm on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



a) no real users will ever be able to access directly
b) does not carry adsense ads
c) should not be included in search results

This is one example where Google's vagueness can benefit you. I assume it to mean on pages where Adsense ads are displayed and "website" to mean pages that are publicly available. It would be nice to be able to block out certain parts of the site from mediabot, though, as I have a website that's about widgets but a list of my usernames on the front page has caused Adsense to show ads based on a word that one of my users is using for his username (i.e., my page is about widgets but it's showing ads for wadgets).

claus

8:05 pm on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'll have to pull back a little on #1 above - it's really not always the case that no real users have access to stuff that i don't want spidered. Still, it's always the case that i don't put ads on stuff that i don't want spidered.

Anyway, i'm glad that most of you think this issue is little more than a spelling error on Google's behalf. I would have thought that it was really to be interpreted quite strict, since it is in the TOS.

In that sense i find it hard to interpret "or any portion thereof" as anything else than "any portion" with "any" being "any" as in "any" - ie. not "some", "just some", or "selected portions".

(perhaps i should send a sample "robots.txt" file to the AdSense team after all)

Reid

8:44 pm on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



other parts of the TOS say things like
not put adsense on error pages
not put adsense on login pages
not put adsense on form conformation pages.
and the like.

Perhaps mediabot only wants to check these things and other 'checkable' TOS violations.
Since googlebot does the indexing then you can limit googlebot but allow mediabot full rights.

It does say 'either remove the file or add (mediabot full rights) at the top of the file.

At least thats how I understood it.

[edited by: Reid at 8:50 pm (utc) on April 6, 2005]

buckworks

8:47 pm on Apr 6, 2005 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I think that Google should revise that wording.

You simply can't approach a legally binding document and say, yeah, that's what it says but they can't really mean it.

Reid

9:02 pm on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In addition, You grant Google the right to access, index and cache the Site(s), or any portion thereof, including by automated means including Web spiders or crawlers.

Maybe this isn't even about the google directory. Maybe there is another index/cache going on with mediabot in relation to equating adsense blocks with keywords ect.

Jenstar

9:02 pm on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you are putting AdSense code on a page, AdSense wants to be able to see what is there so it can provide targeted ads, so you need to give it access. As for the terms, I am willing to bet the majority of publishers allow the mediabot full access to the site because they do not have sections they want to disallow from being indexed.

The mediabot and the regular googlebot do not interact at all, and both honor robots.txt independently of each other. I get many downloads every month by both downloading the robots.txt at the beginning of a session.

Some people were disallowing the mediabot (either deliberately or accidentily) then complaining about PSA issues, which is probably why the specific robots.txt information is included in the tips, so people can see what it *should* look like to allow the mediabot to properly target ads.

I wouldn't worry at all about excluding the mediabot from certain pages/directories, as long as you aren't also running AdSense on them or complaining about a PSA issue ;)