Googlebot ignoring robots.txt and nofollow

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot ignoring robots.txt and nofollow

MrBlack

2:49 pm on Apr 2, 2017 (gmt 0)

I have blocked all robots from crawling affiliate links placed on my website in robots.txt . I have also nofollowed them all, but I have noticed that Googlebot is now ignoring these directions and is crawling them anyway.

Is there anything more I can do to stop them crawling them, especially as now they appear to have declared war on affiliate links?

not2easy

5:57 pm on Apr 2, 2017 (gmt 0)

If you truly do not want Google to see those links you need to nofollow your navigation links to those pages because they will follow your navigation links even if the page is blocked in robots.txt. You should also remove those pages from sitemaps if you submit sitemaps to Google. If you have incoming links to those pages which are not no-follow, they will still visit those pages. Please keep in mind that if you blocked those pages in robots.txt at the same time you added nofollow to the links, then they can't crawl to see that those links are no-follow. I'm guessing that you have also no-indexed the pages you do not want crawled?

I have not bothered to no-follow affiliate links and have no problems. It does not look like they have anything against affiliate links per se - just when there is no content or thin content to support those links. (Or maybe more links than the content supports?)

lucy24

6:43 pm on Apr 2, 2017 (gmt 0)

I have also nofollowed them all, but I have noticed that Googlebot is now ignoring these directions

You may have misunderstood what "nofollow" means. (Been there. Done that.) It doesn't mean "pretend you haven't seen this link". It only means "don't tell them I sent you", as applied to that nebulous quantity known as Link Juice.

Every now and then someone reports seeing the Googlebot (by that name) ignoring robots.txt directives. But so far there has always turned out to be some other explanation.

phranque

10:56 pm on Apr 2, 2017 (gmt 0)

I have blocked all robots from crawling affiliate links placed on my website in robots.txt .

help me understand...
your affiliate links go through an external script and you are using robots.txt to exclude googlebot or all bots from crawling that external script?

EditorialGuy

12:56 am on Apr 3, 2017 (gmt 0)

I have not bothered to no-follow affiliate links and have no problems. It does not look like they have anything against affiliate links per se - just when there is no content or thin content to support those links. (Or maybe more links than the content supports?)

Also, a few years ago (when a lot of people were trying to figure out how to use nofollow), Matt Cutts said something like "We're pretty good at recognizing affiliate links," the implication being that Google knew enough not to send link juice to Amazon or Booking dot com or whatever based on an affiliate link.

MrBlack

7:41 am on Apr 3, 2017 (gmt 0)

help me understand...
your affiliate links go through an external script and you are using robots.txt to exclude googlebot or all bots from crawling that external script?

My affiliate links go through '/folder/file.php?123', 'folder/file.php?1234' etc. I have blocked all bots from crawling /folder/, yet looking through my server logs I can see that googlebot is still looking at the links.

phranque

7:51 am on Apr 3, 2017 (gmt 0)

I have blocked all bots from crawling /folder/, yet looking through my server logs I can see that googlebot is still looking at the links.

have you checked the IP addresses for those /folder/ requests or just the UA string?

MrBlack

8:19 am on Apr 3, 2017 (gmt 0)

Yes, definitely gbot.

keyplyr

8:21 am on Apr 3, 2017 (gmt 0)

robots.txt doesn't block bots from accessing & following links... it asks that the disallowed files not be indexed.

engine

8:59 am on Apr 3, 2017 (gmt 0)

Googlebot has found the links somewhere. It probably won't stop crawling those links as it's part of Google's knowledge of the web. It's knowledge may not mean indexing, but once it finds a link it has a voracious appetite to follow, even if you have the directive set to noindex, nofollow.

I would look for any external links to those pages: That's how they are being found, however, it kind of defeats the object of an affiliate link.

You could also move the pages to a new url, ensuring they are free of any inbounds.

phranque

9:52 am on Apr 3, 2017 (gmt 0)

it shouldn't matter how Google discovers a url.
(I'm talking about the /folder/ URLs here.)
a well-behaved bot will respect exclusions specified in robots.txt and as far as I know googlebot has always been "well-behaved" and as such it wouldn't request a /folder/ url.
have you tested your robots.txt file in GSC?

[edited by: phranque at 11:06 am (utc) on Apr 27, 2017]

MrBlack

10:08 am on Apr 3, 2017 (gmt 0)

have you tested your robots.txt file in GSC?

Yes I have. Everything is OK with the robots.txt. I have no doubt that the googlebot is misbehaving.

lucy24

9:49 pm on Apr 3, 2017 (gmt 0)

robots.txt doesn't block bots from accessing & following links... it asks that the disallowed files not be indexed

Say what now?

msenza

10:07 am on Apr 27, 2017 (gmt 0)

Hi everyone,
We've noticed the same thing, starting on March 28th Googlebot has been crawling & indexing redirect links that have always been blocked in the robots.txt. The links have also been encrypted for awhile and he's never crawled them, until recently

Since march 28th until yesterday it was an easy crawl, 2/300 URI's a day tops, so far today we've got 3k hits on these pagetypes, he's going all out..

The weird part is that these newly indexed pages have different descriptions, either "A description for this result is not available because of this site's robots.txt", or the merchant page description. It's kind of like he considers they've always been indexed and now they just been blocked by the robots.txt, but they've literally been there for years (if not at least a decade..)
He also mixed up part of our breadcrumb and part of the merchant's breadcrumb.

Do you guys think he could have changed something within his crawler? If so, side effect or is this exactly what they wanted to do..?

Has anyone seen anything related? We've noticed that competitors have the same issue, I'm wondering if this is specific to e-commerce/affiliate or is this global

phranque

10:57 am on Apr 27, 2017 (gmt 0)

Googlebot has been crawling & indexing redirect links that have always been blocked in the robots.txt

if you exclude googlebot from crawling it won't see the redirect or any response for that matter since it won't request the excluded url.
this is why the SERP is showing the "A description for this result is not available because of this site's robots.txt" text instead of a description.

msenza

1:39 pm on Apr 27, 2017 (gmt 0)

if you exclude googlebot from crawling it won't see the redirect or any response for that matter since it won't request the excluded url.

Yeah exactly, so why has he started.. ^^ we never removed it for the robots.txt

lucy24

5:59 pm on Apr 27, 2017 (gmt 0)

A description for this result is not available because of this site's robots.txt

Now, wait. This message means that the search engine (G is not the only one) has indexed but not crawled the page. Why do you say it has been crawling? Do you see Googlebot requests in logs? If a law-abiding robot is denied in robots.txt, the request will not be made in the first place. Have you tried Fetch As Googlebot in GSC to see the response?

:: detour to check something ::

As you may know, Fetch As Googlebot is now two separate fetches--one as Googlebot, one with a humanoid UA. If you say �Fetch and Render�, and the specified page is denied in robots.txt, it still does the humanoid fetch, but it doesn't show the "what a human sees" render. (If you say Fetch, without Render, for a roboted-out page it doesn't do the humanoid part at all.)

The only way to know what has been happening is by looking at raw logs.

You may choose to take the opposite approach: permit crawling but deny indexing by putting a robots meta on each page. (Some search engines also recognize the X-Robots header, which can be applied globally. I don't remember if Google does.) But that's a different issue.

ergophobe

10:34 pm on Apr 27, 2017 (gmt 0)

robots.txt doesn't block bots from accessing & following links... it asks that the disallowed files not be indexed.

From robotstxt.org

The "Disallow: /" tells the robot that it should not visit any pages on the site.

-- [robotstxt.org...]

And from the original official spec

Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.

-- [robotstxt.org...]

nickchernets

1:50 am on Apr 28, 2017 (gmt 0)

Maybe it will be better if you remove those pages from the sitemap. This method with 100% accuracy will stop crawling them.

lucy24

3:55 am on Apr 28, 2017 (gmt 0)

This method with 100% accuracy will stop crawling them.

Why would it? A sitemap doesn't mean "crawl only these pages". It means "be sure not to overlook these pages".

msenza

3:47 pm on May 3, 2017 (gmt 0)

Now, wait. This message means that the search engine (G is not the only one) has indexed but not crawled the page.

To my knowledge Google does need to crawl in order to index, you disagree lucy24?
But regardless yeah I see hits on in within the blog..
He's clearly ignoring the robots.txt directives and I don't really get why, that directive hasn't changed in at least 6 years, it's valid (checked with GWT), they've never been crawled or indexed before. I wonder why he just started ignoring.. I looked at logs over the passed 2 years, no hits until last month...

You may choose to take the opposite approach: permit crawling but deny indexing by putting a robots meta on each page. (Some search engines also recognize the X-Robots header, which can be applied globally. I don't remember if Google does.

Yep exacly what I did.. no real choice since I need him to get them out of his index... (added x-robots noindex and unavailable after)
But it's still very weird to me that Google decided to stop respecting the robots.txt directives...

lucy24

4:26 pm on May 3, 2017 (gmt 0)

To my knowledge Google does need to crawl in order to index, you disagree

Yes, they are two separate processes. A page can be
-- crawled but not indexed (the page's robots meta, and/or x-robots header, says noindex)
-- indexed but not crawled (shows up in SERP with blurb about "this page's robots.txt")

Obviously they would prefer to index pages that they have actually seen. But if 800 different pages on good sites say "example.com/pagename.html is the world's single best source for information on widgets" then that may be enough to put your page into the SERPs even if they're not allowed to crawl the page.

So far in this thread you have not given hard evidence that the Googlebot (funny, I never thought of it as having a gender) is in fact crawling--and you've shown some evidence that it is not crawling.

I see hits on in within the blog

Typo? Can we see a sample line from access logs showing the Googlebot requesting a page it isn't supposed to be requesting?

Now, if we wanted to go into tinfoil-hat territory, we could postulate that every time an URL shows up in a SERP with the blahblah about "this site's robots.txt", it really means that the Googlebot--or an agent operating under a pseudonym--has in fact crawled the page and considers it to be worthwhile, but they won't admit to having crawled it, because they've got a reputation to maintain. But I've never got a clear sense of a "plainclothes Googlebot" analogous to the ones sent out by bing and yandex. (Admittedly this could just mean they're a lot better at it :()

tangor

7:32 pm on May 3, 2017 (gmt 0)

I'm with lucy24 ... in that the IP ranges of "googlebot" will be the tell all.

As for g hitting pages that have been disallowed in robots text.... we know g never forgets a url it has met, either on your site or someone else's. We also know g tests these urls, even with a robots.txt directive in place.

.htaccess is one place to enforce ... it's not like we don't know the ip ranges the g bot uses. The OFFICIAL g bot.

seoskunk

1:06 am on May 6, 2017 (gmt 0)

wrap the link in a div called "affiliate" then in php cloak the link from google

Arjunsinh

5:48 am on May 31, 2017 (gmt 0)

Googlebot honor robots.txt very well.

I can't believe people here with 10+ year experience says, robots.txt does not block from crawling. Are you out of mind? Just check google own advertising network [googleadservices.com...] they block all polite bots in their robots.txt and if you have placed adsense ads, then do fetch and render from search console, and you will see those scripts are blocked and google can't execute it.

OP@You can't block crawling from links, you can block directory only in robots.txt, and under that directory all links will be un-crawl by googlebot.

keyplyr

6:13 am on May 31, 2017 (gmt 0)

Hi Arjunsinh and welcome to WebmasterWorld [webmasterworld.com]

I can't believe people here with 10+ year experience says, robots.txt does not block from crawling

Believe it. Robots.txt doesn't *block* anything. Hundreds of bots ignore it altogether. Robots.txt only works on those bots that support it.

lucy24

5:57 pm on May 31, 2017 (gmt 0)

Robots.txt only works on those bots that support it.

Well, of course. It's like a �No Admittance� sign: burglars don�t (or can�t) read, so you need a deadbolt to back it up. But the question was whether the Googlebot, specifically, honors robots.txt.

keyplyr

6:03 pm on May 31, 2017 (gmt 0)

But the question was whether the Googlebot, specifically, honors robots.txt.

Hard week Lucy24? :)

I was responding to Arjunsinh (quoted)