Consequences of blocking robots or just using noindex,follow

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Consequences of blocking robots or just using noindex,follow

gn_wendy

11:40 am on Aug 26, 2009 (gmt 0)

I am working on a site with over 6million unique pages in total. All 6 million pages are available in over 10 languages. All on the same domain.

Currently, we have been setting content not relevant to a specific language to content=noindex. Not relevant is probably wrong ... it is not relevant for users who come through the search engines.

We have decided to try out what would happen if we instead blocked the irrelevant content using the robots.txt there are two reasons for this. The number of pages bots can crawl is reduced, so crawling/indexing of important pages should be a lot faster and we save a lot of server power.

The thing is, all the links that were ='noindex, follow' will now cease to exist as far as bots are concerned.

Any thoughts?
...as I said, we are currently testing, so I'll probably post an update here when I've had a chance to get some results.

tedster

5:16 am on Aug 27, 2009 (gmt 0)

The biggest difference is that when you use noindex,follow robots meta tags, then some PageRank still circulates through the links on those pages. By prohibiting spidering altogether, those links are dropped from the webgraph and no longer passing along any PageRank or other link influence.

setting content not relevant to a specific language to content=noindex

This is confusing to me. Why not let Google index every language version you have, and then return whatever is appropriate, based on the actual query and the person who is searching? Why try to decide ahead of time what is or is not relevant -- and then deny that page even the chance to be in the index?

internetheaven

9:16 am on Aug 28, 2009 (gmt 0)

I've always found all search engines to be slightly incompetent when it comes to robots.txt -- a large portion of my sites have ended up with hundreds of URl listings in the search results. Now, I always use noindex - (although I here Bing is screwing up noindex just a bad recently).

Robert Charlton

7:06 pm on Aug 28, 2009 (gmt 0)

I've always found all search engines to be slightly incompetent when it comes to robots.txt -- a large portion of my sites have ended up with hundreds of URl listings in the search results. Now, I always use noindex....

Note that while robots.txt will keep Google from spidering a page, it will not prevent Google from indexing other references to that page if they appear on pages which Google does spider. This is how those URI listings can end up in the serps.

If you want to keep both the page and references to the page from being indexed, then use the noindex,follow robots meta tag on the page (at least for Google).

There's a twist to this, though. If you use the robots meta tag on a page, don't also use robots.txt to block the spidering of the page. The reason?... if Google doesn't spider the page, it won't see the robots noindex meta tag.

leadegroot

9:24 am on Aug 29, 2009 (gmt 0)

Although it can be legitimate to back up a robots.txt exclusion with a meta noindex - I have occasionally had robots excluded files end up in the index any.

gn_wendy

7:26 am on Aug 31, 2009 (gmt 0)

This is confusing to me. Why not let Google index every language version you have, and then return whatever is appropriate, based on the actual query and the person who is searching? Why try to decide ahead of time what is or is not relevant -- and then deny that page even the chance to be in the index?

Having all our pages indexed would be great! ...but Google will never index all ~80 million pages. We can't even get Google to keep all 6 million we haven't set to 'noindex' in the index. The reason we set the pages to 'noindex' was that google was indexing our pages badly. after we added the noindex tag traffic has been increasing.

The question still remains - how much link power do those internal links count for?

bwnbwn

7:42 pm on Aug 31, 2009 (gmt 0)

how much link power do those internal links count for?

This all depends on your interlinking within the site and who has linked to the pages because the information was deemed worth passing.

80 million pages of unique content in 10 languages or 6 million pages in 10 languages of the same content but in a different language on the same website.

I assume each is in a folder with identical navigation within each language or does each language have a different navigation?
Page url's are rewritten to the language of the folder?
Different Images in each language with correct alt tages to reflect the language?

gn_wendy

7:42 am on Sep 1, 2009 (gmt 0)

I assume each is in a folder with identical navigation within each language or does each language have a different navigation?
Page url's are rewritten to the language of the folder?
Different Images in each language with correct alt tages to reflect the language?

It's a total of ~80 million pages. However the content is only truly unique for roughly 6 million pages divided over 10 languages + another 5 languages with a limited amount of content. However there are a lot of pages for [widgets] relevant only for English speakers, for example, and then [other widgets] only relevant for Spanish speakers.

The point is that we allow access to all pages in all languages for users. The [other widgets] pages can be viewed in English also, i.e. the pages exist when a user visits the website in English even if those pages are targeted at users from Spain and Latin America.

It makes sense to not have these pages indexed, because it steals 'focus' from the pages that we want indexed.

Like I said, we are currently testing to see the effects of the =noindex vs a robots disallow. So far the results are not really positive with regard to the disallow solution though, other than the fact that we are saving lots of server power.

Yes. Each has it's own folder, but the navigation is identical. Urls, folders and sub-domains are different in each language, same with alts, and other META. Some languages (like chinese) have a totally different domain hosted on a local server.

tedster

7:49 am on Sep 1, 2009 (gmt 0)

It sounds like robots.txt might be your best approach. That way Google will allocate your crawl budget over just the urls you want to see indexed.

If you use a noindex meta tag, then googlebot must retrieve the page in order to see the noindex directive, and there goes a bit of the crawl budget that was allocated to your site.

gn_wendy

8:51 am on Sep 1, 2009 (gmt 0)

thanks for the input tedster. allocating the crawl budget is exactly what I was aiming for.

the question is how the link graph will change, and what the effects will be vs. the noindex approach. i guess only time and testing will tell.

bwnbwn

2:33 pm on Sep 1, 2009 (gmt 0)

gn_wendy I see now what you are doing I was thrown off with the 6 million then the 80 million. I agree with your approach as well.

One quick question how did you get the 6 million pages in 10 languages?

I have looked into this and find getting one of our sites translated into another language beyond our companies budget? I have worked with different programs to try to reduce the cost but find the programs do an ok job but not good enough.

gn_wendy

7:30 am on Sep 3, 2009 (gmt 0)

One quick question how did you get the 6 million pages in 10 languages?
I have looked into this and find getting one of our sites translated into another language beyond our companies budget? I have worked with different programs to try to reduce the cost but find the programs do an ok job but not good enough.

The website was started long before my involvement, but the basis is to use a string based system. That is to say any page created is coded 'language neutral', this allows any page to be viewed in every language.

...but everything has been translated. Rome was not built in a day.

johnser

4:31 pm on Oct 20, 2009 (gmt 0)

Tedster posted

Why not let Google index every language version you have, and then return whatever is appropriate, based on the actual query and the person who is searching? Why try to decide ahead of time what is or is not relevant -- and then deny that page even the chance to be in the index?

A related Q to that of the thread but if you had 100 pages of content in each of french, german & italian on .fr/de/it (ie. 300 pages on each site), would it not be more sensible to block Gbot from seeing the german content on the fr/it sites & just encourage Gbot to list the german content on the de site?

Surely this would eliminate risks of dupe content and ensure that users are getting the right content?

gn_wendy

5:33 pm on Oct 20, 2009 (gmt 0)

A related Q to that of the thread but if you had 100 pages of content in each of french, german & italian on .fr/de/it (ie. 300 pages on each site), would it not be more sensible to block Gbot from seeing the german content on the fr/it sites & just encourage Gbot to list the german content on the de site?
Surely this would eliminate risks of dupe content and ensure that users are getting the right content?

Yes. That is exactly it.

Now here's what happened ;)

We basically (using the quoted example) excluded all the "german pages from the italian website"-ish. It's not as simple as that, but for the basis of the example that will suffice. Bear in mind, these pages were already set to '=noindex', so G would never return them in the SERPs (except for the URLs if you did a site:-search) and there was no risk of duplicate content.

Since Google had loads less pages to crawl the idea was that G would cover the nice and juicy content faster, and thus improve indexing.

Indexing has improved, but linkbuilding has been going on at the same time - so there may be an influence there.

Webmaster tools indexed pages (based on sitemaps where only the 'relevant content' is contained) went up ~12% over the course of a month, which is a lot higher than I have seen before.

The sites in index based on a "site:" search increased by a whopping 34%. I do find the site-search a bit dodgy at times though. i.e. displays high (and unexplained?) volatility.

Noteworthy is also that Google didn't allot the same crawl budget. G picked up on the robots.txt change and crawled at a much lower rate. Although the net result was that more pages were crawled faster.

Number of pages / pages crawled per day = up 6.5%

Number of 'relevant' pages / pages crawled per day = ~70%

I am not a math guru, and the stats are based on figures out of webmaster tools, so they are not really 100% accurate due to rounding errors and averages, but I am confident enough in their accuracy to post them here.

Now the interesting thing was actually what the effect would be of removing all the internal links and answer the question what the power was of all those links in the '=noindex' longtail. That is to say G no longer crawls the '=noindex' pages and doesn't 'know' about all those links any more.
The good news is it didn't affect rankings for linked-to pages in any major way. There was some definite movement (down - up a bit - down again - then back up) for a few weeks, but the higher-level links pointing to those pages seem to have been what actually helped the pages rank well in the first place.

All together this was a good move. There is a lot less load on the servers. The rankings weren't affected to any greater extent. The freshness of pages in the G index has improved and the number of pages in the index has increased.

Hope this helps someone, and if you have any question I'll be happy to answer to the extent that I can.

johnser

9:46 pm on Oct 21, 2009 (gmt 0)

That's a really informative post gn_wendy
Many thanks.

Does anyone else have a view on that?