Adding numerous noindex pages

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Adding numerous noindex pages

Tonearm

4:11 am on Dec 2, 2009 (gmt 0)

Will I rub Google the wrong way if I add many new pages overnight, even if I noindex them? I've added sorting capability to my site which has created many new URLs, but I've noindex'ed all of them.

tedster

7:50 am on Dec 2, 2009 (gmt 0)

No, you won't have a problem - at least not in my experience. It's a very good thing to do.

If there's a way to disallow those "sort" URL patterns in robots.txt, I prefer that approach even more. Googlebot sees that disallow rule and doesn't even request the URL. With a noindex meta tag, it still needs to spider the URL just to read it, and that can be a drain on the crawl budget they've allocated to your domain.

serenoo

9:23 am on Dec 2, 2009 (gmt 0)

I suggest to do not use robots.txt. I used it and google crawled my pages anyway. Use noindex. It is better.

internetheaven

10:27 am on Dec 2, 2009 (gmt 0)

crawl budget

Never heard that term before. Nice one though!

I used it and google crawled my pages anyway. Use noindex. It is better.

Aye, this is where it gets silly. I've always had problems with google indexing robots.txt excluded files so went with noindex meta tags ... but then Bing started indexing all my noindexed pages!

I've added sorting capability to my site which has created many new URLs, but I've noindex'ed all of them.

I have a site with around 550 pages and only 13 of them are indexed by Google. The others are all noindexed and I have not seen any problems from that. The 13 pages that are indexed rank fine.

Due to the number of variants and customisations websites have these days I doubt google would be stupid enough to think that there is something wrong with a site that noindexes even 90% of its content. Even if you only have a "list alphabetically" and "list by price" button on your product pages you've just tripled your content and the extra 2/3 are all, essentially, duplicates. noindex is really the best thing.

AnkitMaheshwari

12:16 pm on Dec 2, 2009 (gmt 0)

I have started to use both noindex and robots to block pages to be double sure.....

gn_wendy

1:11 pm on Dec 2, 2009 (gmt 0)

I used it and google crawled my pages anyway. Use noindex. It is better.

Did G' actually crawl the pages, or did you see the URL in the SERPs?

There is a major difference between these two. If G' finds the URL as a link from somewhere (which is usually the case) then G' will return that URL if your search is specific enough - but without any pagetitle or snippet.

If you see that G' has crawled a page (through your server stats) it may well be a problem with your robots.txt

Robots.txt has been a great help in steering G' away from large portions of 'noindex' sites in order to get more out of the crawl budget.

SEOPTI

6:33 pm on Dec 2, 2009 (gmt 0)

Tedster, with a disallow in robots.txt it seems Google still requests those disallowed URLs since all of them appear in WMT -> Crawl Errors -> Restricted by robots.txt

The "detected" date next to each disallowed URL is updated frequently so Google is requesting these URLs again and again. But I think they just request the URL without reading the content.

What I'm not sure about is if thousands of URLs which are disallowed in robots.txt will have a negative effect on your crawling or indexing budget, it might be the case since the URLs are "detected".

I've been using "parameter handling" in WMT to stop Google visiting the disallowed URLs over and over but it did not help.

TheMadScientist

6:59 pm on Dec 2, 2009 (gmt 0)

Actually, URLs disallowed in the robots.txt are probably 'noticed' from the robots.txt itself and links pointing to the page from both your site and external sites, but not requested. (The robots.txt should be retrieved at the beginning of each crawl, so it stands to reason they would get 'noticed' quite a bit, especially if there are links to the pages internally on the site, like in a site map.)

Disallowing URLs in the robots.txt should keep you from using your crawl budget on pages not in the index, but I personally use noindex,follow (follow being default, so unnecessary in the actual robots meta tag) because I want credit for any links pointing to those page and want the content to be known to exist on the site. I usually do this for things other than a sort though where the content is similar, but not identical.

I think it comes down to which is more valuable in the situation: Crawl budget or inbound links to the pages. Personally, in my situation(s) the pages do attract links and I want the content to be known by G, so I use noindex, which gives them access to the content and keep some of the link weight being passed into the site.

Tonearm

8:37 pm on Dec 2, 2009 (gmt 0)

What about 301'ing spiders to the "main" page as opposed to the re-sorted page? Would that be allowed?

TheMadScientist

8:43 pm on Dec 2, 2009 (gmt 0)

It would probably be a bad idea. A 301 should be to the same or similar location now containing the content. It's like a 'moved' sign on a business with the new address on the sign. 301ing everything to the home page is like saying 'We moved... Go look in the phone book to figure out where.'

Noindex or robots.txt are really the best answers to this question.

Tonearm

8:47 pm on Dec 2, 2009 (gmt 0)

From a practical perspective though, a 301 can be a way to tell Google to go there instead of here.