Googlebot found an extremely high number of URLs on your site

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot found an extremely high number of URLs on your site

realmaverick

10:17 pm on Apr 12, 2011 (gmt 0)

Got this in my Webmaster tools today.

Yup, they're right, they have listed lots of junk URL's that "may be problematic". They're the pages that have a meta noindex and forbidden in the robots.txt and have been for several weeks. An example is forum urls, that have params such as &prune_day=100&sort_by=Z-A&sort_key=last_post&topicfilter=all&st=1146

What do I do about this? It seems they're now not even honoring a simple robots.txt

Gah, Google is driving me insane. ABSOLUTELY INSANE.

tedster

3:03 am on Apr 13, 2011 (gmt 0)

Others have been reporting something similar. For now, I'd just ignore it - as long as you are 100% certain those URLs are disallowed and they are not linked to directly from within your site.

realmaverick

3:08 am on Apr 13, 2011 (gmt 0)

Hi Tedster, what is the implication, of noindexing member profiles for example, but having links going to them. They're actually noindex follow.

I realise it's sending a slightly conflicting message. On one hand I'm asking it not to index a page and then on the other, I'm sending the spider to the page.

It's a difficult situation, as I want users to be able to view one another's profiles but there are 2 million of them and to Google and the searcher, they're completely worthless. It's way to many pages to allow Google to index.

There needs to be better methods to help in situations like this.

tedster

3:15 am on Apr 13, 2011 (gmt 0)

You're doing what you can do - and the PR still circulates through the pages because of the "follow".

Another approach would be to allow all the profiles to be indexed, and just let Google sort it out. Sometimes certain prominent members may attract lots of external backlinks and Google might prefer to keep those pages in the index. You might even remove the noindex just for profiles that are particularly active.

realmaverick

3:45 am on Apr 13, 2011 (gmt 0)

You might even remove the noindex just for profiles that are particularly active.

Interesting thought. I'll take a look and see how possible that is.

The idea of leaving Google to it's own devices, leaves me feeling a tad nauseous haha.

buckworks

5:01 am on Apr 13, 2011 (gmt 0)

pages that have a meta noindex and forbidden in the robots.txt and have been for several weeks

It seems they're now not even honoring a simple robots.txt

I suspect the problem might be that they ARE honoring robots.txt, and because of that they don't even know that you want those pages out of the index.

Think about it: If Google had already indexed certain pages, which were later changed to noindex, how would Google know about the noindex directive if robots.txt instructs them not to ever check those pages again?

Google can only respect a directive if it can detect the directive.

Not everyone would agree with me on this, but I don't think it's wise to use both noindex and robots.txt together, ever, and especially not before you have verified that Google has discovered the noindex directives and the pages are gone from the index.

FWIW, I use "noindex,follow" on hundreds of pages I don't want indexed and the combination has always worked as expected, with no known problems.

crobb305

6:42 am on Apr 13, 2011 (gmt 0)

BuckWorks: I house affiliate links in a php redirect file. For several years, I had that file denied to Googlebot until I realized the redirect links were still being indexed looooong after I deleted them from the file. Google didn't know they were 404 or 410, because the Gbot didn't have access. So, over a 2 year period, affiliate links came and went (through various networks/affiliate link testing, etc.) and those links were accumulating in the Google index (URL only since it had no data about them --they were blocked at the final redirect also on the merchant's end).

Since they are blocked at the merchant end as well, Google still can't determine what is on the page. So, my redirect urls could be getting pegged as unknown/thin pages. This is why it was crucial for me to eliminate the thirty dead/404/410 urls from the index and keep them out.

Essentially, I have 3 active links but there were 30 dead ones that had been deleted from the redirect file, but Google kept them in the index. So, I removed the restriction in robots, and I forced 410/G om the dead parameters. Now only the 3 active links remain.

My question now is: what do I do from this point forward. My fear is that if I block robots.txt back up (to conceal the affiliate links from other search engines), I will have this problem all over again (if inbound links to those old dead url parameters still exist somewhere and can be discovrered). Should I just put noindex on the link? I have never used noindex, so I obviously fear starting anything brand new.

I see a lot of affiliate sites ranking well who merely put noindex on their redirected affiliate links.

Sorry I typed this very fast as my sleeping potion is kicking in :)

I know this was long and wordy, but basically I am wondering:
A) Should I put noindex on each redirect link so it doesn't follow to through to the merchant (via all their redirects)

B) Is there a way to put a noindex directive inside a php script?