Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google indexing large volumes of (unlinked?) dynamic pages

         

Receptional Andy

8:48 pm on Oct 28, 2007 (gmt 0)



Here's an odd one for a small site (around 300 pages) with medium pagerank.

In the last week or so Google has indexed a succession of URLs that appear to be unlinked from anywhere. These are in two categories:

- Search result pages

Google is up to 2,130 of these. They are all single word searches for words that do actually appear somewhere on the site. The search itself is simple and does not link to any search results other than next/previous pages.

- Results for an online tool

This involves a user-entered URL (using GET). I've tracked down a few hundred of these that Google has requested, for a bizarre mix of URLs, from massive sites to individual blog posts.

I'm only at the start of my detective work for this (I'm going to grab all of the search keywords indexed and the URLs checked and see if that throws up any clues, and do a bit more in-depth log analysis). I can't find any links to any of the pages indexed on Google or Yahoo.

Here's my initial speculations:

- Someone may be linking to these pages deliberately, perhaps with a bit of noindex/follow . Would seem to be a bit pointless.

- Google might be indexing the pages based solely on the toolbar or another mechanism

- These pages have either been indexed for some time, or have built up over time. It is some change at Google that has made them visible now. This would also explain why the two very different types of page both suffer from the same problem now.

- I've screwed something up so that the pages are being linked to from the site, via some misbehaving script.

I can easily block the content from search engines, but for now I'm interested in tracking down the source, and I may as well see what the effect of thousands of junk pages on the site's performance is! ;)

Anyone have any suggestions as to what may have happened here?

One aside: Google really seems to likes to make troubleshooting difficult these days. The amount of hacking around just to get a complete list of indexed pages is starting to be an annoyance!

Receptional Andy

3:21 pm on Mar 1, 2008 (gmt 0)



Oliver:

It's an annoyance, not least because of the potential for damage to search engine performance. It also creates a number of other tasks that under 'natural' circumstances would be unneccesary. I also dislike spidering that delivers no benefit - it's a waste of resources for everyone.

decide them from mere paranoia

Cheeky bugger! ;)

I don't think I've been at all irrational in this thread. Indeed, I just won't accept one explanation where there is little compelling evidence in favour of it. At best, there may be a most likely explanation.

theBear: of course such a system would be straightforward to compile with freely available software. but even if we assume that's the cause, there are still a number of unanswered questions:

Why undertake this activity? Is this a recent phenomenon or has it been going on for some time? If it's new, is there now some software purely for this purpose that's been made available, and hence we are likely to see an upsurge across other sites?

Identifying the cause with good empirical evidence will go some way to answering these questions, which as far as I'm concerned remain unresolved.

theBear

3:36 pm on Mar 1, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Andy,

Ever hear of Google Bombing or Google Washing?

All of which are ways of manipulating search placement in Google for various forms of "profit".

In short if you can't beat them fair a square take the other path. There's $$$$ in them thar SERPs just ask most members of this forum.

I can't believe that you aren't aware of the games that get played. There are threads all over WebmasterWorld dealing with the effects of funny goings on. A large number of them reduce to a few simple forms.

pageoneresults

4:58 pm on Mar 1, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A little OT, I'm watching a bot right now generate hundreds of 404s on one of my client sites. It is trying to generate query strings from various areas of the site where a query "might" be generated. Thing is, we don't allow that type of stuff so it just sits there and hundreds of 404s and invalid URI strings until it decides to leave or, we kick its arse out!

If you have areas of a site that are open to this type of vulnerability, and I think it is a valid one, then that needs to be locked down to prevent this from happening. Someone or something found it and is doing something with it right now. You say it isn't having an impact but are you 100% certain that is the case? How do you determine whether or not it is?

Oliver Henniges

9:26 pm on Mar 1, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> I don't think I've been at all irrational in this thread.

I didn't mean to say you were. I think pageoneresults will agree that paranioa is far from irrational these days, it's rather vitally important for anyone managing his own server on the web.

> it's a waste of resources for everyone.

Sure.

The IPs you gave obviously came from one of the google PCs. First of all I 'm sure the googlers reading here will well appreciate you pointing to any "wrong" behaviour of their machines, but of course won't comment on it. And if those GET-requestes were part of planned strategy they won't comment either.

Another thing that came to my mind: It is estimated that google is running half a million PCs now, maybe the secong largest cluster after the Pentagon. Taken for granted that many of these PCs are runnng on maximum CPU-load most of the time, they may have an average life-span of 1-3 years, which means that every single day some 500-1000 google PCs break down, with all those unplanable technical consequences. Noone knows what happens in these last seconds, particularly if this machine has a sort of HUB-role within the cluster, so I assume google has learned to live with masses of dirty data more or less.

Of course it is annoying, but I wouldn't relly worry. For myself I take this topic as a kick in the arse to get my .htaccess and GET-cgis a little bit safer.

@pageoneresults:
>Thing is, we don't allow that type
How? via htaccess?
Do you also 404 or 301 an address like
www.yourdomain.com/page-123.htm?q=nonsense

Receptional Andy

9:44 pm on Mar 1, 2008 (gmt 0)



paranioa is far from irrational these days

Paranoia is by definition irrational, otherwise it requires a different name.

Do you also 404 or 301 an address like
www.yourdomain.com/page-123.htm?q=nonsense

Or a 403/401 etc. I'm thinking that this is going to have to be a part of my default site template. Only allowed variables (this is still open to this kind of abuse, incidentally). Gah. The internet is no fun any more ;)

theBear

10:12 pm on Mar 1, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



"Gah. The internet is no fun any more ;)"

The internet is still fun.

No matter what you do, you have to remember, there really are "bad" folks out there.

pageoneresults

12:09 am on Mar 2, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How? via htaccess?

Yes but we use ISAPI_Rewrite (on Windows).

Do you also 404 or 301 an address like www.yourdomain.com/page-123.htm?q=nonsense

Due to the rewrite procedures in place, that would most likely get a 404. Unless it matches one of our patterns at which time it will get 301'd to the appropriate destination.

fishfinger

10:56 am on Mar 2, 2008 (gmt 0)

10+ Year Member



one example (which expects a URL input) is also being populated with a whole load of valid URLs

I don't know if this is related, but I thought I'd mention it just in case. We had something a bit like this on one of our sites last Autumn.

We found one morning that our main rankings had disappeared because Google had de-indexed our home page. The inurl: search to try to find it in the index revealed 1 interesting page on another person's site with our full url in its url i.e.

somesite.com/script.php?site=www.oursite.com

Someone was running a joke script on their site that 'urbanised' pages. In other words you give it a url, it spiders the page and reproduces it word for word with 'da' instead of 'the', 'wiv' instead of 'with' and similar.

They had (innocently as it turned out) generated several hundred of these pages for hundreds of sites - basically every page they had tried this script on.

We never found out how it happened, but Google had somehow then actually spidered all of these pages. It then saw the particular one aping our site to be more important than our actual homepage! Result - the original page gets kicked for duplicating the joke page!

We tracked down the site owner and once we explained the problem they were quite shocked and took the script and all the pages it had generated down that day. Google crawled both sites soon after and within a few days we popped straight back in to the index as it dumped all the pages from the other site.

What worried me is that this seemed to sidestep Google's rules on duplication and authority. Our domain is 5-6 years old and whilst only a PR3, was certainly of higher PR than a PR0 unlinked page.

This 68 message thread spans 3 pages: 68