Welcome to WebmasterWorld Guest from 220.127.116.11
In the last week or so Google has indexed a succession of URLs that appear to be unlinked from anywhere. These are in two categories:
- Search result pages
Google is up to 2,130 of these. They are all single word searches for words that do actually appear somewhere on the site. The search itself is simple and does not link to any search results other than next/previous pages.
- Results for an online tool
This involves a user-entered URL (using GET). I've tracked down a few hundred of these that Google has requested, for a bizarre mix of URLs, from massive sites to individual blog posts.
I'm only at the start of my detective work for this (I'm going to grab all of the search keywords indexed and the URLs checked and see if that throws up any clues, and do a bit more in-depth log analysis). I can't find any links to any of the pages indexed on Google or Yahoo.
Here's my initial speculations:
- Someone may be linking to these pages deliberately, perhaps with a bit of noindex/follow . Would seem to be a bit pointless.
- Google might be indexing the pages based solely on the toolbar or another mechanism
- These pages have either been indexed for some time, or have built up over time. It is some change at Google that has made them visible now. This would also explain why the two very different types of page both suffer from the same problem now.
- I've screwed something up so that the pages are being linked to from the site, via some misbehaving script.
I can easily block the content from search engines, but for now I'm interested in tracking down the source, and I may as well see what the effect of thousands of junk pages on the site's performance is! ;)
Anyone have any suggestions as to what may have happened here?
One aside: Google really seems to likes to make troubleshooting difficult these days. The amount of hacking around just to get a complete list of indexed pages is starting to be an annoyance!
The search results you mentioned are a bit disturbing to me. There is a kind of "database spamming" that involves intentionally getting Google to spider many search results pages - and when such pages show in the SERPs, they are not usually very helpful at all. In fact, last year it seemed like Google intentionally took action to remove such pages from the results.
So why would they try to find pages that the webmaster has NOT intentionally exposed to indexing? Makes me wonder what sources are being tapped for each spider run. Recently a client was developing a new group of materials and they put it on a new unprotected (but also unlinked) directory.
As far as I can tell, Google would have had only two resources to find this new directory - gmail and toolbar. And yet within days the full directory was indexed and I was using Google's URL Removal Tool. However, the project did involve a lot of people, and just 'maybe' someone put a link to the the url on a wiki or something. I can't be 100% sure about that.
There is another possibility. Once google has indexed any dynamic url, might they be trying to spider it with changed parameters? If so, those parameters would be taken from where, exactly? There have been start-up search engines who proposed spidering the web and submitting to search boxes and other types of forms as part of their campaign.
Regarding the search results pages, I've confirmed from server log files that Google started spidering these pages during August. Which makes it harder to track down unfortunately. However, it looks like the pages have only started appearing in results recently (algo change/passed some kind of threshold?).
Once google has indexed any dynamic url, might they be trying to spider it with changed parameters
It's a possibility, but would seem like pretty insane behaviour for a commercial spider. Clearly the majority of such spidering would result in undesirable pages.
I've actually discovered another dynamic page that is also suffering from this behaviour, this time a tool expecting numeric input is being spidered with the same single words as the search results pages. From the last 6 months logs I've been unable to find any human traffic requesting any of these pages.
So I think I'm left with two possibilities:
- Someone linking to these pages for whatever reason
- Inexplicable spidering behaviour by Google.
I'm tending to favour the former, but I'm not worried about the site's performance so I'm planning on seeing what happens until/if I track down the cause. I think I can rule out internal links to these pages being accidentally created since they no longer even follow logical patterns. No other search engine has spidered any of these pages.
I think I can also (largely) rule out the toolbar/gmail since it's pretty much impossible that visitors would have ended up at many of these URLs. They've got 'machine generated' written all over them, frankly!
18.104.22.168 - - [19/Jul/2007:23:48:56 +0100] "GET /tools/example?var=validate HTTP/1.1" 200 2619 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
This tool expects numerical input. The word in bold is one of a selection Google used that are present somewhere on the site.
22.214.171.124 - - [19/Jul/2007:23:49:26 +0100] "GET /search/?query=oblique HTTP/1.1" 200 2891 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
This is the site search. Google now makes thousands of requests/month for single word searches. Again, words present on the site somewhere.
126.96.36.199 - - [19/Jul/2007:23:49:42 +0100] "GET /tools/example2?var=&var2=depends&var3=&var4=preview HTTP/1.1" 200 3360 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
This is the strangest one. It's a form that uses POST. Google has requested this using GET and including the variable names from the form in the request, two of which are populated with single words.
Also, I'm sure you don't, but do you make your server logs publicly available? Just a thought that Google is picking up/crawling from log files...hehe
Other than that it seems someone would have to be linking to the results pages on the site.
crawl/download your own site
Already done - even aggressive spidering that ignores robots rules does not visit any of these pages (accidentally got myself bot-trapped too!).
server logs publicly available
Not even accessible via http so no dice there unfortunately.
someone would have to be linking to the results pages on the site
This is still a strong possibility, but I've not ruled out this being odd spider behaviour yet!
- Googlebot is spidering GET forms by getting the form variables and either leaving them blank or assigning values to them (sometimes taken from options in the form itself)
- Google has a list of words present on the site
- This list of words is being used to populate the form variables, and the URL requested via GET
Pure speculation, obviously!
There are several thousands affected URLs in search results for this site, covering a number of distinct forms.
It would be an odd thing to do, but would allow Google to access data that would previously be hidden to spidering, I suppose.
Certainly, I can find no links to any of the affected pages, despite Google's requests for them dating back several months. In addition, there are no human requests for the URLs Googlebot is spidering. I was hoping I would be able to track down a person if that is what's behind it, since it seems likely they would have tested some of these URLs if they linked to them from somewhere.
Have you tried some inurl: searches to see what links might be out there pointing to you?
seeing those urls show up in the index would not make much sense unless the spidering results went through some heavy backend texting at Google first
Yes - such a test would likely weigh in favour of pages that displayed substantially unique text - like search result pages ;)
But then, I doubt Google really wants to index search results pages, and these are very obviously such ('search' in url, title, heading etc.).
I've tried as many different types of search and engines as I can think of, in addition to checking server logs for suspicious behaviour/requests, with nothing found. I guess I'm getting a bit more dubious about the possibility that this is due to external links. To do so would require compiling a list of all words on the site minus stopwords, and then linking to a succession of these pages, probably from pages with noindex,follow so they can't be found. Which is all possible but kind of pointless, even if it was just a test.
I think the two parties most likely to be responsible are either me or Google. I certainly haven't ruled myself out yet ;)
As for Google spidering get URLs with added parameters, there seems to be reasonable evidence for this based on some URLs I've been looking at - particularly if values are set within the form.
Content being developed and tweaked all through the last seven months.
No links to site. No Toolbar usage. No "Add URL" submissions.
Registered for Webmastertools some three months ago, and robots.txt file tested several times since then.
Site not indexed in Google until three days after the first link was added from another site (about two weeks ago).
WebmasterTools also reported zero pages indexed up until then.
I'm still struggling to find an evidence-based conclusion for Google's behaviour here. Google is continuing to spider these pages at the rate of around a thousand a day, more than three times the rate last month. I've seen sites with solid links struggle to get indexed at this rate so it's very puzzling that Googlebot is doing this. Stranger still that no other search engine has a single example of such pages whereas Google has 3-5000.
Friend IDs and video IDs are valid. Interestingly, some of the videos appear to be about cricket, and there was one other URL tested which is not youtube, but a site containing cricket videos.
I'm sure this means something, but I'm at a bit of a loss to know exactly what at the moment!
The symptoms are Google posting GET data to forms during its spidering. The data used appears to be mainly based on words extracted from the site itself.
For the site I originally posted about, Google has requested 519,777 (!) such URLs during February alone (for a site of a couple of hundred pages). I can't see that this activity is as a result of external links to these URLs, particularly as no such links have shown up since I originally posted this thread.
I still believe that Googlebot may be adding the GET data itself, either as a result of a misconfiguration or in order to attempt to discover new content. Frankly, this seems the most reasonable theory to me at this stage.
I'll see if I can collect any further examples, as I put this on the back burner after my post in October last year.
Furthermore if you have a Url found by Google in a format like this:
...it is quiet normal and likely for a spider to also try ".../folder/file.ext" and ".../folder/" and "...example.com/". And of course if you have links on such a page which populate the parameters, they will be added to the watch/crawl list as well. I'm not sure about form fields, but I would expect form action Urls.
Make sure you have the pages denied in your robots.txt and/or with robot tags.
[edited by: tedster at 4:46 pm (utc) on Feb. 19, 2008]
[edit reason] switch to example.com - it can never be owned [/edit]
The problem is not accounted for by traditional methods of content discovery, though, as I'm talking about hundreds of thousands of URLs, accessible by humans only via GET forms (or direct manipulation of the URLs). None of these URLs has ever been visited by anyone other than Googlebot (according to server logs).
There are only two possibilities that adequately explain this:
- Someone has put up a massive amount of links to this content somewhere, on pages that are of sufficient importance to result in the massively high number of pages spidered. I pretty much rule this out now, as I'm seeing the same Googlebot activity on a totally unrelated site
- Or, this is a product of some odd spidering behaviour on the part of Google.
Of course, there may be something that I've missed, but, again, the fact that I now have two examples to look at where there is no link between the two sites is pretty suggestive.
Maybe other people's sites are also affected? If you have a site search, perhaps search Google for site:www.example.com/search or wherever your site search pages are.
[edited by: Receptional_Andy at 4:22 pm (utc) on Feb. 19, 2008]
@Receptional Andy: Well I'm not sure that you can really say for sure that no other human being has visited those pages with a Google Toolbar.
But the visitors would then show up in server log files (and they don't). And besides, I'm talking up to half a million rogue URLs.
I'll clarify the type of URL. A GET form that is expecting a URL as input. Google has spidered it thousands of times, but instead of a URL, a random word is passed instead. I say random, but they are in fact words that exist on the site itself somewhere. These are dynamic pages that to all intents and purposes do not exist before Google requests them, and have never existed before.
I can categorically say that visitors have not visited all of these such URLs in the past. In fact, the amount of such pages Google has spidered likely equates to more visitors than the site has ever received.
The amount of such pages spidered has reached a peak in February of hundreds of thousands of pages. Typical Google spidering activity was a handful of pages a day.
Anyhow - somehow Google managed to get content from your pages and now starts crawling it. I'm pretty sure you would find this piece also in your server logs, but if you are talking about hundreds and thousands of pages it might be hard to pick in the mass.
[edited by: engine at 6:44 pm (utc) on Feb. 19, 2008]
[edit reason] examplified [/edit]
How can you rule out gamil? All Googlebot has to do is find one URL and then the rest is just normal spidering through navigation.
How can you rule out gamil
I think I must have failed to explain the problem clearly enough. I will exemplify.
A site of 100 pages. One of the pages is a tool that expects numeric input, e.g.
There are no links to results pages for this tool internally. Perhaps occasionally someone else might link to the result from an external site. Bleh.
Google has indexed as below:
Repeat this tens of thousands of times, for every word that is present anywhere on any of the 100 pages of the site.
No-one using gmail, toolbars or any other discovery mechanism has visited these URLs. Only Googlebot has ever requested the URLs. If there are links to such URLs, they do not appear in link searches at Google, Yahoo or MSN or any of the various webmaster tools offerings.
The actual number of such URLs spidered is now in the hundreds of thousands. It is highly improbable that someone has actually linked to this volume of URLs, via gmail or anywhere else. It is even more improbable that I can see the same effect on another, unrelated site that I happen to have access to.
Someone has put up a massive amount of links to this content somewhere, on pages that are of sufficient importance to result in the massively high number of pages spidered. I pretty much rule this out now, as I'm seeing the same Googlebot activity on a totally unrelated site.
Ah, you bring up a very important point. I've mentioned numerous times in related topics that I believe there is competitive sabotage that takes place at the technical level and few are aware of it until it is too late. I am really convinced that this can and is being done. Ting, ting, ting...
The bots eventually figure things out but a quick feed of this type of data into the Gorg and things happen, they have to, the science of it all dictates that something is going to happen.
It is highly improbable that someone has actually linked to this volume of URLs.
I think it is highly probable. :(
We implemented a neat little feature not long ago that tracks 404s and 500s from our sites. I'll tell you what, the first 72 hours of reviewing those errors opened my eyes to something that I was aware of but not at this micro level. I mean, I get a report for these errors that lists everything the server generates, and I do mean everything. I asked my programmer to just give me everything so I could then subtract what I don't want to see. I've not subtracted anything yet. :)
Just the other day, I had to ping my server admin to help me figure out a barrage of 404s. Something latched onto the site and was attempting to generate valid queries through a host of broken URIs all returning a 404. For four hours the 404s were generated at a rate of 3 per minute, a bot of some sort. So, we banned the pesky little bugger at the Firewall for now.
But, if you look closely at those logfiles, you may see some peculiar activity at different spots in the graph. Possibly something spidering the site and generating all of these queries without you really knowing it. From that spidering, the pages were developed to begin the sabotage campaign. They could have indexed over a million pages in who knows how long of a period. It could have been slow or you might have gotten hammered one day and didn't really know it. If your bandwidth is unlimited, you probably are not micro-managing these types of things.
So, they take the spidered results and cloak them to the SE's from PR3+ sites that are somewhat established. You'll never find those pages because they are truly cloaked. Oh, some of you may be able to find them, but for the most part, they are invisible. They go up for a week or two, get indexed, and then come down. Google and the others now have a fresh set of URIs to explore on your site. If your site is returning a 200 for those queries and content is being generated, guess what? I don't think I need to explain any further. There is a bit more that can happen too. This is just part of it "from my tin hat perspective". Ting, ting, ting...
Anytime you are dealing with a dynamic site, the foundation needs to be locked down from the get go. If there are any potential loopholes in the technical implementation, they may be found and possibly exploited. And, it may not be intentional believe it or not. You could be sabotaging yourself. :(
Have you been able to return a 404 error for those bad query strings?
This has affected any GET form on the site, so some query strings are 'bad' (in the sense of undesired input) while others are OK in the general scheme of things (although not especially great content for a search engine). As mentioned previously, the performance of this particular site is irrelevant, and so I've let it run to see what happens.
Perhaps surprisingly, the site going from a few hundred to a few hundred thousand pages of additional, low quality, in a matter of months has had no noticeable effect whatsoever. If I didn't watch logfiles and Googlebot activity I may well never have noticed. Google's done a pretty good job of 'filtering' the new URLs and hiding them from most searches as it is wont to do with various types of content recently.
I robots excluded the relevant URLs a few days ago, when the numbers started to get ridiculous, even with crawl rate set to 'slower'.
I think it is highly probable
You're probably right that I've been too quick to discount 'nefarious' activity. That's because the site doesn't really have any competitors, nor does it have much performance to speak of. So there's nothing to sabotage, but that doesn't rule out spammer activity, of course. Human nature leads me to seek reasonable motivation which is probably a mistake.
The second site with the same problem is very curious though. Someone else actually asked to look at a problem they could see with their indexed pages: yep, the same problem. I'm still trying to figure out what the coincidence part is!
I guess I'm back to mining logfiles. Bah.
One other curiosity is that most of the URLs affected are being populated with single words, whereas one example (which expects a URL input) is also being populated with a whole load of valid URLs. I guess that implies human or spammer input somewhere along the line.
However I think it is important to return a valid 404 header for not existing pages. The same goes for not allowed queries (404 or better 403). Maybe it's a good idea to put a Web Application Firewall (WAF) in front of your webserver and define a good whitelist of Urls and parameters. If you say that the one parameter normally only holds integers and gets queries with strings instead from robots, you should simply define your application or WAF to check for integers and return 404's or 403's in this criteria is not met.
However it's hard to say something about a website that you can't have a look at. Reading of hundreds and thousands of pages, I am not sure if I want to see this website - sounds a little spammy to me... :-)
it is important to return a valid 404 header for not existing pages
But it isn't an appropriate response to pages that return content based on GET parameters. The case of unexpected/invalid input is the exception rather than the rule.
The same goes for not allowed queries (404 or better 403)
But those aren't appropriate responses at all for the example I've given. One suggests that the server could not locate content, the other that the server does not wish to serve the document. The only reasonable possibility is 400 Bad Request, but even that is a bit iffy if you ask me.
a good whitelist of Urls and parameters
But this isn't what I'm talking about. They're valid URLs, valid parameters and valid content. They're just intended for human input, not spiders. Aside from artificial factors (either a human deliberately links to them for 'sabotage' reasons, or a spider chooses to spider them via an internal mechanism) spiders never request such content.
In any case, fine, I can exclude content via robots exclusion or even go for overkill and use a firewall. I could deliver misleading response codes in order to avoid search engine performance problems. But this entirely misses the point of the thread. I already know how to fix the problem: I want to know why it is occurring.
Here's some news about Google crawling via HTML forms:
[edited by: tedster at 1:59 am (utc) on April 12, 2008]