pagewanted=all parameter added to URLs in Google's index

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

pagewanted=all parameter added to URLs in Google's index

enigma1

10:18 am on Apr 26, 2011 (gmt 0)

Seems googlebot indexes pages from my domain by appending the pagewanted=all parameter. The format title/link looks like:

index in Example
www.example.com/?pagewanted=all

and brings up the home page of the domain. It's elsewhere also including this forum. Does anyone have some info about it?

deadsea

2:04 pm on Apr 26, 2011 (gmt 0)

Based on this thread, it may be a wordpress related parameter:
[google.com...]

enigma1

2:29 pm on Apr 26, 2011 (gmt 0)

Yep but I don't run wp. And seems to be present on various sites where the web application is anything other than wp.

There is some info I found about its use though, some sites lock out visitors in general, after few requests to minimize resources and then require the visitor to login. By having the particular parameter in place along with a google referrer for a request, seems to bypass the restriction. That is one way they describe it. A cheap alternative to the dual rdns/ip identifying googlebot.

Now I noticed that based on the site's general ranking, popular sites have more pages with that parameter in the google's index. Although pages listed with this parameter aren't necessarily the most visited ones from a little I checked.

Still I don't know how googlebot concluded that I must have this parameter somewhere. There is no way it can be generated from within my domain by any means.

First entry appears on 2/12/2011 by googlebot, few days later automated requests show the page with the parameter appended coming in with fake referrers. After that they're just sparse visits by googlebot with the particular url.

deadsea

3:14 pm on Apr 26, 2011 (gmt 0)

I just grepped my logs for this week and don't see pagewanted in them even once. All I can confirm is that Googlebot isn't crawling all sites like that.

tedster

4:53 pm on Apr 26, 2011 (gmt 0)

There is no way it can be generated from within my domain by any means.

Some other site is probably linking to you with that parameter included. You could 301 redirect to the URL without the parameter, and a canonical link tag in the head might also help the situation.

aakk9999

5:44 pm on Apr 26, 2011 (gmt 0)

I have also seen this in one of sites I am overseeing. It is not a wordpress site and such parameter does not exist at all anywhere in the site. These started to be reported sometime at the end of March.

In these URLs, apart from adding ?pagewanted=all, there are also other google parameters appended:

sa=
ei=
ved=

I have currently blocked URLs with pagewanted= via robots.txt to see how many there are and to get a better picture before I decide whether to redirect them, return 404 or just leave blocked via robots.

tedster

5:52 pm on Apr 26, 2011 (gmt 0)

One other available tool is to block the parameter in Google Webmaster Tools.

g1smd

8:09 pm on Apr 26, 2011 (gmt 0)

I got so mad at the number of incoming links with appended junk on the end, that for at least several years any such request gets a 301 redirect response to the same URL path but without any of the appended parameters. It seems to have worked very well so far. On a few sites, a few selected parameters are allowed for very specific purposes and everything else is redirected.

enigma1

9:11 pm on Apr 26, 2011 (gmt 0)

Some other site is probably linking to you with that parameter included

There isn't anything in Google's index that shows that.

You could 301 redirect to the URL without the parameter, and a canonical link tag in the head might also help the situation.

No I won't. I don't think is anything to do with poisoning and frankly I don't care externally how they link and if they link to my domain. The application I have in place is pretty good with incoming requests and filtering and links are generated in a certain way without getting infiltrated by external factors.

I believe it has to do with Google at this point and they do append it themselves. But if you have something concrete that shows otherwise by all means I am listening.

I also don't use rels for canonical tags as you only try to hide a problem instead of fixing it.

I got so mad at the number of incoming links with appended junk on the end

In our previous discussion we covered that part pretty much. If the application doesn't validate parameters and gets whatever comes in there is a problem with the application. But I don't see why that's the case here.

The basic principle is if a link is not generated from within the site then is not indexed by SEs. Otherwise is going to be chaos.

tedster

9:20 pm on Apr 26, 2011 (gmt 0)

I hear you, but no search engine or third party application can ever see what URLs your system intends to generate. All there is to go by is how your server responds when a particular URL is requested.

g1smd

9:53 pm on Apr 26, 2011 (gmt 0)

The basic principle is if a link is not generated from within the site then is not indexed by SEs.

I would like to believe that was 100% true, but I just can't. Certainly, links within a site carry more weight but they are not the only factor.

aakk9999

3:09 am on Apr 27, 2011 (gmt 0)

The basic principle is if a link is not generated from within the site then is not indexed by SEs.

I would like to believe that was 100% true, but I just can't. Certainly, links within a site carry more weight but they are not the only factor.

Shouldn't this be fairly easy to test? Just link to a page on your site from external site(s) with extra parameters added to URL. If your server ignores extra parameters, parses the rest and responds with 200 OK you may get duplicate content issue despite the link not being generated from within the site.

Almost certainly you will get this second URL with extra parameters reported in WMT duplicate titles/descriptions. Whether this means that both URLs are in fact indexed that I do not know. Perhaps if you throw enough external links on it, it may.

tedster

4:49 am on Apr 27, 2011 (gmt 0)

Returning to the opening post - I'm now wondering if, rather than any attempted Google exploit, it may instead be a direct probe at website security. It could either be part of a plan looking for an SQL vulnerability or some kind of cross-site scripting. Not every dark trick is a Google algo trick.

enigma1

10:01 am on Apr 27, 2011 (gmt 0)

I've tested URL poisoning for years. As far I can tell, SEs won't fall for it unless the site you try to inject the link, has problems with its web engine and propagates the parameters internally resulting in regenerating duplicated links with the extra parameters.

It's quite logical. If an SE starts indexing whatever page returns 200 it wouldn't know what to list in its index from the havoc out there. It's not going to work doing redirects on every parameter someone injects for SEs to pick up. There are endless combinations, just separators alone (/?&=& are also valid).

And there are no reports in WMT about any problem.

In these URLs, apart from adding ?pagewanted=all, there are also other google parameters appended:

sa=
ei=
ved=

I don't think they are related. These are search parameters for Google itself when searching, not for sites indexed. sa for example comes up when you click the next link as sa=N

It could either be part of a plan looking for an SQL vulnerability

That's possible if they know of a loophole in the google's algo they can infiltrate the bot and then use the indexed links to bypass security measures. I wouldn't be surprised if there is some "popular" code out there - even posted on purpose - that someone would pickup and integrate it with his code. A parameter like pagewanted seems quite transparent.

Another possibility though is that google has enough feedback and the necessary code in place for this particular parameter in order to fetch the full content of a page and doesn't count it as a duplicate.

And another case that comes to mind is if google deploys it for verification and cross-reference purposes. They may have specific code for crawling based on some popular open source applications.

And also it may be triggered by certain keywords or segments found in a site. (eg: keywords related with blogs or forums like articles and topics).

What happens if you have a bot and you append an extra parameter to a URL and the server returns a different response or different content? I remember other SEs are using different methods to check if the content returned based passing non-existing parameters and then check the response.

aakk9999

11:32 am on Apr 27, 2011 (gmt 0)

In these URLs, apart from adding ?pagewanted=all, there are also other google parameters appended:

sa=
ei=
ved=

I don't think they are related. These are search parameters for Google itself when searching, not for sites indexed. sa for example comes up when you click the next link as sa=N

Exactly - these are google search parameters. So what are they doing appended to the site URL after the pagewanted=all parameter? Eg. the URL that appeared in WMT was something like:

www.example.com/?pagewanted=all&sa=something&ei=something2&ved=something3

This is why I have been thinking that adding all these parameters to URL has been done by Google.

enigma1

11:53 am on Apr 27, 2011 (gmt 0)

At this point I think the pagewanted parameter is done by Google and is not some injection. There is one other element from my original post. The Google index lists the page with the title I posted.

"index in Example"

where Example is the name of the site. There is no such title in my meta tags and there is no such phrase in the page's content.

Then the short description that is listed in the index is taken from the main content of this page you see? While with the normal version of the page the meta-description is listed as the short description.

This is why I have been thinking that adding all these parameters to URL has been done by Google.

If you check the indexed page with these parameters what do you see as the title and description? Are they the same as with the page without the parameter or they're different?

Sgt_Kickaxe

9:45 pm on Apr 27, 2011 (gmt 0)

You can add parameters to any url unless you block them via htaccess. "webmasterworld.com/?is-it-the-best-webmaster-site-around=yes" will even resolve.

Since your site doesn't link to that and Google can recognize the duplicate it's not a huge deal really.

enigma1

8:00 am on Apr 28, 2011 (gmt 0)

Yes it's not a big deal, I was just trying to figure out if there is some meaning behind the parameter appended or just a random incident.

deadsea

7:38 am on May 3, 2011 (gmt 0)

I just found out where this parameter is actually used. www.nytimes.com uses it on their articles for the "single page" version of the article. Instead of getting the article spread out across four pages, it is the param to get the article all in a single page.

From what I can gather, the times uses a custom CMS. But maybe there is some standard content management system that supports this param as well. Googlebot must think that your site is using such a system for some reason, and has some logic built in to try this parameter when it thinks it has found a site where it can get articles all in one page.