Googlebot question - a "?" on the end of the URL request - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot question - a "?" on the end of the URL request

Anyone seen this?

atlrus

7:50 pm on Dec 9, 2006 (gmt 0)

I was looking at our logs and I saw something I dont understand:

"GET /news/../some-none-news-page-here.html? HTTP/1.0"

My qestion is - what is the "?" at the end of the URL? We do not have any pages other than core HTML...

And - that page is not in the folder "news" as the log shows it - could it mean that the bot came from the folder "news" to that page?

Any clarity would be greatly appreciated.

tedster

11:50 pm on Dec 9, 2006 (gmt 0)

What response did your server give googlebot for that request? If it was not a 200 OK but some kind of error code instead, then you should have no concerns.

jd01

12:06 am on Dec 10, 2006 (gmt 0)

The ? says 'query_string' to your server, and with the indexing of dynamic pages, could result in duplicate content.

Most sites not using query_string URLs neglect to strip them, making it possible to duplicate content on URLs, because a query_string is information passed to the receiving page.

In a dynamic setting the content of the page (URL) associated with a query_string URL will change respective to the query_string variables.

When a query_string is passed to a static page, the variable information is not processed, so the 'root' .html page is presented to the requesting agent 'as is'.

Try this:
Open the index.html page on your site, then type ?=anything-you-want. Your page will (in most cases) open with the information from index.html regardless of what you type in the query_string.

By serving a 200 OK on this request, the URL is/will be considered a 'valid resource' and could be considered 'duplicate content'.

I believe the safest route is to strip them prior to opening a page.

There is a recent thread with some examples of how to do this in the Apache Forum.

How to Strip the ? from a URL [webmasterworld.com]

Justin

atlrus

6:24 am on Dec 10, 2006 (gmt 0)

Thanks guys, I wish it was that simple.

I do get 200 OK (but no duplicate yet, as quite few pages showed in the log like this, and yet, no change in our site: command), but here is the weird part:
We do not have any dynamic pages - we have the entire website in html - we work a little harder, but always felt safe, until now.

From the log:

"GET /news/../some-none-news-page-here.html? HTTP/1.0"

this is exactly the way it shows (except different page name).
And here is how our website does: we have a main folder called "news" and within that folder there are pages, but the page "some-none-news-page-here.html" is actually NOT in the folder "news" or any other subfolders - it's in the main dir.

And the part "/../" is also quite worrying. It looks to me like a FrontPage link - "../some-none-news-page-here.html" which would go a folder up to that page, which would be correct, if it was front page, not a bot, as it should automatically turn to the whole URL.

And we have no links from folder "news" to that page...

Just very weird, If any bot experts could let me know what is going on...

tedster

7:17 am on Dec 10, 2006 (gmt 0)

You cannot control what URL a bot or a person asks for. You can and should search your pages to be sure that you don't have that kind of typo in any links on your own pages. But beyond that, it's out of your direct control.

What you can control is how your server responds to such requests, and I strongly suggest you follow Justin's advice and strip those "?" -- or else (second best) disallow such URLs in your robots.txt

Trouble from such things is not usually quick, it just builds up gradually. So definitely take some preventative measures now that you see the URLs showing up in your logs. They should not be getting a 200 OK response.

atlrus

6:19 pm on Dec 10, 2006 (gmt 0)

I would strip the "?" URLs if they existed...I have no pages other than the format "page-name.html" - that's it.

I guess I am just going to wait it out and see what happens.

atlrus

6:50 pm on Dec 10, 2006 (gmt 0)

One of those pages just went suplemental...:(

theBear

6:55 pm on Dec 10, 2006 (gmt 0)

atlrus,

You should install proper rewrite rules to strip off the query string or make certain that those kinds of urls serve up a 404.

Justin doesn't just mean for you to go on a search and destroy mission for bad links in your sites html.

What you are seeing is called page sniping or a query string attack and it is aimed at causing duplicate content issues for your site.

[edited by: theBear at 7:17 pm (utc) on Dec. 10, 2006]

jd01

6:56 pm on Dec 10, 2006 (gmt 0)

It is your site, and you can obviously do what you like, but what you may be missing is I can link to your site with:

your-site.com/page.html?var=1
your-site.com/page.html?var=2
your-site.com/page.html?var=3

and effectively 'duplicate' your content.

A search engine can request:

your-site.com/page.html?
your-site.com/page.html?var=Im-Yahoo-requesting-urls-that-dont-exist

and effectively 'duplicate' your content.

The issue is not the technology you use, but rather the response your server generates when a URL, which should not be returned as valid, is requested and returns a 200 OK response.

Justin

CainIV

9:31 pm on Jan 11, 2007 (gmt 0)

Hi guys. I have noticed the same issue recently on one of my websites. Maybe you can help a bit.

The pages with simply html ending:

http://www.example.com/some-page.html

have a different pagerank than:

http://www.example.com/some-page.html?

Most inner pages share this trait. Whats interesting is that a cache:http://www.thissite.com/some-page.html? simply shows the exact cache for:

http://www.example.com/some-page.html

A header check for both urls shows 200 OK. Is there a dupe issue here in your estimation?

Regards,
Todd

[edited by: tedster at 9:51 pm (utc) on Jan. 11, 2007]
[edit reason] use example.com [/edit]

theBear

9:49 pm on Jan 11, 2007 (gmt 0)

Todd,

In all probability the answer is yes, yup, of course, sure thing, etc ....

CainIV

9:58 pm on Jan 11, 2007 (gmt 0)

Is there a definiteive way of testing whether this is the case?

theBear

10:14 pm on Jan 11, 2007 (gmt 0)

Is this close enough for you?

--17:15:36-- [google.com...]
=> `index.html.1'
Resolving www.google.com... 216.239.37.99, 216.239.37.104
Connecting to www.google.com¦216.239.37.99¦:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

[ <=> ] 3,849 --.--K/s

17:15:36 (901.69 KB/s) - `index.html.1' saved [3849]

linux:~ # wget [google.com...]
--17:15:43-- [google.com...]
=> `index.html.2'
Resolving www.google.com... 216.239.37.99, 216.239.37.104
Connecting to www.google.com¦216.239.37.99¦:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

[ <=> ] 3,849 --.--K/s

17:15:43 (909.47 KB/s) - `index.html.2' saved [3849]

linux:~ # diff index.html.1 index.html.2
linux:~ #

CainIV

10:23 pm on Jan 11, 2007 (gmt 0)

Hi Bear, please excuse my lack of linguistics in server code, but can you please laymen terms the above for me?

Much appreciated!

theBear

10:33 pm on Jan 11, 2007 (gmt 0)

Todd,

I used a simple command line tool wget to ask googles server to provide me the contents of the url [google.com...] the tool was provided the content by google and the tool stored the content as index.html.1 in my current working directory. I then provided the tool with the url [google.com...] . Googles server provided the tool with the content and the tool stored the content in index.html.2 in my current working directory.

I then used another command line tool called diff that will list all differences between the two files (index.html.1 and index.html.2) that were created from the use of wget.

Diff found no differences so the two files are identical.

[edited by: theBear at 10:36 pm (utc) on Jan. 11, 2007]

CainIV

10:47 pm on Jan 11, 2007 (gmt 0)

I see now Bear, Thanks alot for the information!

Regards,
Todd

Ride45

6:36 pm on Jan 16, 2007 (gmt 0)

We have put a 301 redirect on any incident of a query string page to the actual page.

for example: www.widgets.com/widget.html?page=blahblahblah
now 301 reidrects to www.widgets.com/widget.html

I was considering making the query strings go to the 404 error document, but the problem we have is that we have deprecated pages which at one point used a query string still appearing in the main index and we don't want them in the index anymore, so we just did a 301 redirect for any query string attached to an html file to 301 to just the static html file.

Is this accpetable?

ashear

8:10 pm on Jan 16, 2007 (gmt 0)

To be safe you could use a conditional redirect based upon useragent. Within you .htaccess file write a 301 redirect based on a list of active useragents. With this you could search for? and with a strip redirect it to the original page. Its perfectly safe and legal. If you happen to buy traffic to your pages, a question mark could be necessary for tracking. Thus doing this for normal useragents would be silly.

Patrick Taylor

9:27 pm on Jan 16, 2007 (gmt 0)

I had this recently. How to deal with the lone trailing question mark -> [webmasterworld.com...]

ashear

11:10 pm on Jan 16, 2007 (gmt 0)

Good catch Partrick, that mod rewrite is just the answer needed!

ferfer

4:34 am on Jan 17, 2007 (gmt 0)

Adding a? at the end of an static url is the simplest way to avoid the ISP and local cache, and force the real current version, I use the trick to see my own pages sometimes because my isp serves a recent cached copy in other way.

May be google is trying to bypass some cache too...

theBear

12:52 am on Jan 22, 2007 (gmt 0)

"May be google is trying to bypass some cache too... "

Not likely since they treat it exactly as they should as a different url than the one without the?.

If they thought it was a means of bypassing a caching system they would at least not index it seperate from the one without the?.

From a systems viewpoint it would make zero sense to assume something of that nature.