Forum Moderators: Robert Charlton & goodroi
"GET /news/../some-none-news-page-here.html? HTTP/1.0"
My qestion is - what is the "?" at the end of the URL? We do not have any pages other than core HTML...
And - that page is not in the folder "news" as the log shows it - could it mean that the bot came from the folder "news" to that page?
Any clarity would be greatly appreciated.
Most sites not using query_string URLs neglect to strip them, making it possible to duplicate content on URLs, because a query_string is information passed to the receiving page.
In a dynamic setting the content of the page (URL) associated with a query_string URL will change respective to the query_string variables.
When a query_string is passed to a static page, the variable information is not processed, so the 'root' .html page is presented to the requesting agent 'as is'.
Try this:
Open the index.html page on your site, then type ?=anything-you-want. Your page will (in most cases) open with the information from index.html regardless of what you type in the query_string.
By serving a 200 OK on this request, the URL is/will be considered a 'valid resource' and could be considered 'duplicate content'.
I believe the safest route is to strip them prior to opening a page.
There is a recent thread with some examples of how to do this in the Apache Forum.
How to Strip the ? from a URL [webmasterworld.com]
Justin
I do get 200 OK (but no duplicate yet, as quite few pages showed in the log like this, and yet, no change in our site: command), but here is the weird part:
We do not have any dynamic pages - we have the entire website in html - we work a little harder, but always felt safe, until now.
From the log:
"GET /news/../some-none-news-page-here.html? HTTP/1.0"
this is exactly the way it shows (except different page name).
And here is how our website does: we have a main folder called "news" and within that folder there are pages, but the page "some-none-news-page-here.html" is actually NOT in the folder "news" or any other subfolders - it's in the main dir.
And the part "/../" is also quite worrying. It looks to me like a FrontPage link - "../some-none-news-page-here.html" which would go a folder up to that page, which would be correct, if it was front page, not a bot, as it should automatically turn to the whole URL.
And we have no links from folder "news" to that page...
Just very weird, If any bot experts could let me know what is going on...
What you can control is how your server responds to such requests, and I strongly suggest you follow Justin's advice and strip those "?" -- or else (second best) disallow such URLs in your robots.txt
Trouble from such things is not usually quick, it just builds up gradually. So definitely take some preventative measures now that you see the URLs showing up in your logs. They should not be getting a 200 OK response.
You should install proper rewrite rules to strip off the query string or make certain that those kinds of urls serve up a 404.
Justin doesn't just mean for you to go on a search and destroy mission for bad links in your sites html.
What you are seeing is called page sniping or a query string attack and it is aimed at causing duplicate content issues for your site.
[edited by: theBear at 7:17 pm (utc) on Dec. 10, 2006]
your-site.com/page.html?var=1
your-site.com/page.html?var=2
your-site.com/page.html?var=3
and effectively 'duplicate' your content.
A search engine can request:
your-site.com/page.html?
your-site.com/page.html?var=Im-Yahoo-requesting-urls-that-dont-exist
and effectively 'duplicate' your content.
The issue is not the technology you use, but rather the response your server generates when a URL, which should not be returned as valid, is requested and returns a 200 OK response.
Justin
The pages with simply html ending:
http://www.example.com/some-page.html
have a different pagerank than:
http://www.example.com/some-page.html?
Most inner pages share this trait. Whats interesting is that a cache:http://www.thissite.com/some-page.html? simply shows the exact cache for:
http://www.example.com/some-page.html
A header check for both urls shows 200 OK. Is there a dupe issue here in your estimation?
Regards,
Todd
[edited by: tedster at 9:51 pm (utc) on Jan. 11, 2007]
[edit reason] use example.com [/edit]
--17:15:36-- [google.com...]
=> `index.html.1'
Resolving www.google.com... 216.239.37.99, 216.239.37.104
Connecting to www.google.com¦216.239.37.99¦:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
[ <=> ] 3,849 --.--K/s
17:15:36 (901.69 KB/s) - `index.html.1' saved [3849]
linux:~ # wget [google.com...]
--17:15:43-- [google.com...]
=> `index.html.2'
Resolving www.google.com... 216.239.37.99, 216.239.37.104
Connecting to www.google.com¦216.239.37.99¦:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
[ <=> ] 3,849 --.--K/s
17:15:43 (909.47 KB/s) - `index.html.2' saved [3849]
linux:~ # diff index.html.1 index.html.2
linux:~ #
I used a simple command line tool wget to ask googles server to provide me the contents of the url [google.com...] the tool was provided the content by google and the tool stored the content as index.html.1 in my current working directory. I then provided the tool with the url [google.com...] . Googles server provided the tool with the content and the tool stored the content in index.html.2 in my current working directory.
I then used another command line tool called diff that will list all differences between the two files (index.html.1 and index.html.2) that were created from the use of wget.
Diff found no differences so the two files are identical.
[edited by: theBear at 10:36 pm (utc) on Jan. 11, 2007]
for example: www.widgets.com/widget.html?page=blahblahblah
now 301 reidrects to www.widgets.com/widget.html
I was considering making the query strings go to the 404 error document, but the problem we have is that we have deprecated pages which at one point used a query string still appearing in the main index and we don't want them in the index anymore, so we just did a 301 redirect for any query string attached to an html file to 301 to just the static html file.
Is this accpetable?
Not likely since they treat it exactly as they should as a different url than the one without the?.
If they thought it was a means of bypassing a caching system they would at least not index it seperate from the one without the?.
From a systems viewpoint it would make zero sense to assume something of that nature.