|The page that Google has indexed does not exist on my server space |
when you say "doesn't exist" does that mean a request for the indexed url returns a 404 Not Found or a 410 Gone status code response?
No I mean I can't find it with my FTP manager.
Sounds like a parameter issue.
Where would the parameter issue reside?
Ah. I see that the moderator's edit has caused some confusion. The inserted word "widget" will lead you to believe it is a word that I use on my site, even a keyword. It is in fact the surname of a mass murderer and is not mentioned anywhere on my site and bears no relation to my subject matter.
put www.yoursite.com/mypage.htm?words=widget into a browser or server header checker and find out the response you are getting. There is one on this site I think but can't remember where so try this. [urivalet.com...]
If it returns a 200 OK then the prob is on your server side and google is seeing the different urls as duplicates. (most probably picked up by a link pointing to you site)
In a browser this returns the page duplicated from my original with the words= suffix.
so the page does exist if it is called,
try going to
where oneofyourpages is actually one of your pages, assuming the page is displayed, the first thing i'd do is implement the canonical tag on your pages
and then look into using htaccess to block all pages with parameters
Most extensions do not use parameters, so any query string after .html or .jpg or what-have-you is simply ignored and the page is served up as-is. In theory you can change this behavior with the AcceptPathInfo setting (assuming Apache)-- but why the bleep should you have to?
Your search engine is willfully and wantonly attaching parameters to URLs that by their nature cannot have parameters. My own wmt parameters page begins with the line
attached to an html URL. (They'll only show you one, though they claim more.) I have to assume that g### picked it up via some linking site, where "newwindow" refers to the site's internal behavior; it's obviously meaningless in isolation.
Remember that indexing and crawling are separate functions; a search engine will happily index a page it has never seen. All you can do is go into gwt, pull up the parameters page and edit to say explicitly "no effect on page content".
Well, that's not really all you can do. You can add a rel="canonical" tag to the page, which would generally resolve the issue, or you could set up a 301 redirect. Then just use Webmaster Tools to do a Fetch As Googlebot, followed by a submit.
Google is pretty good at detecting and eventually correcting this issue on its own. But these corrective steps do speed up the process.
There's actually a bunch more that goes into the decision making when you're dealing with 1,000,000,000,000+ URIs and running a business.
First, there are so many who thought .html was better than .php (and may still be) eliminating .html with a parameter would be silly since people thought they would do better by parsing .htm and .html pages as php and using parameters on them.
Third, when you run a major search engine and have the insane number of pages and URIs they have to deal with you hit a point of diminishing returns by worrying about coding for minutia like newwindow=true is, so what's way more cost effective than trying to figure out all the parameters you don't need to crawl is to spider the URI and see if it returns a 200 OK header, then if it does you do what you do (as Google does) and group the URIs with the same content together and 'give value to/return in the actual SERPs' what you determine to be 'the best/most authoritative one' you find.
When you really get into running a search engine and trying to figure out what to do with 1,000,000,000,000+ URIs/pages there's really a bunch of reasons to spider and 'let slide' a bunch of things many of us who don't deal with those numbers might think are silly or easy to fix, but they're really not 'that important' when you deal with things on the scale they have to code for, especially when you get into how time consuming finding and coding solutions for some of the things must be and how much better that time finding and coding for issues could be spent doing something else.
(For example, not picking on you Lucy24, I wouldn't have ever thought about coding for newwindow=true and with the number of URIs they have to deal with they might not have a clue someone was silly enough to even use it, so is the time spent digging through the insane number of URIs they have to deal with to find the 'goofiness' some people erroneously link with worth the time invested when they'll probably find a (relative) few at most or is the time of some search engineer with a doctorate and $1,000 an hour salary probably better spent somewhere else? IOW: how much would they profit by "eliminating" newwindow=true from the index and how is that possibly more than they would spend by finding and coding for it and other silliness on the part of webmasters? I don't see how they could really be bothered with it personally.)
|You can add a rel="canonical" tag to the page |
Adding "rel='canonical'" won't do a particle of good if it was a static html page to start with, meaning that the parameters just go along for the ride and now every version of the page is calling itself canonical.
|First, there are so many who thought .html was better than .php (and may still be) eliminating .html with a parameter would be silly since people thought they would do better by parsing .htm and .html pages as php and using parameters on them. |
Contrariwise, calling php html and then turning around and attaching parameters is itself so silly, why would the search engine make extra work for itself by playing along? Take the html at face value, request htm(l) pages without parameters and index what you get. If the site owner ends up with un-indexed pages, surely that's their problem and not the search engine's.
|Adding "rel='canonical'" won't do a particle of good if it was a static html page to start with, meaning that the parameters just go along for the ride and now every version of the page is calling itself canonical. |
it's first thing in the morning for me so i might have misunderstood you lucy24 - but from what you are saying you misunderstand the canonical tag.
it should be used like this:
<link rel="canonical" href="http://www.example.com/mypage.html">
and it should tell google that if the page has paramaters to treat the page as though it was without parameters.
|calling php html and then turning around and attaching parameters is itself so silly |
i would agree with this though, i would have thoguth a basic rule could be to ignore parameters with and htm/l extension
|request htm(l) pages without parameters |