homepage Welcome to WebmasterWorld Guest from 54.235.227.60
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Not existing pages indexed, and no 404 page is displayed
helenp




msg:4545890
 12:39 pm on Feb 15, 2013 (gmt 0)
Puf,
just discovered that google keep mixing the directories up, and the 404 page does not work.
I mean it work normally, but
on urls like this wher the folder are mixed up it does not show up and show a page wich is of course duplicate content;

http://www.mysite.com/page.htm/espanol/sales/sales/locationmaps.htm

Please help, I am desperate trying to climb back up,
I am sure the drop is due to these issues.
Thanks

 

helenp




msg:4545961
 7:20 pm on Feb 15, 2013 (gmt 0)

Hi again,
this is the answer I got from the host, and I dont know how to do what he says, the matter of fact the amount of ever crawled keeps increasing and site keep loosing position:

"This issue would have always existed on your application. The reason is a URL like index.php/my/directory/is/here/index.html would be a valid way to visit index.php. Thus it is not a 404 just like your variation page.htm/espanol/sales/sales/locationmaps.htm would always exist in the same manner.
You'll see any parameter you attach to this file or any other of your files will produce the same this: http://www.example.com/page.htm/here/is/another/page/to/my/and/it/is/not/a/404.html

I'm guessing someone linked to a page like that at some point. If you're looking to stop this you would need to build something on your actual application to detect these bad URL's and do a 301 redirect to the correct one."

Please help

helenp




msg:4545966
 7:49 pm on Feb 15, 2013 (gmt 0)

At the moment in webmaster tool improvements html
I have 2 pages with same title (and of course same content)
One of them have 3 different patterns like this:
/page1.htm/espanol/sales/sales/locationmaps.htm
/page1.htm/espanol/sales/sales/site_map.htm
/page1/maps/espanol/maps/transports.htm

And the other have one different pattern.
/page2/maps/svenska/espanol/espanol/sales/svenska/maps/alcazaba_apartment_puerto_banus.htm

I doubt somebody did those linkings, and Im afraid there are more pages like that indexed.
What should I do, tell wmt to delete those pages from the indes, or add an canical url on all pages on my site?
There are more than 500 pages.....

helenp




msg:4545973
 8:19 pm on Feb 15, 2013 (gmt 0)

Puf, now I am thinking, didnt someboyd say that the improvement of html in google webmaster tool shows old information?
Anyway, this is scary as I have more and more ever crawled pages every day, and also indexed are higher than the indexed on the sitemap page.
How can one see all pages indexed by google?

lucy24




msg:4546007
 10:55 pm on Feb 15, 2013 (gmt 0)

The reason is a URL like index.php/my/directory/is/here/index.html would be a valid way to visit index.php

This is true-- sort of-- for pages in .php or any other dynamic-page extension. (Look at the first extension, not the final one.) It is NOT true for static pages in .html and similar.

What are your current IgnorePathInfo settings? If your own htaccess doesn't contain this word, you're using the host's default. If they won't tell you, we can figure it out.

helenp




msg:4546010
 11:02 pm on Feb 15, 2013 (gmt 0)

What are your current IgnorePathInfo settings? If your own htaccess doesn't contain this word, you're using the host's default. If they won't tell you, we can figure it out.

Where do I found out this? I am on a shared host.
I have this in htaccess
to tell the server to parse all html as php:
AddType application/x-httpd-php5 .htm .html

lucy24




msg:4546029
 1:59 am on Feb 16, 2013 (gmt 0)

AddType application/x-httpd-php5 .htm .html
This line means that the PathInfo business is irrelevant and you can forget I asked ;) (Note however that I was typing from memory so I goofed: The directive is AcceptPathInfo, not Ignore...)

Do you actually need the line? That is, do you have pages that are "really" php but their names end in .html? If your setup is like mine, you don't need the "AddType" line if all you've got is html files with php includes.

Do not run out and remove the line! You will need more information.



Answer I would have given if you hadn't revealed the AddType part:

If you don't have an AcceptPathInfo statement in your own htaccess, you're using the host's default. They in turn are probably using the Apache default, which is, uhm, "Default". (Really.)
The treatment of requests with trailing pathname information is determined by the handler responsible for the request. The core handler for normal files defaults to rejecting PATH_INFO requests. Handlers that serve scripts, such as cgi-script and isapi-handler, generally accept PATH_INFO by default.

In English: the "pathname" is the part of the URL after the first extension (the .html in the middle, not the one at the end). If you request a valid filename in .php with more garbage after the "php", you will get the page.* If you do the same thing with a page in .html, you will get a 404.


* But relative links on the page will no longer work. I found this out while experimenting, and will now scurry over and fix it ;)

helenp




msg:4546080
 9:01 am on Feb 16, 2013 (gmt 0)

Do you actually need the line? That is, do you have pages that are "really" php but their names end in .html? If your setup is like mine, you don't need the "AddType" line if all you've got is html files with php includes.

Do not run out and remove the line! You will need more information.


Thanks,
Oh no I want remove it, the site is from 2001 when everything was in html, then years later I started with php and didnt want to change name of the files, so yes, most pages ends in htm but there are some that ends in php (as it is easier to debug with correct extension)

The server is apache, but with Lite Speed.

I have changed the relative links to root, however not sure is 100% done as I used find and replace. And also my slideshow does not work with root links so I had to left them relative.

Yesterday I became very nervous, later I saw that 2 of the 3 pages reported by gwt were old pages I renamed and after that I started to get many 404s, so this third page is a third level page so I renamed it also to force a 404. But I am sure there must be more pages like that.

helenp




msg:4546085
 10:20 am on Feb 16, 2013 (gmt 0)

puf,
If I understand you correctly,
one can always test to see if works.
I should add to my htaccess file
AcceptPathInfo Off
And if I understand correctly what I reading about it, it will make that the includes wont work, so then one should add this also:
<Files "mypaths.shtml">
Options +Includes
SetOutputFilter INCLUDES
AcceptPathInfo On
</Files>
However no idea what I should do with "mypaths.shtml"

helenp




msg:4546090
 10:30 am on Feb 16, 2013 (gmt 0)

I just dont understand, why is the default set to on?
I mean what can make anybody to want to have as a valid value page.htm/folder/?

lucy24




msg:4546118
 1:16 pm on Feb 16, 2013 (gmt 0)

It's got something to do with filters, but if I try reading more closely I get a headache :(

This is a horrible copout but it seems to work:

RewriteCond %{PATH_INFO} .
RewriteRule ^((?:[^./]+/)*[^./]+\.(?:html?|php)) http://www.example.com/$1 [R=301,L]


Unlike your ordinary RewriteRule, do not use a closing anchor $ after the extension. The extra path stuff, unlike a query string, counts as part of the URL.

You can even do it without a Condition as

RewriteRule ^((?:[^./]+/)*[^./]+\.(?:html?|php))/ http://www.example.com/$1 [R=301,L]

again with no closing anchor. All it looks for is a slash after the extension; it doesn't matter if there's anything more.

I did a lot of experimenting with AcceptPathInfo On and Off but no joy. Setting it to On makes it possible to attach garbage after ".html". But setting it to Off didn't seem to do anything at all. That is, it didn't affect include files or simple php activity-- but it also didn't change the treatment of filenames with more stuff after the .php.


I also learned that if you have a RewriteEngine On statement in one directory, without any RewriteRules after it, then incoming rewrites/redirects from higher directories won't work. Don't ask me to explain this; I'm just reporting what I found. I wasn't looking for it. It was an unlucky accident that cost me a lot of time worrying that my htaccess had simply stopped working.

helenp




msg:4546175
 5:48 pm on Feb 16, 2013 (gmt 0)

Unlike your ordinary RewriteRule, do not use a closing anchor $ after the extension. The extra path stuff, unlike a query string, counts as part of the URL.


Thousands of thanks, am afraid I dont understand this part.

Will try it and let you know

helenp




msg:4546192
 6:07 pm on Feb 16, 2013 (gmt 0)

Wow, that works like a charm, trillion of thanks!

helenp




msg:4546218
 8:04 pm on Feb 16, 2013 (gmt 0)

UUPS,
Been doing more tests,
and this page does exist:
http://www.example.com/sales/properties_for_sale_marbella.htm?id=113/maps/
and has same content as:
http://www.example.com/sales/properties_for_sale_marbella.htm?id=113
however as far as I know I had no problem with those pages.

lucy24




msg:4546235
 11:35 pm on Feb 16, 2013 (gmt 0)

Eeuw. You mean the path stuff might get attached to the end of the query string?

That part should be easier to deal with, though. You just need a preliminary php function that checks all query strings for a path component, and bounces back a 404. Or just cross your fingers and hope that g### continues not to notice ;)

am afraid I dont understand this part

Normally when a RewriteRule looks at requests for a particular type of extension like .jpg or .html, you put a closing anchor $ to make sure you are not redirecting requests that have added garbage after the extension. But here you want the with-garbage requests, so you have to leave off the anchor.

Hm. Sounds like a restaurant order. "Give me one URL with double garbage, to go."

Oh, and...

Now that you have plugged the hole, go back and see if you can figure out where those bogus URLs are coming from. Usually the first place you look is anything with relative links.

helenp




msg:4546240
 12:28 am on Feb 17, 2013 (gmt 0)

Now that you have plugged the hole, go back and see if you can figure out where those bogus URLs are coming from. Usually the first place you look is anything with relative links.


I remember back in december one school in sweden had about 3000 links to one page of mine and a parked domain had about 300 links to another page. As site dropped I thought those links were strange, could look like crosslinking or bought link so I renamed those 2 pages as didnt get any answer from webmaster what the links were about.

Also in november I got a dedicated ip, and some persons got another domain for a day when trying to entering our site, I remember I saw that site somewhere in gwt back in december-january.

Not sure, but I think was a mix of many things, anyway the links are now root so I hope there want be anymore.
Thanks for all.
I will have a look at that function to see if I can manage. I tested some site and saw webmaster word gets a 404 if you add garbage to the url.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved