Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot is requesting non-existent urls

         

GeorgeFive

12:58 am on Nov 18, 2008 (gmt 0)

10+ Year Member



I'm back with another question... remember how I mentioned that my pagerank took a massive drop [webmasterworld.com] a couple of days ago? Well, I've been doing some digging in Google's Webmaster Tools to see if I noticed anything screwy. Everything seems alright, except for the fact that I noticed it was trying to access some invalid pages (pages which are not linked on my site and which - according to a Google search - aren't linked elsewhere on the internet).

The thing about this is that these pages aren't just typo'd addresses or adding invalid parameters... they're page numbers that don't exist.

As an example... on this forum, you can change page numbers by clicking the links, links which point to:

http://www.webmasterworld.com/forum30/page1.htm
http://www.webmasterworld.com/forum30/page2.htm
http://www.webmasterworld.com/forum30/page3.htm

I have something similar on my site, and for the category in question, there's a total of 36 pages. According to the "Unreachable URLs" section of Webmaster Tools, Google is trying to get page 37, 38, 39, and everything else up to 9721 (not every broken page is listed there, but some are and the highest is 9721). Obviously, these are broken links, so I'm wondering if this is related to my PR drop.

Again, those pages shouldn't be in there - if you go to my site, there are no links pointing to those pages (I just double and triple checked). Unless they're not showing up in a Google search, no other sites are linking to these invalid pages. Short of manually changing the URL, you can not get to those pages - so why is Google spidering them?

I obviously want Google to continue spidering the legit pages, but is there any way to tell it, you know, not to look on pages that I don't have linked?

[edited by: tedster at 1:14 am (utc) on Nov. 18, 2008]
[edit reason] de-link the url examples [/edit]

tedster

1:16 am on Nov 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How does your server respond to those requests? If the http header for the server's response is 404 (or a 301 redirect to a 404) then this is not the source of any problem, it's just googlebot kicking your tires. Google does check to see how servers handle all kinds of requests - it doesn't just follow links.

GeorgeFive

1:42 am on Nov 18, 2008 (gmt 0)

10+ Year Member



The pages are dynamically built, so it's not returning anything except for a blank page (well, blank aside from the header and such - no actual content). I can actually give you a demo of this, since this site handles it the same way that mine does:

[webmasterworld.com...]

Imagine if everything between the top and bottom "Home / Forums Index / The Google World / Google Search News" was gone, and you'll see what my site does.

I don't think I could set a 404 on these invalid pages, since you access them through something like:

www.example.com/page/1
which, through the magic of htaccess, would transparently redirect to
www.example.com/myscript.php?page=1

So, not an issue here?

tedster

2:07 am on Nov 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I do understand - and I do NOT recommend that your site emulates what WebmasterWorld does.

If a url should not exist, then a request for that url should return a 404 in the http header sent from the server. How you code that will depend on your own server technology, but one way that you can verify your server headers is to use Firefox and install the LiveHTTPHeaders add-on.

If you consistently return a 404 status, the spurious url requests should slow way down over time. Amongh other benefits, this will help Google make a more economical use of whatever crawl budget they allocate to your site.

However, I don't think this is the reason behind your home page PR now showing zero - although who knows. If your home page PR really fell from 5 to 0, your search traffic would also have fallen off dramatically.

TerrCan123

9:04 am on Nov 18, 2008 (gmt 0)

10+ Year Member



I think it sounds like Google is finding template pages so you need to get a fix on the website software or script.

GeorgeFive

9:12 am on Nov 18, 2008 (gmt 0)

10+ Year Member



Yeah, I'm not too concerned about the PR now... my traffic hasn't fallen a bit. Of course, I say this today, and it'll plummet tomorrow.

TerrCan123 - I wrote the code myself, and trust me, I double checked both the code and the actual output. It shouldn't be putting out screwy pages, and if you look at the actual site, it's not. Oh, how I wish I could link it here ;)

TerrCan123

9:21 am on Nov 18, 2008 (gmt 0)

10+ Year Member



The webmaster world page you linked to is an actual page, even though it has no content other than the template. If that is what is happening they are pages I am pretty sure, at least Googlebot would consider them to be pages.

GeorgeFive

9:41 am on Nov 18, 2008 (gmt 0)

10+ Year Member



Ohhh... ok, my apologies, I misunderstood you - yes, you are correct there. The problem is that it's difficult to compensate for this; I could put a "invalid page" note on there, but this wouldn't do much good in terms of pleasing their spider.

On the flipside, there's 236 pages on here right now... if the webmaster of this site were in my shoes and tried to put in some code change, what would happen when the site has 237 pages tomorrow or next week?

I think I'm going to have to do what tedster did and serve up a 404 error, but this is sort of tricky with my setup - I can use the header command to give that error status, but then I can't redirect to the proper error page... and if I can't redirect, then any legitimate 404 will serve up a blank white page.

Ack... I think it'd be best to just see if they knock it off eventually.