Forum Moderators: open
This seems to indicate that Googlebot is having trouble crawling php pages. Are they indexing the links on these pages either? Any help or input would be appreciated.
Whatever the problem is, I doubt it's caused by using PHP. Googlebot doesn't even know your site uses PHP (aside from the PHP file extension if you're using it) because all it sees is the HTML that is generated after the page is parsed.
That's what I thought. Google is only showing cache information for a few of my php pages. I'm trying to determine just why this is.
-- URL has a session ID: most definitely!
-- content is not changing much from one page to another without a session ID in the URL. Many people who use ready made CMS/blogs often complain about this. They create pages with fancy templates and then barely put any content to make the pages look different.
-- hardly any content on the page --> feed your bots words!
-- you're passing variables in the URL but they're getting stripped so content isn't changing or is showing nothing to the bot. This is where a bot simulator will help check whether this is occurring.
-- server or site goes down when he comes crawling
I just converted the site over from html to php. Out of 18 files that weren't cached,only 3 or 4 of them are .php files. The rest are .html files. Funny thing,http://www.widgets.com/ is cached but [widgets.com...] isn't. They are the same file. Any clue?
I made a mistake when I posted. Only 3 or 4 of the non-cached files are .php. The rest are .html files from the old example-site.com. Searching for "site:widget-city.com widget" at Google shows all the pages that they have indexed for the site. There are a bunch with content but no "cached copy" link or "similar pages" link. The latter is generally worthless anyway but if Google is missing my content I'd like to know why. The referring links don't help the home page and the lack of indexed content buries them at the bottom of the search results.
-- URL has a session ID: most definitely!
Q-Where is the session id?A-If they are being used you will either see it in the URL (usually a long string of random numbers and/or letters) or in a cookie. Unless you have installed something that uses sessions, or wrote your own session code, then no session is created.
-- you're passing variables in the URL but they're getting stripped so content isn't changing or is showing nothing to the bot. This is where a bot simulator will help check whether this is occurring.
Q-huh?
A-I don't know about bot simulators, where could one get one of those? Can you trust that some persons homebrew simulator actually behaves like the real thing? Hmmm.
-- server or site goes down when he comes crawling
Q-Not likely is it?A-Definitely not ... pushing 322 days of uptime at this moment!
The pages that I have with urls like [widgets.com...] are indexed and cached files. There just doesn't seem to be a real reason for Google not indexing the content of most of the pages in the site. Especially the static html files.
[edited by: Woz at 2:23 am (utc) on Oct. 10, 2003]
[edit reason] fixed quotation [/edit]
Yeah, Googlebot found [widgets.com...] realised that it was exactly the same as [widgets.com...] and only listed the original one.
I had a bunch of links to
mysite.com/path that produced a 301 to
mysite.com/path/ with the trailing slash.
Google put both URLs with and without the slash into the index. Only the correct ones with the slash had any information cached.
I can't say if this is your problem, but you might want to look into whether your links are generating 301s to get to the files that are not showing as cached.
The trailing slash, the lack of the www sub domain, the space in the URL all seem to throw Google into a spin. Typically my sites seem to benefit from this "bug" as they get multiple listings, like it thinks they are different when they are not......interesting twist!
It might be worth investigating this "feature" more for some competitive advantage......but on the otherhand I guess they will fix it sooner or later......although they don't seem to keen on correcting the problem ;)
Yeah, Googlebot found [widgets.com...] realised that it was exactly the same as [widgets.com...] and only listed the original one.
BS. Google will list [widgets.com...] and [widgets.com...] Besides both urls are listed. Both are not cached. Two different things.
It actually sounds like they are combined, and they only have one copy, but they don't have the results completely cleaned up this month, which is why they are showing the redirect pages I mentioned.
Anyway, what does it matter to you, as long as one version of the page is listed. Isn't that all you want?
By the way, if you are having both / and /index.php showing up, you need to fix your links to be consistant. Always link to / and let the server figure out the filename.
[widgets.com...]
[widgets.com...]
are the exact same page if index.php is your default page
It's been said time and time again that the file extension doesn't help/harm a page's listing. However if your URLs are too deep (many sub directories), or contain variables (like session ID), that can hurt the chances of the pages ranking well, or being cached. But again, that has nothing to do with the html page extension.
Whatever the cause, seems like a pretty safe bet that it's not because you are using PHP -- there are thousands and thousands of PHP pages in Google's cache (Cf. "Google's cache of http:// www.php.net/" [google.com]).
Jordan
Whatever the problem is, I doubt it's caused by using PHP. Googlebot doesn't even know your site uses PHP (aside from the PHP file extension if you're using it) because all it sees is the HTML that is generated after the page is parsed.
Well, this isn't necessarily true. Unless you have your server setup to supress it, the fact that you are using PHP will be announced to the world in every header sent back on a GET request.
I'm not suggesting that Google is doing anything with that info but they certainly could determine that your site uses PHP if they wanted to, again ,unless you take steps to prevent it.
Google has been known to dump cached pages from time to time. I wouldn't worry about that too much.
I'm still looking for an answer. If Google "dumps" cached pages and has no other information displayed other than a link whose anchor text is the url is it indexing the tags,text and links in the page or just dumping it to the bottom of relevant search results?
It doesn't.
Once it crawls, it knows what was on that page. When it indexes that page again later, it will update what it knows about that page.
The cached page is for your benifit, but google itself does not need it to function. I think it's safe to say that when you do a search, google does not scan the cached pages for results. It would take far too long to do it that way.
All it means is that Google found a link to this page, therefore, it knows about it, but hasnt had chance to spider it yet.
Be patient and the next time you get a visit, it should visit that page too, and you will see it fully indexed and cached.