Caching of php pages

Forum Moderators: open

Message Too Old, No Replies

Caching of php pages

An unhappy camper

Arnett

8:41 pm on Sep 28, 2003 (gmt 0)

I've recently gotten involved with php driven sites. While I can say that I'm extremely happy with the ease of site design using php I'm unhappy with the way that Google is listing the pages. The pages from the sites are indexed but only a few of them show a cached image of the page.

This seems to indicate that Googlebot is having trouble crawling php pages. Are they indexing the links on these pages either? Any help or input would be appreciated.

dougmcc1

2:00 pm on Sep 29, 2003 (gmt 0)

Whatever the problem is, I doubt it's caused by using PHP. Googlebot doesn't even know your site uses PHP (aside from the PHP file extension if you're using it) because all it sees is the HTML that is generated after the page is parsed.

Arnett

2:42 pm on Sep 29, 2003 (gmt 0)

Whatever the problem is, I doubt it's caused by using PHP. Googlebot doesn't even know your site uses PHP (aside from the PHP file extension if you're using it) because all it sees is the HTML that is generated after the page is parsed.

That's what I thought. Google is only showing cache information for a few of my php pages. I'm trying to determine just why this is.

tracylee

7:43 pm on Sep 29, 2003 (gmt 0)

I'm not sure if this helps, but if you have session ID's, Google doesn't like those - see this:
[phpbb.com ].

BlueSky

8:19 pm on Sep 29, 2003 (gmt 0)

There's a few things that can cause this and I'm not going to cover every possibility because you don't give enough specifics. First, look in your logs and check what was occurring when the bot visited. Run your pages through a bot simulator script to see what he is seeing. Some things to check on why a page isn't being cached:

-- URL has a session ID: most definitely!

-- content is not changing much from one page to another without a session ID in the URL. Many people who use ready made CMS/blogs often complain about this. They create pages with fancy templates and then barely put any content to make the pages look different.

-- hardly any content on the page --> feed your bots words!

-- you're passing variables in the URL but they're getting stripped so content isn't changing or is showing nothing to the bot. This is where a bot simulator will help check whether this is occurring.

-- server or site goes down when he comes crawling

Arnett

4:40 am on Sep 30, 2003 (gmt 0)

sorry,my bad.

I just converted the site over from html to php. Out of 18 files that weren't cached,only 3 or 4 of them are .php files. The rest are .html files. Funny thing,http://www.widgets.com/ is cached but [widgets.com...] isn't. They are the same file. Any clue?

Arnett

5:02 am on Sep 30, 2003 (gmt 0)

OK. Here's what I ran by my webhost.

I made a mistake when I posted. Only 3 or 4 of the non-cached files are .php. The rest are .html files from the old example-site.com. Searching for "site:widget-city.com widget" at Google shows all the pages that they have indexed for the site. There are a bunch with content but no "cached copy" link or "similar pages" link. The latter is generally worthless anyway but if Google is missing my content I'd like to know why. The referring links don't help the home page and the lack of indexed content buries them at the bottom of the search results.

-- URL has a session ID: most definitely!
Q-Where is the session id?
A-If they are being used you will either see it in the URL (usually a long string of random numbers and/or letters) or in a cookie. Unless you have installed something that uses sessions, or wrote your own session code, then no session is created.
-- you're passing variables in the URL but they're getting stripped so content isn't changing or is showing nothing to the bot. This is where a bot simulator will help check whether this is occurring.
Q-huh?
A-I don't know about bot simulators, where could one get one of those? Can you trust that some persons homebrew simulator actually behaves like the real thing? Hmmm.
-- server or site goes down when he comes crawling
Q-Not likely is it?
A-Definitely not ... pushing 322 days of uptime at this moment!

The pages that I have with urls like [widgets.com...] are indexed and cached files. There just doesn't seem to be a real reason for Google not indexing the content of most of the pages in the site. Especially the static html files.

[edited by: Woz at 2:23 am (utc) on Oct. 10, 2003]
[edit reason] fixed quotation [/edit]

percentages

5:16 am on Sep 30, 2003 (gmt 0)

>There just doesn't seem to be a real reason for Google not indexing the content of most of the pages in the site. Especially the static html files.

Too close to dupes maybe....Google loves PHP, but doesn't like it when the results page is very close to something that already exists;)

Krapulator

6:22 am on Sep 30, 2003 (gmt 0)

>>http://www.widgets.com/ is cached but [widgets.com...] isn't. They are the same file. Any clue?

Yeah, Googlebot found [widgets.com...] realised that it was exactly the same as [widgets.com...] and only listed the original one.

BigDave

6:45 am on Sep 30, 2003 (gmt 0)

Google seemed to have some problems with 301 redirects this month.

I had a bunch of links to
mysite.com/path that produced a 301 to
mysite.com/path/ with the trailing slash.

Google put both URLs with and without the slash into the index. Only the correct ones with the slash had any information cached.

I can't say if this is your problem, but you might want to look into whether your links are generating 301s to get to the files that are not showing as cached.

percentages

6:55 am on Sep 30, 2003 (gmt 0)

BigDave, I have seen that happen on and off for several months....back to the fall of 2002. I have no idea what causes it?

The trailing slash, the lack of the www sub domain, the space in the URL all seem to throw Google into a spin. Typically my sites seem to benefit from this "bug" as they get multiple listings, like it thinks they are different when they are not......interesting twist!

It might be worth investigating this "feature" more for some competitive advantage......but on the otherhand I guess they will fix it sooner or later......although they don't seem to keen on correcting the problem ;)

Arnett

4:33 am on Oct 1, 2003 (gmt 0)

Yeah, Googlebot found [widgets.com...] realised that it was exactly the same as [widgets.com...] and only listed the original one.

BS. Google will list [widgets.com...] and [widgets.com...] Besides both urls are listed. Both are not cached. Two different things.

BigDave

4:42 am on Oct 1, 2003 (gmt 0)

Sorry Arnett, but Krapulator is right in that google will combine them. It just takes them a couple of months to get them worked out.

It actually sounds like they are combined, and they only have one copy, but they don't have the results completely cleaned up this month, which is why they are showing the redirect pages I mentioned.

Anyway, what does it matter to you, as long as one version of the page is listed. Isn't that all you want?

By the way, if you are having both / and /index.php showing up, you need to fix your links to be consistant. Always link to / and let the server figure out the filename.

jatar_k

4:48 am on Oct 1, 2003 (gmt 0)

we all do realize that

[widgets.com...]
[widgets.com...]

are the exact same page if index.php is your default page

Arnett

2:20 am on Oct 6, 2003 (gmt 0)

Anyway, what does it matter to you, as long as one version of the page is listed. Isn't that all you want?

It doesn't take a whole lot to distract you people does it. The point of the thread was to find out why Google is NOT CACHING pages that are in it's index.

JasonHamilton

2:43 am on Oct 6, 2003 (gmt 0)

Google has been known to dump cached pages from time to time. I wouldn't worry about that too much.

It's been said time and time again that the file extension doesn't help/harm a page's listing. However if your URLs are too deep (many sub directories), or contain variables (like session ID), that can hurt the chances of the pages ranking well, or being cached. But again, that has nothing to do with the html page extension.

MonkeeSage

3:17 am on Oct 6, 2003 (gmt 0)

The only thing I've seen that might(/will) cause a problem for Google is what has been mentioned already; using session variables / other query-like strings in the URL. But that is not directly related to using PHP -- that is a peripheral issue that has to do with URL formatting and would apply equally to a static HTML page.

Whatever the cause, seems like a pretty safe bet that it's not because you are using PHP -- there are thousands and thousands of PHP pages in Google's cache (Cf. "Google's cache of http:// www.php.net/" [google.com]).

Jordan

phpmaven

2:33 pm on Oct 6, 2003 (gmt 0)

Whatever the problem is, I doubt it's caused by using PHP. Googlebot doesn't even know your site uses PHP (aside from the PHP file extension if you're using it) because all it sees is the HTML that is generated after the page is parsed.

Well, this isn't necessarily true. Unless you have your server setup to supress it, the fact that you are using PHP will be announced to the world in every header sent back on a GET request.

I'm not suggesting that Google is doing anything with that info but they certainly could determine that your site uses PHP if they wanted to, again ,unless you take steps to prevent it.

Arnett

6:14 pm on Oct 6, 2003 (gmt 0)

Google has been known to dump cached pages from time to time. I wouldn't worry about that too much.

That's exactly what worries me. If they are not caching the page are they counting the outbound links? How can they effectively spider my site if they are dumping a large percentage of the pages?

Arnett

12:17 am on Oct 10, 2003 (gmt 0)

Google has been known to dump cached pages from time to time. I wouldn't worry about that too much.

I'm still looking for an answer. If Google "dumps" cached pages and has no other information displayed other than a link whose anchor text is the url is it indexing the tags,text and links in the page or just dumping it to the bottom of relevant search results?

JasonHamilton

1:15 am on Oct 10, 2003 (gmt 0)

You're assuming google needs a local cached copy of a webpage in order to calculate links.

It doesn't.

Once it crawls, it knows what was on that page. When it indexes that page again later, it will update what it knows about that page.

The cached page is for your benifit, but google itself does not need it to function. I think it's safe to say that when you do a search, google does not scan the cached pages for results. It would take far too long to do it that way.

Splosh

2:30 am on Oct 10, 2003 (gmt 0)

If all you are getting is a link with the url as the anchor text, eg. www.widgets.com/colour.php?colour=blue Then you have nothing to worry about.

All it means is that Google found a link to this page, therefore, it knows about it, but hasnt had chance to spider it yet.

Be patient and the next time you get a visit, it should visit that page too, and you will see it fully indexed and cached.

Arnett

10:23 am on Oct 10, 2003 (gmt 0)

Be patient and the next time you get a visit, it should visit that page too, and you will see it fully indexed and cached

Thanks.

JasonHamilton

1:42 pm on Oct 10, 2003 (gmt 0)

<<All it means is that Google found a link to this page, therefore, it knows about it, but hasnt had chance to spider it yet.>>

I doubt that google would find a link to a page, and then include that page on the search results without ever having visited it first.

piramida

6:45 am on Oct 11, 2003 (gmt 0)

JasonHamilton, what you are saying is not correct. Google does just that.