homepage Welcome to WebmasterWorld Guest from 54.237.184.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
.php and bots
WSQuant




msg:1277373
 9:29 pm on Jun 4, 2005 (gmt 0)

I have a question about using .php and how that affects the way that search engine bots crawl your site. I've heard that using variables in the URL affects the ability of bots to crawl your site.

Is this true? Can anyone expand on this subject a bit for me?

Thanks in advance.

 

badone




msg:1277374
 11:53 pm on Jun 4, 2005 (gmt 0)

This used to be true more than it is today.

If you have a simple url like http://www.example.com/myphppage?id=12 then it is probably OK. It is possible to use mod_rewrite in Apache to have this URL show up as http://www.example.com/myphppage/id/12 to avoid any possibility of search engines not being able to index your link but, unless your URLs are complex (i.e. pass many variables at a time) I don't think this is necessary these days.

On the other hand, look at this very forum. I'll almost guarantee that Brett does not keep each thread in a seperate file so the link;

[webmasterworld.com...]

is most likely a mod_rewrite of;

[webmasterworld.com?forum=88&thread=8501...]

or something along those lines.

If it's good enough for WebmasterWorld it's probably the right way for you to jump too.

HTH,
BAD

[edited by: jatar_k at 9:09 pm (utc) on June 15, 2005]
[edit reason] examplified [/edit]

WSQuant




msg:1277375
 5:31 pm on Jun 5, 2005 (gmt 0)

Thanks alot for the response....great news.

base64




msg:1277376
 5:54 pm on Jun 5, 2005 (gmt 0)

Yep, mod_rewrite rulez. I use this feature in all my websites, so it will look like 100% static "*.html" files. And as i look my webserver logs, Google and other SE are crawling this kind of site prefectly.

PumpkinHead




msg:1277377
 7:45 pm on Jun 5, 2005 (gmt 0)

I've been thinking about this, any good guides people know of?

JamShady




msg:1277378
 3:32 am on Jun 6, 2005 (gmt 0)

I find the problem with using ModRewrite is the regex limitations of Apache. I wrote an alternative in PHP which works perfectly :)

When you access a file in the format of file.php/var/val, there is a $_SERVER['PATH_INFO'] entry made in PHP to identify this path. All you need to do is split it up and store in an array with key/value pairs. You can then merge this into your GET SGA and voila, it's like mod rewrite, only it doesn't need specific rules for it :)

WSQuant




msg:1277379
 3:36 pm on Jun 14, 2005 (gmt 0)

This used to be true more than it is today.

If you have a simple url like http://www.example.com/myphppage?id=12 then it is probably OK. It is possible to use mod_rewrite in Apache to have this URL show up as http://www.example.com/myphppage/id/12 to avoid any possibility of search engines not being able to index your link but, unless your URLs are complex (i.e. pass many variables at a time) I don't think this is necessary these days.

Can someone give a brief explanation on how to do this. I started looking into this and found this thread on WW that says you cannot rewrite the url in the way described above.

[webmasterworld.com...]

Thanks

[edited by: jatar_k at 9:09 pm (utc) on June 15, 2005]
[edit reason] examplified [/edit]

Sarah Atkinson




msg:1277380
 4:13 pm on Jun 14, 2005 (gmt 0)

how do you do a mod_rewrite?

Dijkgraaf




msg:1277381
 9:42 pm on Jun 14, 2005 (gmt 0)

> http://www.example.com/myphppage?id=12

Well, actually I would avoid using the parameter name id if you are going to use parameters, as some bots avoid this, as it is often used as a session id.

[edited by: jatar_k at 9:08 pm (utc) on June 15, 2005]
[edit reason] examplified [/edit]

Span




msg:1277382
 9:57 pm on Jun 14, 2005 (gmt 0)

Also this, from Google's new guidelines [google.com]:
Don't use "&id=" as a parameter in your URLs, as we don't include these pages in our index.

Dijkgraaf




msg:1277383
 10:17 pm on Jun 14, 2005 (gmt 0)

Well funilly enough, I have thousands of pages in Googles index that have id= as one of the parameters.
But yes avoid it.
If you can, try and build your dynamic pages so that it only uses one paramaeter (some bots will only crawl single parameter links).
<added>
Also from the URL mentioned above

If you decide to use dynamic pages (i.e., the URL contains a "?" character), be aware that not every search engine spider crawls dynamic pages as well as static pages. It helps to keep the parameters short and the number of them few.
</added>

WSQuant




msg:1277384
 8:56 pm on Jun 15, 2005 (gmt 0)

Thanks for the responses, but I was aware that I should keep php variables small and to a minimum, but I am trying to avoid them.

I've seen on this forum that it is possible to use mod_rewrite to change a URL from say http://www.example.com/index.html?var=bluewidget

into:
http://www.example.com/bluewidget/

BUT, i've also read on this forum that mod_rewrite can't rewrite a URL that doesn't exist. So if I wanted to do this then www.mysite.com/bluewidget/index.html would still have to exist as a file.

The two explanations I've read as you can see very much contradict each other. So I'm at a bit of a loss as to whether this can or cannot be done. I'll be researching it myself as soon as I find time, but until then I'll hold out the hope that someone here will help me out:)

[edited by: jatar_k at 9:07 pm (utc) on June 15, 2005]
[edit reason] changed to example.com [/edit]

anshul




msg:1277385
 6:58 am on Jun 16, 2005 (gmt 0)

I've related but different curiousity:
What is better a .php or .htm Web page? ( both pages have no querystrings, sids or other url information ). But PHP Web page is still dynamic as it displays different content and links from db, when refreshed again and again. How search engines take this?

Another different question I've: if urls are passing session data, how bad it is and how to avoid session variables populating in urls, without changing application code. ;)

mincklerstraat




msg:1277386
 7:46 am on Jun 16, 2005 (gmt 0)

WSQuant: mod_rewrite can and usually is used to serve requests from files that don't 'really' exist in the filesystem in the same way they appear in the url. The rewrite rules will just send requests to another page specified with the rule, with parameters (if these are included in the rule). However, if the request and the rule don't result in a valid request, mod_rewrite will give some kind of error - this is maybe what you've read. Your bluewidget example should work just fine, as long as the rule is well-written, and you have some kind of valid response for $_GET['var'] = 'blue'.

anshul: how a page is refreshed depends on the server headers. This is a part of the page you can't see just by viewing source - google 'live http headers' and you'll find a number of sites that let you see the http headers of whatever page you want, the firefox webdev toolbar also has this facility. Very important when you get further into web development.

Normally, apache does a great job returning info on .html pages without you knowing anything about all the hard work it's doing. It sends a bit of data out in an E_TAG' header, and sends out the date the page was last modified. When you want to look at the page again that's in your browser's cache, your browser sends back either this E_TAG info, or the last-modified date. Apache looks at this, and can tell if the page has been modified or not. If not, it just sends back a tiny bit of info, 'no, your cache is still valid' - this isn't a whole HTML page, so it happens real nice and fast.

PHP won't give you any of this information unless you ask it to. Since the browser doesn't have any information on how long the page stays fresh, it just asks for the whole page again, and PHP serves up the whole entire page, again. You can add smart cache-headers to your scripts, but this takes some thinking. There are also drop-in cache options like jpcache.

If you use mod_rewrite to make your .php pages look like .htm pages, or make PHP parse .htm pages, this situation doesn't change - the .htm page is still served by PHP, and won't give any significant cache headers. This can be good or bad, depending on the page - if your pages don't change much, it's bad, if they're always updated with fresh stuff, it's good.

Search engines are getting much better at indexing pages with parameters.

Session id's: see the ini settings session.use_cookies and session.use_only_cookies .

Dijkgraaf




msg:1277387
 12:17 am on Jun 17, 2005 (gmt 0)

Hi anshul

You can put cache control into .php pages so that they will return a 304 return code. However many search bots don't even seem to send the If-Modified-Since in the HTTP headers, but then some don't even send that when getting .htm files.
Googleguy said it would help with Googlebot, however the first page I tried testing it on it didn't work with Googlebot, however I now suspect that was because it is my main page which is normally reached via a 301 redirect. I'm trying on another page now, but results aren't yet in ;-)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved