Forum Moderators: open
Has the googlebot troubles to crawl sites build with php? It shouldnt, as this is not client side, but server side programming...
The other thing you can do to encourage spidering is to increase your inbound links. Also, try spidering the site yourself to be sure bots can navigate the site (use Xenu, for example). Check your robots.txt, too, to be sure you aren't inadvertently telling Googlebot to go away.
Totally disagree,
google does not discriminate between file extensions. It does prefer simply structures though. A mod rewrite should be used to simplify complex query string navigations.
example
www.example.com/index.php?section=computers&subsection=monitors&brand=hp
may not be crawled as well as
www.example.com/computers/monitors/hp/
which can be the same file with mod rewrite
Use mod_rewrite to change the URL.
Everyone keeps suggesting this but I think it is bad advice. This goes in line with the 'creating pages for search engines rather than users' problem. Google don't have a problem indexing .php pages and if your linking structure is sound (internal and inbound) then you've got nothing to worry about.
The only reason you really need to use mod_rewrite is if you have terrible inbound links and the only other thing you can think of is to change what Google sees as your URL, it doesn't affect whether Google lists your page and ranks it.
Some Google aficionados will probably notice that we're doing better on dynamic urls too. We always crawled dynamic urls with one or two parameters, but recently we've started to loosen those restrictions a little more.
Yes, I did use Google spellcheck for aficionado. I've gotten to where I just type something close like aficiando cuz I know it'll find it. :)
.cgi & .cgi?param= are in the Google index right now and show on results pages.
The reason you won't see many is because most people have a templated robots.txt file which excludes the cgi-bin or they use the cgi bin for external programming which will not be crawled by Googlebot ... yet ...
widget1234.php?widgetshopcode=4321&session=ef25afda4d8b437182ada0d82a60d68e
google is picking them up , no problem - its just dropping the session id variable at the end, which is fine.
if you don't have a session id, my site generates one for you and adjusts all the urls accordingly.
there is no need to use mod_rewrite - one of my new widget product pages was placed around 2 weeks ago, and googlebot has grabbed it. no problems.
ThreeQuarks, that's interesting - based on your report, it sounds like Google is learning how to distinguish session IDs from real query parameters. Nevertheless, based on past experience, I'd still avoid taking chances by feeding GB session IDs.
There is no need to use mod_rewrite.
Yes there is. Think about the user first. Those long URI strings are not friendly at all. They usually break in certain email programs.
In addition to that, you don't want the user bookmarking session IDs or sending those session IDs as links to someone else.
Also, Google is not the only search engine. Other bots are not as smart as Googlebot is and those long query strings are going to be major roadblocks for getting content indexed.
Think outside the G! ;)
There are many good online tutorials that explain a simple way to make your URLs SE friendly with PHP... sticky me if you want the address...
On another note, it’s also a good idea to turn off PHP sessions (if you use them) for spiders. I’ve found a simple way to do this is by using the following code.
<?
if (preg_match("/Mozilla/i", "$HTTP_USER_AGENT")){
session_start();
}
?>
In other words, sessions will only start if the browser HTTP_USER_AGENT contains “Mozilla” somewhere in the identifier. All versions of Netscape/Mozilla and Internet IE (and even Opera) do this.
All versions of Netscape/Mozilla and Internet IE (and even Opera) do this.
It depends on the User-Agent setting within Opera. If Opera is set to report itself as Opera (like mine is) the User agent string is something like
Opera/7.23 (X11; Linux i686; U) [es]
or
Opera/7.23 (Windows NT 5.1; U) [en]
Jon.
If your php code looks like this:
ht*p://www.yourdomain.com/store/index.php?action=item&substart=0&id=85
Google won't follow it.uh, yes they will unless your parameters are crazy long.
Actually what happens is they will follow certain amout of pages if they have parameters to prevent them ending in an endless loop.
Good point, pageoneresults. If you want your site to be crawled by as many search engines as possible, things like static urls always help, and if you can't do that then fewer parameters probably help.
long get strings are not very spider friendly
long get strings are not very user friendly
spider friendly == user friendly
GET strings are just easy and make life easy for programmers who have trouble understanding the tenets of spiderability or just don't care. Those may be the same people who think "If I build it they will come".
Most of the time GET strings can be avoided or only used on pages you don't care about being spidered. If not, rewrite them, helps your users and helps the spiders.
your choice