|weird Google search URL's and dynamic pages|
things are just messed up
First, my site is entirely dynamic pages done through PHP and MySQL. Right now, only the main pages are being indexed. Also, my PR is 3. Now, I have about 40,000+ pages of content on books/authors/series. everythilg is in the format of /book.php?id=18881. It seems that none of these pages are getting scanned by the Google bot. Should I generate a tiny link at the pottom of index.php that says "listing" or such that contains 40,000 links to all the books? Currently, there are links to all the books, but they are PHP driven as well so are not being indexed.
Second (and related):
I have 40,000 pages but only 300 or so indexed by google. It looks like the bot is attempting to pull up dynamic pages, but they have no content listed on google and they are in the format of search.php?item%3DGustaf%2BFr%F6ding
if it does want to pull up searches, they need to be in the format of search.php?item=whatever.
I am looking into using mod_rewrite, but since I don't unserstand it well, it will be a long while before that is done.
Change the 'id' argument name to something else (like 'title' or whatever). It's probably blocking spidering.
Seconded. Had exactly the same problem in my site, was tearing my hair out playing with session variables and detection scripts but nothing was making a difference.
Turns out that?id= in the url was killing me even though it wasn't a session variable. Changed it to?get= and everything's back to normal.
so the entire problem is having?id
I could just change it to?blah and have no problems?
I'd wager so, although not too sure if it'll solve the second part.
As long as there's no session variable passed in the url (which your example does not, so I'm assuming they're not passed this way) the solution for me was really that simple. Didn't occur to me for ages that Google might be getting it 'wrong', not me.
I notice it does index many sites that use the .php?id= format, I assume the Bot's a little more aggressive with higher page rank sites.
Also not sure if there's a limit on the number of pages that'll be indexed depending on the PR, but I'm sure that's been speculated on by wiser people than I in the past on this here forum.
?id= thingy have been discussed to death since Google have started indexing dynamic pages.
This is the kind of information that start an urban legend.
The characters "?id=" does not prevent Google in spidering your site. Proof of which, one of my site has hundreds of these pages in that format and some of those have even the 'fresh' tag from time to time.
What Google doesn't like is 'session id' and does not necessarily mean '?id='.
You can have a url that say?s=123245454df34df45 and Google would be able to analyze it or presume that the value of that parameter is a session id thus it won't spider it.
However, you can a have url that say?id=2134 and would be considered spiderable page.
How Google determine if the 'value' of the parameter is a 'session id' is beyond me but it has nothing to do with '?id=' in the url. It just happened that it was common before that session id value was assigned to "id=" as the variable.
Thus the misconception that anything start with 'id=' is a session id.
If Google start penalizing pages or not spidering pages because of the 'id=' thingy...there would be a lot of pages going to disappear from the index and those pages have nothing to do with session id.
I hope that makes things clearer.
It makes it clearer, but doesn't present me with a solution. I suppose it is back to figuring out the mod_reqrite thing...this might just take a while.
Our company went back and forth on this subject for the past four months... We did so many things to try and make them SE friendly pages. We analyzed the logs to death, changed some of the links, changed the names, took out ID, etc. etc.
The only thing worked was rewriting the? mark to be a / instead. We now have over 40k pages indexed and cached and are very happy now.
So what used to be:
The really weird thing was that another directory was indexed and cached but used the same format.
Those links are still indexed fine in Google. It was always a mystery why those other section were not friendly to Google but it's working now!
We are using ASP.net so I don't think the program we used would be right for you. Let me know if you want the name of it though.
The common thing that was observed about dynamic sites being deep crawled assuming you are not using session id.
1. How old is the site? Possibly in relation to PR of the site too. Older site have better chance to have their dynamic pages crawled. It does take time to crawl every dynamic page.
The other reason that Google just doesn't unleash gbot on a dynamic site is that...it might just crash your server, simply because dynamic page requires extra server processing as opposed to a static page, imagine if you have 1000 of dynamic pages. One time on a deep crawl, Google could send more than 50 spider a few seconds apart continously. You would really need a dedicated server for such thing and powerful enough to handle such load.
2. The simple your dynamic url the better chance it will get spidered. Such as .com?id=123 as opposed to .com?id=123&des=keyword&title=the_best_site&var=whatever. You get what I mean. Too much parameter seems to disuade gbot in following it through.
If you are using the long dynamic url your best bet if you are running your own server or just have a few dpages is to use mod_rewrite or its equivalent with windows.
So everything would look like yoursite.com/param1/param2/param3/etc this would force gbot to treat your pages as static.
Hope that helps
The site is over a year old and has a PR of 3 (ick!). The entire site is dynamic. I am looking at mod_rewrite, but have yet to actually figure its use out. As far as only having a few pages, the site has over 40,000 pages.
by the way, how the heck do you quote on this forum?
I've studied this closely, as I have a site w/ many pages too.. php/mysql - most of them .php?id=something. 39K in Google's index.
Here's what I've observed:
If you have a search page, or search tool, none of your serp pages will ever get indexed unless you link to them yourself from elsewhere with a get query in the link - and often. So set your search so that it will accept a $_GET and a $_POST
Even then... lets say you have a page/script that shows all the info for one widget, but needs that widget ID # as a variable argument.. if that page isn't linked to AND relevant in it's default state, without the query string, it will never develop PR - although it should get indexed WITH the querystring.
Strong internal linking will overcome a lot of the difficulty in getting dynamic pages indexed. I don't rewrite any of my urls to get rid of the query string, and I do fine in google. Looks to me like if you have a major page with PR5, and you add a querystring to it, it's going to be a PR3 no matter how many times YOU link to it with that query in the link.
If you want to add a page, make the page itself and then make it respond to the $_GET request. Then link to the default state of that page in your main menu, or as much as you can, and link to the page?id=query only when you need to - and all those?id= pages should end up PR2 < Parent page.
Why dont people ever read what Google says on their webmaster pages?
They limit the number of dynamic pages they crawl. These are pages with a? in them.
The only way you are going to get them indexed is if they don't have a?.
I mean, only way to get them ALL indexed.
I've read it a bunch of times...
"Your pages are dynamically generated. We are able to index dynamically generated pages. However, because our web crawler can easily overwhelm and crash sites serving dynamic content, we limit the amount of dynamic pages we index."
That's very nice of them not to fry my server. It also says...
"Fiction: Sites are not included in Google's index if they use ASP (or some other non-html file-type.)
Fact: At Google, we are able to index most types of pages and files with very few exceptions. File types we are able to index include: pdf, asp, jsp, hdml, shtml, xml, cfm, doc, xls, ppt, rtf, wks, lwp, wri.
Notice - no php. But I promise you they DO index php files. They also say...
"Make pages for users, not for search engines. Don't deceive your users, or present different content to search engines than you display to users."
Near as I can tell, users can use pages with?id= in the url just fine.. so if Google cannot, it's their problem to solve. Putting a / in there does not make the page any less dynamic. Google MUST index dynamic content if they are going to be a real search engine. If they didn't, this forum would not be here, and nobody would be wringing their hands over what google does next.
People here give google too much weight. Would you continue to operate your site if Google dropped you from the index entirely? I would. I operated it before there was google too. So if they aren't indexing my pages, and they should be... it's not really my problem - it's theirs.
|let me be a bit clearer for med:|
ANYTHING WITH A? IS NOT GETTING INDEXED!
Now that we are past that point...
Is there any program out there that will design the mod_rewrite commands for me? Yeah, it is probably too much to ask, but I might as well try.
Any explaination for Google not indexing any of my pages with?id=XXX (never more than 3 digits) until I changed them to?get=XXX?
I wouldn't have though that Google would think that such a short dynamic url would be a session var, but lo and behold after changing everything works.
I'd certainly not use?id= again unless I can possibly help it, urban legend or otherwise.
PCG - I don't know of a program that will do it for you. My hosting svc is really friendly, and when I've needed a regex for htaccess or something, they've always been helpful with that. I do have a couple of rewrite rules running in htaccess, but not ones that get rid of the? - or I'd give you one.
That said, Google will pick up your pages with? or even?this=this&that=that in the url IF the page is valid without the query string, AND you link to it internally that way AND you link to it with the query string as well.
|People here give google too much weight. Would you continue to operate your site if Google dropped you from the index entirely? I would. I operated it before there was google too. So if they aren't indexing my pages, and they should be... it's not really my problem - it's theirs. |
Right on sister ..... ummmm, brother?
This tool will re-write your urls if you want to do them one by one.
I really don't have a clue. It could be just a time factor or it could be something else.
But facts are facts....there are probably millions of pages that have?id= at Google index, a few of those are mine.
So the argument that '?id= is not spiderable' is without basis.
ID=1234 is looked as a session ID by Google, don't use ID full stop!
Use mod rewrite if you go the website that is listed above which actually tells you how to set up your .htaccess file and load it to your server, it takes 5 minutes!
Just make sure you are on an Apache server and you have the access rights to do this.
What I have done to make mine SEO friendly is /product/grp=124 appears as /product/group124.htm
Before i use .asp pages, and all crawled well, got more than 1500 links in Google, each asp link has its own title and description.
Then i changed to .php ; do all the same, has its own desc, title, and keyword on each single page, without session id, no .php?ID=****x and etc, the server i believe very Google friendly that i have another website with plain html files produce over 3000 links in google.
Then goggle starts updates by the time i upload the new sites back atnov 2003. and links drop to only 79! until today they still not changing, robots and else i am very sure are OK, they do crawl the .php pages but not so much and not so deep as asp pages.
in the las gew weeks seems weird, one day i get onto the top (#1 and #7) in serps1, then last week they update the pages again (says last time is 20 feb 2004) and the PR drops way down to hell! not even in the top 100 or top 60.
Even google update the pages by their serp every 2 days, but didnt crawl deep, just the index pages, i read some posting in this forum, tehy experiencing the same.
my old pages and websites that i have never update or change for the last 6 months, those were listed in the top #1 in my search terms, are not affected at all. also my 3 years old website.
i think google has experimenting several new algo in their machine, can not see alse they dance in all 10 data center.
I also found the new website who was in the rank under mine, they go up to #1, BUT GOOGLE FAIL TO RECOGNIZE taht tehy website use hidden text on the very top of each their pages.
I just hope everything back to stable, so if the sites needs keywords improvement, everyone can update the sites using the path from serp.
Just hoping and waiting ...(desperately!)
The ID part just might be mistaken by Google as a Session ID or an Affiliate ID, so try to avoid ID.
>> Currently, there are links to all the books, but they are PHP driven as well so are not being indexed. <<
Being PHP driven, in itself, does not make any difference (as opposed to static HTML pages). The PHP script should simply be outputting a page of Valid HTML code. Make sure that your pages are being served with the correct MIME type, and that the HTML code is validated. When pages are assembled from HTML fragments and dynamic content, it is really easy for nesting errors, and so on, to creep into the code. Check out the pages at [validator.w3.org...] for more.
madman21, you are my hero. That took does EXACTLY what I needed. I am now using the following .htaccess:
RewriteRule book(.*)\.htm$ /book\.php?id=$1
RewriteRule author(.*)\.htm$ /author\.php?id=$1
RewriteRule series(.*)\.htm$ /series\.php?id=$1
I am finding this conversation fascinating. I also at one time considered using a mod_rewrite because of difficulty getting pages indexed.
I currently use .asp, .php, .tcf, .taf and .xml.
I ended up getting rid of all session variables and arguments passed until I absolutely needed them (ie basket sessions and ssl for form inputs). I didn't need googlebot to follow these anyway.
Everything I have gets indexed. You do not need a mod_rewrite. You need to pay attention to the args you are trying to pass. Limit as much as you can.
Just my 2 cents.