|Can Google crawl into a database?|
I am currently building a website that will be mainly database driven. It will contain forums, message boards and blogs. As this site is going to be open to the public the client has requested that all information within the databse, relating to forums, message boards and blogs can be found using search engines such as Google, Altavista, Yahoo etc, etc.
What I really need to know is, can Google and the other search engines crawl into the Database to be able get the information searched for within the search engine?
If it can how does it do it, is there anything I need to do to make this easier?
The client "wants" to create all dynamically generated pages containing forums, message boards and blogs to also be stored as HTML (I am suddenly seeing millions of webpages! argh!)!
This is as you can imagine a very mind blowing thought, but he has mananged to convince my managers this is the way to do it.
I really need a good definitive answer to this, so I can go back to my Managers and tell them.
Any other help and advice would be great, thanks.
Googlebot does not have direct access to your database, just the pages that are generated by server side code that pulls data from the database, such as ASP/PHP etc.
In order to ensure that these pages are spidered effectively, you need to take a crash course in basic Search Engine Optimisation, as this is the key to ensuring that Google and other spiders are able to index all your pages.
Make sure that you read the search engine's webmaster guidelines thoroughly before creating your site, think long and hard about navigation, and once built, create an XML sitemap and submit it to Google.
Then get some decent, on-theme links from related sites, and wait for the search engine spiders to find your site.
Bear in mind that it can take some time to index millions of pages, so be very patient. It can take weeks, months or even years to index a site in full.
Of course, bear in mind useability throughout this process. Ensure you design your site for your users, and not the search engines. Put on a white hat and don't take it off. Ever ;)
I would venture that most if not all of the large sites out there are database-driven... So yes, it's definitely possible...
For starters, while you can store a physical page for every page on your site, you'll probably just want to use some sort of page template with database calls. It's all quite easy and elegant. You can easily host "millions" of pages with just a handful of page templates. If there's something you want to change about a page - you change it in one place and it applies to all the instances of that page immediately.
If you want to enable robots to crawl your site just think like a robot and try to avoid things that would be obstacles for robots. The biggest thing is that all pages should be visible - nothing should be hidden behind a form submission or require a POST.
If the structure of your site is search-based for humans, then create index pages on your site specifically for robots, but make sure to keep them human readable or else you might get in trouble. You'll also want to create an XML-based sitemap for Google. In certain cases the index should also include the URLs for likely search terms.
Matt Cutts has also said to avoid 'id' as a parameter (apparently Google won't index pages that use id as a parameter) and keep the number of parameters passed to the page to 1 or 2.
So, in total you're URL for one of your database-driven pages will look something like these:
Thanks for this guys. I'm still not clear on how to go about this, but I will find out by research.
Once again thanks.
|Can Google crawl into a database? |
I hope not......
Basically I'd say there's two ways of doing it: One would be to use dynamic, idempotent urls like the examples given by jay5r. You'd only have these scripts like profile.php or thread.php to maintain, plus -as suggested - some files like sitemap.php or sitemap.xml in order to guide the spiders to all pages to be indexed. If you really see millions of pages, you'd probably have to split those sitemaps (length and performance, PR-inheritance).
Another way would be to perform file operations after every user-action and generate a .html-file for the thread or other information in question.
I think there is a certain limit of parameters (concatenated by '&'), beyond which seearchengines refuse to index "dynamic" urls.
[edited by: Oliver_Henniges at 6:55 pm (utc) on July 18, 2006]
There are many pitfalls for the unwary.
Make sure that every page of content has only one URL that can get to it. Too many packages have multiple URLs for the same content (see my posts here about popular forum software just a few months ago, for example).
Make sure that all parameters are always in the same order, that there are less than three, and that you do not have anything that is (or even merely looks like) a session ID in it.
Make sure that all login, private message, and administration URLs are prevented from being indexed. You do not want thousands of "Error. You are not logged in." pages appearing in the index.
I understand exactly what you need and the solution is simple. I do it successfully with my database-driven sites. However, it's much of work and typical forum scripts like phpbb are not entirely suitable.
The point is you need all dynamicly generated pages of the site indexed in Google. So, in order to obtain perfect results, each page should have:
- unique TITLE tag describing properly its content
- unique META Description tag relevant to its content (or no META Description at all, but avoid duplicate META Descriptions some scripts generate)
- static-looking URI, I mean, instead of /viewtopic.php?t=92834 it should be at least like /topics/92834.html
- no duplicate content: no more than one URI for each page
- making sure that each page is linked from several other pages and, looking from the point of the site structure, each page is no more than a few levels deep from main page
- but avoid putting too many outbound links on one page
- get external links not only to main page but also to some more important deep pages
To make all these on dynamic site, it's probably needed to write custom scripts engine for the site. I did it this way as I am CGI/PHP developer so it's my job. Forcing some non-optimised scripts to behave like this can be difficult but perhaps possible.
What I have said is resembling SEO, but I didn't say anything about keywords in title, path and content, because what is to be achieved here is making site indexed, not ranking high on specific phrases. The game begins when you need add keyword optimization for dynamic content, and this is the point where the algorithms of the engine become even more sophisticated.
[edited by: Wizard at 5:23 am (utc) on July 19, 2006]
Good feedback Wizard...
I developed and administer a site that sells rights-managed medical illustrations. It has over 8000 images for sale and about 15000 unique search terms. It's a bit of a study in what can go right and what can go wrong when you put up a dynamic site.
Two page templates get something like 93% of the site's traffic - the search results page and the image detail page. The URL for the image detail page isn't quite as optimized as you suggested, but I'm not sure how much I'd gain by changing it and putting 301s in place to migrate the old URLs. It currently looks something like:
The search results page URL is where things get a bit tricky... Since the site is search-based from the user's perspective, all I can do is give robots and spiders a list of URLs for search terms, but there are logically subtle, but valid ways to request search terms that might result the same content...
The first two and the last four are/should be identical, but the difference between the first two and the others is whether I do an exact search on the phrase or a starts with search on the phrase. Which might make more sense with a phrase like 'heart' where a starts with search would also return results for 'heart attack' and 'heart disease' as well as 'heart'.
Doing http://www.example.com/image/brain-tumor.htm hadn't really occurred to me, but it would seriously complicate things given the search feature set I've set up to support users (which allows them to put + and - in front of terms and double quotes around phrases). Since Google has all those terms indexed the way I set them up originally, and I can't do 301s to real users who search for things the old way, I think I'm stuck with things the way they are. Though it is a good idea as I set up new sites for other collections of images.
One of the things I'm struggling with at the moment is how to avoid the appearance of duplicate content in the situations above. In some cases I'd think the search engines would know that + is the same as %20 or a " the same as %22, but I'm starting to think I need to spell it out for them with 301s... And then there's the question of whether to give them a list of exact searches or starts with searches - both are valid, but in some cases they result in "duplicate content"...
Google goes crazy indexing the site - they report over 330,000 indexed pages (I honestly don't know how they got that many pages off the site). Googlebot pulled down over a half million pages and images last month taking nearly 14GB of bandwidth. By comparison, Yahoo! (about 20,000 indexed pages) only pulled down a tenth of that - about 50,000 pages and images taking up 1 GB of bandwidth. MSN has not picked up the site much at all (350 indexed pages), and only pulled down 4,000 pages and images last month (125 MB).
In terms of visitors from the big three - Google Images gives us 75% of our organic traffic, Google 19%, Yahoo! 4%, and MSN 0.2%.
But in our case it's a bit of a needle in a haystack scenario - more organic traffic doesn't give us all that many more paying customers - it gives us more kiddies doing homework assignments. What gives us additional customers are some of the marketing efforts of my client's sales and marketing teams. At the same time some of those doing homework assignments are med students who will become customers in time and the organic search is helping with branding the product for that group.
So obviously Google loves us, Yahoo! is OK with us, and MSN barely knows we exist (though that's about to change - MSNDude has seen the site recently, likes it, and is trying to figure out why it didn't get indexed properly).
All that said, what have done to get those results (good or bad)...
1) Pages are titled properly
2) Images are titled properly
3) There are search terms on every page and the terms are linked to search results, but with some of the problems mentioned above.
4) I provided a crawlable index of images and search terms, and a sitemap for Google the last few months.
5) We do have some deep external links into the site, though we haven't really tried all that hard to get them.
6) The site structure is pretty flat.
7) There's lots of cross-linking within the site
I don't use a meta description tag (yet).
One humorous example of something that went wrong with our SEO (the site isn't letting me use the term that makes this sorta funny - so you'll have to use your imagination)... Our image detail pages used to be titled something like "Site Title - Image Title - Image ##" (e.g. "<Example> Medical Illustrations - <private body part> - Image 2969"), but when we did that we were suddenly getting a lot of traffic on the phrase "<private body part> image". It was traffic we didn't really want since we were sure people searching on that term, more often than not, were looking for something completely different than the disected <private body part> illustration we were giving them. So we dropped the "Image ##" portion from the title (wasn't contributing anything anyway) and searches on that term went down to a more acceptable level.
wizard, did you come across any performance issues letting all your /topics/12345.html - pages run thru the php parser? Or do you generate them as static files?
|Can Google crawl into a database? |
No, but roaches can. Make sure you have a good pest control service spray your server room every three months.
>> In some cases I'd think the search engines would know that + is the same as %20 or a " the same as %22 <<
Don't count on it. If the URL is different (as in, it is not exactly the same) then that is yet another potential source of duplicate content.