Forum Moderators: phranque

Message Too Old, No Replies

Search Engine technology and topology

Multiple database servers

         

trillianjedi

11:58 am on Aug 10, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Bit complicated this one, but I need some advice regarding building a system similar to a search engine (think crawlers dumping data to a DB, and a front end performing searches for users).

No, I'm not actually building a search engine - what I'm doing is application-centric rather than web-centric, but the topology is near identical so design considerations are essentially the same.

I need something scalable as this will start small, so I don't want to go out and buy an entire datacentre just yet ;-)

It's the database side that has me a little stuck. I can spread the "search" (front end) side load quite easily with load-balancing. The replication of data amongst the load-balanced machines, I assume, means that every machine, in terms of "write" load, will be identical if real-time replication is used, or under periodic high-load as data is dumped across machines in bulk.

Would I be absolutely nuts to attempt something like this with MySQL, or is there a preferred database type specifically designed for this type of activity?

Thanks,

TJ

iamlost

6:02 pm on Aug 10, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I strongly recommend PostgreSQL.

Read the About [postgresql.org] and the Advantages [postgresql.org] pages. If this overview looks promising then read further, download and read the documentation, etc.

Very active support community if the very good documentation is not quite good enough. Unless a client has specified MySQL I use PostgreSQL.

trillianjedi

6:59 pm on Aug 10, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks iamlost.

Is that recommendation based on speed considerations, or something else?

TJ

iamlost

8:41 pm on Aug 10, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As I do not know your actual requirements I can only be general. Should you have a very specific concern by all means shout.

Just like IE vs. FF many people believe in one or the other ... that aside:
* both are great open source relational DBs.
* MySQL has a larger user base and so is generally supported by a greater number of applications.
* PostgreSQL is basically almost an Open Source Oracle - and has, for years, supported capabilities that MySQL is just addressing now.
* in basic default mode MySQL is faster. However, when loaded with similar "advanced" feature sets speed becomes similar to Postgre.
* for "stand alone" dynamic HTML (i.e. using PHP) I follow the KISS principle and use MySQL (the additional features of PostgreSQL not likely to ever be needed).
* for application use PostgreSQL uses a BSD-style license so can be used as wanted, including in closed-source (proprietary) applications. MySQL, under the GPL, has greater usage restrictions.
* The more complex the DB requirements the greater the value of PostgreSQL.

The two are drawing closer together. To determine which is best list all your requirements, both now and in future; read both DB specs and see which best supports each. Also read the specs with an eye for previously unthought of capabilities/requirements that would be useful.

My personal preference (and the complexity of some DBs I design) has lead me to love PostgreSQL capabilities/scalability. In the end, however, it is which DB will do the specific job best, forever.

plumsauce

5:38 am on Aug 11, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




iamlost,

that was a very good synopis for postgresql.

do you have any thoughts when you bring mssql and db2 into the mix? i am particularly partial to mssql, but would like to hear what you have to say.

thanks,

plumsauce

iamlost

4:13 pm on Aug 11, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The big two DBs are Oracle (Oracle Corp.) and DB2 (IBM). Followed probably by MS-SQL (Microsoft). All three are proprietary systems.

Twenty years ago I played a little with both Oracle and DB2. They have changed greatly since then! Ask me no questions and I will give no silly answers. I have never done anything but port from MS-SQL so I can give little personal opinion there either.

I have used Open Source because it was/is largely free code and I could learn it myself and code-hack to my heart's delight. (Try that with proprietary systems and see what happens ;-))

I have made myself knowledgeable and commercially viable since I went IT-solo a decade ago thanks to WWW:HTML/CSS, Apache, Unix/Liniux, MySQL/PostgreSQL, etc. If the world had gone all closed shop I likely would now be a poor, but happy, stained glass artisan.

If you haven't seen the wikipedia article Comparison of relational database management systems it gives a snapshot comparison that is useful (I use it for data to baffle clients when selling my version of possibilities). Note: most DBs are constantly evolving - read their documentation as they may well have added capabilities since any article was published. And as always - follow the links.

trillianjedi

4:20 pm on Aug 11, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for the help iamlost.

I'm sure txbakers won't mind me linking this, it's an excellent resource and I found it really helpful:-

[en.wikipedia.org...]

In terms of my specific requirement, it's hard to describe without getting into specifics, but it really is almost identical to a search engine (in terms of DB requirement) if you imagine that the SE spiders were crawling and adding data in realtime.

Something that could scale to say 500,000 UPDATE type queries per minute, and about 50,000 SQL type searches per minute (return RESULT SET where somefield = x) would be ideal.

So speed is an issue. The level of traffic can be distributed amongst several machines, although obviously it will start on a single box. So really I need something that will scale from single box to load-balanced cluster fairly easily.

TJ

iamlost

9:21 pm on Aug 11, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Something you might not have considered:

The maximum number of concurrent connections to a database server (default often 100 - set at server start) is Operating System limited not usually DB limited (Postgre certainly not, uncertain of MySQL). The limiting factor is usually the amount of System V shared memory or semaphores the OS will/can allow.

A useful formula to help determine maximum shared memory (in bytes):
250 kB + 8.2 kB * shared_buffers(integer =at least twice max_connections) + 14.2 kB * max_connections(integer=max concurrent connections required)
Note: as far as Postgre is concerned any total less than infinity is fine!
Note: allow for any other applications that may draw on shared memory.

The hardware (esp. memory) is often the limiting factor in a DB because it is not usually thought of as part of the "DB design".

The network, the security, the servers, the UI, the DB design, the scripting calls, etc. are each part of the DB system. And each of those parts is composed of equally important sub-parts. Always think system. Always look at identifying requirements and bottlenecks.

Also:

Note: As said previously I now rarely use MySQL except with web servers so they may have added a better transaction locking process than I remember.

With the high number of updates/queries you expect the likelyhood of collision is high. Like Oracle, PostgreSQL uses multiple row data storage (they call it MVCC: Multi-Version Concurrency Control) to remain responsive in high volume multi-user environments:


Unlike most other database systems which use locks for concurrency control, Postgres maintains data consistency by using a multiversion model. This means that while querying a database each transaction sees a snapshot of data (a database version) as it was some time ago, regardless of the current state of the underlying data. This protects the transaction from viewing inconsistent data that could be caused by (other) concurrent transaction updates on the same data rows, providing transaction isolation for each database session.

The main difference between multiversion and lock models is that in MVCC locks acquired for querying (reading) data don't conflict with locks acquired for writing data and so reading never blocks writing and writing never blocks reading.

trillianjedi

9:45 pm on Aug 11, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The limiting factor is usually the amount of System V shared memory or semaphores the OS will/can allow.

Effectively max threads per process? I think it's possible to change that with a custom kernal build. I've heard of kernals built specifically to handle 2048 for custom applications. I think the standard *nix model is 256.

I think I get this MVCC model. I'll look into it a little more.

Thanks for the help - it looks like postgreSQL is high up on the list.

The application itself is actually really very simple, it's purely the scalability that's an issue. I don't want for something to end up being created that can't scale. Of course, if the application doesn't work for the client, it won't get used and it will probably remain on a small cluster of machines, or even a single machine. But I'd like to opportunity to scale it, if required, without having to redesign and rebuild the whole thing from scratch in order to obtain that scale.

TJ

plumsauce

4:53 am on Aug 12, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is one of the better discussions in recent history here. I hope it goes on a little bit longer :)

Something that could scale to say 500,000 UPDATE type queries per minute, and about 50,000 SQL type searches per minute (return RESULT SET where somefield = x) would be ideal.

You might want to look at tpcc.org for some comparative results. When I was looking for something that I thought was impossible, I went there and saw that things had changed while I wasn't looking. Found it, but it cost $7MIL.

The trick is to read the disclosure reports and look carefully at the tuning. You can often pick up a trick or two.

I'd really like to see a hardware vendor sponsor a postgresql test at tpcc.

sja65

3:37 pm on Aug 15, 2005 (gmt 0)

10+ Year Member



If you want to maintain full scalability, make sure you can do write clustering as well. To do this with OSS (either Postgres or MySQL) the easiest way is just to plan on being able to have multiple database servers. Instead of A-Z in one database, make A-M and N-Z in two different databases (make your software so it is easy to add more divisions later). Then as your load increases, you can split these to two different machines. So if you design your software so you can easily split the data then you should be able to avoid maxing out your machine with updates. I think DB2 can do this automagically without you having to modify your application.