Forum Moderators: phranque
No, I'm not actually building a search engine - what I'm doing is application-centric rather than web-centric, but the topology is near identical so design considerations are essentially the same.
I need something scalable as this will start small, so I don't want to go out and buy an entire datacentre just yet ;-)
It's the database side that has me a little stuck. I can spread the "search" (front end) side load quite easily with load-balancing. The replication of data amongst the load-balanced machines, I assume, means that every machine, in terms of "write" load, will be identical if real-time replication is used, or under periodic high-load as data is dumped across machines in bulk.
Would I be absolutely nuts to attempt something like this with MySQL, or is there a preferred database type specifically designed for this type of activity?
Thanks,
TJ
Read the About [postgresql.org] and the Advantages [postgresql.org] pages. If this overview looks promising then read further, download and read the documentation, etc.
Very active support community if the very good documentation is not quite good enough. Unless a client has specified MySQL I use PostgreSQL.
Just like IE vs. FF many people believe in one or the other ... that aside:
* both are great open source relational DBs.
* MySQL has a larger user base and so is generally supported by a greater number of applications.
* PostgreSQL is basically almost an Open Source Oracle - and has, for years, supported capabilities that MySQL is just addressing now.
* in basic default mode MySQL is faster. However, when loaded with similar "advanced" feature sets speed becomes similar to Postgre.
* for "stand alone" dynamic HTML (i.e. using PHP) I follow the KISS principle and use MySQL (the additional features of PostgreSQL not likely to ever be needed).
* for application use PostgreSQL uses a BSD-style license so can be used as wanted, including in closed-source (proprietary) applications. MySQL, under the GPL, has greater usage restrictions.
* The more complex the DB requirements the greater the value of PostgreSQL.
The two are drawing closer together. To determine which is best list all your requirements, both now and in future; read both DB specs and see which best supports each. Also read the specs with an eye for previously unthought of capabilities/requirements that would be useful.
My personal preference (and the complexity of some DBs I design) has lead me to love PostgreSQL capabilities/scalability. In the end, however, it is which DB will do the specific job best, forever.
Twenty years ago I played a little with both Oracle and DB2. They have changed greatly since then! Ask me no questions and I will give no silly answers. I have never done anything but port from MS-SQL so I can give little personal opinion there either.
I have used Open Source because it was/is largely free code and I could learn it myself and code-hack to my heart's delight. (Try that with proprietary systems and see what happens ;-))
I have made myself knowledgeable and commercially viable since I went IT-solo a decade ago thanks to WWW:HTML/CSS, Apache, Unix/Liniux, MySQL/PostgreSQL, etc. If the world had gone all closed shop I likely would now be a poor, but happy, stained glass artisan.
If you haven't seen the wikipedia article Comparison of relational database management systems it gives a snapshot comparison that is useful (I use it for data to baffle clients when selling my version of possibilities). Note: most DBs are constantly evolving - read their documentation as they may well have added capabilities since any article was published. And as always - follow the links.
I'm sure txbakers won't mind me linking this, it's an excellent resource and I found it really helpful:-
[en.wikipedia.org...]
In terms of my specific requirement, it's hard to describe without getting into specifics, but it really is almost identical to a search engine (in terms of DB requirement) if you imagine that the SE spiders were crawling and adding data in realtime.
Something that could scale to say 500,000 UPDATE type queries per minute, and about 50,000 SQL type searches per minute (return RESULT SET where somefield = x) would be ideal.
So speed is an issue. The level of traffic can be distributed amongst several machines, although obviously it will start on a single box. So really I need something that will scale from single box to load-balanced cluster fairly easily.
TJ
The maximum number of concurrent connections to a database server (default often 100 - set at server start) is Operating System limited not usually DB limited (Postgre certainly not, uncertain of MySQL). The limiting factor is usually the amount of System V shared memory or semaphores the OS will/can allow.
A useful formula to help determine maximum shared memory (in bytes):
250 kB + 8.2 kB * shared_buffers(integer =at least twice max_connections) + 14.2 kB * max_connections(integer=max concurrent connections required)
Note: as far as Postgre is concerned any total less than infinity is fine!
Note: allow for any other applications that may draw on shared memory.
The hardware (esp. memory) is often the limiting factor in a DB because it is not usually thought of as part of the "DB design".
The network, the security, the servers, the UI, the DB design, the scripting calls, etc. are each part of the DB system. And each of those parts is composed of equally important sub-parts. Always think system. Always look at identifying requirements and bottlenecks.
Also:
Note: As said previously I now rarely use MySQL except with web servers so they may have added a better transaction locking process than I remember.
With the high number of updates/queries you expect the likelyhood of collision is high. Like Oracle, PostgreSQL uses multiple row data storage (they call it MVCC: Multi-Version Concurrency Control) to remain responsive in high volume multi-user environments:
Unlike most other database systems which use locks for concurrency control, Postgres maintains data consistency by using a multiversion model. This means that while querying a database each transaction sees a snapshot of data (a database version) as it was some time ago, regardless of the current state of the underlying data. This protects the transaction from viewing inconsistent data that could be caused by (other) concurrent transaction updates on the same data rows, providing transaction isolation for each database session.The main difference between multiversion and lock models is that in MVCC locks acquired for querying (reading) data don't conflict with locks acquired for writing data and so reading never blocks writing and writing never blocks reading.
The limiting factor is usually the amount of System V shared memory or semaphores the OS will/can allow.
Effectively max threads per process? I think it's possible to change that with a custom kernal build. I've heard of kernals built specifically to handle 2048 for custom applications. I think the standard *nix model is 256.
I think I get this MVCC model. I'll look into it a little more.
Thanks for the help - it looks like postgreSQL is high up on the list.
The application itself is actually really very simple, it's purely the scalability that's an issue. I don't want for something to end up being created that can't scale. Of course, if the application doesn't work for the client, it won't get used and it will probably remain on a small cluster of machines, or even a single machine. But I'd like to opportunity to scale it, if required, without having to redesign and rebuild the whole thing from scratch in order to obtain that scale.
TJ
Something that could scale to say 500,000 UPDATE type queries per minute, and about 50,000 SQL type searches per minute (return RESULT SET where somefield = x) would be ideal.
You might want to look at tpcc.org for some comparative results. When I was looking for something that I thought was impossible, I went there and saw that things had changed while I wasn't looking. Found it, but it cost $7MIL.
The trick is to read the disclosure reports and look carefully at the tuning. You can often pick up a trick or two.
I'd really like to see a hardware vendor sponsor a postgresql test at tpcc.