Forum Moderators: phranque
I'm also more interested in the technical aspects of your server performance Markus.
Any threads you wanted to start on that subject I'm sure would be well supported....
TJ
[edited by: Woz at 10:46 pm (utc) on Mar. 17, 2006]
[edited by: tedster at 4:39 pm (utc) on May 28, 2007]
"Results 0 to 20 out of 95 results are shown below"
I can't stop laughing, lol, really, you get it?
Please, tell me you understand the funny part.
This guy is zero based, like a real programmer! :) :) :)
(BTW, there are 20 results, not 21, like it would in zero based array)
1. Approving and editing 20,000 images/day
2. BLocking 1000 nigerian russian scams, escorts etc per day.
3. Blocking fake accounts, trouble makers etc.
4. 100k+ edits/modications/day
5 to 10% of yahoo's and the industries total signups are scams escorts etc. I'd guess these guys steal on the order of $100-400 million per year from the industry.
One question from me=
AI- artificial intelligence?
The only bits and pieces I know about this is the fact that a php based forum website can install an AI system where an artificial "member" answers questions, etc.
I'm showing my ignorance on the matter, I know....
Can anyone enlighten us on what AI means in Markus' case?
There are tools around which can, for example, recognise flesh tones in images. You can spot porn images with about 60% accuracy with those. With a long memory you may remember back about 10 years an academic project that scouted for porn images on the Web to see if it could be done automatically, and it was quite good!
Another clue: most scammers are bloody stupid BTW: you can recognise them not by the content of their posts, etc, but by the great big boots and clown costume that they are wearing when they come to your door. In technical terms, a big clue about dodgy input is "out of band", ie not within the message at all.
For example, my SPAM filter is able to reject 10,000 SPAMs a day and let in the 10-or-so ham by checking simple things like valid return addresses, and if the senders are on any block lists, not by checking the content of the mail at all. I can dump the few that get through very easily by doing that by eye!
This is what I call "AI", though once you have turned it into working code it seems much less "I".
Rgds
Damon
If it would not compromise your competitive position, would you be so kind as to share which database you are using on your DB server? PostgreSQL, Microsoft's SQL Server, Oracle, DB2, or something else/homegrown?
I'm especially interested in what you think of PostgreSQL 8.X (and on which OS), since I'm planning to pick one database to spend a lot of my free time with to really learn the nuances of. I want to pick a good one that I can afford to deploy. But one with hopefully as little grief and limits as I can luck into for some "find the needle in the haystack" type (hobby) research projects I'm planning.
And one thing I was really wondering about is if your queries to the database are in some version of SQL or use some other access language/mechanism? Again, I ask only if this wouldn't affect your competitive position...else please don't answer.
Lastly, what operating system is your database server running upon? I think your webserver is a Microsoft OS (according to netcraft.com), but I'm kind of an OS junkie (and like the BSDs), so I was wondering. BTW, have you checked out Solaris 10's ZFS yet? Sounds very promising to me.
And on the subject of optimization, ever played around with Dtrace? If not, you may find it a very useful tool to see **exactly** where your code -- while running real time in production, but without affecting run time performance -- is using the various resources of the hardware and OS capabilities. I'll bet it would be right up your alley, Markus.
IIRC, Dtrace currently only runs on Solaris 10, but there is an effort underway to port it to FreeBSD. Some of the SUN engineers have some pretty interesting blogs on the new features of Solaris 10 that sound very interesting to me. IIRC, ZFS is supposed to be included in the version of Solaris 10 scheduled to ship in May of this year.
But of all the above, Dtrace sounds like something you might absolutely LOVE. So I wanted to mention it to you in the off chance you had not checked it out yet.
Here are some links you may find interesting:
On Dtrace:
[sun.com...]
Bryan Cantrill's blog at Sun, who is one of the Dtrace developers:
[blogs.sun.com...]
A couple of overview/reviews of some of the new features of Solaris 10 (and I'm not connected in any way to Sun, Solaris, or in any way benefit from any of these links, BTW):
[softwareinreview.com...]
and an old, but still good food for thought review:
[thejemreport.com...]
Again, thanks, Markus, for sharing as you have. I really appreciate it and am very grateful.
Thank you,
Louis
P.S. Again, I hope these questions are OK.
8,000 images per day are not able to be checked programmatically
You know, Google engineers like to automate everything and to solve problems not by hand, but by permanently improving algorithms, etc.
Another example is the Bayesian filtering for spam, the email filters that learn from you more and more every day, simply with a little of your human input marking the few doubtful messages. That's some kind of basic AI (software that learns), and works really wonderfully.
Perhaps Markus is doing some things in a similar line to those, given what he said:
I have a sort of a AI, that i built that handles the site for me. When you've got 2 years of steady growth you can build something super fancy to automate problems as they come up.
I have a site that does about 80,000 dynamic page views a day, and there's no way I could run gzip compression on them on the fly without totally taking out the processors.(1 dual proc web server, hitting 1 dual proc DB). I know IIS6 has the gzip embedded into its core, but this is something I've researched high and low and even had an MSFT employee try to help with, but no dice. I even lowered the compression to 3 and it still totally taxed the procs.
What's the secret?
As far as the success of the site goes, it's truely impressive and a good source of inspiration - plus it's actually a legitimate site helping people better their lives....and it's free. No better model then that.
I'm just amazed at the low hardware usage. The DB Server I can understand, but the web servers, that's just impressive. What load balancer are you using?
Thanks,
Chip-
[edit]Just checked, if I have your domain correct, your gzip isn't turned on, fyi[/edit]
My Gzip is turned on, when I turn it off my bandwidth doubles.
As for why it doesn't work for you I don't know what to tell you. I went into the metabase and added aspx and html to the compression settings and then just followed the rest of the tutorial for turning on compression on microsofts site. Don't try and compress images on the fly?
It's such a bandwidth saver, I hate not being able to run with it. MSFT said there was nothing wrong with the setup, just that too much dynamic content running thru the gzip will eat CPU cycles. Thanks MSFT, lol. ;) Maybe it's time to look into some newer CPUs with more L2 cache. I'm still running on the good ole P3's.
Chip-
I know you said you have 70-100mb/sec traffic and several servers you built and maintain. I was wondering do you run though from your house or did you build them and send them to a company that provides the bandwidth you need.
While you are doing well and have gotten by on the cheap with your hardware, you must be paying a few bucks for that bandwidth!
I don't use a load balancer, i've only got one web server serving up the site.
You may want to continually mirror/clone your web/db/image servers -if not from load balancing/sharing point of view, at least for improving your fault-tolerance.
Unless I have missed something obvious, your setup (as you've desribed above) does seem to have several single points of failure. Not being critical of your setup, just trying to give back to this excellent thread you have contributed so much to.
Apologies if this has already been discussed already and I've missed it.
It must be hard being a one-man show however rich. No vacations, no long weekends and no illness.
I mirror all my images on my servers, I found that using raid is the worst possible thing you can do. When you have millions of images you can't read more then a few hundred a second and it all crashes. If you just mirror the images via software you can serve a few thousand a second. This is using windows. I suspect raid can't handle (small files) bi directional IO under heavy load.
Seems like most are aware of the site, and I would really appreciate taking a look.
Thanks in advance if possible.
One strategy to make sites fast is to put a proxy in front of the php/python/whatever backend and have the proxy cache pages for non-members with a 5-10min TTL.
In general, caching of either complete pages or objects can bring down the load enormously. We also delay many writes to the DB servers so they can be made either in batch or at night.
modern servers are also extremely powerful: a high end box with 4 cores and 8GB RAM and fast SCSI disks can handle 150M - 200M/mo dynamic php wordpress / forum / photo gallery or similar without optimizing much besides tuning php with zend or eaccelerator.
Raid: I agree, often not worth it - once a disk fails the whole server usually becomes so slow that web processes start piling up and rebuilding a raid can take a long time - best is to mirror servers and DBs
I mirror all my images on my servers, I found that using raid is the worst possible thing you can do. When you have millions of images you can't read more then a few hundred a second and it all crashes. If you just mirror the images via software you can serve a few thousand a second. This is using windows. I suspect raid can't handle (small files) bi directional IO under heavy load.
If you drop a drive and it takes out your CPU, your RAID controller sucks. Rebuilding an array also shouldn't take out your server, it should be handled by the RAID controller. If need be, lower the weight of the affected server on the load balancer to handle a smaller percentage of the work load until the rebuild is complete.
But losing a whole server just because of a bad hard drive...I don't think that's a good practice to be preaching here. It's costly and it's easily preventable.
Chip-
freeflight. I only say its a slow solution for me because i would have to write several hundred thousand files to disk per day, and the constant lookups to disk containing millions of static files would be a lot slower then just going straight to the db. My number one concern with anything i do is that it is fast and easy and very little should be able to go wrong.
All the tech stuff does not sound too overwhelming, but I guess you have used a lot of years learning all the stuff necessary. Some degree of nerdness is required to boil down problem solution to almost no time.
Also comparing yourself to the industry does not really make sence - even though it is fascinating. Hired hands are never as efficient as one self. They need to communicate with each other, which is one of the most time consuming parts of running a business.
And my question: Do you think it is possible to do the same thing without using a single Microsoft product?
And my question: Do you think it is possible to do the same thing without using a single Microsoft product?
JAVA, PHP, Apache, Linux, MySQL, and maybe a google mini if you need a robust search engine. None of those require anything from MSFT :).
Chip-