Forum Moderators: phranque

Message Too Old, No Replies

The Technical Aspects of running a One Person High Traffic site.

One person's experience.

         

trillianjedi

12:00 pm on Mar 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Technical Discussion continued from here [webmasterworld.com].

I'm also more interested in the technical aspects of your server performance Markus.

Any threads you wanted to start on that subject I'm sure would be well supported....

TJ

[edited by: Woz at 10:46 pm (utc) on Mar. 17, 2006]

[edited by: tedster at 4:39 pm (utc) on May 28, 2007]

fischermx

8:58 pm on Mar 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yikes!, this Markus guy is really so technical ..
I just enter my zip code in his site and I got :

"Results 0 to 20 out of 95 results are shown below"

I can't stop laughing, lol, really, you get it?
Please, tell me you understand the funny part.

This guy is zero based, like a real programmer! :) :) :)

(BTW, there are 20 results, not 21, like it would in zero based array)

markus007

11:47 pm on Mar 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Keeping my site fast and running is one of my smallest issues.

1. Approving and editing 20,000 images/day
2. BLocking 1000 nigerian russian scams, escorts etc per day.
3. Blocking fake accounts, trouble makers etc.
4. 100k+ edits/modications/day

5 to 10% of yahoo's and the industries total signups are scams escorts etc. I'd guess these guys steal on the order of $100-400 million per year from the industry.

ogletree

12:23 am on Mar 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That is why other sites have so many employees. Your telling us you can do all that by yourself. Or is it like here where we have moderators that work for free.

limitup

5:39 am on Mar 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well there's no way to do those things programmatically so I assume he has volunteers or something. That's why Amazon came up with Mechanical Turk. Maybe Marcus is using them LOL

markus007

6:55 am on Mar 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



99% is done by the site AI.

Heartlander

7:20 am on Mar 19, 2006 (gmt 0)

10+ Year Member



Just fascinating.
After thanking Markus for a great motivational tool here, I've also got to thank ogletree for asing the hard questions that the rest of us are thinking.

One question from me=
AI- artificial intelligence?
The only bits and pieces I know about this is the fact that a php based forum website can install an AI system where an artificial "member" answers questions, etc.
I'm showing my ignorance on the matter, I know....
Can anyone enlighten us on what AI means in Markus' case?

limitup

7:43 am on Mar 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok I'd love to hear more about this "AI". I realize you probably can't respond for competitive reasons, but my position is that there is no way to prevent 99% of the problems you referred to programmatically. Just the images alone are a huge problem. If you've figured out how to programmatically analyze images to make sure they are "appropriate" you'd be a very rich man - because no one else in the world has figured out how to do it.

DamonHD

9:07 am on Mar 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi,

There are tools around which can, for example, recognise flesh tones in images. You can spot porn images with about 60% accuracy with those. With a long memory you may remember back about 10 years an academic project that scouted for porn images on the Web to see if it could be done automatically, and it was quite good!

Another clue: most scammers are bloody stupid BTW: you can recognise them not by the content of their posts, etc, but by the great big boots and clown costume that they are wearing when they come to your door. In technical terms, a big clue about dodgy input is "out of band", ie not within the message at all.

For example, my SPAM filter is able to reject 10,000 SPAMs a day and let in the 10-or-so ham by checking simple things like valid return addresses, and if the senders are on any block lists, not by checking the content of the mail at all. I can dump the few that get through very easily by doing that by eye!

This is what I call "AI", though once you have turned it into working code it seems much less "I".

Rgds

Damon

midwestguy

9:29 am on Mar 19, 2006 (gmt 0)

10+ Year Member



Hi Markus,

If it would not compromise your competitive position, would you be so kind as to share which database you are using on your DB server? PostgreSQL, Microsoft's SQL Server, Oracle, DB2, or something else/homegrown?

I'm especially interested in what you think of PostgreSQL 8.X (and on which OS), since I'm planning to pick one database to spend a lot of my free time with to really learn the nuances of. I want to pick a good one that I can afford to deploy. But one with hopefully as little grief and limits as I can luck into for some "find the needle in the haystack" type (hobby) research projects I'm planning.

And one thing I was really wondering about is if your queries to the database are in some version of SQL or use some other access language/mechanism? Again, I ask only if this wouldn't affect your competitive position...else please don't answer.

Lastly, what operating system is your database server running upon? I think your webserver is a Microsoft OS (according to netcraft.com), but I'm kind of an OS junkie (and like the BSDs), so I was wondering. BTW, have you checked out Solaris 10's ZFS yet? Sounds very promising to me.

And on the subject of optimization, ever played around with Dtrace? If not, you may find it a very useful tool to see **exactly** where your code -- while running real time in production, but without affecting run time performance -- is using the various resources of the hardware and OS capabilities. I'll bet it would be right up your alley, Markus.

IIRC, Dtrace currently only runs on Solaris 10, but there is an effort underway to port it to FreeBSD. Some of the SUN engineers have some pretty interesting blogs on the new features of Solaris 10 that sound very interesting to me. IIRC, ZFS is supposed to be included in the version of Solaris 10 scheduled to ship in May of this year.

But of all the above, Dtrace sounds like something you might absolutely LOVE. So I wanted to mention it to you in the off chance you had not checked it out yet.

Here are some links you may find interesting:

On Dtrace:

[sun.com...]

Bryan Cantrill's blog at Sun, who is one of the Dtrace developers:

[blogs.sun.com...]

A couple of overview/reviews of some of the new features of Solaris 10 (and I'm not connected in any way to Sun, Solaris, or in any way benefit from any of these links, BTW):

[softwareinreview.com...]

and an old, but still good food for thought review:

[thejemreport.com...]

Again, thanks, Markus, for sharing as you have. I really appreciate it and am very grateful.

Thank you,

Louis

P.S. Again, I hope these questions are OK.

limitup

6:42 pm on Mar 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You can spot porn images with about 60% accuracy with those.

Right, which means 40% or in his case 8,000 images per day are not able to be checked programmatically. Now what? And that's just the images we're talking about ...

Juan_G

11:40 pm on Mar 19, 2006 (gmt 0)

10+ Year Member



8,000 images per day are not able to be checked programmatically

Well, for example Google's SafeSearch filtering works quite well for text and images. In the case of Image Search, possibly by looking at the accompanying text. They say: "No filter is 100% accurate, but SafeSearch should eliminate most inappropriate material".

You know, Google engineers like to automate everything and to solve problems not by hand, but by permanently improving algorithms, etc.

Another example is the Bayesian filtering for spam, the email filters that learn from you more and more every day, simply with a little of your human input marking the few doubtful messages. That's some kind of basic AI (software that learns), and works really wonderfully.

Perhaps Markus is doing some things in a similar line to those, given what he said:

I have a sort of a AI, that i built that handles the site for me. When you've got 2 years of steady growth you can build something super fancy to automate problems as they come up.

carguy84

1:33 am on Mar 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



1 million dynamic page views per hour across a Dual Proc web server with gzip compression on? That seems pretty far-fetched(not as in 'you're lieing' but as in, I think you need to actually make sure your compression is turned on), even for the most slimlined HTML. How do you pull it off?

I have a site that does about 80,000 dynamic page views a day, and there's no way I could run gzip compression on them on the fly without totally taking out the processors.(1 dual proc web server, hitting 1 dual proc DB). I know IIS6 has the gzip embedded into its core, but this is something I've researched high and low and even had an MSFT employee try to help with, but no dice. I even lowered the compression to 3 and it still totally taxed the procs.

What's the secret?

As far as the success of the site goes, it's truely impressive and a good source of inspiration - plus it's actually a legitimate site helping people better their lives....and it's free. No better model then that.

I'm just amazed at the low hardware usage. The DB Server I can understand, but the web servers, that's just impressive. What load balancer are you using?

Thanks,
Chip-

[edit]Just checked, if I have your domain correct, your gzip isn't turned on, fyi[/edit]

markus007

1:58 am on Mar 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't use a load balancer, i've only got one web server serving up the site. I figure in the future round robin DNS will allow me to scale further and still use session data.

My Gzip is turned on, when I turn it off my bandwidth doubles.

As for why it doesn't work for you I don't know what to tell you. I went into the metabase and added aspx and html to the compression settings and then just followed the rest of the tutorial for turning on compression on microsofts site. Don't try and compress images on the fly?

carguy84

7:33 am on Mar 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That's the same changes I made. My images are being served from an image server(linux running lightTPD), so I don't think it's because of the images. I'll have to play around with it some more I guess, and find out what part of the process is killing the CPU.

It's such a bandwidth saver, I hate not being able to run with it. MSFT said there was nothing wrong with the setup, just that too much dynamic content running thru the gzip will eat CPU cycles. Thanks MSFT, lol. ;) Maybe it's time to look into some newer CPUs with more L2 cache. I'm still running on the good ole P3's.

Chip-

Juan_G

10:02 am on Mar 20, 2006 (gmt 0)

10+ Year Member



(...) Markus pages indexed by Google (almost a million and half from his main site)

I correct/update myself: now it's about four millions on Google (with the query site:...). Surprising so much change in a few days.

Still one million and half pages on Yahoo.

Juan_G

12:08 pm on Mar 20, 2006 (gmt 0)

10+ Year Member



I'm being asked by stickymail how he has so many pages. Well, this has been explained already on this forum by Markus: users create the content.

midwestguy

12:43 pm on Mar 20, 2006 (gmt 0)

10+ Year Member



Hi Markus,

If it's OK to ask, which database are you are using on your DB server? PostgreSQL, SQL Server, Oracle, or something else (like a homebrew "database")?

I'm especially interested in what you think of PostgreSQL 8.X, if you have any experience with it.

Thanks,

Louis

ogletree

6:19 pm on Mar 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How many bots you got slamming your site after this publicity? There are a lot of lurkers here that do some nasty stuff.

Moosetick

4:33 pm on Mar 21, 2006 (gmt 0)

10+ Year Member



Again, congrats to what you have done.

I know you said you have 70-100mb/sec traffic and several servers you built and maintain. I was wondering do you run though from your house or did you build them and send them to a company that provides the bandwidth you need.

While you are doing well and have gotten by on the cheap with your hardware, you must be paying a few bucks for that bandwidth!

bose

6:55 pm on Mar 21, 2006 (gmt 0)

10+ Year Member



I don't use a load balancer, i've only got one web server serving up the site.

You may want to continually mirror/clone your web/db/image servers -if not from load balancing/sharing point of view, at least for improving your fault-tolerance.

Unless I have missed something obvious, your setup (as you've desribed above) does seem to have several single points of failure. Not being critical of your setup, just trying to give back to this excellent thread you have contributed so much to.

Apologies if this has already been discussed already and I've missed it.

darrat

11:55 pm on Mar 22, 2006 (gmt 0)



Talking of single point of failure. Markus is a single point of failure.

It must be hard being a one-man show however rich. No vacations, no long weekends and no illness.

markus007

4:16 am on Mar 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I spent the 3 of the last 12 months on vacation or traveling...

I mirror all my images on my servers, I found that using raid is the worst possible thing you can do. When you have millions of images you can't read more then a few hundred a second and it all crashes. If you just mirror the images via software you can serve a few thousand a second. This is using windows. I suspect raid can't handle (small files) bi directional IO under heavy load.

randle

7:59 pm on Mar 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hope I'm not out of line here, but can someone sticky me the site? Fascinating thread about an incredible achievement. The candor and patience Markus007 is displaying in answering these questions is admirable.

Seems like most are aware of the site, and I would really appreciate taking a look.

Thanks in advance if possible.

freeflight2

8:12 pm on Mar 23, 2006 (gmt 0)

10+ Year Member



I have to disagree with the "static files are far to slow" argument - we have several sites generating 100M+ pv/mo and a dual xeon + linux + thttpd can easily saturate a 1GBit line doing nothing else than serving static pages and pics from a pool with a couple million files.

One strategy to make sites fast is to put a proxy in front of the php/python/whatever backend and have the proxy cache pages for non-members with a 5-10min TTL.

In general, caching of either complete pages or objects can bring down the load enormously. We also delay many writes to the DB servers so they can be made either in batch or at night.

modern servers are also extremely powerful: a high end box with 4 cores and 8GB RAM and fast SCSI disks can handle 150M - 200M/mo dynamic php wordpress / forum / photo gallery or similar without optimizing much besides tuning php with zend or eaccelerator.

Raid: I agree, often not worth it - once a disk fails the whole server usually becomes so slow that web processes start piling up and rebuilding a raid can take a long time - best is to mirror servers and DBs

carguy84

9:13 pm on Mar 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I mirror all my images on my servers, I found that using raid is the worst possible thing you can do. When you have millions of images you can't read more then a few hundred a second and it all crashes. If you just mirror the images via software you can serve a few thousand a second. This is using windows. I suspect raid can't handle (small files) bi directional IO under heavy load.

Then I suspect you need to find a better RAID controller. Reading from multiple disks at once will be faster then reading from 1 disk at a time. SCSI will take which ever drive gets the info to you fastest. Then you combine this with a load balanced array of image servers, and you can serve up tens of thousands per second, AND keep your RAID intact.

If you drop a drive and it takes out your CPU, your RAID controller sucks. Rebuilding an array also shouldn't take out your server, it should be handled by the RAID controller. If need be, lower the weight of the affected server on the load balancer to handle a smaller percentage of the work load until the rebuild is complete.

But losing a whole server just because of a bad hard drive...I don't think that's a good practice to be preaching here. It's costly and it's easily preventable.

Chip-

freeflight2

9:58 pm on Mar 23, 2006 (gmt 0)

10+ Year Member



carguy84:sometimes a defect raid card can corrupt the data on *all* drives.
In any way you have to anticipate that a server can go down for other reasons (RAM failure, etc.)

markus007

11:54 pm on Mar 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



carguy84, I write 4 copies of my files across multipul drives. I am using high end raid card and have 0 issues on other servers. The issue only shows up when serving hundreds of files. One of the copies is even saved on a 4 drive array, i just would never serve data from it.

freeflight. I only say its a slow solution for me because i would have to write several hundred thousand files to disk per day, and the constant lookups to disk containing millions of static files would be a lot slower then just going straight to the db. My number one concern with anything i do is that it is fast and easy and very little should be able to go wrong.

philaweb

12:24 am on Mar 24, 2006 (gmt 0)

10+ Year Member



Congrats on the success. Would love to go down the same road. :)

All the tech stuff does not sound too overwhelming, but I guess you have used a lot of years learning all the stuff necessary. Some degree of nerdness is required to boil down problem solution to almost no time.

Also comparing yourself to the industry does not really make sence - even though it is fascinating. Hired hands are never as efficient as one self. They need to communicate with each other, which is one of the most time consuming parts of running a business.

And my question: Do you think it is possible to do the same thing without using a single Microsoft product?

fischermx

3:41 am on Mar 24, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




And my question: Do you think it is possible to do the same thing without using a single Microsoft product?

Markus might be invited next year to MIX'07 Asp.Net conference! ;)

carguy84

4:40 am on Mar 24, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



And my question: Do you think it is possible to do the same thing without using a single Microsoft product?

I won't answer for Markus, but from experience, yes. MSFT may make it easier, but when you get to a level where Markus works in, it's a matter of what you're most profficient in, that gets the job done.

JAVA, PHP, Apache, Linux, MySQL, and maybe a google mini if you need a robust search engine. None of those require anything from MSFT :).

Chip-

This 94 message thread spans 4 pages: 94