Forum Moderators: phranque

Message Too Old, No Replies

The Technical Aspects of running a One Person High Traffic site.

One person's experience.

         

trillianjedi

12:00 pm on Mar 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Technical Discussion continued from here [webmasterworld.com].

I'm also more interested in the technical aspects of your server performance Markus.

Any threads you wanted to start on that subject I'm sure would be well supported....

TJ

[edited by: Woz at 10:46 pm (utc) on Mar. 17, 2006]

[edited by: tedster at 4:39 pm (utc) on May 28, 2007]

arbitrary

6:42 am on Mar 24, 2006 (gmt 0)

10+ Year Member



markus, did you realize things would get this big for you from the start?

What I mean to say is: Did you design the system from day one to eventually handle these loads or did things gradually (or suddenly) get big and you realized this kind of load handling would be required?

What you have done is incredible (but I do belive it), congratulations. If you knew it would be this big from the start, that would be one more impressive.

stef25

10:25 am on Mar 24, 2006 (gmt 0)

10+ Year Member



fascinating, thanks for posting all this info. congratulations to you :-)

domrep

12:24 pm on Mar 24, 2006 (gmt 0)

10+ Year Member



Outstanding stuff! Kind of makes my 9 million+ monthly page views look lite, but hey, you have to start some place.

Can someone please sticky me the URL of marcus007's site.

the_nerd

12:25 pm on Mar 24, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Markus,

some post already have inquired about your database software. I understand if you don't want to let us know - but maybe you can tell us if it's open source or not.

nerd

P.S. I'm really impressed by the performace of your hardware/software setup. But that COULD possibly be handled by a couple of "vaxes behind the curtain" - but you should definitely write a book about about marketing. I'd buy 10 copies right away.

and1c

2:07 pm on Mar 24, 2006 (gmt 0)

10+ Year Member



I am struggling to believe how a site as big and busy as this could be overseen in only one hour a day....

Unless Markus is a superhuman?!

and 3months on holiday out of 12 months!

I just cant see how its possible knowing how much time my sites take :)

Disbelievers +1

finer9

2:15 pm on Mar 24, 2006 (gmt 0)

10+ Year Member



pretty amazing stuff. I personally think anyone who doesn't 'believe' this is possible just isn't open-minded enough.

markbaa

2:30 pm on Mar 24, 2006 (gmt 0)

10+ Year Member



and1c, he has already answered this question. He has AI doing the bulk of the work for him.

markus007

4:31 pm on Mar 24, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I got my first hundred members in march, and then In july i was up to a few thousand and adsense came out. The day I added it to my site is the day I knew I was going to be a mega site. The main reason was that I was growing at a consistant excellerating rate and I could plot my growth on a graph and predict exactly where I would be in 5 months. I knew the only way to suceed was to fly under the radar. So I did what no one else has ever done. I went and blocked alexa users from being able to signup, as well as comscore. The public takes alexa rankings as a bible, and corperations spend 50 grand plus a year on comscore and that is their bible. If you aren't on comscore or alexa you are a nobody. I used that and grew huge, I only lifted my ban on alexa users yesterday, My alexa ranking currently only represents New traffic per day.

As for designing the system to handel that much load from the start is impossible. As you ad more data the way the database works changes(I'm not saying what i'm using) Also I constantly added new features every week or 2 making it more resource intensive. I just spent all my time trying to scale the DB on only 1 server and stick to using only 1 web server. Constant optimzation till I reached the level i'm at now. Law of big numbers works on my side now, I will only run into issues at 1.6X my current traffic.

I think you can accomplish the same thing in nearly any language except PHP which does not scale at all when it comes to heavy loads.

priidik

5:10 pm on Mar 24, 2006 (gmt 0)

10+ Year Member



Yahoo is using PHP. 3 billion page views per day. PHP is scaleable dear Markus.

[public.yahoo.com...]

markus007

5:47 pm on Mar 24, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Anything is scable horizontally if you have a couple hundred thousand servers. PHP from what everyone in the industry tells me, and other developers is it can not scale to tens of millions of pageviews on a single machine.

Scaling up especially via good code is far more cost effective and maintainable then scaling out.

arbitrary

5:56 pm on Mar 24, 2006 (gmt 0)

10+ Year Member



markus, thanks for answering my question.

and then In july i was up to a few thousand and adsense came out. The day I added it to my site is the day I knew I was going to be a mega site. The main reason was that I was growing at a consistant excellerating rate and I could plot my growth on a graph and predict exactly where I would be in 5 months. I knew the only way to suceed was to fly under the radar.

I went and blocked alexa users from being able to signup, as well as comscore.

Not only did you do something amazing, you showed great vision on many fronts.

Congratulations!

Tastatura

9:37 pm on Mar 24, 2006 (gmt 0)

10+ Year Member



markus007: I went and blocked alexa users from being able to signup, as well as comscore

This might be trivial to some, but how is this done - How do you know (detect) if visitor is using Alexa toolbar or comscore (or google toolabr, or yahoo, etc)?

markus007: I think you can accomplish the same thing in nearly any language except PHP which does not scale at all when it comes to heavy loads.
...
PHP ...can not scale to tens of millions of pageviews on a single machine.

Why is this? What specifically prohibits code written in PHP to scale (verticaly). I am new to PHP so I am very curious.

carguy84

2:15 am on Mar 25, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Interesting slideshow, priidik, thanks for the link!

As far as PHP not being able to handle the same amount of RPS as other scripting languages - it really boils down to what you are doing with the language and what your data access layer looks like.

Also, there's a huge community developing OS projects in PHP and with that comes a bunch of great tools us MSFT folks never get to see :(

Chip-

lorax

6:31 am on Mar 25, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



PHP from what everyone in the industry tells me, and other developers is it can not scale to tens of millions of pageviews on a single machine

Evidence?

amznVibe

6:39 am on Mar 25, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I assure you the load of PHP is far lighter than ASP at any scale.
I consult for a site that I believe has heavier traffic than yours on a single server using php
and it's super smooth with page sizes twice of yours.

I think you are using a page data cache on your site and you can do the same with PHP
to make dynamic pages semi-static and radically reduce loads.

Some large scale PHP presentations:
[talks.php.net...] (doesn't work right in my Firefox)
[public.yahoo.com...]
[public.yahoo.com...]

ASP.net is compiled code so if you are going to compare to that, you'd take the most critical parts of the site and write them into C cgi's which are even faster than ASP.net and easily integrate them into PHP.

On a different note though, what I am curious about is if your customer uploaded images are stored statically or inserted into a sql database? I never understand programmers who convert images and store them in a database, it adds incredible overhead and virtually no advantages. I suspect yours are static?

freeflight2

6:59 am on Mar 25, 2006 (gmt 0)

10+ Year Member



amznVibe: storing images in a DB lets you replicate the data across multiple servers much easier, create consistent hotbackups etc.

amznVibe

7:03 am on Mar 25, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Grrr, darn editing timeouts on this board. Brett you need to increase the time limit! (at least for senior members?)

In any case I believe gaiaonline is one of the top 10 largest forums in the world and it's PHP. In 2004 they claimed to have 70 million messages, 9000 simultaneous users, and 750000 registered. With four dual-core servers. I'm limited to what I can link to on there but try here [big-boards.com...]

ah I just found this from October 2005:

We now can easily support 30K simultaneous users signed in with no problem. After a few more optimizations we should be able to double and triple our load, without even adding any more hardware. We currently have about 300M posts on the site, with a few million more being added every day.
But of course they don't qualify anymore as a "one person high traffic site" so I digress.

freeflight2: but if the DB corrupts you've got virutally all images down and a single point of failure in any case - what about unique filenames based on some kind of directory substructure that is even more easily mirrored (copy only files that are new or changed). Honestly I've never worked with with anything more than a dual server setup so I don't have real world experience with anything massive. But it still perplexes me why someone would make it that much more complicated unless it's an absolute ton of tiny images.

markus007

4:45 pm on Mar 25, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Forums tend to be trivial to scale to huge sizes as the database load is so light. This means that the calls to the database execute much faster then what my typical query is, meaning you can execute more pages per second as all your threads aren't used up.

As for PHP dieing under big loads, i think it had to do with it being unable to run well with multipul threads.

[simon.incutio.com...]

As for storing images in a database i wouldn't dream of it. Creating a terrabyte database is not something i'm keen on doing and even then steaming the images from a database would max out my IO's damn fast. I would need a huge SAN with hundreds of drives.

amznVibe

3:24 pm on Mar 26, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Markus can you share with us any of your techniques to spot and block bad bots?
Do you just look for ip's with heavy pageviews and keep a whitelist?
Do you use spider traps? I think bad bots are definitely the #1 problem of any large site.

freeflight2

3:47 pm on Mar 26, 2006 (gmt 0)

10+ Year Member



hammering out 500 to 1k php requests per second is not an issue for PHP+apache+linux - we are doing it all the time... with uptimes of several months.
It really doesn't matter if images (or any bigger objects) are in a DB or in a filesystem - usually the speed of the disks are the bottleneck to determine how many requests per second are possible, while the processor stays 60%+ idle.
MySql can handle several TB of data with 100M+ rows per table. We only need 4 disks per TeraByte server to deliver a continuous 150mbit jpeg stream and not 100 ;)
The beauty of the "everything in a DB" solution is that there's no need to deal with NFS, timeouts and things like that.
AmznVibe: a monster like this should always be replicated over 2-4 servers (in different DCs ideally) - in case a box goes down the frontend should automatically take it out of the rotation while a table check is being performed on the DB.

bhopkins

3:44 am on Mar 27, 2006 (gmt 0)

10+ Year Member



Okay on this whole coding everything yourself. I take some issue with this. The dating part of the site probably was coded custom, but the forums are a version of the Asp.Net forums which was coded by Rob Howard and others on the Asp.Net team. The latest versionc an be gotten from Telligent over at CommunityServer.org. So everything was not coded by one person.

Bruce

graeme_p

4:40 am on Mar 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Now that markus has turned off his blocking of Alexa, the Alexa graph traffic graph for his site looks interesting.

I know Alexa numbers are very distorted. Even so, I can not understand why there is a huge spike in reach a week or so ago, which reversed very quickly.

I wondered if the publicity has had an effect on traffic; but that does not seem likely given how busy the site is anyway.

markus007

5:19 am on Mar 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The ASP.net forums are completely worthless. I used only the layout and threw everything else away. I have never ever seen such bloatware before. Its a miracle anyone can run a forum using that code.

As for alexa i only lifted the ban on thursday, the surge was from an extra 120 toolbar users from slashdot on the monday. My alexa reach rank should move from a 400 to 1,200 within 4 weeks.

Cheeser

1:52 pm on Mar 27, 2006 (gmt 0)

10+ Year Member



All this talk about huge server farms and custom programming got me thinking of my own site.

I run quite a large Harry Potter fan site which does 8-10 million pageviews monthly. I *did* write every line in the code from scratch, including the forums, membership database, news aggregation and content administration, visitor tracking and much more.

With good caching, the site blazes even with 3,000+ concurrent users. It runs on 3 servers: one for mail/external apps, one for DB (Microsoft SQL Server), and one for ColdFusion and IIS.

I'm not just patting myself on the back here - I'm adding my voice to the true fact that you can run a very high-traffic site on a very reasonably-priced hardware setup. My site started at my apartment on a Pentium III 450MHz in 2002 with 1,000 visitors/month. Now at over 1.5 million visitors a month, all it took was a relocation to a datacenter ($300/mo instead of $289/mo SDSL at my apartment), and two new servers (both Dell PowerEdge's... total of less than $4,000).

I couldn't be happier with the hardware, and there is so much room left to grow on it that I am not at all worried for the next several years (until the Potter franchise is all but done in 2010 - by then, I'll be looking for "other work" anyways). ;)

freeflight2

8:46 pm on Mar 27, 2006 (gmt 0)

10+ Year Member



huge server farms
web servers are usually not the issue - you can run 100 web servers for $10k - $20k/mo (ticketmaster has some 1000+ perl web servers) while 4-5 top of the line DB servers might costs as much as that (especially if you have to license oracle or similar).
So the important part is to scale the DB backend and how data is efficiently stored and distributed.

lovethecoast

9:06 pm on Mar 27, 2006 (gmt 0)

10+ Year Member



I have a site that does about 80,000 dynamic page views a day, and there's no way I could run gzip compression on them on the fly without totally taking out the processors.(1 dual proc web server, hitting 1 dual proc DB). I know IIS6 has the gzip embedded into its core, but this is something I've researched high and low and even had an MSFT employee try to help with, but no dice. I even lowered the compression to 3 and it still totally taxed the procs.

I know each of our web servers does a minimum of 250k dynamic pages every day (1-2 inserts and 4-20 selects on every page) with no problems. Just dual zeon 2.8's with 2 gigs of ram. We use XCompress instead of IIS GZip -- maybe you could try that; they have a trial version I believe. We also use a custom caching solution which caches about 50% of each page (more if it's just static content for the most part). We then update each page every 18 hours or so automatically. Pack that with an image server and we figure we can double (or more) the amount of traffic we're sending through each server. We're running Win2k3 and 99% classic ASP.

S

Juan_G

9:10 pm on Mar 27, 2006 (gmt 0)

10+ Year Member



Cheeser wrote:

With good caching, the site blazes even with 3,000+ concurrent users.


Lovethecoast wrote:

We also use a custom caching solution (...)


Interesting... Thank you for sharing it.

carguy84

2:10 am on Mar 28, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I just implemented some "custom"(I'm sure others use the exact same way) classic ASP caching, and while it obviously makes a huge difference on server load and time to first byte it also had a secondary bonus... Since it cache's to a static html file, I can now gzip that up and not have to worry about the processors getting taken out as it only changes the compressed file when the cache changes!

As for not being able to gzip up 80,000 dynamic requests a day, don't forgot, I"m on old hardware :)

Chip-

mojomike

2:45 am on Mar 28, 2006 (gmt 0)

10+ Year Member



Dear markus007,

You have done an excellent job. I would like to summarise the lessons you have shared with us.

a) if you are skillful enough, write your own code. then after you have written it, test it, and then optimise it. when it gets to slow, find the bottleneck, re-write it and re-optimize it. tune it to minimize user wait.

b) when you have the money, invest in ram ( I noted that you hinted about huge ram ). funny that no one ever mentions that option ( which is the first thing I try to do with extra cash)

c) invest when you can in purchasing good links and work on developing others.

d) invest time into content developing.

e) invest your time in being stealthy. interesting, but this is only effective for the person that really wants to make a solid effort.

the only thing I really did not see ( or you are hiding it ) is the redundancy issue. knowing how to create 5xnine's machines back in 2000 to 2003 some thing might be still valid ( for those that don't know, five - nines means 99.999 up-time, mission critical is six nines, and we would laugh because I came up with nuke_six-nines, which is 6 mirror system of a six-nine's machines, placed in different geopolitical areas), anyway, I would mention that a simple round robin would offload a ton CPU and/or i/o. and if I recall correctly, MS server 2000 has a fail-over round robin system when CPU or i/o or heat get to the fail-over trigger point.

I prefer round robin over anything else in your case. When we have mirror system with optimized code, with that code releasing the i/o and CPU quickly, you will never have a CPU - i/o bog. ( I/O / CPU bog is when you have repeated calls of a process that is intensive, what happens is that you could end up with a few people doing these routines on the same server ( #*$! luck but it happens ) and the system slows down). When you have someone consistently optimising the code to release the CPU or I/O, you rarely will have hit the Bog. Question ow is how to mirror the drive system to replicate the content with little to no lag and doing it cheaply.

A Question does come into play, how the heck do you do your back-ups

Juan_G

8:11 am on Mar 28, 2006 (gmt 0)

10+ Year Member



Mojomike wrote:

markus007, (...) I noted that you hinted about huge ram


Maybe with cache in RAM memory rather than on disk, including RAM cache of frequent queries? (Just a possible guess for one of the many performance issues, apart from database optimization, etc.).
This 94 message thread spans 4 pages: 94