Welcome to WebmasterWorld Guest from 3.92.28.84

Forum Moderators: bakedjake

Message Too Old, No Replies

My first billion page crawl

     
4:11 pm on Oct 16, 2015 (gmt 0)

New User

5+ Year Member

joined:Jan 22, 2012
posts: 33
votes: 2


I posted here a while back when someone was asking about the YioopBot user-agent so I figured I'd do a follow-up. I am excited because my crawler has now just completed its first billion page crawl. Here is my blog post about the crawl:
[yioop.com...]
The index can be found at
[yioop.com...]
The software used to do the crawling, indexing, and web app were written in PHP and does not rely on any other crawling or indexing project. It is GPLv3 and can be downloaded from
[seekquarry.com...]
My search engine that currently runs off six 2011 Mac Mini's in my home over a Comcast business connection. If you read the original page rank paper by Brin and Page they mention that they imagine in the future prices would come down so that pretty much anyone could do a web scale crawl. Currently, I would say the cost to do such a crawl is 3 to 4 thousand dollars of equipment and internet costs but excluding labor. My guess is this price will continue to fall in the future. It will probably be a few months before I do another crawl as I want to improve my indexer and web app.
3:22 pm on Oct 19, 2015 (gmt 0)

Administrator from CA 

WebmasterWorld Administrator bakedjake is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 8, 2003
posts:3883
votes: 61


Neat accomplishment. Have you compared costs doing it locally vs. in the cloud on AWS or Azure?
7:15 pm on Oct 20, 2015 (gmt 0)

New User

5+ Year Member

joined:Jan 22, 2012
posts: 33
votes: 2


It would probably be a lot cheaper to do things on AWS. I just prefer doing things on my own hardware. Both approaches though are getting cheaper with time.
9:56 pm on Oct 20, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15704
votes: 811


It would probably be a lot cheaper to do things on AWS.

Yah, but then you'd have to deal with all those 403s slammed in your face from sites that care enough to block the range, but not enough to poke holes.
12:40 am on Oct 21, 2015 (gmt 0)

Preferred Member

10+ Year Member

joined:June 15, 2007
posts:452
votes: 18


is there any way to estimate what this cost the other people - the ones who had their sites crawled?
6:26 am on Oct 21, 2015 (gmt 0)

New User

5+ Year Member

joined:Jan 22, 2012
posts: 33
votes: 2


It really depends on how the person is paying for their website. For example, I am not charged for the bandwidth of those downloading from my site. On the other hand, if you are using shared hosting you might be. My crawler does respect robots.txt and does have some code to try to reduce the speed or skip entirely sites that seem to be slow to respond, so hopefully I am not being too much of a burden for those. I am pretty sure my crawler was hitting smaller sites a lot less frequently than Google. You may or may not be surprised at the number of sites which basically forbid anyone but Google from crawling them (not even allowing Bing). To me this in the long run might be riskier than any short term savings in bandwidth.
4:31 pm on Oct 21, 2015 (gmt 0)

Full Member

5+ Year Member

joined:Aug 16, 2010
posts:257
votes: 21


I am interested in the source code but i cant find it on the website you mentioned...am i missing something?

What kind of storage software are you using?

ps you know you can just download the complete common url crawl db of 145TB?
9:26 pm on Oct 21, 2015 (gmt 0)

New User

5+ Year Member

joined:Jan 22, 2012
posts: 33
votes: 2


The download link on the page:
[seekquarry.com...]
give you the source code. If you scroll down that page, you can also see how to git clone the repository.

For the crawl, I am using 8x 4TB drives -- the cheapest I could buy at Amazon. This was a low budget operation personally financed. They are configured as raid 0. This has actually caused some issues, as I didn't have enough cash to have back ups of the crawl itself, and using RAID 0 means a drive crashing causes me to lose two drives worth of crawl. The index data is in its own format as I coded this project from scratch. On the drives, my format uses a sequence of files all less than 2gb, so will work with older Linux systems. For the raw data within a file you might find a sequence of compressed web pages similar to a WARC file. The index and dictionary structures are more complicated. The dictionary portion of this structure is on SSDs.

Non-crawl data (users, wiki's, group feeds, news crawls) are stored in a postgres database which is backed up.

P.S. I didn't think Common Crawl was that big. In any case, it would take a while to download that much data from common crawl. After downloading it you would then need to index it and of course you would be subject to whatever agreements common crawl has. I thought most people just operate on common crawl data in situ.
9:36 pm on Oct 21, 2015 (gmt 0)

New User

5+ Year Member

joined:Jan 22, 2012
posts: 33
votes: 2


Oh I meant 12 x 4tb drives in the last
6:03 am on Oct 22, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:9908
votes: 971


Unless you have it doubled and mirrored off site (in case one burns down or is flooded) you are a power surge away from Nothing. Meanwhile check your frontend.... says you are running 8tb drives. :)

What is the purpose of this crawl? How in depth? And just for fun did a few searches. Do you know your auto complete fails on way too many searches? Will not allow an expressed search when it thinks it is something else.

Otherwise, congrats (and you've been blocked for sometime, I checked my .htaccess and robots.txt, but with no date don't know how long). No offense intended! Just don't know who you are or what you are doing, and never saw a referral before blocking.

But from a tech and personal expense kind of thing, I am a bit impressed, though I wonder how many of those billion pages are wikipedia as that was just about all I saw for any given search.
3:26 pm on Oct 22, 2015 (gmt 0)

New User

5+ Year Member

joined:Jan 22, 2012
posts: 33
votes: 2


Yeah I am being imprecise, I was counting 2x4TB in a single enclosure RAID0 connected via usb2 as a single 8tb drive. I do intend to back up my web crawls as I get money for more equipment and prices come done. When Samsung's 16TB SSD become cheap, I think we will truly be in an age where this is practical. I think Google actually keeps its indexes in RAM. I seem to remember blekko (now part of IBM) getting venture funding to get 100TB of SSD, that would allow for billion page indexes+crawls completely in SSD with redundancy.

The purpose of my crawling is to make a usable search engine. 1 billion pages I think would be sufficient for starters (much smaller than google), it is roughly the sizes that commercial systems had in 2004, but this is a personal system in 2015. I think you would agree, I haven't achieved my goal yet. I am using it as my default search engine though before trying searches on other engines -- this let's me see how it's failing (or sometimes not) so that I can improve it in the next iteration.

Having more hardware means you can do more preprocessing passes over the index before presenting it to the world. My index is being presented after just 1 pass which partially explains the skews in results you are noticing.

Also unlike Google, et al I don't have the infrastructure to handle auto-complete server-side so the auto-complete is being done client-side , so is based off local storage tricks and a single trie.