My first billion page crawl - Alternative Search Engines forum at WebmasterWorld - WebmasterWorld

Forum Moderators: bakedjake

Message Too Old, No Replies

My first billion page crawl

cpollett

4:11 pm on Oct 16, 2015 (gmt 0)

10+ Year Member

I posted here a while back when someone was asking about the YioopBot user-agent so I figured I'd do a follow-up. I am excited because my crawler has now just completed its first billion page crawl. Here is my blog post about the crawl:
[yioop.com...]
The index can be found at
[yioop.com...]
The software used to do the crawling, indexing, and web app were written in PHP and does not rely on any other crawling or indexing project. It is GPLv3 and can be downloaded from
[seekquarry.com...]
My search engine that currently runs off six 2011 Mac Mini's in my home over a Comcast business connection. If you read the original page rank paper by Brin and Page they mention that they imagine in the future prices would come down so that pretty much anyone could do a web scale crawl. Currently, I would say the cost to do such a crawl is 3 to 4 thousand dollars of equipment and internet costs but excluding labor. My guess is this price will continue to fall in the future. It will probably be a few months before I do another crawl as I want to improve my indexer and web app.

bakedjake

3:22 pm on Oct 19, 2015 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Neat accomplishment. Have you compared costs doing it locally vs. in the cloud on AWS or Azure?

cpollett

7:15 pm on Oct 20, 2015 (gmt 0)

10+ Year Member

It would probably be a lot cheaper to do things on AWS. I just prefer doing things on my own hardware. Both approaches though are getting cheaper with time.

lucy24

9:56 pm on Oct 20, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

It would probably be a lot cheaper to do things on AWS.

Yah, but then you'd have to deal with all those 403s slammed in your face from sites that care enough to block the range, but not enough to poke holes.

creeking

12:40 am on Oct 21, 2015 (gmt 0)

10+ Year Member

is there any way to estimate what this cost the other people - the ones who had their sites crawled?

cpollett

6:26 am on Oct 21, 2015 (gmt 0)

10+ Year Member

It really depends on how the person is paying for their website. For example, I am not charged for the bandwidth of those downloading from my site. On the other hand, if you are using shared hosting you might be. My crawler does respect robots.txt and does have some code to try to reduce the speed or skip entirely sites that seem to be slow to respond, so hopefully I am not being too much of a burden for those. I am pretty sure my crawler was hitting smaller sites a lot less frequently than Google. You may or may not be surprised at the number of sites which basically forbid anyone but Google from crawling them (not even allowing Bing). To me this in the long run might be riskier than any short term savings in bandwidth.

bhukkel

4:31 pm on Oct 21, 2015 (gmt 0)

10+ Year Member

I am interested in the source code but i cant find it on the website you mentioned...am i missing something?

What kind of storage software are you using?

ps you know you can just download the complete common url crawl db of 145TB?

cpollett

9:26 pm on Oct 21, 2015 (gmt 0)

10+ Year Member

The download link on the page:
[seekquarry.com...]
give you the source code. If you scroll down that page, you can also see how to git clone the repository.

For the crawl, I am using 8x 4TB drives -- the cheapest I could buy at Amazon. This was a low budget operation personally financed. They are configured as raid 0. This has actually caused some issues, as I didn't have enough cash to have back ups of the crawl itself, and using RAID 0 means a drive crashing causes me to lose two drives worth of crawl. The index data is in its own format as I coded this project from scratch. On the drives, my format uses a sequence of files all less than 2gb, so will work with older Linux systems. For the raw data within a file you might find a sequence of compressed web pages similar to a WARC file. The index and dictionary structures are more complicated. The dictionary portion of this structure is on SSDs.

Non-crawl data (users, wiki's, group feeds, news crawls) are stored in a postgres database which is backed up.

P.S. I didn't think Common Crawl was that big. In any case, it would take a while to download that much data from common crawl. After downloading it you would then need to index it and of course you would be subject to whatever agreements common crawl has. I thought most people just operate on common crawl data in situ.

cpollett

9:36 pm on Oct 21, 2015 (gmt 0)

10+ Year Member

Oh I meant 12 x 4tb drives in the last

tangor

6:03 am on Oct 22, 2015 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Unless you have it doubled and mirrored off site (in case one burns down or is flooded) you are a power surge away from Nothing. Meanwhile check your frontend.... says you are running 8tb drives. :)

What is the purpose of this crawl? How in depth? And just for fun did a few searches. Do you know your auto complete fails on way too many searches? Will not allow an expressed search when it thinks it is something else.

Otherwise, congrats (and you've been blocked for sometime, I checked my .htaccess and robots.txt, but with no date don't know how long). No offense intended! Just don't know who you are or what you are doing, and never saw a referral before blocking.

But from a tech and personal expense kind of thing, I am a bit impressed, though I wonder how many of those billion pages are wikipedia as that was just about all I saw for any given search.

cpollett

3:26 pm on Oct 22, 2015 (gmt 0)

10+ Year Member

Yeah I am being imprecise, I was counting 2x4TB in a single enclosure RAID0 connected via usb2 as a single 8tb drive. I do intend to back up my web crawls as I get money for more equipment and prices come done. When Samsung's 16TB SSD become cheap, I think we will truly be in an age where this is practical. I think Google actually keeps its indexes in RAM. I seem to remember blekko (now part of IBM) getting venture funding to get 100TB of SSD, that would allow for billion page indexes+crawls completely in SSD with redundancy.

The purpose of my crawling is to make a usable search engine. 1 billion pages I think would be sufficient for starters (much smaller than google), it is roughly the sizes that commercial systems had in 2004, but this is a personal system in 2015. I think you would agree, I haven't achieved my goal yet. I am using it as my default search engine though before trying searches on other engines -- this let's me see how it's failing (or sometimes not) so that I can improve it in the next iteration.

Having more hardware means you can do more preprocessing passes over the index before presenting it to the world. My index is being presented after just 1 pass which partially explains the skews in results you are noticing.

Also unlike Google, et al I don't have the infrastructure to handle auto-complete server-side so the auto-complete is being done client-side , so is based off local storage tricks and a single trie.