Forum Moderators: open
I'm quite new to these forums, so I hope I'm writing this in the right forum.
I am currently writing my master thesis in computer sciense at the University of Oslo, Norway. A part of this thesis is writing a spider that for all visible purposes, acts like a search engine spider.
My master project is creating a software package that can perform scientific measuring of the internet. (Yeah, I know it sounds abstract but bear with me.) The data that this software will collect will not be publicly available.
The main idea is to create a spider that starts with just a few hand picked pages, and then just let it spread. The pages it visits will be cashed locally. After a while the spider should start to revisit all the pages in the cash, and compare the new one to the cash. Then it can start measuring different things about the websites in the cash.
After reading a lot of posts in this forum, I've come to understand webmasters as quite paranoid and protective of their content, which is perfectly understandable. But that is why I ask you which rules to follow when I create my spider?
I have picked up a few things:
But I still have quite a few questions that I don't have the answer to:
Phew, this was a long post. To sum things up, the main question is: Do you have any guidelines to follow when creating a good spider?
Best Regards
Vidar Johansen
How often do the page change? And how much?
Use an RSS feed for updates to get new content or use the sitemaps XML file that all the major search engines provide. Otherwise, wait for maybe a week or more before crawling the same page again and note which pages have substantially changed and crawl those pages more often than all the others moving forward.
Pages that never seem to change can just float to the bottom of the stack and theoretically the webmaster will use sitemaps to notify the search engine when it's changed, if ever.
Set up a website with information about the spider and what it does and why. (Include this in the UA?)
Yes, in the UA is preferred.
Respect the site's bandwitch and don't pull too much data too fast
Honor the crawl-delay directive of robots.txt if you find one.
# How long should I wait between each page pull from a site? And how long should I wait between each time I revisit a site to compare pages?
I would suggest from 5-10 seconds per page by default.
Don't read the robots.txt, cause if you're not google, you're blocked
If you don't read robots.txt, you'll definitely get blocked.
How to deal with redirects and other http error pages? How many pages can return a 404 before I realise I'm banned? How long until I could retry to see if I'm unbanned?
404's are page not found and blocked bots typically get a 403 forbidden instead.
However, you'll learn real fast that not all error pages return error result codes and give "200 OK" for actual errors and you'll find you need a list of hundreds of fingerprints of error pages.
One way to test the error system is to send a couple of randomly named pages such as http://www.example.com/gibberishtotestthe404error.html and see what you get. Some redirect to the index page so you have to remember that the index page is a 404 error.
It's complicated, good luck!
I would suggest from 5-10 seconds per page by default.
1 second delay seems reasonable default, I state this on the basis of 85 bln crawled pages. Those webmasters who are sensitive of this can and do use crawl-delay, this would probably cover 2-3% of all urls.
If-Modified-Since, unfortunately, is not supported well by dynamic pages and these days majority of pages are likely to be dynamic. Does not hurt to use it however.
Etags are fairly rare - only apply to static pages, it is a good idea to support them however if you do recrawls.
Recrawling should ideally be based on how frequent page changes, so that if it does not change then you don't recrawl it too often - this is a bigger problem than it may sound because you can't just calculate MD5 hash of the page - a lot of pages would change a bit (say by printing current date/time) but substantially they won't be changed.
Also support GZIPped content - this saves you and webmasters valuable bandwidth, you can reasonably expect to crawl 30-35% more (can be 50% - depends on urls) urls for the same bandwidth.
And the last but not least - never argue with wilderness here, you won't win...
1 second delay seems reasonable default, I state this on the basis of 85 bln crawled pages.
That's why you aren't allowed to crawl my servers because it's not reasonable.
One size does not fit all.
My primary site is VERY dynamic, very popular, and even the big search engines honor my crawl-delay. However, every now and then, once or twice a month, everything converges with a high traffic day, too many bots at once, and the server backlogs, the server alarms go off, and all my phones start ringing to alert me the site isn't responding.
If it wasn't for my own homebrew traffic control software the problem would go critical and remain that way for hours like it used to do. Now it usually takes 2-3 min for the backlog to flush after locking out overly aggressive spiders (2s per page or less) out but it recovers on it's own.
That's why I said 5-10s because I didn't want to see the poor guy get booted for abusing a server on his first attempt.
Don't forget, your spider isn't the only spider on the server as I'm hosting up to 20 crawlers of some sort at a time plus 60-100 visitors average so it's very easy to push a dual XEON box over the edge without some level of control.
Besides, anyone that knows anything about queues can queue up thousands of pages to crawl on a variety of domains on multiple threads yet keep the impact per domain scattered out enough in the queue that it doesn't cause problems and 5-10s out of thousands of pages (millions?) in a queue seems pretty simple.
[edited by: incrediBILL at 2:32 am (utc) on June 28, 2008]
That's why you aren't allowed to crawl my servers because it's not reasonable.
Fair enough, we support robots.txt (including crawl-delay) so that you can make exactly that choice. Since we support crawl-delay (since the time only MSN supported it) it gives webmaster control how slow they want to be crawled (up to a limit that we consider reasonable).
Your view Bill is that of a dug in webmaster who thinks he is under some kind of siege, you are in a very small minority as far as crawling is concerned, statistically what you do is irrelevant when it comes to crawl of the web, however this does not mean that your conditions of crawling should not be respected - obviously crawler should respect them on the basis of your robots.txt, but it would be a big mistake for a bot builder to assume that your views are somehow represent majority of webmasters - they don't.
The main job of a bots builder is to ensure that people like you (and some other on this site) are satisfied that their robots.txt is honoured - that's why good support for robots.txt is essential even if you just debug your bot.
If you have got slow server then just use Crawl-Delay, and if you think too much bandwidth used then support gzip to compress pages.
Very kind of you to tribute me with such a compliment.
;)
3.) Please don't read this robots.txt every minute:
User-agent: *
Disallow: / Your view Bill is that of a dug in webmaster who thinks he is under some kind of siege,
Well, you're statistically wrong this time.
The site I speak about gets enough traffic it's one of the sites you can even check in Google's website trends, not nearly as much as WebmasterWorld, but statistically enough to rank.
I would call having the 3 majors indexing and crawling my site to the tune of 40K-50K pages per day fairly significant, not to mention some of the other little SE's that I do allow to ramble through my site. Then we have a ton of spiders I boot from all over the world, usually get about 50-100 requests per day from things that never returned traffic that get denied. Don't forget all the spybots, botnets, link checkers, data mining, scraping and everything else you can imagine that my bot traps automatically detect and kick to the curb.
So yes, my site is under siege and many others are as well which is why I told the OP to be very kind with a slow crawl limit so as not to burden sites like mine, and there are quite a few, that are being overrun with automated activity.
Any questions?
How long should I wait between each page pull from a site? And how long should I wait between each time I revisit a site to compare pages?
This will give webmasters time to see your visit in their log files and decide whether or not to include you in their robots.txt.
Newly arriving on a site and just pulling pages is not good etiquette, of course your bot was not disallowed in the robots.txt because you were not previously known.
Well, you're statistically wrong this time.
No I am not Bill - I unlike you I can see the big picture of all sites on the web and I know how many of them use robots.txt at all, how many use crawl-delay and other things.
You are, like a fair few people in this forum, make a rather typical human mistake of jugding the world from the view that is present in your own small (or big) place, but this view is very biased and does not show the big picture - you need to raise above your personal circumstances and think big, something that you understandably might not be interested to do since your site to you is the most important thing. There are a fair few sites that ought to be much bigger than yours that don't allow crawling by anyone but the big 3 SEs, now those sites are a problem for any upstart, but even in this case these sites are in a minority and this should not adversly affect indexing.
Crawl-delay was created just for this case - and I endorse it fully, however to state that 10-15 delay should be default (for all domains even those without robots.txt) is totally wrong from crawling point of view - if YOU think your site should be crawled slowly then use Crawl-Delay, this is nowadays an acceptable practice just like using robots.txt to disallow urls you don't want to be crawled.
Anyway I think this turns into discussion value of which is diminishing with each post, I think I've made my points pretty clear.
Here's a link for you, Vidar. Make sure your robot comprehends the meaning of each server response code and acts accordingly: Hypertext Transfer Protocol -- HTTP/1.1 [w3.org]
Jim
Just offering a simple fact about avoiding spider traps that stop crawlers that hit servers too fast, not that everyone runs bot traps, but if they do it's the difference between crawling or eating 403s.
With the large number of crawlers these days it's just a simple situation that a new crawler being less conspicuous and less aggressive improves it's chances it won't be blocked by webmasters. or spider traps, if it's not being perceived as a nuisance or a threat.
[edited by: incrediBILL at 5:59 pm (utc) on June 28, 2008]
It would greatly increase crawl time if the default delay for all sites was 10-15 seconds, especially for sites with a lot of content. Some sites have literally hundreds of millions of urls, even with 1 second delay it would take years to crawl them, what do you think will happen if delay is 10 times higher? If you have got a lot of pages on your site, big traffic from visitors, then you should not be suprised if bots come to you en masse - its the price of success, some visitors will always be a cost to business - use gzip to save on bandwidth (applies to bot and site owners equally) and optimise your backend to handle pages quicker.
Some people will never be pleased with anything, that's true for all walks of life, so in this case a good spider will obey robots.txt to avoid unnecessary confrontation with a small but very vocal minority who takes things way too personally. IMHO.
use gzip to save on bandwidth (applies to bot and site owners equally) and optimise your backend to handle pages quicker.
For many large dynamic sites the bandwidth of data isn't really the issue, it's the bandwidth of CPU that's more critical. The problem with gzip is it adds more CPU usage to an already strained server and makes your site seem even slower to visitors because their browsers don't get any content until the entire page is generated, gzip'd, downloaded and decoded into the browser opposed to the normal interactive streaming download of the page.
IMO the best way to regulate the speed of the spider, short of a specific crawl-delay, would be to simply spread out the request time per page based on the response time of each page. For instance if a request takes 3 seconds to get the page, then delay that amount of time before asking for the next page. Therefore, if the server speeds up or slows down the crawler is pacing the server performance and not driving the server further into the red pounding on it.
takes things way too personally
Well, we do tend to take it personally when our livelihood is threatened. I never paid much attention to spiders until a few years back when some overly aggressive asian mirroring sites would come and literally knock my site offline for up to 30 minutes asking for up to a hundred pages a second. When visitors can't get to the site due to the actions of bots and you can't make any money, it does tend to get a little personal.
It's the entitlement mentality of all the spybots, scrapers and offline reasers (and there are a lot of them), like the asian mirroring sites mentioned above, that cause the real problems.
If those types of bots didn't exist this conversation would be pretty moot.
Besides, you run one of the legit bots that tries to do it right, so don't take it personally that we're tainted by the actions of others.
At your first visit, pull one page and leave. Then come back no sooner than 24 hrs later.
Beyond that, you have to find some way to guess what kind of bandwidth strain a website may be able to take, in order to figure out how deeply you should crawl it. One measure you can take is how quickly the pages are served, taking into account how large they are.
--
What Makes A Nice Spider?
1. It asks for permission to access my site.
2. It takes "No" for an answer if that is what I tell it.
3. It is properly configured and does not make idiotic requests.
4. It honestly states its identity and (in a reference URL) its purpose.
5. It explains in that reference URL how it benefits me or humanity in general.
Both Microsoft and Yahoo send robots that fail this test, and if I blocked both I would save a lot of bandwidth and lose very little human traffic. But I allow them because I welcome all human traffic, and because lack of competition for Google can only be unhealthy.
--
What Makes A Good Spider?
Good presumably meaning successful. The job is to access and index other people's stuff.
I have some sympathy for new entrants in the field, but the reality is that good spiders are far outnumbered by scrapers, scammers, spammers, thieves, probes, database hackers and sundry bandwidth-wasting robots operated by otherwise legitimate companies who somehow think that being dishonest and attempting to deceive webmasters is acceptable behaviour.
So a spider must convince me that allowing access would be a good idea. Most fail.
Commercial projects usually fail because they want to use my resources to make a profit
and cannot convince me that I will ever get anything in return.
Academic projects usually fail because they cannot convince me that humanity in general will benefit from what they do or that they will not subsequently monetize the data collected.
Spiders want my stuff. What's in it for me?
--
Conclusion
If I were submitting a thesis to the University of Oslo it would say that a "good/nice spider" is one that can gain the trust, respect and goodwill of the owners of the stuff it needs and feeds on.
Only that road can lead to popular success.
...
For many large dynamic sites the bandwidth of data isn't really the issue, it's the bandwidth of CPU that's more critical.
Well, thats what multi core CPUs are for - this task of gzipping on the fly is perfect for multi-threading, even though I agree gzip itself is rather slow, many sites however could do with more optimal HTML - say Google's is a very lightweight one.
I agree about crawler auto-increase delay if it detects that it takes too long to fetch a page from server, thus indicating that server is possible overloaded.
I do appreciate that you guys get hit by all sort of nasty stuff, and I was hit myself earlier this year with fake bot, this is something that is very bad as there is not a lot can be done about it :(
create a spider that starts with just a few hand picked pages, and then just let it spread.
Efficient Crawling Through URL Ordering [dbpubs.stanford.edu] (Junghoo Cho, Hector Garcia-Molina, Lawrence Page)
This is an old one but goes into detail about how important URL ordering can be on larger scales.
I read it a while back and looking at the numbers i.e. "The web containing around 1.5TB of data" I'd imagine you could replicate the test quite easily given 2008 resources.
[edited by: brotherhood_of_LAN at 8:18 pm (utc) on June 28, 2008]
perceived throughput is still slower with gzip, tried it, not good.
I think it depends on size of the page, if it is fairly small so that gzipped part is pretty small too, then it should all be quick, but if its a huge page then yes maybe perceived rendering won't be good, but you can't have it both - either save on bandwidth or have fast perceived response. Maybe you could always use gzip for bots? Bots don't have perceived response issues, so using gzip for them will reduce impact of crawlers on your bandwidth. We started supporting gzip (again) recently and this allowed us to crawl 50% more data for the same bandwidth while saving a lot of bytes to webmasters who support it. IMO all good bots should support gzip.
save on bandwidth
That's really not an issue as I have 2,000GB allocation per month per server and my busiest server is only using 200GB+ per month at the moment, so I have some room to grow.
It's really all CPU burn with the dynamic pages.
However, I can see where gzip would help a spider immensely because the compressed pages means faster downloading which means even more pages so theoretically you're increasing your crawl capability if the webmaster is willing to gzip the pages.
Maybe someone else knows but I'm unclear on whether, or how, you can selectively gzip in Apache based on the user agent.
It is certainly good for bots that support gzip (like us) that sites support it too - it helps crawl a lot more for less bandwidth on both sides, win-win really if there was one.
Jim