Forum Moderators: open
robots.txt? NO
We can't link to blogs so if you search for --
cugillion.com scrape
-- you'll find out its origins, and current conduct, plus:
"Page scraping with spiders really is an effective technique that has given me a new perspective as well as a tremendous amount of data."
Not mine.
Automated, bots of various forms are a scourge on the Internet. Collectively they are becoming a serious burden on server resources and drive up the operating costs of running a website.
If you don't want your bot on universal ban lists then you had better A) make sure it obeys the robots.txt file and B) has a real purpose that helps to drive human visitors to the sites you crawl.
Thank you for addressing your concerns. I completely understand the nature of them and I mean no harm by spidering websites. It is true that other people use spidering for reasons that do not directly help anybody. I do, however, intend to use the data I've collected to develop a search engine and other related tools that benefit the community.
I apologize for any concerns this caused. I have suspended my spidering temporarily to address writing code for reading robots.txt. I did not expect Cugillion to be as successful as it has been and I have been working in an expedited fashion. If you wish to prevent your site from being indexed, you may specify it in your Robots.txt.
User-agent: Cugillionbot
Disallow: /
I expect my development to take several months. While I am developing I have set up an email account at gmail to answer any questions that arise while I need time to configure my server. You may reach me at cugillion@gmail.com and I will respond promptly to any concern that arises.
Aaron
1.) Here's hoping you'll also (re)program your bot to respect ALL aspects of the Robot Exclusion Standard [en.wikipedia.org]. For example, in robots.txt --
User-agent: *
Disallow: /
-- and in HTML, ALL Robots META Tags [NoArchive.net]:
<META NAME="ROBOTS" CONTENT="NONE, NOINDEX, NOFOLLOW, NOARCHIVE, NOSNIPPET">
2.) Pardon me, but I'm confused. You state on your blog that your program is "a spider for page scraping". You also state that an early version was "collecting domains faster than it could view them". And later, also in the OP: "Page scraping with spiders really is an effective technique that has given me a new perspective as well as a tremendous amount of data."
You're in the SEO business so I know you know data mining/harvesting/extracting/indexing/spidering may or may not be scraping per se. The former commonly means retrieving 'plain' .html files; the latter everything down to the studs: .html, .js, .css, .txt, .pdf, .doc, all dynamic files, all graphic file types, etc.
(FWIW: I abhor all robots.txt fundamentals-ignoring bots because their use wastes my and thus my clients' resources. Their use also raises issues ranging from copyright infringement to trespass and theft.)
So anyway, precisely what is your bot doing, please? Spidering or scraping? TIA for your reply.
I'm just going to give a short background before I answer questions. I started Cugillion independently as a "for fun" project. We have coded spiders and used them on a much smaller scale for various small projects before. I engineered a spider and sent it off thinking it would just eventually fail and it didn't. After talking about it's success at work and eventually showing it to my executives – we decided that with some funding we might be able to turn this into something beneficial for everyone.
1) I will follow all standards and take couple extra steps.
2)I do work as an SEO firm. We have been extremely successful. The amount of data we collect will be dependent on how successful it really is. We're labeling this a “research project” for a reason. The first goals we have are to collect enough data to mimic other engines. At <snip> we spend countless hours researching ways to dissect engines to learn their secrets.
After we have the data, we'll use it to reverse engineer ranking strategies used by popular engines. These tools will probably be available to the community to be shared. So far, I am the only engineer working on this project, and I work on it after hours. After we clean our slate a little more we may devote a team to it.
Currently, we are collecting page text and html. You know as well as I do – Google, Bing, and Yahoo aim to take over the world. I will be writing several kinds of spiders/scrapers for several different tasks, and we plan to maintain “fresh internet” as much as we can. SEO goes way further than simple page text and HTML but this is how we're entering. I have interests in other world search engines such as Baidu as well. We'll eventually mirror most search engine trends. Building a great search engine requires money and power. We have invested very little thus far.
Tangor,
The ultimate goal is to end up mirroring major engines. Maybe if time permits and success grants it, we can compete – we do not have any intentions to compete right now. My time is already booked until March, so I have to work hard in my spare time to grow the project. By then, I'm positive we'll have something to play with. A six-month commitment at least. If it is successful enough, I will lead a team developing it further.
Thank you guys for your interest. Your questions have made my day.
Aaron
[edited by: incrediBILL at 2:46 pm (utc) on Jan. 15, 2010]
[edit reason] No self-promotion, company name removed [/edit]
Sorry, but no deal.
There's no intention to harm anybody.
That doesn't fly because the purpose of SEO is to always harm someone else's search engine rankings at the expense of your site/customer.
Only 10 people can exist in the top 10 and whoever you push to #11 has been harmed.
Considering this is an SEO research project, designed to learn how to manipulate search engines, harm is the obvious intent and the end result.
Not to mention you're using the bandwidth and server resources of people unaware of your crawler, along with the thousands of other crawlers that are "doing no harm", because the average webmaster doesn't know to block you in the first place nor do they even know you exist.
So not only do you intend to ultimately harm some of the sites you crawl via SEO manipulation you make them pay for the privilege of being harmed!
Research indeed.
User-agent: Googlebot
User-agent: Slurp
User-agent: msnbot
User-agent: Teoma
Disallow: /cgi-bin
User-agent: *
Disallow: /