Cugillion Search Engine Robot

Forum Moderators: open

Message Too Old, No Replies

Cugillion Search Engine Robot

New scraper; no robots.txt

Pfui

6:54 pm on Jan 11, 2010 (gmt 0)

host.cugillion.com
Cugillion Search Engine Robot 1.4

robots.txt? NO

We can't link to blogs so if you search for --

cugillion.com scrape

-- you'll find out its origins, and current conduct, plus:

"Page scraping with spiders really is an effective technique that has given me a new perspective as well as a tremendous amount of data."

Not mine.

dstiles

11:41 pm on Jan 11, 2010 (gmt 0)

IP 96.30.31.nnn on cogswell range 96.30.0.0 - 96.30.63.255 (guessing here but that's where the web site lives)

thetrasher

12:23 pm on Jan 12, 2010 (gmt 0)

host.slingshotcommerce.com
208.75.148.xx

cugillionSETRP

4:42 pm on Jan 12, 2010 (gmt 0)

Hey guys,

I'm beginning to notice that Cugillion is becoming present on many administrator radars. I am building a project over the next several months that will become a community tool If you would like to see progress, feel free to check back at Cugillion.com.

Thanks everyone!

KenB

4:17 am on Jan 13, 2010 (gmt 0)

It is more than just on our radars, its locked in our gun sights. Quite literally any automated bot that crawls our sites but isn't used to create a public search engine index that guides REAL human visitors to our sites is considered the enemy by many of us and will be dealt with swiftly.

Automated, bots of various forms are a scourge on the Internet. Collectively they are becoming a serious burden on server resources and drive up the operating costs of running a website.

If you don't want your bot on universal ban lists then you had better A) make sure it obeys the robots.txt file and B) has a real purpose that helps to drive human visitors to the sites you crawl.

cugillion

12:26 am on Jan 14, 2010 (gmt 0)

KenB,

Thank you for addressing your concerns. I completely understand the nature of them and I mean no harm by spidering websites. It is true that other people use spidering for reasons that do not directly help anybody. I do, however, intend to use the data I've collected to develop a search engine and other related tools that benefit the community.

I apologize for any concerns this caused. I have suspended my spidering temporarily to address writing code for reading robots.txt. I did not expect Cugillion to be as successful as it has been and I have been working in an expedited fashion. If you wish to prevent your site from being indexed, you may specify it in your Robots.txt.

User-agent: Cugillionbot
Disallow: /

I expect my development to take several months. While I am developing I have set up an email account at gmail to answer any questions that arise while I need time to configure my server. You may reach me at cugillion@gmail.com and I will respond promptly to any concern that arises.

Aaron

Pfui

4:30 am on Jan 14, 2010 (gmt 0)

And thank you for acknowledging our concerns, Aaron. (The first cugillion poster flat-out ignored same and their remarks were more spammy than not, imho.) Permit me two follow-ups, please:

1.) Here's hoping you'll also (re)program your bot to respect ALL aspects of the Robot Exclusion Standard [en.wikipedia.org]. For example, in robots.txt --

User-agent: *
Disallow: /

-- and in HTML, ALL Robots META Tags [NoArchive.net]:

<META NAME="ROBOTS" CONTENT="NONE, NOINDEX, NOFOLLOW, NOARCHIVE, NOSNIPPET">

2.) Pardon me, but I'm confused. You state on your blog that your program is "a spider for page scraping". You also state that an early version was "collecting domains faster than it could view them". And later, also in the OP: "Page scraping with spiders really is an effective technique that has given me a new perspective as well as a tremendous amount of data."

You're in the SEO business so I know you know data mining/harvesting/extracting/indexing/spidering may or may not be scraping per se. The former commonly means retrieving 'plain' .html files; the latter everything down to the studs: .html, .js, .css, .txt, .pdf, .doc, all dynamic files, all graphic file types, etc.

(FWIW: I abhor all robots.txt fundamentals-ignoring bots because their use wastes my and thus my clients' resources. Their use also raises issues ranging from copyright infringement to trespass and theft.)

So anyway, precisely what is your bot doing, please? Spidering or scraping? TIA for your reply.

tangor

4:42 am on Jan 14, 2010 (gmt 0)

Pfui raises interesting questions. I have just one: What is the ultimate goal for this project? A followup question is: Where is it headed, and when do you think it might happen?

cugillion

2:04 pm on Jan 14, 2010 (gmt 0)

Pfui,

I'm just going to give a short background before I answer questions. I started Cugillion independently as a "for fun" project. We have coded spiders and used them on a much smaller scale for various small projects before. I engineered a spider and sent it off thinking it would just eventually fail and it didn't. After talking about it's success at work and eventually showing it to my executives � we decided that with some funding we might be able to turn this into something beneficial for everyone.

1) I will follow all standards and take couple extra steps.

2)I do work as an SEO firm. We have been extremely successful. The amount of data we collect will be dependent on how successful it really is. We're labeling this a �research project� for a reason. The first goals we have are to collect enough data to mimic other engines. At <snip> we spend countless hours researching ways to dissect engines to learn their secrets.

After we have the data, we'll use it to reverse engineer ranking strategies used by popular engines. These tools will probably be available to the community to be shared. So far, I am the only engineer working on this project, and I work on it after hours. After we clean our slate a little more we may devote a team to it.

Currently, we are collecting page text and html. You know as well as I do � Google, Bing, and Yahoo aim to take over the world. I will be writing several kinds of spiders/scrapers for several different tasks, and we plan to maintain �fresh internet� as much as we can. SEO goes way further than simple page text and HTML but this is how we're entering. I have interests in other world search engines such as Baidu as well. We'll eventually mirror most search engine trends. Building a great search engine requires money and power. We have invested very little thus far.

Tangor,

The ultimate goal is to end up mirroring major engines. Maybe if time permits and success grants it, we can compete � we do not have any intentions to compete right now. My time is already booked until March, so I have to work hard in my spare time to grow the project. By then, I'm positive we'll have something to play with. A six-month commitment at least. If it is successful enough, I will lead a team developing it further.

Thank you guys for your interest. Your questions have made my day.
Aaron

[edited by: incrediBILL at 2:46 pm (utc) on Jan. 15, 2010]
[edit reason] No self-promotion, company name removed [/edit]

KenB

3:26 pm on Jan 14, 2010 (gmt 0)

What I read from this is that the primary use of the data will be to help reverse engineer search engines so that your clients can better benefit from your SEO services. Sharing tools with the community does not equate to giving them away for free. Thus if we allow your bot to crawl our sites, we still might not see any benefit from your bot consuming our resources UNLESS we pay you for whatever tools/services you offer. In fact sites that rank well in search engines could actually get hurt by your bot being allowed to crawl them because these sites would help you refine your SEO practices, which said site's competitors would benefit from.

Sorry, but no deal.

cugillion

3:45 pm on Jan 14, 2010 (gmt 0)

KenB,

That's fine with me, you can block my robots if you want. I'm providing you a method to do it and my robots will ignore your site or optionally some of your pages if you choose. This project is a fun project. There's no intention to harm anybody.

Thanks for voicing your concerns,
Aaron

incrediBILL

3:05 pm on Jan 15, 2010 (gmt 0)

There's no intention to harm anybody.

That doesn't fly because the purpose of SEO is to always harm someone else's search engine rankings at the expense of your site/customer.

Only 10 people can exist in the top 10 and whoever you push to #11 has been harmed.

Considering this is an SEO research project, designed to learn how to manipulate search engines, harm is the obvious intent and the end result.

Not to mention you're using the bandwidth and server resources of people unaware of your crawler, along with the thousands of other crawlers that are "doing no harm", because the average webmaster doesn't know to block you in the first place nor do they even know you exist.

So not only do you intend to ultimately harm some of the sites you crawl via SEO manipulation you make them pay for the privilege of being harmed!

Research indeed.

Pfui

2:51 pm on Feb 4, 2010 (gmt 0)

host.cugillion.com
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)

robots.txt? NO

Hit home page html; no graphics.

If it looks like a cloaked bot, and acts like a cloaked bot...

jdMorgan

5:42 pm on Feb 4, 2010 (gmt 0)

cugillionSETRP,

Although I agree in principle with most of the members' statements posted here about Webmaster's cost/benefit analysis, I'd like to take the opportunity to offer some input from a somewhat-technical perspective.

Your spider should:

1) Fetch robots.txt on a relatively-frequent basis - at least once per 24 hours, but not more than a few times per day (Individual crawler instances should share robots.txt data, rather than each fetching robots.txt on their own).

2) Fully-implement the Standard for Robot Exclusion, and correctly-handle various valid constructs. For example, multiple-user-agent policy records and wild-card policy records:

User-agent: Googlebot
User-agent: Slurp
User-agent: msnbot
User-agent: Teoma
Disallow: /cgi-bin

User-agent: *
Disallow: /

3) Optionally, handle some of the recent additions to robots.txt, such as "Crawl-delay", "Allow", wild-card URL and query-string matching, etc.

4) If Crawl-delay is not implemented, take care to spread requests to name-based virtual servers sharing IP addresses over time; crawl "URL-space width-first, then depth."

5) Identify itself properly: Provide a "Webmaster Help page" URL in the user-agent string, and an e-mail contact address in the HTTP "From" request header. The Webmaster Help page must resolve, and must fully describe your robot, its robots.txt matching string, and the intended end-use of all data that it collects.

6) Have a completely-valid user-agent string -- No dangling semicolons, single spaces following semicolons and preceding the next "token," and all open parentheses closed.

7) A fixed-format user-agent string: The only thing that should ever change after initial deployment is the revision number, and its format should be fixed as to 'digit count,' use of leading zeroes, etc. Eschew all other variations. (Be aware that many of us use whitelists instead of blacklists; If your UA format changes, you may get kicked to the curb until someone notices the change and amends the whitelists. Plan ahead.)

There are likely several other items I've forgotten, but these represent the most common problems that we've seen all too often, even from the major SE robots.

There is not generally an antagonistic bias against crawlers among Webmasters, but please realize that the number of automated-agent requests has grown from a trickle to a flood over the past 15 years, consuming server resources and visibly skewing sites' visitor "stats." It is only natural and right that Webmasters ask, "What's in it for us?"

Best,
Jim