I got it from this IP - 220.127.116.11 - It did NOT visit robots.TXT
Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] firstname.lastname@example.org)
To ban or not to ban. This is the question!
It's Amazon's Elastic Compute Cloud (EC2) again.
Anonymous virtual machines for cheap crawling. No, thanks!
(Yes, it's really Amazon.com who offers hosting!)
Thank you very much for letting us crawl your website.
However, if our crawl was unwanted, we apologize for the intrusion. Please check our website for more information on how to limit our spider's access to your website.
We have made sure that our crawler complies with W3 standards
According to w3 standard:
"Some tips: URI's are case-sensitive, and "/robots.txt" string must be all lower-case. Blank lines are not permitted within a single record in the "robots.txt" file."
so robots.TXT will not be checked. However we will definitely give feedback to our technical team to address these exceptions as well.
By allowing Teemer and other non-malicious spiders to crawl your site, you are already helping to advance the state-of-the-art in Internet services. You are also supporting competition and allowing new ideas to be explored in the world's largest electronic playground. This type of innovation and cooperation is what lead to the creation of the Internet and the World Wide Web. We sincerely hope that you will work with us to help create the Internet of tomorrow!
|By allowing Teemer and other non-malicious spiders to crawl your site, you are already helping to advance the state-of-the-art in Internet services. You are also supporting competition and allowing new ideas to be explored in the world's largest electronic playground. This type of innovation and cooperation is what lead to the creation of the Internet and the World Wide Web. We sincerely hope that you will work with us to help create the Internet of tomorrow! |
Many thanks for taking the time offer your insights to this forum at Webmaster World?
Could you possibly answer a question?
Why on earth would a non-malicious spider (Teemer or otherwise, and supposedly credible) associated itself with an backbone provider (Amazon Development), when said backbone has a history of harboring both unidentified and non-compliant crawls?
72.44.46.zz - - [30/Aug/2007:12:21:27 -0500] "GET /Myfolder/MyPage.html HTTP/1.0" 200 32726 "-" "Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)"
And recently, the same Class B that Teemer is using (which BTW has a recently added User-Agent per your explanation; of course in all-fairness, twenty ranges in a Class C may in fact be WITHOUT affiliation to Teemeer)
67.202.4.zz - - [07/Oct/2007:12:13:48 -0500] "GET / HTTP/1.0" 200 6774 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
|To ban or not to ban. This is the question! |
Many have multiple ranges of the backbone denied.
Thanks for bringing this to our attention. As you may know Large scale crawling of the web, is a server and storage intensive task. Unlike giants like Google or Yahoo with 300,000+ server farms many startup and university research centers do not have resources of such scale.
EC2 is an interesting offering by Amazon that many research centers and startup companies like us, can lease thousand servers and crawl the web and release the servers for future processing. Of course such cheap and efficient service is an excellent opportunity for malicious activities as well. But at the same time this service is used by many cutting edge startups and research centers around the world that are trying to come up with new innovative ways to index and understand the web. NetSeer is actively working with other startups and university reearch centers to improve the open source crawler, Nutch and start a consortium to generate and maintain an updated crawl of the web to answer valid concerns that you brought up.
In the mean time we appreciate your comments and we would be happy to answer any further concerns.
To read more about Amazon EC2 and their offerings please check
few startups using their services:
What your team should know -- What all crawler teams should know, are the following points.
As Webmasters, our sites are subject to constant abuse by e-mail address harvesters, content scrapers who copy our sites, plaster pay-per-click ads all over their copies, and then compete with us for our own search engine rankings, criminals intent on injecting malicious code into our servers, competitors who mount (or hire others to mount) denial-of-service attacks to bring down our servers, and all manner of other abuses.
As a result, we have little tolerance for robots that don't honor robots.txt or which consume excessive bandwidth from our servers while providing no 'return' for us in the form of traffic to our sites.
You could say that many of us here are a bit trigger-happy on returning 403-Forbidden responses as a direct result of the above.
So, in simple terms, we want to know exactly what the purpose of any new crawler is, and how allowing it to crawl our sites will benefit us and/or our sites' visitors or members. There are plenty of sites that allow only Google, MSN, and Yahoo to crawl them, and suffer not in the least.
We also want to see an on-line page with all of this information, without having to individually send e-mails to some unknown organization to inquire, while hoping for a truthful response, or any response -- Some fake robots have published e-mail addresses in their user-agent strings for the sole purpose of collecting Webmaster e-mail addresses, with the predictable result that the Webmasters who did inquire using that e-mail address were soon subjected to a flood of unsolicited hosting service offers.
In short, it's a jungle out here, and full disclosure and standards compliance is the only way to stay off widely-distributed robot ban lists. Have a look at some of the threads here in this forum for a snapshot of our world.
With those threads in mind, consider updating your /crawler page to tell us how the data harvested from our site will be used to benefit our sites. The phrase, "we build up a body of knowledge that helps us to provide better targeted advertising" is somewhat obtuse -- How does this benefit our sites or our businesses? Or does it?
One further comment: The description of NetSeer as a "start-up company" sounds rather odd, since many of us have been using their server monitoring service for many years. Also, that user-agent string is just a bit too long, IMO; The company name and /crawler.html URL will do, I think. :)
Thanks for replying to some of our inquiries. We welcome your participation.
with as many markets as Amazon is now exploiting I am banning their keyword harvesting and possibly image harvesting little bot. bye bye. No more competitor snatching my content.
Oh and it grabbed a page every 3 seconds of the time it was on our site and never once asked for robots.txt
used ip 67.202.25.#*$!
[edited by: Bewenched at 5:42 am (utc) on Nov. 15, 2007]
I fully agree with Jim on this matter. I saw this one coming for a few days now and just 10 minutes before I read this thread I 403-ed them.
It won't give me any traffic in return, so why allow it to harvest my site.
|The description of NetSeer as a "start-up company" sounds rather odd, since many of us have been using their server monitoring service for many years. |
Is NetSeer the same company (or run by the same people) as InternetSeer? I assumed the name was a rip off, and that it is a different mob running it.
|... and never once asked for robots.txt |
For what its worth, I haven't had any problem with Teemer not requesting robots.txt. Every visit in several sites I monitor, it not only asks for it but obeys it too! It doesn't ask for any files but robots.txt.
If a crawler will obey robots.txt, I much prefer to use that method rather than 403 every request they make - less load on the server.
NetSeer by all means, I am pretty sure that you guys are working on something big and bright, but as a webmaster/gatekeeper of my site if I would visit your careers page, I would 403 every request that I could track right after a read that stuff.
Just for giggles if I would to apply for Office Assistant position, would I truly be responsible for all aspects of systems administration for the company, including networks, applications, databases, and telecommunications. Wouldn’t I be doing Systems Administrator Job, or is that the same?
This crawler gets robots.txt but doesn't respect it. I say BAN!
Thanks everybody for the feedback. We constantly check our crawl logs to make sure we honor robots.txt instructions. We have actually checked with our network of webmasters to make sure it works fine and have not seen any instance that it has caused a problem. Having said that we are sorry if it has caused you any inconvinience. Could you please email us a copy of your robots.txt or your domain name so that we can investigate more. I think our efforts to actively answer your concerns speaks of our respect for webmaster community.
Some notes we think can help:
- Zerillos: "I got it from this IP - 18.104.22.168 - It did NOT visit robots.TXT "
Thanks for your feedback please fix your file by renaming it to robots.txt
-jdMorgan: "As a result, we have little tolerance for robots that don't honor robots.txt or which consume excessive bandwidth from our servers while providing no 'return' for us in the form of traffic to our sites."
We couldn't agree with you more. But our crawler does honor robots.txt and limits crawling speed for each domain to ensure low bandwidth consumption. We think the easiest way for you to block us is to follow our instructions and change your robots.txt.
-Mokita: "Is NetSeer the same company (or run by the same people) as InternetSeer"
Thanks for the feedback. No we are not related to InternetSeer. NetSeer is a new indexing and conceptual matching technology. Most of the core group are from UCLA. We will have more information on our website soon
-blend27: Thanks for the feedback and good catch :) The webpages currently online are only templates provided by designers. We just put them online for crawler information page. We should have our live website sometime soon. Check back for our real career page sometime soon. We have junior and senior sys admin positions available.
- NemoNemo: "This crawler gets robots.txt but doesn't respect it. I say BAN!" Please provide us more information to check our logs. Teemer is checking for robots.txt for each and every domain. Please email us your robots.txt and we would be happy to give you feedback.
We want to emphasize our goal again. As a legitimate crawler with long term plans we want to be your friendly crawler. We know for many reasons you might not want us to crawl your website, and we respet that. That's why the best way to stop us crawling your website is to follow instructions and change your robots.txt.
Having read [netseer.com...] , I do not see any possibility to specifically disallow Teemer. So only '403' is left ;)
Netseer bot hit us last night around 2AM sending around 200+ reqs/sec for a duration of an hour before we blocked their IP range. Enabled again this morning and their crawl was *still* running at full speed (even though we were rejecting at the firewall and the ICMP unreachable should have halted their bot).
Going to try a robots exclusion (I don't really want to firewall off Amazon's EC2 IP space), but I am not hopeful that these people will do the right thing here.
On Nov 14th (and after my replies), "Teemer" was added to my robots.txt of my websites.
The next day and after reading robots.txt the bot has thus far remained compliant.
|I do not see any possibility to specifically disallow Teemer. |
I see your emoticon, so my answer is aimed other people, who are not regular readers of these forums. This thread comes up at No 1 for "Netseer" in Google.
I structure robots.txt such that it allows a tiny handful of crawlers and disallows everything else, known and unknown.
There are so many crawlers around now it is impossible to keep up with them all. Plus doing it this way means that when a previously unknown bot visits for the first time, you can see at glance if they obey robots.txt and 403 them if necessary.
I used to have a lengthy robots.txt that required constant maintenance adding new crawlers. Now it is short and sweet and requires no maintenance.
[edited by: Mokita at 12:37 am (utc) on Nov. 19, 2007]
Would whole 67.202.x.x belong to the same entity?
What prompted me to get onto this one are 404s that come from Google AdWords as a source, no referring link.
They all say: User Agent = PRCrawler/Nutch-0.9 (data mining development project)
All IPs belong to Amazon.
Now, this is something I never thought about before: Does Google charge us for “clicks” from spiders, bots, whatever?
The reason for 404s is the fact that “?” gets somehow evaded, so any variable applied to the end of URL creates non-existing page. If you have Google’s tagging turned on, no link will work in such cases.
|Would whole 67.202.x.x belong to the same entity? |
Just the 0-63-Class C of the 202 Class B.
They all say: User Agent = PRCrawler/Nutch-0.9 (data mining development project)
Nutch-Annything is basically bad, however. . .I seem to recall some reference, somewhere that google (or perhaps another was using a Nutch from THIS Amazon for something.
|All IPs belong to Amazon. |
Don't confuse this Amazon with the other Amazon.
This Amazon is basically a reseller.
Amazon Development Centre South Africa
|Now, this is something I never thought about before: Does Google charge us for “clicks” from spiders, bots, whatever? |
No clue from me, however I believe there is a forum at Webmaster World focused on googles click through program?