Forum Moderators: open
67.202.21.216 - - [09/Nov/2007:02:32:26 -0500] "GET /robots.txt HTTP/1.0" 200 5329 "-" "Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; http://www.netseer.com/crawler.html; crawler@netseer.com)"
IP address belongs to Amazon Development Centre South Africa.
Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
To ban or not to ban. This is the question!
(Yes, it's really Amazon.com who offers hosting!)
We have made sure that our crawler complies with W3 standards
According to w3 standard:
[w3.org...]
"Some tips: URI's are case-sensitive, and "/robots.txt" string must be all lower-case. Blank lines are not permitted within a single record in the "robots.txt" file."
so robots.TXT will not be checked. However we will definitely give feedback to our technical team to address these exceptions as well.
By allowing Teemer and other non-malicious spiders to crawl your site, you are already helping to advance the state-of-the-art in Internet services. You are also supporting competition and allowing new ideas to be explored in the world's largest electronic playground. This type of innovation and cooperation is what lead to the creation of the Internet and the World Wide Web. We sincerely hope that you will work with us to help create the Internet of tomorrow!
By allowing Teemer and other non-malicious spiders to crawl your site, you are already helping to advance the state-of-the-art in Internet services. You are also supporting competition and allowing new ideas to be explored in the world's largest electronic playground. This type of innovation and cooperation is what lead to the creation of the Internet and the World Wide Web. We sincerely hope that you will work with us to help create the Internet of tomorrow!
Many thanks for taking the time offer your insights to this forum at Webmaster World?
Could you possibly answer a question?
Why on earth would a non-malicious spider (Teemer or otherwise, and supposedly credible) associated itself with an backbone provider (Amazon Development), when said backbone has a history of harboring both unidentified and non-compliant crawls?
Old threads
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
Same backbone
72.44.46.zz - - [30/Aug/2007:12:21:27 -0500] "GET /Myfolder/MyPage.html HTTP/1.0" 200 32726 "-" "Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)"
And recently, the same Class B that Teemer is using (which BTW has a recently added User-Agent per your explanation; of course in all-fairness, twenty ranges in a Class C may in fact be WITHOUT affiliation to Teemeer)
67.202.4.zz - - [07/Oct/2007:12:13:48 -0500] "GET / HTTP/1.0" 200 6774 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
Thanks
NetSeer Team
As Webmasters, our sites are subject to constant abuse by e-mail address harvesters, content scrapers who copy our sites, plaster pay-per-click ads all over their copies, and then compete with us for our own search engine rankings, criminals intent on injecting malicious code into our servers, competitors who mount (or hire others to mount) denial-of-service attacks to bring down our servers, and all manner of other abuses.
As a result, we have little tolerance for robots that don't honor robots.txt or which consume excessive bandwidth from our servers while providing no 'return' for us in the form of traffic to our sites.
You could say that many of us here are a bit trigger-happy on returning 403-Forbidden responses as a direct result of the above.
So, in simple terms, we want to know exactly what the purpose of any new crawler is, and how allowing it to crawl our sites will benefit us and/or our sites' visitors or members. There are plenty of sites that allow only Google, MSN, and Yahoo to crawl them, and suffer not in the least.
We also want to see an on-line page with all of this information, without having to individually send e-mails to some unknown organization to inquire, while hoping for a truthful response, or any response -- Some fake robots have published e-mail addresses in their user-agent strings for the sole purpose of collecting Webmaster e-mail addresses, with the predictable result that the Webmasters who did inquire using that e-mail address were soon subjected to a flood of unsolicited hosting service offers.
In short, it's a jungle out here, and full disclosure and standards compliance is the only way to stay off widely-distributed robot ban lists. Have a look at some of the threads here in this forum for a snapshot of our world.
With those threads in mind, consider updating your /crawler page to tell us how the data harvested from our site will be used to benefit our sites. The phrase, "we build up a body of knowledge that helps us to provide better targeted advertising" is somewhat obtuse -- How does this benefit our sites or our businesses? Or does it?
One further comment: The description of NetSeer as a "start-up company" sounds rather odd, since many of us have been using their server monitoring service for many years. Also, that user-agent string is just a bit too long, IMO; The company name and /crawler.html URL will do, I think. :)
Thanks for replying to some of our inquiries. We welcome your participation.
Jim
Oh and it grabbed a page every 3 seconds of the time it was on our site and never once asked for robots.txt
used ip 67.202.25.#*$!
[edited by: Bewenched at 5:42 am (utc) on Nov. 15, 2007]
The description of NetSeer as a "start-up company" sounds rather odd, since many of us have been using their server monitoring service for many years.
Is NetSeer the same company (or run by the same people) as InternetSeer? I assumed the name was a rip off, and that it is a different mob running it.
Bewenched wrote:
... and never once asked for robots.txt
For what its worth, I haven't had any problem with Teemer not requesting robots.txt. Every visit in several sites I monitor, it not only asks for it but obeys it too! It doesn't ask for any files but robots.txt.
If a crawler will obey robots.txt, I much prefer to use that method rather than 403 every request they make - less load on the server.
Just for giggles if I would to apply for Office Assistant position, would I truly be responsible for all aspects of systems administration for the company, including networks, applications, databases, and telecommunications. Wouldn’t I be doing Systems Administrator Job, or is that the same?
Some notes we think can help:
- Zerillos: "I got it from this IP - 67.202.26.75 - It did NOT visit robots.TXT "
Thanks for your feedback please fix your file by renaming it to robots.txt
-jdMorgan: "As a result, we have little tolerance for robots that don't honor robots.txt or which consume excessive bandwidth from our servers while providing no 'return' for us in the form of traffic to our sites."
We couldn't agree with you more. But our crawler does honor robots.txt and limits crawling speed for each domain to ensure low bandwidth consumption. We think the easiest way for you to block us is to follow our instructions and change your robots.txt.
-Mokita: "Is NetSeer the same company (or run by the same people) as InternetSeer"
Thanks for the feedback. No we are not related to InternetSeer. NetSeer is a new indexing and conceptual matching technology. Most of the core group are from UCLA. We will have more information on our website soon
-blend27: Thanks for the feedback and good catch :) The webpages currently online are only templates provided by designers. We just put them online for crawler information page. We should have our live website sometime soon. Check back for our real career page sometime soon. We have junior and senior sys admin positions available.
- NemoNemo: "This crawler gets robots.txt but doesn't respect it. I say BAN!" Please provide us more information to check our logs. Teemer is checking for robots.txt for each and every domain. Please email us your robots.txt and we would be happy to give you feedback.
We want to emphasize our goal again. As a legitimate crawler with long term plans we want to be your friendly crawler. We know for many reasons you might not want us to crawl your website, and we respet that. That's why the best way to stop us crawling your website is to follow instructions and change your robots.txt.
Thanks
NetSeer Team
Going to try a robots exclusion (I don't really want to firewall off Amazon's EC2 IP space), but I am not hopeful that these people will do the right thing here.
I do not see any possibility to specifically disallow Teemer.
I see your emoticon, so my answer is aimed other people, who are not regular readers of these forums. This thread comes up at No 1 for "Netseer" in Google.
I structure robots.txt such that it allows a tiny handful of crawlers and disallows everything else, known and unknown.
There are so many crawlers around now it is impossible to keep up with them all. Plus doing it this way means that when a previously unknown bot visits for the first time, you can see at glance if they obey robots.txt and 403 them if necessary.
I used to have a lengthy robots.txt that required constant maintenance adding new crawlers. Now it is short and sweet and requires no maintenance.
[edited by: Mokita at 12:37 am (utc) on Nov. 19, 2007]
What prompted me to get onto this one are 404s that come from Google AdWords as a source, no referring link.
They all say: User Agent = PRCrawler/Nutch-0.9 (data mining development project)
All IPs belong to Amazon.
Now, this is something I never thought about before: Does Google charge us for “clicks” from spiders, bots, whatever?
P.S.
The reason for 404s is the fact that “?” gets somehow evaded, so any variable applied to the end of URL creates non-existing page. If you have Google’s tagging turned on, no link will work in such cases.
Would whole 67.202.x.x belong to the same entity?
NO.
Just the 0-63-Class C of the 202 Class B.
They all say: User Agent = PRCrawler/Nutch-0.9 (data mining development project)
Nutch-Annything is basically bad, however. . .I seem to recall some reference, somewhere that google (or perhaps another was using a Nutch from THIS Amazon for something.
All IPs belong to Amazon.
Don't confuse this Amazon with the other Amazon.
This Amazon is basically a reseller.
Amazon Development Centre South Africa
Now, this is something I never thought about before: Does Google charge us for “clicks” from spiders, bots, whatever?
No clue from me, however I believe there is a forum at Webmaster World focused on googles click through program?