Forum Moderators: open
I have a page on one of my websites where people can download files related to user agents. This page can also be used to check for updates. However, it's a heavy page so I've set up tools to make it easy for people to check for updates without putting any noticeable load on my server.
Despite my pleas that people use the aforementioned tools there are still a lot of people who check the main page. So many that I added a clause to my Terms of Service stating that checking the main page more than once a day is a violation that will result in one or more IP Addresses being added to my ban list.
Let me try and be brief. Something that's hard for me to do. ;)
The above user agent has been hitting that page multiple times per day.
So I finally wrote to the webmaster and politely but firmly told him to stop it or I'll ban his entire company's range of IP Addresses. He wrote back to tell me the page isn't in my robots.txt file and that he can crawl it as often as he wants.
My reply to him made it clear my problem wasn't with him indexing downloads.asp. In fact I want search engines to index it. It's the top ranking page in all the majors using my keyword(s). My problem, as I told him, was that he was violating my Terms of Use.
Here's where it gets interesting.
He wrote back and told me that bots don't have to abide by a site's Terms of Use.
The other thing he said, and I really do not know why, is that he often crawls the web [b]spoofed as Google[/b] and that maybe I was seeing duplicate entries because both bots crawl from the same IP Address. I'm not. When he crawls as Google he uses IP Addresses from an ISP in Queensland, AU. When he crawls using the above user agent the IP Addresses are from a company called ATMLINK, INC. in Los Angeles.
Between his abuse of my Terms of Use, and his admitting to spoofing Google user agents I have enough information to consider him worthy of being banned.
What arrogance!
I agree... and certainly a site/IP range to be deemed worthy of the 403 ban.
This was the first time I've had a problem with arrogance in the extreme.
BTW this guy just doesn't know when to stop typing. He claims he used to own another bot that I used to have problems with: bdncentral, an Australian company that used to have a search engine, but now seems to simply be a registrar, site designer and host. He encouraged me to Google him to see what a great person he is. Bah!
That's all I have to say about this. I've relayed the facts he shared with me. As always it's up to each of us to make our own decisions about what to do next.
216.240.159.** - - [30/Jun/2005:19:27:45 -0700] "GET / HTTP/1.0" 200 9106
"-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
[edited by: volatilegx at 3:00 am (utc) on Oct. 4, 2006]
[edit reason] obfuscated ip address [/edit]
The truth to the whole story from the original email is listed below, there is also the original second email cause you chose not to reply from the original.
Google terms for a crawler
http://www.google.com/intl/en/terms_of_service.html
From Google site very interesting facts that there are not search engines on the web that read your terms of use
_______________________________________________________
Content Linked to by Google
The sites displayed as search results or linked to by Google Services are developed by people over whom Google exercises no control. The search results that appear from Google's indices are indexed by Google's automated machinery and computers, and Google cannot and does not screen the sites before including them in the indices from which such automated search results are gathered. A search using Google Services may produce search results and links to sites that some people find objectionable, inappropriate, or offensive. We cannot guarantee that a Google search will not locate unintended or objectionable content and assume no responsibility for the content of any site included in any search results or otherwise linked to by the Google Services.
________________________________________________________________
Hello
Your robots text file
User-Agent: *
Disallow: /contact-me/
Disallow: /error/
Disallow: /template/
Disallow: /tools/
Disallow: /versions/
Disallow: /stream.asp
Your robots text file is not going to stop any crawler there(downloads.asp).
Full info here how to setup your robots.txt file
http://pcaccessoriesparts.com/Spider.php
or add this to your robots.txt file. Full block on our bot only for your full site
User-Agent: Nelian Pty Ltd - Spider v2.1 ( http://pcaccessoriesparts.com )
Disallow: /
Block on that page for all robots
User-Agent: *
Disallow: /downloads.asp
Block on our robot only
User-Agent: Nelian Pty Ltd - Spider v2.1 ( http://pcaccessoriesparts.com )
Disallow: /downloads.asp
Add these to your robots txt file for a crawl delay
User-agent: *
Crawl-delay: 17
Your meta tags. You have two robots tags on this page, this is not compliant to stop indexing on this page.You have two strings for robots,
<meta name="robots" content="noarchive">
<meta name="robots" content="index,follow">
Should be like below for no index and no following the links
<meta name="robots" content="noindex,nofollow">
Now after you make necessary changes to prevent your page from being indexed that comply to "The Robots Exclusion Protocol" adopted worldwide as a standard, i can asure you now our spider wont have a hope of indexing materials you dont want indexed.
Full details on "The Robots Exclusion Protocol" is located at the bottom of our page here
http://pcaccessoriesparts.com/Spider.php
Thank You
Brian Neilen
Nelian Pty Ltd
<snip: no emails can be posted anywhere on this system>
Nelian Pty Ltd - Spider v2.1 ( http://pcaccessoriesparts.com )
216.240.157.3
beaver.unixbsd.info
-----
09/27/2006 08:31:03 200 GET browsers.garykeith.com
/downloads.asp browsers.garykeith.com
Nelian+Pty+Ltd+-+Spider+v2.1+(+http://pcaccessoriesparts.com+)
09/27/2006 08:31:03 200 HEAD browsers.garykeith.com
/downloads.asp browsers.garykeith.com
Nelian+Pty+Ltd+-+Spider+v2.1+(+http://pcaccessoriesparts.com+)
09/29/2006 06:56:36 200 GET browsers.garykeith.com
/downloads.asp browsers.garykeith.com
Nelian+Pty+Ltd+-+Spider+v2.1+(+http://pcaccessoriesparts.com+)
09/29/2006 06:56:36 200 HEAD browsers.garykeith.com
/downloads.asp browsers.garykeith.com
Nelian+Pty+Ltd+-+Spider+v2.1+(+http://pcaccessoriesparts.com+)
09/30/2006 06:08:43 200 GET browsers.garykeith.com
/downloads.asp browsers.garykeith.com
Nelian+Pty+Ltd+-+Spider+v2.1+(+http://pcaccessoriesparts.com+)
09/30/2006 06:08:43 200 HEAD browsers.garykeith.com
/downloads.asp browsers.garykeith.com
Nelian+Pty+Ltd+-+Spider+v2.1+(+http://pcaccessoriesparts.com+)
09/30/2006 06:10:44 200 GET browsers.garykeith.com
/downloads.asp browsers.garykeith.com
Nelian+Pty+Ltd+-+Spider+v2.1+(+http://pcaccessoriesparts.com+)
09/30/2006 06:10:44 200 HEAD browsers.garykeith.com
/downloads.asp browsers.garykeith.com
Nelian+Pty+Ltd+-+Spider+v2.1+(+http://pcaccessoriesparts.com+)
Second email
<snip>
___________________________________________________________________
I very rarely use internet explorer and as stated above i use k-meleon and its usually set to google, its easy to serve up expoits based on user agent string, its not as widly used any more(that technich), and at times when im testing my scrits for detection of uer agents i will add my user agent in k-meleon and it can at times stay that way for a few days. Within a week or two my testing will be complete. And i wont be on the net as actively as i am testing.
Well if you truely believe my engine should be banned that does not worry me in the least, you just need to convice the billion odd web masters in the world to do it, while my engine will probaby only get to 20,000 in dex docs.
All the best gary, u need it, too highly stung or stressed over the internet being used.
Thank You
Brian Neilen
[1][[b]edited by[/b]: Brett_Tabke at 12:36 pm (utc) on Oct. 6, 2006][/1]
3 things:
1- when you post someones crawler agent with an ip and url - they are 99.9% of the time going to come here and see it.
2- if you run a bot - do so respectifully - it is your responsibility to make sure it runs in a respectiful manner. Don't be surprised when website owners get upset from you.
3- this thread has come to a useful end. You guys want to talk/discuss it more - I invite you to do so in email.