Forum Moderators: open
Just digging bit more to find that
SlySearch is the robot of http://www.Plagiarism.org and http://www.Turnitin.com -
A company that charges for document retrival or something. we are ALL paying $$ for Bandwidth just for another company to re-sell our material? It's a bad joke. My site is an educational site with over 100K of free domain articles, I am not running a commercial site. I am going to block them, donno about you guys.
hanuman
To block 209.10.169.24 - PortalBSpider, and 64.140.48.30 Slysearch
I added these lines to my .htaccess file
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^(SlySearch.*¦PortalBSpider.*) [NC,OR]
RewriteRule ^(.*) block.htm [L]
I would also recommend adding the following lines
RewriteCond %{HTTP_USER_AGENT} ^(-?¦[A-Z]{10})$ [OR]
RewriteCond %{REMOTE_HOST} ^private$ [NC,OR]
Thanks the group for the kind help!
Hanuman
TurnitinBot/1.4 [turnitin.com...]
However, I think that they [the Turnitinbot/Slysearch bot owners] might actually read this forum. So everytime they change their useragent on their bot, it gives them lag time before website owners know what bot owners have done. They can deep crawl away to their hearts content.
I think the most effective way to ban these unwanted intrusions is to ban this bots known IP addresses as well.
Here is my robots.txt
---------------
User-agent: ia_archiver
Disallow: /
User-agent: SlySearch
Disallow: /
------------------
But after requesting my robots.txt this time, the bot tried to download my site.
Here is the request for robot.txt:
64.140.48.24 - - [13/Aug/2002:18:14:15 -0400] "GET /robots.txt HTTP/1.0" 302 294 "-" "TurnitinBot/1.4 (http://www.turnitin.com/robot/crawlerinfo.html)"
And here is an example of how it requested my files after getting the robot.txt:
64.140.48.24 - - [13/Aug/2002:18:15:13 -0400] "GET /example.htm HTTP/1.0" 302 294 "-" "TurnitinBot/1.4 http://www.turnitin.com/robot/crawlerinfo.html"
Luckily, I had banned the bot using .htaccess the same day as I added the robot.txt and the response it received as a 302 error.
And how was your day?
Q: How can I completely exclude TurnitinBot from my site?
To exclude TurnitinBot from all or portions of your site all you have to to do is create a file called robots.txt and put it in the top most directory of your web site.
Below is an example of a robots.txt file which exludes ONLY our robot from a portion or all of your site.#This is an example robots.txt file
User-agent: SlySearch
Disallow: /hide/ #Will disallow any url starting with /hide/#This is an example robots.txt file
User-agent: SlySearch
Disallow: / #Will disallow all urls on your site
I will try your method and let you know what happens!
[turnitin.com...]
This spider won't GO away. It has been hitting my server all night.
This bot has been hitting my site continuosly for the past few days with no regard to the robots.txt
User-agent: ia_archiver
Disallow: /User-agent: turnitinbot
Disallow: /User-agent: SlySearch
Disallow: /
Finally, I had to email the company to get it to stop today.
1 16.67% TurnitinBot/1.5 [turnitin.com...]
1 16.67% TurnitinBot/1.5 (http://www.turnitin.com/robot/crawlerinfo.html)
They are now a plagerisim system?
It used to be 1.4 and before that just Turnitin and before that several Sylsearch bots.
I do not think adding just "Turnitin" will stop all the Turnitin bots. In fact that is in my robots.txt and it did not stop them last night.
I makes me wonder if they changing it on purpose just to see how much more they get.