Forum Moderators: open

Message Too Old, No Replies

SlySearch?

         

hanuman

7:40 am on Jun 12, 2002 (gmt 0)

10+ Year Member


Does any one knows this one, It spider my website on a daily basis now, grabbing 10-20K of documents per day! Comming from 64.140.48.30!

The robot faq page at http://www.slysearch.com/ does not explain why it crawel.

Should I block this agent?
thanks
hanuman

Dpeper

8:23 am on Jun 12, 2002 (gmt 0)

10+ Year Member



As long as it dont use up your bandwith i think its safe and has a good explanation for spidering.

hanuman

1:14 pm on Jun 12, 2002 (gmt 0)

10+ Year Member


Hi,

Just digging bit more to find that
SlySearch is the robot of http://www.Plagiarism.org and http://www.Turnitin.com -
A company that charges for document retrival or something. we are ALL paying $$ for Bandwidth just for another company to re-sell our material? It's a bad joke. My site is an educational site with over 100K of free domain articles, I am not running a commercial site. I am going to block them, donno about you guys.

hanuman

korkus2000

1:19 pm on Jun 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I changed my robots.txt to ban their bot using their syntax. It has been a month and they have still not obeyed it. I get them once a week draining bandwidth. I will be banning them through other measures.

Axacta

1:34 pm on Jun 12, 2002 (gmt 0)

10+ Year Member



On my site SlySearch has been used by www.plagiarism.org which is a site fighting plagiarism in the education system. This sounds good to me. I don't think I want to block this one. It seems to be on our side.

hanuman

4:50 am on Jun 13, 2002 (gmt 0)

10+ Year Member



Ref: [webmasterworld.com...]

To block 209.10.169.24 - PortalBSpider, and 64.140.48.30 Slysearch

I added these lines to my .htaccess file

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^(SlySearch.*¦PortalBSpider.*) [NC,OR]
RewriteRule ^(.*) block.htm [L]

I would also recommend adding the following lines

RewriteCond %{HTTP_USER_AGENT} ^(-?¦[A-Z]{10})$ [OR]

RewriteCond %{REMOTE_HOST} ^private$ [NC,OR]

Thanks the group for the kind help!
Hanuman

Key_Master

12:02 pm on Aug 9, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



SlySearch is now:

TurnitinBot/1.4 [turnitin.com...]

korkus2000

12:07 pm on Aug 9, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I wonder if they changed the name because so many people were banning the bot. I don't mind what they are doing but it is very aggressive. It acts like googlebot with no advantage to my site.

frontpage

2:11 pm on Aug 10, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Turnitinbot was hitting my site very hard today and caught them in time to add "Turnitinbot" to my .htaccess to ban them.

However, I think that they [the Turnitinbot/Slysearch bot owners] might actually read this forum. So everytime they change their useragent on their bot, it gives them lag time before website owners know what bot owners have done. They can deep crawl away to their hearts content.

I think the most effective way to ban these unwanted intrusions is to ban this bots known IP addresses as well.

volatilegx

2:57 pm on Aug 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've had some correspondence with the administrator for this bot and I don't think they would have changed the name in order to overcome being banned by user agent. They seemed very forthright in explaining what their bot was about and were ready to help me with the problem I was having with their bot. Just my gut feeling...

frontpage

11:46 am on Aug 14, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member


Well it looks like Turtinin does not respect robots.txt. I banned them after getting hit by them on their last visit using robots.txt and useragent.

Here is my robots.txt
---------------
User-agent: ia_archiver
Disallow: /

User-agent: SlySearch
Disallow: /

------------------

But after requesting my robots.txt this time, the bot tried to download my site.

Here is the request for robot.txt:

64.140.48.24 - - [13/Aug/2002:18:14:15 -0400] "GET /robots.txt HTTP/1.0" 302 294 "-" "TurnitinBot/1.4 (http://www.turnitin.com/robot/crawlerinfo.html)"

And here is an example of how it requested my files after getting the robot.txt:

64.140.48.24 - - [13/Aug/2002:18:15:13 -0400] "GET /example.htm HTTP/1.0" 302 294 "-" "TurnitinBot/1.4 http://www.turnitin.com/robot/crawlerinfo.html"

Luckily, I had banned the bot using .htaccess the same day as I added the robot.txt and the response it received as a 302 error.

And how was your day?

frontpage

11:41 am on Aug 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The bot that would not go away came back.

64.140.48.24 - - [15/Aug/2002:18:19:10 -0400] "GET /robots.txt HTTP/1.0" 302 294 "-" "TurnitinBot/1.4 (http://www.turnitin.com/robot/crawlerinfo.html)"

Robots.txt wont stop him only a ban in .htacess seems effective.

Key_Master

11:52 am on Aug 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Frontpage,

Are you using this in your robots.txt file?

User-agent: turnitinbot
Disallow: /

frontpage

10:29 pm on Aug 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have not tried that. I was relying upon the information provided by Slysearch/Turnitin's website.

Q: How can I completely exclude TurnitinBot from my site?

To exclude TurnitinBot from all or portions of your site all you have to to do is create a file called robots.txt and put it in the top most directory of your web site.
Below is an example of a robots.txt file which exludes ONLY our robot from a portion or all of your site.

#This is an example robots.txt file
User-agent: SlySearch
Disallow: /hide/ #Will disallow any url starting with /hide/

#This is an example robots.txt file
User-agent: SlySearch
Disallow: / #Will disallow all urls on your site

I will try your method and let you know what happens!

Key_Master

11:14 pm on Aug 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Actually, it's not my method- it belongs to turnitin.com.

[turnitin.com...]

frontpage

12:35 pm on Aug 17, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



64.140.48.24 - - [16/Aug/2002:21:04:13 -0400] "GET /cgi-bin/odp/index.cgi?/Computers/CAD/Computer_Aided_Manufacturing/ HTTP/1.0" 302 294 "-" "TurnitinBot/1.4 [turnitin.com...]

This spider won't GO away. It has been hitting my server all night.

frontpage

12:53 pm on Aug 22, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



64.140.48.24 - - [21/Aug/2002:18:19:01 -0400] "GET /cgi-bin/odp/index.cgi?/Home/Cooking/Beverages/Smoothies/ HTTP/1.0" 302 294 "-" "TurnitinBot/1.4 [turnitin.com...]

This bot has been hitting my site continuosly for the past few days with no regard to the robots.txt

User-agent: ia_archiver
Disallow: /

User-agent: turnitinbot
Disallow: /

User-agent: SlySearch
Disallow: /

Finally, I had to email the company to get it to stop today.

jdMorgan

10:26 pm on Aug 22, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



frontpage,

Just noticed this... I wouldn't have expected them to go away if you are returning a server code of
302 - Moved Temporarily. Most bots *will* go away after awhile if you return 403 - Forbidden.

Jim

EliteWeb

11:02 pm on Sep 9, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This one came up for me today:

1 16.67% TurnitinBot/1.5 [turnitin.com...]
1 16.67% TurnitinBot/1.5 (http://www.turnitin.com/robot/crawlerinfo.html)

They are now a plagerisim system?

Rugles

12:58 pm on Oct 4, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I really hate that they constantly change the User Agent for this company. I think they have used 4 or 5 different agents in the last year.
It is like they do it on purpose just to get around my robots.txt file.
I just fired off an angry e-mail in hopes that they will stop crawling my sites on a near daily basis.

korkus2000

1:02 pm on Oct 4, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Have they changed from TurnitinBot?

Rugles

3:21 pm on Oct 4, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ya.. it is TurnitinBot/1.5 now.

It used to be 1.4 and before that just Turnitin and before that several Sylsearch bots.

I do not think adding just "Turnitin" will stop all the Turnitin bots. In fact that is in my robots.txt and it did not stop them last night.

I makes me wonder if they changing it on purpose just to see how much more they get.