Welcome to WebmasterWorld Guest from 54.161.49.216

Forum Moderators: Ocean10000 & keyplyr

Message Too Old, No Replies

Convera Crawler is Posting links.

     
7:55 pm on Jun 15, 2005 (gmt 0)

New User

10+ Year Member

joined:May 16, 2005
posts:4
votes: 0


I have a php-nuke site and I was tracking down the source of someone that had posted a huge number of links in the comments area of my review section. Every review has the exact same links in the comments section.
I deleted most but here are 2 examples:
[****.***...]
[****.***...]
Looking through my Security Logs, aka Protector System, I was able to determine that this crawler was the only visitor on my site, at that specific time that could have posted those links.
UNITED STATES
Last here: 2005.06.13 06:53:57
Ip: 63.241.***.***
Isp/Host: 8-9745.san2.***.***
Last Referer: Direct Hit
Total Hits: 301
Was last on:/modules.php?name=Stories_Archive&sa=show_all
Agent infoConveraCrawler/0.8 (+http://www.authoritativeweb.com/crawl)

Other visitors that were on the site around that time only have one or two hits and they do not include my reviews section. Notice the 301 hits. Among those 301 hits are links like this:
[****.***...]

Has anyone seen this type of activity before from a bot or crawler?

[edited by: volatilegx at 9:16 pm (utc) on June 15, 2005]
[edit reason] removed specifics [/edit]

9:19 pm on June 15, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 22, 2001
posts:2450
votes: 0


Hi Jabba and welcome to WebmasterWorld :)

I haven't seen this type of behaviour from the Convera Crawler before, but this is very interesting.

10:04 pm on June 15, 2005 (gmt 0)

New User

10+ Year Member

joined:May 16, 2005
posts:4
votes: 0


Hi Jabba and welcome to WebmasterWorld happy!
I haven't seen this type of behaviour from the Convera Crawler before, but this is very interesting.

Well thanks for the welcome.
As I said in my PM to you, I am a lurker but I thought this would be very worthy of posting.
Frankly. I've never seen a bot/crawler visit the places this one did. From the evidence I have, I have no doubt that this crawler was responsible for posting those links.
This crawler seemed to be on the hunt for places where "anonymous" could post a comment as it hit Reviews, Stories, News, Sections, Topics and Forums only. The comments section for my reviews and one forum category is the only place that allows anonymous comments.
Sorry I didn't read the TOS. I assumed posting the evidentiary links would be permitted.

If more info is needed I will try to provide it.

12:04 am on June 16, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5493
votes: 3


I have the backbones (CERFnet) range denied.

Don

8:15 pm on June 16, 2005 (gmt 0)

New User

10+ Year Member

joined:May 16, 2005
posts:4
votes: 0


I spoke with an admin contact for this crawler and he sounded genuinely interested in finding out what may have caused this crawler to post those links.
Maybe he'll post a comment as I pointed him to this thread.
7:22 pm on June 17, 2005 (gmt 0)

New User

10+ Year Member

joined:May 16, 2005
posts:4
votes: 0


Digging a little deeper into this issue it seems that their crawler may have been hijacked by an IP addy in Belarus which I found in my server logs.
Their crawler was logged by Protector but the Belarus IP was not.
My server side logs show ConveraCrawler GET my robots.txt and then immediatley the Belarus IP began to POST the links.

Trojan or virus? Don't know but I do know these guys are new and they initially brought up the suggestion that they could have a malicious script embedded in the crawler code.

11:09 pm on June 17, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 30, 2002
posts:2646
votes: 96


I've just been paged to come into the office at 23:30 on a Friday evening because Convera was hammering my directory and was not obeying the robots.txt exclusion. The IP in question is a Convera IP and it is going in the permanent deepsix list.

Looks like just another maggot until proven otherwise. I've emailed the contact address for an explanation. But I don't buy that hijack theory.

Regards...jmcc

12:14 am on June 18, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5493
votes: 3


[convera.com...]

Funny thing?
I don't see any search option offered. Rather mentions of data retrieval and third party product references using websites as their reources.

2:38 am on June 18, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 30, 2002
posts:2646
votes: 96


Funny thing?
I don't see any search option offered. Rather mentions of data retrieval and third party product references using websites as their reources.

Or in other words, a more organised webscraper? :)

Regards...jmcc

4:46 am on June 18, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5493
votes: 3


jmcc,
There are actually many sources across the internet collecting data to sell or administer to third parties.

My gripe is "the why" of allowing them to utilize private resources (websites and bandwidth)without expecting compensation from either the "so called scraper" or their customers?

After all they are collecting the data to be utilized in a non-internet capacity. More like an intranet.

I feel the same way about univerities. And I do realize that much research (such as google and other projects) begins at universities. However, they have vaild resources in the way of grants with paid staff (professors) and students doing the majority of the work to further their career beyond the data they mine from privately owned web sites.

Another good example is Archive Org. It's an excellent resource and concept. The moog point, IMO, is that they will sell terabytes of collected data to anybody that wants to pay.
That payment concept is not under the theme of what most webmasters allow their site to be spidered.

The term "third party" is very broad and entials many companies not offering search engines. IBM Almaden is another example that only collects data to display in a closed enviroment to paid customers.

I have no desire to allow these types of bots or software's in my sites. UNLESS they are willing to send some compensation my way.
Of course they'd have to change their entire concept of doing business before that would happen ;)

Don