homepage Welcome to WebmasterWorld Guest from 54.196.201.253
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Anyone know the name of Wayback Machine's robot?
One site has not been crawled all year... wondering why.
JeffOstroff




msg:4530514
 1:27 am on Dec 23, 2012 (gmt 0)

For years we have had dozens of crawls of our site from Internet Wayback machine over at Archive org.

There normally we find screen shots every few weeks for all our pages of our site back to 1998 so we can see how the site looked in the past.

But there has been no screen shots at all this year, but a few in 2011 and I'm wondering if we are inadvertently blocking their robot for some unknown reason.

Does anyone know what they Internet Archive robot looks like? I want to make sure it's no excluded in our robots.txt file.

 

incrediBILL




msg:4530703
 3:22 am on Dec 24, 2012 (gmt 0)

According to their site [archive.org...] it's still ia_archiver which is also used by a ton of scrapers, many from China.

I have no clue why anyone would want to allow the wayback machine to archive a site as it's a really bad idea IMO, see NOARCHIVE.NET [noarchive.net...] for a few good reasons.

lucy24




msg:4530708
 5:16 am on Dec 24, 2012 (gmt 0)

What's the IP? The newer one is

IP: 207.241.224.41
UA: ia_archiver(OS-Wayback)

You can count on the fingers of one hand the number of places that answer "please identify your robot" e-mails. TIA is one of them. (Toshiba is another.)

They currently seem to be working on a very long time lag, up to 2 years before things get posted.

JeffOstroff




msg:4530709
 5:26 am on Dec 24, 2012 (gmt 0)

Yes, I also just discovered, they do everying on the new beta system now, so when you get to their home page, you click on the Wayback logo to go to the wayback form, and ther eis a green button that says Try Beta Version. Once you go through that, it shows captures of our site from 2012 all year long.

Looks liek they are not supporting the old archive any more, and they even disbaled the form wher eyou could submit your site. They prefer to happen across your site by natural means. We are Back in business!

JeffOstroff




msg:4530797
 3:46 pm on Dec 24, 2012 (gmt 0)

Wayback is useful when we find someone who scraped our content say 2 years ago, and we have to show their webhost what our site looked like 2 years ago, and also that we had the content up first.

It kills the scammer's argument when they claim they had it up first, when you can show it on wayback archive 2 years before the offending web site domain name was created.

incrediBILL




msg:4530872
 1:08 am on Dec 25, 2012 (gmt 0)

Wayback is useful when we find someone who scraped our content say 2 years ago, and we have to show their webhost what our site looked like 2 years ago, and also that we had the content up first.


Wayback is also where the scrapers SCRAPE your site from two years ago! Letting them use your data is also letting the scrapers abuse that data so you're creating a problem that goes round and round and never ends. It's also where, if you're infringing anything, lawyers go to get proof of how long you've been doing it.= and much more. Overall, bad idea.

I keep my own site archives, I don't need them.

If you haven't read NOARCHIVE.NET, go there now and find out more.

wilderness




msg:4530897
 3:08 am on Dec 25, 2012 (gmt 0)

Wayback is useful when we find someone who scraped our content say 2 years ago, and we have to show their webhost what our site looked like 2 years ago, and also that we had the content up first.


I keep my own site archives, I don't need them.


Ditto.
Any person (webmaster or otherwise) that uses their computer extensively for storing data better have a regular backup method in place.
I have two, which is done on the first of each month.


If you haven't read NOARCHIVE.NET [noarchive.net], go there now and find out more.


This bears repeating.

not2easy




msg:4530908
 4:18 am on Dec 25, 2012 (gmt 0)

I will second (third?) the advice to block both ia bots. If someone wants to scrape, let them come to the source where it can be minimized. I have the originals stored in backups and on the original hard drives back to 2000 anyway.

JeffOstroff




msg:4531029
 4:48 pm on Dec 25, 2012 (gmt 0)

I think they are scraping our site directly already, not via internet archive. They have scrpaed pages before they ever appeared in ininternet archive, that's how I know. We ranked high on lots of popular keywords, so they come to the sites that are ranking high and steal their content. We also get thousands of those Alexa wannabe web sites, that post useless whois and ranking data as an excuse to post your title, description, and a few paragraphs scraped from your site.

Anyway, you guys all mentioned about back ups, I keep backups too, but you have a flaw in your theory. These backups won't help you with your DMCA notices to the web hosts, as they either want to confirm the content is on your site now, or they want an independent 3rd party snapshot to prove your content was there first. Thatís where Wayback has helped us in shutting down several hundred sites in the last few months. Think of it as a necessary evil. Your fears about the IA robot seem more conspiracy theory than actual practice. They steal my blog entries off my site that are not even on the wayback archive. Iím more concerned about my own site being the source of scraping than the wayback.

Also guys, just because you have backups of your site 2 years ago on your pc does not in anyway prove to the web host that you are the copyright owner. I don't understand where you're coming from when you say just use your local pc backups. We hit brick walls when we cannot show somewhere online where our copyright content currently exists.

GoDaddy for example won't even accept wayback machine screen shots! You have to supply them with a URL on YOUR SITE that currently shows the same content on your site that the scrapers stole from you. If you cannot produce it, too bad, the scraper site stays up, with your stolen content that you cannot prove was yours simply because it is no longer on your site. So you have it backed up on your PC? So What, they donít care.

keyplyr




msg:4531055
 10:56 pm on Dec 25, 2012 (gmt 0)

Wayback has helped us in shutting down several hundred sites in the last few months.

If I had several hundred sites scrape my content in the last few months I would first look to see what I was doing wrong. Maybe allowing IA or translators to post copies of you site on their servers is actually your problem? Maybe you allow caching by sites such as Blekko, Google or Bing? Maybe you allow various download tools to rip your content?

Stopping intellectual property from being stolen is an on-going, pro-active and comprehensive approach.

incrediBILL




msg:4531091
 8:03 am on Dec 26, 2012 (gmt 0)

I think they are scraping our site directly already, not via internet archive.


If your site is hard to scrape, which mine is, they will scrape from Google Cache, Bing Cache, internet archive or anywhere else that has lighter security.

I know this because I've tracked it using tracking bugs in my content.

wilderness




msg:4531105
 11:09 am on Dec 26, 2012 (gmt 0)

Anyway, you guys all mentioned about back ups, I keep backups too, but you have a flaw in your theory. These backups won't help you with your DMCA notices to the web hosts, as they either want to confirm the content is on your site now, or they want an independent 3rd party snapshot to prove your content was there first. Thatís where Wayback has helped us in shutting down several hundred sites in the last few months. Think of it as a necessary evil. Your fears about the IA robot seem more conspiracy theory than actual practice. They steal my blog entries off my site that are not even on the wayback archive. Iím more concerned about my own site being the source of scraping than the wayback.


There's another forum at Webmaster World where Copyright is on topic. I'm sure the participants there would be willing to debate these issues until the cows come in.

Also guys, just because you have backups of your site 2 years ago on your pc does not in anyway prove to the web host that you are the copyright owner. I don't understand where you're coming from when you say just use your local pc backups. We hit brick walls when we cannot show somewhere online where our copyright content currently exists.


You really need to explore backup procedures.
The thought of backing up your data to the source machine your trying to protect is an absurd thought.
Standard practices are two external media devices, with the second being stored in a secondary location.

JeffOstroff




msg:4531127
 2:08 pm on Dec 26, 2012 (gmt 0)

You guys are completely missing the point here. I'm not talking about backing up my site. We back everything up regularly. I have backups.

I'm not talking about using wayback machine as a backup, I'm talking about using it as proof that our site was online with the content before the scraper site had it.

It's like playing whack a mole, you'll never stop them from coming onto your site, even if you update your robots.txt daily, because they keep changing names to fool your robots.tx file. Furthermore, many of the scrapers who steal our content are regular jos, and businesses who grabbed a paragraph. that has nothing to do with robots, when they come and manually cut and paste.

Furthermore, as web hosts are switching to automated DMACAs, they are requiring that you supply them with a URL that has your copyright content, as well as the url of the offending site. If you cannot supply them with a URL that shows your content that matches what the offending site has, they kick it back to you.

I don't know how you guys turned this into a backing up your data issue and my lack of being a responsible backup person, and then start talking about multiple hard drives.
This whole thread was about using the Wayback to PROVE, understand me, prove that I had the content up first. We have a popular auto related web site, which many people come and manually steal stuff, and there is also robot scraped but I doubt the threat from wayback is as big as the conspiracy theory people are suggesting. They are coming to our site and grabbing stuff BEFORE it appears on wayback. Our problem has not been with wayback. Wayback has been the solution for us.

keyplyr




msg:4531198
 8:38 pm on Dec 26, 2012 (gmt 0)

Standard practices are two external media devices, with the second being stored in a secondary location.

I've got mine buried under a chestnut tree.

incrediBILL




msg:4531215
 10:01 pm on Dec 26, 2012 (gmt 0)

I'm not talking about using wayback machine as a backup, I'm talking about using it as proof that our site was online with the content before the scraper site had it.


Nope, we didn't miss anything.

You're playing with a double-edged sword that cuts both ways.

Yes, it can be used as 'proof' your files were online at that time but likewise it can be used as the source of scraping causing you to need that proof in the first place.

Haven't you asked yourself that simple question of where two year old content is coming from?

It's not from your site, it's not from the search engines, where would old content happen to be?

Only one place I can think of.

The best way to prove copyright is to periodically send a CD of your site to the copyright office and spend the small amount to legally protect it's contents. If people then don't believe the content was yours when you file a DMCA request they lose their safe harbor and are horribly exposed. You really need to start a discussion in the copyright forum on how to do it right because the Wayback machine is problematic at best IMO

Besides, there are places you can make online archives of sites but you can do it privately and not expose your content to uncontrolled scraping, but that's another discussion.

wilderness




msg:4531224
 10:35 pm on Dec 26, 2012 (gmt 0)

Nope, we didn't miss anything.


your wasting your time and effort.

lucy24




msg:4531270
 1:18 am on Dec 27, 2012 (gmt 0)

:: idly wondering if it's time for the Wayback Machine to join domain-name dragons in the land of Things We Do Not Discuss Here ::

dstiles




msg:4531538
 8:21 pm on Dec 27, 2012 (gmt 0)

> keep changing names to fool your robots.tx file

Bad bots never even look at robots.txt and even if they did they would ignore its directions. Only good bots (sometimes) obey the directions "suggested" by robots.txt. Bad bots do their own thing and the only way to block them is to detect certain access parameters and return a 403 with no content. This requires eternal vigilance to detect new methods but a good detector can foil 99% of scrapers.

incrediBILL




msg:4531550
 8:46 pm on Dec 27, 2012 (gmt 0)

Bad bots never even look at robots.txt


Actually they do at times. There's one bot that uses a blank user agent when reading the robots.txt to see which spider names you've allowed. Then it switches to one of those spider names to make sure it gets access to your site.

Another reason I do a dynamic robots.txt file that tells everyone to go away except my whitelist and serves up a custom file per request so that I don't expose the bot names that are allowed.

In case you're wondering, ia_archive is NOT allowed ;)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved