Welcome to WebmasterWorld Guest from 18.204.48.199

Forum Moderators: Ocean10000 & phranque

Message Too Old, No Replies

Inadvertently blocking Archive.org bot

     
11:47 pm on Jul 13, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Oct 13, 2003
posts:705
votes: 0


The main way to clear out archive.org's records of your site is to simply disallow their bot in your robots.txt file, and then call the bot.

So I added to robots.txt:

User-agent: ia_archiver
Disallow: /

User-agent: archive.org_bot
Disallow: /

User-agent: ia_archiver-web.archive.org
Disallow: /

And called their bot, but it is served a 403 instead of a 200:

207.241.229.169 - - [10/Jul/2008:05:34:49 -0400] "GET /robots.txt HTTP/1.1" 403 306 "-"
"ia_archiver-web.archive.org"

209.234.171.43 - - [11/Jul/2008:05:55:19 -0400] "GET /robots.txt HTTP/1.0" 403 290 "-" "ia_archiver"

I double-checked my main htaccess, and neither of those IPs are denied, so there must be something in my Conditions still blocking it. Can anybody spot where please?
Here's the likely grouping involved... I've removed some Rules related to my domain and www etc.

Do I need a carat in front of iaea\.org [NC,OR] in this Condition?

RewriteCond %{HTTP_REFERER} iaea\.org [NC,OR]

I should add...
There's no "vertical pipes" problem, and this code has been stable and working for months.
To post it here I put spaces in the longest line so you dont have to scroll right.

Options +FollowSymLinks
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^WebStripper/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTMLParser [OR]
RewriteCond %{HTTP_USER_AGENT} ^ZipppBot [OR]
RewriteCond %{HTTP_USER_AGENT} iSiloX [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Convera [OR]
RewriteCond %{HTTP_USER_AGENT} nameprotect [OR]
RewriteCond %{HTTP_USER_AGENT} ^gazz [NC,OR]
RewriteCond %{HTTP_REFERER} vnc/ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper/ [OR]
RewriteCond %{REQUEST_URI} (.?mail.?form¦form¦(GM)?form.?.?mail¦.?mail)(2¦to)?\.?(asp¦cgi¦exe¦php¦pl¦pm)?$ [NC,OR]
# MSOffice
RewriteCond %{REQUEST_URI} ^/(MSOffice¦_vti) [NC,OR]
# Nimda
RewriteCond %{REQUEST_URI} /(admin¦cmd¦httpodbc¦nsiislog¦root¦shell)\.(dll¦exe) [NC,OR]
# CodeRed
RewriteCond %{REQUEST_URI} ^/default\.(ida¦idq) [NC,OR]
RewriteCond %{REQUEST_URI} ^/.*\.printer$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(BlackWidow¦Crescent¦Disco.?¦ExtractorPro¦HTML.?Works¦Franklin.?Locator¦
HLoader¦http.?generic¦Industry.?Program¦IUPUI.?Research.?Bot¦Mac.?Finder¦
NetZIP¦NICErsPRO¦NPBot¦Planty¦Production.?Bot¦Program.?Shareware¦Teleport.?Pro¦
TE¦VoidEYE¦WebBandit¦WebCopier¦WebZIP¦WinMHT¦WEP.?Search¦Zeus) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (collector¦extractor¦magnet¦reaper¦siphon¦sweeper¦harvest¦collect¦wolf¦WebDAV) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Educate.?Search¦Full.?Web.?Bot¦IUFW.?Web [NC,OR]
RewriteCond %{HTTP_USER_AGENT} httrack¦heritrix¦twiceler¦mirar¦larbin¦NaverRobot¦SURF [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.?URL.?Control [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Miss.*g.*.?Locat.* [NC,OR]
# Bad requests
RewriteCond %{REQUEST_METHOD} !^(GET¦HEAD¦OPTIONS¦POST) [NC,OR]
RewriteCond %{THE_REQUEST} ^CONNECT.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.06\ \(Win95;\ I\) [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ \(compatible;\ MSIE\ 5\.00;\ Windows\ 98$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ \(compatible\ ;\ MSIE.? [NC,OR]
RewriteCond %{HTTP_REFERER} q=guestbook [NC,OR]
RewriteCond %{HTTP_REFERER} iaea\.org [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule .* - [F]

7:27 pm on July 14, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


I don't see anything in there that would block ia-archiver. I do note, however, that your block on *any* user-agent that starts with the two letters "TE" is rather dangerous. Use a longer pattern if at all possible, because that very-short one is a trap waiting to catch you in the future.

Note that you can "comment out" this whole rule-set for testing by preceding it with "RewriteEngine off" or by changing the rule to "RewriteRule .* -" which essentially says "rewrite all requested URL-paths to that same URL-path and continue." Or you could add a URL-pattern that will never match, such as "RewriteRule ^no-such-path-on-my-server-mate\.hmtl$ - [F]". Since that URL is unlikely to ever be requested, it essentially disables the rule. Then you can go use a user-agent spoofer browser plug-in or one of the on-line UA spoofer tools to see if ia-archiver is still blocked when your rule is disabled.

Jim

2:30 am on July 15, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Oct 13, 2003
posts:705
votes: 0


Thank you Jim, I really appreciate your advice.

I've removed the TE¦, I can't recall why I have that in there anyway.

I'm mystified why I'm blocking IA, but I'll follow your tip, learn how to use a UA spoofer, and comment out a line at a time.

2:53 am on July 15, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Save yourself some time -- disable the whole routine first. It's quite possible that the problem is not in this code.

If disabling the whole thing *does* allow ia-archiver to get in, then proceed with commenting-out by halves -- Comment-out the whole first half, then uncomment that and comment-out the whole second half. Within the half that commenting-out fixes, uncomment one-half, comment-out the other. Proceeding with this 'binary sort' approach can save you a lot of time uploading/testing/editing.

Beware the waiting trap: Remember while editing that the last uncommented RewriteCond must not have an [OR] flag on it! (I'd suggest you test the very last RewriteCond by itself, and then leave it 'active' and don't include it in the subsequent commenting/uncommenting.)

Jim

1:14 am on July 16, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Oct 13, 2003
posts:705
votes: 0


Thanks Jim,

I installed the UA switcher FF add-on, tried entering my site as ia_archiver, and was blocked.

I temporarily turned off the rewriteengine, and was still blocked access. So the Conditions and rules are ok.

Next I removed all my IP denies, and was still denied, so they are not the culprit.

There's nothing else in my www htaccess likely to cause this.

I restored the full normal htaccess, and tried my sub-domains. I could gain entry when using ia_archiver in the browser spoof, which confirms the main, and the sub-domain htaccess files are not the cause.

Next I removed my .js browser detect script, but that made no difference.

Currently I've no idea why I'm blocking IA from my main site, yet allowing it into my sub-domains.

If my host was blocking it at server level, IA couldn't get into my sub-domains.

Any ideas anyone?

1:48 am on July 16, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5507
votes: 5


Currently I've no idea why I'm blocking IA from my main site, yet allowing it into my sub-domains.

For some years, I had the archiver denied.
More than a year ago, I removed the denies in an attempt to get my sites archived.

It has never suceeded, atthough archiver does occassionally crawl one of my sites, however both sites, when attempting requests through through the archive website still show, accessed denied.
Thus I concluded it was an archive. org problem.

A few months back, I attempted some communications with the archive folks that didn't go well.
The answers I received were sort of standard replies (close to automated or 3rd party tech support), as oppposed answers which would have beared reflection on what was addressed in my inquiry.

One answer explained to me that adjustments in access via the archive website were on a six-month delay.

Since my communications went so well (tongue-in-cheek), I haven't bothered to contact them any further.

Don

12:14 am on July 17, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Oct 13, 2003
posts:705
votes: 0


Thanks Don, I know what you mean about poor communication, but I suppose they get swamped.

In my case, if I use my browser pretending to be ia_archiver, I +am+ allowed into my sub-domains, but the css file they call from my www is blocked. Since, I assume, this type of visit doesn't actually involve my site contacting archive.org, it must be something in my site doing the blocking. (In this case just the css call.)

When I try the same thing on my www, I get a 403.

I've not asked archive.org about this yet, but last weekend I emailed them (through the address on their site) asking how to remove my +other old+ sites that I cannot put a robots.txt file in. They are long dead.

I got a personal reply, and I've responded with a list of the old sites to be removed. Hopefully, it will go okay.

I'll not confuse them by asking about this blocking problem yet, I'll let them deal with the list of old sites first.

It is most peculiar.

4:19 am on July 17, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Lacking a mod_access or mod_rewrite rule that acts on some particular characteristic of ia-archiver requests, there is nothing that ia-archive can do that will affect whether they can access your site or not. *You* control your server response, not them.

Since your tests indicate that you don't have any such rules in your .htaccess file, I'd be looking through your "control panel options" for your main domain, and asking your host if they are doing anything higher-up at the server config or firewall level -- successful ia-archiver accesses to your subdomains notwithstanding.

Jim

12:11 pm on July 17, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Oct 13, 2003
posts:705
votes: 0


Wilco Jim,

My host is good; I doubt they'd block IA unless I asked. Sounds more like it's my hamfistedness somewhere. I'll ask them to have a look and report back here.

12:19 pm on July 17, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5507
votes: 5


Lacking a mod_access or mod_rewrite rule that acts on some particular characteristic of ia-archiver requests, there is nothing that ia-archive can do that will affect whether they can access your site or not.

Jim,
Perhaps clarification is required for my previous reply?

Archiver was long ago removed from robtos text.

As were archiver referecnes and IP's frm htaccess.

Their bot visits my sites, however no pages are listed in their website archives.
In addtion, my websites when requested thru the archive website, result in a page which reads "Blocked site error".

12:22 pm on July 17, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


If they are like most hosts, you will have to emphasize that you removed your .htaccess code and still had the problem. In fact, you may actually have to remove it again while they look around. Many hosts hear "htaccess", panic, and then give you the "We don't support .htaccess" response - end of.

I can understand it, I guess, since supporting user configurations can add a lot of support cost, but it is quite annoying when you have a problem and have already done your homework to eliminate your .htaccess code from the possible-cause list...

Jim

1:19 pm on July 17, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Oct 13, 2003
posts:705
votes: 0


You were right Jim; my host just confirmed they block IA and MSIECrawler in the core.

They must be really bad bots to get that treatment.

Good host, baaad bot!

I'll have to email IA, and ask them to pull our site.

Many thanks for your advice.

Spot-on as usual.

1:00 am on July 18, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Oct 13, 2003
posts:705
votes: 0


Update: My host acknowledges that their core block needs fine tuning to stop IA bots getting into sub-domains as well main directories. They blocked them "for abusive behaviour when crawling sites and driving up the server loads to unacceptable levels."

wilderness:
Since I recently got a sensible human reply from Archive.org, you may have success if you try emailing them again, with a simple clear request for your site to be included.

You want in, I want out :)

Why do you want to be in... proof of copyright evidence?

I want to be out, in case the SE's count site, and content, duplicates stored in Archive.org.
I'm not convinced they are too clever to do that.

5:55 am on July 18, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5507
votes: 5


Why do you want to be in... proof of copyright evidence?

That was precisely my logic when I allowed access, however since the lapse of time (be it one year, perhaps two), I'm not as convinced of the necessity.

I want to be out, in case the SE's count site, and content, duplicates stored in Archive.org.
I'm not convinced they are too clever to do that.

Widgets have loads of pages and sites that have disappeared from the active web (wish I hadn't deleted all those dead links years ago), and in some instances I utilize in-content links to archive.org pages, which the active SE's are NOT picking up on.
Whether the above is any help in you dupe inquiry, I've no clue.

1:24 pm on July 18, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Oct 13, 2003
posts:705
votes: 0


wilderness said:
I utilize in-content links to archive.org pages, which the active SE's are NOT picking up on. Whether the above is any help in you dupe inquiry, I've no clue.

I made a list of all the copyright infringements I have DMCA'd over the years, ready to locate in, and DMCA Archive.org of, if necessary.

If it is; they are going to be very busy, as well as me!

I may well ask them about this, after I've got my sites out, I don't want to overload them. I emailed them details of all my old dead sites today, with proof of ownership. So we will see how efficient they are.

My most infringed pages, with hundreds of illegal copies in IA, are actually still top ranked in both G and Y!

I conclude from this that +SEs are+ ignoring dupes in IA, as you suggest. But that could change instantly (in error), and I'd rather they were out of IA. They have all been long removed from the infringing sites, or course.

Looking at your old sites in IA is exquisite torture.
And we thought they were the bee's knees.