Archive.org

Forum Moderators: goodroi

Message Too Old, No Replies

Archive.org

for sites not crawled since 2006

Neil_McRae

5:44 pm on Jun 28, 2011 (gmt 0)

Does anyone know any tricks for getting a more recent entry to appear for a site that hasn't been crawled since 2006? It seems there are a lot of websites out there that were crawled more frequently up until then. I contacted IA asking them if they take payments to update a website's entry. They wrote back saying that if the site is well linked it is likely to be crawled again and they do not charge for their services. I don't currently have a robots.txt file in the root directory on my homepage. Do I need to have one or could I possibly have something else enabled that might be preventing crawlers from retreiving copies of my website?

Thanks for any information you are able to give me about this.

-Neil

lucy24

7:39 pm on Jun 28, 2011 (gmt 0)

Obvious question: has the site in fact changed since 2006?

Rumor has it that some robots get anxious if they can't find a robots.txt at all, and go on to assume the worst. ("I'm not allowed in here.") It may be safer to put a robots.txt in place, even if all it says is

User-Agent: *
Disallow:

That is, let everyone crawl everywhere-- but you will soon decide you didn't really mean this!

Is the site getting human visitors? What's in the htaccess? It's hard not to think that back around 2006 you absent-mindedly locked them out. This is easy to do, because the ia_archiver is squarely in the middle of the amazonaws block, which a lot of people lock out on principle.

My current .htaccess says, complicatedly,
[deny from]
174.128.0.0/16
174.129.117.0/24
174.130.0.0/15
174.132.0.0/15
Fortunately my father taught me base 2 in early childhood ;)
This is still not precise: I'm allowing the whole 129 range except 117 which contains a robot I don't care for.* But so far I haven't met anyone but the ia_archiver elsewhere in 129.

* Nothing wrong with the robot itself, I just don't like their bosses.

dstiles

10:26 pm on Jun 28, 2011 (gmt 0)

We have blocked IA for years, both in robots.txt and using UA and IP detection.

Neil_McRae

11:26 pm on Jun 28, 2011 (gmt 0)

Originally posted by lucy24
Obvious question: has the site in fact changed since 2006?

Yes it has changed since then.

I made a robots.txt and I uploaded it to my root directory. That's the same directory in which I have my index.html file that my domain points to. The htaccess for that directory is empty.

The robots.txt file I made says:

User-Agent: *
Disallow:

No spaces after the asterisk on first line.
One space on second line after disallow.

Hoople

12:14 am on Jun 29, 2011 (gmt 0)

I recovered a customer's site in IA that was deleted by it's past webhost in 2004. Well, most of it.

Fast forward to recently. They have a beta link to their archive that allowed me to recover even more. Not all, but a LOT more.

See http://wayback.archive.org/web/ [wayback.archive.org]

tangor

1:36 am on Jun 29, 2011 (gmt 0)

Neil_McRae, welcome to Webmasterworld! IA, for some (most?) is an undesired archive of copyrighted websites, so blocking and or removal requests is more common. However, getting them to index has never been a problem: just have an active website and they will EVENTUALLY get there.

As for site modifications, that question is in regards to many or all of the pages having LAST DATE MODIFIED headers changed. If most are still 2006, it is unlikely IA's bot will spend much time, even with a new robots.txt. Their resources are not anywhere near the big SEs and, by looking at their name and mission, indicates they are more interested in HISTORICAL SNAPSHOTS of websites rather than keeping 100% up-to-date.

phranque

9:09 am on Jun 30, 2011 (gmt 0)

welcome to WebmasterWorld, neil!

you might want to check out Did The Wayback Machine Die in 2010?:
http://www.webmasterworld.com/search_engine_spiders/4257186.htm [webmasterworld.com]

Neil_McRae

6:04 pm on Jul 1, 2011 (gmt 0)

Originally posted by tangor
As for site modifications, that question is in regards to many or all of the pages having LAST DATE MODIFIED headers changed. If most are still 2006, it is unlikely IA's bot will spend much time, even with a new robots.txt ...

I hadn't even been using "date modified" tags in my headers, so I added some for all of my pages that I want to be indexed. And my headers on those pages now look like this:

http://www.stuntsillusion.com/pics/image001.jpg

And I don't have a "description" or a "keywords" tag on every one of my pages, I still put the new "date modified" tag at the bottom of my header (line before page title) on all the pages that don't have them.

Archive.org

for sites not crawled since 2006

Neil_McRae

lucy24

dstiles

Neil_McRae

Hoople

tangor

phranque

Neil_McRae

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week