homepage Welcome to WebmasterWorld Guest from 54.205.205.47
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Accredited PayPal World Seller

Visit PubCon.com
Home / Forums Index / WebmasterWorld / Webmaster General
Forum Library, Charter, Moderators: phranque & physics

Webmaster General Forum

    
Archive.org
how to get more pages of your site archived?
vitaplease




msg:366253
 4:34 pm on Dec 30, 2002 (gmt 0)

Anyone know what criteria the waybackmachine archive.org has for archiving more or less pages of your site?

from their page: [web.archive.org...]

....http://web.archive.org/200109*/http://www.mysite.com*

This returns all URLs that begin with ...http://www.mysite.com which were archived in September 2001.

Also, anyone know why they lag so far behind in showing their archives?
(they state 6 to 12 months, but why so stale?)

I am hoping to use the archive.org to show copycat webmasters that my content was there before they copied it.

 

duncan12




msg:366254
 7:13 pm on Dec 30, 2002 (gmt 0)

Archive.org gets their archived pages from Alexa, so to get in the archive, you need to get Alexa to crawl your site.

Alexa doesn't deep crawl sites with minimal traffic. So, you will need to install the Alexa toolbar and visit all the pages you want the crawler to capture... you can check and you will see that the crawler visits within 24 hours.

Alexa donates archived pages to the archive after six months... apparently this is to assuage copyright risk.

vitaplease




msg:366255
 7:30 pm on Dec 30, 2002 (gmt 0)

thanks duncan,

I am just wondering if visiting all your own pages is enough for full indexing, or do you also need enough popularity?

ken_b




msg:366256
 7:49 pm on Dec 30, 2002 (gmt 0)

I just recently check archive.org for my website and was surprised to see that the last few dates the had for me were just my robots.txt

They are no more excluded than any other se, so I was a bit confused.

duncan12




msg:366257
 10:43 pm on Dec 30, 2002 (gmt 0)

Alexa guages popularity via the toolbar. So if somebody with an Alexa toolbar visits a page on the net, it is deemed to be relatively important, and will be crawled -- same day.

Regarding the crawler and your robots.txt, I believe that the wayback machine automatically checks your robots.txt file every time somebody does a search on your site, with minimal caching. This way you could exclude your site's content from the wayback machine by editing your robots.txt file, in real-time.

If the ia_archiver hasn't visited your site in over two months (it takes them two months to complete each crawl), it is probably because none of the Alexa toolbar users visited your site.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Webmaster General
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved