homepage Welcome to WebmasterWorld Guest from 54.196.159.11
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Archive.org
for sites not crawled since 2006
Neil_McRae




msg:4332174
 5:44 pm on Jun 28, 2011 (gmt 0)

Does anyone know any tricks for getting a more recent entry to appear for a site that hasn't been crawled since 2006? It seems there are a lot of websites out there that were crawled more frequently up until then. I contacted IA asking them if they take payments to update a website's entry. They wrote back saying that if the site is well linked it is likely to be crawled again and they do not charge for their services. I don't currently have a robots.txt file in the root directory on my homepage. Do I need to have one or could I possibly have something else enabled that might be preventing crawlers from retreiving copies of my website?

Thanks for any information you are able to give me about this.

-Neil

 

lucy24




msg:4332215
 7:39 pm on Jun 28, 2011 (gmt 0)

Obvious question: has the site in fact changed since 2006?

Rumor has it that some robots get anxious if they can't find a robots.txt at all, and go on to assume the worst. ("I'm not allowed in here.") It may be safer to put a robots.txt in place, even if all it says is

User-Agent: *
Disallow:

That is, let everyone crawl everywhere-- but you will soon decide you didn't really mean this!

Is the site getting human visitors? What's in the htaccess? It's hard not to think that back around 2006 you absent-mindedly locked them out. This is easy to do, because the ia_archiver is squarely in the middle of the amazonaws block, which a lot of people lock out on principle.

My current .htaccess says, complicatedly,
[deny from]
174.128.0.0/16
174.129.117.0/24
174.130.0.0/15
174.132.0.0/15
Fortunately my father taught me base 2 in early childhood ;)
This is still not precise: I'm allowing the whole 129 range except 117 which contains a robot I don't care for.* But so far I haven't met anyone but the ia_archiver elsewhere in 129.


* Nothing wrong with the robot itself, I just don't like their bosses.

dstiles




msg:4332277
 10:26 pm on Jun 28, 2011 (gmt 0)

We have blocked IA for years, both in robots.txt and using UA and IP detection.

Neil_McRae




msg:4332301
 11:26 pm on Jun 28, 2011 (gmt 0)

Originally posted by lucy24
Obvious question: has the site in fact changed since 2006?


Yes it has changed since then.

I made a robots.txt and I uploaded it to my root directory. That's the same directory in which I have my index.html file that my domain points to. The htaccess for that directory is empty.

The robots.txt file I made says:

User-Agent: *
Disallow:

No spaces after the asterisk on first line.
One space on second line after disallow.

Hoople




msg:4332320
 12:14 am on Jun 29, 2011 (gmt 0)

I recovered a customer's site in IA that was deleted by it's past webhost in 2004. Well, most of it.

Fast forward to recently. They have a beta link to their archive that allowed me to recover even more. Not all, but a LOT more.

See http://wayback.archive.org/web/ [wayback.archive.org]

tangor




msg:4332335
 1:36 am on Jun 29, 2011 (gmt 0)

Neil_McRae, welcome to Webmasterworld! IA, for some (most?) is an undesired archive of copyrighted websites, so blocking and or removal requests is more common. However, getting them to index has never been a problem: just have an active website and they will EVENTUALLY get there.

As for site modifications, that question is in regards to many or all of the pages having LAST DATE MODIFIED headers changed. If most are still 2006, it is unlikely IA's bot will spend much time, even with a new robots.txt. Their resources are not anywhere near the big SEs and, by looking at their name and mission, indicates they are more interested in HISTORICAL SNAPSHOTS of websites rather than keeping 100% up-to-date.

phranque




msg:4332958
 9:09 am on Jun 30, 2011 (gmt 0)

welcome to WebmasterWorld, neil!

you might want to check out Did The Wayback Machine Die in 2010?:
http://www.webmasterworld.com/search_engine_spiders/4257186.htm [webmasterworld.com]

Neil_McRae




msg:4333755
 6:04 pm on Jul 1, 2011 (gmt 0)

Originally posted by tangor
As for site modifications, that question is in regards to many or all of the pages having LAST DATE MODIFIED headers changed. If most are still 2006, it is unlikely IA's bot will spend much time, even with a new robots.txt ...


I hadn't even been using "date modified" tags in my headers, so I added some for all of my pages that I want to be indexed. And my headers on those pages now look like this:

[img]http://www.stuntsillusion.com/pics/image001.jpg[/img]

And I don't have a "description" or a "keywords" tag on every one of my pages, I still put the new "date modified" tag at the bottom of my header (line before page title) on all the pages that don't have them.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved