homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum


WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

Msg#: 4625824 posted 3:52 pm on Nov 25, 2013 (gmt 0)

Not sure why I'm posting this.
Perhaps I'm just confused by the capability and the lack of non-standard procedures.

This creature (for lack of a better word) had offered snapshots of less than a dozen of hundreds of pages. Each snapshot offered a reference date. I explored the most recent addition.

Their FAQ offers the following explanation:
"What software do you run and how data is stored ?
The archive runs Apache Hadoop and Apache Accumulo. All data is stored on HDFS, textual content is duplicated 3 times among servers in different datacenters and images
are duplicated 2 times. All datacenters are in Europe."

Their FAQ also provides a brief explanation as to why they do not comply with robots.txt

The FAQ also provides that the tool used is a browser plug-in.

The initial visitor used my page to create a Wiki page and Wiki Media was next visitor (see Below #2).
The Wiki page does credit the active web page, unfortunately I generally deny Wiki refers, as well as the folks whom create the same Wiki pages (as I did in this instance; a MediaCom IP range).

1) A standard visit to the page using a standard browser ("Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)")

2) followed up by the following (note the trailing comma in lines # 4 & 5, which would have resulted in a 404 even if the 403 didn't exist.): - - [21/Sep/2013:09:06:43 -0600] "GET /MyFolder/MyPage.html, HTTP/1.1" 403 573 "-" "LinkSaver/2.0" - - [21/Sep/2013:09:06:43 -0600] "HEAD /SameFolder/SamePage.html, HTTP/1.1" 403 143 "-" "LinkParser/2.0" - - [21/Sep/2013:09:18:54 -0600] "GET /SameFolder/SamePage HTTP/1.1" 403 644 "http://www.google.com/" "Mozilla/5.0 (compatible; Windows NT 5.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/535.19" - - [21/Sep/2013:09:22:15 -0600] "HEAD /SameFolder/SamePage, HTTP/1.1" 403 143 "-" "COIParser/2.0"
69.208.90.zz - - [21/Sep/2013:13:24:19 -0600] "GET SameFolder/SamePage, HTTP/1.1" 403 644 "http://en.wikipedia.org/wiki/SimilarName" "Mozilla/5.0 (Windows NT 6.0; rv:23.0) Gecko/20100101 Firefox/23.0"
69.208.90.zz - - [21/Sep/2013:13:24:20 -0600] "GET /favicon.ico HTTP/1.1" 200 419 "-" "Mozilla/5.0 (Windows NT 6.0; rv:23.0) Gecko/20100101 Firefox/23.0"

The 69.208. range is an ATT/SBC PPPoX Pool, many of which I've had denied for more than a decade, due to repeated non-standard practices. I do make some exceptions for known associates.

My question is if the THREE snapshot requests were denied, why is there not a corresponding log of the initial page request that was mirrored in their archives?

For "my widgets", there is absolutely no benefit to having materials duplicated (even reworded materials) on a Wiki page. In fact, some "widget Wiki pages" draw traffic from existing "widget orgs own pages".



Msg#: 4625824 posted 3:42 pm on Feb 1, 2014 (gmt 0)

This outfit (apparently started by one Denis Petrov in Prague) makes copies of sites' homepages (main pages) and places them on the web in a way that they they directly compete with the original page or whichever new page is at the archived URL. The server appears to be, belonging to OVH Systems (AS16276) in France.

Unlike the wayback machine archive, this outfit does not give owners of pages the option to have them removed from the archive. According to a discussion page at Wikipedia, the related spider visits (which explicitly ignores robots.txt) have come from countless IP blocks all over the world, so it seems impossible to contain this scraper. (It seems to me this could be a tool to effectively destroy Wikipedia or any other website it takes a fancy to...) I see huge trouble ahead here...


WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

Msg#: 4625824 posted 9:04 pm on Feb 1, 2014 (gmt 0)

My question is if the THREE snapshot requests were denied, why is there not a corresponding log of the initial page request that was mirrored in their archives?

Wasn't that your #1?

why they do not comply with robots.txt

Same rationale as search-engine previews, right? They're acting in response to a human request, so they don't count as robots. Even if the site vigorously disagrees, and/or the original human doesn't know it's happening.*

* Took me quite a while to grasp that all those thumbnails littering my Webmaster Tools pages aren't constructed from information gleaned in recent crawls. They're built on the fly via a robot with Preview in its name.


Msg#: 4625824 posted 3:24 am on Feb 2, 2014 (gmt 0)

More about archive.is

If I understand their approach correctly, it is this: "if someone decides that they want to keep a copy of a given page, instead of them saving it on their computer or in a pravate bookmark file, we save it for them by placing it on the web where everybody, including search engine spiders, can see it". This is very different from the way archive.org works, and it has the undesirable consequence that webmasters can no longer remove outdated information (which, in many cases is the same as "junk") related to their own sites from the web - and in the search engine listings their own "junk" will compete with the relevant information.

One illustration: about half a year ago I used (for a short time, while building a new site) a certain subdomain, but this subdomain has not been active for many months. It also had disappeared from the search engines until, via a search in G, I discovered a file with a URL refering to that subdomain that is being made available by archive.is - this page not only competes in G's search results with the relevant page, it even got a better placement. So I had to reactivate this subdirectory in the DNS and set it as alias to the actual site (no further harm done at this point, since the information on that "junk" page is not yet outdated, just the layout ;) - but i can foresee that at some time in the future this "junk" page can seriously interfere with the purpose of the site in question.)

Imagine Microsoft were to decide that copies of files in the "recycle bin" are placed in an archive when you decide to empty the bin and that every time you look for some information on your computer, the search function would also check this archive and add what it found there to the result list...

It seems to me that Mr. Petrov's concept, as implemented right now, has a fatal flaw...

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved