Forum Moderators: open

Message Too Old, No Replies

New archive.org UA

includes heritrix :)

         

caribguy

5:38 pm on May 11, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I guess they forgot to read the WebmasterWorld library threads.

207.241.235.49 - - [11/May/2010:11:22:33 +0600] "GET /robots.txt HTTP/1.0" 301 - "-" "Mozilla/5.0 (compatible; archive.org_bot/heritrix-1.15.4 +http://www.archive.org)"

NetRange: 207.241.224.0 - 207.241.239.255
CIDR: 207.241.224.0/20
NetName: INTERNET-ARCHIVE-1

Staffa

9:01 pm on May 11, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



At least while visiting your site they kept the UA short, I had these two visits today (first time ever) coming from the same range :

2010-05-11 17:08:16 GET /robots.txt - 80 - 207.241.235.47 HTTP/1.0
Mozilla/5.0 (compatible; archive.org_bot/heritrix-1.15.4 +http://www.archive.org) CCBHP=HPAttributed=/keyword/2010/A750/homepage.aspx; CFCLIENT_CART=cartid%3D3328202%23; CFGLOBALS=urltoken%3DCFID
%23%3D17569%26CFTOKEN%23%3D9c172d90d66b9c77%2D4324C455%2D3048%
2D6168%2DA8FCF58432017ADD%23lastvisit%3D%7Bts%20%272010%2D05%
2D11%2011%3A33%3A26%27%7D%23timecreated%3D%7Bts%20%272010%2D05%2
D11%2008%3A21%3A35%27%7D%23hitcount%3D9%23cftoken%3D9c172d90d66b
9c77%2D4324C455%2D3048%2D6168%2DA8FCF58432017ADD%23cfid%3D17569%23; CFID=17569; FTOKEN=9c172d90d66b9c77-4324C455-
3048-6168-A8FCF58432017ADD; FRLANCOMEAUTH=A24A8F427990B876B9E1D
76CDBDF05EAACCC6B7BC0EDC3BC276631BA687E9942274586F6243607A673110
78DCEC0CBB1B0712AD6EAA9A2CD23FC54D60243DDA7C8433342AA7B31C525171
D8B021F2177BD234D8A713558D55932A89234D4D9594ED8B639AEA40F5634F5A0
452FAAC26E6EC91428878C5B150B4964BB2CA04DC15AE1DFDD230B27FADDCEF6B
BF4E753C58D3884F6D6DFD76B7D3DD05348348FBE;
LTLocalizationCookie=Localization=en-US&CurrentLocalization=en-US;
PHPSESSID=3qg7gu1k88fmll3p8phrnijrd2; UserAuthentication=authenticated=no&user_id=e1578d36-260f-4563-
92c6-01561470f945&EZ_id=2239fc88-f02f-4e2c-bef8-7ff78ca66d51;
XTCsid=519939919295be572dc731e0df083f3c

2010-05-11 17:08:19 GET / - 80 - 207.241.235.47 HTTP/1.0
Mozilla/5.0 (compatible; archive.org_bot/heritrix-1.15.4 +http://www.archive.org) CCBHP=HPAttributed=/keyword/2010/A750/homepage.aspx; CFCLIENT_CART=cartid%3D3328202%23; CFGLOBALS=urltoken%3DCFID%23%3D17569%
26CFTOKEN%23%3D9c172d90d66b9c77%2D4324C455%2D3048%2D6168%2DA8FCF
58432017ADD%23lastvisit%3D%7Bts%20%272010%2D05%2D11%2011%3A33%3A2
6%27%7D%23timecreated%3D%7Bts%20%272010%2D05%2D11%2008%3A21%3A35%
27%7D%23hitcount%3D9%23cftoken%3D9c172d90d66b9c77%2D4324C455%2D30
48%2D6168%2DA8FCF58432017ADD%23cfid%3D17569%23; CFID=17569; CFTOKEN=9c172d90d66b9c77-4324C455-3048-6168-A8FCF58432017ADD;
FRLANCOMEAUTH=A24A8F427990B876B9E1D76CDBDF05EAACCC6B7BC0EDC3BC276
631BA687E9942274586F6243607A67311078DCEC0CBB1B0712AD6EAA9A2CD23FC5
4D60243DDA7C8433342AA7B31C525171D8B021F2177BD234D8A713558D55932A89
234D4D9594ED8B639AEA40F5634F5A0452FAAC26E6EC91428878C5B150B4964BB2
CA04DC15AE1DFDD230B27FADDCEF6BBF4E753C58D3884F6D6DFD76B7D3DD053483
48FBE; LTLocalizationCookie=Localization=en
-US&CurrentLocalization=en-US; PHPSESSID=3qg7gu1k88fmll3p8phrnijrd2;
UserAuthentication=authenticated=no&user_id=e1578d36-260f-4563-92c6-
01561470f945&EZ_id=2239fc88-f02f-4e2c-bef8-7ff78ca66d51;
XTCsid=519939919295be572dc731e0df083f3c

I added the bold text which in the original is a keyword related to my site. I have no page called homepage nor such url construction, nor aspx pages.
I also have no idea what the gobbledegook is after archive dot org (in one continuous line which I broke not to mess up the layout of this page) but it is certainly not something either my site or its log files create.

Mods, if this gobbledegook has any meaning please delete.

Pfui

3:42 am on May 12, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If the shoe fits... Google the following phrase, dated yesterday:

Internet Archive Releases Amazon S3 Like API

Backstory:

1.) Archive.org "[receives] data donations from Alexa Internet..." -FYI [archive.org]

"The Internet Archive Wayback Machine is a service created by Alexa..." -FYI [web.archive.org]

2.) Alexa Internet is an Amazon.com company.

3.) "Amazon S3" = "Amazon Simple Storage Service" a.k.a. one of the "Amazon Web Services" a.k.a. amazonaws.com a.k.a. --

amazonaws.com plays host to wide variety of bad bots
[webmasterworld.com...]

P.S.

I've blocked archive.org's bots but for robots.txt for years after one too many scrapers made massive error_log and other messes. To me, the Wayback machine is akin to Google translation -- a wide-open side door for troublemakers.

Staffa

5:49 am on May 12, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Pfui, thank you for this information, I wasn't aware of their "new product" offering.

In any case I'm not bothered since they have been blocked for years, as well as amazonaws and G translate. At this visit they didn't even get robots.txt because of non-www > wwww

aristotle

6:16 pm on May 14, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Can someone please give me the code for blocking Google Translation in robots.txt?
Thank you

wilderness

7:09 pm on May 14, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Can someone please give me the code for blocking Google Translation in robots.txt?


nothing like hijacking a topic ;)

Translators (Google or otherwise) don't request or comply with robots.txt.
Should you desire to deny (forced) them entry to your site (s), you'll need to add their IP's to htaccess.

keyplyr

7:13 pm on May 14, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@ aristotle

You can't block anything in robots.txt. All you can do is "request" that a User Agent follow your directives, but ultimately many will not.

You can however block using other means. If your server is Apache/Unix and you have accesss to an .htaccess file, you can use mod_rewrite to stop all requests containing the term "translate" in their referrer string:


RewriteEngine On
RewriteCond %{HTTP_REFERER} translate
RewriteRule !^robots\.txt$ - [F]


If you wish to only block the requests from Google's translation service, use this instead:


RewriteEngine On
RewriteCond %{HTTP_REFERER} translate\.googleusercontent
RewriteRule !^robots\.txt$ - [F]

aristotle

9:27 pm on May 14, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the help. Also, I really wasn't trying to hijack the thread. I just saw a chance to get some information I had been wanting.

Thank you

blend27

4:21 pm on May 15, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



CFCLIENT_CART=cartid%3D3328202%23; CFGLOBALS=urltoken%3DCFID
%23%3D17569%26CFTOKEN%23%3D9c172d90d66b9c77%2D4324C455%2D3048%
2D6168%2DA8FCF58432017ADD%23lastvisit%3D%7Bts%20%272010%2D05%
2D11%2011%3A33%3A26%27%7D%23timecreated%3D%7Bts%20%272010%2D05%2
D11%2008%3A21%3A35%27%7D%23hitcount%3D9%23cftoken%3D9c172d90d66b
9c77%2D4324C455%2D3048%2D6168%2DA8FCF58432017ADD%23cfid%3D17569%23; CFID=17569; FTOKEN=9c172d90d66b9c77-4324C455-
3048-6168-A8FCF58432017ADD; FRLANCOMEAUTH=A24A8F427990B876B9E1D
76CDBDF05EAACCC6B7BC0EDC3BC276631BA687E9942274586F6243607A673110
78DCEC0CBB1B0712AD6EAA9A2CD23FC54D60243DDA7C8433342AA7B31C525171
D8B021F2177BD234D8A713558D55932A89234D4D9594ED8B639AEA40F5634F5A0
452FAAC26E6EC91428878C5B150B4964BB2CA04DC15AE1DFDD230B27FADDCEF6B
BF4E753C58D3884F6D6DFD76B7D3DD05348348FBE;
LTLocalizationCookie=Localization=en-US&CurrentLocalization=en-US;
PHPSESSID=3qg7gu1k88fmll3p8phrnijrd2; UserAuthentication=authenticated=no&user_id=e1578d36-260f-4563-
92c6-01561470f945&EZ_id=2239fc88-f02f-4e2c-bef8-7ff78ca66d51;
XTCsid=519939919295be572dc731e0df083f3c

I also have no idea what the gobbledegook is after archive dot org ......


Seems like they are sending you/your-app 3 cookies to chew on..

CFGLOBALS - Coldfusion Based
LTLocalizationCookie - ASP.NET Based
PHPSESSID - PHP Based

Do you run all 3 technologies on your site?

Staffa

8:38 pm on May 15, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thank you for this clarification blend27

I do not run either of these technologies.
Just out of interest, do you happen to know what the purpose of these cookies could be ?

blend27

1:39 pm on May 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For Storing client data in a cookie, in CF case as far as I know. I am going to take the first part(till the first semicolon)
CFGLOBALS=urltoken%3DCFID%23%3D17569%26CFTOKEN%23%3D9c172d90d66b9c77%2D4324C455%2D3048%
2D6168%2DA8FCF58432017ADD%23lastvisit%3D%7Bts%20%272010%2D05%2D11%2011%3A33%3A26%27%7D%23timecreated%3D%7Bts%20%272010%2D05%2D11%2008%3A21%3A35%27%7D%23hitcount%3D9%23cftoken%3D9c172d90d66b9c77%2D4324C455%2D3048%2D6168%2DA8FCF58432017ADD%23cfid%3D17569%23;

Use client variables for data that is associated with a particular client and application and that must be saved between user sessions. Use client variables for long-term information such as user display or content preferences. ColdFusion uses two of these cookies for the CFID and CFToken identifiers, and also creates a cookie named cfglobals to hold global data about the client, such as HitCount, TimeCreated, and LastVisit.

there is more on this: help.adobe.com/en_US/ColdFusion/9.0/Developing/WSc3ff6d0ea77859461172e0811cbec0c35c-7fd5.html The first time you would visit the CF site, it would set appropriate cookies on your machine.

As to why they sending that data: is realy interesting. If you don't have CF installed on your site, i'd say they are faking the request. Or maybe you do and don't even know it.

Staffa

9:50 pm on May 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I know what cookies are, but why would they append them to the UA ?

I checked and CF is not installed on the server.
.NET and PHP is but I don't use either of them.

blend27

2:52 am on May 17, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Staffa, that is not UA, unless I am reading your LOG file format the wrong way. QS or cookie, at last.

If that string is a part of UA, boot it.