Google-Fake Directories testing 404

Forum Moderators: open

Message Too Old, No Replies

Google-Fake Directories testing 404

wilderness

3:05 am on Dec 14, 2014 (gmt 0)

In recent weeks, Google has been testing 404's with invalid nonsense-charaters for non-existent leading directory names.

A few of the major SE's have done the same in the past, and it's nothing new.

Unfortunately, yesterday I had regular visitor from an otherwise normal IP do the same type of nonsense-charaters for non-existent leading directory name.

This visitor visited the same page on three-successive days using two different UA's from the same IP (down to the Class D). All requests were accompanied by full supporting files.

On the fourth day and with nonsense-charaters for non-existent leading directory, no supporting files were requested, and the invalid request was followed up by a favicon request.

Has anybody seen such practices from a non-SE?

dstiles

7:28 pm on Dec 14, 2014 (gmt 0)

Is this a single URL or several? Are URL(s) possibly in another site's pages, which G has indexed and others are following? Have you checked the URL in G?

keyplyr

7:46 pm on Dec 14, 2014 (gmt 0)

In my experience, Bing has been doing it daily for over 3 years, despite their explicit denial. Google does it intermittently and I've even seen it occasionally in their Webmaster Tools as Crawl Errors (i.e. blaming it on me.) The danger of course is these convoluted URLs escaping into the wild, which hopefully your "regular visitor" isn't an account.

lucy24

9:23 pm on Dec 14, 2014 (gmt 0)

nonsense-characters for non-existent leading directory names

Do you mean something like

/plausible-directory-name/fjkl4j3ifj.html

where the directory name is not in the cat-on-the-keyboard form they use for "soft 404" testing? That does sound like someone else with a bad link to your site, and the search engine is trying to work out whether the directory exists at all. Have they requested any of the trio

/plausible-directory-name/
/plausible-directory-name/index.html
/plausible-directory-name

?

Or did you mean the other way around, like

/fjkl4j3ifj/plausible-file-name.html

?

I've never seen the cat-on-the-keyboard business with directory names. Only files. But then, I may not have looked closely enough.

wilderness

10:12 pm on Dec 14, 2014 (gmt 0)

dstiles,
It was the same URL in all four with visits, with the exception of the leading-phony-directory on the 4th.

lucy,

Here's the first two instances from Google in this months logs.

66.249.75.48 - - [06/Dec/2014:07:57:35 -0700] "GET /STQiZ/MySub/sub-sub/MyPgae.html HTTP/1.1" 404 451 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.48 - - [06/Dec/2014:08:00:02 -0700] "GET /STMhZ/DifferentSub/DifferentSub-Sub/DifferentPage.html HTTP/1.1" 404 439 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

And from the visitor in question:

69.161.111.zzz - - [12/Dec/2014:19:14:17 -0700] "GET /VNLmZ/DifferentSub/DifferentPage.html HTTP/1.1" 404 451 "-" "Mozilla/5.0 (Linux; Android 4.0.4; BNTV400 Build/IMM76L) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.59 Safari/537.36"

FWIW, I've denied the IP to the Class C.

lucy24

11:50 pm on Dec 14, 2014 (gmt 0)

Huh. Interesting. That's definitely not Google's fake-URL pattern. The soft-404 tests come in the form

[a-z]{8,16}

using only lower case.

Are yours always in the form

[A-Z][A-Z][A-Z][a-z][A-Z]

? It sure looks like someone following a bad link. Exasperating that you're not getting any unambiguous human requests with a referer that you could go investigate.

:: detour to pore over raw logs ::

Huh again. Wonder what prompted the googlebot to ask for

/i18n/english/

-- a form I have never used anywhere? It jumped out at me because it's the only time they ever got a 404 on a bare directory name. Other than that it's always

/dir/subdir/cat-on-keyboard-stuff-here

using actual URLpaths, with or without leading (sub)directory names.

Odd corollary discovery: I said "/dir/subdir/" but in fact the only time they've made requests in the form

/(\w+/)+[a-z]{8,16}\.html

(i.e. one or more leading directories before the garbage) was a short spell in the spring of 2013. Almost two years ago, so I've no way to guess what was going on-- something unusual on their part, or a different redirect pattern (unlikely!) on mine.

wilderness

1:26 am on Dec 15, 2014 (gmt 0)

What's their 'trip'?

66.249.75.80 - - [14/Dec/2014:11:27:59 -0700] "GET /TbUQZ/MWjXZ/NoZVZ/VLLSZ/KRQWZ/XNbnZ/KaKbZ/MySub/MySubSub/MyPage.html HTTP/1.1" 404 451 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

lucy24

2:58 am on Dec 15, 2014 (gmt 0)

Yikes. But always exactly five characters, with no more than two lower-case? Let's see how long it takes them to throw out more variations.

In other news, Google has suddenly (i.e. TODAY) started asking for assorted URLs in

/paintings//etcetera

(Only one directory.) I swear I can't find anything wrong on my end. I did add a subdirectory a few days ago, but its links are clean.

Fortunately it is not many days since I figured out how to redirect multiple slashes. But sheesh, how annoying. Can you say "Duplicate Content"? Ignoring duplicate slashes seems to be an Apache default. MAMP does it too.

Is Google simply trying new things at random? Here an /ABcDE/ directory, there a double slash?

wilderness

8:24 am on Dec 15, 2014 (gmt 0)

It must be something broken or gone astray.
Why Sam's hell would they wish an affirmed 404 for robots.txt?

66.249.75.64 - - [14/Dec/2014:18:54:53 -0700] "GET /SPbjZ/robots.txt HTTP/1.1" 404 451 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

lucy24

8:39 am on Dec 15, 2014 (gmt 0)

It must be something broken or gone astray.

I think with this latest specimen you've eliminated all other possibilities ;) Google has simply gone bonkers. (Are you even allowed to have a robots.txt for anything but the root? Who would know to ask for it?)

:: detour to check whether they're still asking for /paintings// since it would be pretty aggravating to set up a redirect if it was just a passing whim on their part ::

Yup. Why in Sam's hell would they ask for an URL format that isn't even recognized by the server as aberrant?

keyplyr

9:54 am on Dec 15, 2014 (gmt 0)

For years Bing's replay was that my server was creating that nonsense. I couldn't be sure as I had some iffy redirect code in my htaccess at that time (forwarding non-www to www, changing dynamic DB paths, etc.) So I removed all that code but still Bing requested about 100 nonsense directory/files daily.

Further emails with Bingbot support produced the same run-around, this time saying that in time Bing would drop those non-existent URLs from their crawl. They never did. They continue still, on a daily basis.

A year ago Google's reply was basically that they were just following links (one of the few times Google has ever replied directly to my complaints regarding Googlebot.)

In conclusion - I trust Google's crawl more than I do Bing, who IMO is pretty stupid... so it is possible that Google is somehow following the Bing 404s (GTB/Chrome) and that is how they both have them.

JMO, YMMV

lucy24

10:20 am on Dec 15, 2014 (gmt 0)

this time saying that in time Bing would drop those non-existent URLs from their crawl. They never did.

"Well, it's all well and fine to say that today
/directory
and
/directory/index.html
both redirect to
/directory/
... but how can we be sure that this will still be the case tomorrow? We'd better check again to be sure."

At least that's my best guess as to the thought process of whoever gives the bingbot its marching orders.

wilderness

4:19 pm on Dec 15, 2014 (gmt 0)

Fortunately it is not many days since I figured out how to redirect multiple slashes. But sheesh, how annoying. Can you say "Duplicate Content"? Ignoring duplicate slashes seems to be an Apache default. MAMP does it too.

lucy,
Might you provide the syntax for this?

Some two+ years ago I had some issues due to my own errors on new pages (unnecessary extra slashes in paths).
Found this old syntax, and old double slash thread [webmasterworld.com].

Unfortunately, I was never able to implement the elimination of multiple slashes in /sub/subDirectories/ (or even more directory structures below sub-sub.

lucy24

5:26 pm on Dec 15, 2014 (gmt 0)

You have to put the // in a RewriteCond under %{REQUEST_URI}, meaning that if you need to capture, this has to be the last Condition. I found by experiment that multiple slashes in the pattern-- where they would logically belong-- are simply ignored. At least on my server, and also on MAMP-- which I have to assume uses default settings except where I've personally changed things.

The rule I'm using says simply

RewriteCond %{REQUEST_URI} /paintings//+(.*)
RewriteRule ^paintings http://example.com/paintings/%1 [R=301,L]

Fortunately the problem only happened in one directory, so we don't (yet) have to evaluate conditions on every single request. If the googlebot stops asking I'll remove the rule in a few days.

wilderness

5:34 pm on Dec 15, 2014 (gmt 0)

lucy,
When I had this issue, Google kept multiplying the number of slashes on subsequent visits as they were chasing their own tail.

Eventually the slashes increased to the rate or 5 or 6, then they eventually caught the 301 and it took a long while to simmer down. Eventually the requests stopped.

lucy24

9:04 pm on Dec 15, 2014 (gmt 0)

Google kept multiplying the number of slashes

You fill me with dread :)

wilderness

7:04 pm on Dec 16, 2014 (gmt 0)

It would seem that Google and my-other-lone-IP are NOT the only ones testing these waters.

It's certainly an odd occurrence.

Verizon Data
grnsvr4.bellatlantic.
West Orange, NJ

8.28.16.254 - - [16/Dec/2014:06:55:36 -0700] "GET /MySub/MYPage.html HTTP/1.1" 403 832 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; InfoPath.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; MS-RTC LM 8; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
198.23.5.73 - - [16/Dec/2014:06:55:36 -0700] "GET /SameSub/SamePage.html HTTP/1.1" 403 610 "-" "Mozilla/4.0 (compatible;)"
8.28.16.254 - - [16/Dec/2014:06:55:36 -0700] "GET /KKOPZ/SameSub/SamePage.html HTTP/1.1" 403 832 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; InfoPath.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; MS-RTC LM 8; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
198.23.5.73 - - [16/Dec/2014:06:55:36 -0700] "GET /KKOPZ/SameSub/SamePage.html HTTP/1.0" 403 832 "http://www.google.com/url?sa=example.com-with-encrypted-search-string" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"

wilderness

8:40 pm on Dec 16, 2014 (gmt 0)

more 'bonkers'!

Googlebot-Image has been in my robots.txt for more than a decade!

Over the same length of time as these 404-dirs have been some very scattered (NOT full crawls for same) image requests.

66.249.75.64 - - [16/Dec/2014:12:48:51 -0700] "GET /MySub/MyImage.jpg HTTP/1.1" 403 647 "-" "Googlebot-Image/1.0"
66.249.75.64 - - [16/Dec/2014:12:48:52 -0700] "GET /MySub/DifferentImage.jpg HTTP/1.1" 403 647 "-" "Googlebot-Image/1.0"
66.249.75.64 - - [16/Dec/2014:12:48:59 -0700] "GET /QfKWZ/MySub/DifferentImage.jpg HTTP/1.1" 403 635 "-" "Googlebot-Image/1.0"

lucy24

8:56 pm on Dec 16, 2014 (gmt 0)

Oh, ###, now they've gone to a full [A-Z]{5} option.

Do you have any real-life directories with names in this form? If you omit the spurious directory, is what's left over a valid URL? Or is the spurious directory replacing a real one? You could potentially do something like

RewriteRule ^[A-Z][A-Za-z]{3}[A-Z]/(.+) http://www.example.com/$1 [R=301,L]

if-and-only-if you can isolate a pattern. Exceptions if necessary:

RewriteCond %{REQUEST_URI} !/MySuB
RewriteRule ^[A-Z][A-Za-z]{3}[A-Z]/(.+) http://www.example.com/$1 [R=301,L]

I don't know whether there's a difference between
[A-Za-z]{3}
and
[A-Za-z][A-Za-z][A-Za-z]
but I can't imagine the difference would be vast. (People reading along: note that we're here talking about .htaccess, where Regular Expressions are recompiled from scratch on every request.)

wilderness

9:19 pm on Dec 16, 2014 (gmt 0)

Do you have any real-life directories with names in this form?

Sort of ;)
Early on with my websites, I used valid names for directories.
However I recognized almost immediately than names (especially more than a few characters) was not ideal for directory names and began using 3-4 letters which were generally a made up abbreviation for a name.

I recall some vague protocol reference which suggested NOT using two-letters or less (i. e., 1) for directory names. I do however use some two-letter names for sub-sub-directories.
EX.
12345/12/12/

I'm wondering if a variation of the old random UA code would work for these?

RewriteCond %{HTTP_USER_AGENT} [0-9A-Za-z]{15,} [OR]

(note; some where I've another variation of this that was modified to use the absence of verbs).

FWIW, I'm not too worried about these requests (they have not been real excessive in quantity), rather I'm just more concerned about 'google going bonkers' and the appearance of these other two solitary IP's using a similar technique (have they gone bonkers as well, and is the process something that is 'catchy' (for lack of a better term)?)

lucy24

11:33 pm on Dec 16, 2014 (gmt 0)

So far, it looks as if all your spurious requests fit into the pattern

^[A-Z][A-Za-z]{3}[A-Z]/

i.e. first directory is exactly five letters, of which the first and fifth (at least) are capitals. So don't say anything about [0-9] (or, as I put it, \d) and definitely don't use the [NC] flag. You can then constrain your exceptions to any real directories of yours that fit this exact pattern. The more tightly constrained you can make the body of the rule, the less often your server has to stop and evaluate conditions.

Edit:
You've got your raw logs, right? Start by just searching for " 404 \d" with leading space. (The \d is to eliminate any files you might have whose size happens to be 404 bytes.) Then search for

GET /[A-Z][A-Za-z]{3}[A-Z]/\S* HTTP/1\.[01]" 404

and see if the number of results is identical. If the number is smaller, we'll take another look at possible patterns.

wilderness

3:31 am on Dec 18, 2014 (gmt 0)

I get occasional requests from DHS (216.81.80.0/20) however I've never seen any reason to allow them in my widget sites.

Only reason for adding here was the use of the invalid nonsense-charaters for non-existent leading directory.

Note four successive request and four separate UA's.

8.28.16.254 - - [17/Dec/2014:11:25:20 -0700] "GET /MySub/MySubSub/MyPage.html HTTP/1.1" 403 832 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; InfoPath.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; MS-RTC LM 8; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
216.81.94.71 - - [17/Dec/2014:11:25:21 -0700] "GET /SameSub/SameSubSub/SamePage.html HTTP/1.1" 403 832 "-" "Mozilla/4.0 (compatible;)"
8.28.16.254 - - [17/Dec/2014:11:25:21 -0700] "GET /ZWfRZ/SameSub/SameSubSub/SamePage.html HTTP/1.1" 403 832 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; InfoPath.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; MS-RTC LM 8; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
216.81.94.71 - - [17/Dec/2014:11:25:21 -0700] "GET /ZWfRZ/SameSub/SameSubSub/SamePage.html HTTP/1.1" 403 647 "http://en.wikipedia.org/wiki/SameNameAsMYPage" "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"

lucy24

5:10 am on Dec 18, 2014 (gmt 0)

Do you suppose "zwfrz" is an actual Polish-or-possibly-Czech word, or is it just google throwing in the towel? :) At least it doesn't appear to be a common word in any standard legacy font.

Matter of fact, I'm getting curious. Not curious enough to visit a site explicitly flagged as "this site may harm your browser" (query: if that is the case, why is it still on page 2 instead of way down at the bottom?). And I probably don't need to know which medication used for treating dropsy in fish would scan as "zwfrz". But do you have any straightforward way of pulling up the specific names of any and all bogus directories requested to date? I'm wondering if there is any rhyme or reason beyond the [A-Z][A-Za-z]{3}[A-Z] pattern.

Edit after checking records: Good heavens. I'd forgotten all about 216.81.80.0/20 (DHS). But they must have visited me at some time, because I've got an ID on the range. Also a notation that they came in with Bluecoat. Which in fact would be your 8.28.