Odd Anomaly with Googlebot-Image/1.0

Forum Moderators: phranque

Message Too Old, No Replies

Odd Anomaly with Googlebot-Image/1.0

thecoalman

2:18 pm on Jun 5, 2018 (gmt 0)

I was perusing the new Search Console and had huge amount of soft 404's for image files served through a php script.

They are listed in the search console as:

http://example.com/forum/download/file.php?id=123456&mode=view

That will not return a 404 however In the access log the request from Googlebot-Image/1.0 is coming in as:

GET /forum/download/file.php?id\\u003d123456\\u0026mode\\u003dview

Further search of the logs shows a few sites hot linking to the images using URL scheme like that. Is that even valid? Secondly why the discrepancy in the search console and the logs?

QuaterPan

2:26 pm on Jun 5, 2018 (gmt 0)

Looks like json encoding. May be sites which are hot linking to you are messing up with their database / feed

lucy24

4:24 pm on Jun 5, 2018 (gmt 0)

That will not return a 404

Well, that's the point. If it did return a 404 it wouldn't be listed as a “soft 404”. (Editorial comment: Is g### really being that stupid? If an URL that doesn't require parameters is redirected to the parameterless version, that's a correct and appropriate response; the alternative is Duplicate Content. What the ### do they expect you to do?)

id\\u003d123456\\u0026mode\\u003dview

That looks like encoding of non-ascii, or non-alphanumeric, characters. Quick lookup confirms that it translates to
id=123456&mode=view

The interesting part is that the original (non-google) requests for URLs in this format are hotlinks, which you are presumably either blocking or serving some other content. So nobody except search engines will ever get as far as a successful request.

thecoalman

10:59 pm on Jun 5, 2018 (gmt 0)

Lucy, G is listing it accurately in the console and I can even go through the process of having it fetch as Googlebot successfully. The image bot however is requesting wrong version presumably because of these pages and it appears to be only the image bot doing it.

Here is the offending code from the site hotlinking to it, there is numerous domains doing it but they appear to be all using same site format.


//Same thing before for other sites.
<li>
<div class="ktw_img" itemprop="associatedMedia" itemscope itemtype="http://schema.org/ImageObject" itemid="https://example.com/forum/download/file.php?id\u003d123456\u0026mode\u003dview">
Some relevant text
<a href="https://exmple.com/forum/download/file.php?id\u003d123456\u0026mode\u003dview" rel="nofollow" class="fancybox">
<img src="https://example.com/forum/download/file.php?id\u003d42565\u0026mode\u003dview" alt="file.php?id=123456&mode=view text relevant to image at offendingsite.co" title="Text relative to the image" height="798" itemprop="contentURL" onError="this.onerror=null;this.src='https://encrypted-tbn0.gstatic.com/images?q=random_string_for_thumbnail_on_google';" />
</a>
Relevant text
</div>
</li>
//Same thing after for other sites.

Backslash isn't even a valid character in a URL is it? I guess my next step here is contacting Google?

which you are presumably either blocking or serving some other content

The script doesn't know what this request is so it returns a friendly 404. "file attachment does not exist..."

.

lucy24

11:37 pm on Jun 5, 2018 (gmt 0)

Backslash isn't even a valid character in a URL is it?

That was my point about encoding. The \u or \\u element, plus the four following hexadecimal digits, is how the offending site's code deals with equals signs and ampersands.

thecoalman

11:58 pm on Jun 5, 2018 (gmt 0)

I know why it's appearing that way but my question is why Google would follow it to begin with.

I sent a message through the Feedback link, I won't hold my breath. :P

thecoalman

8:25 pm on Jul 25, 2018 (gmt 0)

They stopped listing new entries as "last crawled" on the 5th but are listed daily before that. I posted this and sent G a feedback message about it on the 5th, not sure if it's a coincidence. I still have requests in the access logs and now I'm seeing them from Bing too.

I'm wondering if I should do a redirect?

thecoalman

2:00 am on Aug 30, 2018 (gmt 0)

They are still listing these and the list is growing, I'm at about 35K now which is about half the available files. I really don't know how to proceed with fixing this, a redirect would effectively be redirect to same URL, right? I'm guessing the only real realistic solution is edit the script to accommodate this so it's serving the file?

Do i then have URL's in Google index showing up as:


example.com/forum/download/file.php?id\\u003d123456\\u0026mode\\u003dview

Any other suggestions?

not2easy

2:54 am on Aug 30, 2018 (gmt 0)

If you want the Googlebot-Image/1.0 to access the files you could edit the script. If you don't want them asking for the files you can add a few lines to robots.txt:

User-agent: Googlebot-Image/1.0
Disallow: /download/

but that won't help for Bing and others.

keyplyr

2:58 am on Aug 30, 2018 (gmt 0)

There errors are coming from 3rd parties hotlinking, correct? So the errors are not on you site. In that case, there is nothing to fix... ignore the GSC report.

thecoalman

3:35 pm on Aug 30, 2018 (gmt 0)

These are files I want indexed or could be indexed. I'm not quite sure it's understood what is occurring here so I'll try and explain again. The excluded URL's listed in the search console because of a soft 404 are listed as the valid URL. This is how Google has it listed as being a soft 404:

http://example.com/forum/download/file.php?id=123456&mode=view

If the bot makes that request it would not get a 404 and would be served the file. If I fetch as Google I'll get a success. However that is not how the requests are being made, they are being like this:

GET /forum/download/file.php?id\\u003d123456\\u0026mode\\u003dview

A 404 is expected for this and if that was the end of it I could care less, there is nothing I can do about it anyway.

What is not expected is Google marking the valid URL that is indexable as a 404. It appears to me if you wanted to sabotage someone's listing in Google image index this would do it and for all i know that is exactly the intent. Those pages hotlinking to mine have numerous sites in the same niche with screwed up links like this where files are being served through a php script. Those pages don't actually appear to have the intent for anyone to use them, there is a login overlay and numerous domains in same format.

If you don't want them asking for the files you can add a few lines to robots.txt:

I do want them indexed but this just gave me an idea, I can exclude those two strings with robots.txt which will prevent the bot from trying to access the invalid URL. Not sure why I didn't think of that before. I'll just have to let Google do it's thing and reindex the valid URL's over time,

--edit---

Disallow: /forum/download/file.php?*\\u003d
Disallow: /forum/download/file.php?*\\u0026

Added this to my robots.txt file and tested the valid URL which passed.