Welcome to WebmasterWorld Guest from 3.80.38.5

Forum Moderators: phranque

Odd Anomaly with Googlebot-Image/1.0

     
2:18 pm on Jun 5, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 4, 2004
posts:891
votes: 3


I was perusing the new Search Console and had huge amount of soft 404's for image files served through a php script.

They are listed in the search console as:

http://example.com/forum/download/file.php?id=123456&mode=view


That will not return a 404 however In the access log the request from Googlebot-Image/1.0 is coming in as:

GET /forum/download/file.php?id\\u003d123456\\u0026mode\\u003dview


Further search of the logs shows a few sites hot linking to the images using URL scheme like that. Is that even valid? Secondly why the discrepancy in the search console and the logs?
2:26 pm on June 5, 2018 (gmt 0)

Full Member

joined:May 21, 2018
posts:276
votes: 72


Looks like json encoding. May be sites which are hot linking to you are messing up with their database / feed
4:24 pm on June 5, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15444
votes: 739


That will not return a 404
Well, that's the point. If it did return a 404 it wouldn't be listed as a “soft 404”. (Editorial comment: Is g### really being that stupid? If an URL that doesn't require parameters is redirected to the parameterless version, that's a correct and appropriate response; the alternative is Duplicate Content. What the ### do they expect you to do?)

id\\u003d123456\\u0026mode\\u003dview
That looks like encoding of non-ascii, or non-alphanumeric, characters. Quick lookup confirms that it translates to
id=123456&mode=view

The interesting part is that the original (non-google) requests for URLs in this format are hotlinks, which you are presumably either blocking or serving some other content. So nobody except search engines will ever get as far as a successful request.
10:59 pm on June 5, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 4, 2004
posts:891
votes: 3


Lucy, G is listing it accurately in the console and I can even go through the process of having it fetch as Googlebot successfully. The image bot however is requesting wrong version presumably because of these pages and it appears to be only the image bot doing it.

Here is the offending code from the site hotlinking to it, there is numerous domains doing it but they appear to be all using same site format.


//Same thing before for other sites.
<li>
<div class="ktw_img" itemprop="associatedMedia" itemscope itemtype="http://schema.org/ImageObject" itemid="https://example.com/forum/download/file.php?id\u003d123456\u0026mode\u003dview">
Some relevant text
<a href="https://exmple.com/forum/download/file.php?id\u003d123456\u0026mode\u003dview" rel="nofollow" class="fancybox">
<img src="https://example.com/forum/download/file.php?id\u003d42565\u0026mode\u003dview" alt="file.php?id=123456&mode=view text relevant to image at offendingsite.co" title="Text relative to the image" height="798" itemprop="contentURL" onError="this.onerror=null;this.src='https://encrypted-tbn0.gstatic.com/images?q=random_string_for_thumbnail_on_google';" />
</a>
Relevant text
</div>
</li>
//Same thing after for other sites.


Backslash isn't even a valid character in a URL is it? I guess my next step here is contacting Google?

which you are presumably either blocking or serving some other content


The script doesn't know what this request is so it returns a friendly 404. "file attachment does not exist..."

.
11:37 pm on June 5, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15444
votes: 739


Backslash isn't even a valid character in a URL is it?
That was my point about encoding. The \u or \\u element, plus the four following hexadecimal digits, is how the offending site's code deals with equals signs and ampersands.
11:58 pm on June 5, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 4, 2004
posts:891
votes: 3


I know why it's appearing that way but my question is why Google would follow it to begin with.

I sent a message through the Feedback link, I won't hold my breath. :P
8:25 pm on July 25, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 4, 2004
posts:891
votes: 3


They stopped listing new entries as "last crawled" on the 5th but are listed daily before that. I posted this and sent G a feedback message about it on the 5th, not sure if it's a coincidence. I still have requests in the access logs and now I'm seeing them from Bing too.

I'm wondering if I should do a redirect?
2:00 am on Aug 30, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 4, 2004
posts:891
votes: 3


They are still listing these and the list is growing, I'm at about 35K now which is about half the available files. I really don't know how to proceed with fixing this, a redirect would effectively be redirect to same URL, right? I'm guessing the only real realistic solution is edit the script to accommodate this so it's serving the file?

Do i then have URL's in Google index showing up as:

example.com/forum/download/file.php?id\\u003d123456\\u0026mode\\u003dview


Any other suggestions?
2:54 am on Aug 30, 2018 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4160
votes: 262


If you want the Googlebot-Image/1.0 to access the files you could edit the script. If you don't want them asking for the files you can add a few lines to robots.txt:
User-agent: Googlebot-Image/1.0
Disallow: /download/
but that won't help for Bing and others.

2:58 am on Aug 30, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 891


There errors are coming from 3rd parties hotlinking, correct? So the errors are not on you site. In that case, there is nothing to fix... ignore the GSC report.
3:35 pm on Aug 30, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 4, 2004
posts:891
votes: 3


These are files I want indexed or could be indexed. I'm not quite sure it's understood what is occurring here so I'll try and explain again. The excluded URL's listed in the search console because of a soft 404 are listed as the valid URL. This is how Google has it listed as being a soft 404:

http://example.com/forum/download/file.php?id=123456&mode=view


If the bot makes that request it would not get a 404 and would be served the file. If I fetch as Google I'll get a success. However that is not how the requests are being made, they are being like this:

GET /forum/download/file.php?id\\u003d123456\\u0026mode\\u003dview


A 404 is expected for this and if that was the end of it I could care less, there is nothing I can do about it anyway.

What is not expected is Google marking the valid URL that is indexable as a 404. It appears to me if you wanted to sabotage someone's listing in Google image index this would do it and for all i know that is exactly the intent. Those pages hotlinking to mine have numerous sites in the same niche with screwed up links like this where files are being served through a php script. Those pages don't actually appear to have the intent for anyone to use them, there is a login overlay and numerous domains in same format.

If you don't want them asking for the files you can add a few lines to robots.txt:


I do want them indexed but this just gave me an idea, I can exclude those two strings with robots.txt which will prevent the bot from trying to access the invalid URL. Not sure why I didn't think of that before. I'll just have to let Google do it's thing and reindex the valid URL's over time,

--edit---
Disallow: /forum/download/file.php?*\\u003d
Disallow: /forum/download/file.php?*\\u0026


Added this to my robots.txt file and tested the valid URL which passed.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members