Forum Moderators: phranque

Message Too Old, No Replies

Google testing for index types?

I assume it's just Google testing for index types...

         

No5needinput

3:59 pm on Mar 11, 2019 (gmt 0)

10+ Year Member Top Contributors Of The Month



From error log:

(index.php,index.php5,index.php4,index.php3,index.perl,index.pl,index.plx,index.ppl,index.cgi,index.jsp,index.jp,index.phtml,index.shtml,index.xhtml,index.html,index.htm,index.wml,Default.html,Default.htm,default.html,default.htm,home.html,home.htm,index.js) found, and server-generated directory index forbidden by Options directive, referer: https://www.google.com/webmasters/tools/crawl-errors?hl=en&siteUrl=https://www.example.com/


Just found it interesting, probably already been discussed - I assume it's just Google testing for index types?

topr8

10:48 pm on Mar 11, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



i would have though most sites would block access to the actual 'page' as they would want access only to '/'
so i can't see the purpose of this.

in the same way a large number of sites are extensionless these days (perhaps the majority) so access to filename.php/.asp/.htm/.etc would also be blocked.

phranque

11:46 pm on Mar 11, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



From error log

did you use that timestamp to check the server access log for the equivalent request?

to me that looks like someone with access to GSC clicked on the crawl errors report (in the "old" search console) and then navigated to the home page url for that site.
you'll probably find that request originated from your IP using your user agent to request the resource.
(or hopefully someone else within your organization)

the message is a normal "error condition" level message for mod_autoindex.
i.e. somewhere in your config you probably have a "LogLevel autoindex:error" specified.
if you want to not show these errors you must increase the level, e.g.:
LogLevel autoindex:crit

(index.php,index.php5,index.php4,index.php3,index.perl,index.pl,index.plx,index.ppl,index.cgi,index.jsp,index.jp,index.phtml,index.shtml,index.xhtml,index.html,index.htm,index.wml,Default.html,Default.htm,default.html,default.htm,home.html,home.htm,index.js)

i'm guessing this is extracted from the list of index files provided in the DirectoryIndex directive.

phranque

11:51 pm on Mar 11, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



i would have though most sites would block access to the actual 'page' as they would want access only to '/'
so i can't see the purpose of this.

sites should 301 redirect requests for directory index documents to the (trailing slash) directory path.
e.g. https://www.example.com/index.php should be 301 redirected to https://www.example.com/

mod_autoindex doesn't come into play until a (trailing slash) directory url path is requested.

tangor

12:29 am on Mar 12, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Google tests everything. Period. As phranque suggests much of this can be controlled on your side with a few redirects ... but that won't change g's behavior. Bing is not quite as aggressive, but does similar.

Meanwhile, your error log is working just fine. :)

lucy24

12:47 am on Mar 12, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



if you want to not show these errors
But only if it’s your own server; you can’t set it in htaccess. Anyway, this type of request doesn’t require consulting the Error Log, since they should each show up in access logs as a 404 as well.

OP, you left out the single most important piece of information: were these files requested by the googlebot? Otherwise it’s just another malign robot sending a bogus referer. (And since when does Google itself give GSC as a referer?)

You can have multiple index files in the same directory, and set more than one to be the DirectoryIndex--but once the server has found one on the list, it stops looking. Others then have to be requested by name. I remember once playing with this on my test site; in fact I’ve still got the directory, with a slew of index.htm and index.php and so on. Unlike LogLevel, DirectoryIndex can easily be changed in htaccess. You just have to remember it's there, if you make different settings for one directory.

I get heaps of requests for index.php, but that's just malign robots doing their thing--looking for WP vulnerabilities and the like. Never seen the weird extensions.

:: detour to recent logs ::

Nope, nothing but the occasional index.html. In fact, G must have been doing a periodic spot-check pretty exactly a year ago; on one date in March 2018 I find requests for /index.html of almost every directory I own--on a site where these have never been visible URLs.

No5needinput

1:58 pm on Mar 12, 2019 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks for the answers

Directory and index pages all redirect to trailing slash. File names have .php extensions - without trailing slash

I found that particular error in cpanel last errors or whatever its called - not somewhere I usually look for problems as I have a dedi server but I was just browsing around and saw it. I did go to GSC that morning, the old one, so I probably triggered it.

Here is the error again as it was and I included the line both above and below it:

[Mon Mar 11 11:33:26.040422 2019] [core:info] [pid NUMBER:tid NUMBER] [client BINGBOT IP] AH00128: File does not exist: /home/example/public_html/dir/dir/non-existent-page.php
(index.php,index.php5,index.php4,index.php3,index.perl,index.pl,index.plx,index.ppl,index.cgi,index.jsp,index.jp,index.phtml,index.shtml,index.xhtml,index.html,index.htm,index.wml,Default.html,Default.htm,default.html,default.htm,home.html,home.htm,index.js) found, and server-generated directory index forbidden by Options directive, referer: https://www.google.com/webmasters/tools/crawl-errors?hl=en&siteUrl=https://www.example.com/
[Mon Mar 11 11:29:07.358159 2019] [:error] [pid NUMBER:tid NUMBER] [client BINGBOT IP:0] File does not exist: /home/example/public_html/dir/dir/non-existent-page.php


No major drama, I just thought the possible number of index. files was interesting :-)

lucy24

6:26 pm on Mar 12, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do the parentheses mean that (one or more items in this long list) was found? The error would make more sense if (none-of-the-above) were found, which is the only reason a server would need to check for auto-generated index permission.

client BINGBOT IP
For a couple of years now, the bingbot has had an irritating habit of asking for lowercase.html when the actual filename (which it also asks for) is CamelCase.html. This leads to a fair number of bingbot 404s.

phranque

11:34 pm on Mar 12, 2019 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



bingbot has had an irritating habit of asking for lowercase.html when the actual filename (which it also asks for) is CamelCase.html

i would call this corporate guilt for past misdeeds.
afaik the windows server os default setting was and still is for case-insensitive file names:
Configure Case Sensitivity for File and Folder Names [docs.microsoft.com]