Strange 404's showing up in GSC

Forum Moderators: goodroi

Message Too Old, No Replies

Strange 404's showing up in GSC

NickMNS

1:57 pm on Apr 5, 2017 (gmt 0)

For two days in a row now I am getting the same 404 error in GSC.

The errors are for the urls /km and for /mile. This coincides with the two positions of the toggle button on the main input form on my site. This toggle button is the first input on the form. This suggest to me that Google bot is trying to crawl the form but is not succeeding.

The site has 5 pages indexed. I have not submitted a site map as there are only three relevant pages to index, home page, two input forms. The other two pages are ones I submitted manually. One for each result page for each form. I submitted these so that Google could see the result without necessarily indexing an infinite number of potential outputs.

Clearly something is wrong here, or is it? How does one go about getting a dynamic website indexed? Do I need set something up in GSC under url parameters?

phranque

2:12 pm on Apr 5, 2017 (gmt 0)

this might contain some useful information - GET, POST, and safely surfacing more of the web:
https://webmasters.googleblog.com/2011/11/get-post-and-safely-surfacing-more-of.html [webmasters.googleblog.com]

NickMNS

3:04 pm on Apr 5, 2017 (gmt 0)

@phranque thanks for the reply, I quickly read through the link. There is no issue with Google rendering the result page, I have submitted a result page in Fetch and Render and everything worked fine, so I know that Google can see the content if gets to it.

I just took a look into my logs to see what Googlebot is actually crawling. Mostly it is simply going to my home page. The only attempts crawling other dynamic pages seem to have ended with /km or /mile url, which throws the 404.

not2easy

3:43 pm on Apr 5, 2017 (gmt 0)

You can use parameters, but I would just add /km and /mile to robots.txt disallows. I've not found parameters to be as reliable. As long as they can parse the script, at some point it should sink in.

lucy24

5:13 pm on Apr 5, 2017 (gmt 0)

Sometimes a 404 is just a 404. Although generally the googlebot distinguishes between 410 and 404, I've also found that if a given URL has never returned anything but a 404 (most likely if they followed a bad link) they will not request it very often. So you may not need to take any action at all.

It does occur to me to wonder if it's another form of entrapment. (There probably exists a nicer term.) That is: if a given URL returns content, they will try certain related URLs to see if they meet the same content. For example--this one's universal--if there is content at
/blahblah/
they will then also request
/blahblah
and
/blahblah/index.html

(I don't know what they try if the default URL is /blahblah without extension. Is it /blahblah/ with slash or would it be /blahblah.html with extension?)

And if there is content at
sub.example.com/
they will probably also try
example.com/sub/

So with the proliferation of CMS yielding "friendly" URLs, it would not surprise me if every time they see
example.com/blahblah/?parameter
they also try
example.com/blahblah/parameter/
just for the heck of it.

NickMNS

6:59 pm on Apr 5, 2017 (gmt 0)

Let me just say I am not concerned about the 404 itself. My concern is that these 404s are symptomatic of deeper problem with Google's ability/willingness to crawl the site. So I see no real need to block /km or /mile with the robots.txt.

I tend to think that the situation is closer to what @lucy24 describes:

example.com/blahblah/parameter

But my worry is that Google is trying to crawl the parameters and can't. I get no organic traffic to the website at all, I have no links pointing to it but I have had no links pointing to other sites and was still getting some traffic. So I am not sure if this a niche thing (what I have suspected up to this point) or a Google crawl/indexing thing.

This in turn raises another question. In terms of a dynamic site with a design pattern of: fill in a form, submit to server, server calculates or looks-something up, returns results page. Where each result page is identical save for the the result calculated. Does Google need to crawl all the results pages to understand the content? How else can Google understand, index, rank and serve the content? Then, in my case with the permutations and combinations of 5 to 8 inputs, the number of nearly identical pages will be "tremendous"? Indexing all the results pages seems like the wrong solution. So how do you balance these two extremes?

phranque

7:02 pm on Apr 5, 2017 (gmt 0)

so I know that Google can see the content if gets to it.

exactly!
i wasn't suggesting it was a rendering issue but rather a discovery issue.

i wouldn't worry about a couple of 404s...

phranque

7:10 pm on Apr 5, 2017 (gmt 0)

Is your form submitting POST requests or GET requests?

NickMNS

7:18 pm on Apr 5, 2017 (gmt 0)

The form submits a GET request and returns a url with all the query parameters, in the format:

example.com/get_result?param1=1&param2=2...

phranque

7:50 pm on Apr 5, 2017 (gmt 0)

Is googlebot making any GET requests in that parameter format?

NickMNS

8:03 pm on Apr 5, 2017 (gmt 0)

No the only requests past the home page I have seen in my logs thus far is GET /?, /km or /mile.

NickMNS

2:37 am on Apr 6, 2017 (gmt 0)

So I decided to give the url parameters feature a go. I entered all the parameters but set them all to "let Google bot decide". The only thing is I haven't submitted a site map as I initially did not think that these pages should be indexed, should I submit one now?

tangor

3:56 am on Apr 6, 2017 (gmt 0)

Check your logs and see what other search engines are doing.

phranque

3:59 am on Apr 6, 2017 (gmt 0)

I initially did not think that these pages should be indexed, should I submit one now?

1.why did you not think so before and what has changed with those pages or your thinking?
2. if they aren't worth linking to, why should G expend resources looking for it?

(these are general questions you should be asking, not a judgement of you or your specifics.)

phranque

4:05 am on Apr 6, 2017 (gmt 0)

The form submits a GET request and returns a url with all the query parameters

if you have any concerns about privacy and especially if you are using secure protocol you should consider the implications of exposing your visitors' parameter selections in the urls they are requesting.

NickMNS

4:21 am on Apr 6, 2017 (gmt 0)

@phranque

.why did you not think so before and what has changed with those pages or your thinking?

I assumed that Google would be able to determine what the site was about by simply from the few pages that were indexed and would rank it accordingly. I am now questioning that assumption as the site get absolutely no organic traffic, in its current state. I am not expecting it to rank number one, but the content is unique, and I cannot imagine that not a single person over the course of a 6 to 9 month hasn't Googled something sufficiently similar that I could not rank for. Unless, of course Google's interpretation of the content is very narrow.

As you know, I preparing to add a bunch of new content, that I am very confident will attract traffic. But before I go there, the fact that this site has failed so completely leaves me worried that it has failed not on its merits but on something I overlooked from a purely technical perspective. I have created crappy sites in the past (and learned my lesson), but even when the content was horrible and it had no links it still managed to attract some traffic. It is the no traffic part that gets me here.

Today there was an post on SER [seroundtable.com...] that says you should link to your content. Technically since the site is dynamic there are no links to any of the results pages. The user only get to those pages by submitting the forms. Is this the issue? Again I feel like I am missing something.

NickMNS

4:26 am on Apr 6, 2017 (gmt 0)

@tangor when reviewed my logs earlier, I focused mainly on Googlebot, but in passing I noticed some bingbot activity but it look very much the same as Googlebot.

@phranque there is no sensitive data passed in url parameters. The new section will have a login and all that data will be passed using POST requests.

tangor

4:53 am on Apr 6, 2017 (gmt 0)

From what you've described you have a self-defeating web site for search engines. They are not designed to query forms to get data, they just find data that is accessible. Might want to package more "static" pages which preview or describe the content of the dynamic side for the search engines and trust that info is enough to get traffic to your "index form" for users to further explore your content.

phranque

6:02 am on Apr 6, 2017 (gmt 0)

in general you can't rank for content that isn't indexed, content won't be indexed if a url containing that content isn't crawled, the url won't be crawled if it isn't discovered.
if nobody is linking to this content, google won't try very hard to find content hidden behind your forms unless something makes google "hungry" for more of your content.
(usually external forces like searcher behavior or references to your content)
even though you accept GET requests for form submits, your dynamic content is minimally discoverable because you can't expect google to generate GET requests for all the combinations of parameter values in your form(s).
you have to either refer them directly or motivate them to find it.

this site has failed so completely leaves me worried that it has failed not on its merits but on something I overlooked from a purely technical perspective

if there's a technical issue with crawling, it should show up in the access&error logs and/or your GSC.
you can see if your content is indexed by doing a site:example.com search perhaps with some appropriate keywords.
is the content in the index sufficient to rank given how competitive the search term is?
what happens with "brand" searches?

NickMNS

7:03 pm on Apr 6, 2017 (gmt 0)

If understand correctly, and given that direct links are impossible, then a site-map is the only option.

So I started generating a site map.
Question, can I generate a site map for only a representative sub-section of pages? Or should I generated a map for every single potential url? For example one of my form inputs is time in seconds, the user can enter a time in seconds and every second will yield a different results page url and result. The difference between the two is marginal so there is no real interest in the result from say 2 min, 00 secs as compared to 2 min, 01 secs. So to reduce the number of url I have generated url only for every 30 second interval. Is this okay?

NickMNS

7:07 pm on Apr 6, 2017 (gmt 0)

One more site map question...
Does order matter? In other words is Google more likely to crawl and index the first URL's provide in a site map, or is it random?

tangor

11:29 pm on Apr 6, 2017 (gmt 0)

How many entries in your sitemap? Less than 10,000 and I doubt there's any difference.

NickMNS

11:36 pm on Apr 6, 2017 (gmt 0)

In the order of 500k entries when taking only sample of all possible urls, otherwise it would be well into the tens of millions.

tangor

11:45 pm on Apr 6, 2017 (gmt 0)

You will need multiple sitemaps to accomodate that many urls.

According to the Latest Update: 01 Jan 2017

Any Sitemap file is limited to 50MB (uncompressed) with a maximum of 50,000 URLs.

And a Sitemap index file (not to be confused with a Sitemap file) can include up to 50,000 Sitemaps.

So, for a single Sitemap index file, the maximum capacity of URLs and storage could be calculated as described below:

in terms of URLs:

50,000 sitemaps = ( 50,000 * 50,000 ) URLs = 2,500,000,000 URLs

[stackoverflow.com...]

NickMNS

11:57 pm on Apr 6, 2017 (gmt 0)

Thanks for the heads up, I was aware of this. I will be limited by the 50k URLs, so I am looking at 10 sitemap files.

tangor

12:17 am on Apr 7, 2017 (gmt 0)

While it is possible to list all URLs in sitemaps, the question remains if that is good practice. G, and other SEs have crawl budgets, ie x amount of time to y urls then move on to next site and upon return do a different chunk of urls, while possibly revisiting some previous to make sure they are still there. Too many urls might mean a longer time before the site is fully indexed, or more likely, that the site will never be fully indexed. TMI can get in the way of site indexing as well as non-friendly forms to access dynamic sites.

I've experienced both outcomes over the years and the compromise (this I tell clients) is a sitemapindex, 10 or less categorysitemaps, each limited to "top 10,000" likely urls just to get the spigot flowing. Internal linking thereafter should fill in the holes over time. "full index" happens much quicker and SE exploration of links seems to happen at a greater rate. YMMV

NickMNS

1:07 am on Apr 7, 2017 (gmt 0)

Tangor that sounds like very good advice. To some extent less is more, but as with most things the key is finding the optimal balance. The issue in my case remains the dynamic nature of the site, there are no internal links to fill the gaps. So it will be up to Google to interpolate.

phranque

12:35 pm on Apr 7, 2017 (gmt 0)

In the order of 500k entries when taking only sample of all possible urls

from what you've described so far i doubt you have enough unique and useful content to support that type of crawl budget.
to the algo it's probably going to look duplicative and/or low quality.

you are probably leading googlebot into the equivalent of a faceted navigation nightmare.

here are some of the concepts from the search point of view (Official Google Webmaster Central Blog).
Google, duplicate content caused by URL parameters, and you:
https://webmasters.googleblog.com/2007/09/google-duplicate-content-caused-by-url.html [webmasters.googleblog.com]
Faceted navigation best (and 5 of the worst) practices:
https://webmasters.googleblog.com/2014/02/faceted-navigation-best-and-5-of-worst.html [webmasters.googleblog.com]
What Crawl Budget Means for Googlebot:
https://webmasters.googleblog.com/2017/01/what-crawl-budget-means-for-googlebot.html [webmasters.googleblog.com]

from the user experience point of view - Filters vs. Facets: Definitions:
https://www.nngroup.com/articles/filters-vs-facets/ [nngroup.com]

if you decide to go with the sitemap you should read everything here - Manage your sitemaps - Search Console Help:
https://support.google.com/webmasters/topic/4581190 [support.google.com]

NickMNS

4:13 pm on Apr 7, 2017 (gmt 0)

@phranque.

This is not a duplicate content issue. For each URL the page shown is the based on the same template, and thus the templateted content is repeated, this will be considered as boiler plate content and ignored. The content that is calculated changes for each and every page submitted. I have asked this question explicitly sometime ago in Google hangout, because my large static site has tens on millions of such pages. I was told that Google can easily differentiate boiler plate content from actual content. I take the relative success of my static site as confirmation of these statements.

The risk if any, counter-intuitively, is if not enough pages are submitted to the index then Google may not be able to differentiate between boiler plate content. But I'm certain this will only be a problem if say only a few pages, in the order of less than ten, would be submitted.

The content is absolutely not faceted or fitlered. Every combination of form input produces a unique result, not a filtered or refined result. It is not as though changing the toggle from km to mile will show all the results in miles. All the form inputs are required and a unique result is returned for that exact combination.

But I do not see how this can be avoided. I haven't invented anything new. Fill in form -> submit -> get result on a result page. This seems like a pretty straight forward design pattern. The only improvement I see would be to get the result directly on the page with AJAX, but I am not sure that that would solve the problem, it would probably make it technically more complex.

The bottom line is that somehow I need to tell Google what content is on the other side of the form. My options are:
- create a new page to link to some of the content statically;
- submit a sitemap of possible URLs;
- manually submit a few pages in GSC.

lucy24

5:46 pm on Apr 7, 2017 (gmt 0)

Some sites have a generic or default version of a page, accessible even if you don't enter any search terms. (What happens currently if one or more of your parameters is missing?) That could be a starting point for crawls.

If there are millions of possible results, you can't realistically expect all of them to be crawled and indexed. So you may want to think about a different entry point. And then you get returning users who start out on the search page and proceed from there to wherever they want to be. Some users may even figure out shortcuts. (Real-life examples: if I want to know what ABC stands for, I go to a familiar look-up-the-initials site and finish off the URL manually: example.com/abbr/ABC because I know that's how their URLs work. Similarly for IP address lookups, at of course a different site: example.com/11.22.33.44.)

What's most important: to rank in search engines, or to get traffic? Obviously they are not mutually exclusive, but neither are they synonymous.

This 40 message thread spans 2 pages: 40