Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Should I noindex my results pages?

         

NickMNS

7:03 pm on May 16, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I am working on a new website/webapp. It will be a tool where the user inputs information into a form and the app makes a look up into the database, makes calculation and returns a result. It is an equivalency calculator.

The structure of the site will be a homepage, with an explanation of the purpose of the tool and with a start button. The start button leads to the input form, then once the form is submitted a results page is returned with the result. Each result page will have a unique url that will allow the user to bookmark, share and return to the page at a later time.

Now the question, should the results page(s) be "no-indexed". The are a multitude permutations and combinations of possible inputs, leading to tens of millions of potential unique results pages? The value provided is not on the page itself but in the actual calculation made by the app.

With the results pages no-indexed, this would mean that the website would have only 2 pages.

The results pages would be set to "follow".eg:

<meta name="robots" content="noindex, follow" />

Andy Langton

7:09 pm on May 16, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've done something very similar to this, and gone through various iterations. Happy to share anecdotal results

- Let Google retrieve results - quickly grew out of control, few and low rankings, small amount of traffic to results pages. Volume of low quality pages became a concern
- Noindex, nofollow result pages - better results for indexed pages (less than 10 of them)
- Noindex, follow result pages (because people linked to them) - no noticeable difference from noindex. nofollow

Either you would need to carefully control which results pages are spiderable (and even then, results may be poor) or go for your noindexing method of choice. Google dislikes auto-generated pages.

Note that noindex,follow is often worse than robots exclusion (via robots.txt), unless those pages genuinely lead to discoverable, new content, or (in theory) if the noindexed pages have links to them. My own feeling is that this leads to a site with lots of crawling for little reward (as far as Google is concerned).

NickMNS

7:28 pm on May 16, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Andy thanks for the quick reply.

I didn't think of the crawling aspect, and what you describe make sense. There is no point getting Google to crawl the site, just to find no-indexed pages. It would be a waste of my server resources as well as a waste of Google's.

The only question I have then, is will the results pages still pass page-rank if they are blocked by robots and have extrernal links pointing to them?

Andy Langton

7:49 pm on May 16, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The only option to try to pass pagerank or other link value from an excluded page (whether robots.txt or noindex,nofollow) is noindex,follow. Even with that, whether value is passed is open to question. Google will certainly discover URLs from a noindexed page, and there are some signs that value is passed with noindex,follow, but, to be frank, it's hit and miss at best.

aakk9999

9:13 am on May 17, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just a thought - if pages are practically the same, with only the calculation result being different, you may try to use rel canonical instead of noindex? Theoretically this should consolidate pagerank on a single result page you declare to be canonical (maybe the single calculation you make yourself just to create a canonical page).

I believe you cannot pass the pagerank from page blocked by robots.txt as google would not know where to pass it to (where the page links to) as it is not allowed to crawl it.

NickMNS

1:01 pm on May 17, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



According to this post by John Muller on stack exchange page rank can be passed to pages that are robotted. Once passed to the domain, I doubt it matters whether it goes any further, specially in this case where the domain will have very few possible landing pages.

[webmasters.stackexchange.com...]

Andy Langton

1:22 pm on May 17, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Look at it this way - if an external link points to a URL, that URL receives PageRank, regardless of whether the URL exists, is excluded, etc. If it's excluded, the value has no effect, since the URL itself will not be evaluated beyond being flagged as excluded. If it's robots.txt excluded, Google cannot know whether there are any links on the page.

If it;'s excluded with noindex,follow, theoretically, Google assigns PageRank from the excluded page in the usual manner. YMMV.

NickMNS

2:41 pm on May 17, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Correct me if I am wrong, what I understood from the stack-exchange post is that if there is link pointing to a page then it passes page rank. Blocking the page with robots.txt does not change that, but any links pointing out of the blocked pages would not pass page rank. Page-rank is assigned to the domain, correct? Thus the site would benefit from the links.

Andy Langton

3:07 pm on May 17, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



No, PageRank is per URL, so a link to a robots.txt excluded URL has no benefit. Theoretically, using a meta robots element with noindex,follow would allow PageRank from the excluded page to pass on its PageRank to pages it links to.

In general, links help your whole site, because any page linked to links to your other pages, thus "spreading" the benefit of links (depending on your site's link structure).

aakk9999

3:34 pm on May 17, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I did say exactly that :)
... you cannot pass the pagerank from page blocked by robots.txt

Using canonical or noindex avoids page rank circulation issue, but it does result in excessive crawling as Andy pointed out further above.

Theoretically, you could create URLs in a such a way that only URLs that get outside links get crawled by Googlebot. To do this, you would need to create URLs that cannot be "guessed" by Googlebot and you need to make sure that you properly respond with 404 Not Found for URLs that do not exist. One way of doing this is to use some sort of encryption that your web application would understood and decrypt into query string required to generate the page, e.g. it could be something like this:

http://www.example.com/result/9a79037c1fe931bdd886b517

This should prevent excessive crawling as only URLs that can bring in page rank would be crawled (unless they are nofollow-ed from the linking domain) since Googlebot would not know any better and it would be difficult to guess one (unless they use Chrome address bar for "discovery").

aristotle

4:45 pm on May 17, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If I understand correctly, googlebot doesn't crawl pages that never change very often, especially if they don't have any backlinks or get much traffic.

iamlost

7:07 pm on May 17, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Google and other SEs don't just use identifiable bots or IP ranges; they also use a number of stealth bots, headless browsers, etc. to confirm site behaviours. So just to be sure that site search results are not ranked (they index regardless if they can identify an IRL) I not only disallow in robots.txt but meta no index as well.

Note: it is a fun game identifying the SE stealth behaviours...

tangor

10:07 am on May 18, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Each result page will have a unique url that will allow the user to bookmark, share and return to the page at a later time.

I've avoided this thread three times, but I can't help myself. I have to ask (given the OP's premise of a CALCULATOR) why would there ever be a URL in the first place?

How many unique URLs to 2+2=4 (per user) can there be before, if nothing else, extreme duplicate/thin content is achieved? What value as a LINK is that result to other users or the web in general? Why would anyone expect such pages to have any value at all?\

NONE of my calc widgets store more than a cookie visit. If the user didn't copy/paste, then let 'em come back and input the numbers again.

Sorry, have to get my head around this part before I can address the second part....

aristotle

11:36 am on May 18, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I was thinking the same thing as tangor. I don't see any reason why these kinds of "pages" need to be kept permanently. If they are intended to "beef up" the website's content, but then you noindex them or block them, that doesn't work either.

aakk9999

12:00 pm on May 18, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I can see why OP would like to save calculation results page for user's future use. This may, for example, happen in these cases:

- if the calculation needs many parameters to be entered (to save the time required for the form being filled in again)
- if the calculation is a snapshot that happens at certain point of time or from certain location, and therefore repeat calc may change the result
- if the calculation takes lots of time/processing power/resources to be performed, so storing result is faster than re-calculating again

A good example of this would be webpagetest website where after testing web page speed, a user can email the link with results or link to it.

However, if you take webpagetest as an example and check their robots.txt, you can see that all test results have URL with /result/ in the URL and that /result/ is blocked via robots.txt

This would be a better approach if results are to be saved as URLs to be linked to.

NickMNS

12:55 pm on May 18, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@Tangor thank you for your input, I sincerely appreciated the feedback. First let me start by saying that this is my first time creating a "calculator" site. All the work I have done in the past has been with simple static information, where any and all calculations have been pre-processed before the site was ever created.

The goal of the site is to determine an equivalence between two people's performance. The user enters information about him/herself and their score (for lack of a better term). Then the user inputs information about another person. The app calculates what the user's score would be if they had the same characteristics of this other person.

The thinking is that the other person will most likely not be present when the user uses the app. So I believe that the user will have a high propensity to share the result. It is true that I could create a "card" with the data and then that could be shared. The advantage of sharing a link is that this encourage the person's that receive the shared content to engage with it, that is using the app again to make another comparison. Using the url with the data embedded in it then allows the subsequent entry form to be partially filled in, a definite plus for mobile users.

@aristotle, the pages are not kept permanently, I am simply using the url to store the input data. When the request for a url is made the app parses the url, and it uses the data to make a request to the db, and then returns a new results page. If the data in the db has changed between the time the url was first accessed and when the url is accessed again then the page changes.

As it stands now I leaning strongly towards the blocking by robots.txt approach.

Andy Langton

1:04 pm on May 18, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Based on your update, I think this is the right approach. I would robots exclude the result pages, and keep an eye on external links. If you do pick up decent links to excluded pages, consider making the content as unique as possible, and using allow or another method to let Google crawl/index. Robots.txt is the easiest on/off switch approach.

tangor

8:42 pm on May 18, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have used the "url as data input" before and works a treat. I don't, however, create an actual PAGE on the site, merely print to screen and leave everything in the database. Done that way there's no page to be noindexed or robot.txt excluded. That said, the new direction outlined above should do what you want.

JS_Harris

3:01 am on May 20, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<meta name="robots" content="noindex" /> is the correct tag, you don't need to specify follow since that's the default behavior.

I would not recommend using robots.txt to block any pages anymore, not only do they still receive pagerank but they count towards your link graph and Google displays them in serps anyway, without a description.

If the results page being created offers no additional content besides a link to somewhere, especially if the page title doesn't change, and it's all user generated I'd use the nofollow tag on them. Keep the actual search URI in the index however, don't block that.