homepage Welcome to WebmasterWorld Guest from 54.204.142.143
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 32 message thread spans 2 pages: 32 ( [1] 2 > >     
Webmaster Tools, ignore an already indexed variable
GoNC




msg:4641399
 2:28 am on Jan 31, 2014 (gmt 0)

A lot of the pages on my site have a variable in the path for the last postdate; this way, the user gets a cached version if the page hasn't had any updates since their last visit.

Eg, http://www.example.com/board/1234/?h=20140130213214

So technically, the variable is irrelevant for search engines.

Looking in Google Webmaster Tools, though, I see that 8,667,936 pages are indexed with the h parameter.

If I mark it to be ignored, though, am I going to lose 8 million indexed pages?

 

netmeg




msg:4641411
 3:00 am on Jan 31, 2014 (gmt 0)

You don't want 8 million indexed pages like that.

lucy24




msg:4641420
 4:23 am on Jan 31, 2014 (gmt 0)

Looking in Google Webmaster Tools, though, I see that 8,667,936 pages are indexed with the h parameter.

Do you mean that the URL is valid even if you leave off the h= part entirely?

If I mark it to be ignored, though, am I going to lose 8 million indexed pages?

With any luck, yes. That's assuming you don't have eight million pages of Forum thread, so you've currently got tons of Duplicate Content.

I think the real question is whether you want the search engine to disregard the parameter's value (the usual approach) ... or ignore any URLs that even contain the parameter. Is the h= part present in links?

GoNC




msg:4641423
 4:44 am on Jan 31, 2014 (gmt 0)

Yeah, I mean that these are always identical:

example.com/board/1234/
example.com/board/1234/?h=20140130213214
example.com/board/1234/?h=20140130235029

The only use is for caching. I use the h parameter consistently throughout the site for this, so it should never mean anything else.

What I DON'T know, though, is whether example.com/board/1234/ is indexed at all, or if it's only indexed using the h parameter.

It's definitely possible to have more than one indexed copy of the same page. My concern, though, is that it might have indexed example.com/board/1234/?h=20140130213214 but not example.com/board/1234/ ; if it's not and I tell Google to ignore the h parameter, will I be accidentally removing the entire page from the index?

Shai




msg:4641440
 6:27 am on Jan 31, 2014 (gmt 0)

Is it possible for you to dynamically create a canonical tag for each cached version of the page pointing at example.com/board/1234/ ? Then block access to Google using robots.txt to any url with a ? in it.

Disallow: /*?

Like others said, I would not want 8 million pages of dupe content in the index.

lucy24




msg:4641445
 7:06 am on Jan 31, 2014 (gmt 0)

What I DON'T know, though, is whether example.com/board/1234/ is indexed at all, or if it's only indexed using the h parameter.

Yes, that's why I asked about links. Obviously the ideal would be to have only the no-h-at-all versions indexed. But if there's no way for search engines to find out about them, then a solid second choice is to tell the search engine* that the URL has no effect on page content so they can just crawl & index a representative one.

If there's no other way to find out, try some sample searches. Select a solid chunk of text from a few random forum threads and search for them. But you should be able to tell just by looking at your page code. If you click something like "next" or "previous" --or whatever navigation options the search-engine robot has-- what does the actual link say?

Edit: Once you've told the search engine that a parameter has no effect on page content, there's a further decision to make. You can either tell them to pick some "representative" URL, or tell them to ignore any URL that contains the parameter at all. The second approach is probably most common with things like "printer-friendly" where the parameter isn't present at all in an ordinary URL.


* I realize this is the Google subforum. But really, not all questions are google-specific. That Other Search Engine has an ignore-parameters option too.

aakk9999




msg:4641526
 12:36 pm on Jan 31, 2014 (gmt 0)

What I DON'T know, though, is whether example.com/board/1234/ is indexed at all, or if it's only indexed using the h parameter.

As it seems you are internally linking to URL with ?h parameter, it is possible that the URL without h parameter is not indexed at all. You can try to check it using the combination of the site: and inurl: command

site:example.com inurl:/board/1234/

and see what versions of URLs are returned.

What currently happens currently if your date changes? I presume internal links are replaced with a new value of ?h parameter, but what happens with the old one? Are you redirecting it to the new ?h or is it return 404 or something else? If you are not redirecting it, then over the time you will have a ton of duplicate content.

However, this is very unusual and not recommended way of forcing the cached page to be served. Are your pages so big that you need to employ this kind of mechanism for page caching? Or do you have a problem with the amount of traffic to your server? Ideally you would stop using this mechanism of interlinking.

If you do decide to keep this way of handling URLs, then either Lucy's suggestion on parameter ignoring or Shai's suggestion to use canonical link element are good ones. In the case of canonical link element, just make sure that you do not block the ? in robots because in this case Google will not be able to see canonical link element. In this case, as you are always internally linking using URL with ?h parameter, you could add a HTML sitemap page to your website that lists all your URLs without ?h parameter so that your canonical URL is not interlinked within the site.

netmeg




msg:4641533
 1:20 pm on Jan 31, 2014 (gmt 0)

I don't think you want Google crawling those ?h parameter pages either. Let G find the canonicals.

GoNC




msg:4641709
 10:18 am on Feb 1, 2014 (gmt 0)

As it seems you are internally linking to URL with ?h parameter, it is possible that the URL without h parameter is not indexed at all. You can try to check it using the combination of the site: and inurl: command

site:example.com inurl:/board/1234/

and see what versions of URLs are returned.


Awesome! I wasn't familiar with the inurl: command.

I searched for a more recent, larger thread, and found that it is indexed 7 times, all with the h parameter. So, yes, there is a ton of duplicate content there.

What currently happens currently if your date changes? I presume internal links are replaced with a new value of ?h parameter, but what happens with the old one? Are you redirecting it to the new ?h or is it return 404 or something else? If you are not redirecting it, then over the time you will have a ton of duplicate content.

However, this is very unusual and not recommended way of forcing the cached page to be served. Are your pages so big that you need to employ this kind of mechanism for page caching? Or do you have a problem with the amount of traffic to your server? Ideally you would stop using this mechanism of interlinking.


The h parameter is completely irrelevant to the content, so it doesn't redirect or anything; the server just ignores it. You could change it to anything, and it would show the same content.

Since the site is mostly dynamic content from the users, caching is a big deal; I can't have someone read a thread that's loaded from cache, or they wouldn't see the most recent posts. I tried disabling cache altogether at one point, but it caused a pretty significant drain on the server, so I went with this method 4 or 5 years ago.

Now, the list of topics isn't cached, but the threads themselves are. The logic is simple; you cache the page with the h parameter, which is the timestamp of the last post. When you return to the list of topics, if a new post has been made then the h parameter changes, and you bypass the cache.

Ideally, Google would completely ignore the h parameter, but since all of the threads they've indexed have it, would telling them to ignore it mean that ALL of those pages are dropped? Or would it be wise enough to keep one and just drop the duplicates? I don't see a way in Webmaster Tools to tell them how to handle it.

netmeg




msg:4641756
 3:02 pm on Feb 1, 2014 (gmt 0)

I would put in canonical tags and tell them to ignore anything with ?h in it. It still may take quite a while to get all those duplicate pages out.

And I would tell them not to *crawl* anything with ?h in it either. If your non parameter pages aren't being indexed now, it might be because Google is using up your crawl budget on pages with parameters. That's not good for Google and it's not good for you. So keep 'em out of there.

Robert Charlton




msg:4641788
 8:31 pm on Feb 1, 2014 (gmt 0)

And I would tell them not to *crawl* anything with ?h in it either

If robots.txt is also used, would Google be able to crawl these pages to see the canonical tag?

lucy24




msg:4641792
 8:53 pm on Feb 1, 2014 (gmt 0)

If your non parameter pages aren't being indexed now

I got the impression the without-parameter pages simply don't exist. That is, they can be created, but they never occur in links. So the only way for a search engine to learn of the canonical URLs' existence is by crawling the non-canonical form.

Now, since the parameter doesn't affect the content, would it be OK to redirect based on user-agent? Send the googlebot and selected others to the h-less form. Or would google simply get annoyed if every single link, everywhere, always led to a 301?

GoNC




msg:4641800
 9:29 pm on Feb 1, 2014 (gmt 0)

Lucy is correct, every link to these pages will have an h parameter in it. So I can't just tell it to ignore anything with an h, or that kills the majority of the site.

netmeg




msg:4641802
 10:49 pm on Feb 1, 2014 (gmt 0)

Wow. Ok.

aakk9999




msg:4641815
 12:42 am on Feb 2, 2014 (gmt 0)

I would imagine that all forums have this particular caching problem. Have you checked what others do?

Could another solution be to get more powerful servers?

With regards to parameters handling in WMT - there are two options there to choose from, one is "parameter does not change content" and the other is "Changes content (reorders/narrows/etc).

In your case, the first option is applicable and when selected, you have no further choices (such as Every URL, No URLs etc that exist for "Changes content" option).

So the "parameter does not change content" shows on WMT parameters list as "One representative URL".

What I am not sure is whether Google in this case just drops the parameter and indexes that page or decides on one particular value of the parameter to index. If it is the first then this would give you what you want: a URL without ?h= parameter being indexed.

Perhaps you could do a little test - create a (static) page on the site with a parameter z=, link to that page with a three different values of z= parameter, set up Parameter Handling in WMT for this parameter to say "Does not change page content" and after a while see what has Google indexed - the URL without the z= parameter or one of URLs with a particular value of z parameter.

lucy24




msg:4641838
 4:30 am on Feb 2, 2014 (gmt 0)

In your case, the first option is applicable and when selected, you have no further choices (such as Every URL, No URLs etc that exist for "Changes content" option).

Oops. You are right and I had it backward. It's because google's definition of "changes page content" includes things like reordering, where all the words on the page are the same, but they're in different places. To me that wouldn't count as changing content. (Makes it look different, yes. Changes, no.)

It's when you say "Yes, it changes content" that you get a list of four further options.

I guess "crawl only one representative URL" means that if g### finds a link to blahblah.php?h=12345 it checks to see whether it has recently crawled blahblah.php?h=any-number-at-all. And if it has, it doesn't schedule a fresh crawl.

GoNC, when you talk about caching, are you simply referring to the browser's own cache? "I saw this URL yesterday, no need to reload it". Or do you serve up static versions of each URL from copies on your server?

GoNC




msg:4641854
 8:34 am on Feb 2, 2014 (gmt 0)

GoNC, when you talk about caching, are you simply referring to the browser's own cache? "I saw this URL yesterday, no need to reload it". Or do you serve up static versions of each URL from copies on your server?


No, just the first one (browser's cache).

I upgrade the server about once every other year, and redesign regularly with efforts to increase the speed (using sprites, minimizing images, tweaking Apache, etc), so it's blazing along pretty well right now. A few years ago, we had significant load problems on high traffic days (which was when I implemented this caching method), but I think I have that pretty well under control now.

Sgt_Kickaxe




msg:4641856
 9:10 am on Feb 2, 2014 (gmt 0)

As others have pointed out above the solution is to use a canonical tag so that any page with an h parameter points to its non-h version. This will tell search engines which page is important although it will not stop the other pages from being indexed.

Don't block Google with robots.txt, the pages will still be indexed but will have no value for you.

I just wanted to add that, perhaps of greater concern, is the fact that all of these urls actually do exist on your site and that you link to them. By linking to them you share some value with them which is ultimately wasted. You can add paramaters to any url and the page will still appear the same, test it on this page for example, add ?h=123421341234 and webmasterworld will still find this page. What you want to avoid is links pointing to the pages with parameters because doing that says "hey search engine, I know this page is near identical but it exists".

In GWT there is an option to tell Google which parameters do not have any impact, I'd add h to that group.

netmeg




msg:4641905
 2:19 pm on Feb 2, 2014 (gmt 0)

Sgt Kickaxe it sounds like he doesn't have any non h parameter pages. And yea, you can't block with robots.txt, but that's a whole 'nother can of worms as regards crawl efficiency.

How's your search traffic been since you implemented this caching solution?

aakk9999




msg:4641989
 11:33 pm on Feb 2, 2014 (gmt 0)

Sgt Kickaxe it sounds like he doesn't have any non h parameter pages.

I got that the same way as netmeg did.

Now the question (not sure if anybody tested it):

So the site is not linking to pages without ?h= parameter, but the identical page content can obviously be retrieved if the request for a page without ?h= parameter hits the server.

If the page has a canonical pointing to a page without ?h= parameter, then if Google crawls that page, it would obviously be returned from the server. Is it then necessary for the site to link to it explicitly? Or would the fact that it is not internally linked made Google ignore the canonical?

GoNC, you could perhaps pick a recent thread and do the following test:

1) use site: and inurl: to see what URLs has Google indexed from this thread
2) add a canonical link element to this page:
<link rel="canonical" href="http://www.example.com/board/1234/">
3) verify the above canonical link element is shown when the page is requested with ?h= parameter by requesting the page and then using view source
4) monitor if the page (with any of ?h= parameters) has been re-crawled by Googlebot
5) monitor whether this page without ?h= element has been crawled by Google - or even better, use Fetch as Googlebot and Submit to index
6) watch over the time whether the page with ?h= parameters for /board/1234/ are started to disappear and new ones not being indexed

If this work, then you may want to implement canonical across your site.
add a canonical link element to that thread, then see

[Edit reason]Corrected typo, re= should have been rel= [/edit reason]

[edited by: aakk9999 at 12:34 am (utc) on Feb 4, 2014]

lucy24




msg:4641991
 12:00 am on Feb 3, 2014 (gmt 0)

Now I am curious about a more general question.

When a search engine meets URLs in the form /dir1/dir2/pagename.html they will eventually ask for /dir1/ and /dir1/dir2/ even if these pages don't actually exist and nobody links to them. Do they do the same with parameters? That is, if they habitually see something with h=some-value, do they eventually ask for the same URL without the h= parameter? Seems like sooner or later they would.

GoNC




msg:4641992
 12:05 am on Feb 3, 2014 (gmt 0)

How's your search traffic been since you implemented this caching solution?


Unfortunately, I have no idea when I implemented he h parameter :-(

The site is 13 years old, and I've made thousands of tweaks along the way! I began using Analytics in January 2010, though. Looking at the stats from that first month, I see that the h parameter was there at that time, so I implemented it sometime before then.

Since Jan 2010, traffic has steadily increased, so I don't think it's had a negative impact. But... maybe.

I do know that if I compare 2/1/2011-2/1/2012 to 2/1/2012-2/1/2013, I see a 102% [u]increase[/u] of unique visitors. But, if I compare 2/1/2012-2/1/2013 to 2/1/2013-2/1/2014, I see a 26% [u]decrease[/u] of unique visitors (with the decrease starting in May 2013). Pageviews have gone up, though, so I don't know if this is related to the h parameter, the alleged "pre-Penguin 2.0" "Phantom" update around 5/2013 (which we're discussing in another thread), or possibly due to Google counting unique visitors differently. Or, since our sites are localized, it could have something to do with a local change (competing FB groups, a local big office complex might have blocked us, or even the weather can have an impact).

Here's the thread where we're discussing the Phantom update, if you're interested:

[webmasterworld.com...]

GoNC




msg:4641993
 12:40 am on Feb 3, 2014 (gmt 0)

2) add a canonical link element to this page:
<link re="canonical" href="http://www.example.com/board/1234/">


I think you guys are right on this. I'll add the canonical tag to remove the h parameter, then give it a little time and see how Google is indexing the new pages. If it stops duplicating new pages, I'll give it some time and then tell WM Tools to ignore the h parameter.

I quoted the link, though, just in case someone reads this later; it should be <link rel=...>. (aakk9999 accidentally left off the "l")


When a search engine meets URLs in the form /dir1/dir2/pagename.html they will eventually ask for /dir1/ and /dir1/dir2/ even if these pages don't actually exist and nobody links to them. Do they do the same with parameters? That is, if they habitually see something with h=some-value, do they eventually ask for the same URL without the h= parameter? Seems like sooner or later they would.


I'm guessing, no. If they did, it seems like they would have indexed my pages both with and without the h parameter. I can't find any of my pages indexed without it, though.

Robert Charlton




msg:4642045
 10:09 am on Feb 3, 2014 (gmt 0)

Any other way to accomplish the caching?

If so, is there an easy way to also replace all of your nav links with canonical versions? And is there any reason not to then 301 redirect the urls with h= parameters to the canonical version? Seems it's pattern-match friendly and could be done in a way that wouldn't involve a lot of .htaccess code.

Otherwise, you're still sending visitors to a huge number of pages with incorrect urls whenever they click on a nav link, and that's lot's of extra work for Google as well.

lucy24




msg:4642137
 4:52 pm on Feb 3, 2014 (gmt 0)

Any other way to accomplish the caching?

I think the whole point is to prevent the browser from even making a request to the server in the first place. The question is whether the request itself places that big a load on the server. Usually the choice is between serving a cached static copy of a dynamic page vs. building the page from scratch. Here it's something different.

Otherwise, you're still sending visitors to a huge number of pages with incorrect urls

The browser doesn't know that part of the URL is meaningless. So nobody gets redirected.

GoNC, the part I'm trying to figure out is:

User clicks on a link leading to post # suchandsuch. Browser says "Oh, I've been there before, I'll just serve up my cached copy." But what if new posts have come in to the same page of the thread? Wouldn't the user then miss out on those new posts because the browser doesn't know they exist and therefore doesn't put in a new request?

aakk9999




msg:4642152
 6:07 pm on Feb 3, 2014 (gmt 0)

Good point, Lucy, haven't thought of this.

I am guessing that ?h= parameter is served on threads only, and not on page that lists all threads in that particular forum (presumably the browser always goes to the server for this?)

So the problem would be with thread page refresh, which would go local because of the ?h= parameter?

GoNC




msg:4642166
 7:09 pm on Feb 3, 2014 (gmt 0)

Any other way to accomplish the caching?


Lucy and aakk9999 are both absolutely correct; the point is to prevent the browser from making the request, there is no redirection or anything like that, and the h parameter is just on the threads (not the list of threads).

Honestly, I've been using this method for so long (more than 4 years, for sure) that I don't really even know if it's an issue any more. I believe that I was doing this before I discovered mod_deflate, and my server was constantly having problems with a high load. Back then, I would load 30 different images for emoticons (I was still of the belief that a lot of smaller pictures was better than one larger picture), had numerous images in the header for spacing, etc, so I was caching as much as possible.

Now, the site has 6-8 images that load on each page (I'm using CSS sprites for emoticons), and of course, the site is compressed, so I could probably try to disable caching for the site entirely and drop the h parameter without any major server problems*. But that's still scary; every once in awhile, there will be a big news event locally that causes a major spike in traffic (our highest ever was 507,000 pageviews within about 8 hours), and I'd hate to start having server problems on those days.

Either way, though, that wouldn't really resolve the issue with all of the pages that are indexed to date.

User clicks on a link leading to post # suchandsuch. Browser says "Oh, I've been there before, I'll just serve up my cached copy." But what if new posts have come in to the same page of the thread? Wouldn't the user then miss out on those new posts because the browser doesn't know they exist and therefore doesn't put in a new request?


In theory, the only way this would happen is if they click on "Back", follow a bookmark, or simply haven't left the page and are clicking on refresh.

In those cases, you're right, they would potentially miss out on new posts. I don't know if there's really a lot that we could do about it, though, other than rely on the browser to handle it.

*In retrospect, though, there is a direct correlation between page load time and pages per visit; the faster I can make the site load, the more pages people seem to visit. Since we sell ads at PPI, saving an average of 1-second per page can increase our monthly revenue by $1000+.

aakk9999




msg:4642174
 7:48 pm on Feb 3, 2014 (gmt 0)

Thanks on a great detailed explanation! I can see your dilema clearly:

- on one side you have speed/caching issues you want to address and you are not sure what would happen if you remove your current ?h= solution

- on the other side you cannot be sure whether this particular solution you are using has resulted in Google sending you less traffic, because although your traffic has increased, would it increased even more if there is not so much duplicate content

Have you thought of removing ?h= parameter on one small section of the forum and monitor the traffic/page loads/time on page/number of visitors google sends?

Technology has improved in the last 4 years and if you test the removal of h parameter on a sub-section of the forum (and adding canonical or implementing redirect to these pages where you have removed h parameter to solve duplicate content), then after a while you would probably have some metrics to decide what is the best way forward.

If you decide to do this, please do return and report your findings as this is an interesting case of duplicate content.

lucy24




msg:4642179
 8:09 pm on Feb 3, 2014 (gmt 0)

I don't think the requests for images and supporting files are a major issue. You can set a longer caching period for those-- months and months, if you like. Then the browser will only put in a fresh request for the html itself, not all the other stuff.

Someone, somewhere, probably has data on the two aspects of server load: the mere fact of a request, vs. the size of the material sent out. The HTML page is presumably bigger than all supporting files put together, unless you've got image-intensive forums or a wildly inefficient stylesheet. But the raw number of requests comes down on the non-page side.

GoNC




msg:4642185
 8:24 pm on Feb 3, 2014 (gmt 0)

Have you thought of removing ?h= parameter on one small section of the forum and monitor the traffic/page loads/time on page/number of visitors google sends?


That's not a bad idea. I've been working on setting up the rel="canonical" tags, so this would be a good time to try this.

Re: the rel="canonical" tags, do you guys know whether it will hurt if the canonical tag is on every page (linking back to itself)? Or should it just be on those that need to be redirected?

Meaning, if I have both:

example.com/board/1234/
example.com/board/1234/?h=12345

If both have <link rel="canonical" href="http://www.example.com/board/1234/">, is that OK? I know it's not necessary on the page without the h parameter, but there's a question of how to code it.

Someone, somewhere, probably has data on the two aspects of server load: the mere fact of a request, vs. the size of the material sent out.


I can tell you that there was a HUGE difference in server load after I changed my images to CSS sprites. I dropped from an average peak of 300 Apache requests to < 40.

I do think that browsers will render the larger image a little more slowly than parallel requests for multiple images, but now that most people are using high-speed connections that's not such concern.

This 32 message thread spans 2 pages: 32 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved