Welcome to WebmasterWorld Guest from 3.234.210.89

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google SERPs link directly to my sitemap.xml.gz . What should I do?

     
3:25 pm on Apr 21, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 17, 2006
posts: 841
votes: 0


I wonder if this is common or if there is something particularly wrong with my sitemap files.

I have a site that receives traffic from SERPs directly to the sitemap (in gzipped format) rather than actual content pages. The SERPs are usually long tail (3+ word phrases or uncommon words such as industrial part numbers) but the result is pretty darn bad for everyone involved, Google, users and my sites:

  • Google gets very strange looking SERPs (and some people might find it spammy, too).
  • People get an empty browser page and a *.gz file starts downloading, which can frighten some people, especially if their PCs don't know what to do with a *.gz file. They might think I'm loading some malware onto their PC.
  • I get zero page views but 400KB of download bandwidth gets wasted each time they needlessly download the gzipped XML sitemap.

If it were any other type of page or file, I would probably just no-indexed it and forgot about it. But I can't do that with a sitemap, can I? Does anyone have any idea of what sort of an issue this is and if it may be caused by a particular format of my sitemap (Google seems to parse it just fine) or it's Google propensity to look inside gzipped files or what else can be causing this strange situation.
Thanks!
6:03 pm on Apr 21, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 7, 2003
posts: 753
votes: 0


That is horrible.

You could use the X-Robots-Tag header to noindex the file. Here is the code for you .htaccess file

<FilesMatch "sitemap\.xml\.gz$">
Header set X-Robots-Tag "noindex"
</FilesMatch>
11:10 pm on Apr 21, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 17, 2006
posts: 841
votes: 0


Thanks for the suggestion, deadsea. Like I said, I would have no-indexed it in a heartbeat but this is the site's sitemap we are talking about.

I guess, what I'm struggling with is this: I don't want the sitemap.xml.gz file in the index but I sure want bots to access it and read/parse. So, what would happen if it's marked as no-index? Are they reading / and processing files marked as no-index (presumably, to see if there are links in them)?

I may actually be able to try it: I've discovered that another site of mine has the same trouble. It's not a major site and I think I'd like to try your suggestion on it and just see what happens.
11:17 pm on Apr 21, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 7, 2003
posts: 753
votes: 0


I wouldn't think that google would index any zipped files, let alone sitemap files. I can't imagine a specific noindex directive would hurt. You can check webmaster tools after you do so to make sure that they can still parse your site map.
11:33 pm on Apr 21, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 17, 2006
posts: 841
votes: 0


I've added the noindex robot header and tested in WMT -> Test Sitemap. It says "no errors found", so I guess I'm on the right track.
Thanks again for the suggestion, deadsea. I hope this will rectify the issue.
4:25 pm on Apr 24, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 17, 2006
posts: 841
votes: 0


This gets really weird.

I've reviewed pretty much all my sites using Google WMT->Your site on the Web->Search Queries->Top Pages

It appears that EVERY site that makes the sitmap available in the gzipped format receives what I would call a considerable amount of traffic or at the very least impressions (because CTR is really bad 'cause both the title and the description snippet look terrible) directly to the sitemap.xml.gz files.

Google seems to be insisting on indexing *.gz files even though they are listed as sitemaps in WMT. What gives?

Does anyone else see their sitemaps listed among the the top pages in WMT Search Queries reports?
4:35 pm on July 27, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 17, 2006
posts: 841
votes: 0


Just wanted to post a quick update: this gzipped sitemap ranking issue is still going on. The most disturbing part to me is that the sitemap.xml.gz file is ranking for some of the terms that are completely made up of bits and pieces of the un-gzipped content of the file. In other words, it looks like it un-gzips the sitemap but then indexes it not as a sitemap but rather as a content file. And then Google goes even further than that - it chops the content up by the word-stop symbols (slash, space, dot, dash, new line), re-hashes it and creates keyword matches that do not even exist on my site at all!

Also, another important detail I missed when this was posted originally: I should have really been talking about files called sitemap1.xml.gz, sitemap2.xml.gz etc. They are linked from sitemap.xml which is the index for these linked sitemaps.
I know that Google gets this index right because when you open this sitemap.xml settings in WMT, all the sub-sitemaps are listed as sitemaps in their own right. So, it's the sub-sitemaps that are ranking in Google for some good long tail terms.

The sitemaps (when un-gzipped) parse XML just fine and I have verified their format many times over. It is for all intents and purposes a sitemap. So, why does it rank as if it's a content? is there anyone else here that has this happen to their sitemap files?
9:31 pm on July 27, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15935
votes: 889


YES. There are threads going back at least a year or two about google's firm belief that "If we can reach it, we can index it". They've got your robots.txt indexed too.

Crawling and indexing are entirely different processes. Marking something as "noindex" doesn't prevent it from being crawled. Here you probably want a FilesMatch that simply says <{blahblah} "\.gz"> meaning Don't index anything, anywhere, with the .gz extension.
3:03 am on July 29, 2012 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11873
votes: 245


what Content-Type: header is returned with those .gz files?
3:23 am on July 29, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 17, 2006
posts: 841
votes: 0


It's Content-Type:ˇapplication/x-gzip

@lucy24: I have serious reservations about no-indexing my sitemaps. I guess no-index is not the same as no-follow but doing anything of this nature with a sitemap sounds like messing with basics on such low level than something is just bound to blow up in my face.

Had anyone ever try to no-index their sitemaps?
3:40 am on July 29, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 15, 2003
posts:960
votes: 34


I just did some poking around and discovered that Google will index files compressed with either .gz or .zip as long as the archive contains a file in a format it supports. If you do some sample searches with either 'filetype:gz' or 'filetype:zip' you'll see what I mean.
3:49 am on July 29, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 17, 2006
posts: 841
votes: 0


@rainborick: their desire to index everything they get their hands on is understandable, but I cannot get why would they show internals of an XML file, specifically a sitemap about which they know via the WMT (it's shown there, no errors) and via the urlset tag in the XML file that in my case says


<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">


Maybe there's something in my sitemap definition that makes is look like a different XML file, like one with an actual content.
4:17 am on July 29, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


A search for [inurl:"sitemap.xml.gz"] turns up a couple thousand results, but only that many. Some of those results are a handful of sites that seem to specialize in listing the sitemaps of other sites. What a bizarre thing to do! My guess would be that someone, somewhere is linking to your sitemap file.

I have serious reservations about no-indexing my sitemaps. I guess no-index is not the same as no-follow but doing anything of this nature with a sitemap sounds like messing with basics on such low level than something is just bound to blow up in my face.

The same thing can happen with robots.txt files, by the way, and the best answer I know of in that case is to use x-robots directive. This approach does NOT stop googlebot from fetching and obeying the file - I know of files that have done this.

I would have no reservations about using the x-robots directive if this was happening to my site. FWIW, Google's John Mueller also suggested exactly this approach earlier this year.