Johan, I have other sites with hundreds and thousands of pages, but I'm not going to make the effort on those sites till I find out if it's worth the time it takes to do it, and won't damage my current listings/spidering. I don't necessarily expect that the sitemap feature will do any better; what I DON'T expect is that it will do worse than before I created and submitted the file, which is what is seeming to be the case.
I have a problem with my robots.txt and sitemaps.
Today tried to upload my xml sitemap and then discovered that robots.txt file is banning all bots!
So I changed that using a known working and allowing robots.txt file.
But what I get when trying to submit to google sitemaps again is the same error
"We were unable to access the URL you provided due to a restriction in robots.txt. Please make sure the robots.txt file and the Sitemap URL are correct and resubmit your Sitemap"
I already uploaded the new robots.txt file ... what can I do now?
How do I invite all bots again to crawl my site?
Check for rewrites (.htaccess...). Try to delete the robots.txt for a while and wait for the bots. Should work if offsite links point to your site. Also, use a sitemap validator to check your XML (http://code.google.com/sm_thirdparty.html links to free sitemap tools) and the HTTP response (should be 200).
I have a free Google Sitemap Validation service at [nodemap.com...]
I know there are excellent web based XML validators available, but I built this software
specifically for validating your sitemap.xml *or* compressed
sitemap.xml.gz file. Hopefully it is straightforward and easy.
This service allows you to validate your Google Sitemap XML files. Your
file may optionally be gzip compressed. Each report you generate may be
stored in your account. You have the option to send each report via
email to the recipient you specify. There is also a quick-help feature
that allows you to ask a technical question, or make a comment, about
+ works with text/xml content-type
+ checks and reports UTF-8 Byte Order Mark
+ converts the file to unix line terminators if necessary
+ re-encodes the xml file to UTF-8 if the file isn't UTF-8 (*see note)
+ gzips xml files
+ better error handling on web server redirects.
+ shows line numbers against your xml file if the file doesn't
This was posted on their discuss sitemaps section:
Added Site Map Entire Dropped from Google
Don't think I will touch this one. I have a comprehensive sitemap onsite that has been picked up by google numerous times.
The way I read it is even minor site changes means you have to remake and or resubmit a new sitemap....who needs the trouble?
Ok, here's the short story. One of my sites was down for 4 days a while ago when gbot decided to crawl it. I lost over 10k pages and was down to about 300 in the index. I did the sitemap, and now I'm back up and having more pages added, and traffic is increasing.
Here's a shorter story... created my site_map.asp and give it to G. Now my MSN visitors are increasing by several 1000%. Could msn be looking at my site_map.asp?
is this the next goldrush now? google sitemaps turn MSN into Google2?
If Google introduced an autodiscovery mechanism like RSS, there will be much lesser work for the webmasters and Google to identify the sitemaps.
Of course, we will miss the stats in that case.
Sorry if this is a dumb question. I am a marketer more than a computer "geek" so I am sending google the URLs of my main pages in a text file format, and creating a sitemap page www.mydomain.com/sitemap.html.
Is that ok?
Also, do I have to list every URL on each page that I want google to spider? If I have deep links, will google see them if I give the URL of the page the deep links are on.
Is google asking me to give it every URL I want spidered, even those several levels down?
>I am sending google the URLs of my main pages in a text file format
Use this form (has nothing to do with Google SiteMaps):
>creating a sitemap page www.mydomain.com/sitemap.html
Always a good idea (but has nothing to do with Google SiteMaps)
Learn more about Google SiteMaps here:
(The link in my profile leads you to a Google SiteMap tutorial)
>Also, do I have to list every URL on each page that I want google to spider?
Yes, put the URLs of all pages you want Google to crawl in your sitemap XML file
>Is google asking me to give it every URL I want spidered, even those several levels down?
Here are some links to sitemap generators:
Any ideas on how a Themed Canonical Stucture can be correctly represented with google sitemaps?
Flat. You can export your URLs recursively to keep them in hierarchical order, but the given output format is flat. Thus I won't bother with the hierarchical listing, which burns more resources than a sequential output ordered by lastModification descending, if that attribute is indexed.
They don't seem to like subdomains.
I've uploaded 48,500 URLs (oh, before you call me a spammer - this is a local professional directory site) in xml.gz Google sniffed around - then replied: "Denied URLs" - and listed [subdomain.mysite.com...]
I made my XML site map manually.
2 hours later the bot went nuts spidering anything and everything on the FTP. I know some other people who only mapped out the real major pages in there site and submitted the map, same thing happened to them but the crawl took a little longer.
..18 hours...still pending...is that normal?
Yes. Sometimes Google updates the status report after a delay.
And you might feel shocked to see an update now saying submitted 18 hours ago, downloaded 18 hours ago ;)
|They don't seem to like subdomains. |
Was the sitemap on the subdomain?
I would assume that you could only have the sitemap on the exact domain being served, just the same as a robots.txt.
Since there is no "physical" subdomain directory, I placed the sitemap in the root for domain mysite.com. The sitemap included the URLs for both www.mysite.com and [subdomain.mysite.com....] They denied [subdomain.mysite.com....]
At the moment I'm planning to add a sitemap for each folder of my domain.
This does seem to work for the sitemaps I've done, with each folder having it's own sitemap which is submitted to Google.
Anyone know if this is o.k to do or would it be better to combine all these sitemaps into one big file in the root directory?
I've decided to re-do my front page so it acts as a site map and is easier for surfers and robots to follow. It seems like some have had success so far, but the idea of being "flagged" for some reason by Google still runs thru the back of my mind....
I think I am just going to submit each URL manually (I'm up to about 50...made 40 in the last few days)...
It took them 22 hours and my sitemap was still pending, so I'll give it another shot when I hear how 30 to 50%+ of webmaster submit sitemaps routinely. I've never been an early adopter.
I still think the best ay is to create a sitemap yourself using ASP or PHP etc… unless you have a non db site. Its very easy to do and I not even a programmer!
There's a couple of points regarding the simple text file sitemap i want to clear up before i submit my sitemap. Google states,
1.Your URLs must not include embedded newlines.
2.You must fully specify URLs because Google tries to 3.crawl the URLs exactly as you provide them.
4.Your sitemap files must use UTF-8 encoding.
5.Each sitemap file must have no more than 50,000 URLs.
What do they mean by embedded new lines and UTF-8 encoding? How do I ensure i have UTF-8 encoding?
"What do they mean by embedded new lines and UTF-8 encoding? How do I ensure i have UTF-8 encoding? "
the URL cannot be on more than one line. One line per URL and one URL per line.
UTF-8 encoding is unicode, more characters are available than other encodings (windows 1258, etc) If you are a graphics person, it can be thought of "like" the difference between a 16 color palette and a 65,536 color palette.
You should use an editor that allows you to save in Unicode/UTF-8 such as Notepad (Windows) or BBEdit (Mac). Notepad will insert a Byte Order Mark (BOM) at the beginning of the file to signify that it is UTF-8, which may appear to be odd characters if you look at it in something else.
If you are creating a script then just use an encoding function on your output.
I still think the best ay is to create a sitemap yourself using ASP or PHP
Google has a link to a PHP third party solution that for me is way better than the Pyhton script. [code.google.com...]
|Since there is no "physical" subdomain directory, I placed the sitemap in the root for domain mysite.com. |
Well then, you have to figure out some way to serve them up separate sitemap files for each subdomain then.
As far as the rest of the world is concerned, different subdomains are different machines, possibly under the control of different people. You only let a machine tell you about themselves.
|How do I ensure i have UTF-8 encoding? |
The simple answer is that if you are using only characters generally available in english (which is what is used for the vast majority of URLS) you can use any old text editor.
If you want to use any other languages or special symbols, you have to verify that your editor outputs in UTF-8.
> As far as the rest of the world is concerned, different subdomains are different machines, possibly under the control of different people. You only let a machine tell you about themselves.
It's a very distrustful world we live in. :) I've split up the universal sitemap into separate subdomain specific sitemaps and submitted them separately under each subdomain. Thanks BigDave.
I'm using a 301 redirect from www.mydomain.net to sub.mydomain.net.. I've got a dir called "sub.mydomain.net" (which acts as a subdomain along with an .htaccess file) - and.. do I need to place my sitemap index in a root dir or in the sub.mydomain.net?
And the 2nd thing: I'm using php and mysql to generate the sitemap. Php creates sitemap files in all my directories - but I don't know if they are utf-8 encoded. Since it's all done automatically (fopen() etc..), I don't save those files in an editor. So, how to make sure my sitemaps use utf-8 encoding?
|do I need to place my sitemap index in a root dir or in the sub.mydomain.net? |
If the URLs that you want crawled is on sub.mydomain.net, then you have to serve that to Google from sub.mydomain.net.
If you have URLs that you want crawled on both sub.mydomain.net and www.mydomain.net, then you will need 2 different sitemaps.
|And the 2nd thing: I'm using php and mysql to generate the sitemap. Php creates sitemap files in all my directories - but I don't know if they are utf-8 encoded. Since it's all done automatically (fopen() etc..), I don't save those files in an editor. So, how to make sure my sitemaps use utf-8 encoding? |
PHP strings are in ASCII. ASCII and UTF-8 overlap for the first 128 characters.
If you are using only the following characters, you will have no problems.
Otherwise, you can take a look at the utf8_encode function.
You would have been able to find this all out for yourself in a few minutes by checking the PHP manual, and doing a search on UTF-8 on the web.
Interesting - I just found that sitemap.xml supercedes robots.txt. For example:
included in sitemap:
included in robots.txt:
Googlebot still grabs /testdir/index.html and a search on site:mysite.com shows /testdir/index.html.
In my opinion, robots.txt should take preference.
|In my opinion, robots.txt should take preference. |
While I can see your point, there is a perfectly valid opposing viewpoint that you have now specifically told the search engine to index that URL.
The real problem is that it is an undocumented situation, that should be documented no matter whether robots.txt or sitemap takes precedence.
I would recommend that you post it to the sitemap newsgroup where a Google engineer is more likely to spot it.
| This 188 message thread spans 7 pages: < < 188 ( 1 2 3 4  6 7 ) > > |