Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Webmaster Tools and Sitemap Crawling

         

HollyWMarshall

7:59 pm on Sep 21, 2020 (gmt 0)

10+ Year Member



We've posted this a few times over at the Google webmaster help community and can't seem to get any help

Having a ongoing issue with sitemaps in search console. Old sitemaps submitted prior to July 2018 show in GSC as success but last read date is July 2018. Even when removing and resubmitting those sitemaps - shows the same result. No indication that sitemaps are currently being accessed by Google.

When submitting a new sitemap is showing couldn't fetch and sitemap could not be read even though the sitemap checks out with lighthouse and url inspection tool with no issues.

This problem started when we had a momentary bug with some auto ping code for sitemap updates. This has since been disabled since July 2018 yet cannot get any sitemaps to be read by Google at this point.

One thing of note - the particular site in question has yet to move to https so all URLs (including sitemaps) are http Not sure if this is part of the problem.

Thanks in advance.

JesterMagic

11:02 am on Sep 22, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sounds like you might be blocking Google somehow from reading your sitemaps...

HollyWMarshall

5:56 pm on Sep 22, 2020 (gmt 0)

10+ Year Member



Checked and there are no blocks in place and we do not even see request attempts on the sitemap file from Google.

engine

6:18 pm on Sep 22, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Have you checked your robots.txt fike to make sure you've not blocked it from there?

not2easy

6:23 pm on Sep 22, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Can you use your GSC Inspect URL tools to see what they report for the sitemap file?

HollyWMarshall

3:22 pm on Sep 30, 2020 (gmt 0)

10+ Year Member



Hi Engine and not2easy

Double checked and none of the sitemaps are blocked in robots.txt

Checked one and under the URL inspection tool and inside the live test it says

URL is available to Google
If it gets indexed and selected as canonical, it could appear in Google Search results with all relevant enhancements.

But under sitemaps the same exact URL says Couldn't fetch under the status.

I'm honestly thinking that this might be some kind of bug that might have been triggered in the search console sitemaps section back in July 2018 when we tried experimenting with the ping tool Ever since then it seems that the sitemaps are not crawled, fetched and the existing ones show a success status but last read date of July 2018. If you remove them and add them back it says the same. if you try adding a new sitemaps that weren't submitted before, it shows the current submit date but a Couldn't fetch status.

HollyWMarshall

10:39 pm on Sep 30, 2020 (gmt 0)

10+ Year Member



Just a small update. Tried the Google Search Console APIs - Search Console API - Sitemaps - Sitemaps-get API today on the sitemap in question.

This new sitemap tried submitting earlier today via the regular search console and it showed the same thing as I previously mentioned. In the search console it shows
Sitemap: The name of the sitemap is listed there
Type is unknown
Submitted Sep 30, 2020
Last read - is blank
Status: Couldn't fetch
Discovered is 0

The type of sitemap is a plain text sitemap formatted UTF-8 per Googles instructions - 1 URL per lined - correct headers sent

Checked this sitemap with the URL inspection tool and it checks out with no issues.

So going to the API - checked the status of the sitemap and it gave the following

Removed the actual URL for this posting.

"path": "http://www.xxxxxxx.xxx/xxxxsitemap1.txt",
"lastSubmitted": "2020-09-30T16:40:45.623Z",
"isPending": true,
"isSitemapsIndex": false,
"warnings": "0",
"errors": "0"
}

Very strange and inconsistent with what is shown in the search console

Tried one of the old sitemaps that shows a Success status with a July 2018 last read date but a more current submitted date and here is what the API said

{
"path": "http://www.xxxx.xxx/xxxxsitemap.xml",
"lastSubmitted": "2020-09-10T23:18:32.441Z",
"isPending": true,
"isSitemapsIndex": false,
"type": "sitemap",
"lastDownloaded": "2018-07-06T21:07:28.191Z",
"warnings": "0",
"errors": "0",
"contents": [
{
"type": "web",
"submitted": "334",
"indexed": "315"
}
]
}

So this one is showing the correct last submitted date but a 2018 last downloaded date and apparently the results of that 2018 download and a pending status (since 2018?)

Checked several more on the API and they all show as pending. So it seems that for some reason due to the issue back in 2018 - all the sitemaps are stuck in a pending status.

Not sure what to do......

tangor

10:27 am on Oct 1, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Unless you can get a person (human) at g on telephone there's probably not a lot you can do. If there's an error in their programming it will be up to them to correct it, you can't FORCE it.

HollyWMarshall

3:21 pm on Oct 1, 2020 (gmt 0)

10+ Year Member



Tangor - thanks for the input.

Anyone have any connections at google that might be able to help out with this? Much appreciation in advance for any assistance anyone might offer.

lucy24

4:22 pm on Oct 1, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Type is unknown
Could there be a content-type mismatch? It seems unlikely if you've re-uploaded the file many times, but this does seem like grasping-at-straws time.

What does your robot.txt say about the sitemap? I don't mean a Disallow, I mean the line that typically reads
SITEMAP: https://example.com/sitemap.xml

(or perhaps sitemap.txt if it's a tiny site, and in your case http:// instead) Unlike some robots.txt extras, Google does recognize this line.

HollyWMarshall

7:23 pm on Oct 1, 2020 (gmt 0)

10+ Year Member



This site has both a sitemap.txt and a sitemap.xml file. The robots.txt file only lists the sitemap.txt file. We have this particular sitemap also uploaded in the search console and this is what the API returns:

{
"path": "http://www.xxxx.xxx/sitemap.txt",
"lastSubmitted": "2020-04-09T20:23:23.312Z",
"isPending": true,
"isSitemapsIndex": false,
"type": "urlList",
"lastDownloaded": "2018-07-05T05:33:51.050Z",
"warnings": "0",
"errors": "0",
"contents": [
{
"type": "web",
"submitted": "576",
"indexed": "0"
}
]
}

We also have a group of dynamically generated sitemap files for Google only (.txt and the response header returns
Content-Length: 52000
Content-Type:text/plain; charset=utf-8

along with the response date

each of these particular sitemaps returns a large group of individual URLs in text format - one per line. These sitemaps were all working up to the 2018 date and crawled frequently by google and showed updated properly in GSC. They all now show now Success status with July 2018 last read dates along with the amount of discovered URLs. The API for these URLs shows the same information with most of them showing in a pending status TRUE and some of them pending status FALSE but all with July 2018 lastDownloaded dates

Also there are several RSS sitemaps that are the same exact situation as the last group I just wrote about. All were working up to the 2018 date and all show the same similar status, etc in both GSC and the API.

A little background on the problem we were having when this whole thing started in 2018. The dynamic txt sitemaps files work by basically by an automated script grabbing a group of URLs from the database that have had recent updates and writing them to a text file. When the txt file is requested by google, those URLs are automatically removed from the TXT file and then the script automatically updates the TXT file with a new group of URLs. In 2018 we decided to try to ping google with the sitemap file each using the old ping method

[google.com...]

when there were new URLs available in the sitemap file. During the test (it was for maybe less than 10 min) there was a bad piece of code in the script which resulted in the ping hitting in a loop. Once it was realized that this happened, the test was abandoned and no further attempts were made. Ever since that moment, this is when the sitemaps issue in GSC started. No other negative effect was seen from Google and all crawling and indexing activity continued on as normal. It seems only the sitemaps function was affected.

JesterMagic

8:57 pm on Oct 1, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sounds like you may need to experiment a little as Google is not going to help you.

Assuming Google is having Fetch Issues:

Have you checked your server logs? Do you see Google attempting to read your sitemap file when you submit it?

Have you tried a really small sitemap file with just a few items in it? (to see if by chance something in your sitemap is causing Google to report these weird issues)

Try the sitemap file on another domain on a different server? (to see if it is your server causing the problem)

Try the sitemap file on another domain on the same server?

Assuming Google Console has a Bug:

You could delete your current Google Console Account and create a new one?

Or add a new account and then use another domain on the same server to see if the issue is still there?

HollyWMarshall

10:48 pm on Oct 1, 2020 (gmt 0)

10+ Year Member



JesterMagic

At no time do I see an attempt by Google to retrieve the sitemap. Nothing is in the logs. Seems to be consistent with the "Pending" status since 2018 for the older existing sitemaps and the newly submitted sitemaps as reported by GSC and the GSC API

I've tried having just a couple items in the sitemap - it makes no change.

Have verified that Bing Search Console will immediately crawl the same exact sitemap when submitted on their panel. So this appears to be a Google issue and not a server issue. Will try to see if Google will crawl a sitemap for another domain that is located on the same server just to make sure... I'll report back my finding.

I did try removing the property and adding it back to GSC - this had resulted in no change.

Thanks

lucy24

12:09 am on Oct 2, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



One final question to ponder:

OK, so Google has been pretending not to find the sitemap. Has this in any way affected their ability to find actual pages? If no, I’d say ### ’em and think no more about it.

tangor

5:35 am on Oct 2, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Echoing lucy24 with an aside: a sitemap is NOT required for site functionality. If g can't/won't play nice with your sitemap, then ignore them----

AS LONG AS THEY ARE INDEXING/CRAWLING your site on a regular basis, including all NEW content.

Note: I have never used a sitemap since 1996...

YMMV

JesterMagic

11:11 am on Oct 2, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Actually I meant to mention what lucy24 and tangor brought up in my last post. Since your issue is over 2 years old and you have just recently brought it up on WebmasterWorld I have a feeling it has not really affected your site.

Sitemaps are not required and shouldn't affect rankings as Google bot spiders all sites anyways. The advantage of sitemaps is Google will find new content right away and will be notified right away of new content (if you use ping)

Still obviously you want it fixed (as I would). Maybe go one step further and create a whole new Account and then add the domain. Hopefully any of the old info that Google has cached about your domain will not be brought over to your new Google Console Account.

HollyWMarshall

5:21 pm on Oct 2, 2020 (gmt 0)

10+ Year Member



Hi all. Once again - thank you for all the replies. Yes, I've been trying to solve this off and on (in between other obligations) for 2 years on the Google Support community. Since all of those threads seem to lead to no where that is why I've decided to try here after a long absence from this site. Sincerely do appreciate everyone's help here. Also, have tried to use the GSC feedback button and gave a request with the screenshot function to Google on the Sitemaps page.

Honestly, I would have to say that this issue probably has had an affect on this particular site as we've noticed that most of the indexed pages ( there are quite a few of them) are up to the 2018 date. There are pages after that date that are being indexed but at a disproportionate ratio to those before the 2018 date. They are constantly crawling those older pages and some new ones but there is a distinct difference between old and new. We seem the majority of Google organic search traffic being mostly to pages up to the 2018 date. Recently some structured data was added to pages and have noticed that the indexed pages count is slowing started to increase more but that's about it. Really hoping to solve this because sitemaps was always a very successful way to have pages discovered in the past before this issue.

Is the ping tool for Google still working? I've been doing some research on the API and see they have functionality for submitting pages but it is written that its limited to :JobPosting or "BroadcastEvent embedded in a VideoObject" Seen a few blog posts that some webmasters have been able to successfully submit and see results for URLs not in these categories but found at least one that resulted in that webmaster's site being totally delisted due to the action. [developers.google.com...]

Thank You