Why would bingbot be requesting atom.xml?

Forum Moderators: open

Message Too Old, No Replies

Why would bingbot be requesting atom.xml?

What is the purpose of that file anyways?

SumGuy

2:36 pm on Dec 27, 2020 (gmt 0)

I stumbled across this today, don't know how often it happens but is seems somewhat new. Bingbot (13.66.139.66) requested atom.xml from my site. Naturally it got a 404. Would this atom thing be a generic file? Or is it a customized file, different for every site that has it? Is it useful for a search engine to have and analyze?

Jonesy

5:50 pm on Dec 27, 2020 (gmt 0)

Drop +"atom.xml" into DuckDuckGo's search box.
Yes, all of the characters: +"atom.xml"

lucy24

6:03 pm on Dec 27, 2020 (gmt 0)

don't know how often it happens but is seems somewhat new

It seems to have started in May (2020). In my case, at least half of requests are 301--either because they're sent in to HTTP, or because they used a non-canonical form of the hostname. So far nobody but bingbot, and they're not frequent enough to be worth returning a manual 404 before canonicalization.

Why does bing, and only bing, think we have--or ought to have--this file? And why can't they be bothered to canonicalize?

SumGuy

1:24 am on Dec 28, 2020 (gmt 0)

> Drop +"atom.xml" into DuckDuckGo's search box.

Ok, so bingbot is just blindly requesting atom.xml even though there is no hint of it's existence on my site. I wonder if I can construct atom.xml and simply have it point back at some of my existing html or pdf files and leverage this as another way for them to show up in a bing search.

How exactly would someone construct a search query such that content referenced by an atom.xml file would be brought up by (in this case) bing? I can't recall ever getting a link to an RSS feed (or Atom) in response to a google search query. Google doesn't even break out syndication feeds as a search category (ie like they do for images, video, etc). I don't know about bing, don't use it.

iamlost

4:10 am on Dec 28, 2020 (gmt 0)

Mere supposition:
Bing has long recommended/used RSS/Atom feeds for picking up new site URLs (backup to XML Sitemaps for comprehensive snapshot).

Option 1 might be a third party playing silly idiots.

Option 2 might be bingbot has decided to simply query each site on the off chance they have an RSS/Atom feed; perhaps they found many sites with RSS/Atom feeds not knowing (or in a G-centric webverse not caring) that Bing results can be updated this way.

Log files: Curiouser and curiouser!� cried Alice (she was so much surprised, that for the moment she quite forgot how to speak good English); �now I�m opening out like the largest telescope that ever was!

@lucy24: And why can't they be bothered to canonicalize?

And miss out?
The data hoarders �but I might need it� mantra alive and well.

lucy24

5:37 pm on Dec 28, 2020 (gmt 0)

bingbot seems to be very fond of asking for nonexistent files.

Googlebot: 404s are generally limited to ads.txt (I happen not to have it) and the abcblahblah.html that they use to verify that the site sends out 404s when appropriate. (Programmatically triggered whenever a site passes some threshold of 301s, such as when you move to HTTPS.) Once in a blue moon there's a 404 that can be blamed on a typo or punctuation glitch on some linking site, but they don't obsess over these.

bingbot: very nearly as many, by raw numbers, as google. But what a difference: In addition to atom.xml, there's mountains of

/sitemap.xml (on a site that doesn't have one, with robots.txt specifying sitemap.txt)
/sitemap.xml.gz
/sitemap_index.xml
/sitemaps.xml

/directory/subdir/blahblah where the first part is a legitimate URL, while "blahblah" is any random garbage, apparently mixed in from some other site's URLs.

and finally any number of URLs that don't exist on my site at all, and are clearly mixed in from assorted other sites. An especially popular one over the past year is a Norwegian site with distinctly weird URLpaths, and what appears to be a Canadian used-car dealer.