lucy24 - 10:12 pm on May 29, 2012 (gmt 0)
I've got two text files in front of me. One based on bing's Index Explorer from 19 May, one from 27 May. Drop between the two, 21 pages or about 25% -- just to show that they appear to be doing this to everyone, not only the Big Boys. More exactly, it's 23 pages out, 2 pages in. No idea why they added those two pages; they've been around forever. Not complaining, though.
It seems to have leveled off; I just wish I had known beforehand so I could have started checking on May 3, which was the spike date for me. The no-longer-indexed pages are really gone; Bing WMT isn't just saying so to alarm us. I tried a few unique-phrase searches and came up cold.
First discovery: There's a big lag between crawling and indexing. This may be proportional rather than absolute, so ymmv. The "last crawled date" on the earlier index varied, but never less than 13 days before the index date. The "last crawled date" on the later index is newer and-- here's the interesting part-- for many pages the crawl date given is before 19 May. That is, the "last crawl date" isn't really the most recent date; there's some kind of limbo in between. (I hadn't the energy to check raw logs and see how they compare-- in particular, how many pages are they crawling but not indexing?)
Even the current list of indexed page isn't completely accurate, because it includes pages that I've explicitly removed from SERPs due to redirects and so on. They don't get de-indexed; they just sit in a back room somewhere.
So what got removed? Some are legitimate. For example, a few pairs like
:: cough-cough, ahem ::
http://example.com/ebooks [this is Bing's naming format for a directory's index file]
A stray case of
where they've finally dumped oldestname-- but not yet oldername. Likewise a few-- but by no means all-- files so old, I've gone from a year of redirecting to an unequivocal Gone.
One no-longer-indexed file is in a roboted-out directory. Well, thanks, Bing. To make up for it, two still-indexed files are explicitly labeled noindex-- and located in public directories, so no falling back on "Well, but how were we to know it's noindex? You wouldn't let us see!"
Others are more puzzling. A few very specialized pages from the /fonts/ directory. Two pages so new, they can only just have been indexed-- and then Bing turns right around and de-indexes them. Both are almost entirely in a language (and script) Bing doesn't know, which may be relevant. But surely they'd have noticed in the first place?
There's one specific deletion I can't figure out at all. No way, no how. Over a year ago I completely rearranged one directory. All pages are still there; they just have different paths and slightly different filenames. No particular change to title or text. The number of pages is out of all proportion to the weight of the content, so humans got a special 404/410 page and robots got a simple Gone. Let them find the new locations from scratch; the top level of the directory is unchanged.
A week ago, one matching pair was indexed both ways: Old URL (in spite of steady diet of 410) and new URL. A week later, the new URL is gone and the old URL-- the 410 version-- is still in the index.* Huh what?
:: insert "noidea" emoticon here ::
* And still getting crawled. At this point I threw in the towel and added redirects for a few specific long-gone pages that the search engines persist in looking for. Well, if they want it that badly-- but not badly enough to realize it's directly linked from a page they crawl regularly--