Google No Longer Indexes all The Web

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google No Longer Indexes all The Web

General Public Notices Google Quality Decline

Brett_Tabke

12:27 pm on Apr 10, 2019 (gmt 0)

[lifehacker.com...]

Tim Bray and Marco Fioretti noted that Google seems to have stopped indexing the entirety of the internet for Google Search. As a result, certain old websites—those more than 10 years old—did not show up through Google search. DuckDuckGo and Bing both still seem to offer more complete records of the internet, specifically showing web pages that Google stopped indexing for search.

NickMNS

12:39 pm on Apr 10, 2019 (gmt 0)

Side note from the article

...Pinboard, a minimalist bookmarking service similar to Pocket, which has a key feature for archivists: If you sign up for its premium service—$11 per year—Pinboard will make a web archive of every page you save.

So the "service" is to copy copy-written content and then charges for it.

iamlost

3:32 pm on Apr 10, 2019 (gmt 0)

There is a monumental critical difference between indexing and showing in search results. Google has long had several indices only one or two being public.

Note: there were similar discussions elsewhere over a year ago.

Only G knows whether they have begun deleting old content. Or what their criteria might be for so doing.

However, given (1) the examples from a year ago, (2) the increasing drawdown in site command results, and (3) the (as high value advertisers leave) steadily increasing need for current content on subjects that attract/retain advertisers imo what we see is simply a shift in how the search index is populated NOT that G has actually purged data.

engine

3:41 pm on Apr 10, 2019 (gmt 0)

I'm pretty sure, over time, i've heard the odd word here or there from Google saying it doesn't index everything.
No surprise there.
Whether it shows all the content it indexes is also a point.

broccoli

6:37 pm on Apr 10, 2019 (gmt 0)

I guess that explains why they’re only acknowledging 4K of the backlinks from my 300K backlink profile to my old website.

This is very harmful to creators who got in first and invented things before anyone else.

Robert Charlton

9:58 pm on Apr 10, 2019 (gmt 0)

...i've heard the odd word here or there from Google saying it doesn't index everything.
No surprise there.
Whether it shows all the content it indexes is also a point.

Most recently, John Mueller posted this on Twitter, I believe from April 7 or 8....

[twitter.com...]

One thing to add here - we don't index all
URLs on the web, so even once it's
reprocessed here, it would be normal that not
every URL on every site is indexed.

It's not clear from the above what meaning John assigns to "indexed" after he discusses reprocessing. "De-indexed" (earlier in John's comments) does suggest that Google does save the data.

Robert Charlton

10:02 pm on Apr 10, 2019 (gmt 0)

PS: I've just been reading that Google will allowing users to specify date ranges that include "before than". This suggests that Google might be aware of this criticism about dropping old results, and may be allowing those who are looking for these vanished results to find them once again.

tangor

11:22 pm on Apr 10, 2019 (gmt 0)

Not surprised ... the web these days is riddled with duplicate/scraped content, low value, MFA, PBN, all other sorts of "not content". And it is only getting worse by the minute!

With the tremendous amount of data/stuff out there, I would not be surprised that AT SOME FUTURE time only "home pages" will be displayed.

The amount of processing and horsepower required to do same must be reaching a point of diminishing returns on ROI!

Sounds like g has called in a team of janitors to clean the pubic facilities. :)

aristotle

1:22 am on Apr 11, 2019 (gmt 0)

Google No Longer Indexes all The Web

Well "no longer" suggests that google used to "index" everything, but "no longer" does. I was a bit confused when I saw that. Google has never "indexed" "all the web". Nor would it make sense to try to do so.

Better to just throw out all the spam and worthless garbage as the first step of the ranking process. That would likely leave less than 10% of the web's pages as eligible for inclusion in the search results. Not only would this save a lot of processing power, it would greatly improve the quality of the results.

rustybrick

11:02 am on Apr 11, 2019 (gmt 0)

Google has been saying this for many many years. They haven't done this since the old days when Yahoo and Google were competing for the largest index size. Back at the GooglePlex Google Dance days and Great America parties. :)

Brett_Tabke

1:01 pm on Apr 11, 2019 (gmt 0)

@NickMNS: Google has long had several indices only one or two being public.

Where has Google said this recently?

@engine: from Google saying it doesn't index everything.

This is the first major publication story I've seen where the depth of Google's commitment to index is questioned. They have stated 'all the world information' as their mission statement since day one. This flat out questions their "search" credibility.

Mueller on Twitter is the only place I've ever seen a Google rep talk about their failure to index the entire web. They have regularly boasted about the quality and depth of their index. No where has that record been officially corrected by a Google Spokes person.

Remember, this isn't a random blog post, this is LIFEHACKER - with a massive daily network reach (quoted as 100m daily and sold just this week [nbcnews.com...]
It also has a majority female staff that does not smoke the G dope on a regular basis: [lifehacker.com...] . eg: people are going to notice it. I expect more major stories soon.

NickMNS

1:11 pm on Apr 11, 2019 (gmt 0)

@Brett I'm not referring to Google but the services linked in the article you posted, that not only saved a bookmark but for a small fee would copy the webpage (copywrite material) and save on their server for you to be viewed at a later date.

robzilla

1:58 pm on Apr 11, 2019 (gmt 0)

It's not really a story, though, is it? Let alone a major one. It's a few blog posts from early 2018 serving as context for another blog post with tips on finding old websites (so that LifeHacker now ranks #1 for "find old websites"). That's not news, that's just content marketing. And tech-savvy bloggers and HackerNews users are hardly the "general public", so this interpretation feels a bit overblown. (And the gender of the staff irrelevant?)

Google's corporate mission is "to organize the world's information" [about.google], but not necessarily all of it. Many competitors have tried to use that to their advantage, but without much success (anyone remember cuil.com [webmasterworld.com]?).

After reviewing the blog [inessential.com] posts [tbray.org] mentioned by Bray, I find it hard to make a case of why those should be retrievable, or how not being able to retrieve them signifies a "quality decline". Once blogging became a thing, everyone and their mother was using their blogs as their personal diaries. I imagine there's a limit to what's interesting (and on-topic) enough to process.

[edited by: robzilla at 2:36 pm (utc) on Apr 11, 2019]

Brett_Tabke

2:32 pm on Apr 11, 2019 (gmt 0)

>It's not really a story, though, is it?

A huge one. It questions Google's commitment to search. (eg: their only profitable endeavor). We live in the insulated bubble of search marketing that we need to pop our head up every once-in-awhile to see what the general public is thinking. Even a tech trade rag like Lifehacker have no idea what goes on in search. For them to notice Google is no longer indexing all the content, is significant.

Meanwhile they are abandoning product after product.
[arstechnica.com...]

"It's only April, and 2019 has already been an absolutely brutal year for Google's product portfolio. "

Maybe we should have a Google Funeral forum where we eulogize all the slain Google Products

In fact, Google has closed, abandoned, and obfuscated so many products recently, that they are having to reassure people that their products will stay. This outta Google Fiber email:

We love being your neighbor and being part of the community. As sure as breakfast tacos are awesome, Google Fiber is here for good.

Our network is built to last. And with Google Fiber, you get all the Internet we can give you, all the time.

aristotle

2:42 pm on Apr 11, 2019 (gmt 0)

"to organize the world's information"

The information on the web is only a tiny percentage of the "world's information".

There's an enormous amount of information in old books, newspaper archives, non-english publications, orally-transmitted traditions, etc, which isn't on the web and dwarfs what is on the web.

robzilla

3:26 pm on Apr 11, 2019 (gmt 0)

For them to notice Google is no longer indexing all the content, is significant.

LifeHacker isn't exactly investigative journalism, the next article posted by the same author is "How to (Finally) Change Your Name on PlayStation 4". "They" didn't "notice" anything, the author (male) just mashed up of a few blog and forum posts, sprinkled some tips on top, et voilà, another how-to was born. I wouldn't be surprised if the post sprouted from a list of key-phrases they hadn't yet targeted, because that's essentially their business model. (Apparently not a very profitable one, given the $32.5 million loss reported in Q4.)

And Alphabet having to shut down products all the time is a problem for them, certainly, but what's that got to do with Search? As noted throughout the thread, Google has never committed to indexing everything, so how does their continuing on that path suddenly signify a lack of commitment? I'd sooner argue the opposite: that their advancements in information retrieval increasingly allow them to separate the wheat from the chaff.

Shaddows

4:10 pm on Apr 11, 2019 (gmt 0)

There's an enormous amount of information in old books, newspaper archives, non-english publications, orally-transmitted traditions, etc, which isn't on the web and dwarfs what is on the web.

Not true!

Various studies suggest the online "data" is bigger than offline data, and growing exponentially. For example, more data was produced in the last two years versus the whole of human history.

Mostly, this is because of uncompressed or less-compressed video files. But also because billions of people spend time on SM or... vintage message boards(!)

Some resources before I get flamed (I have not read them, just searched for 3rd party verification):
[seagate.com...]
[forbes.com...]
[bbc.co.uk...]

That last one is old, and contains the following quote...

The study, published in the journal Science, calculates the amount of data stored in the world by 2007 as 295 exabytes*.

That is the equivalent of 1.2 billion average hard drives.

The researchers calculated the figure by estimating the amount of data held on 60 technologies from PCs and and DVDs to paper adverts and books.

"If we were to take all that information and store it in books, we could cover the entire area of the US or China in 13 layers of books," Dr Martin Hilbert of the University of Southern California told the BBC's Science in Action.

*Note the 295 exabytes. We are now beyond 33 zettabytes (or 33,000 exabytes = 100 times bigger = 1300 layers of books over the US)

iamlost

4:21 pm on Apr 11, 2019 (gmt 0)

If this 'no longer indexes all the web' headline is correct it means either or both of the following is true:
1. Google is no longer noting and/or crawling every URL it comes across.
2. Google is deleting some number of stored URLs from existing indexes aka memory.
Note: other than that necessary for link rot (although there have always been bot request indications that G never forgets a link...)

Simply for crawl reasons the above seems unlikely. Plus G needs to crawl simply to know whether content is good bad or indifferent. Etc. So, imo, it is a matter of the definition of indexing.

As to whether G knows of a given page one's log file is a better resource than anything G shares publicly. If a page has been crawled it has been indexed, if a page receives G search referrals it has been indexed, if there is a link on a crawled page to a non crawled page that URL aka address has been indexed/stored, if a URL is returned in a search query result it has been indexed. Etc.

Is the headline 'true'?
Possibly for some definition almost certainly as JohnMu and other Googlers tend to be most careful in what they say. However, in the broad sense I'd say not as that implies any number of inefficiencies.

Google has been moving away from allowing search commands for years usually via making them increasingly mediocre first; that any given page/site cannot or can no longer be found via Google search query only means that it is no longer in the search level indices not that has been removed totally from any Google data storage. Even if, perhaps especially if, it is a hazardous page it needs to be indexed for reference.

What level of content/data completeness may vary but not indexed to some degree? Nah. That would screw with too many necessary requirements not least link graphs.

Until/unless Google provides a more detailed convincing response I'm saying the headline/statement is if not misleading then misunderstood.

And now I'll return to pretty much ignoring Google because, to date, my sites get their bots and their referrals simply by existing. And being purely stupendously awesome. :)

Brett_Tabke

4:22 pm on Apr 11, 2019 (gmt 0)

>but what's that got to do with Search?

That Google is struggling to maintain quality through it's products. The shutdown cycle is a growing cancer that is affecting search.

> data

Google is estimated (sorry wish I had a reference) to have 1.25m boxes in data centers around the world with an average storage in the 3-4tb range. Somewhere around 5 exabytes in the whole network. You could could index 99% of the text on the web with that (the tricky part being duplicate dynamic content).

> suddenly signify a lack of commitment?

Because they are impacting quality. They've reached a tipping point of "non indexation" that even tech people are starting to notice.

They have also crammed so many in-site links on the to serps now, that organic exposure is falling fast. Thus, they are killing off swaths of the web (starting with older content).

aristotle

6:41 pm on Apr 11, 2019 (gmt 0)

Shaddows -- You can't judge the amount of useful information by the number of hard drives used to store it. For the web, that would include all the worthless inane social media posts and youtube videos. Not to mention all the scraped and re-hashed content, quackery, intentional mis-information, and so on.

I have hundreds of old books full of valuable information that isn't on the web. There's also an enormous amount of knowledge that people carry around in their heads that isn't on the web. As a simple example, when my mother was a child living in the country, the road she lived on was initially a dirt road. She can remember that when she was six years old, the county paved it. It was a big deal for the people who lived on that road. So knowing when she was born, you could deduce the year when the road was paved. There might be an entry about it in the old county records, but I bet you can't find it on the web.

But that's just one memory of one person. What about all the other people on the earth? Just because information isn't stored on hard drives, doesn't mean that it doesn't exist.

EditorialGuy

6:46 pm on Apr 11, 2019 (gmt 0)

Not everything belongs in the organic search results. A search engine's results aren't supposed to be a raw data dump.

Brett_Tabke

7:40 pm on Apr 11, 2019 (gmt 0)

Agreed Editorial Guy, but there are huge swaths of the web that Google is now abandoning. I think with the loss of the organic page 1 Serp, that we are witnessing the end of Google indexing the web in large measure. Soon it will only index those that buy AdWords or that are profoundly compelling.

Maybe it is time to stop calling Google a Web Search Engine and start to call them a 'Fortune 1000' index.

aristotle

9:11 pm on Apr 11, 2019 (gmt 0)

That would be like skimming off a thin layer of cream that rose to the top of the cesspool.

MrSavage

9:18 pm on Apr 11, 2019 (gmt 0)

Who will ask the question if the growth of Google's own YouTube plays a role is the discarding of the "old"? Losing information from the web is of secondary importance when comparing to the fruitful indexing and ranking YouTube videos in the SERPS. Nobody here wants to discuss this angle I'm sure. Defenders of the Faith isn't just an album title. A few among us have the poster up on their wall and a sticker on their briefcase.

nomis5

9:28 pm on Apr 11, 2019 (gmt 0)

It would be interesting to have views on what the age of of a webpage / website means.

Literally just how long long its been available? The date of the last update. The date of the last significant update?

If G is removing websites / pages purely based on age then they need to be very careful. In that situation there will be a huge number of redirects simply to avoid the age problem.

robzilla

10:58 pm on Apr 11, 2019 (gmt 0)

They have also crammed so many in-site links on the to serps now, that organic exposure is falling fast. Thus, they are killing off swaths of the web (starting with older content).

There's no disputing that the organic landscape is changing, as it always has, but, in keeping with the news, the leaps in logic taken here (and earlier) are about as large as a black hole. I can see how one might imagine a connection, but does that suffice to make it "thusly" so?

If G is removing websites / pages purely based on age then they need to be very careful

They're not, and the whole "10 years old" theory is nonsense. I have no trouble pulling up obscure pages from 20+ years ago. If old pages are missing, there's something else about them that, possibly in combination with their age, essentially says: "this is very unlikely to be interesting to anyone". How that's determined would, of course, certainly be interesting to know. But if you look at one of the blog posts that could no longer be found, That new sound [inessential.com], see if you can figure out the topic or point of the post, try to think of the type of query that it might be returned for, and then imagine Google's AI trying to figure out what to do with it. Is this a page that needs to be in the "primary" index?

EditorialGuy

1:17 am on Apr 12, 2019 (gmt 0)

They're not, and the whole "10 years old" theory is nonsense. I have no trouble pulling up obscure pages from 20+ years ago.

Our top landing page is from an article that was written in 1998. After 20+ years and occasional tweaks to keep facts up to date, it continues to produce a steady stream of traffic and revenue.

tangor

1:57 am on Apr 12, 2019 (gmt 0)

Not sure what "age" has to do with it ... a site from 1996 still ranks well, was updated to responsive in 2016(!) and continues to do well. Being internationally recognized with links from 5 continents didn't hurt.

However, never had adsense or any other kind of advertising/third party of any kind (or js, or vids, or audio). Plain Jane scholarly report. Serp 1 most times, but competitor erosion is beginning to show... sometimes (by topics) as low as Serp 9.

Nobody can keep Serp 1 forever. Just a fact of life.

tangor

2:01 am on Apr 12, 2019 (gmt 0)

Also note we have a different set of "quality raters" in the work force. Kids from a different education system that pushes various ideologies and values compared to the last generation. We have to deal with THEIR likes and dislikes as well as competitors, too. (sigh)

Robert Charlton

12:18 pm on Apr 12, 2019 (gmt 0)

Regarding the disappearance of old websites on Google, I noted earlier in this thread that...

I've just been reading that Google will allowing users to specify date ranges that include "before than". This suggests that Google might be aware of this criticism about dropping old results, and may be allowing those who are looking for these vanished results to find them once again.

The reference was from Danny Sullivan, and here's a thread I've started on the topic, largely to introduce the new operators and to explore Google's apparently strong interest in obtaining better dates on sites and pages...

Google testing before: and after: commands to help find old pages
https://www.webmasterworld.com/google/4942129.htm [webmasterworld.com]

In relation to this current thread about how much of the web Google indexes, I noted that the introduction of new date operators...

...suggests another approach to unearthing old pages, which is to allow the user to choose the time segments, rather than attempting a full historical view in a set of ten links. The segmented approach is particularly useful in finding gold in old pages, which are othewise buried under ever-growing layers of new results. How this will eventually relate to the apparent size of the web that Google is indexing remains to be seen.

...I'm very curious where Google is going with this... how many results are simply going to be Supplemental, and how many others are likely to become Vintage, like old wines, brought out for special occasions.

This 47 message thread spans 2 pages: 47