Forum Moderators: open

Message Too Old, No Replies

Thousands of pages indexed but not showing in results

Know something similar?

         

kaijohannkursch

3:37 pm on Oct 12, 2003 (gmt 0)

10+ Year Member



We manage a new site (some months old) with a great amount of indexable pages. It has been indexed in three steps. Firstly index page, secondly several thousands of pages and thirdly...

On 9-11 September googlebot crawled nearly 200.000 different pages, but those pages does NOT show on Google results... and googlebot is visiting the site unfrequently (some hits on index nearly every day... one day a thousand of hits...)

Site is PR6 (surely PR7 next update) with a lot of inbound links (and growing daily), most pointing to index, but also to internal pages (some from high PR sites). We've noticed Google showing more pages on other sites, even counting more backlinks since then.

It's been more than a month and we are waiting for the pages to show... Does someone know of something similar? Could we expect Google to show these pages soon?

[edited by: kaijohannkursch at 5:04 pm (utc) on Oct. 12, 2003]

Net_Wizard

9:48 pm on Oct 13, 2003 (gmt 0)



Sure, if that's what you think is happening. But then we're back to your original question. What happened to those crawled pages. ;)

Cheers

kaijohannkursch

10:06 pm on Oct 13, 2003 (gmt 0)

10+ Year Member



I don't know (that's the reason I started the topic), but obviously not what you suggested, as it is not the way Google works.

BigDave

10:37 pm on Oct 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



kaijohannkursch,

I think you need a refresher course in how search engines work, Net_Wizard is right.

I would highly recommend for anyone that wants to understand some of what is involved with the algo and the various tweaks, that they install ht:dig from [htdig.org...] on a website. This is an open source search engine, it is not a search service.

Then go to [htdig.org...] and read up on the configuration file attributes, and see whether it affects the building of the index (htdig), or during searching (htsearch). Play with these factors, tweaking them up and down, and run some searches. You might be amazed at how much of a difference a minor tweak can make.

Almost all the factors that cause different weightings in the results are caluclated at indexing time, not at search time.

Obviously, Google does not use htdig, their engine is far more sophistcated. And you might know something that I don't about how google runs thier searches. But if you have such inside knowledge, I really doubt that you would be asking such a simple question here.

Now, back to your original question, I refer you to Google:
[google.com...]


5. How long does the Google robot take to index a URL once it's been submitted?

Depending on the timing of the submission and of our crawl, the entire process can take between six and eight weeks.

Your question was not about page submitting, but this still applies. If the crawled page does not qualify to make it into the fresh listing pile, I find that it can still take over a month to make it into the index. It's improving quite a bit, but you still need a little patience.

kaijohannkursch

11:01 pm on Oct 13, 2003 (gmt 0)

10+ Year Member



I simply negated two Net_Wizard's conclusions:

- Pages need to be PRed before showing in search results. I think this does not need further explanation and it's clear to see that's not true.

- "In page" factors are precalculated before the pages show in result pages. If some pages displays at Google without content indexed it is obvious this is not true.

In addition. Think about a LARGE text page. Thousands of different words, and nearly infinite word combinations (two-word, three-word... expressions). Are you suggesting Google pre-ranks each page for each of those infinite combinations? Obviously not.

There are factors Google manage before displaying a page. I don't think googlebot spiders a page and that's all, I am not a 5 year boy. But in this case, I repit: "If Google would need to do that kind of operations over the crawled pages, it would have displayed them gradually. There is no need to release them in bulk".

Or are you telling Google needs to apply whatever the process it applies to spidered pages to ALL indexed pages within a site before release them?

OldGuy

11:04 pm on Oct 13, 2003 (gmt 0)

10+ Year Member



A test on 20 test URL's, extremely simple content... 5 weeks to update content on 5-page sites.

Still indexing the 2,000 page sites.

Curious as to the turnaround on 200K pages...

Net_Wizard

12:05 am on Oct 14, 2003 (gmt 0)



I think you misread my post :) I don't work for Google and certainly I don't know their exact process.

However, I know three things when it comes to building a search engine.

1. Crawling
2. Indexing
3. Search

These are the three basics of search engines. Out of the three, Search is optimized for speed. Algorithms are not part of the Search stage, it is applied at the 2nd stage, Indexing. Out of the 3 stages, Indexing is the slowest process which could take days, weeks, and even months depending on the size of the database.

The only real time applied to the Search stage is the actual query the rest are back end.

I wouldn't even be suprised if Google have indices for popular queries just for the sake of speed. This is called 'database optimization' and there are various ways to optimize a database.

Again, I don't know their exact optimization technique, if I do, I wouldn't be here learning from others and talking to you but busy selling my knowledge to rich clients, guaranteeing them #1 position. ;)

Crawlers/Spiders - don't do real time indexing. Sometimes we call it 'my pages have been indexed by Googlebot' but actually it's a misnomer, it doesn't indexed but fetched/download your pages. For the sake of resources, search engines would just grab your pages otherwise it would be wasting your resources and the crawler resources. Very inefficient for a search engine to do that.

Indexing - this is the meat of the search engines whether Google or ATW or other search engines. This is where all the filtering, scoring, and cataloging occurs. Data here is not ready for public consumption and a separate database from that of the Search database/s. Whether they do PRs at this point is beside the point, only Google know that.

Search - an optimized database/s set for public query. A well built search engine would not require recalculation at this point for two reasons; speed and resources.

If you think this is still wrong then I suggest you build your own search engine not to mock you but to give you a chance to have first hand experience of how search engines work.

Cheers

BigDave

12:15 am on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You have a tendancy to read things into posts that are not there.

I simply negated two Net_Wizard's conclusions:

No you did not. Those were your conclusions about what was written, not what Net_Wizard wrote


- Pages need to be PRed before showing in search results. I think this does not need further explanation and it's clear to see that's not true.

What Net_wizard wrote was:

Each URL have to wait their turn to be analyzed, graded, PRed, and ranked accordingly. In short, the actual indexing process.

Which is totally different. To make it into the index it has to be processed. The URLs that are in the index without being crawlled have also been processed, and many of them have PR assigned. They also have scores for anchor text and components of their URL. They just don't have any score yet for their on page factors.

- "In page" factors are precalculated before the pages show in result pages. If some pages displays at Google without content indexed it is obvious this is not true.

Yes, many of the on-page factors are pre-calculated when the page is inserted into the index, though not all. Those pages that are not in the index, but show in the results wone out on their off-page scoring.

How can google take their on page score into account when the page isn't there? They cannot. Your logic is flawed.

In addition. Think about a LARGE text page. Thousands of different words, and nearly infinite word combinations (two-word, three-word... expressions).

Oh, believe me, I have thought about that. It is one of the toughest pieces of SE programming.

The way you do it when searching on a quoted phrase is that you use and the results of each of the words in the phrase and use that to limit the number of cached pages that you have to dig through at search time.

The thing is, digging for a large quoted phrase in a huge document is the exception rather than the rule. And it would be almost impossible to do without using the precalculated information to narrow the search.

Are you suggesting Google pre-ranks each page for each of those infinite combinations? Obviously not.

Correct, I am not saying that. You just amde a bad assumption about what I was saying, just like you did to Net_Wizard.

There are factors Google manage before displaying a page. I don't think googlebot spiders a page and that's all, I am not a 5 year boy. But in this case, I repit: "If Google would need to do that kind of operations over the crawled pages, it would have displayed them gradually. There is no need to release them in bulk".

1. Calculating PR is a major undertaking. It takes days.
2. Inserting and deleting records from a live database that is being searched half a billion times a day would lead to spinlock hell. Servers are taken offline to update the database, then brought back online.
3. The Google index is terabytes in size and has to be transferred to each of the datacenters. This is a major operation.

If you do not have to implement it yourself, it is really easy to say "There is no need to release them in bulk", but saying it, doesn't make it so.

Or are you telling Google needs to apply whatever the process it applies to spidered pages to ALL indexed pages within a site before release them?

No. Where did you get that idea? No one suggested that.

asinah

4:49 am on Oct 14, 2003 (gmt 0)

10+ Year Member



In regards of the back links I didn't counted the back links from your own domain on the google index. I noted you have a couple of links from external sites that I overlooked but many of those links come from websites related to your business and as well several forums.

Try to make a rewrite of your pages to blabla.com/World/...
and give it a html extension. This also guarantees you easy inclusion in the other search engines such as ATW and Altavista of which you currently have zero links indexed.

( Don't ignore those engines as they could be a live saver if google drops your pages - I was heavy penalised by google due to the reason that my server was down for over a week ((google did a deep crawl when the site went down)) but my 25000 links in ATW keeps still revenues coming in)

You seem to have a lot of content in Spanish and ATW is a must, but after checking your pages they are not optimized for both ATW and Altavista.

( We generate about 400-500 users a day just for our Spanish content from ATW and google brings us only about 600-700 users from a peak of 2000 last month )

Finally, as an advise, setup another domain and run the Dmoz directory (80 languages) and link back to your main domain. ( Then ask dmoz to list your new domain under Sites using Dmoz content and wow you got a link to one of your domains and it should be showing up on the google directory within 3-6 month and you should do fine )

Take care and good luck!

cabbie

5:35 am on Oct 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It was only a month ago .Give it a couple more weeks I reckon.
Google might be updating continuously now but they're in no hurry to index deep pages on a relatively new site.

Oh hang on ...Thats what i wrote in msg#5 and msg#8.
It seems like kaijohannkursch that you wont accept anything but "crikey something must be wrong with google" post.
You may be a seo for a few years and my respect goes out to you for having so many pages but some of the guys answering you in your thread are google whizkids in the no#1 google forum in the world so I would heed their opinion for sure.
Hey googleguy can you help put kaijohannkursch mind at ease at all?

asinah

6:31 am on Oct 14, 2003 (gmt 0)

10+ Year Member



Latest updates of my website:
Having been punished by google for loosing dns, site was down etc for so long, Google has added again 2000 pages to her index (26000) over the last 12 hours and www2 and www3 show already over 32000 links in her index.

Our Dmoz pages are now: 6740 (Up to Level 3)

Our Hotel pages are now: 15450 (mainly Japanese, Chinese, Arabic) - It took google 3 months to have those pages indexed but the content in German, English, Italian and Dutch are still not back and I guess it will take another month or two to have them back.

Our Weather feed has now: 4250

Our Country Guide's has now: 3500

Only the Amazon XML feed shows no additional links to yesterday's 2500 pages.

We modified yesterday afternoon, Asia time (24 hours ago) dns again and for the first time in 3 weeks we had 4400 visitors yesterday online (Monday is a poor day). I am monitoring the log file in realtime as well and I am getting now 2 visitors from any google site every 20 seconds.

Now my job is done and I will be flying into Phuket for a 2 weeks holiday. Hopefully after the return of my holiday I am back to pre-level of having at least my 88,000 links back in google.

BTW: I don't count myself as an SEO expert as I can't keep up with so many changes on a weekly basis but my number one rule is "Take it easy, relax and wait".

I recieved a sticky mail today from one of the users who asked me if I would offer SEO services to their travel sites. No, I am not in that business but I am happy to share experiences on a none-commercial basis in this forum.

Take care everyone and I will update you all in the first week of November how things turned out for my site.

onedumbear

6:36 am on Oct 14, 2003 (gmt 0)

10+ Year Member



flying into Phuket

asinah, just what is "et" you are flying in to phuk?
i'm sorry, i could'nt resist

asinah

7:56 am on Oct 14, 2003 (gmt 0)

10+ Year Member



I am on a holiday with my wife and daughters. Beach, Beach, Beach and scuba diving

kaijohannkursch

9:30 am on Oct 14, 2003 (gmt 0)

10+ Year Member



Sorry if I misunderstood, which seems to be this topic rule.
Some points.

If I was talking about the delay from where the pages are crawled until where the pages are displayed in results and Net_Wizard says: "Each URL have to wait their turn to be ... PRed" I thought he suggested pages are PRed before displaying... I thought so, and all the people I comment this post with thought it also. If we misunderstood, sorry again.

When talking about large text pages I was not referring only to quoted searches... Take into account that a given page targets all combination of words it contains (other thing is the rank it reaches... usually so far to be not considered).

About the process Google applies to crawled pages. In my previous post I said I know there are precalculated factors. Longer or shorter? Well, who knows.

If we agree pages do not need to be released in bulk, why are pages crawled after the mentioned 200K being indexed (even link counted) before? (treat this question as rethorical, as this topic would be eternal) That's the question I'm trying to answer. PR and link level things does not seem to count, as many of the pages already displaying are deep link from the index...

And Cabbie, we were simply discussing. I think there were unclear points (though you seem to misunderstand exactly what I am asking) so we were treating them... I think discussing a subject is the objetive of any forum.

Silicon

2:33 am on Oct 15, 2003 (gmt 0)

10+ Year Member



It seems to me G is slowing down when it comes time for new pages to show up in the SERPs...IMHO I think G will probably go back to the monthly update with a beefed up fresh bot because of people taking advantage of testing the algo and getting results in such a short period of time.

claus

9:30 am on Oct 15, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



kaijohannkursch i'm sorry about all the confusion, i believe your questions are answered quite in-depth, but perhaps not as explicitly as you would have wished.

(1) You are probably allright = Not banned or penalized, your indexed page count is increasing
(2) Expect your indexed page count to increase (but not with 33% forever), and expect pages to be spidered, even if not indexed.
(3) Don't expect all pages to get indexed. Ever.
(4) Do follow asinahs advice:
- remove empty pages from your directory (focus on good content pages)
- rewrite URLs; remove "?c=82-4" and add "/World/" in stead (plain text, not ID's - no questionmark)

As for (2) this depends a lot on how many (external) incoming links you have. With this page count, you will need a lot. And as for external incoming links, the url's of your site might make the count unreliable, as "blabla.com/?c=82-4 is your mainpage also if it is not" (as asinah wrote).

Asinah, did i understand you right if i say this:

You have one website/domain with:
(a) 2,500 pages travel portal
(b) 350,000 pages ODP
(c) 400,000 pages Amazon
(d) 9,000 pages weather-feed
(e) 145,000 pages hotel database

..ie. a grand total of 906,500 pages on the site, which is around one year old?

Did i also understand this right:

(1) Before the problems, you had around 88,000 pages indexed in Google (10% of pages)
(2) You had server+dns problems one month ago; indexed page count dropped
(3) Promlems are solved, you have now around 30,000 pages indexed in Google (3% of pages)

I get a bit confused, as you sometimes use the word "links" meaning "pages indexed". However, these percentages might provide a good guideline for kaijohannkursch and everyone else. In essence, as not all pages will get indexed, you should concentrate on those adding most value (in whatever unit of measurement you might have).

However, almost 100K pages indexed after just one year is still a lot. To accomplish this; apart from your 1K linkexchange, what is the amount and structure of your external (off-site) incoming links?

/claus

asinah

5:40 pm on Oct 15, 2003 (gmt 0)

10+ Year Member



Claus,

You have one website/domain with:
(a) 2,500 pages travel portal

YES

(b) 350,000 pages ODP

YES

(c) 400,000 pages Amazon

YES

(d) 9,000 pages weather-feed

YES

(e) 145,000 pages hotel database

YES

..ie. a grand total of 906,500 pages on the site, which is around one year old?

NO ALLTOGETHER IT IS ABOUT 2.5 MILLION PAGES.

AMAZON ITSELF IS AROUND 2 MILLION PAGES AS I HAVE THE WHOLE XML FEED VIA MODE REWRITE AS .HTML FILES

Did i also understand this right:

(1) Before the problems, you had around 88,000 pages indexed in Google (10% of pages)

YES, GENERATEING AROUND 10,000 VISITORS PER DAY. WE HAVE ALSO AROUND 25,000 PAGES IN ATW, SO THE 10K VISITORS PER DAY INCLUDES ALL THE OTHER SE.

(2) You had server+dns problems one month ago; indexed page count dropped

INDEX PAGE IS STILL DROPPED WITH PENALITY AS THE SITE WAS DOWN.

(3) Promlems are solved, you have now around 30,000 pages indexed in Google (3% of pages)

AROUND 30K BUT AS MENTIONED THE INDEX PAGE IS STILL GONE.

I get a bit confused, as you sometimes use the word "links" meaning "pages indexed". However, these percentages might provide a good guideline for kaijohannkursch and everyone else. In essence, as not all pages will get indexed, you should concentrate on those adding most value (in whatever unit of measurement you might have).

IT SHOULD HAVE BEEN "PAGES INDEXED" SORRY ABOUT THAT.

I WAS READY TO KICK OUT THE AMAZON XML FEED BUT aS OUR SITE IS BASED ON TRAVEL IT WORKED VERY WELL WITH OUR TRAVEL GUIDES ETC. - UNTIL RECENTLY WE HAD IN FROOGLE ABOUT 9000 LINKS.

However, almost 100K pages indexed after just one year is still a lot. To accomplish this; apart from your 1K linkexchange, what is the amount and structure of your external (off-site) incoming links?

FRANKLY SPEAKING THE LINK EXCHANGE WITH ALL THE OTHER TRAVEL PORTALS IS NOT THAT GREAT. MOST OF THE OFFSITE LINKS ARE FROM LARGE SITES SUCH AS CNN, BBC, LONELY PLANET AND MANY LARGE WEBSITES IN JAPAN, CHINA, MALAYSIA, SAUDI ARABIA, GERMANY, NORWAY, SWEDEN, ETC. - TO THE LARGE SITES WE DON'T LINK TO.

WE TRIED TO IGNORE THE US AND IN GENERAL ENGLISH SPEAKING COUNTRIES AS THE COMPETITION IS VERY BIG. EXAMPLE: MALAYSIA , INDONESIA, BRUNEI AND SINGAPORE ARE A POPULATION OF AROUND 230 MILLION USERS WITH A LARGE MIDDLE CLASS THAT USES THE INTERNET. WE ARE THE ONLY ONE OFFERING TRAVEL CONTENT AND HOTEL RESERVATIONS ON THE INTERNET IN LOCAL LANGUAGES. THE SAME COUNTS FOR JAPAN AND CHINA.

WE ALSO DEVELOPED OUR OWN HOTEL DATABASE IN VIET NAM WHICH TOOK 4 MONTH TO FINISH AND THE TEAM IS WORKING DAILY ON THE MULTILINGUAL CONTENT.

BTW: THE EXTERNAL LINKS IN GOOGLE TO OUR SITE IS JUST ABOUT 300 - 400.

ANYONE THAT WANT TO KNOW THE SITE JUST SEND ME A STICKY MAIL.

swampy webber

12:35 am on Oct 16, 2003 (gmt 0)

10+ Year Member



I have pages crawled quite often that take at least this long to show up. Not saying it's good, just stating the fact.

asinah

6:51 am on Oct 18, 2003 (gmt 0)

10+ Year Member



Latest googe figures compared as to 72 hours ago:
(new value/old value 3 days old)

Our Dmoz pages are now: 9350/6740 (Up to Level 3)

Our Hotel pages are now: 21800/15450

Our Weather feed has now: 4400/4250

Our Country Guide's has now: 3900/3500

Amazon: 9200/2500

A lot of activities on our box with an average of 30000 pages been crawled per day.

I also have been in contact with someone at Google Adsense to just get a confirmation about the index page. Answer took 12 hours and that was the first time I contacted someone at google. BTW: The reply came back from info@google but went well in details and they are looking into the details of the index page.

I am hoping that an update is on the way over the next few days. Also content in English, German, Italian and French is showing up and I guess more pages will be added over the next 3-4 days.

Combined links in the google index are now about 38000 and I hope to have 50000 back by next week.

This 48 message thread spans 2 pages: 48