Forum Moderators: open

Message Too Old, No Replies

Checking which Pages Crawled, and which Pages Dropped?

How to find which Pages Dropped? -sj datacenter is showing 3000 less pages

         

Imaster

8:58 am on May 5, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hello,

I was checking my site (PR 6) on the www-sj.google.com server and noticed that about 3000 pages (out of 30,000) have been dropped by google (which will get reflected for this next update, once the dance is over). I tried but I am simply not able to debug manually which pages have been dropped.

Is there a way to get a list of URLs (from a site) which are indexed by Google. I wish to compare the current normal server with the -sj server, and with that I will be able to debug why those pages were removed. Could be some fault in the my index pages.

I don't think it could be any penalty, as I am a very honest webmaster, and haven't used any wrong techniques. (to my knowledge).

Or could it be that it didn't deepcrawl enough?

My site was down for 3 hours during the deepcrawl. Do you think a period of 3 hours could cause a drop of about 3000 pages?

IMaster

takagi

9:27 am on May 5, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hello Imaster, don't pay to much attention to the current data in www-sj.google.com. But to answer the question about finding the indexed pages, search for

site:www.mydomain.com -blablabla

This will show all indexed pages from your site (even those not yet spidered) that don't contain the string 'blablabla'. Google will only show 1000 results, so that won't work for a site with 30,000 pages. To get a better understanding about the missing pages, you have limit the search. If 5% of the pages are about green widgets, search for

site:www.mydomain.com "green widgets"

and you will only see the 150 pages (5% of 30,000) on www.google.com and maybe 135 pages (5% of '30,000-3,000') on www-sj.google.com. To make it easier, you could change Google to display 100 pages per SERP instead of the default 10 pages (www.google.com >> Preferences >> Number of Results).

Imaster

10:27 am on May 5, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Takagi,

Wow, you are very Brilliant, and equally helpful. :)

After spending a lot of time analyzing (using your superb tips), I was able to exactly pinpoint some of the pages which are not being shown on the -sj dataserver. I will keep an eye on those pages, and see if they are shown on the normal server after the dance is over. (with fingers crossed)

You said "don't pay to much attention to the current data in www-sj.google.com".

How does this -sj server work. I would love to know more, and would appreciate if you could answer these below mentioned questions.

- So what are the odds that those missing pages might come after the dance. ;)
- Why do you think some pages are not shown on the -sj server? Doesn't -sj contain the old data, too?
- If those pages are not being shown on -sj, does it mean that they have been dropped by Google, or does it mean that GoogleBot is yet to crawl those pages? Do you thing GoogleBot would be crawling all those missing 3,000 pages at this very moment, and then they will again start reflecting after the dance?

Thanks.
Internet Master (IMaster)

takagi

11:28 am on May 5, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Why do you think some pages are not shown on the -sj server? Doesn't -sj contain the old data, too?

The sj data center contains a mix of old data (February?) and some very new data. Some members here complained that their whole site was missing, or only the homepage was left over. It's unclear why this mix of old and new data is used. From the remarks of GoogleGuy in message 27 and 44 in Pre-update Part 2 [webmasterworld.com] you can understand that they are testing some new algorithm ('secret sauce'). Google could on purpose leave out some selection of the index and add some pages they usually filter out for spamming, to prevent webmasters from comparing the current SERPs and the new SERPs (i.e. the old and the new algo). But again, no official reason is given, so we can only guess.

So what are the odds that those missing pages might come after the dance?

The only reason you have to think these 3000 pages are missing is because of some down time, isn't it? If this down time was at the end of the deep crawl, then it could mean that these missing pages will stay out for 1 month. Just wait for the next update (a week or so?) and you will know. At this moment you cannot do anything, and nobody can tell you for sure if they will be missing or not after the update.

Do you thing GoogleBot would be crawling all those missing 3,000 pages at this very moment, and then they will again start reflecting after the dance?

If you have log files, you can see the answer. Without a log, it's hard to say.

Imaster

11:45 am on May 5, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Takagi,

Thanks for the insight on the -sj server. I will keep an eye on my log files to see any activity.

Thanks for your superb help.

Internet Master -> who feels that it is Takagi who should be named the "Internet master" for his enornous knowledge. ;)

hetzeld

11:52 am on May 5, 2003 (gmt 0)

10+ Year Member



Takagi > and you will only see the 150 pages (5% of 30,000) on www.google.com and maybe 135 pages (5% of '30,000-3,000') on www-sj.google.com.
---------------
Hi Takagi,

5% of 30000 makes 1500 (not 150), which is still too much for Google to report accurately as it exceeds the 1000 boundary. ;)

Dan

takagi

11:59 am on May 5, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



5% of 30000 makes 1500 (not 150)...

Dan, you're absolutely right. Sorry! I hope the rest of the message was clear. At least it seams Imaster was happy with it.

Imaster

8:31 pm on May 5, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Takaji,

Yes, I was "extremely" happy with your message and superb explanation of the entire scenario. I would rate that a 10/10 points. :)

Internet Master