The story about those Google SE results at the end.

Forum Moderators: open

Message Too Old, No Replies

The story about those Google SE results at the end.

What is the deal with these mini results.

Paully

5:17 pm on Sep 26, 2002 (gmt 0)

Can anyone explain the reason that some Google SE results end up at the end with no descriptions? Thanks in advance.

ciml

5:27 pm on Sep 26, 2002 (gmt 0)

If Google finds a URL from another page, but doesn't get time to index it (or if it's /robots.txt protected) then it just lists the URL as title, with no snippet.

If it's a new listing, it'll probably get crawled. Otherwise it may not have enough PageRank to convince Google to fetch it.

Paully

5:43 pm on Sep 26, 2002 (gmt 0)

:( Most my $$ pages for some reason are listed this way. I guess I will look into creating a new sitemap.

rmjvol

6:10 pm on Sep 26, 2002 (gmt 0)

or if it's /robots.txt protected

Hadn't heard that one reason before. It doesn't make sense that G would list links to a page that I told them to stay away from. Guess I'll dig around a litle more if you're sure about that.

Paully

7:49 pm on Sep 26, 2002 (gmt 0)

Does anyone notice this phenom to be happening more in this update?

Also, does it take most people two cycles to get fully indexed? Or new pages get perfectly indexed from crawl to update in the same month?

ciml

12:50 pm on Sep 27, 2002 (gmt 0)

rmjvol:
> It doesn't make sense that G would list links to a page that I told them to stay away from.

To be honest, it's not something I check often so if Google's changed I may not have noticed.

Robots Exclusion Protocol asks robots not to fetch the resource. Listing the URL is another matter. I remember GoogleGuy mentioning something about the META robots tag removing the whole listing if set to "noindex". Has anyone here tested that?

Paully :
> ...does it take most people two cycles to get fully indexed? Or new pages get perfectly indexed from crawl to update in the same month?

I prefer to think in terms of two updates. If your main links have high PageRank, then you are unlikely to get them credited from a late crawl.

For example, if you put up a new page last week, and linked it from a PR7 page, then Google will probably have included the page by now, but I would expect it to get its PR6 (and better rankings) at the next update.

Note that as Google gets fresher, this may change.

Paully

6:40 am on Oct 1, 2002 (gmt 0)

Thanks for the good info ciml.

Any one else have any more info on the results with no description?

I should have made that the title of this thread, would have brought more traffic, lol. Hey Im new to this SEO thing. :)

jdMorgan

5:13 pm on Oct 1, 2002 (gmt 0)

ciml,

Robots Exclusion Protocol asks robots not to fetch the resource. Listing the URL is another matter. I remember GoogleGuy mentioning something about the META robots tag removing the whole listing if set to "noindex". Has anyone here tested that?

Yes, and it doesn't work in a practical application. At least it doesn't work if you disallow a page in robots.txt and put the "noindex,nofollow" robots metatag on the page.

I would like to try removing the disallow in robots.txt, leaving only the noindex metatag, but then that creates a catch-22: What about other search engines that don't recognize the on-page noindex robots metatag?

And it's really a double catch-22: How can Googlebot know about the on-page metatag if the page is disallowed in robots.txt? It can't.

So, regardless of what the letter of the RFC says, I think Google should not list links to robots.txt-disallowed pages. Otherwise, it just doesn't work practically. I'd prefer they not mention disallowed pages in their index at all, since this is part of my spam-avoidance strategy. There are other engines that do, and other engines that don't. I'd prefer that they don't, but because of the double catch-22 and the RFC, I've just learned to live with it.

Jim

WebGuerrilla

5:34 pm on Oct 1, 2002 (gmt 0)

By far my number one gripe with Google. Not only will they list the urls of excluded pages in their db, they will also return those urls as matches for searches if there is enough supporting link text.

ciml

5:52 pm on Oct 1, 2002 (gmt 0)

That's the dilemma, Jim. Use /robots.txt and have the URL listed in Google, or use META noindex and risk some other bot from indexing it?

My (unpopular) view, is that what we publish to the Web is public. /robots.txt is a nice convention for asking search engines not to bother indexing spider traps (session Ids, etc.), but not useful from a privacy or security point of view. Please keep in mind that my view derives from the days when the 'webmaster' was the server admin, not the document author.

Spam avoidance is a problem, but the spam bots will wilfully ignore your /robots.txt and META robots.

<added>
WebGuerrilla, this used to be a major feature, I think "The Anatomy of a Hypertextual Search Engine" discusses it. It's probably good that they stopped listing email addresses though:).

jdMorgan

6:31 pm on Oct 1, 2002 (gmt 0)

ciml,

robots.txt is ... not useful from a privacy or security point of view.
Spam avoidance is a problem, but the spam bots will wilfully ignore your /robots.txt and META robots.

Yes, I understand this - and as I said, have learned to live with it. I'm not concerned about spambots on my sites here, I'm concerned about making it easy for any kind of harvester - robotic or human - using Google Search to find "concentrated spambot food."

There are the "Contact WidgetUsers.org!" front pages, which lead to e-mail contacts and phone numbers - I'd like those URLs kept out of the index, even though they are now forms-based.

There are also some pages on my sites which are not very useful outside the context of the pages which link to them. I know this from confused feedback from searchers who have entered those pages directly, and I am trying to use a totally objective assessment of their out-of-context worthlessness here. I'd like them kept out of search results in order to improve the user experience (where've I heard that before?). But this is a secondary concern.

I've been on the 'net long enough to remember the academic goal of a totally-open information space. Most of my sites fit that model, most are non-commercial, information-source -type sites (which never have any problems with placement in Google, so I never have to watch the update threads). On the other hand, the UCE interests now on the web couldn't give a rat's behind about making information freely available for the good of all mankind. All they want is 5 million e-mail addresses to sell on a CD to all buyers.

So, in my opinion, Google should bend the robots RFC just a little, and do what some other SE's do: Not list URLs which go to pages disallowed in robots.txt.

It's just a wish. I hope they will re-consider the trade-offs between strict compliance with the robots standard, and the real world concerns of spam and abuse of resources. But I have learned to live with their policy as it is.

Jim