Forum Moderators: open
after many months of fruitless waiting, crossed fingers etc., and a number of appeals to the Gplex following GGuys posts in these fora, I finally got a reply as to why a site dropped out of the index and hasn't reappeared.
According to the good G help folk, the site is "partially indexed" e.g. url only, because the crawlers don't have enough information about the site to provide a title and description.
Well, I have to say, this is a new one on me. robots.txt double checked to make sure I didn't accidently block Gbots, validation AOK (many many many checks!) and the other usual check list items.
Now my question, if the bot can crawl the site sufficiently to put the url in the index, why isn't the rest of the meta there? Or does this mean that Gbot will return soon (please please please ...) and complete the indexing job?
BTW, did a search to check another thread didn't already cover this earlier, didn't find any G ref.s.
Feedback much appreciated.
Hooroo
JP
There are a few reasons for 'URL-only' listings, including these:
* /robots.txt forbids indexing or is malformed.
In the past, returning 403 forbidden blocked Google too but some server erroneously uses it as the default for 'not found' so Google relaxed their treatment.
* The page is linked well enough for Google to see it, but not well enough for Google to bother fetching it.
If the PageRank is very low, or if it takes a lot of clicks back to get out of the domain it's on, then Googlebot may not crawl it.
If the URL has a "?" in, and especially if it has a few "&" characters, then it is less likely to be fetched (quickly). This isn't a barrier however, get a high PR link and a /foo.cgi?a=1&b=2&c=3&d=4 URL should be crawled.
Any CGI parameter called 'id' should be avoided. eg. /bar.cgi?id=50
* The page is broken. (i.e. Google doesn't find the title or content)
Also, empty pages consisting only of frames, Javascript, Flash, images etc. can be a problem. Anyone following W3C's Web Content Accessibility Guidelines should have no problem.
* There are connectivity problems.
Your server doesn't need to be down for Google to fail to reach it. If there's a problem at some point between your server and their crawler then expect a 'URL-only' listing.
It would be great if we had some way to check, maybe Google translate or WAP proxy would be close enough but I think Google are crawling from different places these days so there may not be a reliable test.
for my site, robots.txt is blank, PR is a 6, the url doesn't have a "?" in it, the meta tags are normal and work fine with other engines, it doesn't use frames/flash/etc. the server has its occasional connectivity problems but very occasional and certainly not when googlebot has come visiting and google seems to have no problem indexing other sites on the same server (with different ip addresses).
i'd love to know what other conditions might cause a url-only listing.
Thanks for all the feedback, been away from my PC for 72 glorious hours (out of town ... great excuse to unplug totally. ;-)
ciml, interesting list. You might want to add the following DW generated DTD for transitional XML to the list.
<?xml version="1.0" encoding="iso-8859-1"?>
<!doctype html public "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
I asked in other posts last year, whether that might be the reason and was assured by those more experienced than I, that it was unlikely because they had XML coded pages which ranked fine, but I have another site that I decided to update to trans. XML code which also doesn't appear in G results.
Previously it had no DTD at all, just good old fashioned lean, clean 3.2 validated code, and ranked well.
Both of the sites concerned index and rank fine on other search engines, but are nowhere to be seen on G, except for the "partially indexed" url only.
I'd be interested to hear if anyone else has noticed something similar with trans XML DTD on IIS servers, because if I have this trouble with trans. XML on G indexing, I'm sure not updating any other pages to XML standard until I get a good reason why this should be so, or other search engines start making significant inroads on Gs predominance in the search place!
Cheers and hooroo
JP