Forum Moderators: Robert Charlton & goodroi
As for the order of the pages being included with title and description, there seems to be no rhyme (sp) or reason. Some of my pages that date back over a year still have no title or description, some new pages get picked up in a matter of days. I have pounded my head against the wall trying to figure this one out.
Go Figure...
I have pages that took months to show the title and description ... pages seem to get their title and description at a rate of one to three pages per week ... I have pounded my head against the wall trying to figure this one out.
The tentative upshot is that a specific URL needs to be crawled 3 times before it will change from URL-only + no cache to Title + Description. The waters of this are muddied, however, by the existence of a sleuth-bot (identified in the referer string by
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)") which operates under HTTP/1.1 and does not seem to count towards the total of hits, whereas the common police-bots (IDed by
"Googlebot/2.1; (+http://www.google.com/bot.html)and HTTP/1.0) *do* count towards the 3-hits-and-you-are-in.
If accurate, this explanation would account for the apparent G bias towards long-established sites, the so-called "Sandbox", and the URL-only issue.
I have piqued mine own interest. The original research was possible because my former hosts had, by their behaviour, driven my site into the hands of a former colleague. I had the time whilst my server was uprooted to gather the info. That server + site transfer has literally just been completed, but I shall take a couple of hours to update the previous info from the logs, and report back.
I shall take a couple of hours to update the previous info from the logs
The PRE code does not work properly on this forum, so the following is not so easy to read, but here is a synopsis of the results:
Site-Hits by the GBot, 17/Jun/2005:07:00 - 30/Jan/2005:04:02:26.
(for 1st 20 results on site:my-site.com SERPs, May 17)
..............................................................................*..........
.................................*........................*...................*..........
..................*....*.........*........................*...................*..........
........x....x....*....*....x....*........................*...................*..........
.......01...02...03...05...06...07...08...09...11...12...13...14...15...16...17...19...20
13.Jun...........................G..............M....M....G....M..............M..........
06.Jun............G....MMG.......M....M....MM...M....M....M....M..............M....M....M
30.May...........................G........................G...................G..........
23.May....................................................G...................G..........
16.May......................G.............................G..............................
09.May...........................G.............................M..............G..........
02.May..M....M....M....M..............M....M....M....M.........M....M....M....G....M....M
25.Apr......................G....G........................G..............................
18.Apr......................M....G........................G...................G..........
.
28.Feb...........................G........................G...................G..........
21.Feb......................G.........M.......................................M....M.....
14.Feb..G....M..............G....G.........M....M.........G....M....M....M....G.........M
07.Feb...........................G...................G....G.........M.........G..........
31.Jan......................G.......................................G....................
.
Notes:
[no available logs between 03/Mar/2005:09:25:10 and 17/Apr/2005:04:03:48]
.
**** = Title + desc May 17, now URL only
*** = Title + desc May 17 + Jun 17
** = Title + desc Jun 17, prev url-only
x = not in first 100 results
all others are still url-onl
.
G = 1 x visit from standard GBot: HTTP/1.0 Googlebot/2.1 (+http://www.google.com/bot.html)
M = 1 x visit from Mozilla GBot: HTTP/1.1 "Mozilla/5.0 (compatible; Googlebot/2.1
.
URLs lost on hits #4, 10 + 18.
Currently, 16 of 1st 20 and 83% of 1st 100 SERPs are url-only.
On 17 May, 17 of 1st 20 and 87% of 1st 100 SERPs were url-only.
It only takes one hit from a standard GBot for the title + snippet to appear. However, that ignores the effects of the Mozilla GBot, the snippet-eater, the 3-times-a-second-roast-my-site GBot.
I now need to get back to my normal work.
As for the order of the pages being included with title and description, there seems to be no rhyme (sp) or reason. Some of my pages that date back over a year still have no title or description, some new pages get picked up in a matter of days. I have pounded my head against the wall trying to figure this one out.
I'm right there with you webdude. I think I dented the wall the other day.
The Google SiteMap so far has not shown much improvement for my site in our test. I put up a SiteMap with 1500 URLS, 500 each of: unindexed, partial indexed (URL only), and fully indexed pages. So far the main bot activity (98%) is on pages already fully indexed.
Regarless of this test, we will be rolling out a full SiteMap for our full site. Hopefully it will help since our main problem seems to be Google not crawling deep enough to find all of our forum style content.
The url appeared in the SERPS, but that was all, as Google had no other information than just the url.
Once the bot(s) were allowed back in, the placement appeared, along with the coresponding decription and title.
I can't say that this would be your case, but, it was ours.
Hope this helped.
I find that very interesting. I am going to give it a wack. While I have lost nothing in Bourbon, I have many pages that are URL only, about 50%. I would really like to get them listed with title and description. The bot is taking forever to include these pages in the index. It seems that Google SiteMap is taking the load off of the first googlebot that lists URLS. Could be that Google SiteMap is telling the second bot to crawl the links provided (the bot that usually adds title and desc.).
from where is this report (u post on previous page).
It is (1) grepping the apache access-logs and (2) manual hard work.
Have a look at msg #:59+60 [webmasterworld.com] - I give actual examples of the commands used.
The XML page cannot be displayed
Cannot view XML input using XSL style sheet. Please correct the error and then click the Refresh button, or try again later.
--------------------------------------------------------------------------------
A semi colon character was expected. Error processing resource 'file:///C:/Documents and Settings/Administrator/Desktop/sitemap.xml'. Line 142, Position 99
<loc>http://examplesite.com/MyState-Widget-Forum.taf?_function=detail&ForumMasterThreads_uid1=454&start=1</loc>
--------------------------------------------------------------------------------------------------^
There seems to be a problem at the = sign in "_uid1=454"
I have looked at this entry and I don't get it.
Needless to say, this is bobmging the G sitemaps. I would really like to figure it out. Any clues?
That was supposed to be bombing.
Any way, a word of caution on some of the on-line, auto generating programs for use in the G SiteMaps. It appears that XML cannot use the & in the <loc>. After much digging, I found that all & characters must be changed to &, otherwise it will error out the xml file. In other words, the URL must conform to RFC 2396 (http://www.ietf.org/rfc/rfc2396.txt).
It seems some of the on-line programs are not replacing these characters.
More Info....
Note: All data values, including URLs, in your Sitemap files must be XML-encoded. The chart below provides a list of characters with their corresponding encoded values. You can use either the entity or the character code to XML encode a character. Please see the FAQ for more information about XML encoding.Entity Character Code
Ampersand & & &
Single Quote ' ' '
Double Quote " " "
Greater Than > > >
Less Than < < <
Anyway, I learned the hard way. We'll see if the new xml file works.
Sorry to disagree just a a tad with you here. I have pages that are generated from a forum. There are hundreds of these pages, some listed with full title and description, some as just URL. In both cases, at least in my case, age of the page seems to have nothing to do with whether or not the title/description gets displayed. I have pages that have been URL only for the past 6 months. Every once in a while, one of those will go to title/description in the SERPs. Some pages get picked up right away and display title/description. Some go URL only for a couple of days then go to title/description. Some it takes weeks -- some it takes months.
What I am seeing on this particular site is that there seems to be no rhyme or reason to the way/how/why some pages get picked up with title/description while some don't. Nor does it make sense to me the time frame it takes to get these listed correctly.
Now I know some of you are going to say that this is the dupe content filter/penalty/yada yada that is causing this, and that may be true, but I am still confused as to the way/how/why. These pages all have different titles, descriptions, text and links on them. Of course they are templated, but why some and not others? There is a lot of valuable info there for the bot to see.
Anyway... I have successfully had the XML file downloaded and acknowledged by G yesterday and will see what happens. There seems to be more bot activity on the site right now. I'll wait a few days and see how it goes.
I have had many pages re-indexed since submitting a sitemap, however their position in the SERPs is way down. Several remain URL only.
All pages dropped a point in Page Rank over the period as well, the site now has no "similar pages" shown in the SERP listings, backlinks are way down. Hurt, hurt! But not a simple diagnosis.
I tends to believe that this symptom is due to "inadequate PR or inbound links". Googlebot may appear to crawl every URL, but it indexes some portions with title/description. In my case of URL only, this happens when pages are too many level deep from Googlebot's entry points OR there are too many links on the hub page.
That's an interesting thought and correct too. The pages I am referring to do get updated a lot. Some are old pages that haven't seen updates in months though. That is the "no rhyme or reason" of my previous post. Others are very recent that show the title and description and are still being updated -- go figure.
As for SiteMaps, I did get a complete crawl and now this is very interesting. It picked up almost all of the pages with the title and description -- Yippie Skippie! -- but now, the other url-only pages are still there too. So now I have 2 links to every page for about half of my forum.
mmmmm, I wonder if this is going to trip a duplicate content penalty?
Suggestions?
..SiteMaps, I did get a complete crawl ... picked up almost all of the pages with the title and description
This is an edited sample from a recent access log of the 2 different beasts:
# fgrep 'GET /robots.txt' /var/log/httpd/access_log* ¦ less(the first is the 'good' bot, whilst the second is the HTTP/1.1 snippet-eater.)
66.249.64.4 "GET /robots.txt HTTP/1.0" "Googlebot/2.1 (+http://www.google.com/bot.html)"
66.249.65.232 "GET /robots.txt HTTP/1.1" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
A very quick sampling shows accesses from this latter on: