Forum Moderators: Robert Charlton & goodroi
Here are some little tidbits I've included in my reference library while performing various research for inclusion into the various services that Google have to offer. This one is specific to Google News as the title implies.
Google News (publishers) Help
[google.com...]
Google News (publishers) Help > Technical Requirements
[google.com...]
The above will take you to the starting points for the below snippets of information...
Google News (publishers) Help > Technical Requirements: Article URLs
[google.com...]
Display a three-digit number. The URL for each article must contain a unique number consisting of at least three digits. For example, we can't crawl an article with this URL: http://www.example.com/news/article23.html. We can, however, crawl an article with this URL: http://www.example.com/news/article234.html. Keep in mind that if the only number in the article consists of an isolated four-digit number that resembles a year, such as http://www.example.com/news/article2006.html, we won't be able to crawl it.
^ Did you know that? Of course you did! ;)
Google News (publishers) Help > Technical Requirements: Dynamic content
[google.com...]
Google News indexes dynamically generated webpages, including .asp, .php, and pages with question marks in their URLs. However, these pages can cause problems with our crawler, and may be ignored.
^ Ya, even after all the advances in technology, there are still challenges in this area.
Google News (publishers) Help > Technical Requirements: Forum URLs
[google.com...]
Google News is unable to include articles that are set up as posts or threads. For example, if a URL contains specifically one of these following substrings, then it will not be crawled.
And...
Please keep in mind that we're unable to include sites that don't have a formal editorial review process.
^ I didn't know that certain URI strings are off limits to the Google News Crawler. I wonder how this translates over to search?
Oh, here is an interesting one...
Google News (publishers) Help > Technical Requirements: Links to your articles
In order for our crawler to correctly gather your content, each article needs to link to a page dedicated solely to that article. We're unable to index articles from news sections which consist of one long page rather than a series of links that lead to articles on individual pages.
And...
Keep in mind that our automated system is currently best able to crawl headlines or anchors (text links such as "Full story" or "read more") that have 22 words or less.
^ 22 words or less. There are more references to that 22 word limit and another that states 2 to 22 words as a minimum/maximum for headlines and page titles!
Those are just a few of the things I added to me library. There are all sorts of interesting tidbits within the documentation for each of their products and/or services. There is consistency across the board with some suggestions and then there are specifics if you are targeting certain Google services such as Google News.
Do you think the written suggestions for other Google Products and Services apply to search in general? I was surprised to see min/max for headlines and titles. 22 words? Whew, that is one healthy title. We all know that Google does not stop at character 67 which is usually the max point of truncation. 22 words? What is the average word length?
eg:
Please keep in mind that we're unable to include sites that don't have a formal editorial review process
I mean, a lot of forums have their new thread submissions moderated before accepted. ie: editorial review. and the first post is considered the "article" and replies just comments to it.
also, so does that mean that just if we simply rename sub-folder named "forums" to "community" (or whatever)...
Wahlah! your threads are open for news crawls? (assuming no issue with the showthread or ?forumid)
interesting, one of my sites front page (cms) PR jumped and yet the forums for that site PR dropped to ZERO (previously 3). hmmm...
[edited by: GrendelKhan_TSU at 4:06 am (utc) on Feb. 16, 2009]
Display a three-digit number. The URL for each article must contain a unique number consisting of at least three digits. For example, we can't crawl an article with this URL: http://www.example.com/news/article23.html. We can, however, crawl an article with this URL: http://www.example.com/news/article234.html. Keep in mind that if the only number in the article consists of an isolated four-digit number that resembles a year, such as http://www.example.com/news/article2006.html, we won't be able to crawl it.
Anyone know why this is so?
Syzygy
Interesting to note: "this rule is waived with News sitemaps." reference [google.com] - but in a News Sitemap, you only include articles published in the past three days, so I guess that's part of the difference.
The isolated 4 digit number looks like a year but is very generic. I believe that is why you'll see this type of practice /2009/02/19 as a standard archiving string.
I'm interested to know the exact reason but I doubt we will ever get it in writing. It probably has to do with something top secret in the index. ;)
In following the link for Sitemaps, I caught this which is totally unrelated but I don't recall seeing a fixed number on Sitemaps.
A News sitemap can contain no more than 1,000 URLs. Your sitemap index file shouldn't list more than 1,000 sitemaps.
1,000 x 1,000 = 1,000,000 URIs
So two things with the # in the URL. #1 - because there are many sites that have news and non-news items, having the # helps GNews distinguish which are news stories and which aren't (they also require these types of sites to designate a single or a few news main pages that will include links to only news stories to make crawling and indexing easier). If a site works with GNews to submit a news specific sitemap, these requirements aren't necessary.
The other thing the date does (this can be accomplished by a dateline rather than the # in the URL, but they need to determine the date one way or another) is that if they are going to group stories, it's logical to suspect that the same topic will be grouped on more than one occasion. So if they're grouping stories on Obama's stimulus bill today - they want to show more recent stories and make sure they weed out the stories on the same topic from a month ago, or even a week ago. Having a numerical date just helps them group things better and provide more relevant results.
Obviously, webmasters can run into trouble if they aren't aware of these caveats, but there are rules and specifics in web search that require knowledge and forethought too.
1,000,000 new urls in three days seems like a very generous limit
indeed; some of the biggest sites with which i have dealt, i think, publish about 150 articles a day, if the estimate i was given was correct.
this is also interesting to note; although google has said that they have been able to read javascript for quite some time, they suggest the following:
Google News doesn't accept articles embedded in JavaScript because they sometimes display different content for users (who see the JavaScript-based text) than for search engines (which see the no script-based text).
[google.com...]
and
Google News does not recognize or follow Flash, graphic/image or JavaScript links which link to articles. Our automated crawler is best able to crawl plain text HTML links.
[google.com...]
they aren't breakthrough discoveries, but g did give us another peek at their hand.