homepage Welcome to WebmasterWorld Guest from 54.234.128.25
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Google News Publishers Help
pageoneresults




msg:3844479
 4:17 pm on Feb 7, 2009 (gmt 0)

Have any of you ever read the documentation that Google have for their other services besides search? Do you think the suggestions given for other services are to be considered when developing documents in general? Ever read the documentation on their Search Appliances?

Here are some little tidbits I've included in my reference library while performing various research for inclusion into the various services that Google have to offer. This one is specific to Google News as the title implies.

Google News (publishers) Help
[google.com...]

Google News (publishers) Help > Technical Requirements
[google.com...]

The above will take you to the starting points for the below snippets of information...

Google News (publishers) Help > Technical Requirements: Article URLs
[google.com...]

Display a three-digit number. The URL for each article must contain a unique number consisting of at least three digits. For example, we can't crawl an article with this URL: http://www.example.com/news/article23.html. We can, however, crawl an article with this URL: http://www.example.com/news/article234.html. Keep in mind that if the only number in the article consists of an isolated four-digit number that resembles a year, such as http://www.example.com/news/article2006.html, we won't be able to crawl it.

^ Did you know that? Of course you did! ;)

Google News (publishers) Help > Technical Requirements: Dynamic content
[google.com...]

Google News indexes dynamically generated webpages, including .asp, .php, and pages with question marks in their URLs. However, these pages can cause problems with our crawler, and may be ignored.

^ Ya, even after all the advances in technology, there are still challenges in this area.

Google News (publishers) Help > Technical Requirements: Forum URLs
[google.com...]

Google News is unable to include articles that are set up as posts or threads. For example, if a URL contains specifically one of these following substrings, then it will not be crawled.

And...

Please keep in mind that we're unable to include sites that don't have a formal editorial review process.

^ I didn't know that certain URI strings are off limits to the Google News Crawler. I wonder how this translates over to search?

Oh, here is an interesting one...

Google News (publishers) Help > Technical Requirements: Links to your articles

In order for our crawler to correctly gather your content, each article needs to link to a page dedicated solely to that article. We're unable to index articles from news sections which consist of one long page rather than a series of links that lead to articles on individual pages.

And...

Keep in mind that our automated system is currently best able to crawl headlines or anchors (text links such as "Full story" or "read more") that have 22 words or less.

^ 22 words or less. There are more references to that 22 word limit and another that states 2 to 22 words as a minimum/maximum for headlines and page titles!

Those are just a few of the things I added to me library. There are all sorts of interesting tidbits within the documentation for each of their products and/or services. There is consistency across the board with some suggestions and then there are specifics if you are targeting certain Google services such as Google News.

Do you think the written suggestions for other Google Products and Services apply to search in general? I was surprised to see min/max for headlines and titles. 22 words? Whew, that is one healthy title. We all know that Google does not stop at character 67 which is usually the max point of truncation. 22 words? What is the average word length?

 

tedster




msg:3844514
 5:10 pm on Feb 7, 2009 (gmt 0)

Regular Google Search will go (or did go) beyond 22 words in the title element - at least some of the time. I know of one example from last year that ranked on the first page for a word that appeared ONLY at the end of a very long title element (over 1000 characters) - nothing on-page or in backlinks.

GrendelKhan TSU




msg:3850386
 4:06 am on Feb 16, 2009 (gmt 0)

great find. thanks! and actually I find all of that quite disturbing (but useful).

eg:
Please keep in mind that we're unable to include sites that don't have a formal editorial review process

is that even "fair", so to speak?

I mean, a lot of forums have their new thread submissions moderated before accepted. ie: editorial review. and the first post is considered the "article" and replies just comments to it.

also, so does that mean that just if we simply rename sub-folder named "forums" to "community" (or whatever)...
Wahlah! your threads are open for news crawls? (assuming no issue with the showthread or ?forumid)

interesting, one of my sites front page (cms) PR jumped and yet the forums for that site PR dropped to ZERO (previously 3). hmmm...

[edited by: GrendelKhan_TSU at 4:06 am (utc) on Feb. 16, 2009]

Syzygy




msg:3850949
 10:07 pm on Feb 16, 2009 (gmt 0)

Display a three-digit number. The URL for each article must contain a unique number consisting of at least three digits. For example, we can't crawl an article with this URL: http://www.example.com/news/article23.html. We can, however, crawl an article with this URL: http://www.example.com/news/article234.html. Keep in mind that if the only number in the article consists of an isolated four-digit number that resembles a year, such as http://www.example.com/news/article2006.html, we won't be able to crawl it.

Anyone know why this is so?

Syzygy

tedster




msg:3853343
 7:36 pm on Feb 19, 2009 (gmt 0)

I've always assumed it has to do with some technical requirement of the Google News back end and not thought about it any further.

Interesting to note: "this rule is waived with News sitemaps." reference [google.com] - but in a News Sitemap, you only include articles published in the past three days, so I guess that's part of the difference.

pageoneresults




msg:3853440
 9:58 pm on Feb 19, 2009 (gmt 0)

I presented the question to Vanessa Fox in a recent WebmasterRadio.FM chat. We'll see what she comes back with. I would think it has something to do with what tedster says. A technical thing and that there are reserved spaces for those 3 digits internally.

The isolated 4 digit number looks like a year but is very generic. I believe that is why you'll see this type of practice /2009/02/19 as a standard archiving string.

I'm interested to know the exact reason but I doubt we will ever get it in writing. It probably has to do with something top secret in the index. ;)

In following the link for Sitemaps, I caught this which is totally unrelated but I don't recall seeing a fixed number on Sitemaps.

A News sitemap can contain no more than 1,000 URLs. Your sitemap index file shouldn't list more than 1,000 sitemaps.

1,000 x 1,000 = 1,000,000 URIs

tedster




msg:3853519
 11:30 pm on Feb 19, 2009 (gmt 0)

1,000,000 new urls in three days seems like a very generous limit ;)

Syzygy




msg:3853569
 12:20 am on Feb 20, 2009 (gmt 0)

So, it's something 'technical'. Will look forward to finding out more!

Syzygy

[edited by: tedster at 1:05 am (utc) on Feb. 20, 2009]

roseberry




msg:3856144
 9:27 pm on Feb 23, 2009 (gmt 0)

Google's news index is completely different from it's web index and requires manual approval for inclusion. Because they're different, the same page from a site could be included in the web index and not the news index (most people know this, but thought I'd recap). The other big difference is how users find pages in G News. While there is a search function, mostly people view headlines by category (business, sports, health, etc.) with a headlines grouped together by topic.

So two things with the # in the URL. #1 - because there are many sites that have news and non-news items, having the # helps GNews distinguish which are news stories and which aren't (they also require these types of sites to designate a single or a few news main pages that will include links to only news stories to make crawling and indexing easier). If a site works with GNews to submit a news specific sitemap, these requirements aren't necessary.

The other thing the date does (this can be accomplished by a dateline rather than the # in the URL, but they need to determine the date one way or another) is that if they are going to group stories, it's logical to suspect that the same topic will be grouped on more than one occasion. So if they're grouping stories on Obama's stimulus bill today - they want to show more recent stories and make sure they weed out the stories on the same topic from a month ago, or even a week ago. Having a numerical date just helps them group things better and provide more relevant results.

Obviously, webmasters can run into trouble if they aren't aware of these caveats, but there are rules and specifics in web search that require knowledge and forethought too.

nealrodriguez




msg:3901779
 7:18 pm on Apr 27, 2009 (gmt 0)

1,000,000 new urls in three days seems like a very generous limit

indeed; some of the biggest sites with which i have dealt, i think, publish about 150 articles a day, if the estimate i was given was correct.

this is also interesting to note; although google has said that they have been able to read javascript for quite some time, they suggest the following:

Google News doesn't accept articles embedded in JavaScript because they sometimes display different content for users (who see the JavaScript-based text) than for search engines (which see the no script-based text).

[google.com...]

and

Google News does not recognize or follow Flash, graphic/image or JavaScript links which link to articles. Our automated crawler is best able to crawl plain text HTML links.

[google.com...]

they aren't breakthrough discoveries, but g did give us another peek at their hand.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved