Bug, filter or something else?

Forum Moderators: open

Message Too Old, No Replies

Bug, filter or something else?

doc_z

7:43 pm on May 3, 2004 (gmt 0)

I'm seeing strange results when searching for some keyword combinations: when I search for word1 word2 I got numerous results. The first one is from companyXY (www.companyxy.com/directory/page.html). If I add the companyname (i.e. when I'm searching for word1 word2 companyxy) the results from that domain are completely removed from the SERPs. Although the company name is not only in the domain name but it also appears in the title and the text. There are only a few results left but none from the domain www.companyxy.com. That's strange enough but even more curious when I repeat the search by clicking on "repeat the search with the omitted results included" I get 48 results from www.companyxy.com but none from other domains ('Results 1 - 48 of about 4,500 for ...').

Is this a bug? Is someone else seeing similar things?

kaled

11:59 pm on May 3, 2004 (gmt 0)

Sounds like Florida. Variously called an over-optimisation filter or Bayesian (spelling?) spam filter, I called it a dynamic spam filter since the definition of spam seems to be dependant on the search terms.

The repeat-search issue you described is new to me, but Google is buggy.

Kaled.

doc_z

5:02 am on May 4, 2004 (gmt 0)

If this is an over-optimization filter and it is triggered by your own company name, I would call it a bug. Also, just a few results from that domain are optimized, most of them are not. And of course, none of the pages was optimized for the company name. Finally I wouldn't expect just the results which were previously filtered out are shown when repeating the search with 'filter=0'.

caveman

5:48 am on May 4, 2004 (gmt 0)

Consider the possibility that overuse of *any* word trips filters under certain circumstances, if not offset or countered by other factors. Also, I use the term 'filters' loosely.

doc_z

1:35 pm on May 4, 2004 (gmt 0)

Of course, Google can filter out any result - it's their decision. I'm just not sure if this is a wanted result.

Also, I don't think that this is a reasonable behaviour:

1. companyXY is a trademark. Most of the results shown have no right to use it. However, these pages are shown while all results from companyXY are removed.

2. It doesn't make sense for the user to shown all results except those from www.companyxy.com for the normal search and just results from www.companyxy.com when repeating the search with 'omitted results included'. (There are no results 'included' - they are replaced)

The reason that this word is triggering a filter might be that people are bidding on this keyword on Google Adwords.

kaled

3:25 pm on May 4, 2004 (gmt 0)

It doesn't make sense ....

That, I'm afraid, seems to be what they call progress at the 'Plex - baffle webmasters with flaky results. Don't worry too much about users - they're so stupid they don't know what they want.

Kaled.

PS Your adword theory certainly warrants investigation.

vincevincevince

3:29 pm on May 4, 2004 (gmt 0)

it seems like the page is too narrowly optimised

the page may has focused upon "term1 term2" so exactly that it doesn't have enough relevancy to deal with "term1 also term2" or "term1 domain term2" - remember google doesn't give much weighting to the order of words as you type them in.

if you add a 3rd word to your search in place of the domain name, do you see the same result?

ranking highly for "term1 term2" does not indicate a high ranking for "term1 term2 term3"

doc_z

4:15 pm on May 4, 2004 (gmt 0)

it doesn't have enough relevancy to deal with "term1 also term2" or "term1 domain term2" - remember google doesn't give much weighting to the order of words as you type them in.
ranking highly for "term1 term2" does not indicate a high ranking for "term1 term2 term3"

It's not a ranking problem because the pages are completely removed from the SERPs (there are only a few results left for these combination). Moreover, the problem is independent from the order of the words. Also - as already mentioned - the only relevant result for a search which contains (the trademark) companyXY is www.companyXY.com. Of course, the page isn't optimized for companyXY but it appears several times (title, description, text).

if you add a 3rd word to your search in place of the domain name, do you see the same result?

I tried this for several different words. In most of the cases, pages from that domain are shown at the top of the results. But there are also a few results where the pages are completely removed. In the latter case 'repeating the search with the omitted results included' leads always to the strange behaviour described above.

hutcheson

4:27 pm on May 4, 2004 (gmt 0)

It sounds to me like the results from that domain are "considered similar" to one (or more) of the results from another domain. If there are only a few other pages listed, look at the text on them -- I'm betting one of them has almost exactly the same text.

Google is trying (and, considering the difficulties, is surprisingly good at) eliminating "similar results" even from other domains. One search I do every now and then eliminates 90% of the top 100 results, causing content from the second primary source to first appear in searches in position 11 rather than position 116.

The first source can't be said to be harmed, as it dominates page 1 anyway; the second and third sources must surely approve of this.

As an aside, the eliminated results are split about equally into (1) pages on the same domain, wildly different, but containing the same author name; and (2) pages, each on a separate domain, that contain a near BUT NOT EXACT copy of a page on yet another domain.

[edited by: hutcheson at 4:32 pm (utc) on May 4, 2004]

hutcheson

4:31 pm on May 4, 2004 (gmt 0)

Just to emphasize: this isn't the dreaded "content filter" that some people have been so concerned about (and which I believe is a myth), although it may be the cause of what they are seeing.

The page is still in the search results! It's just not shown when there is a higher-ranking page in the same domain, or a VERY SIMILAR higher-ranking page on another domain. So if you're counting on SMC product descriptions, or your hotel-now hotel promotional blurbs, to get picked up by Google -- count again. With seven million results, people aren't going to be saying "Oh, the first three million aren't enough, I'll look at the others."

Put another way, it's not a "mom-and-pop" filter, it's a plagiarism filter.

kaled

6:24 pm on May 4, 2004 (gmt 0)

If I buy a blue filter for a camera, it only lets through blue light. Similarly does a plagiarism filter only let through pages that are copies or approximations of originals whilst the original, genuine pages are sent to oblivion?

If Google kept first-indexed-on (date/time) data for pages, it would be able to determine to a reasonably high level of reliability which pages are original and which are copies. Attempting to filter out duplicates without such data is guaranteed to fail frequently (and result in original pages being filtered out).

Again, as a concept, this is not rocket science but it seems to be beyond the algo designers at the Plex.

Kaled.

doc_z

7:00 pm on May 4, 2004 (gmt 0)

It sounds to me like the results from that domain are "considered similar" to one (or more) of the results from another domain.

This sounds reasonable (and there are cases where the original content is filtered out) but doesn't seem to fit for this case for the following reasons:

- I couldn't find similar results
- even if the first (top) result from this domain would be considered as similar there are numerous other (unique) results
- the pages from companyXY are the original pages and most of them have high PR
- if they are considered as original content for 'word1 word2' it would be strange to see these pages as duplicate content for 'word1 word2 companyXY' (assuming that some similar pages exists)

hutcheson

3:17 am on May 5, 2004 (gmt 0)

You raise an interesting question about exactly what constitutes "similar" and "very similar" pages.

"Similar" pages [as in "show similar pages" in the search results] won't necessarily be found by the same search term, but generally seem to be connected by hyperlinks and perhaps by similarity of vocabulary. The omitted "very similar pages" can be on any domain, but seem to always have very similar page text. But all the examples I have are pretty extreme -- nearly all of the page text is identical.