Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Does Cutts April Fools Inadvertently Reveal Google Targets for Spam?

         

seoskunk

5:00 am on Apr 3, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I’ve been working really hard with some friends on a project to handle SEO automatically. Now we’re ready to take the wraps off it over at seo.ninja.

One of the ideas that helped the World Wide Web succeed was that it separated presentation and content. You could write your text and decouple it from the problem of how the text looked. AutoSEO takes that to the next stage with search engines, so you don’t have to think about things like redirects.

How much would you pay to never have to worry about keyword density, H1 headers, or meta descriptions again? How about.. free? That’s right, AutoSEO is free for individual, students, self-hosted installs, and companies with fewer than 100 employees. AutoSEO is also built from the ground up to handle mobile browsers.

We’re starting with a limited set of invites to kick the tires on the system before opening things up for wider usage. Read more about the project over at seo.ninja!

My next project: AutoSEO [mattcutts.com]


Lets examine this:

Spam signals =

individual
students
self-hosted installs
companies with fewer than 100 employees

Spam Techniques =

H1 headers
keyword density
Redirects

[edited by: aakk9999 at 9:58 am (utc) on Apr 3, 2015]
[edit reason] Added link to Matt's blog post [/edit]

rustybrick

11:00 am on Apr 3, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Doubt it...

martinibuster

11:46 am on Apr 3, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



There is no scientific research or patent filed by any search engine that describes a method for identifying spam involving the cross-referencing of a domain's whois data with college and high school registrations, minecraft downloads or employment status at McDonalds and then flagging those student domain owners for spam action. If you can find such research then please post a URL for it.

Thank you.

martinibuster

1:01 pm on Apr 3, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



The topic of this thread is an extreme example of SEO speculation however there are more common theories out there that are similarly based on the same kind of "what if" foundations. The danger in baseless speculation or even speculation that takes a phrase out of context is that it gets repeated and then discussed as if it's a real thing when the original idea was taken from a single sentence taken out of context, or worse, with no research or patent filing to back up that the theory might be a real algo in use.

Here is an example of an SEO Myth that has gone mainstream: That Google favors brands. There is no scientific research or patent filing that describes a method for identifying a "brand." There is no basis for that belief yet it is the foundation for other SEO theories like the one about brand mentions and SEO practices, like we should all strive to become brands. All those theories and SEO practices have their source from a belief that has no basis at all.

EditorialGuy

1:59 pm on Apr 3, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Shouldn't this have been posted on April 1?

seoskunk

2:29 pm on Apr 3, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ok calm down, it's a tongue in cheek post :)

martinibuster

3:31 pm on Apr 3, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Emoticons are your friends. ;)

Tongue in cheek ----> :P

rish3

3:39 pm on Apr 3, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



There is no scientific research or patent filed by any search engine that describes a method for identifying spam involving the cross-referencing of a domain's whois data with college and high school registrations, minecraft downloads or employment status at McDonalds and then flagging those student domain owners for spam action. If you can find such research then please post a URL for it.


Detecting spam related and biased contexts for programmable search engines [google.com...]

Certainly covers "cross-referencing of a domain's whois data". Doesn't list any of your example target cross-references specifically, but if you assume a broad definition of "information":

"analyzing domain-related information corresponding to a domain associated with the document over time, and scoring the document based, at least in part, on a result of the analyzing."

:)

lucy24

7:18 pm on Apr 3, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: insert link to tvtropes "Don't Explain The Joke" page here ::

RedBar

10:10 am on Apr 4, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There is no scientific research or patent filing that describes a method for identifying a "brand."


Was it Cutts, probably Mueller, who said 2-3 years back that "brands" were the way forward in fighting spam?

martinibuster

12:20 pm on Apr 4, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



“Brands are the solution, not the problem,” Mr. Schmidt said. “Brands are how you sort out the cesspool.”


It was neither. It was Eric Schmidt at a speech to magazine publishers. His statement was taken out of context and made to represent a smoking gun about Brand Preference. But if SEOs take a moment to see the statement in-context, as well as further statements from that same event they will see that they are mistaken, and mistaken to their own harm because any time one believes an inaccurate data point they are closing their eyes to what is really going on and deprive themselves of the tools to better their situation, to understand what is going on around them.

At that very same conference event Schmidt was asked by the brand name magazine executives how brands could rank better. His response would surprise you, given how often his quote about brands versus cesspoools is trotted out as Google's smoking gun. I wrote an article [thesempost.com] about it in SEMPost if you are interested in the myth of Google's preference for brands.

Kratos

3:10 pm on Apr 4, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



The fact that Cutts disabled the comments for that post is enough to tell you indeed that that post was an April Fools post.

Personally, I don't get these kind of things. It's a great way to waste people's time, don't see anything funny about it. I was surprised to see reputable online sites play such pranks. Anyway...

rish3

9:39 pm on Apr 6, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



the myth of Google's preference for brands

It always seems odd to me when someone declares that anything is, or is not, a signal.

How would anyone outside of a small subset of people at Google know? You can test, I suppose, but you can't make other variables stand still, so it's flawed testing at best. You could make a case for why something would be a bad signal, but that doesn't mean it's not used, in some way.

And, given that they've handed the reigns over to machine learning, it's not that odd to say they might not know what is, and is not, a signal these days.

Edit: Here's an example. It's often said that bounce rate is probably not a signal, because it is "too noisy". That is true for certain types of sites, and certain queries. But, bounce rate for a query with purchase intent, to an e-commerce site, is probably a decent indicator of an unhappy visitor. Or perhaps a visit directly from Google product search. If that data were available to Google, why would they ignore it?

martinibuster

10:33 pm on Apr 6, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



How would anyone outside of a small subset of people at Google know?


Anyone cannot know. But anyone can understand. There is copious documentation in the form of scientific research papers and patent applications that lays out what kinds of algorithms they have studied, what was superseded and why (indicating they may no longer be in use or may never have been in use) and what seems to have not been challenged and is likely incorporated into modern algorithms. With that kind of understanding you can then know what the engineers are talking about when an engineer says something like this: [wired.com]

I think you look for signals that recreate that same intuition, that same experience that you have as an engineer and that users have. Whenever we look at the most blocked sites, it did match our intuition and experience, but the key is, you also have your experience of the sorts of sites that are going to be adding value for users versus not adding value for users. And we actually came up with a classifier to say, okay, IRS or Wikipedia or New York Times is over on this side, and the low-quality sites are over on this side. And you can really see mathematical reasons…


You will read that and say, ok, I've read about classifiers. Those are a decision function, a routine that will take strings of data (could be text, could be features of an entire page, could be various ratios) and then decide using mathematics that based on certain features of a particular page, this is similar to pages that quality raters have judged to be spam. That's a classifier, in this example, a mathematical solution for mimicking what we casually do when we look at a page and in seconds clap our hands and say, This is spam.

If one does not have a foundation in understanding how algorithms work one could be forgiven for reading the above statement and declaring it a smoking gun.

...we actually came up with a classifier to say, okay, IRS or Wikipedia or New York Times is over on this side, and the low-quality sites are over on this side.


But the above quote is not a smoking gun that "proves" Google has a brand preference algorithm. To assert that is to take the statement way out of context. The context is about creating a mathematical way to mimic our gut instinct of knowing spam when we see it. That's it. That's all it is.

It's often said that bounce rate is probably not a signal, because it is "too noisy".


That is a good example and a whole other can of worms because the bounce rate is related to CTR, to what people do when they click from a SERP and then click back. Before we even have that conversation you/me/we/the industry has to come to an agreement on what we're even discussing, specifically, the question of what is a ranking signal. Many call CTR a ranking signal but in many instances, dating back to about 2003 when GoogleGuy stated on WebmasterWorld [webmasterworld.com] that CTR is used for quality control, CTR is most often seen in algorithms related to re-ranking the SERPs, quality control.

So we can go around in circles and say yes it is/no it isn't unless we first take a moment to understand the context in which things like CTR can be used, may be used and probably are used. There are many uses for CTR and bounce rate. One of them relates back to understanding what went wrong with the classifiers, in other words, using CTR and other factors to understand what went wrong with a SERP, which algo factor, which classifier is responsible for a "bad" result and then deciding to either rank it lower if it's useful or chuck it or maybe just show it at certain times of day, etc. And as for the noisiness of the CTR signal, yes there is noise but there has also been research on identifying that noise, chucking it then finding useful meaning in CTR, bounce rate and other user behaviors.

rish3

12:07 am on Apr 7, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



If one does not have a foundation in understanding how algorithms work one could be forgiven for reading the above statement and declaring it a smoking gun.


If someone calls that a smoking gun, it's likely from a different set of experiences that are sometimes just as relevant as technical experience.

I have certainly been in the room at large companies when a non-technical leader made a technical decision that wasn't a terrific one. It still got implemented.

As for depending on what Google employees say...well, that's tricky. Matt Cutts made a video responding to questions about the Vince update, which was perceived to be about brands.

His answer was that it wasn't about brands, but rather about "trust, authority, reputation, PageRank, high quality." The problem with that is that only 1 of those 5 has a real definition. Any one of "trust", "authority", "reputation" and "high quality" could potentially contain some derivation of "identifying a brand".

Yahoo has a patent, for example, on TrustRank. It does start with a whitelist of sites, and without getting into the gory details, you could call it something that "prefers brands"...if the list of seed sites matched that objective. Could google have something similar? Do they patent everything in the algorithm?

I'm also of the belief that the machine learning is making things very murky. Depending on what they give it visibility to, it can certainly decide to use signals that Google did not intend it to. It has to have that ability, or it's not machine learning.

martinibuster

2:14 am on Apr 7, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Yahoo has a patent, for example, on TrustRank... Could google have something similar?


No. TrustRank was shown to have several flaws in subsequent research studies. I wrote an article about that at SEMPost, called TrustRank Teardown. There was a thing called Topical TrustRank that superseded it and maybe something like it or a derivative of it is in action in some form but Trust as a thing that gets passed around like PageRank is, from page to page, that isn't something that there's recent research about. Do you know of any?

rish3

2:41 am on Apr 7, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



No. TrustRank was shown to have several flaws in subsequent research studies.


Sorry, just cracks me up that you can definitively say "no", they don't have something similar. They must have something related to "trust", right? They talk about it. And they mention it separately from PageRank...in a list where PageRank is one thing, and trust is another.

Yes, I believe that you're experienced in this, and I believe you looked at the Yahoo! paper...and saw real flaws. That's not, however, the same thing as saying there's no way to use "trust" in their algorithm...to have something "similar".

They do seem, at times, to use ideas that are flawed enough that they can only be used as downward filters, like dampers. They don't have enough faith in the ideas to use them to rank things, but they have enough faith to use them to demote things.

But, then, we're both really guessing. Your guesses are more educated than mine, but they are still guesses.

Machine learning was also dismissed, some time ago, as being too flawed for use in a search algorithm.

Lastly, I can't speak to Google, but I can say that tech companies regularly use new approaches without publishing anything about it...on purpose.

martinibuster

3:46 am on Apr 7, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Thanks for the great discussion rish3 and for bringing up all these issues. I really appreciate your input and questions. I believe that a community that encourages the respectful challenging of opinions and statements is good for everyone involved. I have learned a lot at WebmasterWorld by participating. I can't emphasize enough how participating in discussions, answering questions, and challenging the answers all work together to help ourselves. It's not enough to just read. Participating helps the participants. So thanks rish3 for your part in moving this discussion forward. :)

...I believe you looked at the Yahoo! paper...and saw real flaws... But, then, we're both really guessing.


I do not want to claim credit for anything I didn't do. I did not see the flaws in TrustRank. Subsequent researchers documented the flaws and published the findings (as I related in my article...).

Machine learning was also dismissed...


Machine learning is in use. Period. There is no denying this. Machine learning has been in use for many years.

I can say that tech companies regularly use new approaches without publishing anything about it...on purpose.


I agree with you, 100%. ;)

netmeg

12:19 pm on Apr 7, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



(I can't contribute anything useful to this discussion, but just wanted to pop in to say I'm enjoying the read)

RP_Joe

1:13 pm on Apr 7, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



It seems odd to me that they have all the whois data but do not flag domains registered in 3rd world countries.
Since record storage is cheap, I cannot believe they do not keep a blacklist (or dark gray) of previous spammers/ spam websites.

rish3

1:21 pm on Apr 7, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



It seems odd to me that they have all the whois data but do not flag domains registered in 3rd world countries.

You can put anything you want in the whois data, so Google would only have access to what was entered, versus where the person was when they entered it.

And, of course, there are plenty of legitimate, successful, enterprises in third world countries.

System

7:22 am on Apr 14, 2015 (gmt 0)

redhat



The following 4 messages were cut out to new thread by goodroi. New thread at: google/3005869.htm [webmasterworld.com]
7:57 am on Apr 14, 2015 (utc -5)