homepage Welcome to WebmasterWorld Guest from 54.226.173.169
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
Factors for Identifying Commonly Operated Websites
How can Google and Other Crawler-based SE's Tie My Sites Together?
SEOmariachi




msg:120753
 9:40 pm on Sep 2, 2004 (gmt 0)

As Google (and Yahoo) step up their efforts to give away less organic traffic and better monetize their searchers, while still maintaining the quality search that keeps searchers loyal, the squeeze is being put on webmasters that operate multiple websites with similar themes and content.

SE's weeding out the "cloners" or "multi-servers" sounds like a user- and quality-focused initiative at first, until you realize that it's second (or perhaps first?) objective is to create a scarcity of free organic indexing so that it can SELL an alternative--logical from a business perspective. This takes the form of AdWords for Google, and Overture's PFI and PPC for Yahoo. PFI is perhaps the most literal alternative, because Yahoo is charging you NOT for guaranteed performance, but just for a shot at competing with the other sites in their index; Google, on the other hand, has sworn off PFI (for now).

But for both major SE's, organic listings for cloned websites is a drain on revenue potential: "At most", they think, "we should give away one 'set' of goodwill to any particular webmaster for any particular bucket of unique content." Any more than that, and they're giving away too much and missing out on an opportunity to sell that goodwill to someone who is willing to pay for it. Any more than that, and they’re diluting the experience of their prized searchers.

Some webmasters have a philosophical problem with cloning one's own website. The rest of us use this as one of many important tools in sustaining or supplementing our livelihoods. For the cloners, I would like to ask...

1.) What factors are most likely to be used by SE's to tie together a group of commonly operated websites?

2.) What is the level of automation for determining these factors, both currently, and in the near future?

3.) Which of these factors is likely to impact only organic indexing, or possibly also the supposed privilege of participating in AdWords, and Overture PFI or PPC?

4.) What level of paranoia is justified in segregating all of these aspects in order to limit the risk of having your cloning found out?

5.) How different must your content be (both page by page and site by site) for it to fall under the duplicate content radar?

Here is an initial list of ours, which I’m sure is a small percentage of all the imaginable gotchas. Or, perhaps some of these issues are too paranoid and not worth the effort?

Same content verbatim
Same cookie structure
Common obscure registrar
Common name servers
Same credit card used for ANYthing (Y! Dir, Overture, AdWords, etc.)
Website hosted on same IP/block
Email hosted on same IP/block
Email subdomains hosted on same IP/block
Hosting of any service on different IP blocks with same netblock owner
Traceroute hops between IPs appearing to be separate
Whois information matching
Alexa contact information matching
Interlinking of domains
Common backlinks (indirect interlinking)
Same contact information posted on websites
Common outages among domains (during server reboot, etc.)
Past common billing for AdWords/Overture
Login from same IP to separate AdWords/Overture accounts
Residual cookies from past logins to AdWords/Overture accounts
Similar file names or linking/directory structure
Aspects of HTML that identify common backend (such as…

Comments
Tabulation
Javascript function names
Cookie structure
Linked CSS and JS files
CSS class names

)

 

DavidT




msg:120754
 3:14 pm on Sep 3, 2004 (gmt 0)

He's not asking what we think about cloning sites, start your own thread if you want to sound off about something, knock yourselves out.

Pretty long list there SEOmariachi, can't really agree with all of your premises, in Google's case I'd say the desire is more in the QC area than money grabbing, although those two things are connected in a way.

Some way paranoid yeah but I don't suppose it hurts to be careful aside from any time and effort that goes into it.

And I wouldn't limit it to 'clones', Google will make the connection between two very different sites, with similar backlink structure, that to some extent target the same areas. Not clones at all just related.

mifi601




msg:120755
 3:29 pm on Sep 3, 2004 (gmt 0)

I think same IP block is a biggie

DaveAtIFG




msg:120756
 3:33 pm on Sep 3, 2004 (gmt 0)

Several members responded to SEOmariachi's post with "Cloning sites is spam" type responses which I deleted. As DavidT so wisely points out, "start your own thread if you want to sound off about something".

Please recall Terms of Service [webmasterworld.com] item 16 and stay on topic folks! :)

mincklerstraat




msg:120757
 4:15 pm on Sep 3, 2004 (gmt 0)

Indeed, this is a very interesting question, and the ethics or personal dislike of site cloning, however important, is an entirely different question - unless you're so gung ho on this you think all such discussions should be boycotted here, which is pretty gung ho indeed, given the likelihood that some clone-detecting techniques could end up hitting sites that aren't really all that cloney. SEOmariachi is taking some of the well-known means of identifying clones and extrapolating, and when it comes to creditcard info and ad campaign logins, adding some I haven't seen or even thought of yet. The questions are also interesting whether or not you agree with his premise.

Least likely, I'd guess, would be the following in no particular order:
- tabulation
- simultaneous outages (why bother with this, if you can get netblock owner info, and over-scrupulous webmasters will put clones on different hosts anyways and update them with xml or other cross-domain tech?) also: necessity of keeping up some kind of statistics over time for reliability, instead of being able to calculate immediately).

2.) Level of automation: I do some SEO in which I try to out-guess the searchengines not only for today, but for a few months, maybe a year from now: questions are, basically, what resources could SE's have at their disposal to automatically detect site anomalies (not necessarily 'search engine spam' or cloning, but I won't go into it here)? What's necessary are processor power, storage of base material for comparison, and the smart algorithms to not waste either of these. Obviously, the more 'important' a site becomes in rankings, the more care can be taken in fine-tuned checking, ie., bigger algos, more processor time used. This is a piece of cake programming-wise and a no-brainer when it comes to using your resources wisely. This means in the future that your tricks probably won't work if you're in the top 20 of a popular search term. You'd spend more time disguising your clonishness than it would take starting from scratch. But your tricks are likely to work if you're not so big. And tricks that get you into the top results could be the same ones that break you once you're a focus of attention.

What you then want to study are comparison operators, and see how operations go with comparison operators when conducted on a large scale. If you know a language that uses regexes, think, how would I search for a match, or a flag that something might be fishy, using regexes and hashes? What info would I need at hand, how would I build this engine? What would I look for first that would be easiest and consume the least resources?

It's this first bit that should be on your hatchet block to obfuscate and randomize. And then think further, and randomize what comes next, and there's a good chance that the SE's won't discover your clonishness for quite a while.

nativenewyorker




msg:120758
 9:15 am on Sep 7, 2004 (gmt 0)

Shhhhhh... Perhaps we're giving GG too many ideas to annihilate us.

shri




msg:120759
 3:47 am on Sep 8, 2004 (gmt 0)

My current tin-foil hat theory is that link structure of a site is important.

Think of your links are roads on a street map. You don't need to know the names of the buildings or the colors they're painted in.

If you see three very similar maps, it takes you milli-seconds to figure out they're all related to each other.

In practical terms...

1) Cloned sites A, B and C have 5322 pages.
2) Home Page on all of them has 48 links with the same anchor text and same directory structure.
3) Level one pages have the same names and all point to the same level two pages.

Atleast this is what I seem to have figured out from Krishna Bharat's paper and another one which possibly related to Yahoo... i.e. look for common elements and not content to find duplicate sites.
( [www8.org...] ) sory .. cant find the second paper I mentioned.

cabbie




msg:120760
 4:52 am on Sep 8, 2004 (gmt 0)

I am taking notes SEOmariachi of all your list but at the moment I have only seen evidence that Google uses automation in some form to find common linking patterns and structure of site.
Other things like whois info and IP range seem to be only taken into account with a manual review.
They may discount the links between same ip range but I have sites with consequential(subsequential?) ips sitting beside each other in the serps so they by no means ban them.
Cheers.

attard




msg:120761
 12:14 am on Sep 9, 2004 (gmt 0)

Suppose you have legitimately related sites. Not clones of each other, but subtopics or niches. For instance, say you owned sweets.com and then wanted separate sites to talk about candy, cake, cookies, sugar, artificial sweetners, etc.

Each has unique content but common links - say each site has a link to the home page on the other, or to a common shopping cart.

How do you avoid getting bumped for interlinking?

mivox




msg:120762
 12:33 am on Sep 9, 2004 (gmt 0)

How do you avoid getting bumped for interlinking?

My ideas (based on my own interlinked sites)...

1.) Definitely get plenty of incoming links from other people's sites (directories and personal/hobby sites are both good, manufacturers - if you retail products - are a give-away, competitors are great but next to impossible to get)

2.) Possibly set them up as subdomains (candy.sweets.com, cakes.sweets.com, etc.).

I do both, and have never had a problem.

5stars




msg:120763
 12:48 am on Sep 9, 2004 (gmt 0)

How do you avoid getting bumped for interlinking?

I have several sites. The master site talks about all our widgets in a general. Each of the other domains covers a specific widget and there is no duplicate content per say… but rather duplicate subject matter. All the sites share the same shopping cart on yet another domain and they are all heavily interlinked.

If I go to alexa and type in one of the sub domains, my home page for the master domain comes up even though the each subcategory domain has its own homepage. So I know they have tied them together. I have never tried to hide the fact that they are not related. Every domain is hosted on the same server. Do you think our site received a manual review? If we did we were not penalized as of yet.

What steps would you recommend sites like ours take to ward off the gray bar?

Thanks in advance!

Green_Widge




msg:120764
 6:00 am on Sep 9, 2004 (gmt 0)

SE's weeding out the "cloners" or "multi-servers" sounds like a user- and quality-focused initiative at first, until you realize that it's second (or perhaps first?) objective is to create a scarcity of free organic indexing so that it can SELL an alternative--logical from a business perspective. This takes the form of AdWords for Google, and Overture's PFI and PPC for Yahoo. PFI is perhaps the most literal alternative, because Yahoo is charging you NOT for guaranteed performance, but just for a shot at competing with the other sites in their index; Google, on the other hand, has sworn off PFI (for now).

But for both major SE's, organic listings for cloned websites is a drain on revenue potential: "At most", they think, "we should give away one 'set' of goodwill to any particular webmaster for any particular bucket of unique content." Any more than that, and they're giving away too much and missing out on an opportunity to sell that goodwill to someone who is willing to pay for it. Any more than that, and they’re diluting the experience of their prized searchers.

If you think about it, weeding out the clones would make google less money, not more money by your reasoning, since there would be a broader range of websites being represented.

If there are 10 clones on the front page, then the other 9 guys that are usually there have to buy adwords, and the customer that wants to buy from those guys has to either dig deaper into the SERPS, or they have to click on an adword.

As far as google is concerned, the monetary issue of adwords vs. too many good quality SERPs will not ever be an issue. For every search term, there are thousands of sites that would like to be number one. So, there will always be tons of people willing to pay for adwords to be put on the front page for their search terms.

The issue of too good a quality SERPs doesn't come into play either, since when I sat 6 people down at google.com and asked them to do searches and then click on what they were looking for, a third of them clicked on the adwords and checked them out before going to the free SERPs anyway (especially when searching for products).

While this isn't a big study of course, it leads me to believe that if you did a search for "FORD", and 10 different sites showed up with great ford content, you would still have a large quantity of people who would click on the adwords to see what other ford stuff was out there. AND, on the other hand, if you had forddotcom show up 10 times you would probably get the same same or better adword clickthrough rates.

I agree with you that google and the others will try to weed out the "dups" for quality control issues, but I don't think it will ever be because they are worried that their revenues will fall because the quality of their serches is too good.

sit2510




msg:120765
 6:16 am on Sep 9, 2004 (gmt 0)

>>> their revenues will fall because the quality of their serches is too good.

I disagree with your statement. Quality results will help Google keep old users as well as to attract new ones. With more repeated users and by Law of Average, clickthrough of Adwords would increase proportionately with the increase in number of queries - this will help to enhance G's revenues in long term.

Green_Widge




msg:120766
 6:39 am on Sep 9, 2004 (gmt 0)

their revenues will fall because the quality of their serches is too good.

took that out of context didn't you? Since what I said was:

I don't think it will ever be because they are worried that their revenues will fall because the quality of their serches is too good.

It actually looks like we agree, google will not lose money by having googd search results. (I can understand why you would be confused though, since I just read what I wrote above and it confused the heck out of me. LOL)

MHes




msg:120767
 7:59 am on Sep 9, 2004 (gmt 0)

There are some threads here that are ahead of their time and make this a very special place... this is one of them.

I always think about the scenario of a small town with a popular webdesigner, and a commonly used hosting company (probably recommended by the webmaster). He has built 100 sites for 100 different clients offering 100 different services. All are relevant and the best for searches done around that town. They all 'naturally' link together, based on sensible recomendations because they all are from the same town.

If google penalises by looking at ip's, linking and website style, they will lose the relevant sites for that town.

The best way forward for Google IMHO is to deal only with duplication of text. Duplication is what fustrates the user and makes the serps rubbish. If sites offer unique content, then it doesn't matter if they originate from the same ip or webmaster, as long as they are relevant.

However, currently the serps do factor in ip penalties with respect to linking. This in my opinion is counter productive, as it encourages savvy webmasters to link with non relevant sites, all be it on the same theme but a different ip, for ranking purposes. These links are pointless for the user, but great for google.... who wants links to "buy widgets in Scotland", when the site is about a town in France? If I try and do a site or sites for only my local area and keep the links focused on business's for that area, I restrict the ip range of links in. The result is that another webmaster who is based outside the area, can probably conjure up all sorts of varied links in, by being not locally focused, and get top position.

Content is the important bit, nothing else should matter.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved