Welcome to WebmasterWorld Guest from 18.104.22.168
More importantly, let's say I wanted to follow the IYP template for a website and list all the plumbers in the country. Users type in their zip code and get a list of all the plumbers in their area. (Typically, from what I've seen in IYPs, each zip code gets its own webpage.) How should I handle the following issues that may come up:
1. Duplicate content. The address, contact info and any "content" related to a plumber would show up several times if they service several different zip codes in a city. (By content, I mean something similar to a blurb about the plumber or reviews.)
2. Multiple links. If I include the plumber's website in their contact info, I'll be linking out to it on every zip code webpage. I assume this would seriously delude any benefit my website would have from a reciprocal link situation. The plumber links to my IYP once and my IYP links to his website dozens of times.
3. Let's assume I go to the trouble to compile a list of all the plumbers in the country and this data doesn't exist anywhere else. I realize you can't copyright this kind of information, but is there some way to at least make it harder for a scraper to make an exact copy of my website? What do the big IYPs do?
I tend to think of online directories as categorical descriptive listings of websites with hotlinks, where applicable, to the website. Phone number info tends to be found "at the website".
So far as duplicate content and linking goes I'm not in the "make it for Google" camp. I'm in the "make it work for users/visitors" camp. So I tend not to worry what Googlebot thinks. Let Googlebot figure out what's good for your directory users. If it can't then maybe Googlebot will eventually become irrelevant and all the worrying about what Googlebot thinks will prove to have been a waste. YMMV. Design for the benefit of users is all I'm going to add to the analysis. And, that's not to say that your information architecture should be . . lacking.
It's a large and bountiful world, with lots of fruits and vegetables in the market basket, so there's lots of room for variety and consumer preference. I wouldn't necessarily emulate one model or the other.
[edited by: Webwork at 12:48 pm (utc) on Oct. 22, 2008]
Any tips on how to protect my content? I guess the best defense is to be first in the marketplace with unique information AND to build good copyright content. It's just frustrating that all of the data that I've compiled for the listings could be grabbed up in no time at all.
I don't see the proposition as setting up a false dichotomy, especially in the realm of the "duplicate content dialogue". So much time has been invested in that dialogue that could be more fruitfully applied elsewhere.
First off, I'm never quite clear if people are clear about exactly what duplicate content is OR should be.
The only definition of duplicate content that a search engine should be concerned about is "content that duplicates content previously and originally published on another website".
Instead, I have read countless threads about people sweating about "duplicate content" on their own website - duplicate only by virtue that the content management system made the content accessible through a variety of ways of accessing that data. THAT type of "duplicate content" should never have been part of the Google duplicate content dialogue - but it was and still is - as evidenced by this thread and the OP's concern for duplicate content.
So, I respectfully say rubbish to your "false dichotomy", my dear Buckworks. ;)
In the case of duplicate content on one's own website, which duplication only reflects an attempt to facilitate user access, I say the problem is Google's to fix and one should not have to give it a second thought. Forget fretting about what to "no follow" and just devote your time to making the data accessible to users by offering boy taxonomic aides AND on site search.
I say it's a case of "OR" . . AND fiddlesticks to Google. :p
Any tips on how to protect my content?
I strongly suggest you take to reading the material put up by our fellow moderator IncrediBILL, who has spent years fending off bots sent to scrape his website's content.
IF you can slow down and/or frustrate the bots you are ahead of the game, but it's a never ending battle.
Also, read about what you can and cannot expect to copyright when it comes to directories AND go to the source - the USPTO.gov - and not the forums for the most authoritative answer. Forums are full of misininformation, most of which leans towards "it's okay to rip off directory content".
No, not entirely, but you gotta do some reading in order to work your angles . . so you can successfully sue and bankrupt the bad actors . . or at least wipe out their Adsense accounts for a long time . . :-/
duplicate only by virtue that the content management system made the content accessible through a variety of ways of accessing that data
No, it was duplicate because the CMS generated pointlessly redundant pages by presenting large chunks of identical content in multiple locations (different URLs). It would be much better to figure out ways to provide a "variety of ways of accessing" that lead users to the same content on the same URL (or at least fewer URLs than some CMS's generate).
In my experience, there is no problem with duplicate content in the form of single paragraphs or lists of details that end up mixed and matched on different category pages. As long as the resulting pages end up sufficiently different from each other, Google would have no reason to filter any of them. Many businesses might end up listed in more than one category or zip code, for example, but different zips or category pages would end up with a very different mix of listings.
I say the problem is Google's to fix
They do, by filtering out results that they perceive as pointlessly redundant.
fiddlesticks to Google
No comment ...
From the OP:
The plumber links to my IYP once and my IYP links to his website dozens of times.
Then the plumber would see your site as a valuable place to get listed. That is A Good Thing. However, if his link would show up on "dozens" of URLs you should try to tighten up how you're presenting your content so there are fewer URLs involved.
pointlessly redundant pages . . identical content in multiple locations . .
There is "the data", which isn't redundant and isn't in multiple locations, and then there's the bot, which blindly navigates by links - starting with the premise that it must visit a link - unless no-followed - and collect "the data found there" and then process the data - to determine if it has seen the same data "elsewhere".
So, no-follow or, no, follow and figure it out?
A human might not have trouble "seeing" that archives, for example, can exist in a variety of settings: by author, by date, by topic AND easily deal with that, i.e., not have duplicate content issues. But, then there's the bot . .
When Google decides to grow up and to enable authors, copywriters, content creators, . . whatever . . to "submit their creation" (if they choose) for proof of initial authorship then we will be a long way ahead in the resolution of the real duplicate content issue.
AND, when the bot visits any given URL - to "find" the data - then it needn't choke on the fact that "it's elsewhere" when it - the bot - knows it needn't concern itself with "the other", since that's only a matter of allowing visitors to access the same data by whatever path is their preference: Do I want to scope out the site, by author, since some authors are better than others? Do I want to scope out the site by topic, since that's all I'm interested in? Etc. So, as you said - the bot can (and to my understanding does) - do a decent job of "seeing onsite duplicate content", but that's not - or should not be contextualized by anyone - as duplicate content to be worried about, as in a site being penalized "for duplicate content". On site "duplicate content" should never be a penalty trigger, at least not in the case of popular CMS that choose such an approach to data access.
I have no bones to pick with anyone - the millions - who will, by their or any consensual wisdom [i]build with the bot in mind but, so far as my little peashooter of a brain is concerned, the bot needs to be able to sort things out. The world should forget about "designing for Google" and design according to whatever works for humans. IF shoddy human information architecture of popular CMSes is the order of the day so be it, until such time as - for no other or better reason than the function of the CMS itself - the CMS is redesigned.
No comment ...
The bot could care less about you or I or our world view or reason or anything of our minds, design or creation. It unceremoniously tosses off sites, without explanation, every day - even "good ones". So, I stand by the comment: Fiddlesticks. Build, as best you can, for your users. Buckworks, on the other hand, will wisely build for the bot and benefit therefrom, until that moment when the bot by some caprice or design shows Buckworks how little all her honest intelligent endeavors means to the bot. Which, one can only hope by then - should such a misfortune arise - that she will have built a stream of defensible traffic such that she, too, can say with the same unperturbed zest - "Fiddlesticks!". ;)
Sorry to digress.
To the OP: Listen to Buckworks. I'm just a brewing up a little tempest in a teacup of a revolution, in hopes that the monolith - Google - becomes just one of many entry points "to search" and becomes far less "important" in the scheme of things. I just see no good coming from this Google monoculture. YMMV.
[edited by: Webwork at 5:36 pm (utc) on Oct. 22, 2008]
There is "the data", which isn't redundant and isn't in multiple locations
A particular piece of data might reside in one and only one place within the database, but when live pages are presented on the web that same piece of data can indeed appear in multiple locations.
that's only a matter of allowing visitors to access the same data by whatever path is their preference
Yes, but it doesn't necessarily follow that those multiple paths should lead to multiple landing URLs to read the same content.
the bot needs to be able to sort things out
Savvy webmasters will build their sites so things are well sorted before the bot ever sees them.
Build, as best you can, for your users. Buckworks, on the other hand, will wisely build for the bot and benefit therefrom
No, it is not "on the other hand". I aim at both. If you think that pleasing users means displeasing the bots or vice versa then do some more thinking. It's certainly possible to focus on one and forget the other ... many people do ... but there's no inherent contradiction between the two goals.
If you take care with the sort of details that would make your content come across more intelligently to a non-sighted user, you'd automatically do a better job of pleasing the bots even if you had no idea they existed.
a stream of defensible traffic
The kind of work it takes to gain exposure in places that would send direct traffic often pleases Google along the way. That's another "both/and".
I just see no good coming from this Google monoculture.
Many would agree with that, myself included, but we have to deal with the world as it is.
but when live pages are presented
The idea that one is creating duplicate content arises from the view that a Link=URL=Document=Page.
When is a query a query and not "a page reference"?
Only when a query is instantiated as plain text in a search box?
Isn't the problem Google's difficulty gleaning meaning from links themselves? Discerning when a link is a trigger designed to return a certain "view of data" versus "(entirely) new data"?
What about Google advancing its analysis of links themselves? Why stop at "no follow"? Why not "archive by author dataview"? Argh. Not my job.
As an aside, perhaps we should all emulate Google, forgoing website design by designing "websites" around a datastore, a search box and a bit of text? Perhaps then we could then better spend our time and resources on stocking the shelves in the datastore - adding to the data, information, content - instead of constantly moving the cans, boxes and "product" around on the shelves in an effort to please the bot.
Ack. Ignore me. Buckworks no doubt grows rich by knowing how to make the bot happy and respecting that, whilst I fiddle about.
Yes, I know, put away the fiddle. Amen.
[edited by: Webwork at 2:05 pm (utc) on Oct. 24, 2008]
constantly moving the cans, boxes and "product" around on the shelves in an effort to please the bot
That comment reflects a seriously limited view of what it takes to please a bot. You don't please the bots by tweaking just for the sake of tweaking, you please them by building on good foundations in the first place, like logical file structures, logical semantic structures, good source ordering, intelligent navigation, lean clean code, paying attention to the details that ensure accessibility for all users, and so on. If those foundations are in good order, you won't have to think about them very often and you can indeed focus on the content.
Note that every one of those bot-pleasing things will improve the user experience in some way, some for a few users, some for every user. Both/and.
It's also worth noting that "designing your site for users" implies that users are able to find your site in the first place.
a seriously limited view
In the context of "directories" I prefer to think of it as taking a concept to a logical extreme.
If Google is the gold standard of information discovery, retrieval, indexing and access - with the latter defined by a page with nothing more than the iconic search box - then what's to hold everyone else to the standard of looking not Google and acting not Google?
Heaven forbid, whilst being not Google, those that are not Google fail to serve to Google the information architecture, taxonomies, titles, metas, keyword density, etc. that Google says is so important to the WWW. To everyone else, that is, but Google - so far as "websites" go.
Who knows. Maybe the next evolutionary step in the development of the increasingly botted WWW, heavily mined by non-human "visitors", will be the emergence of query dependent websites - en masse.
What if a directory didn't look like a directory? What if, instead, it looked like a directory fashioned on-the-fly, like Google's SERPs? Entirely possible.
Interesting to think how this might play out with future implementations of existing forms of software. Say a forum where there was no index, just a query box or a comment box, with a backend that filters or sorts and "files" comments. Say, "me too" type comments are sorted and filed into an enormouse "Me Too" thread - so the accumulated value of the added comment eventually adds up to something?
Does a directory need a "directory front end"? I guess so, since structure can be an aid to discovery. I guess a directory that is driven by a well designed database would also be one pleasing to Google, but what if databases or DBMs evolve towards more "loose" or "fuzzy" models, something akin to emulating a human brain/mind?
Maybe taxonomies, site architecture, navigation, presentation layers, frontends and all the rest is overrated and a burden in an information economy. Maybe Google will be a driving force leading to that eventuality.
Just a thought to play around with. Maybe it really is true, at least in the end, that it's the content that matters?
Back to the real work of grinding out keywords and taxonomies and link navigation structures and . . . Bleh. ;)
[edited by: Webwork at 5:07 pm (utc) on Oct. 24, 2008]
taking a concept to a logical extreme
If you're going to take something to a logical extreme, make sure your logic is sound to start with.
that Google says is so important to the WWW
It's not just Google. People have been working on the art and science of information retrieval since long before Google existed.
Stop fixating on Google and see the bigger picture. There is a bigger picture here, y'know, bigger than just Google.
since structure can be an aid to discovery
Yep, for humans and bots alike.
Methods of organization can vary a lot depending on who is doing the organizing, and also who else might be affected by their efforts. If you're creating something for personal use only, you can do whatever you want. If you want to create something that will be used by / useful to others, you'll have to consider their needs and preferences as well as your own.
A simple example from the physical world: if there is more than one cook in the kitchen, they'll need to organize the cupboards so they each know where to find what they need. There could be many ways to set up a kitchen, but there tends to be a lot of similarity in how experienced cooks organize their tools and ingredients. The similarity evolves from the nature of the tasks conducted in the space.
taxonomies, site architecture, navigation, presentation layers, frontends and all the rest is overrated and a burden in an information economy
Is the cup a burden to the tea it holds?
Let's try to get this thread back on topic. ;)
I work at a large IYP so this is interesting to me. I like Webwork's distinction that an IYP is phone-number centric while a web directory is a list of URLs that rarely includes telephone numbers.
Many IYPs (think Europe, Asia, etc as well as North America) are not well optimised or not at all. However, paid-listing based companies (not AdSense based) need to show "value" to their advertisers. For us this means checking all sorts of things, such as not showing the phone number in a Google SERP snippet. We need the click to the Business Profile Page to count that visit. If the searcher got the phone number in the SERP, we won't know that the advertiser got the benefit and worse, the searcher might tell the advertiser that they were found in Google, not the IYP. :(
You have to decide whether to generate a spiderable page for every postcode where an advertiser chooses to be found. For florists and similar "come-to-me" businesses, this could be every postcode in the country. In Australia we have about 11,000 of these. We have not done this yet but intend to. The pages will differ in what appears in their title tag, metas, H1 and a little blurb that says that this is an out-of-area advertiser who services this area.
I don't consider this duplication of content although many page elements relating to that advertiser are the same. When you consider the static page elements, then you could say that over 50% of each page is duplicated. However, this is where I have faith in the algorithm's ability to ignore navigation and static elements on large sites. We're doing this as if search engines did not exist -- this is G's recommendation to all of us. If the advertiser paid to show up for any locality, they would be upset if they did not.
I also believe that high TrustRank sites survive filters better than those of minor players, such as my private sites.
You could choose to let the site search deal with out-of-area situations such as the above, which means that they don't show up in a Google search, but they do in a site search.
Depending on the comprehensiveness of your site, i.e. certain adult listings have address suppression, e.g. escort services. Although you know their billing address, you have to ensure you don't show their page differently for that suburb, otherwise you have broken that part of the contract.
For reciprocals from the advertisers, they should be encouraged to link where they choose. Most will give just one link - either to your home page or to their own suburb page.
Our business profile pages don't have PR-passing URL links. They are not nofollowed but go through our counter. I suspect this is why many of these pages show PR5 and possibly more. If you want to pass PR, you could no-follow all the out-of-area pages but not the single in-area page.
Scraping concerns? The best you can do is make a token effort to show in a lawsuit that you took reasonable steps to protect your IP. Put a fake listing in every heading or at least the popular ones. Then you can decide whether you want to bother with a lawsuit or not. A DMCA note to the SEs will be a lot cheaper and more effective than a strongly worded letter from your lawyer.
You won't stop actual theft of data. Most businesses are already listed online and it is hard to prove theft. Tell your advertisers that their listings can also be found "on the Internet" so you are giving them added value.