Forum Moderators: Robert Charlton & goodroi
I need your advice. Let's say you decided to add a component to your site that's best implemented by adding lots (1000's) of algorithmically-generated pages to your site. Let's say they are all technical data tables. These pages, obviously don't carry much weight as far as search engines are concerned, but having them in the right places is *really* what the users of the site need.
How would you do it without angering Google and incurring some sort of crazy penalty? Assuming you do NOT actually need them in Google index (or at least don't care).
I was thinking of the following way: the pages are generated and placed in a special directory /auto-gen-pages/ after which the directory is closed for spidering in robots.txt. Seems like a nice solution, right? Only I have one fear: my site is still going to have links going into this directory. Would some penalizing algorithm then decide that I am up to something no good because I have tons links pointing to a "closed" directory? (Which I assume could be a sign of a commercial redirect or something like that?)
Another way I was thinking was letting them spider the pages, but making sure the links are "nofollow". I am totally honest, right - I don't even want Google to take my own pages into consideration. Then again, I seem to remember a certain excessively powerful individual named "Matt" say something along the lines of that all links marked "nofollow" are considered paid links and treated as such. Well, I am not selling links, I just don't want Google to assign any weight to them! I don't want them to penalize my site for selling links (especially when it's not true!)
Finally, there's the option "just put it out there, don't worry about anything". However, my intuition tells me I am almost certain to get penalized for adding hundreds and even thousands of pages quickly.
Seems like an extremely simple task, yet impossible to come up with a penalty-free solution. It's driving me absolutely mental.
What would YOU do?
Thanks in advance
All you have to do is put the pages in a robots.txt disallowed directory.
Nofollow doesn't carry any penalties or suggest that the link is paid for. It means you don't vouch for a link.
Penalty for linking to a robots.txt disallowd directory? That's just silly.
Nofollow isn't necessarily a good option because if people link to a nofollowed page it'll stil get indexed, especially if you think no one will link to these pages.
General white hat rule of thumb is to use nofollows on outbound links and meta noindex/robots.txt disallow to discount internal pages.
All you have to do is put the pages in a robots.txt disallowed directory.
Google will still index/cache the pages even when disallowed by robots.txt if you are linking to them (this has been discussed many times).
You need to generate each page with a robots meta tag of <meta name="robots" content="noindex,nofollow">
You're worrying too much.
I disagee. This is what the 2007 version of Google does to whitehat webmasters. If you try to improve your site, you have to worry about how Google could potentially see your improvement as some sort of spam trick and hence a penalty. It's completely ridiculous IMO, but that's just how it works.
Anyway, if you truely don't care about the pages, just use robots.txt to block them to be extra safe. However, by just adding these pages and leaving them wide open (not blocking them w/robots.txt) will also be fine. I don't think you have to worry about a penalty like this. Big sites do not mean higher rankings. It's more about the overall site theme and the theme/trust of inbound links these days that allow you to rank well.
It doesn't matter if Google still indexes it. If you don't link to it you're not losing any link juice to those pages, so the ranking of the rest of your site isn't impacted. That's all that matters. If you see a few url only links in the main index, so what?
"This is what the 2007 version of Google does to whitehat webmasters."
No, that's what webmasters choose to do to themselves.
so why is it that we get slammed in the serps, but sites like ebay which add 1000 times more than we do get higher and higher rankings.
If you don't link to it you're not losing any link juice to those pages, so the ranking of the rest of your site isn't impacted. That's all that matters.
Please re-read the opening post:
Only I have one fear: my site is still going to have links going into this directory. Would some penalizing algorithm then decide that I am up to something no good because I have tons links pointing to a "closed" directory?
Which is laboring over a penalty that doesn't exist.
The right solution here is robots.txt disallow, not META ROBOTS NOINDEX, or REL=NOFOLLOW.
META ROBOTS is more granular and takes more time than putting one disallow in a txt file.
Having urls linger in the SERP does no harm to a site.
Which is laboring over a penalty that doesn't exist.
Seen many sites drop like a rock after adding 1000's of pages all at once. Some have taken quite a time to recover also...
robots.txt will do nothing to keep googlebot from indexing if you are linking to files.
META ROBOTS is more granular and takes more time than putting one disallow in a txt file.
Sure, but it's the only thing that works? Why would you bother disallowing in robots.txt if they are going to be indexed anyways?
Besides, he is generating 1000's of pages...I think he can manage to stick in a simple noindex meta tag into the template?
edited: Halfdeck - Please see [webmasterworld.com...] also, because I think you are having trouble understanding how Google crawls files that are linked to regardless of robots.txt.
[edited by: The_Contractor at 1:20 am (utc) on July 25, 2007]
Seen many sites drop like a rock after adding 1000's of pages all at once. Some have taken quite a time to recover also...
That's largely a trust/PageRank issue due to splitting a domain's PageRank into 100000 bits and publishing them all at the same time, so that Google doesn't allow full PageRank to flow into them due to lack of trust.
You can address that issue by preventing those pages from leeching PageRank or getting crawled using robots.txt, so that links to disallowed URLs do not pass PageRank.
That thread has to do with the correct use of robots.txt, 404 and 410.
Robots.txt tells Google not to crawl a page. It doesn't say whether or not a page exists.
404 tells Google the page isn't found on the server for whatever reason.
410 tells Google the page is gone.
In this particular case, the urls are not gone. He just doesn't want Google to index them. Both robots disallow and META NOINDEX will accomplish close to the same thing. Of course META NOINDEX completely prevents a URL from getting indexed, while robots.txt disallow depends on backlinks. 100,000 disallowed URLs still do not lower Google's trust in your site nor do they water down your site's PageRank so that your site gets sucked into the supplemental index.
So what if a couple of sites happen to link to one of the disallowed URLs and they get indexed? Let them get indexed. Google still knows from your robots.txt you're not trying to spam their index.
Nice splitting hairs with you :)
So what if a couple of sites happen to link to one of the disallowed URLs and they get indexed? Let them get indexed. Google still knows from your robots.txt you're not trying to spam their index.Nice splitting hairs with you
Not sure why you consider it splitting hairs? If you don't want files indexed you have to be sure to use the meta tag as the robots.txt will not prevent this. Two completely different things...hehe
So no, it's not splitting hairs...
If you don't want files indexed you have to be sure to use the meta tag as the robots.txt will not prevent this.
His principle worry is angering the "Google Gods." If his end goal was prevent URLs from getting indexed, then of course META NOINDEX is the failsafe method. But that's not his end goal. His end goal is to avoid a penalty.
For that all you need to do is add a line in robots.txt.
Ok, are we done now? :)