how to get pages "unindexed"?

Forum Moderators: phranque

Message Too Old, No Replies

how to get pages "unindexed"?

didn't have an index in the root so every subfolder got crawled

Shipintern

2:24 pm on May 20, 2010 (gmt 0)

I set up a new hosting account in order to host many different websites as add-on domains. However I did not intend to have any site on the root domains, so I neglected to ever put an index.html file into the root. Then i found out the search engines were crawling all the folders, so now for every site i have under the account, i have indexed:

website-1.com
example.com/website-1.com

website-2.com
example.com/website-2.com

website-3.com
example.com/website-3.com

etc...

Today i added the index.html file, but wanted to know if there is anything i can do to get those subfolders out of the search engine so they are not double-indexed and appear as duplicate content to Google.

(p.s. moderator feel free to move this thread if there is a more appropriate section of the forum. thanks)

[edited by: incrediBILL at 7:38 pm (utc) on May 20, 2010]
[edit reason] Only use EXAMPLE.COM for sample domain names [/edit]

physics

8:49 pm on May 20, 2010 (gmt 0)

robots.txt? [google.com...]
I'm a little confused about your example.com/website-1.com line. Is website-1.com the name of the HTML page indexed?

jdMorgan

10:19 pm on May 20, 2010 (gmt 0)

You could use mod_rewrite to redirect requests for
example.com/<something>.com/<optional-path>
to http://<something>.com/<optional-path>, as in


RewriteRule ^(([^./]+\.)+com)(/(.*))?$ http://$1/$4 [R=301,L]

If you install this redirect, do NOT Disallow indexing of the /<something>.com subfolders using robots.txt. The search engines must be allowed to fetch those 'bad' URLs in order to receive the redirect response.

If you don't use mod_rewrite for any other functions, and do not plan to ever do so in the future, then you could use mod_alias's RedirectMatch directive:


RedirectMatch 301 ^/(([^./]+\.)+com)(/(.*))?$ http://$1/$4

The extra complication in the regex patterns prevents a redirect to http://www.something.com (missing trailing slash), which the server would then redirect a second time, making this 'solution' less effective; Some search engines don't like multiple/stacked/chained redirects and some don't handle them correctly, so it's best to fix everything all at once.

Jim

[edit] Corrections as noted below. [/edit]

[edited by: jdMorgan at 3:37 am (utc) on May 25, 2010]

Shipintern

3:07 am on May 21, 2010 (gmt 0)

for example, see the following google search and you will see what i'm talking about:

<snip>

[edited by: jdMorgan at 3:56 am (utc) on May 21, 2010]
[edit reason] No specifics, please. [/edit]

Shipintern

4:40 am on May 21, 2010 (gmt 0)

sorry, cant show you specifics. think of it this way. my main website (call it primarydomain.com) is in the root of the hosting account. i then have several otherwebsites that are 'add-on' domains to that hosting account. (i dont think this should need explaining). but the sites all got indexed twice, and if i do a google search for

inurl:otherwebsite.com

the it shows two sets of results ..

otherwebsite.com
primarydomain/otherwebsite.com

otherwebsite.com/page1
primarydomain/otherwebsite.com/page1

etc etc.

[edited by: phranque at 6:31 am (utc) on May 24, 2010]
[edit reason] disabled graphic smileys ;) [/edit]

phranque

6:45 am on May 24, 2010 (gmt 0)

the 301 status code response as suggested by jdMorgan is the best solution to your problem.

Shipintern

2:32 pm on May 24, 2010 (gmt 0)

sorry, but is this just a method to redirect, or would this actually get the link containing the root domain to be 'de-indexed'? When i first saw the post by jdMorgan i just assumed that it was a redirect. When you do a redirect like that, does the search engine ultimately stop paying attention to the redirected URL and un-index it? thanks

lammert

12:56 am on May 25, 2010 (gmt 0)

A 301 redirect works as a sign for the search engines that the URL the redirect is originating from should be removed from the index, and all link juice from that URL be added to the URL where the redirect is pointing to. With a redirect the search engines will therefore not only un-index the wrong URL, but any link juice accuired by that wrong URL will be used to increase the rankings of the URL you want to index, which is a small advantage compared to just blocking the wrong URLs with a robots.txt or equivalent method.

Shipintern

1:20 am on May 25, 2010 (gmt 0)

yea, okay that makes sense. and 301s are easy enough. can i just put this in the .htaccess file then?

RewriteRule ^(([^./]+\.)+com)(/(.*))?$ http://$1.com/$4 [R=301,L]

all that code is over my head. i've always done them like

301 /link.html http://example.com/newlink.html

[edited by: phranque at 3:29 am (utc) on May 25, 2010]
[edit reason] fixed urls [/edit]

jdMorgan

3:36 am on May 25, 2010 (gmt 0)

Reviewing the code, it needs correction, as what I originally posted would have doubled-up on the ".com"


RewriteRule ^(([^./]+\.)+com)(/(.*))?$ http://$1/$4 [R=301,L]
-or-
RedirectMatch 301 ^/(([^./]+\.)+com)(/(.*))?$ http://$1/$4

Jim

phranque

3:52 am on May 25, 2010 (gmt 0)

if you have existing rewrite rules in there make sure you order them from most specific to most general.
also if you mix mod_rewrite directives (RewriteRule) with mod_alias directives (RedirectMatch) be aware of issues that may arise from mixing "directives from different modules" [httpd.apache.org].

physics

9:36 pm on May 25, 2010 (gmt 0)

Shipintern, if you want to make sure the URL is de-indexed read this [google.com...] and read up on blocking pages with robots.txt

phranque

9:55 pm on May 25, 2010 (gmt 0)

note: robots.txt is from the robots exclusion protocol, which is about crawling, not indexing.
if you exclude the crawler it will never make the request and will therefore not see the redirect.
if the crawler finds a link to the excluded page it may index the url without crawling the page.