Welcome to WebmasterWorld Guest from 54.225.32.164

Forum Moderators: mademetop

Message Too Old, No Replies

Tricky Robot.txt issue

Only want the home page crawled.

     
12:47 am on Jun 24, 2010 (gmt 0)

Senior Member from CA 

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 12, 2003
posts:1100
votes: 5


Due to duplicate content issues, we only want the home page crawled

eg. www.example.com

What would the robot.txt file look like just to index the home page URI, with no file extensions and no other files.

I have found no information on how to do this.
1:05 am on June 24, 2010 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:8455
votes: 651


Does your home page have links to other pages? If so consider all of that indexed, too... and if you don't want that, then noindex,nofollow those links, then prepare to include in robots.txt every folder/file on your site...

Per the desire above this does not seem to be a robots.txt issue...
2:29 am on June 24, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


> then prepare to include in robots.txt every folder/file on your site...

That's not really necessary. Although the universally-supported original Standard for Robot Exclusion defines only the "Disallow" directive, you could disallow fetching of all pages except the home page using 26 Disallows -- or possibly 36, or a few more than that, because robots.txt uses prefix-matching.

So, 26 lines like
 Disallow: /a
Disallow: /b
.
.
.
Disallow: /y
Disallow: /z

would disallow all resources whose URL-paths begin with a-z. And if you use numbers and a few non-encoded punctuation marks as initial characters, you could disallow those as well.

Jim
3:57 am on June 24, 2010 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:8455
votes: 651


Jim, as always, you remain a treasure in knowledge. Still maintain that if Index Page is scanned whatever is linked from that page (which makes the index page WORK) will appear in the serps. We know the gorg gets EVERYTHING, even if disallowed, y and b do the same.
9:57 am on June 24, 2010 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11388
votes: 156


note that robots.txt is intended to exclude robots from crawling a page.
if a url is discovered through internal linking from your home page or an inbound link, that url will be indexed without the content being crawled.
if you prefer that neither the url nor the content be indexed, you must allow crawling of that content so that it can see either:
- the robots noindex meta tag you have placed in the <head> of your HTML document.
- the X-Robots-Tag HTTP header with a noindex value that is returned with the requested resource.

to index the home page URI, with no file extensions and no other files.

nothing in the robots exclusion protocol will canonicalize your home page url or rewrite to an extensionless url, if that's what your are asking.