homepage Welcome to WebmasterWorld Guest from 54.234.74.85
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Disallowing .html/
rros




msg:4189968
 7:13 pm on Aug 20, 2010 (gmt 0)

My blog is creating pages that I haven't seen before and am not sure how to disallow them. These urls have attributes beyond ".html" and are worthless pages. They all look like

mydomain.com/folder/page.html/somethingelse

How to write robots.txt to prevent their indexing? I looked over at this forum but could not find any answer. Thanks.

 

phranque




msg:4191404
 12:59 pm on Aug 24, 2010 (gmt 0)

robots.txt excludes the crawler from getting the content but it won't prevent the search engine from indexing the url that has been discovered, typically without a title or snippet.
if you want to prevent index, you must allow the url to be crawled and then add a robots noindex meta tag to the head of your html document:
<meta name="robots" content="noindex">

is it possible that those "worthless" urls are being caused by incorrect relative urls in anchor tags that should be absolute urls?
if this is the case you should fix the relative urls to make them absolute.

or are they caused by inbound links?
in this case they should serve a 404 Not Found status code response or a 301 redirect to the canonical url.

if you are not sure that they are internal links you can use a link checker such as xenu linksleuth to find them.

rros




msg:4191759
 3:49 am on Aug 25, 2010 (gmt 0)

Thank you for the very detailed answer, phranque. You made me take a second look and oila! Those urls came as a result of uploading the images to wp as attachments. So the script would create yet anoter page to hold the image and create that strange url. Now, I went back and changed all link to images as direct links which is actually better and faster for the user.

The original bad urls still live in the server.

Is it true that Google may drop them as there may not be linked from any other page? I went into the database and found the table "wp_attachment_metadata" that appears to have the bad links. But I may have to hire one of the B-Dienst specialists that broke into the British Naval codes in 1935 to find out where they really are. Another alternative would be to 301 them to the appropriate image files. Any suggestions, please?

phranque




msg:4192428
 9:20 am on Aug 26, 2010 (gmt 0)

you have 3 choices here:
- 301 redirect to the image file
- meta robots noindex or X-Robots-Tag header
- 401 Gone status code response (you can do this with the G flag on a RewriteRule)

rros




msg:4192692
 6:24 pm on Aug 26, 2010 (gmt 0)

Thank you!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved