phranque

msg:4191404 | 12:59 pm on Aug 24, 2010 (gmt 0) |
robots.txt excludes the crawler from getting the content but it won't prevent the search engine from indexing the url that has been discovered, typically without a title or snippet. if you want to prevent index, you must allow the url to be crawled and then add a robots noindex meta tag to the head of your html document: <meta name="robots" content="noindex"> is it possible that those "worthless" urls are being caused by incorrect relative urls in anchor tags that should be absolute urls? if this is the case you should fix the relative urls to make them absolute. or are they caused by inbound links? in this case they should serve a 404 Not Found status code response or a 301 redirect to the canonical url. if you are not sure that they are internal links you can use a link checker such as xenu linksleuth to find them.
|
rros

msg:4191759 | 3:49 am on Aug 25, 2010 (gmt 0) |
Thank you for the very detailed answer, phranque. You made me take a second look and oila! Those urls came as a result of uploading the images to wp as attachments. So the script would create yet anoter page to hold the image and create that strange url. Now, I went back and changed all link to images as direct links which is actually better and faster for the user. The original bad urls still live in the server. Is it true that Google may drop them as there may not be linked from any other page? I went into the database and found the table "wp_attachment_metadata" that appears to have the bad links. But I may have to hire one of the B-Dienst specialists that broke into the British Naval codes in 1935 to find out where they really are. Another alternative would be to 301 them to the appropriate image files. Any suggestions, please?
|
phranque

msg:4192428 | 9:20 am on Aug 26, 2010 (gmt 0) |
you have 3 choices here: - 301 redirect to the image file - meta robots noindex or X-Robots-Tag header - 401 Gone status code response (you can do this with the G flag on a RewriteRule)
|
rros

msg:4192692 | 6:24 pm on Aug 26, 2010 (gmt 0) |
Thank you!
|
|