homepage Welcome to WebmasterWorld Guest from 54.205.207.53
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
robots.txt - disallow any url containing.
using robots.txt and dynamic urls for phpbb3 forum
BrainDed



 
Msg#: 4314267 posted 6:00 pm on May 18, 2011 (gmt 0)

My goal is to get google to stop crawling specific URL's and setup a accurate sitemap. I am running a phpbb3 message board with SEF URL's. The problem I have is the forum script generates a URL for every reply in a topic, basically anchors.

This creates 1000's of useless URL's in the eyes of the search engine, even though the users like them for bookmarking.

TOPIC = Domain brewerscubs.com/messageboard/milwaukee-brewers/carlos-gomez-16796.html

Direct Link to post = brewerscubs.com/messageboard/milwaukee-brewers/carlos-gomez-16796.html#p412994


I have been researching and trying to find a way to tell robots.txt to disallow any url containing "#p" but have not had any luck. Also, my host, siteground, is busting my marbles about CPU usage from the testing i have been doing with a gsitecrawler so my days of testing are numbered... I need to get it right this time, so i turn to the experts :)

 

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4314267 posted 6:09 pm on May 18, 2011 (gmt 0)

Google should not include anchors as if they are separate URLs, the bit after # should be ignored.

Anchors are interpreted only within the browser, as a page-relative link. The #part is not requested from the server when a link is clicked.

BrainDed



 
Msg#: 4314267 posted 6:35 pm on May 18, 2011 (gmt 0)

Fantastic!

Thank you for the reply.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4314267 posted 8:10 pm on May 18, 2011 (gmt 0)

That's not to say there isn't some duplicate content somewhere to clean up. These searches will be useful to begin with:

site:example.com -inurl:www

site:www.example.com

Especially change to 100 results per page.

BrainDed



 
Msg#: 4314267 posted 9:48 pm on May 18, 2011 (gmt 0)

Thanks again!

I changed my URL structure just last week, so 90% of those indexed pages are returning a 404 error.

I thought about doing a 301 redirect, but I dont think it would work. Here is what I am working with.

Structure was /messageboard/(forum number)/(topic number)
Strucure is messageboard/(forum NAME)/(topic NAME)

http://www.brewerscubs.com/messageboard/16/16417.html

http://www.brewerscubs.com/messageboard/mlb/astros-have-no-chance-in-winning-the-central-16417.html


The topic number in the URL is converted on the fly..



So the redirect would be:
redirect 301 /messageboard/(forum number)http://www.example.com/messageboard/(forum NAME)

But what about the change to the second part, the topic? No way could I create a redirect for every topic as there are thousands.

Would it be best for to add a disallow to the old urls, ignore it, or another route?

The site has been active for years, and I am just now paying attention to SEO. The pages in the board had no meta data at all prior to last week. Now the description is pulled from the text on the page and the title is the the topic title.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4314267 posted 10:07 pm on May 18, 2011 (gmt 0)

Use a RewriteRule to match incoming external URL requests and internally rewrite them to a PHP script that can then interpret the old URL from the request, look up the new URL in an array or database, and then send the correct HTTP 301 redirect.

BrainDed



 
Msg#: 4314267 posted 10:28 pm on May 18, 2011 (gmt 0)

Wow... thats over my head. Thank you though!

Should I start a new topic in this section, [webmasterworld.com...] , with this info as we have strayed away from robots.txt?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved