homepage Welcome to WebmasterWorld Guest from 184.72.72.182
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
How to exclude dynamic URLS?
So Search engines dont get bogged down.
Tech2004




msg:1529139
 4:51 am on Nov 11, 2004 (gmt 0)

Have read where Engines like Google will stop indexing a site if it has too many dynamic pages in it. My site is using a bulletin board - PHPBB2, which uses the '?' in the dynamic pages it has. How do I go about telling a robot to only index the outer most pages , like just the topics, and replys, and not everything else? I already started declaring specific pages in a few lines of the current Robots.txt I am using, but If I have to list every single one, this could be quite tedious, and Im not even sure if it will work. As opposed to disallowing the entire folder the board is in , is there a simpler way? The support website for PHPBB2 said to come here for help - :/

 

Slade




msg:1529140
 5:44 am on Nov 11, 2004 (gmt 0)

This is my robots.txt for my phpbb:
User-agent: *
Disallow: /admin/
Disallow: /attach_mod/
Disallow: /db/
Disallow: /files/
Disallow: /images/
Disallow: /includes/
Disallow: /language/
Disallow: /templates/
Disallow: /common.php
Disallow: /config.php
Disallow: /glance_config.php
Disallow: /groupcp.php
Disallow: /login.php
Disallow: /memberlist.php
Disallow: /modcp.php
Disallow: /posting.php
Disallow: /printview.php
Disallow: /privmsg.php
Disallow: /ranks.php
Disallow: /search.php
Disallow: /statistics.php
Disallow: /tellafriend.php
Disallow: /viewonline.php

Do you have session id's disabled for bots?

Tech2004




msg:1529141
 6:51 am on Nov 11, 2004 (gmt 0)

Hi - No I dont believe I do. How do you go about that?

Also - arent you disallowing files in the root folder if you dont include /phpbb/whatever.php in front like that?

Thanks.

Tech2004




msg:1529142
 3:28 pm on Nov 11, 2004 (gmt 0)

This is what I have in my root folder. 'supportbbs' is the PHPBB2 folder.

Disallow: /supportbbs/admin
Disallow: /supportbbs/cache
Disallow: /supportbbs/docs
Disallow: /supportbbs/db
Disallow: /supportbbs/images
Disallow: /supportbbs/includes
Disallow: /supportbbs/language
Disallow: /supportbbs/templates
Disallow: /supportbbs/memberlist.php
Disallow: /supportbbs/profile.php
Disallow: /supportbbs/search.php
Disallow: /supportbbs/groupcp.php
Disallow: /supportbbs/faq.php

The last 5 Lines are what I started to add in reference to my original post. Would it be better to use the wildcard as an extension to block more URL's?

IE: Disallow: /supportbbs/faq.*?

jdMorgan




msg:1529143
 12:22 am on Nov 13, 2004 (gmt 0)

The Standard does not support "wildcards." As specified, robots.txt uses prefix-matching, so
Disallow: /faq
is equivalent to your
Disallow: /faq.*

But since it is a prefix match, you can't disallow all files of a specific type, such as
Disallow: *.php
that is invalid for most search engines.

However, just to make matters more complicated, Google has defined some extensions to robots.txt to allow you to disallow by filetype and more -- See their Webmaster Help section. You can use their special extensions within a robots.txt record specifically addressed to Googlebot, but you'll need to find another solution for all the other robots that visit your site.

For example, this would stop Googlebot from indexing Excel files:

User-agent: Googlebot
Disallow: /*.xls$

Jim

Tech2004




msg:1529144
 12:40 am on Nov 13, 2004 (gmt 0)

Thank you. (I was under the impression it would think these are folders)

The Goal is to stop the bots from needlessly indexing certain Webpages on the bulletin board that do not really contain content deemed useful in a web search.
So I am going to try adding this:

Disallow: /supportbbs/memberlist
Disallow: /supportbbs/profile
Disallow: /supportbbs/search
Disallow: /supportbbs/groupcp
Disallow: /supportbbs/faq
Disallow: /supportbbs/search
Disallow: /supportbbs/posting
Disallow: /supportbbs/privmsg

and hopefully any page in the supportbbs folder 'prefix matching' - as you say, the listed phrases will not be indexed. It seems to be a good start, thanks again.

googalot




msg:1529145
 4:54 am on Nov 16, 2004 (gmt 0)

Hi :),

I am not sure if this is of use to you but I achieve the results you are after by using mod_rewrite to remove my dynamic looking pages.

EXAMPLE:
green-widget.php?items=16
to
green-widget/16/

I then use a wildcard comment to prevent Google from indexing any page with a question mark in it like so:

User-agent: Googlebot
Disallow: /*?

Although most validators will moan at you for this use of code the source is Google's own Webmaster Tips:

from Google: [google.com...]
------------------------------------
12. How do I tell Googlebot not to crawl dynamically generated pages on my
site?

The following robots.txt file will achieve this.

User-agent: Googlebot
Disallow: /*?
------------------------------------

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved