Forum Moderators: goodroi

Message Too Old, No Replies

what to put into robots.txt?

         

Hitman3266

1:23 am on Nov 26, 2006 (gmt 0)

10+ Year Member



basically i have an invision forum, and i have absolutely no clue what to put into the robots file. ive seen people post their robots files however they have like Disallow:..and i dont want to disallow, i want the whole site to be indexed. How can i do this easily, thanks

[url snip]

[edited by: goodroi at 1:38 am (utc) on Nov. 26, 2006]

jdMorgan

2:00 am on Nov 26, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you don't want to disallow *anything* then your robots.txt can be completely blank (an empty file), or it should contain:
User-agent: *
Disallow:


Note that the Disallow argument field is blank, meaning "nothing" and also note the blank line at the end -- which some obscure robots need (or needed) to consider the record to be valid.

You might want to dig around on your server and make sure that you really want to allow all directories to be spidered. Many sites contain cgi-bin or stats directories that you should not allow to be spidered.

Ref: A Standard for Robot Exclusion [robotstxt.org]

Jim

goodroi

2:00 am on Nov 26, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



hey hitman3266,

welcome to webmasterworld :) in general people like to use disallow in their robots.txt to block the bots from getting into admin areas, development and testing pages and pages with heavy duplicate content.

when you say you want all of your pages indexed, i think you mean to say you want all of your content pages indexed (feel free to correct me :)) personally i would not want the engines indexing the admin parts of my forums. to get the engines to index the content pages get links pointing to these deep pages and not only to your home page. by default the engines will index as much as possible so just get those links pointing to your good stuff.

as for a specific example of disallowing stuff, you may want to double check your regular pages vs printer friendly pages. that was a problem with other forum programs. it might also be helpful if you poke around other sites running invision and see what they put into their robots.txt

good luck and happy indexing

Asia_Expat

8:41 pm on Nov 26, 2006 (gmt 0)

10+ Year Member



Here is the robots file from my own Invision installation, although you should note that I have not yet added the lines that will disallow 'getlastpost' url's. You can PM me for my sites address if you like and monitor my robots file for changes........

User-agent: *
Disallow: /advertise/
Disallow: /forum/index.php?act=idx
Disallow: /forum/index.php?act=Login
Disallow: /forum/index.php?act=Search
Disallow: /forum/index.php?act=Shoutbox
Disallow: /forum/index.php?act=Reg
Disallow: /forum/index.php?act=Msg
Disallow: /forum/index.php?act=Mail
Disallow: /forum/index.php?act=Forward
Disallow: /forum/index.php?act=Track
Disallow: /forum/index.php?act=Post
Disallow: /forum/index.php?act=Print
Disallow: /forum/index.php?act=ST
Disallow: /forum/index.php?act=boardrules
Disallow: /forum/index.php?act=Help
Disallow: /forum/index.php?act=Stats
Disallow: /forum/index.php?act=Members
Disallow: /forum/index.php?act=Online
Disallow: /forum/index.php?act=calendar
Disallow: /forum/index.php?act=SR
Disallow: /forum/index.php?act=ICQ
Disallow: /forum/index.php?act=MSN
Disallow: /forum/index.php?act=AOL
Disallow: /forum/index.php?act=AIM
Disallow: /forum/index.php?act=SC
Disallow: /forum/index.php?act=task
Disallow: /forum/index.php?act=findpost
Disallow: /forum/index.php?act=UserCP
Disallow: /forum/index.php?&act=
Disallow: /forum/index.php?act=report
Disallow: /forum/index.php?act=buddy
Disallow: /forum/index.php?act=legends
Disallow: /forum/index.php?CODE=
Disallow: /forum/index.php?automodule
Disallow: /forum/index.php?act=attach
Disallow: /forum/index.php?&&CODE=
Disallow: /forum/index.php?&debug=1
Disallow: /forum/index.php?act=Profile
Disallow: /forum/index.php?showuser
Disallow: /forum/index.php?s=
Disallow: /*&mode=linear$
Disallow: /*&mode=threaded$
Disallow: /*&mode=linearplus$
Disallow: /*&p=
Disallow: /*&pid=

[edited by: Asia_Expat at 8:42 pm (utc) on Nov. 26, 2006]

Asia_Expat

8:45 pm on Nov 26, 2006 (gmt 0)

10+ Year Member



... To clarify, this will allow all the important meaty threads, but disallow everything you DON'T want indexing and also protect you in theory from dupe content issues... NOTE: I can't emphasis enough that you should monitor my file for when I add the 'getlastpost' exclusion... it's extremely important. The only reason I am dragging my heels is because I have removed those links from the skin I am currently using.

[edited by: Asia_Expat at 8:46 pm (utc) on Nov. 26, 2006]

AndyA

10:22 pm on Nov 26, 2006 (gmt 0)

10+ Year Member



Asia_Expat,

I'd like some additional info on your Invision Forum as well. I just added my /forum/ to robots.txt to disallow all robots, as I feel the 2 years of my forum issuing sessions and a problem with the config file have pulled my entire site down.

I'd like to eventually have the forum indexed, but not at the cost of the rest of the site. I've been using the IPB Portal as the destination URL to the forum from internal links in my site, and the Portal only issues session IDs, so I will eventually have to dump that I guess since to go anywhere from that page you have a session in the URL.

Asia_Expat

10:54 pm on Nov 26, 2006 (gmt 0)

10+ Year Member



Hi,
Personally, I have not touched the IPB portal, so I can neither recommend it or recommend against it.
I don't know if you have donw this already but in the cookie settings section, in the 'Cookie Domain' field, I added...

.mydomain.com

(add your own domain of course, and don't forget the dot at the beginning)

... and the session ID problems simply vanished into thin air. Also, if you look closely at the robots file above, IPB seesion ID's are disallowed, so eventually, any session ID's you have indexed should be dropped from the index.

If you have never attempted to manage the indexing of your forum, I would think it's most certainly causing you serious issues. All I can suggest is that you implement the above robots file and see what happens. It also included wildcard exclusions to take care of 'Printer Friendly' versions also, and also all the post 'Snapback' URL's.... it's a very comprehensive robots for IPB and takes care of just about everything, I think. (NOTE: I still have to add the 'Getlastpost' wildcard... chack back later).

I think you should let the bots back into your forum straight away and let them see this robots file because I noticed good results very quickly.
Wait for me to add the 'Getlastpost' exclusion though.

[edited by: Asia_Expat at 10:56 pm (utc) on Nov. 26, 2006]

Asia_Expat

11:09 pm on Nov 26, 2006 (gmt 0)

10+ Year Member



Actually, I just noticed a few lines that could be improved slightly... will post again later

Asia_Expat

11:27 pm on Nov 26, 2006 (gmt 0)

10+ Year Member



OK, here's the completed file, including the 'Getlastpost' exclusion wildcards.
You should note that the 'pid' exclusions are closely related to this and are also ver important, as I have noticed that the 'Getlast' post links/URL's actually produce a 302 header response, before redirecting you to the last post, yet more dupe content. However, this robots file should take care of it all.
Note that only Google and Yahoo have confirmed they support the wildcards/pattern matching protocol.

If anyone can see any issues with my robots file, I'd be really grateful if you could let me know, preferably by PM...

User-agent: *
Disallow: /forum/index.php?act=idx
Disallow: /forum/index.php?act=Login
Disallow: /forum/index.php?act=Search
Disallow: /forum/index.php?act=Shoutbox
Disallow: /forum/index.php?act=Reg
Disallow: /forum/index.php?act=Msg
Disallow: /forum/index.php?act=Mail
Disallow: /forum/index.php?act=Forward
Disallow: /forum/index.php?act=Track
Disallow: /forum/index.php?act=Post
Disallow: /forum/index.php?act=Print
Disallow: /forum/index.php?act=ST
Disallow: /forum/index.php?act=boardrules
Disallow: /forum/index.php?act=Help
Disallow: /forum/index.php?act=Stats
Disallow: /forum/index.php?act=Members
Disallow: /forum/index.php?act=Online
Disallow: /forum/index.php?act=calendar
Disallow: /forum/index.php?act=SR
Disallow: /forum/index.php?act=ICQ
Disallow: /forum/index.php?act=MSN
Disallow: /forum/index.php?act=AOL
Disallow: /forum/index.php?act=AIM
Disallow: /forum/index.php?act=SC
Disallow: /forum/index.php?act=task
Disallow: /forum/index.php?act=findpost
Disallow: /forum/index.php?act=UserCP
Disallow: /forum/index.php?&act=
Disallow: /forum/index.php?act=report
Disallow: /forum/index.php?act=buddy
Disallow: /forum/index.php?act=legends
Disallow: /forum/index.php?CODE=
Disallow: /forum/index.php?automodule
Disallow: /forum/index.php?act=attach
Disallow: /forum/index.php?&&CODE=
Disallow: /forum/index.php?&debug=1
Disallow: /forum/index.php?act=Profile
Disallow: /forum/index.php?showuser
Disallow: /forum/index.php?s=
Disallow: /*&view=getnewpost$
Disallow: /*&view=getlastpost$
Disallow: /*&mode=linear$
Disallow: /*&mode=threaded$
Disallow: /*&mode=linearplus$
Disallow: /*&p=
Disallow: /*&pid=

Asia_Expat

11:34 pm on Nov 26, 2006 (gmt 0)

10+ Year Member



May I also suggest you make use of the IPB Google sitemap generator mod (a search will reveal where you can get it) although this will also need careful management.

AndyA

5:09 pm on Nov 27, 2006 (gmt 0)

10+ Year Member



Thanks for the info, I'm using an older version of Invision Board, and I do intend to update it and buy a license soon. Some of the links have changed between the old and new versions, so this won't work for me now. I'll have to hold off on allowing the bots access until I can get everything straightened out.

So far, not a problem for any of the SEs except for Google.