homepage Welcome to WebmasterWorld Guest from 54.161.192.61
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Spider.txt
graemel

10+ Year Member



 
Msg#: 155 posted 11:33 pm on Jan 23, 2001 (gmt 0)

Hi, can anyone help me out with some info on how to create a spider.txt file and what needs to be in it?

Thanks

 

mivox

WebmasterWorld Senior Member mivox us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 155 posted 12:01 am on Jan 24, 2001 (gmt 0)

1. name it robots.txt *not* spider.txt

2. make a list of everything on your site you DON'T want robots/spiders to visit, and list in in robots.txt like so:

User-agent: *
Disallow: /directory1
Disallow: /directory2/file1.htm

etc., etc.

The * after disallow means NO spider is supposed to visit the files & directories in this section.

If you only want to ban specific robots from certain files, add a second section like so (replace the * with the user-agent of the spider you want to ban):

User-agent: Googlebot/2.1
Disallow: /don't_want_google_here

User-agent: FAST-WebCrawler/2.2
Disallow: /don't_want_FAST_here

Etc., etc...

graemel

10+ Year Member



 
Msg#: 155 posted 12:29 am on Jan 24, 2001 (gmt 0)

Thanks for that, now a question I suppose shouldn't be posted here, but I use Frontpage 2000, how does one incorporate the robots.txt file into a website.

I've had a look at all the help files I can find, but am getting nothing.

Cheers

mivox

WebmasterWorld Senior Member mivox us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 155 posted 12:35 am on Jan 24, 2001 (gmt 0)

Well... you don't "incorporate" it into your website. You'd create it in Notepad, or some other plain text editor, save it as a plain text file, and upload it to your webserver.

Don't let it get anywhere near Frontpage... who knows what could happen to it!

Marcia

WebmasterWorld Senior Member marcia us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 155 posted 11:23 am on Jan 24, 2001 (gmt 0)

mivox, how about excluding from an entire site? And, will adding robots.txt disallowing all spiders result in a site being removed from the index?

Machiavelli

10+ Year Member



 
Msg#: 155 posted 4:21 pm on Jan 24, 2001 (gmt 0)

The following disallows a (polite) spider from the entire site:

User-agent: *
Disallow: /

However, I have seen it take about 3 months after a spider visting and reading robots.txt before they think about removing the disallowed pages from their index. Of course, in most cases, it is entirely in the search engines interest to remove disallowed content, because often a link to this content is useless.

hohmaster

10+ Year Member



 
Msg#: 155 posted 8:37 pm on Jan 24, 2001 (gmt 0)

I have a problem with my robots.txt. In my site there is a directory with lot of very similar files there. I want to exclude these files from SE robots. The problem is that there are subdirectories in this folder as well, and I don't want them to be excluded. To change the structure of the site is also complicated.
What should I use?

User-agent: *
Disallow:/mydirectory/ OR
Disallow:/mydirectory

I have read about it somewhere, but I forgot it :-(

Could anyone help me?

mivox

WebmasterWorld Senior Member mivox us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 155 posted 8:49 pm on Jan 24, 2001 (gmt 0)

I use:

Disallow: /mydirectory

Seems to work fine.

In your case, you can write a separate disallow line for each file in that directory you don't want spidered, but still allow access to the directory itself and therefore the subdirectories:

Disallow: /mydirectory/myfile1.htm
Disallow: /mydirectory/myfile2.htm
Disallow: /mydirectory/myfile3.htm
etc...

It may result in a long robots.txt file, but it should do the trick.

hohmaster

10+ Year Member



 
Msg#: 155 posted 10:35 pm on Jan 24, 2001 (gmt 0)

Thank you mivox!

This was what I suspected to be right.
I know that writing separate lines for each files would also work, but there is nearly one hundred of them. Maybe I'm too lazy :-)

Anyway it could be a warning for those who have same site structure. I really didn't want to spam SEs, these (asp) files contain one single line (response redirect), but it might led to Google's dropping me off completely! I don't know for sure, but...SEs are getting to be more and more clever.

Thank you again

mivox

WebmasterWorld Senior Member mivox us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 155 posted 10:47 pm on Jan 24, 2001 (gmt 0)

I personally try to keep my websites structured with no more than 1 directory level down from root:

mysite.com/directory/file.htm

The only time I use second level subdirectories is to hold administrative or data/graphics files:

mysite.com/images/icons/icn.gif
mysite.com/images/headers/head.gif

I disallow everything from my image and admin directories anyway, so that keeps everything else nice and clean.... either I allow the spiders into the directory or not.

I'm lazy too... I hate having to redesign and reorganize things once I build them! A lesson for the lazy: it's worth spending the extra time to plan out a site structure at first that will allow you to be lazy later! ;)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved