New to robots.txt and how spider find robots

Forum Moderators: goodroi

Message Too Old, No Replies

New to robots.txt and how spider find robots

jazzvn

6:51 am on Apr 8, 2004 (gmt 0)

Hi everyone,
I am totally new to the robots.txt
Can anyone show me how to put syntax in meta tag, create a small and quick robots.txt please.
Also, my website , I think never been submited to any search engine,.So if I put robots.txt in my root, is it gonna help Search Engine Find my website so next time their spiders can crawl into my webpage? PS: MY website was never submited to any search engine.
Thanks a lot for any piece of information from all of you

Jazzvn

IONWeb

7:04 am on Apr 8, 2004 (gmt 0)

Welcome to webmasterworld jazzvn

A robots.txt file is simply a text file created in a text editor that is stored on the server. Here is a simple robots.txt:

User-agent: *
Disallow: /cgi-bin/
Disallow: /members/

Here is a site that details what the file is and what it does [robotstxt.org ]

Hope this helps to get you started.

jdMorgan

7:15 am on Apr 8, 2004 (gmt 0)

jazzvn,

Welcome to WebmasterWorld [webmasterworld.com]!

Here's the Robots.txt standard [robotstxt.org].

Putting the robots.txt file in your root directory will not help to get your site indexed. The purpose of robots.txt is to tell "good" robots not to index certain pages or subdirectories of your site. Bad robots, such as e-mail address harvesters, will ignore robots.txt.

Typical uses are to keep robots from requesting your scripts and/or shopping cart, to stop them from indexing or copying your images, and to keep them from listing your "semi-private" pages - although this is no guarantee that those pages will be kept private!

Another use is to keep robots from requesting pages that you don't need indexed and consuming large amounts of bandwidth.

In order to get your pages indexed in search engines, what you need to do is to get incoming links -- get other sites which are already indexed to link to your pages. Best results will be had if the sites that link to your pages share the topic of your pages. Also, links from reputable directories such as the Open Directory Project (DMOZ) are good to have.

In addition to robots.txt, which is a text file placed in the root directory of your site, you can also use the on-page HTML robots meta-tag of the form:

 <meta name="robots" content="noindex,nofollow">

Note that if a page is disallowed in robots.txt, then the page won't be read by robots, and the above tag will have no effect.

Also note that Ask Jeeves and Google will index your page (list it in search results) if they find any link to your page, regardless of whether that page is disallowed in robots.txt. The only way to stop them from listing your page is to allow them to fetch it (don't disallow tha page in robots.txt) and use the on-page HTML robots meta-tag shown above.

Use a simple text editor such as NotePad to create your robots.txt. Once you have written your robots.txt file, validate it here [searchengineworld.com].

Jim

jazzvn

4:52 pm on Apr 8, 2004 (gmt 0)

Hi Moderator,
Does it mean that my page doesn't allow spiders,
<meta name="robots" content="noindex,nofollow">

It say no roindex and nofollow.
I really want to have these:

User-agent:*
Disallow:/album/
# Album is my whole folder of family webalbum, and I don't want people see it. Is that code right? This will let any kind of spider but not into my www.domain.com/album, right?
Thanks

rogerd

4:55 pm on Apr 8, 2004 (gmt 0)

Welcome, Jazzvn. Those code examples should work. You can test your code with some tools found here: [searchengineworld.com...]

sidyadav

1:19 am on Apr 10, 2004 (gmt 0)

I don't want people see it.

People or robots?
Robots.txt can only prevent access to robots which obey the exclusion standard. You cannot prevent browsers/people - unless you use .htacess.

If you use robots.txt, theres no need to use META tags. If you use META tags, theres no need to use robots.txt.

So I'd suggest go with the robots.txt code you posted in your last message.

Good luck!
Sid

PS; welcome to WebmasterWorld!

EarWig

11:09 pm on Apr 10, 2004 (gmt 0)

Jim
A little help please
In your opinion would this also apply to Yahoo! Slurp?

Thanks & regards
Ray

sidyadav

11:41 pm on Apr 10, 2004 (gmt 0)

Jim
A little help please
In your opinion would this also apply to Yahoo! Slurp?

No, Yahoo! quits requesting pages on your site once it see's that its disallowed - and it doesn't index the thing.

However, Google does this. If you do a query for "Overture", in result #3, even though content.overture.com [overture.de] has banned all robots via robots.txt, Google still has its URL.

So theres a chance of this happening in Yahoo! results, as Yahoo! still seems to use some of Google's.

Sid

EarWig

11:49 pm on Apr 10, 2004 (gmt 0)

Hi sidyadav

Thank you for the feedback and info

I asked the question because Yahoo! recently listed a whole site of ours that does contain a robots.txt forbidding indexing.
The strange thing is it lists all pages but with no descriptions.
Just the url and and company name which contains the link.

Been trying to find out why without success so far.

Any ideas?
Ray