large robots.txt file

Forum Moderators: goodroi

Message Too Old, No Replies

large robots.txt file

What size will cause a crawler to choke in the file

DoppyNL

12:15 pm on Mar 30, 2004 (gmt 0)

I'm working on a cms-system and need to prevent crawlers from indexing certain url's.
99% of the time those url's will not return a html page, but a file for download.

The problem is that it is not possible to exclude those url's with a short robots.txt, I will have to create a huge robots.txt for it.

some examples:
/path/ --> html page
/path/download/423/ --> file download
/another/path/ --> html page
/anohter/path/download/123/ --> file download

and so on for a lot of different pages.
the html pages should be indexed; but the files should not be indexed.
This would result in a robots.txt something like:

User-agent: *
Disallow: /path/download/*
Disallow: /another/path/download/*
.....

question:
- will crawlers choke on a huge robots.txt?
- should I come up with another url structure?
- Is there another robots.txt-method that allows this more easily?
- Any other thoughts?

Staffa

2:37 pm on Mar 30, 2004 (gmt 0)

Can you not organize your set up so that all files open to robots are in one or more directories and that the files that you want blocked from being accessed are in separate directories? You can then just block the whole directory.

DoppyNL

3:59 pm on Mar 30, 2004 (gmt 0)

That's an idea, but it will remove the option of "going up in the path" to find the page where the file can be found.
So I'd like to show in the URL the page where the file comes from.

On top of that there will be a security check in place to determine if the user is allowed to download the file; this is related to the page.
If I were to change the url-structure, I would also have to pass something along to determine what page I have to check with.

Also, the problem isn't creating the robots.txt file, I can do that automaticly in PHP.
Problem is, will search engine's cope with large robots.txt files?

DoppyNL

11:37 am on Apr 8, 2004 (gmt 0)

bump to top.

question I still have:

How large can the robots.txt file be?
When do crawlers start having problems with it because of it's size?

bufferzone

12:02 pm on Apr 8, 2004 (gmt 0)

I would think that the same principles for normal web pages apply. I read an answer by GoogleGuy that you should not go over 15K, under 12K is fine and if you can keep it under 10K it is best. If you look at Brett�s robots.txt her at webmasterworld you will see that it is LLLLLLLLong, How many K�s I haven�t checked. My guess would be that if you keep it shorter then Brett�s you should be fine

the_nerd

7:40 pm on May 21, 2004 (gmt 0)

Looking at WWs robots-file I ask myself (and now you, because my answers weren't good) why not change the standard so it can handle includes as well.

We could then block everything and let in what we want. Might be unfair for new SEs, but could keep out a couple of suckers that bring down our sites.

sidyadav

9:43 pm on May 21, 2004 (gmt 0)

> We could then block everything and let in what we want.
> Might be unfair for new SEs, but could keep out a couple
> of suckers that bring down our sites.

Yeah but DoppyNL's purpose for using a robots.txt is way different. Instead of blocking certain robots, DoppyNL wants to block all the robots from certain prohibited web pages.

In this case, I would strongly suggest that you place the "<meta name="robots" content="noindex,follow">" into all of your HTML pages that you want to prohibit robots from visiting. It's not worth taking the risk of creating a large robots.txt file, but then again, putting those tags into your prohibited web pages wouldn't be easy either - but once it's done, it's done.

Sid