homepage Welcome to WebmasterWorld Guest from 23.22.173.58
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
large robots.txt file
What size will cause a crawler to choke in the file
DoppyNL

10+ Year Member



 
Msg#: 351 posted 12:15 pm on Mar 30, 2004 (gmt 0)

I'm working on a cms-system and need to prevent crawlers from indexing certain url's.
99% of the time those url's will not return a html page, but a file for download.

The problem is that it is not possible to exclude those url's with a short robots.txt, I will have to create a huge robots.txt for it.

some examples:
/path/ --> html page
/path/download/423/ --> file download
/another/path/ --> html page
/anohter/path/download/123/ --> file download

and so on for a lot of different pages.
the html pages should be indexed; but the files should not be indexed.
This would result in a robots.txt something like:

User-agent: *
Disallow: /path/download/*
Disallow: /another/path/download/*
.....

question:
- will crawlers choke on a huge robots.txt?
- should I come up with another url structure?
- Is there another robots.txt-method that allows this more easily?
- Any other thoughts?

 

Staffa

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 351 posted 2:37 pm on Mar 30, 2004 (gmt 0)

Can you not organize your set up so that all files open to robots are in one or more directories and that the files that you want blocked from being accessed are in separate directories? You can then just block the whole directory.

DoppyNL

10+ Year Member



 
Msg#: 351 posted 3:59 pm on Mar 30, 2004 (gmt 0)

That's an idea, but it will remove the option of "going up in the path" to find the page where the file can be found.
So I'd like to show in the URL the page where the file comes from.

On top of that there will be a security check in place to determine if the user is allowed to download the file; this is related to the page.
If I were to change the url-structure, I would also have to pass something along to determine what page I have to check with.

Also, the problem isn't creating the robots.txt file, I can do that automaticly in PHP.
Problem is, will search engine's cope with large robots.txt files?

DoppyNL

10+ Year Member



 
Msg#: 351 posted 11:37 am on Apr 8, 2004 (gmt 0)

bump to top.

question I still have:

How large can the robots.txt file be?
When do crawlers start having problems with it because of it's size?

bufferzone

10+ Year Member



 
Msg#: 351 posted 12:02 pm on Apr 8, 2004 (gmt 0)

I would think that the same principles for normal web pages apply. I read an answer by GoogleGuy that you should not go over 15K, under 12K is fine and if you can keep it under 10K it is best. If you look at Brettís robots.txt her at webmasterworld you will see that it is LLLLLLLLong, How many Kís I havenít checked. My guess would be that if you keep it shorter then Brettís you should be fine

the_nerd

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 351 posted 7:40 pm on May 21, 2004 (gmt 0)

Looking at WWs robots-file I ask myself (and now you, because my answers weren't good) why not change the standard so it can handle includes as well.

We could then block everything and let in what we want. Might be unfair for new SEs, but could keep out a couple of suckers that bring down our sites.

sidyadav

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 351 posted 9:43 pm on May 21, 2004 (gmt 0)

> We could then block everything and let in what we want.
> Might be unfair for new SEs, but could keep out a couple
> of suckers that bring down our sites.

Yeah but DoppyNL's purpose for using a robots.txt is way different. Instead of blocking certain robots, DoppyNL wants to block all the robots from certain prohibited web pages.

In this case, I would strongly suggest that you place the "<meta name="robots" content="noindex,follow">" into all of your HTML pages that you want to prohibit robots from visiting. It's not worth taking the risk of creating a large robots.txt file, but then again, putting those tags into your prohibited web pages wouldn't be easy either - but once it's done, it's done.

Sid

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved