Forum Moderators: goodroi
The problem is that it is not possible to exclude those url's with a short robots.txt, I will have to create a huge robots.txt for it.
some examples:
/path/ --> html page
/path/download/423/ --> file download
/another/path/ --> html page
/anohter/path/download/123/ --> file download
and so on for a lot of different pages.
the html pages should be indexed; but the files should not be indexed.
This would result in a robots.txt something like:
User-agent: *
Disallow: /path/download/*
Disallow: /another/path/download/*
.....
question:
- will crawlers choke on a huge robots.txt?
- should I come up with another url structure?
- Is there another robots.txt-method that allows this more easily?
- Any other thoughts?
On top of that there will be a security check in place to determine if the user is allowed to download the file; this is related to the page.
If I were to change the url-structure, I would also have to pass something along to determine what page I have to check with.
Also, the problem isn't creating the robots.txt file, I can do that automaticly in PHP.
Problem is, will search engine's cope with large robots.txt files?
We could then block everything and let in what we want. Might be unfair for new SEs, but could keep out a couple of suckers that bring down our sites.
Yeah but DoppyNL's purpose for using a robots.txt is way different. Instead of blocking certain robots, DoppyNL wants to block all the robots from certain prohibited web pages.
In this case, I would strongly suggest that you place the "<meta name="robots" content="noindex,follow">" into all of your HTML pages that you want to prohibit robots from visiting. It's not worth taking the risk of creating a large robots.txt file, but then again, putting those tags into your prohibited web pages wouldn't be easy either - but once it's done, it's done.
Sid