Forum Moderators: goodroi

Message Too Old, No Replies

Robot,txt Help

         

branmh

5:47 pm on May 29, 2005 (gmt 0)

10+ Year Member



Is there away to have a plain html file (ie: sitemap) with all my links that I would like a search engine to spider, but not spider my actual pages with the graphics (to save bandwidth)?

I hope this made since.

jbinbpt

11:34 pm on May 30, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What are you expecting them to see? Are your url's descriptive enough to rank or is your site map going to include content? If so, then block everything except index and sitemap. I would worry about how this looked to the search engines. All links and no content will get work against what you are trying to do and might end up with a banned site.

branmh

11:53 pm on May 30, 2005 (gmt 0)

10+ Year Member



What I was hoping to do is to allow the spider to a plain not graphics page that it doesn't generate my dynamic pages and pull data from other sources in which would eat my and my sources bandwidth as will as my space for my cache files.

It there is a way to do this, have links and description in a plain html and not to allow them access to my dynamic pages?

Do I make since now?

encyclo

12:49 am on May 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The standard spiders don't download graphics anyway - and you can use robots.txt to exclude your graphics from the image bots. The easiest way is to place all your graphical elements in a separate directory, and then you just disallow everything in that directory:

User-agent: *
Disallow: /the_name_of_your_images_folder/

Other than that, if you want your content indexed, you must let the bots index your content... if you see what I mean ;)

branmh

1:03 am on May 31, 2005 (gmt 0)

10+ Year Member



There as been one night I download from my source around 50 megs of data, just for no one to see. Wasting bandwidth (of my and my source) and disk space.

Dijkgraaf

10:03 pm on Jun 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well that type of badly behaved bot is unlikely to obey robots.txt in the first place, so trying to use robots.txt to stop it is probably pointless.
Did that spider even request robots.txt?
What was the user agent it used?
Has it come again and used the same IP address? If so banning that IP address would work.

branmh

12:04 am on Jun 2, 2005 (gmt 0)

10+ Year Member



I would still like for the spider to index my site but I perfered it not run on live pages to download all the source files from another server and waste that bandwidth as well as my space for caching them.

branmh

6:34 pm on Jun 12, 2005 (gmt 0)

10+ Year Member



As of right now google has hit my site 7063 times and has caused my bandwidth to increase.

Now, which environment would be easier to use within a php file to block the content area from not loading by a search engine. (REMOTE_HOST, HTTP_USER_AGENT or something else) and is there a list of all of remote_host and user_agent available with all the serach engines listed.