Robot,txt Help

Forum Moderators: goodroi

Message Too Old, No Replies

Robot,txt Help

branmh

5:47 pm on May 29, 2005 (gmt 0)

Is there away to have a plain html file (ie: sitemap) with all my links that I would like a search engine to spider, but not spider my actual pages with the graphics (to save bandwidth)?

I hope this made since.

jbinbpt

11:34 pm on May 30, 2005 (gmt 0)

What are you expecting them to see? Are your url's descriptive enough to rank or is your site map going to include content? If so, then block everything except index and sitemap. I would worry about how this looked to the search engines. All links and no content will get work against what you are trying to do and might end up with a banned site.

branmh

11:53 pm on May 30, 2005 (gmt 0)

What I was hoping to do is to allow the spider to a plain not graphics page that it doesn't generate my dynamic pages and pull data from other sources in which would eat my and my sources bandwidth as will as my space for my cache files.

It there is a way to do this, have links and description in a plain html and not to allow them access to my dynamic pages?

Do I make since now?

encyclo

12:49 am on May 31, 2005 (gmt 0)

The standard spiders don't download graphics anyway - and you can use robots.txt to exclude your graphics from the image bots. The easiest way is to place all your graphical elements in a separate directory, and then you just disallow everything in that directory:

User-agent: *
Disallow: /the_name_of_your_images_folder/

Other than that, if you want your content indexed, you must let the bots index your content... if you see what I mean ;)

branmh

1:03 am on May 31, 2005 (gmt 0)

There as been one night I download from my source around 50 megs of data, just for no one to see. Wasting bandwidth (of my and my source) and disk space.

Dijkgraaf

10:03 pm on Jun 1, 2005 (gmt 0)

Well that type of badly behaved bot is unlikely to obey robots.txt in the first place, so trying to use robots.txt to stop it is probably pointless.
Did that spider even request robots.txt?
What was the user agent it used?
Has it come again and used the same IP address? If so banning that IP address would work.

branmh

12:04 am on Jun 2, 2005 (gmt 0)

I would still like for the spider to index my site but I perfered it not run on live pages to download all the source files from another server and waste that bandwidth as well as my space for caching them.

branmh

6:34 pm on Jun 12, 2005 (gmt 0)

As of right now google has hit my site 7063 times and has caused my bandwidth to increase.

Now, which environment would be easier to use within a php file to block the content area from not loading by a search engine. (REMOTE_HOST, HTTP_USER_AGENT or something else) and is there a list of all of remote_host and user_agent available with all the serach engines listed.

Robot,txt Help

branmh

jbinbpt

branmh

encyclo

branmh

Dijkgraaf

branmh

branmh

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week