Forum Moderators: open

Message Too Old, No Replies

Wget/1.6

3 requests in 10 hrs

         

bobriggs

2:49 pm on May 15, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Very strange UA. First time I've seen this one. And three requests for the index page within a 10 hour period from 3 different IPs, no one related to any other.

Only took the index page, no graphics, and did not request robots.txt.

Anybody know anything about this UA?

Woz

2:58 pm on May 15, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Bob,

got this thru google - [freeware.sgi.com...]

"GNU Wget (wget) is a freely available network utility to retrieve files from the World Wide Web, using HTTP (Hyper Text Transfer Protocol) and FTP (File Transfer Protocol), the two most widely used Internet protocols."

Onya
Woz

Everyman

3:08 pm on May 15, 2001 (gmt 0)



Wget is a GNU personal spider. The man page is dated 1996. There's also a Windows port. You're lucky it didn't try to grab your entire site. It has a zillion options, and can crawl subdirectories recursively.

You can set robots.txt on or off (this is typical; most personal bots that even bother with robots.txt appear to have a switch that allows you to ignore this file).

The only thing wget DOESN'T have, as far as I can tell, is an optional timer to avoid hitting a site so fast with successive GETs. There's a timeout for a hung connection, but not for being polite on someone's site.

You can download wget for free on the Internet at various places. The Windows version I have works in DOS32 mode, and lacks some of the wildcard versatility of the Linux version.

skirril

3:34 pm on May 15, 2001 (gmt 0)

10+ Year Member



I assume the wget 1.6 I am seeing atm are related to the GRUB project (no, not the GRand Unified Bootloader), the so called "Open Source Indexing site" www.grub.org

As far as I can tell, the did not get robots.txt, no oide if they honor the meta tag.

Since it is a distributed system, blocling one ip also won't do much.

All I can hope is that they improve their code to at least honor robots.txt, and they change the wget/1.6 ua to grub-wget/1.6 or something.

mollusk

5:52 pm on Aug 16, 2001 (gmt 0)



Question:

How to enable robots such as wget to grab my html page?

My html header is:

<html><head><title>My Store Here</title>
<META NAME="description" CONTENT="A lot of content here">
<META content="text/html; charset=windows-1252" http-equiv=Content-Type></head>

And the page is located in the second level of our web site, i.e. it has a link <a href> in my homepage

But I tried to use wgets, -r, it does find this page and but does not search deeper from this page.

What is wrong with my html header?

Should I add something else?

littleman

6:11 pm on Aug 22, 2001 (gmt 0)



wget -r [blahh.com...] will usually do the job. Do you have unusual links?