Forum Moderators: phranque

Message Too Old, No Replies

Make my own web crawler

         

ThatBG

11:23 am on Dec 8, 2005 (gmt 0)

10+ Year Member



Hi guys, its my first post here but I must say I've been reading this site for quite a while, its been of great help to me :) Anyway, I want to set up my own web crawler (just a one off, I want to find a file, I know what site but I don't know the exact address of the file) so yes is there anything available to me?

Thanks,
TBG

physics

6:53 pm on Dec 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi ThatBG, welcome to WebmasterWorld.com!

It sounds like a web crawler may be overkill for what you need. Have you tried the site: command on Google and Yahoo and MSN Search? Like
site:example.com filename

If that doesn't work for you, look into wget if you're on a unix type system or if your on windows look on Tucows or something for a free crawler script.

ThatBG

11:46 pm on Dec 8, 2005 (gmt 0)

10+ Year Member



Hi Physics,

I managed to find one, but it doens't seem to be working on the site I want it to...if robots.txt says

User-agent: *
Dissalow: /

does that mean it can't crawl anything?

Thanks,
Jake

ThatBG

11:48 pm on Dec 8, 2005 (gmt 0)

10+ Year Member



Hi Physics,

I managed to find one, but it doens't seem to be working on the site I want it to, because the file isn't actually linked to anything on the site.

Thanks,
TBG

Stefan

11:54 pm on Dec 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I managed to find one, but it doens't seem to be working on the site I want it to, because the file isn't actually linked to anything on the site.

Sorry if I've missed something, but how will a crawler find it if there are no links to it?

Could you just email the webmaster and ask them?

(And welcome to WW.)

physics

12:42 am on Dec 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Most spiders look for a file called robots.txt for every domain they crawl, it does not have to be linked to from anywhere.
If their robots.txt has those lines then no you're not allowed to crawl the site. I doubt asking the webmaster will help in this case.

ThatBG

10:24 am on Dec 9, 2005 (gmt 0)

10+ Year Member



I was after a web crawler that had a brute force approach, not one where it just goes off links. But it doesn't matter now :)