How do I check if robots have crawled my site?

Forum Moderators: DixonJones

Message Too Old, No Replies

How do I check if robots have crawled my site?

H2O_aa

4:49 pm on Feb 23, 2005 (gmt 0)

I heard from somone that there's a log you can go see what robots have crawled your site and what pages. But I have no idea where I find that at? Thanks.

pmkpmk

4:53 pm on Feb 23, 2005 (gmt 0)

If you have access to your logfiles (which is a necessity if you're serious about being a professional webmaster) then watch for User-Agent strings like "GoogleBot", "Slurp" and suchlike. Also requests for robots.txt in the logfile are a dead giveaway.

If you don't have access to your logs, go out and get it!

H2O_aa

5:01 pm on Feb 23, 2005 (gmt 0)

I have applied web statistics for my site and I have 77 requests for robots.txt, which I don't think is an impressive number?

Also, where can I get access to my logfiles?

Thanks for your explanation.

pmkpmk

5:02 pm on Feb 23, 2005 (gmt 0)

Ask the helpdesk of your ISP (internet service provider). There's no standard answer to this question, but they will know what to do.

H2O_aa

5:10 pm on Feb 23, 2005 (gmt 0)

Sorry for my shallow knowledge, web statistics is different than the logfiles you are talking about right?

pmkpmk

5:19 pm on Feb 23, 2005 (gmt 0)

Yes. Web statistic is the distilled result, derived from your logs. A logfile entry looks like this on the popular Apache webserver:

192.168.2.1 - - [23/Feb/2005:11:22:24 +0100] "GET /phpMyAdmin HTTP/1.0" 401 468 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0"

g1smd

5:45 pm on Feb 23, 2005 (gmt 0)

Log files are usually to be found in a folder located somewhere above the web root. You will need FTP access to your website to access that location.

Often, the server only keeps the last 7 to 20 days worth of results. Your host may have archived the older ones. If you need those, ask your host for them.

There is usually one log file per day, and each line in the file records the date and time that someone accessed your website, their IP address and user agent, what file they asked for, the status of that request, and so on.

If someone accessed a page of your site, and that page contained 3 images, then there would be 4 entries in the log file for that action. The log file can be very big for a large traffic site.

An analysis program can read the data from that file and then produce statistics for you: such things as number of unique users, number of pages accessed, users per country or ISP, and many other things.

wildfiction

8:15 pm on Feb 25, 2005 (gmt 0)

Is there a way to set up some code (ASP for example) on my page(s) so that I will receive an email when my web is crawled?

g1smd

8:28 pm on Feb 25, 2005 (gmt 0)

Yes. You can detect the UserAgent string and then take various actions when the UA you are looking for is detected as accessing that particular page.

SEOMike

8:40 pm on Feb 25, 2005 (gmt 0)

Yes. You can detect the UserAgent string and then take various actions when the UA you are looking for is detected as accessing that particular page.

I'm no programmer, so I'd be interested to hear how this is set up!

wildfiction

5:51 am on Feb 26, 2005 (gmt 0)

So what is the general strategy? Do you just set up a single ASP page that is linked to your home page and because that ASP page is linked then it's likely to be crawled and so you log a hit on your site by the spider? i.e. You only need it on one page don't you?

I searched the net for some ASP code to do this and came up with something called XAgent which seems to do the job but their site has either been hijacked or is now under construction so I haven't been able to find any example code yet.

mattglet

1:25 pm on Feb 26, 2005 (gmt 0)

Here's one of the best articles I've seen:

[msdn.microsoft.com...]

Jackal

5:52 pm on Feb 27, 2005 (gmt 0)

OK, so how about some of the best sources on writing a good robots.txt for a website? I am interested in learning how to write a good one for blocking specific IP addresses, maybe some IP address blocks if necessary should the need ever present itself, and from blocking specific browsers that I know are used as dummy entries on the log files so as to evade detection by what is really doing what on a person's sites. I would appreciate anyone's help.

g1smd

6:10 pm on Feb 27, 2005 (gmt 0)

The robots.txt filedoes not block anything at all. It is a friendly "please don't look here" message, but very few spiders actually read that file, and even less take note of what it says.

To physically block spiders or IPs you need to set file and directory permissions. One way to do this on Apache is to set these up in the .htaccess file. I assume that IIS boxes have some sort of equivalent functionality.

Jackal

7:00 am on Mar 1, 2005 (gmt 0)

g1smd,

Thanks for the feedback but I am soooooo lost! Sorry, all you have explained seems like a foreign language to me. I am new at all of this so I apologize for my ignorance. However, I really do need help in creating an effective robots.txt or to prevent those I do not want indexing or searching my pages. I am not using apache and in all honesty my provider is giving me windows based services.

Jackal

7:53 am on Mar 1, 2005 (gmt 0)

OK, sorry for sounding the alarm so quickly! I have found some excellent sources (of course) here on the site that have the validator and many other posts and threads! Is this site cool or what? ;) So, never mind for my stupid question previously about robots.txt help. I am sure I will have other stupid questions later though! ;)

[searchengineworld.com...]

Jackal

8:02 am on Mar 1, 2005 (gmt 0)

The following text below is from the robots.txt tutorial and it is the only area I do not understand. Can someone please elaborate on this? I am using FP2000 and I don't want to mess anything up. Thanks!

***********************

The robots.txt file should be created in Unix line ender mode! Most good text editors will have a Unix mode or your FTP client *should* do the conversion for you. Do not attempt to use an HTML editor that does not specifically have a text mode to create a robots.txt file.
***********************

Jackal

8:08 am on Mar 1, 2005 (gmt 0)

Oh! One last thing! This is also from the robots.txt tutorial. Given the fact that the previous says do not use HTML? can I use FP2000 to enter what follows into the code or not? If not what do I do to include the following to work for me? My web hosting has windows based services only. No Apache or anything else. Any suggestions?

*************************
# Robots.txt file from [searchengineworld.com...]
#
# Built from text file
[info.webcrawler.com...]
#
# This restricts access to only known and registered robots.
#

User-agent: Mozilla/3.0 (compatible;miner;mailto:miner@miner.com.br)
Disallow:

User-agent: WebFerret
Disallow:

User-agent: Due to a deficiency in Java it's not currently possible
to set the User-agent.
Disallow:

User-agent: no
Disallow:

User-agent: 'Ahoy! The Homepage Finder'
Disallow:

User-agent: Arachnophilia
Disallow:

User-agent: ArchitextSpider
Disallow:

User-agent: ASpider/0.09
Disallow:

User-agent: AURESYS/1.0
Disallow:

User-agent: BackRub/*.*
Disallow:

User-agent: Big Brother
Disallow:

User-agent: BlackWidow
Disallow:

User-agent: BSpider/1.0 libwww-perl/0.40
Disallow:

User-agent: CACTVS Chemistry Spider
Disallow:

User-agent: Digimarc CGIReader/1.0
Disallow:

User-agent: Checkbot/x.xx LWP/5.x
Disallow:

User-agent: CMC/0.01
Disallow:

User-agent: combine/0.0
Disallow:

User-agent: conceptbot/0.3
Disallow:

User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
Disallow:

User-agent: root/0.1
Disallow:

User-agent: CS-HKUST-IndexServer/1.0
Disallow:

User-agent: CyberSpyder/2.1
Disallow:

User-agent: Deweb/1.01
Disallow:

User-agent: DragonBot/1.0 libwww/5.0
Disallow:

User-agent: EIT-Link-Verifier-Robot/0.2
Disallow:

User-agent: Emacs-w3/v[0-9\.]+
Disallow:

User-agent: EmailSiphon
Disallow:

User-agent: EMC Spider
Disallow:

User-agent: explorersearch
Disallow:

User-agent: Explorer
Disallow:

User-agent: ExtractorPro
Disallow:

User-agent: FelixIDE/1.0
Disallow:

User-agent: Hazel's Ferret Web hopper,
Disallow:

User-agent: ESIRover v1.0
Disallow:

User-agent: fido/0.9 Harvest/1.4.pl2
Disallow:

User-agent: H�m�h�kki/0.2
Disallow:

User-agent: KIT-Fireball/2.0 libwww/5.0a
Disallow:

User-agent: Fish-Search-Robot
Disallow:

User-agent: Mozilla/2.0 (compatible fouineur v2.0;
fouineur.9bit.qc.ca)
Disallow:

User-agent: Robot du CRIM 1.0a
Disallow:

User-agent: Freecrawl
Disallow:

User-agent: FunnelWeb-1.0
Disallow:

User-agent: gcreep/1.0
Disallow:

User-agent:?
Disallow:

User-agent: GetURL.rexx v1.05
Disallow:

User-agent: Golem/1.1
Disallow:

User-agent: Gromit/1.0
Disallow:

User-agent: Gulliver/1.1
Disallow:

User-agent: yes
Disallow:

User-agent: AITCSRobot/1.1
Disallow:

User-agent: wired-digital-newsbot/1.5
Disallow:

User-agent: htdig/3.0b3
Disallow:

User-agent: HTMLgobble v2.2
Disallow:

User-agent: no
Disallow:

User-agent: IBM_Planetwide,
Disallow:

User-agent: gestaltIconoclast/1.0 libwww-FM/2.17
Disallow:

User-agent: INGRID/0.1
Disallow:

User-agent: IncyWincy/1.0b1
Disallow:

User-agent: Informant
Disallow:

User-agent: InfoSeek Robot 1.0
Disallow:

User-agent: Infoseek Sidewinder
Disallow:

User-agent: InfoSpiders/0.1
Disallow:

User-agent: inspectorwww/1.0
[greenpac.com...]
Disallow:

User-agent: 'IAGENT/1.0'
Disallow:

User-agent: IsraeliSearch/1.0
Disallow:

User-agent: JCrawler/0.2
Disallow:

User-agent: Jeeves v0.05alpha (PERL, LWP, lglb@doc.ic.ac.uk)
Disallow:

User-agent: Jobot/0.1alpha libwww-perl/4.0
Disallow:

User-agent: JoeBot,
Disallow:

User-agent: JubiiRobot
Disallow:

User-agent: jumpstation
Disallow:

User-agent: Katipo/1.0
Disallow:

User-agent: KDD-Explorer/0.1
Disallow:

User-agent: KO_Yappo_Robot/1.0.4(http://yappo.com/info/robot.html)
Disallow:

User-agent: LabelGrab/1.1
Disallow:

User-agent: LinkWalker
Disallow:

User-agent: logo.gif crawler
Disallow:

User-agent: Lycos/x.x
Disallow:

User-agent: Lycos_Spider_(T-Rex)
Disallow:

User-agent: Magpie/1.0
Disallow:

User-agent: MediaFox/x.y
Disallow:

User-agent: MerzScope
Disallow:

User-agent: NEC-MeshExplorer
Disallow:

User-agent: MOMspider/1.00 libwww-perl/0.40
Disallow:

User-agent: Monster/vX.X.X -$TYPE ($OSTYPE)
Disallow:

User-agent: Motor/0.2
Disallow:

User-agent: MuscatFerret
Disallow:

User-agent: MwdSearch/0.1
Disallow:

User-agent: NetCarta CyberPilot Pro
Disallow:

User-agent: NetMechanic
Disallow:

User-agent: NetScoop/1.0 libwww/5.0a
Disallow:

User-agent: NHSEWalker/3.0
Disallow:

User-agent: Nomad-V2.x
Disallow:

User-agent: NorthStar
Disallow:

User-agent: Occam/1.0
Disallow:

User-agent: HKU WWW Robot,
Disallow:

User-agent: Orbsearch/1.0
Disallow:

User-agent: PackRat/1.0
Disallow:

User-agent: Patric/0.01a
Disallow:

User-agent: Peregrinator-Mathematics/0.7
Disallow:

User-agent: Duppies
Disallow:

User-agent: Pioneer
Disallow:

User-agent: PGP-KA/1.2
Disallow:

User-agent: Resume Robot
Disallow:

User-agent: Road Runner: ImageScape Robot (lim@cs.leidenuniv.nl)
Disallow:

User-agent: Robbie/0.1
Disallow:

User-agent: ComputingSite Robi/1.0 (robi@computingsite.com)
Disallow:

User-agent: Roverbot
Disallow:

User-agent: SafetyNet Robot 0.1,
Disallow:

User-agent: Scooter/1.0
Disallow:

User-agent: not available
Disallow:

User-agent: Senrigan/#*$!xxx
Disallow:

User-agent: SG-Scout
Disallow:

User-agent: Shai'Hulud
Disallow:

User-agent: SimBot/1.0
Disallow:

User-agent: Open Text Site Crawler V1.0
Disallow:

User-agent: SiteTech-Rover
Disallow:

User-agent: Slurp/2.0
Disallow:

User-agent: ESISmartSpider/2.0
Disallow:

User-agent: Snooper/b97_01
Disallow:

User-agent: Solbot/1.0 LWP/5.07
Disallow:

User-agent: Spanner/1.0 (Linux 2.0.27 i586)
Disallow:

User-agent: no
Disallow:

User-agent: Mozilla/3.0 (Black Widow v1.1.0; Linux 2.0.27; Dec 31
1997 12:25:00
Disallow:

User-agent: Tarantula/1.0
Disallow:

User-agent: tarspider
Disallow:

User-agent: dlw3robot/x.y (in TclX by [hplyot.obspm.fr...]
Disallow:

User-agent: Templeton/
Disallow:

User-agent: TitIn/0.2
Disallow:

User-agent: TITAN/0.1
Disallow:

User-agent: UCSD-Crawler
Disallow:

User-agent: urlck/1.2.3
Disallow:

User-agent: Valkyrie/1.0 libwww-perl/0.40
Disallow:

User-agent: Victoria/1.0
Disallow:

User-agent: vision-search/3.0'
Disallow:

User-agent: VWbot_K/4.2
Disallow:

User-agent: w3index
Disallow:

User-agent: W3M2/x.xxx
Disallow:

User-agent: WWWWanderer v3.0
Disallow:

User-agent: WebCopy/
Disallow:

User-agent: WebCrawler/3.0 Robot libwww/5.0a
Disallow:

User-agent: WebFetcher/0.8,
Disallow:

User-agent: weblayers/0.0
Disallow:

User-agent: WebLinker/0.0 libwww-perl/0.1
Disallow:

User-agent: no
Disallow:

User-agent: WebMoose/0.0.0000
Disallow:

User-agent: Digimarc WebReader/1.2
Disallow:

User-agent: webs@recruit.co.jp
Disallow:

User-agent: webvac/1.0
Disallow:

User-agent: webwalk
Disallow:

User-agent: WebWalker/1.10
Disallow:

User-agent: WebWatch
Disallow:

User-agent: Wget/1.4.0
Disallow:

User-agent: w3mir
Disallow:

User-agent: no
Disallow:

User-agent: WWWC/0.25 (Win95)
Disallow:

User-agent: none
Disallow:

User-agent: XGET/0.7
Disallow:

User-agent: Nederland.zoek
Disallow:

User-agent: BizBot04 kirk.overleaf.com
Disallow:

User-agent: HappyBot (gserver.kw.net)
Disallow:

User-agent: CaliforniaBrownSpider
Disallow:

User-agent: EI*Net/0.1 libwww/0.1
Disallow:

User-agent: Ibot/1.0 libwww-perl/0.40
Disallow:

User-agent: Merritt/1.0
Disallow:

User-agent: StatFetcher/1.0
Disallow:

User-agent: TeacherSoft/1.0 libwww/2.17
Disallow:

User-agent: WWW Collector
Disallow:

User-agent: processor/0.0ALPHA libwww-perl/0.20
Disallow:

User-agent: wobot/1.0 from 206.214.202.45
Disallow:

User-agent: Libertech-Rover www.libertech.com?
Disallow:

User-agent: WhoWhere Robot
Disallow:

User-agent: ITI Spider
Disallow:

User-agent: w3index
Disallow:

User-agent: MyCNNSpider
Disallow:

User-agent: SummyCrawler
Disallow:

User-agent: OGspider
Disallow:

User-agent: linklooker
Disallow:

User-agent: CyberSpyder (amant@www.cyberspyder.com)
Disallow:

User-agent: SlowBot
Disallow:

User-agent: heraSpider
Disallow:

User-agent: Surfbot
Disallow:

User-agent: Bizbot003
Disallow:

User-agent: WebWalker
Disallow:

User-agent: SandBot
Disallow:

User-agent: EnigmaBot
Disallow:

User-agent: spyder3.microsys.com
Disallow:

User-agent: www.freeloader.com.
Disallow:

User-agent: Googlebot
Disallow:

User-agent: METAGOPHER
Disallow:

User-agent: *
Disallow: /

*************************

cooldoug

1:00 pm on Mar 3, 2005 (gmt 0)

Your host should give you the logs, mine is under /logs.