Forum Moderators: DixonJones

Message Too Old, No Replies

Discovering hotlinking.

How?

         

Broadway

4:40 am on Mar 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My site is graphics intense. Everybody and his brother rips off my text (I find that with Google searches). I've always assumed that there is some image hotlinking going on too.

Related to my monthly bandwidth usage my hosting bill is starting to get a little big. While I have no real proof of hotlinking I'm to a point where I would like to know.

My host provides Webalizer stats. Is there some aspect of them that can provide hotlinking information?

My host generates a new site log for each day of the week. I can download this log. I've looked at it with WordPad but it is very confusing. Is there some program I could run this file through on my own machine so to decipher things?

keyplyr

8:54 am on Mar 13, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



One way would be to use www.analog.cx website statistics analyzer (free) which runs on your local machine. It can be configured many ways, including showing file downloads compared to page downloads in parallel columns. If there are file hits but no page loads, the user is loading your image remotely. Then you just follow the referrer to see where this is occurring. This is not a fool-proof method because of blank referrers, tabbed browsing, SERPs in frames, and other methods that will not display an accurate referrer, but it does catch quite a few. Then the question is, what do you do?

Broadway

3:31 am on Mar 14, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



keyplyr,
I downloaded Analog, that was certainly easy enough. However, I really can't seem to figure out what "command lines" I need to put in the configuration file so to provide the information you disucss. Could you help me with this? Thanks.

Macguru

3:53 am on Mar 14, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Broadway,

Analog is also my favorite stat program, but it can take some time to set the config file to suit your needs. There are many ways to prevent hotlinking [google.com] without having to look for the culprits.

There is also another thing to think about, these sites are linkling to your site, thus contributing to it's popularity. If you pull the plug, your rankings on most SE will probably suffer.

If you want to do a quick chek about who links to your site, you can query GG with
-link: www.yoursite.com/ and run trough them manually.

keyplyr

5:35 am on Mar 14, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



... these sites are linkling to your site, thus contributing to it's popularity. If you pull the plug, your rankings on most SE will probably suffer.

Macguru, we're talking about remote linking to images. Ranking pertains to webpages.

Broadway - There's a folder named: examples, which contains a file named bigbyrep.cfg. Read through it. It shows you how to control what is displayed in each category on your report.html page.

Granted, analog is not a quick study, but it is a terrific tool and highly customizable.

<added>

This is an example of how to show columns for page requests vrs. file requests by referrer in your analog.cfg file - thus showing who's remote linking to image files:


REFERRER ON
REFCHART ON
REFCOLS PR
REFSORTBY REQUESTS
REFFLOOR 1r
REFARGSSORTBY REQUESTS
REFARGSFLOOR 1r

(The number 1 is listing every hit. Incease number to taste.)

</added>

Hagstrom

10:24 am on Mar 14, 2004 (gmt 0)

10+ Year Member



If you have the (raw Apache) log, you can use grep in two steps:

First you select all log-records that concerns images:

grep -i "(\.gif HTTP¦\.jpg HTTP)" my.log > work.log

"my.log" is your log (of course) and work.log is a temporary file that you pass on to step 2:

grep -i -v "(http://.*.yoursite\.com¦cache¦atomz¦babelfish¦ \"-\")" work.log > hotlink.log

This command will select all log records that are not referred from sites that you control. Remember to replace "yoursite\.com" with your own site-id.

cache, atomz and babelfish are sites that legitimately links to my images.

Broadway

3:15 pm on Mar 14, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks keyplyr,
I had found big.cfg and fooled with it for well over an hour experimenting but still felt I was a long way from figuring things out regarding hotlinking and how to format things like you mentioned.

Broadway

3:27 pm on Mar 14, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hagstrom,
You mention your code reads the raw "apache" log. As a novice I chose a Windows/IIS plan. Now of course I regret it (no .htaccess control). Is there a difference between an apache log and the log I have?

Hagstrom

7:26 pm on Mar 14, 2004 (gmt 0)

10+ Year Member



I have now idea about what those logs look like. But the main point is that I have a log that looks like this:

2X2.30.8.160 - - [09/Mar/2004:16:07:09 +0100] "GET /[b]your-image.jpg[/b] HTTP/1.1" 200 1692 "http://[b]www.yoursite.com/Yourpage.htm[/b]" "Mozilla/4.0 (compatible; MSIE 5.01; Windows 98; DigExt)"

I used the first grep to extract all lines with "jpg" and gif" in them - and the second grep to extract those where the referrer did not (that's the "-i") contain "yoursite.com".

[edited by: DaveAtIFG at 4:58 pm (utc) on Mar. 22, 2004]
[edit reason] Obscured IP [/edit]

The Cricketer

4:39 pm on Mar 22, 2004 (gmt 0)

10+ Year Member



Try 'Weblog expert Lite' - it's quality and pretty simple to use.

cgrantski

10:20 pm on Mar 22, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Logs are very confusing to look at.

You could try this, using the Command Window followed by Excel. I haven't done it in awhile and hope I can remember it right. The main point of this is looking for third-party sites that are referrers of hits to your image files.

First, look inside one of your logs using WordPad or whatever and make sure *.jpg and *.gif files are actually being logged. If so, go to the command window (DOS-looking thing), change to the directory holding the log, and do this:

find ".jpg " ex040322.log > ex040322.jpegs
followed by, if you want, when it's finished:
find ".gif " ex040322.log >> ex040322.jpegs

(I'm just guessing at the logfile name and am arbitrarily using March 22 as the date of the log)

You'll end up with one file containing only jpeg file hits or only jpeg and gif hits. As a last step, open the original log and grab the line at the top starting with #Fields: and put it at the top of the new file.

Explore to the new file, right click on it in Windows, choose "Open with Excel." In Excel, choose Delimited and specify a space as the sole delimiter. 65,000 lines of your log will open in Excel. In the transplanted line that starts with the word "#Fields:," delete the first cell (containing #Fields:), moving everything to the right of it to the left. Now all your field names will line up vertically with the correct fields. Save this as an .xls file before you go any further.

Use Excel to Sort on the Referrer field (probably called cs(Referrer)). Be sure to specify that there is a header row when you do the sort. In the sorted worksheet, scan through the Referrer column. Since it's pretty rare for an image file to have a referrer other than your own site, you'll be looking for referrers that are not your site. When you find any, the images they're grabbing will be in the cs-uri-stem and/or the cs-uri-query fields of that line.

Okay. Now it's time for the smart people in this forum to shoot this full of holes ...