homepage Welcome to WebmasterWorld Guest from 107.20.25.215
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

This 31 message thread spans 2 pages: 31 ( [1] 2 > >     
fixing a regex in htaccess
This regex doesn't seem to work
ianevans




msg:1499705
 8:22 am on Aug 5, 2003 (gmt 0)

I set up a few setenvif's in my htaccess to zap the more notorious photo stealing forums and community site.

This regex worked like a charm:

SetEnvIfNoCase Referer "^http://([^/]*)domain\.tld/" spam_ref=1

The example I saw had some more generic ways to look for URLS that contained certain terms so that you could block by forum software for example. (I've seen people using 50K images for avatars).

So I added an example regex:

SetEnvIfNoCase Referer "^http://([^/]*)phpbb([^/]*)/" spam_ref=1

Restarted Apache. I then headed off to a phpbb site I knew was stealing my photos and looked. Hmmm....cleared my cache. Went back....photos still there.

Is there an error in the regex? What do I need for "find this string in any URL"?

On a connected note: CSS has the selectoracle which explains CSS selectors: "Applies to an h1 in a div named #content". Is there a site where you can enter a regex expression and it fires back with "Looks for a string that doesn't start with 0-9 and contains the word 'whatever'"?

 

Timotheos




msg:1499706
 9:50 pm on Aug 5, 2003 (gmt 0)

Is there a site where you can enter a regex expression and it fires back with "Looks for a string that doesn't start with 0-9 and contains the word 'whatever'"?

Here's a place you can test your expression
[regexlib.com...]

ianevans




msg:1499707
 10:14 pm on Aug 5, 2003 (gmt 0)

Thanks for the tip.

The site you pointed me to shows that the regex isn't working.

At this point in the day regex is like black magic to me. :-)

How do I work out if a phrase is contained in an url?

I'd like to be able to find some generic phrases that appear in the majority of urls of the photo stealing sites. Stuff like phpbb, ultimatebb and viewtopic

Thanks.

DaveAtIFG




msg:1499708
 11:42 pm on Aug 5, 2003 (gmt 0)

Why not "reverse your logic" and only allow from your domain. For example:
SetEnvIfNoCase Referer www\.mydomain\.com not_spam_referral=1
This is essentially the example used in the Apache docs [httpd.apache.org] for the SetEnvIf directive.

ianevans




msg:1499709
 11:52 pm on Aug 5, 2003 (gmt 0)

By using that setting, I'd exclude image bots like Google's though.

DaveAtIFG




msg:1499710
 11:54 pm on Aug 5, 2003 (gmt 0)

Good point, and it your case that's certainly a bad thing.

claus




msg:1499711
 11:55 pm on Aug 5, 2003 (gmt 0)

>> How do I work out if a phrase is contained in an url?

>> SetEnvIfNoCase Referer "^http://([^/]*)phpbb([^/]*)/" spam_ref=1

AFAIK, it's as simple as this (for the word "phpbb"):

SetEnvIfNoCase Referer phpbb spam_ref=1

- and a phrase (backslash in front of spaces, dots, etc.):

SetEnvIfNoCase Referer "phpbb\ board" spam_ref=1

/claus



added:

Welcome to WebmasterWorld ianevans :)

ianevans




msg:1499712
 12:06 am on Aug 6, 2003 (gmt 0)

Ahhh...so the example I had was using a sledgehammer to crack an egg?

jdMorgan




msg:1499713
 12:15 am on Aug 6, 2003 (gmt 0)

Try changing
>> ^http://([^/]*)phpbb([^/]*)/
to ^http://([^/]*)/phpbb([^/]*)/

This allows for the slash following the domain name.

Jim

ianevans




msg:1499714
 6:49 am on Aug 6, 2003 (gmt 0)

Adding the slash worked.

Thanks.

claus




msg:1499715
 9:34 am on Aug 6, 2003 (gmt 0)

>> sledgehammer

Well sort of.. regexps are pattern matching - more like swiss army knives than sledge hammers. What you did (and jdMorgans suggestion) was to specify in a more exact manner where in the referrer string the "phpbb" should be found.

My suggestion will match the string "phpbb" - the other one will match the same string inside another string starting with "http://" (or "/" i believe) and ending with "/". The first approach will match more than the second, it will ie. also match the referrer "https://www.phpbb.com" or "http://example.com/myphpbb/file.htm".

It's two different ways, "a match for the string" vs. "a match for the string in a specified location inside another string with specific characteristics".

/claus

edit: clarified a bit

jdMorgan




msg:1499716
 1:50 pm on Aug 6, 2003 (gmt 0)

Precisely. There are many, many ways to accomplish a particular goal with regular expressions and mod_rewrite; In many cases, it comes down to a matter of your own personal style.

The common reference to mod_rewrite as a "Swiss army knife" springs from the fact that with mod_rewrite and regex, you have many choices of tools to solve the problem at hand.

Jim

ianevans




msg:1499717
 2:28 am on Aug 7, 2003 (gmt 0)

The solutions offered here have worked great. Thanks for the help everyone.

Here's a similar problem, but I'm not quite sure if it can be solved as easily...it's probably a mod-rewrite job. (Which I don't have.)

The help here has effectively shut off all the forum/community folks who post our photos like mad. The other problem I've noticed is this:

A person on a blog or forum says "Hey look at this." They then create a link to my site that goes directly to the image. In other words an 'a href' pointing to [mydomain.com...]

So hundreds of folks flock to see the bare image. No context, no content...and yes...no ads.

I'd much prefer it if they went to
[mydomain.com...]

I'm assuming this isn't easily solved?

jdMorgan




msg:1499718
 2:59 am on Aug 7, 2003 (gmt 0)

ianevans,

You may want to reconsider DaveAtIFG's idea, and make a list of allowed domains -- including your own and those of the various image bots.

Just make a series of SetEnvIf directives to set an environment variable named, for example, "allowimg" for allowed referrers, and then Allow your images only if the variable is set.

SetEnvIf Referer "google\.com" allowimg
SetEnvIf Remote_Host "googlebot\.com" allowimg
SetEnvIfNoCase Referer "^(www\.)?yourdomain\.com" allowimg

Order Allow,Deny
Deny from all
Allow from allowimg

The first line allows your images to be displayed by pages in the Google cache. The second allows googlebot to fetch your images, and the third allows your images to be displayed when called by pages on your own site. Add more at will.

Jim

ianevans




msg:1499719
 6:40 am on Aug 7, 2003 (gmt 0)

Strangely, the "^(www\.)?yourdomain\.com" regex didn't work and I couldn't see my photos on my own site.

Changed it to:
"^http://([^/]*)yourdomain\.com/" from an earlier example and it worked. Curious.

On another note...I'm now toying with the idea of a custom 403 page. If the 403 is caused by someone looking for an image without the context it will look it up in the gallery database and redirect them to the proper, full context page.

ianevans




msg:1499720
 6:57 am on Aug 7, 2003 (gmt 0)

I really should stop these late night coding sessions. Just saw another post from you JD (http://www.webmasterworld.com/forum10/2083.htm) that mentioned the problem of potentially blocking legitimate visitors with blank referrers...

Since there's only a very small list of really bad abusers I'm pondering going back to blocking by those sites and perhaps the generic board script names...off to get a thinking cap and a coffee.

claus




msg:1499721
 11:12 am on Aug 7, 2003 (gmt 0)

>> Strangely, the "^(www\.)?yourdomain\.com" regex didn't work

This character is important: ^

It denotes the beginning of the string. The regexp above says that the string should begin with either "www." or "your".

That is not catched as the referrer will either begin with "http" or "/"

/claus

jdMorgan




msg:1499722
 4:15 pm on Aug 7, 2003 (gmt 0)

ianevans,

> Strangely, the "^(www\.)?yourdomain\.com" regex didn't work and I couldn't see my photos on my own site.
> Changed it to:
> "^http://([^/]*)yourdomain\.com/" from an earlier example and it worked. Curious.
>
> On another note...I'm now toying with the idea of a custom 403 page. If the 403 is caused by someone looking for an image without the context it will look it up in the gallery database and redirect them to the proper, full context page.

Oops! - My mistake... :(

> I really should stop these late night coding sessions.

Yes, me too... obviously! :)

I'd suggest:
SetEnvIfNoCase Referer "^http://(www\.)?yourdomain\.com" allowimg

The "http://" is not necessary in the other referer-based directives, because the pattern is not start-anchored with "^".

Thanks, claus, for helping. Note to self: "The Referer var starts with [,...] Remote_Host does not."

Regarding the 403 redirect to your page context: You cannot redirect from an <IMG> link to a non-image page - the browser usually can't handle changing MIME-types from an image to a text file when an <IMG> link is used. However, if the user is clicking a text link to [yourdomain.com...] it may work. The problem is that the state of the browser is not visible to you, and so the results of the redirect function will not be consistent between these two access methods.

An alternative is to put up a "stolen image" graphic containing the URL of your site, and redirect all unwelcome hotlinking to that image. The URL won't be clickable, but you may recover some "advertising value" from the hotlink, anyway.

Jim

ianevans




msg:1499723
 7:13 pm on Aug 7, 2003 (gmt 0)

Three things:

1) I need to cut down on my coffee intake.

2) Pondering the small percentage of blank referrers vs. not having a long list of specific leeches:

Would it work to have an .htaccess that denies from all but allows from mysite, google and blank refers? How would you specify the blank refer in the regex? Would it be:
SetEnvIfNoCase Referer "" allowed=1

3) jd, you said "Regarding the 403 redirect to your page context: You cannot redirect from an <IMG> link to a non-image page"

Just to clarify I meant using a custom error page as in:
ErrorDocument 403 /lookforphoto.shtml

I realize that would not be in effect for IMG tags, but it would work for HREF links would it not?

jdMorgan




msg:1499724
 9:26 pm on Aug 7, 2003 (gmt 0)

ianevans,

How would you specify the blank refer in the regex? Would it be:
SetEnvIfNoCase Referer "" allowed=1

SetEnvIfNoCase Referer "^$" allowed

"^$" specifies blank and "=1" is the default, so is not needed.

There is a long list of image loaders/referers you might want to allow - it tends to be site- and webmaster- specific.

  • Referral from your site's domain(s)
  • Referral from your site IP address (if a non-shared IP address)
  • Blank referrer
  • Referral from "trusted sites" - partners, friends, etc.
  • Referral from Google cache or translator
  • Referral from Yahoo cache
  • Referral from AltaVista Translator
  • Referral from Cometsystems cache
  • Referral from SearchHippo cache
  • Referral from Alexa/Internet Archiver cache
  • Referral from freetranslation.com
  • Referral from wysiwyg://[0-9]{1,2}/http://yourdomain.com (this is a Netscape4 javascript image load)

    There are many others, but those are the "high points"

    3) jd, you said "Regarding the 403 redirect to your page context: You cannot redirect from an <IMG> link to a non-image page"

    Just to clarify I meant using a custom error page as in:
    ErrorDocument 403 /lookforphoto.shtml

    I realize that would not be in effect for IMG tags, but it would work for HREF links would it not?

    An error redirect to an shtml page would work for links where the requested item was a "page" MIME-type, such as "text/html". I'm not sure that it would work for a link to an image MIME-type... It might work in some or even most cases. I use mod_rewrite to redirect troublemakers while avoiding changing MIME-types whenever possible. As a result, I haven't played with this in some time, so I can't claim current knowledge.

    How about testing and letting us know what you find out? :)

    Jim

  • ianevans




    msg:1499725
     10:00 pm on Aug 7, 2003 (gmt 0)

    This is a fantastic site. I'll be subscribing as soon as I can.


    There is a long list of image loaders/referers you might want to allow - it tends to be site- and webmaster- specific.
    • Referral from Google cache or translator

    • Referral from Yahoo cache

    • Referral from AltaVista Translator

    • etc...

    Is there a somewhat definitive list of these referrers/hostnames anywhere (this site?)


    How about testing and letting us know what you find out? :)

    I'll give it a go.

    ianevans




    msg:1499726
     10:18 pm on Aug 7, 2003 (gmt 0)

    jd,

    Of course my google search on finding the referrers you mentioned pulled up another post on this site by you with examples.

    I'll quickly redo it in setenif form and run it by you before I toss it up. :-)

    ianevans




    msg:1499727
     11:35 pm on Aug 7, 2003 (gmt 0)

    Okay...based on this and other posts, I think this should do the trick:

    SetEnvIfNoCase Referer "^$" allowed
    SetEnvIfNoCase Referer "^http://([^/]*)mysite\.com" allowed
    SetEnvIfNoCase Referer "^http://my\.ip\.goes.\here" allowed
    SetEnvIfNoCase Referer "^http://216\.239\.(3[2-9]¦[45][0-9]¦6[0-3])\..*www\.mydomain\.com" allowed
    SetEnvIfNoCase Referer "^http://66\.218\.(64¦[78][0-9]¦9[0-5])\.[0-9]{1,3}/search/cache.*www\.mydomain\.com" allowed
    SetEnvIfNoCase Referer "^http://babel\.altavista.com/.*www\.mydomain\.com" allowed
    SetEnvIfNoCase Referer "^http://search.*\.cometsystems\.com/search.*www\.mydomain\.com" allowed
    SetEnvIfNoCase Referer "^http://.*searchhippo\.com.*www\.mydomain\.com" allowed
    SetEnvIfNoCase Referer "^http://web\.archive\.org/web/.*/http://www\.mydomain\.com" allowed
    SetEnvIfNoCase Referer "^207\.228\.(19[2-9]¦2[01][0-9]¦22[0-3])\." allowed
    SetEnvIfNoCase Referer "^http://fets\.freetranslation\.com.*mydomain" allowed
    SetEnvIfNoCase Referer "^wysiwyg://[0-9]{1,2}/http://www\.mydomain\.com" allowed
    <FilesMatch "\.(jpg¦JPG)">
    Order Allow,Deny
    Deny from all
    Allow from allowed
    </FilesMatch>

    JD:
    In an earlier message you also mentioned:
    SetEnvIf Referer "google\.com" allowed
    SetEnvIf Remote_Host "googlebot\.com" allowed

    Do I still need to add those? Also, in the cases like "^http://babel\.altavista.com/.*www\.mydomain\.com", do I need to add another line for the possibility that there is no www?

    Thanks again.

    jdMorgan




    msg:1499728
     4:14 am on Aug 8, 2003 (gmt 0)

    ianevans,

    You don't need parentheses in this one:
    SetEnvIfNoCase Referer "^http://[^/]*mysite\.com" allowed

    In an earlier message you also mentioned:
    SetEnvIf Referer "google\.com" allowed
    SetEnvIf Remote_Host "googlebot\.com" allowed

    Do I still need to add those?

    No, for the first one, it's more efficient to use the IP address, since no reverse-DNS request will be needed.
    The second line is no longer needed because you are now allowing blank referrers (which is what you get with most robots).

    Also, in the cases like "^http://babel\.altavista.com/.*www\.mydomain\.com", do I need to add another line for the possibility that there is no www?

    No, you don't need a separate line. Just make the "www\." optional, like so:

    SetEnvIfNoCase Referer "^http://babel\.altavista.com/.*(www\.)?mydomain\.com" allowed

    Jim

    ianevans




    msg:1499729
     4:59 am on Aug 8, 2003 (gmt 0)

    SetEnvIfNoCase Referer "^http://[^/]*mysite\.com" allowed

    Just did that and I lost all images on my site. And yes, I did change mysite to the actual domain :-)

    ianevans




    msg:1499730
     5:40 am on Aug 8, 2003 (gmt 0)

    ...and I lost all images on my site

    Repeat after me: "When testing the new .htaccess on your site, don't accidentally hit the Opera images off button"

    All is working fine so far. Just saw the site (and images in the babel translator...will hit the google cache and others while I drink a nice tea, keeping my mouse far away from any Opera switches.

    BTW, you also wanted me to report back on my custom 403 experiment.

    I clicked on an href on another site pointing to one of my images.

    The custom error document looked the image up in my database, found the correct location for it and instead of presenting the bare .jpg it now presents the page that photo is on. Hooray for PHP.

    Originally wanted it to send you directly to the page, but the referer made the image come up blank. Tried reseting the referer variable, but that still didn't work. My 403 page now says this when you're looking for an image:

    You cannot access this image directly. Please use this <b>image link</b> to see the image you're looking for.

    ianevans




    msg:1499731
     9:09 am on Aug 8, 2003 (gmt 0)

    All is working fine so far. Just saw the site (and images) in the babel translator...will hit the google cache

    Spoke to soon...saw the images on altavista because they were in my cache. Tried the experiment again, clearing the cache after each test site. altavista, google, etc. failed.

    jd, sent you an email with actual urls.

    ianevans




    msg:1499732
     1:58 am on Aug 9, 2003 (gmt 0)

    Quick question:

    What's the difference between the logical OR symbol (vertical line) and the vertical line with it's middle missing that we see in the examples in this thread?

    After changing the chopped verticals to full verticals, the regex tester mentioned earlier in this thread was suddenly matching test cases.

    Earlier I stumbled across a thread where (I think it was jd) they told person leaving the question to remember to change the OR symbol. Of course, now I can't find that thread...

    Anyway, despite the fact that it now matches on the regex tester, the setenvif's for google still don't allow my images to be displayed despite the fact they've been "allowed."

    claus




    msg:1499733
     8:55 am on Aug 9, 2003 (gmt 0)

    I'm not sure about the reasons for the Gbot - what lines do you use for the Google cache?

    The broken pipe: "¦"

    It's simply another character than the "real" one - it just looks like it but it isn't the same. You should always use the one that is not broken, which means editing all you copy from WW by hand - it's not that difficult though ;)

    I cannot tell you WW's reasons for replacing it of course but i can tell you why i would do so if i did: It's right that it's an OR operator, but it also has other uses most prominently as the "pipe" which gave it its name - this is a function that can send input/output to various places, just like the pipes that carry our drinking water to the tap, and then it's used as a field delimiter in some databases as well.

    I will not elaborate on this, it's just a security feature. I think you would have to be very skilled to cause any harm with it, but on a forum like this this is also essential to avoid, as, judging by posts alone, i think that there are very skilled people around, so it's probably replaced following the principle of "better safe than sorry" :)

    /claus

    edit: clarified a bit

    ianevans




    msg:1499734
     5:51 pm on Aug 9, 2003 (gmt 0)

    Claus,

    I figured it was something like that.

    BTW, got the search engines all working. Just want to toss down some notes and I'll toss it up here.

    JD: Hope your work is going well. Tried sending you a stickymail to read at your leisure, but the mailbox is full.

    It's sunny here...maybe I'll actually leave the computer today. :-)

    This 31 message thread spans 2 pages: 31 ( [1] 2 > >
    Global Options:
     top home search open messages active posts  
     

    Home / Forums Index / Code, Content, and Presentation / Apache Web Server
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved