homepage Welcome to WebmasterWorld Guest from 54.167.185.110
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Struggling with regular expressions
Any decent tutorials?
vordmeister




msg:4361140
 5:30 pm on Sep 11, 2011 (gmt 0)

I've always struggled with regular expresions. Normally it's possible to pinch something from the web but I should really start understanding it. Could anyone suggest a good book or tutorial?

Just at the moment I'm trying to replace images linked using bbcode from certain sites with an error message while retaining the rest of the text in the post. Here is what I have:


$string = "Lots of text [IMG]http://example.com/image1.jpg[/IMG] and [IMG]http://example.com/image2.jpg[/IMG] more text";
$pattern = "/[\[IMG\].?example.?\[/IMG\]]/";
$replacement = "Images from example.com not allowed";
echo preg_replace($pattern, $replacement, $string);


Obviously that gives errors right now. Would much appreciate help.

 

lucy24




msg:4361144
 6:10 pm on Sep 11, 2011 (gmt 0)

Urk. For starters, it helps if you say what language you're in. Is this a php/bb forum?

I started out here:

[regular-expressions.info...]

and eventually printed out a couple of sections after about the nineteenth time I had to consult the list.

Some things are universal to all Regular Expressions, but some are context-specific. For example, if you're accustomed to javascript you have to learn not to escape / or ; (they have syntactic meaning in js but not in RegEx itself). Conversely, almost anything that goes in .htaccess has to escape literal spaces (because spaces do have syntactic meaning in Apache though not in RegEx generically).

The most obvious error-- this is a RegEx universal-- is that you have confused ? with *. Question mark means "maybe one, maybe none". Asterisk means "none or some".

You also need to know whether IMG is always capitalized, or whether it might also occur in lower case. This is dialect-specific.

In general it is not a good idea to have .* anywhere but the end of a pattern. Here you can constrain it much more narrowly by, say,

https?://(www\.)?(baddomain\.com|otherbaddomain\.com|evildomain\.info)/([^/.]+/)*\w+\.(jpg|png|gif)

again dealing with the upper/lower case issue. The part you capture and reuse is $2. If all your unwanted domains happen to be dot coms, you can leave them out of the capture:

(baddomain|otherbaddomain|evildomain)\.com

and put the .com in your replacement string instead.

Incidentally, why do you want to do this? In general, unwanted hotlinking is considered the originating site's problem. And you can't possibly list every x-rated domain in the world.

vordmeister




msg:4361146
 6:33 pm on Sep 11, 2011 (gmt 0)

It's a vBulletin forum running on php. The reason is pictures posted in photobucket tend to disappear after a year or so when the member deletes them, or when they exceed the bandwidth of their photobucket account, rendering really good posts useless. I realise this bit of code will annoy a lot of people but I'd like to trial it to see if the majority will use the various options we have on the forum server.

The IMG tags could be upper case or lower case. I want to change anything between and including the image tags to an error message where the site is a problematic external image host, but there can be a number of image tags on the same post and I want to retain the rest of the post.

Will try your code. Problem I've been having is constraining the replacement to the image tags on the ends and saying anything between them mentioning badserver should be replaced.

lucy24




msg:4361183
 8:50 pm on Sep 11, 2011 (gmt 0)

One more thing. Instead of replacing the image with text, it may come out cleaner if you make a small png that says "please don't link to images from such-and-such host" and pop it between the image tags. Just make sure it says clearly that this injunction is coming from your site, not from the other end, since the whole point of photobucket and similar is that they do allow hotlinking.

At the same time you could make the image into an anchor linking to the page where you explain the options. I don't know whether this goes inside or outside the [IMG] tags, but I'm sure you do.

In some RegEx varieties, you can use (?i) and (?-i) to switch case-insensitivity on or off. I have never personally used this, but it can't hurt to try.

Looking again at your example:

$pattern = "/[\[IMG\].?example.?\[/IMG\]]/"

OK, so slashes mean the same thing they do in javascript. Better escape them then. They will occur, because you're using them in your search ("optionally more directories here"). Outside of .htaccess, it's very rare for Regular Expressions to object to things being escaped when they don't need escaping. So if you're in doubt, you can type "\a\-\z" and it will just be read as "a-z".

Do all photobucket links have the exact same format? If so, you can probably reduce grouping and wildcards even more. Just be sure to get your pipes in the right places:

http://www\.(photobucket\.com\/{exact format of pb link here}|otherunwantedhost\.com\/{exact format of their link}|thirdhost\.com\/{and another exact format})\.(jpe?g|gif|png)

That is, you can say things like \w+\/\w+ --using the exact number of directories-- instead of having to cover all possibilities with ([^/.]+/)*\w+\.(jpe?g|png|gif). Or, in your case, \/ The first period is exempt from escaping because the brackets themselves already mean "any old character".

Your forums probably have a fixed set of image formats that they recognize, so just list those. jpe?g is shorthand for jpeg|jpg at a savings of three bytes ;)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved