homepage Welcome to WebmasterWorld Guest from 54.234.147.84
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Regex pattern cannot start or end with hyphen
ocon




msg:4603468
 2:05 am on Aug 20, 2013 (gmt 0)

I'm having difficulty writing the regex for this pattern:
  • A string must be between 1 to 255 characters in length.
  • A string can only contain a-z, 0-9, or a hyphen.
  • A hyphen cannot follow another hyphen.
  • A string cannot start or end with a hyphen.
I'm finding the last two parts the most challenging. What I have so far is:

/^[a-z0-9]{1}[a-z0-9]{0,254}?!-$/

 

lucy24




msg:4603476
 2:41 am on Aug 20, 2013 (gmt 0)

Key question: Are you generating a string, or identifying one that already exists? If you're picking out strings from an existing pattern, you can probably drop the length constraint. All you need is

^([a-z0-9]+-)*[a-z0-9]$

If you have potential strings that fit this pattern except that they are longer than 256 characters, and you need to exclude those, then option B is to measure the string length separately. I suspect this would actually execute faster than if you had to juggle two things concurrently.

The {1} can definitely, unequivocally go, since one of anything is the default.

I said * (asterisk) up above, allowing for any number of non-hyphen sets broken up by single hyphens. But did you mean that the string will contain exactly one hyphen? Or no more than one?

Is it always lower-case a-z? Can _ ever occur? You might even be able to reduce [a-z0-9] to \w (alphanumerics plus lowline, but not hyphen).

phranque




msg:4603483
 3:05 am on Aug 20, 2013 (gmt 0)

^([a-z0-9]+-)*[a-z0-9]$


that wouldn't match the string "abc", so you probably want this:
^([a-z0-9]+-)*[a-z0-9]+$

and then the length check, of course - i think you can put a quantifier on a capturing group:
^(([a-z0-9]+-)*[a-z0-9]+){1,255}$
lucy24




msg:4603511
 4:07 am on Aug 20, 2013 (gmt 0)

that wouldn't match the string "abc"

Oh, oops, I left out the final +.

What he said ;)

i think you can put a quantifier on a capturing group

You can essentially do anything to a group (whether in brackets or in parentheses) that you can do to a single character. But it applies to the whole group, so
(([a-z0-9]+-)*[a-z0-9]+){1,255}
would mean 1-255 iterations of the entire package
([a-z0-9]+-)*[a-z0-9]+
!

(Same principle as a question mark:
(([a-z0-9]+-)*[a-z0-9]+)?
means the package-- as a unit-- is optional.)

ocon




msg:4603518
 4:21 am on Aug 20, 2013 (gmt 0)

That's just awesome! Thank you both. I find myself in a love/hate relationship with regex, but its great when it comes together.

g1smd




msg:4603572
 2:12 pm on Aug 20, 2013 (gmt 0)

Does this have to be a single RegEx? The "count" RegEx is simply:
^.{1,255}$ or ^[a-z0-9-]{1,255}$

The matching "valid characters" RegEx is problematical. The suggested pattern
^([a-z0-9]+-)*[a-z0-9]+$ doesn't match when there is NO hyphen. Making the hyphen optional will make the pattern very ambiguous.

I think I would go with:

^[a-z0-9][a-z0-9-]*}$ AND ^[a-z0-9-]{1,255}$ AND !-$

or perhaps

^[a-z0-9][a-z0-9-]*[a-z0-9]$ AND ^[a-z0-9-]{2,255}$ if there are no "one character" requests.

There are multiple ways to code this. Some are more efficient than others.

lucy24




msg:4603697
 10:35 pm on Aug 20, 2013 (gmt 0)

doesn't match when there is NO hyphen

Uhm, why not? That's why I used * instead of +

^[a-z0-9][a-z0-9-]*}$

Typo for something, but I'm not sure what? As printed, } would be the literal } character.

[a-z0-9-]{1,255}

See first post. He needs to exclude consecutive hyphens.

Question: Would two or more consecutive hyphens ever occur? If not, you don't need to complicate the code. Alternatively, you could have a separate step that excludes
-{2,}

Really this is one of those questions that's easier to answer if you start by laying out in English exactly what you're trying to do.

g1smd




msg:4603797
 10:20 am on Aug 21, 2013 (gmt 0)

Yes, the extra } in my code is a typo. Remove the } and it's fixed.

I forgot about consecutive hyphens. You'll need a
!-- test too.
ocon




msg:4604377
 11:31 am on Aug 23, 2013 (gmt 0)

So I'm still not getting this regex and I need to refresh.

^[a-z0-9][a-z0-9-]*$ AND ^[a-z0-9-]{1,255}$ AND !-$

So I can break this into multiple parts? Let me see if I understand this:
  • I need to start with one character between a-z or 0-9 followed by an unlimited number of a-z, 0-9 or hyphens characters: ^[a-z0-9][a-z0-9-]*$
  • There must be between 1 and 255 a-z, 0-9 or - characters: ^[a-z0-9-]{1,255}$
  • I'm not sure what the last part means, is it there cannot be multiple hyphens in a row: !-$
The way I understand the code it would still allow for it to end with a hyphen. Maybe the first part could be substituted with ^([a-z0-9]|[a-z0-9][a-z0-9-]*[a-z0-9])$ to allow for one character non-hyphen strings, two character non-hyphen strings, or an unlimited number of character strings that do not begin or end with a hyphen.
lucy24




msg:4604381
 11:51 am on Aug 23, 2013 (gmt 0)

Two elements.

Length:
^.{1,255}$

Character constraints, here using \w as shorthand for [a-z0-9]:
((\w+-)*\w+)

You could add a third element
!--
to eliminate multiple hyphens, but you'd then need to add some other stuff to exclude initial and final hyphens. I don't think you'd gain anything.

Alternative RegEx with separate !-- exclusion:

\w([\w-]*\w)?

That's if your patterns can be as little as 1 or 2 letters.

Technically \w includes _ lowline but here I'm assuming you won't have any.

Pay close attention to the difference between brackets [] and parentheses ().

ocon




msg:4604393
 12:12 pm on Aug 23, 2013 (gmt 0)

Lucy, could I knock out both length and character constraint with:
^[a-z0-9-]{1,255}$
Prevent starting and trailing hyphens with:
^([^-]|[^-].*[^-])$
Prevent adjacent hyphens with:
^([^-]+-?)+$
g1smd




msg:4604409
 1:15 pm on Aug 23, 2013 (gmt 0)

The
!-$ prevented the trailing hyphen.

Use
!-- to prevent adjacent hyphen.
lucy24




msg:4604410
 1:24 pm on Aug 23, 2013 (gmt 0)

Non-hyphens are best expressed as a negative like
!--
You don't have to find all of them; even if there's only one, the condition fails. (Or succeeds, depending on how you look at it.)

In fact all of the hyphen-related rules work best as negatives. As a single rule:
!(^|-)-|-$
or, probably more efficiently,
!^-|-($|-)
where a leading ! applies to the whole pattern.

If you say [^-] then you open the door for non-word characters such as punctuation.

I don't think you've ever explained what this RegEx-- we're now in the php forum-- is actually supposed to do. In the other thread, you're constructing a RewriteRule based on identifying a pattern in an incoming request. What's happening here? All of the Regular Expressions we've talked about will only work if you're testing against an existing string. If you're constructing the string in the first place, you need a different set of rules.

ocon




msg:4604419
 2:33 pm on Aug 23, 2013 (gmt 0)

Excellent! So using
(^-|--|-$) and ^[a-z0-9-]{1,255}$ together I can test that:
  • A string must be between 1 to 255 characters in length.
  • A string can only contain a-z, 0-9, or a hyphen.
  • A hyphen cannot follow another hyphen.
  • A string cannot start or end with a hyphen.
I'll be using this code in a script (briefly mentioned in the other thread as /scripts/createPage.php?name=$1) to ensuring the integrity of a URL defined variable to minimize unintended results. I'll also be using the regex in a RewriteRule that leads to this script as a secondary security measure.

$path = $_GET['path']; 
if(preg_match('/(^-|--|-$)/', $path) || !preg_match('/^[a-z0-9-]{1,255}$/', $path)){
header('HTTP/1.0 404 Not Found');
die();}

Note: I wasn't able to get
!(^|-)-|-$ or !^-|-($|-) to work for me. I'm sure it's because I'm poorly implementing them in the script above.
g1smd




msg:4604450
 4:42 pm on Aug 23, 2013 (gmt 0)

Lucy. Good job on combining the hyphen checking patterns!
I wasn't giving it too much thought, hoped someone else would fill in the blanks.
Thanks.

lucy24




msg:4604524
 10:04 pm on Aug 23, 2013 (gmt 0)

So using (^-|--|-$) and ^[a-z0-9-]{1,255}$ together

Note that the first rule is negative while the second rule is positive. In fact you can make both rules negative by expressing the second one as
.{256}

Is it extremely important to prevent misformed requests from ever reaching the php script? Instead of maintaining two parallel identical RegExes, one in php and one in htaccess, you could simply rewrite everything to the php script at once. Let it do the format-and-length validation and the directory generation, and then issue the appropriate redirect, 404 or 403.

JD_Toims




msg:4604613
 4:39 am on Aug 24, 2013 (gmt 0)

I'm having difficulty writing the regex for this pattern:

A string must be between 1 to 255 characters in length.
A string can only contain a-z, 0-9, or a hyphen.
A hyphen cannot follow another hyphen.
A string cannot start or end with a hyphen.

Personally, in PHP I don't know if I'd use a regex for the whole thing.
I think I'd perform some "quick checks" and then use a regex if the checks passed.

Something like:

$len=strlen($the_string);
if($len>0
&& $len<256
&& strpos($the_string,'--')===FALSE
&& strpos($the_string,'-')!==0
&& strpos($the_string,'-')!==($len-1)
&& preg_match('#^[a-z0-9-]+$#',$the_string)
) {
// do what I want here
}

Thanks to Lucy24 for pointing out the parallel thread here.

g1smd




msg:4604687
 3:05 pm on Aug 24, 2013 (gmt 0)

With the right patterns, two RegEx checks, one for valid characters and length and one to enforce various restrictions, can be blisteringly fast.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved