Forum Moderators: coopster

Message Too Old, No Replies

Dealing with an ALL CAPS forum post

StrToLower() and ucfirst() to clean it up

         

trillianjedi

11:18 am on Jan 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Finally got bugged enough by the upteenth post on my forum where the OP decided that an all caps title would get him more attention ;-)

I found the StrToLower and UppercaseFirst functions which are going to be my friend here I suspect, but ideally what I'd like to do is a bit of basic logic as follows (speed is not an issue as this will only be called on post, then printed to the DB in it's rewritten form):-

1. If the post is all caps then lower-case the lot.
2. Capitalise first word at start of each new sentence.
3. Capitalise "i" if it has a space either side of it.

I think those three would do the trick quite well.

Does anyone have any code for this, or is there perhaps an open-source forum which has such a function that I could pinch?

In terms of detecting if it's all uppercase, would one single lc character mess that detection up? Is there a way, for example, to detect if something is 90% uppercase?

Thanks,

TJ

sonjay

12:04 pm on Jan 23, 2006 (gmt 0)

10+ Year Member



I recently did something similar for fields that are in a database. This should do the trick for your first two:
$text = ucwords(strtolower($text));

Uppercasing I appropriately is a bit trickier -- if the "i" is at the beginning of the subject, e.g., "i need help", then simply capping it if it's surrounded by a space on either side wouldn't do it. Or a construction like "script fails-I don't understand." Parens, hyphens -- there are lots of ways that a standalone "i" could end up not being surrounded by whitespace.

Also, you're not accounting for acronyms or other usages that should be capped in some other way -- PHP, MySQL, or iPod, for example.

You could simply display the rewritten subject line back to the person in an editable field, with a note that all-caps subjects aren't allowed, and let them edit it.

Checking for mostly caps should be pretty easy. Use a regexp to count all occurrences of [A-Z], then compare that count to the total character count of the subject. If $allcaps > .9 * $totalchars, do your rewrite bit.

larryhatch

12:07 pm on Jan 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you go to all that trouble, you might as well append a message at the end.

STOP SHOUTING!

trillianjedi

12:18 pm on Jan 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Sonjay - thanks.

$text = ucwords(strtolower($text));

Yes, that's the kind of thing I was thinking of, but would that capitalise the first word of every sentence automagically? Or do I first need to break down to the full stops?

Uppercasing I appropriately is a bit trickier

Yes, I agree. Perhaps between us we could thrash out a rule set:-

Capitalise "i" if:-

1. It has a space either side
2. It has a full stop or dash on it's left and a space on it's right.
3.?

acronyms

e.g. - iPod, generally don't have a space on the right, or if they do, have an alpha char on the left?

It's not going to be possible to get it 100% I'm sure, but to be honest, 90% would do. Bear in mind that non all-caps posts would get left alone, so the only affected ones would be the all caps ones, and a 90% fixed version is going to be a lot better than the all caps in any event.

Checking for mostly caps should be pretty easy. Use a regexp to count all occurrences of [A-Z], then compare that count to the total character count of the subject. If $allcaps > .9 * $totalchars, do your rewrite bit.

Excellent idea....

Larry - that's not such a bad idea actually. You could do it in a little more friendly manner:-

<AutoEdit by JediBot>Please don't post in all caps - thanks</AutoEdit>

TJ

sonjay

12:44 pm on Jan 23, 2006 (gmt 0)

10+ Year Member



Aargh ... You're right, my code caps the first letter of every word. I guess ucfirst() is what you need, and that complicates things because now you have to test for more than one sentence in the string.

It would be easy enough to explode the string on periods, then run ucfirst() on each array element, then implode it back into one string -- but now you're not accounting for abbreviations that might be used -- if the person enters i.e., e.g., etc., or any other abbreviation, the next letter after each period would be capitalized.

If it were me, I think I'd just do the ucfirst(strtolower($text)) on the subject, then re-display it and let the user edit it, with a note about all-caps not being permitted. If you're dead-set on creating an elaborate set of rules to avoid that, it would be an interesting exercise and I'll be happy to participate. But I think it would be faster and easier to push the edited subject back to user for final editing. (And of course, their edited subject would also have to be run against your rules again.)

trillianjedi

1:04 pm on Jan 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you're dead-set on creating an elaborate set of rules to avoid that, it would be an interesting exercise and I'll be happy to participate.

Not so much dead-set, I just find it quite rewarding coming up with things that work in an automated way like this - even if it only works to 90% accuracy. And I bet it could get honed down quite nicely over a period of time.

So if you're up for it, I am ;-)

I'll do a first draft of a function next day or so and come back and post here - anything you feel you could add to the two rules regarding the "i" above?

Rule set for exploding sentences:-

1. Full stop must be preceded by a letter.
2. Text between full stops must be >5 chars.
3.?

Again, I appreciate true intelligence here is not possible, but 90% would be great....

TJ

sonjay

4:20 pm on Jan 23, 2006 (gmt 0)

10+ Year Member



Okay, works for me. I'm sure I'll learn something along the way. Probably some things that I can use right away in my project where I use ucwords(strtolower($text)).

Another rule for sentences: Either full stop must be immediately followed by a space, or must not be immediately followed by a comma or another alpha character (as in, "e.g.,"). I'm not sure if that rule would work best with a "must be followed by" or a "must not be followed by." What else might come after a period that doesn't signify the end of a sentence, and would it be easier to define a ruleset for what to include, or what to exclude?

Also, for capitalizing "I", something along the lines of, the i must be followed by either a space or a single apostrophe (I'd, I'll, I'm).

Just got back from getting husband's arm fixed up at the doctor, so I need to get some actual work done now. I'll check back later.

trillianjedi

4:39 pm on Jan 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What else might come after a period that doesn't signify the end of a sentence, and would it be easier to define a ruleset for what to include, or what to exclude?

This is perhaps the trickiest one. I would think it's safest to define what does get included?

Also, we have to consider:-

NOT EVERYONE USES A SPACE AFTER A FULLSTOP.SOME PEOPLE WRITE LIKE THIS.

As it stands, the ruleset so far would turn that into:-

Not everyone uses a space after a fullstop.some people write like this.

Which is not entirely bad, and still better than full caps? I can't see a way around that which wouldn't destroy filenames or links, eg:-

somedomain.com/myfile.html

Could get converted into:-

somedomain. Com/myfile. Html

... if you try to parse it with a basic full-stop rule.

With that in mind, although it's a bit hit and miss I think it's best to assume "<dot><space>" to be an end of sentence full-stop. Worst case scenaria is the odd first word capital letter might get lost.

The other thing to consider would be not parsing anything in between CODE or PRE tags.

i must be followed by either a space or a single apostrophe (I'd, I'll, I'm).

Good point.

TJ

ergophobe

5:30 pm on Jan 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I wonder how well this would work for "i"

$pattern = '/(\W¦^)i(\W¦$)/';
$replace = '$1I$2';

Perhaps better still would be simply
$pattern = '/\bi\b/';
$replace = 'I';

It would miss some like

iPod begins with a lowercase "i" => "I"
It's spelled "w-e-i-r-d" => w-e-I-r-d

If you try it on some real text, I'd be curious to know.

whoisgregg

5:32 pm on Jan 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



An "i" that should be capitalized is preceded and followed by any non-alphanumeric character.

whoisgregg

5:55 pm on Jan 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Whoops, posted same time as ergophobe. Don't mind me, I was suggesting the first pattern, and it does have flaws. I imagine they are rarely encountered.

Moosetick

7:54 pm on Jan 23, 2006 (gmt 0)

10+ Year Member



I think the easiest way would be to put the burden back on the submitter. As you suggested, check to see if all letters are CAPS. If so, return the form to them with a note stating that all CAPS are not permitted. That should fix 95% of the posts. I suspect few peeple title a post with ..

I NEED ATTENTION. pLEASE READ MY POST!

Checking for >90% wouldn't be too difficult either. Reformatting it serverside would be a headache though and may require ongoing tweaking. As yourself if you want to commit to this project indefinately!

By the way, I NEED ATTENTION. pLEASE READ MY POST!

willybfriendly

8:38 pm on Jan 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



An "i" that should be capitalized is preceded and followed by any non-alphanumeric character

Except in cases such as Islam, Istanbul, Isaac, etc.

"Did Isaac Asimov foresee the development of iPods when he wrote 'I, Robot'?"

WBF

trillianjedi

10:45 pm on Jan 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



As yourself if you want to commit to this project indefinately!

Could I better spend my time in terms of the end result achieved?

Definitely ;-)

But this is as much about my honing some php/regex skills as anything else - I'm sure I'll learn a lot in the exercise.

Whether or not a useful and usable function actually comes out of it is secondary...

TJ

jatar_k

6:50 am on Jan 24, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



you guys are nuts

build common rules and deny/edit outside of those. You'll waste more time making it than you will editing it.

just had to say it

have fun 'programming for all eventualities'

trillianjedi

11:17 am on Jan 24, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hehe

You'll waste more time making it than you will editing it.

I think I'll learn a lot from making it. I'm not a php coder, so this is something that I can play with, learn from, and it's at least more useful to me than working through an example in a book which I can't use. I haven't used RegEx from php yet.

have fun 'programming for all eventualities'

I think getting it 90% there is enough, and should be quite easy?

TJ

PeteM

12:57 pm on Jan 24, 2006 (gmt 0)

10+ Year Member



We do something similar here at work. We have a database that contains all names in upper case. We have to convert to mixed case for mailings. We do a UppercaseFirst type conversion...

E.g. MACHENDRY becomes Machendry

We then convert all Mach to MacH using a table of rules and end up with MacHendry.

However, this conversion would not work for MACHINERY so we have a second set of rules that convert MacHinery back to Machinery. This table also contains exceptions such as BMW.

Pete