White space includes spaces, tabs, commas, periods, begin- and end-of-line markers, and other special characters as the situation dictates.
In what dialect of RegEx? Every one I've ever met distinguishes between \W (non-word characters, meaning spaces, punctuation, or any old squiggle) and \s (spaces, meaning spaces of all kinds, tabs, and \r and \n).
This is assuming for the sake of discussion that the original string is limited to ASCII, so you don't have to deal with, say, "ao" tucked into the middle of a Greek word. Then you have to pore over the documentation and figure out which of the eighteen variants of \p{ASCII} your specific dialect uses.
It gets worse if the source is not 'normal' human language textual material.
I don't think the OP was talking about human words at all, just strings. I can't think offhand of a language in which "ko" "an" "'ao" ...
and "cacao" are all words. And g### refuses to recognize the leading apostrophe (or is it a glottal stop?) even when I put every single thing in quotes.
Hm. I know someone who lives in Hawai'i. I could ask :)
Anyway, since OP hasn't come back with fresh problems, the ignore-the-complicated-stuff solutions probably worked.