Forum Moderators: coopster

Message Too Old, No Replies

Manipulating the content of attr

Checking for valid tags, and mincing the dynamic urls....

         

SneakyWho am i

12:17 pm on Sep 5, 2007 (gmt 0)

10+ Year Member



I'm doing something involving data from the web. It's not just a sanitizer or validator though, I'm trying to make it "intelligent".

At the moment it will strip all unwanted attributes from a given tag. After stipping all the other attributes, it knocks the rest of the tag off - this is because of the variety of types of inputs the form can take.

Can anybody show me how to replace the = signs in a dynamic url, when that url is a part of an img or a tag?

Example:

<a href="http://www.webmasterworld.com/postv4.cgi?action=new&forum=88" title ="Woo" target="_blank">Some text here</a>

What I want to do is convert that to:

<a href="http://www.webmasterworld.com/postv4.cgi?action_SOMETHING-ELSE_new&forum_SOMETHING-ELSE_88" title ="Woo" target="_blank">Some text here</a>

I've been trying to do it with the likes of regular expressions... In my example though it is imperative that only the = signs in the url are changed - nothing else.

If any of you have been down this road... I would be beyond grateful for whatever solution you could provide.
Thanks in advance.

SneakyWho am i

12:32 pm on Sep 5, 2007 (gmt 0)

10+ Year Member



I know, I'm double posting! Please nobody kill me.
I'd prefer to edit my last post but can't find an edit post button so I've replied to myself instead. This may be helpful. I was thinking a regular expression might be the way to go but I'm still picking that up (and php in general really) but here's what I'm using to sort attributes:

explode('=', trim($attrSet[$i]));

That's why it's breaking the urls. I'd rather work around it than change that filter as it's very paranoid...

Habtom

12:36 pm on Sep 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How about just string replacing

str_replace($this, $withthis, $inthis);

Hab

Habtom

12:36 pm on Sep 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In your case,

str_replace("=", "_SOMETHING-ELSE_", $url);

SneakyWho am i

2:31 am on Sep 6, 2007 (gmt 0)

10+ Year Member



Yeah... That was my first instinct also but it broke signs in other parts of the string as well... I'm now trying to put together a regular expression that'll handle it better. The thing is, sure what you've suggested changes an equals sign into something else, but the problem is it'll apply it to the entire input.

I'm trying to process things more like this:

<a href="http://www.webmasterworld.com/postv4.cgi?action=reply&forum=88& discussion=3441821"><img ilo-full-src="http://www.searchengineworld.com/gfx/logo.png" src="http://www.SearchEngineWorld.com/gfx/logo.png" alt="http://www.webmasterworld.com" title="http://www.webmasterworld.com" align="left" border="0" hspace="7" vspace="0"></a>

I need them to look like this:

<a href="http://www.webmasterworld.com/postv4.cgi?action_somethingelse_reply& forum_somethingelse_88&discussion_somethingelse_3441821"><img ilo-full-src="http://www.searchengineworld.com/gfx/logo.png" src="http://www.SearchEngineWorld.com/gfx/logo.png" alt="http://www.webmasterworld.com" title="http://www.webmasterworld.com" align="left" border="0" hspace="7" vspace="0"></a>

Now... I want to convert the = signs in the url itself:
action=reply&forum=88

But I don't want to convert the other equals signs:
src="http://
border="0
align="l

I will ultimately get it myself in the end.... So it needs to do something along the lines of:
-IF there is an alphanumeric character on BOTH sides of the equals sign (with our without whitespace), and the RHS character is not part of the string "http" THEN convert it.

The reasoning.....:
- Some tags may be malformed in not having " or ' around attributes, so I can't use those characters as criteria for a match
- http is the protocol that will most often be submitted, so at the very least, the filter must exlude that from matches.

I'm trying to vandalize only the = signs in the url, and not touch any of the others. I'm not sure that I'll be able to get it 100% but if you've seen some expression or function that will do this, I'd love to hear about it.

If not, and I make one myself, I will bring it back. It may be that a (non?)working example would be the best way to explain it.

Thank you at least for taking the time to reply :-)

[edited by: dreamcatcher at 6:32 am (utc) on Sep. 6, 2007]
[edit reason] Fixed side scroll. [/edit]

Sylver

10:52 am on Sep 6, 2007 (gmt 0)

10+ Year Member




<a href="http://www.webmasterworld.com/postv4.cgi?action=reply&forum=88& discussion=3441821"><img ilo-full-src="http://www.searchengineworld.com/gfx/logo.png" src="http://www.SearchEngineWorld.com/gfx/logo.png" alt="http://www.webmasterworld.com" title="http://www.webmasterworld.com" align="left" border="0" hspace="7" vspace="0"></a>

I need them to look like this:
<a href="http://www.webmasterworld.com/postv4.cgi?action_somethingelse_reply& forum_somethingelse_88&discussion_somethingelse_3441821"><img ilo-full-src="http://www.searchengineworld.com/gfx/logo.png" src="http://www.SearchEngineWorld.com/gfx/logo.png" alt="http://www.webmasterworld.com" title="http://www.webmasterworld.com" align="left" border="0" hspace="7" vspace="0"></a>

I think the easiest way to do that is to break it down in smaller steps and not try to do everything in one block.

First, get the href attribute content in a variable:
Something like this should do the job: /(href=)(.+)(\s¦>)/Ui

The assumption here is that all urls are after "href" attribute, and that the href attribute ends with either a space or a ">" sign. Note the modifiers U (ungreedy) and i (caseless).

Collect the content of the 2nd parenthesis in a variable. If there are several URLs (most likely), store them all in an array.

Run through the array and strip " or ' if they exist at the begining and end of the string. Put the stripped versions in a different array.

Now that you have isolated the url, we can safely assume that any "=" sign mus be replaced by "_somethingelse_". That's easy, you have given several methods for it.

By now you should have one array containing all the original urls and one array containing all the new urls. add some " to the begining and end of the new urls. Replace the values of the old array with the values of the new array and you are done.

Cheers,
Sylver

Drag_Racer

11:26 am on Sep 6, 2007 (gmt 0)

10+ Year Member



if you can get a reference to the anchor tag, then get the href attribute, use the 'split' function on the '=' sign, then element.href= join by SOMETHING-ELSE

such as

var h = getElementByWhateverMeans.href.split(/=/);
getElementByWhateverMeans.href = h.join('SOMETHING-ELSE')

SneakyWho am i

6:46 am on Sep 7, 2007 (gmt 0)

10+ Year Member



Wow, thanks guys. For a while there I was resigned to breaking it down into several steps as suggested here. Sadly it interferes with the workflow due to the variety of things the script processes, and due to the very paranoid method of sanitation that I've used.

I don't expect a lot of trouble from my technique in future although I think using the method you've provided will simplify things if I ever do it again.

What I used in the end was provided to me by Ketan Kulkarni from carvingIT.com

It's a short script involving a regular expression and it works like this more or less:

$pattern = '#(src=[^\?]*)([\?]?)([^=\s]+)=([^=\s\&]+)#i';
$replacement = '$1$2$3_EQUAL_SIGN_$4';
$output = preg_replace($pattern,$replacement,$output);
while(preg_match($pattern,$output)){
$output = preg_replace($pattern,$replacement,$output);
}

It doesn't seperate the url, it only manipulates it. The benefit of this is that I'm able to keep the context. It's allowed me to preserve the url itself while performing operations on the rest of it (without keeping the string in an array for the duration of the script.)

Thanks again for all your help, everyone.

(And thanks for stopping my prior post from sidescrolling)