Seeking script to extract Description meta tag

Forum Moderators: phranque

Message Too Old, No Replies

Seeking script to extract Description meta tag

esllou

4:17 pm on Jan 20, 2005 (gmt 0)

I have a folder of 100 html pages. I would like to be able to extract the description meta tag of each of them and insert this description in the main body tag at a pre-determined place.

Does anyone know of a script that could do this?

trillianjedi

12:58 pm on Jan 21, 2005 (gmt 0)

Have you tried ignoring the fact that it's HTML, think of it as text, and searched for tools that are designed to search/replace/move text around?

For example - total guess - but I wouldn't be surprised if you could get a decent Word processor application to do it (Word, OpenOffice etc).

cartone

8:03 pm on Jan 21, 2005 (gmt 0)

I use Textpad to work on my html code and there you can add all your pages in the main-tree.
You just have to replace the text on all pages at the same time. (= option on the replace function)

esllou

10:04 pm on Jan 21, 2005 (gmt 0)

cartone, I don't see how that would do what I want to do.

I need to extract the (different) description meta tag for a series of pages, keep the original meta tag in place, and put the copied description into the main body of the page, preferably wrapped in <P> tags!

I couldn't see how to use Word to help me either...

:-(

lammert

2:00 am on Jan 22, 2005 (gmt 0)

If you have access to a Linux box, you can use AWK, or any of the other stream editor functions available under this OS. The GAWK program from the GNU project is also available for many other operating systems so you could try to find one for your os.

A typical call would be

 awk -f myscript.awk *.html > descriptions.html

where myscript.awk contains something like

/name="description"/ {
split( $0, terms, "=" );
gsub( "^\"", "", terms[3] );
gsub( "\">$", "", terms[3] );

printf( "\nDescription from %s:\n", FILENAME );
printf( "%s\n\n", terms[3] );
}

So what does this do? The first line tells us, to only search for lines in your text html files that contain the string 'name="description"'. Then this line is split in three parts with the '=' as separator.

The first gsub strips the starting " from the line and the second one strips the "> from the end. So terms[3] now contains the clean text.

The two printf statements output the filename, followed by the description text. The printf statement in AWK uses normal C formatting codes (if this language is familiar to you) so you can output all kinds of text, including HTML codes etc.

I really do not know any simple to use program onder modern graphical operating systems which comes even close to the power of ancient *nix scripting languages.

esllou

2:30 am on Jan 22, 2005 (gmt 0)

just tried and I got:

/name="description"/ {
^Invalid char '

whatever that means!

lammert

3:30 am on Jan 22, 2005 (gmt 0)

I have tested it, and I can reproduce your error message, when I first edit the file in Notepad, and then save it in UTF-8 format. In UTF-8 format there are three hidden characters at the beginning of the text file which is a sign to programs that it is using UTF-8 coding.

Normal editors do not show them on the screen. AWK doesn't know about this special file coding and prints an invalid char in expression message instead.

If you save the awk script file in ANSI format with your editor this problem should disappear.

trillianjedi

12:55 pm on Jan 22, 2005 (gmt 0)

Great call Lammert!

I really do not know any simple to use program onder modern graphical operating systems which comes even close to the power of ancient *nix scripting languages.

You're right, and often we forget that when we get so used to having a mouse and buttons to click on.

I haven't used AWK in many years, your post reminded me just how damn useful it can be.