I have group of html files where i have to extract content between <hr> and </hr> tags.I have done everything except this extraction.What i have done is
1.Loaded all html files and store it in #html_files.
2.Then I am storing each file's content in #useful_files array.
3.Then I am looping the #useful_files array and checking each line where <hr> is found.If found I need next lines of content in #elements array.
Is it possible.Am I in the right?
foreach(#html_files){
$single_file = $_;
$elemets = ();
open $fh, '<', $dir.'/'.$single_file or die "Could not open '$single_file' $!\n";
#useful_files = ();
#useful_files = <$fh>;
foreach(#useful_files){
$line = $_;
chomp($line);
if($line =~ /<hr>/){
#elements = $line;
}
}
create(#elements,$single_file)
}
Thanks !!!
My input html file will be like this
<HR SIZE="3" style="COLOR:#999999" WIDTH="100%" ALIGN="CENTER">
<P STYLE="margin-top:0px;margin-bottom:0px; text-indent:4%"><FONT STYLE="font-family:Times New Roman" SIZE="2">Lorem ipsum dolor sit amet, consectetur adipiscing elit. </FONT></P>
<P STYLE="font-size:12px;margin-top:0px;margin-bottom:0px"> </P>
<TABLE CELLSPACING="0" CELLPADDING="0" WIDTH="100%" BORDER="0" STYLE="BORDER-COLLAPSE:COLLAPSE">
<TR>
<TD WIDTH="45%"></TD>
<TD VALIGN="bottom" WIDTH="1%"></TD>
<TD WIDTH="4%"></TD>
<TD VALIGN="bottom"></TD>
<TD WIDTH="4%"></TD>
<TD VALIGN="bottom" WIDTH="1%"></TD>
<TD WIDTH="44%"></TD></TR>
<TR>
<TD VALIGN="top"></TD>
<TD VALIGN="bottom"><FONT SIZE="1"> </FONT></TD>
<TD VALIGN="bottom"></TD>
<TD VALIGN="bottom"><FONT SIZE="1"> </FONT></TD>
<TD VALIGN="bottom"><FONT STYLE="font-family:Times New Roman" SIZE="2">Title:</FONT></TD>
<TD VALIGN="bottom"><FONT SIZE="1"> </FONT></TD>
<TD VALIGN="bottom"><FONT STYLE="font-family:Times New Roman" SIZE="2">John</FONT></TD></TR>
</TABLE>
<p Style='page-break-before:always'>
<HR SIZE="3" style="COLOR:#999999" WIDTH="100%" ALIGN="CENTER">
The html code which i have copied here is just the sample.I need the exact content between the <hr> in the #elementsarray.
In a simplest way you may do this:
my #cont;
foreach (#ARGV) {
open my $fh,'<',$_;
push #cont,join('',map { chomp; $_ } <$fh>)=~m%<hr>(.*?)</hr>%g;
}
#print join("\n",#cont,'');
And yes, dont worry: all files will be closed on exit "automagically" :)
Hint: uncomment print statement to see the result.
You can use grep in the command line:
grep -Pzo '<hr>\K((.|\n)*)(?=</hr>)' file.html
This will allow you to extract anything between <hr> and </hr> even if new lines are present.
Example:
tiago#dell:/tmp$ grep -Pzo '<hr>\K((.|\n)*)(?=</hr>)' <<< '<hr>a b c d </hr>'
a b c d
tiago#dell:/tmp$ grep -Pzo '<hr>\K((.|\n)*)(?=</hr>)' <<< $'<hr>a b\nc d </hr>'
a b
c d
And of course you can run grep against multiple files.
I know people say not to parse HTML with a regex, but this seems like the kind of relatively simple task that warrants the use of a regex.
Try this:
if ($line =~ m/<hr>(.*?)<\/hr>/){
push #elements, $1;
}
This will extract the text between <hr> and </hr> and store it in the next index in the #elements array.
Also you should ALWAYS use strict; and use warnings; at the top of your code! This will stop you from making dumb mistakes and prevent many needless headaches down the road.
You should also close your file after you are done extracting its contents into the #useful_files array! close $fh;
(On a side note, the name of this array is misleading. I would suggest you name it something like #lines or #file_contents since it contains the contents of a single file... not multiple files as your variable name seems to suggest.)
Related
I'm trying to use awk to change the color of cells in an HTML table. Ideally, I would be able to use awk to locate the Nth instance (a variable passed from earlier in the script) of "tg-6k2t" after "Bob" and change the color code to "tg-b5xm". This is a giant HTML table with many different people's names.
<tr>
<td class="tg-6k2t">Bob</td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
</tr>
My desired output would be
<tr>
<td class="tg-6k2t">Bob</td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
<td class="tg-b5xm"></td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
</tr>
You can do it with an Awk statement as follows,
awk -v count=6 '/"tg-6k2t".*Bob/{x=count}x--==1{sub(/tg-6k2t/,"tg-b5xm")}1' file
which generates the output as below, meaning the 6th line from the line matching Bob, change the variable to your convenience.
<tr>
<td class="tg-6k2t">Bob</td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
<td class="tg-b5xm"></td>
</tr>
Just learning sed, and I feel like I'm getting close to doing what I want, just missing something obvious.
The objective is to take bunch of <tr>...</tr>s in an html table and appended it to the single table in another page. So I want to take the initial file, strip everything above the first time I use <tr> and everything from </table> on down, then insert it just above the </table> in the other file. So like below, except <tr> and </tr> are on their own lines, if it matters.
Input File: Target File:
<html><body> <html><body>
<p>Whatever...</p> <p>Other whatever...</p>
<table> <table>
<tr><td>4</td></tr> <thead>
<tr><td>5</td></tr> <tr><th>#</th></tr>
<tr><td>6</td></tr> </thead>
</table> <tbody>
</body></html> <tr><td>1</td></tr>
<tr><td>2</td></tr>
<tr><td>3</td></tr>
</tbody>
</table>
</body></html>
Becomes:
Input file Target File:
doesn't matter. <html><body>
<p>Other whatever...</p>
<table>
<thead>
<tr><th>#</th></tr>
</thead>
<tbody>
<tr><td>1</td></tr>
<tr><td>2</td></tr>
<tr><td>3</td></tr>
<tr><td>4</td></tr>
<tr><td>5</td></tr>
<tr><td>6</td></tr>
</tbody>
</table>
</body></html>
Here's the code I'm trying to use:
#!/bin/bash
#$1 is the first parameter and $2 is the second parameter being passed when calling the script. The variable filename will be used to refer to this.
input=$1
inserttarget=$2
sed -e '/\<\/thead\>,$input' $input
sed -e '/\<\/table\>,$input' $input
sed -n -i -e '\<\/tbody\>/r' $inserttarget -e 1x -e '2,${x;p}' -e '${x;p}' $input
Pretty sure it's pretty simple, just messing the expression up. Can anyone set me straight?
Here I cut the problem in two:
1. Cut the rows from the input
2. Paste those rows in the output file
sed -n '\:<table>:,\:</table>:p' ${input} | sed -n '\:<tr>:p'
This line will remove all lines containing <tr> in the block ranging from the first line matching <table> to the first line matching </table>. All those freshly cut lines are printed in the standard output.
sed -i '\:</tbody>: {
r /dev/stdin
a </tbody>
d}' ${inserttarget}
This multi-line command will add the lines read from stdin after the line matching </tbody>. Then we move the </tbody> by appending it after the new lines and removing the old one.
Another trick used here is to replace the default regex delimiter / by :, so that we can use '/' in our matching pattern.
Final sotuion:
sed -i '\:</tbody>: {
r /dev/stdin
a </tbody>
d}' ${inserttarget} < <(sed -n '\:<table>:,\:</table>:p' ${input} | sed -n '\:<tr>:p')
Et voila!
I have managed to extract data from a website, then get relevant data from the extracted webpage. Now I am stuck as to how to extract data from <td> cols. into an array for data manipulation ?
My extracted HTML is following:
<tbody>
<tr>
<td>abc3207</td>
<td>151</td>
<td>Lorem Ipsum</td>
<td>Off Campus</td>
<td>OFF</td>
<td>12 of 999 </td>
<td> </td>
<td> </td>
<td>Get</td>
</tr>
<tr>
<td>abc3207</td>
<td>151</td>
<td>Dolor Sit Amet</td>
<td>Mount Lawley</td>
<td>ON</td>
<td>45 of 999 </td>
<td>Activity</td>
<td> </td>
<td>Get</td>
</tr>
</tbody>
I am doing this using a bash script as I must do it via bash only.
To parse html or xml, you'd better use dedicated command line tools as xmlstarlet or xmllint.
But with your html sample, you can try this :
mapfile td < <(sed -n 's/[\t ]*<td[^>]*>\(.*\)<\/td>/\1/p' file)
for td in "${td[#]}"; do
printf "$td"
done
sed extracts all td contents and pass the result to mapfile using process substitution.
mapfile stores each line from the process substitution in an array variable named $td.
It will work with your simple html with :
one td tag per line
opening and closing td on same line
I am trying to parse an html table in order to obtain the values. See here.
<tr>
<th>CLI:</th>
<td>0044123456789</td>
</tr>
<tr>
<th>Call Type:</th>
<td>New Enquiry</td>
</tr>
<tr>
<th class=3D"nopaddingtop">Caller's Name:</th>
<td class=3D"nopaddingtop"> </td>
</tr>
<tr>
<th class=3D"nopaddingmid"></th>
<td class=3D"nopaddingmid">Mr</td>
</tr>
<tr>
<th class=3D"nopaddingmid"></th>
<td class=3D"nopaddingmid">Lee</td>
</tr>
<tr>
<th class=3D"nopaddingbot"></th>
<td class=3D"nopaddingbot">Butler</td>
</tr>
I want to read the values associated wit the "CLI", "Call Type", and "Caller's Name" into separate variables using sed / awk.
For example:
cli="0044123456789"
call_type="New Enquiry"
caller_name="Mr Lee Butler"
How can I do this?
Many thanks, Neil.
One example for CLI one :
var=$(xmllint --html --xpath '//th[contains(., "CLI")]/../td/text()' file.html)
echo "$var"
For the multi <tr> part :
$ for key in {4..6}; do
xmllint \
--html \
--xpath "//th[contains(., 'CLI')]/../../tr[$key]/td/text()" file.html
printf ' '
done
echo
Output:
Mr Lee Butler
I am dealing with some html code and i got stucked in some problem. Here is the extract of some code and the format is exactly the same
<tr>
<td nowrap valign="top" class="table_1row"><a name="d071301" id="d071301"></a>13-Jul-2011</td>
<td width="21%" valign="top" class="table_1row">LCQ8: Personal data of job</td>
Here i have to match with
<tr>
<td nowrap valign="top"
and insert something before <tr> .the problem occurs as i have to match a pattern in different lines.
i have tried
grep -c "<tr>\n<td nowrap valign="top"" test.html
grep -c "<tr>\n*<td nowrap valign="top"" test.html
grep -c "<tr>*<td nowrap valign="top"" test.html
to test but none of them works.So i have two dimension to figure out the problem:
Match <td nowrap valign="top" and insert in the line above
Match whole string
<tr>
<td nowrap valign="top"
Would anyone suggest a way to doing it in either way?
Using sed you can perfom replacement on multiple lines. Its also easy to substitute the match.
sed "/\s*<tr>\s*/ { N; s/.*<tr>\n\s*<td.*/insertion\n&/ }"
This cryptic line basically say:
match a line with (/\s*<tr>\s*/)
continue on next line (N)
substitute the matched pattern whit the insertion and the matched string, where & represent the matched string (s/.*<tr>\n\s*td.*/insertion\n&/)
Sed is very powerful to perform substitution, its a nice to know tool. See this manual if you want to learn more about sed:
http://www.grymoire.com/Unix/Sed.html
Try grep -P "tr>\s*\n\s*<td".
It's not clear how it will help you to insert something before <tr>, but anyway.
Quoted strings do not nest, you need to escape the quote characters, or use single quotes instead of double quotes.