How to extract data from <td> cols into an array using bash? - bash

I have managed to extract data from a website, then get relevant data from the extracted webpage. Now I am stuck as to how to extract data from <td> cols. into an array for data manipulation ?
My extracted HTML is following:
<tbody>
<tr>
<td>abc3207</td>
<td>151</td>
<td>Lorem Ipsum</td>
<td>Off Campus</td>
<td>OFF</td>
<td>12 of 999 </td>
<td> </td>
<td> </td>
<td>Get</td>
</tr>
<tr>
<td>abc3207</td>
<td>151</td>
<td>Dolor Sit Amet</td>
<td>Mount Lawley</td>
<td>ON</td>
<td>45 of 999 </td>
<td>Activity</td>
<td> </td>
<td>Get</td>
</tr>
</tbody>
I am doing this using a bash script as I must do it via bash only.

To parse html or xml, you'd better use dedicated command line tools as xmlstarlet or xmllint.
But with your html sample, you can try this :
mapfile td < <(sed -n 's/[\t ]*<td[^>]*>\(.*\)<\/td>/\1/p' file)
for td in "${td[#]}"; do
printf "$td"
done
sed extracts all td contents and pass the result to mapfile using process substitution.
mapfile stores each line from the process substitution in an array variable named $td.
It will work with your simple html with :
one td tag per line
opening and closing td on same line

Related

Export filenames in a directory to CSV

I am looking to try and export the filenames in a directory to a CSV file to later include in a HTML table. I am wanting to have a few different column headings on the CSV/table which sometimes will include data and sometimes won't (based on the filename).
For example the directory /usr/local/flowsim/data/ contains the following file 10.0.0.0,9996,Netflow-Version-10.conf This file contains config info which is also the filename (in this case IP, Port, Netflow Version). There may be a better way to name the file as to get the info into the CSV but I am wanting to take the comma seperated values in the filename and put these into a table like so:
IP Address
Port
Netflow Version
10.0.0.0
9996
Netflow Version 10
In addition to this I am trying to figure out a way to have the output of each of the .conf files in a column at the end of the table with the full contents of the file (the file contents may sometimes contain more details than just that of the filename so would be best to just include it in its own column), i.e.
IP Address
Port
Netflow Version
Full Configuration
10.0.0.0
9996
Netflow Version 10
Output of cat 10.0.0.0,9996,Netflow-Version-10.conf
Is there anyway to achieve this or am I being unrealistic here? Thanks!
EDIT
I managed to resolve this using the following bash script:
#!/usr/bin/env bash
> /var/www/html/live-flows.html
head='<!DOCTYPE html>
<html>
<div class="u-expanded-width u-table u-table-responsive u-table-1">
<table class="u-table-entity u-table-entity-1">
<colgroup>
<col width="33.3%">
<col width="33.3%">
<col width="33.3%">
</colgroup>
<thead class="u-align-center u-custom-font u-grey-5 u-heading-font u-table-header u-table-header-1">
<tr style="height: 40px;">
<th class="u-border-1 u-border-grey-dark-1 u-table-cell">Collector IP Address</th>
<th class="u-border-1 u-border-grey-dark-1 u-table-cell">Collector Port</th>
<th class="u-border-1 u-border-grey-dark-1 u-table-cell">Netflow Version</th>
</tr>
</thead>
<tbody class="u-align-center u-table-body">
<tr style="height: 7px;">
</tr>'
tail='</tbody>
</table>
</html>'
printf '%s\n' "$head"
shopt -s nullglob
for file in /usr/local/flowsim/data/*,*,*.conf; do
[[ $file =~ ([^/]+),(.+),(.+).conf$ ]] &&
ip=${BASH_REMATCH[1]}
port=${BASH_REMATCH[2]}
version=${BASH_REMATCH[3]}
printf ' <tr>\n <td>%s</td>\n <td>%s</td>\n <td>%s</td>\n </tr>\n' "$ip" "$port" "$version"
done
printf '%s\n' "$tail"

Unterminated address regex - misapplying escape characters in bash sed script?

Just learning sed, and I feel like I'm getting close to doing what I want, just missing something obvious.
The objective is to take bunch of <tr>...</tr>s in an html table and appended it to the single table in another page. So I want to take the initial file, strip everything above the first time I use <tr> and everything from </table> on down, then insert it just above the </table> in the other file. So like below, except <tr> and </tr> are on their own lines, if it matters.
Input File: Target File:
<html><body> <html><body>
<p>Whatever...</p> <p>Other whatever...</p>
<table> <table>
<tr><td>4</td></tr> <thead>
<tr><td>5</td></tr> <tr><th>#</th></tr>
<tr><td>6</td></tr> </thead>
</table> <tbody>
</body></html> <tr><td>1</td></tr>
<tr><td>2</td></tr>
<tr><td>3</td></tr>
</tbody>
</table>
</body></html>
Becomes:
Input file Target File:
doesn't matter. <html><body>
<p>Other whatever...</p>
<table>
<thead>
<tr><th>#</th></tr>
</thead>
<tbody>
<tr><td>1</td></tr>
<tr><td>2</td></tr>
<tr><td>3</td></tr>
<tr><td>4</td></tr>
<tr><td>5</td></tr>
<tr><td>6</td></tr>
</tbody>
</table>
</body></html>
Here's the code I'm trying to use:
#!/bin/bash
#$1 is the first parameter and $2 is the second parameter being passed when calling the script. The variable filename will be used to refer to this.
input=$1
inserttarget=$2
sed -e '/\<\/thead\>,$input' $input
sed -e '/\<\/table\>,$input' $input
sed -n -i -e '\<\/tbody\>/r' $inserttarget -e 1x -e '2,${x;p}' -e '${x;p}' $input
Pretty sure it's pretty simple, just messing the expression up. Can anyone set me straight?
Here I cut the problem in two:
1. Cut the rows from the input
2. Paste those rows in the output file
sed -n '\:<table>:,\:</table>:p' ${input} | sed -n '\:<tr>:p'
This line will remove all lines containing <tr> in the block ranging from the first line matching <table> to the first line matching </table>. All those freshly cut lines are printed in the standard output.
sed -i '\:</tbody>: {
r /dev/stdin
a </tbody>
d}' ${inserttarget}
This multi-line command will add the lines read from stdin after the line matching </tbody>. Then we move the </tbody> by appending it after the new lines and removing the old one.
Another trick used here is to replace the default regex delimiter / by :, so that we can use '/' in our matching pattern.
Final sotuion:
sed -i '\:</tbody>: {
r /dev/stdin
a </tbody>
d}' ${inserttarget} < <(sed -n '\:<table>:,\:</table>:p' ${input} | sed -n '\:<tr>:p')
Et voila!

How to get strings inbetween two Strings

I have group of html files where i have to extract content between <hr> and </hr> tags.I have done everything except this extraction.What i have done is
1.Loaded all html files and store it in #html_files.
2.Then I am storing each file's content in #useful_files array.
3.Then I am looping the #useful_files array and checking each line where <hr> is found.If found I need next lines of content in #elements array.
Is it possible.Am I in the right?
foreach(#html_files){
$single_file = $_;
$elemets = ();
open $fh, '<', $dir.'/'.$single_file or die "Could not open '$single_file' $!\n";
#useful_files = ();
#useful_files = <$fh>;
foreach(#useful_files){
$line = $_;
chomp($line);
if($line =~ /<hr>/){
#elements = $line;
}
}
create(#elements,$single_file)
}
Thanks !!!
My input html file will be like this
<HR SIZE="3" style="COLOR:#999999" WIDTH="100%" ALIGN="CENTER">
<P STYLE="margin-top:0px;margin-bottom:0px; text-indent:4%"><FONT STYLE="font-family:Times New Roman" SIZE="2">Lorem ipsum dolor sit amet, consectetur adipiscing elit. </FONT></P>
<P STYLE="font-size:12px;margin-top:0px;margin-bottom:0px"> </P>
<TABLE CELLSPACING="0" CELLPADDING="0" WIDTH="100%" BORDER="0" STYLE="BORDER-COLLAPSE:COLLAPSE">
<TR>
<TD WIDTH="45%"></TD>
<TD VALIGN="bottom" WIDTH="1%"></TD>
<TD WIDTH="4%"></TD>
<TD VALIGN="bottom"></TD>
<TD WIDTH="4%"></TD>
<TD VALIGN="bottom" WIDTH="1%"></TD>
<TD WIDTH="44%"></TD></TR>
<TR>
<TD VALIGN="top"></TD>
<TD VALIGN="bottom"><FONT SIZE="1"> </FONT></TD>
<TD VALIGN="bottom"></TD>
<TD VALIGN="bottom"><FONT SIZE="1"> </FONT></TD>
<TD VALIGN="bottom"><FONT STYLE="font-family:Times New Roman" SIZE="2">Title:</FONT></TD>
<TD VALIGN="bottom"><FONT SIZE="1"> </FONT></TD>
<TD VALIGN="bottom"><FONT STYLE="font-family:Times New Roman" SIZE="2">John</FONT></TD></TR>
</TABLE>
<p Style='page-break-before:always'>
<HR SIZE="3" style="COLOR:#999999" WIDTH="100%" ALIGN="CENTER">
The html code which i have copied here is just the sample.I need the exact content between the <hr> in the #elementsarray.
In a simplest way you may do this:
my #cont;
foreach (#ARGV) {
open my $fh,'<',$_;
push #cont,join('',map { chomp; $_ } <$fh>)=~m%<hr>(.*?)</hr>%g;
}
#print join("\n",#cont,'');
And yes, dont worry: all files will be closed on exit "automagically" :)
Hint: uncomment print statement to see the result.
You can use grep in the command line:
grep -Pzo '<hr>\K((.|\n)*)(?=</hr>)' file.html
This will allow you to extract anything between <hr> and </hr> even if new lines are present.
Example:
tiago#dell:/tmp$ grep -Pzo '<hr>\K((.|\n)*)(?=</hr>)' <<< '<hr>a b c d </hr>'
a b c d
tiago#dell:/tmp$ grep -Pzo '<hr>\K((.|\n)*)(?=</hr>)' <<< $'<hr>a b\nc d </hr>'
a b
c d
And of course you can run grep against multiple files.
I know people say not to parse HTML with a regex, but this seems like the kind of relatively simple task that warrants the use of a regex.
Try this:
if ($line =~ m/<hr>(.*?)<\/hr>/){
push #elements, $1;
}
This will extract the text between <hr> and </hr> and store it in the next index in the #elements array.
Also you should ALWAYS use strict; and use warnings; at the top of your code! This will stop you from making dumb mistakes and prevent many needless headaches down the road.
You should also close your file after you are done extracting its contents into the #useful_files array! close $fh;
(On a side note, the name of this array is misleading. I would suggest you name it something like #lines or #file_contents since it contains the contents of a single file... not multiple files as your variable name seems to suggest.)

Parsing text with sed / awk

I am trying to parse an html table in order to obtain the values. See here.
<tr>
<th>CLI:</th>
<td>0044123456789</td>
</tr>
<tr>
<th>Call Type:</th>
<td>New Enquiry</td>
</tr>
<tr>
<th class=3D"nopaddingtop">Caller's Name:</th>
<td class=3D"nopaddingtop"> </td>
</tr>
<tr>
<th class=3D"nopaddingmid"></th>
<td class=3D"nopaddingmid">Mr</td>
</tr>
<tr>
<th class=3D"nopaddingmid"></th>
<td class=3D"nopaddingmid">Lee</td>
</tr>
<tr>
<th class=3D"nopaddingbot"></th>
<td class=3D"nopaddingbot">Butler</td>
</tr>
I want to read the values associated wit the "CLI", "Call Type", and "Caller's Name" into separate variables using sed / awk.
For example:
cli="0044123456789"
call_type="New Enquiry"
caller_name="Mr Lee Butler"
How can I do this?
Many thanks, Neil.
One example for CLI one :
var=$(xmllint --html --xpath '//th[contains(., "CLI")]/../td/text()' file.html)
echo "$var"
For the multi <tr> part :
$ for key in {4..6}; do
xmllint \
--html \
--xpath "//th[contains(., 'CLI')]/../../tr[$key]/td/text()" file.html
printf ' '
done
echo
Output:
Mr Lee Butler

How to use sed to insert in a line before the matching pattern

I am dealing with some html code and i got stucked in some problem. Here is the extract of some code and the format is exactly the same
<tr>
<td nowrap valign="top" class="table_1row"><a name="d071301" id="d071301"></a>13-Jul-2011</td>
<td width="21%" valign="top" class="table_1row">LCQ8: Personal data of job</td>
Here i have to match with
<tr>
<td nowrap valign="top"
and insert something before <tr> .the problem occurs as i have to match a pattern in different lines.
i have tried
grep -c "<tr>\n<td nowrap valign="top"" test.html
grep -c "<tr>\n*<td nowrap valign="top"" test.html
grep -c "<tr>*<td nowrap valign="top"" test.html
to test but none of them works.So i have two dimension to figure out the problem:
Match <td nowrap valign="top" and insert in the line above
Match whole string
<tr>
<td nowrap valign="top"
Would anyone suggest a way to doing it in either way?
Using sed you can perfom replacement on multiple lines. Its also easy to substitute the match.
sed "/\s*<tr>\s*/ { N; s/.*<tr>\n\s*<td.*/insertion\n&/ }"
This cryptic line basically say:
match a line with (/\s*<tr>\s*/)
continue on next line (N)
substitute the matched pattern whit the insertion and the matched string, where & represent the matched string (s/.*<tr>\n\s*td.*/insertion\n&/)
Sed is very powerful to perform substitution, its a nice to know tool. See this manual if you want to learn more about sed:
http://www.grymoire.com/Unix/Sed.html
Try grep -P "tr>\s*\n\s*<td".
It's not clear how it will help you to insert something before <tr>, but anyway.
Quoted strings do not nest, you need to escape the quote characters, or use single quotes instead of double quotes.

Resources