I want to extract each html table from a list of links. The code I use is the following:
wget -O - "https://example.com/section-1/table-name/financial-data/" | xmllint --html --xpath '//*[#id="financial-data"]/div/table/tbody' - 2>/dev/null >> /Applications/parser/output.txt
This works perfectly fine, however, given that this is not the only table I want to extract it will give me difficulties identifying which financial-data belongs to which table. In this case scenario, it will only parse one table that is appended to that output file where the SDTOUT looks like this:
<tbody>
<tr class="text-right">
<td class="text-left">Sep 08, 2017</td>
<td>4605.16</td>
<td>4661.00</td>
<td>4075.18</td>
<td>4228.75</td>
<td>2,700,890,000</td>
<td>76,220,200,000</td>
</tr>
<tr class="text-right">
<td class="text-left">Sep 07, 2017</td>
<td>4589.14</td>
<td>4655.04</td>
<td>4491.33</td>
<td>4599.88</td>
<td>1,844,620,000</td>
<td>75,945,000,000</td>
</tr>
...
</tbody>
But I am looking for this:
<tbody>
<tr class="text-right">
<td>TABLE-NAME</td>
<td class="text-left">Sep 08, 2017</td>
<td>4605.16</td>
<td>4661.00</td>
<td>4075.18</td>
<td>4228.75</td>
<td>2,700,890,000</td>
<td>76,220,200,000</td>
</tr>
<tr class="text-right">
<td>TABLE-NAME</td>
<td class="text-left">Sep 07, 2017</td>
<td>4589.14</td>
<td>4655.04</td>
<td>4491.33</td>
<td>4599.88</td>
<td>1,844,620,000</td>
<td>75,945,000,000</td>
</tr>
...
</tbody>
Where the TABLE-NAME is the name of the specific asset. The name can be extracted either using the XPath /html/body/div[3]/div/div[1]/div[3]/div[1]/h1/text() which appears in the same URL where the table is, or from the link itself /table-name/.
I cannot figure out the syntax.
NB: I purposely omitted the -q flag in the wget command as I want to see what is happening in the Terminal at the moment the script is executed.
Thanks!
UPDATE
According to #DanielHaley this can be done through XMLStarlet, however, when I read through the documentation I could not find an example of how to use it.
What is the correct syntax? Do I first have to parse the HTML table via xmllint --html --xpath and then apply xmlstarlet afterwards?
This is what I've found so far:
-i or --insert <xpath> -t (--type) elem|text|attr -n <name> -v (--value) <value>
-a or --append <xpath> -t (--type) elem|text|attr -n <name> -v (--value) <value>
NEW UPDATE
According to this link, I came across the script that adds a subnode easily like this:
wget -O - "https://example.com/section-1/table-name/financial-data/" |
xmllint --html --xpath '//*[#id="financial-data"]/div/table/tbody' - 2>/dev/null |
xmlstarlet ed --subnode "/tbody/tr" --type elem -n td -v "Hello World" >> /Applications/parser/output.txt
Which writes the following to STDOUT:
<tbody>
<tr class="text-right">
<td class="text-left">Sep 08, 2017</td>
<td>4605.16</td>
<td>4661.00</td>
<td>4075.18</td>
<td>4228.75</td>
<td>2,700,890,000</td>
<td>76,220,200,000</td>
<td>Hello World</td>
</tr>
<tr class="text-right">
<td class="text-left">Sep 07, 2017</td>
<td>4589.14</td>
<td>4655.04</td>
<td>4491.33</td>
<td>4599.88</td>
<td>1,844,620,000</td>
<td>75,945,000,000</td>
<td>Hello World</td>
</tr>
...
</tbody>
So far so good, however, this reproduces some default text declared as a text string using the option -v, i.e. in this case scenario "Hello World". I'm hoping to replace this text string with the actual name of the asset. As stated previously, the TABLE-NAME is found in the same page where the table is and can be accessed via the other XPath, hence I tried the following code:
wget -O - "https://example.com/section-1/table-name/financial-data/" |
header=$(xmllint --html --xpath '/html/body/div[3]/div/div[1]/div[3]/div[1]/h1' -) |
xmllint --html --xpath '//*[#id="financial-data"]/div/table/tbody' - 2>/dev/null |
xmlstarlet ed --subnode "/tbody/tr" --type elem -n td -v "$header" >> /Applications/parser/output.txt
Here you can clearly see that I tried declaring a variable $header that shall include the name of the asset. This does not work and leaves my output file empty, probably because the declaration is wrong or the pipe's syntax is not correct.
How can I insert the according XPath (that references to the name of the asset) into the newly created subnode <td>? A variable is the first thing that I came up with; can it be done elsewise?
You should try to insert the additional column before appending the output to output.txt. Make sure the tablename you need is stored in a variable. You want to do something like
tbl=testtbl
echo "<tbody>
<tr class="text-right">
<td class="text-left">Sep 08, 2017</td>
<td>4605.16</td>
<td>4661.00</td>
<td>4075.18</td>
<td>4228.75</td>
<td>2,700,890,000</td>
<td>76,220,200,000</td>
</tr>
<tr class="text-right">
<td class="text-left">Sep 07, 2017</td>
<td>4589.14</td>
<td>4655.04</td>
<td>4491.33</td>
<td>4599.88</td>
<td>1,844,620,000</td>
<td>75,945,000,000</td>
</tr>
" | sed 's#.*<tr.*#&\n <td>'"${tbl}"'</td>#'
In the sed command the normal slashes are replaced by '#', so you do not to escape the slash in </td>.
When you have a file alltables.txt with the apporox. 1160 tables, you van make a loop like this:
while IFS= read -r tbl; do
wget -O - "https://example.com/section-1/table-name/financial-data/" |
xmllint --html --xpath '//*[#id="financial-data"]/div/table/tbody' - 2>/dev/null |
sed 's#.*<tr.*#&\n <td>'"${tbl}"'</td>#' >> /Applications/parser/output.txt
done < alltables.txt
You could probably do this with the ed (edit) command in xmlstarlet, but I don't know xmlstarlet well enough to give you an easy answer.
Also, like you said, it looks like you'd have to pass the HTML through either xmllint or use the fo xmlstarlet command before passing it to xmlstarlet ed. It doesn't look like ed supports --html.
What I would do is use the xmlstarlet tr (transform) command with an XSLT stylesheet.
It's very verbose, but it's much safer than trying to parse HTML/XML with regex. It's also a lot easier to extend.
Here's the XSLT. I added comments to try to help you understand what's happening.
XSLT 1.0 (stylesheet.xsl)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<!--Parameter to capture the table name. This is set on the command line.-->
<xsl:param name="tablename"/>
<!--Identity transform. Will basically output attributes/nodes without
change if not matched by a more specific template.-->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!--Template matching the root element. I do this to narrow the scope of what's
being processed.-->
<xsl:template match="/*">
<!--Process tbody.-->
<xsl:apply-templates select=".//*[#id='financial-data']/div/table/tbody"/>
</xsl:template>
<!--Match tr elements so we can add the new td with the table name.-->
<xsl:template match="tr">
<!--Output the tr element.-->
<xsl:copy>
<!--Process any attributes.-->
<xsl:apply-templates select="#*"/>
<!--Create new td element.-->
<td><xsl:value-of select="$tablename"/></td>
<!--Process any children of tr.-->
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
command line
wget -O - "https://example.com/section-1/table-name/financial-data/" |
xml tr --html stylesheet.xsl -p tablename="/html/body/div[3]/div/div[1]/div[3]/div[1]/h1"
I was able to test this locally by using cat on a local html file instead of wget. Let me know if you want me to add the test file/result to my answer.
This script works but is inefficient; it needs some editing:
name_query="html/body/div[3]/div/div[1]/div[3]/div[1]/h1/text()"
# Use xargs to TRIM result.
header=$(wget -O - "https://example.com/section-1/name-1/financial-data/" |
xmllint --html --xpath "$name_query" - 2>/dev/null |
xargs)
wget -O - "https://example.com/section-1/name-1/financial-data/" |
xmllint --html --xpath '//*[#id="financial-data"]/div/table/tbody' - 2>/dev/null |
xmlstarlet ed --subnode "/tbody/tr" --type elem -n td -v "$header" >> /Applications/parser/output.txt
This makes two requests:
Fetch the name and pass it to variable $header
Get the table and append a subnode <td>$header</td>
Hence, this writes the following to my output.txt file:
<tbody>
<tr class="text-right">
<td class="text-left">Sep 08, 2017</td>
<td>4605.16</td>
<td>4661.00</td>
<td>4075.18</td>
<td>4228.75</td>
<td>2,700,890,000</td>
<td>76,220,200,000</td>
<td>Name 1</td>
</tr>
<tr class="text-right">
<td class="text-left">Sep 07, 2017</td>
<td>4589.14</td>
<td>4655.04</td>
<td>4491.33</td>
<td>4599.88</td>
<td>1,844,620,000</td>
<td>75,945,000,000</td>
<td>Name 1</td>
</tr>
...
</tbody>
It's relatively slow because this can actually be done using one request only, but I can't figure out how.
I'm trying to use awk to change the color of cells in an HTML table. Ideally, I would be able to use awk to locate the Nth instance (a variable passed from earlier in the script) of "tg-6k2t" after "Bob" and change the color code to "tg-b5xm". This is a giant HTML table with many different people's names.
<tr>
<td class="tg-6k2t">Bob</td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
</tr>
My desired output would be
<tr>
<td class="tg-6k2t">Bob</td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
<td class="tg-b5xm"></td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
</tr>
You can do it with an Awk statement as follows,
awk -v count=6 '/"tg-6k2t".*Bob/{x=count}x--==1{sub(/tg-6k2t/,"tg-b5xm")}1' file
which generates the output as below, meaning the 6th line from the line matching Bob, change the variable to your convenience.
<tr>
<td class="tg-6k2t">Bob</td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
<td class="tg-6k2t"></td>
<td class="tg-b5xm"></td>
</tr>
I have managed to extract data from a website, then get relevant data from the extracted webpage. Now I am stuck as to how to extract data from <td> cols. into an array for data manipulation ?
My extracted HTML is following:
<tbody>
<tr>
<td>abc3207</td>
<td>151</td>
<td>Lorem Ipsum</td>
<td>Off Campus</td>
<td>OFF</td>
<td>12 of 999 </td>
<td> </td>
<td> </td>
<td>Get</td>
</tr>
<tr>
<td>abc3207</td>
<td>151</td>
<td>Dolor Sit Amet</td>
<td>Mount Lawley</td>
<td>ON</td>
<td>45 of 999 </td>
<td>Activity</td>
<td> </td>
<td>Get</td>
</tr>
</tbody>
I am doing this using a bash script as I must do it via bash only.
To parse html or xml, you'd better use dedicated command line tools as xmlstarlet or xmllint.
But with your html sample, you can try this :
mapfile td < <(sed -n 's/[\t ]*<td[^>]*>\(.*\)<\/td>/\1/p' file)
for td in "${td[#]}"; do
printf "$td"
done
sed extracts all td contents and pass the result to mapfile using process substitution.
mapfile stores each line from the process substitution in an array variable named $td.
It will work with your simple html with :
one td tag per line
opening and closing td on same line
I have group of html files where i have to extract content between <hr> and </hr> tags.I have done everything except this extraction.What i have done is
1.Loaded all html files and store it in #html_files.
2.Then I am storing each file's content in #useful_files array.
3.Then I am looping the #useful_files array and checking each line where <hr> is found.If found I need next lines of content in #elements array.
Is it possible.Am I in the right?
foreach(#html_files){
$single_file = $_;
$elemets = ();
open $fh, '<', $dir.'/'.$single_file or die "Could not open '$single_file' $!\n";
#useful_files = ();
#useful_files = <$fh>;
foreach(#useful_files){
$line = $_;
chomp($line);
if($line =~ /<hr>/){
#elements = $line;
}
}
create(#elements,$single_file)
}
Thanks !!!
My input html file will be like this
<HR SIZE="3" style="COLOR:#999999" WIDTH="100%" ALIGN="CENTER">
<P STYLE="margin-top:0px;margin-bottom:0px; text-indent:4%"><FONT STYLE="font-family:Times New Roman" SIZE="2">Lorem ipsum dolor sit amet, consectetur adipiscing elit. </FONT></P>
<P STYLE="font-size:12px;margin-top:0px;margin-bottom:0px"> </P>
<TABLE CELLSPACING="0" CELLPADDING="0" WIDTH="100%" BORDER="0" STYLE="BORDER-COLLAPSE:COLLAPSE">
<TR>
<TD WIDTH="45%"></TD>
<TD VALIGN="bottom" WIDTH="1%"></TD>
<TD WIDTH="4%"></TD>
<TD VALIGN="bottom"></TD>
<TD WIDTH="4%"></TD>
<TD VALIGN="bottom" WIDTH="1%"></TD>
<TD WIDTH="44%"></TD></TR>
<TR>
<TD VALIGN="top"></TD>
<TD VALIGN="bottom"><FONT SIZE="1"> </FONT></TD>
<TD VALIGN="bottom"></TD>
<TD VALIGN="bottom"><FONT SIZE="1"> </FONT></TD>
<TD VALIGN="bottom"><FONT STYLE="font-family:Times New Roman" SIZE="2">Title:</FONT></TD>
<TD VALIGN="bottom"><FONT SIZE="1"> </FONT></TD>
<TD VALIGN="bottom"><FONT STYLE="font-family:Times New Roman" SIZE="2">John</FONT></TD></TR>
</TABLE>
<p Style='page-break-before:always'>
<HR SIZE="3" style="COLOR:#999999" WIDTH="100%" ALIGN="CENTER">
The html code which i have copied here is just the sample.I need the exact content between the <hr> in the #elementsarray.
In a simplest way you may do this:
my #cont;
foreach (#ARGV) {
open my $fh,'<',$_;
push #cont,join('',map { chomp; $_ } <$fh>)=~m%<hr>(.*?)</hr>%g;
}
#print join("\n",#cont,'');
And yes, dont worry: all files will be closed on exit "automagically" :)
Hint: uncomment print statement to see the result.
You can use grep in the command line:
grep -Pzo '<hr>\K((.|\n)*)(?=</hr>)' file.html
This will allow you to extract anything between <hr> and </hr> even if new lines are present.
Example:
tiago#dell:/tmp$ grep -Pzo '<hr>\K((.|\n)*)(?=</hr>)' <<< '<hr>a b c d </hr>'
a b c d
tiago#dell:/tmp$ grep -Pzo '<hr>\K((.|\n)*)(?=</hr>)' <<< $'<hr>a b\nc d </hr>'
a b
c d
And of course you can run grep against multiple files.
I know people say not to parse HTML with a regex, but this seems like the kind of relatively simple task that warrants the use of a regex.
Try this:
if ($line =~ m/<hr>(.*?)<\/hr>/){
push #elements, $1;
}
This will extract the text between <hr> and </hr> and store it in the next index in the #elements array.
Also you should ALWAYS use strict; and use warnings; at the top of your code! This will stop you from making dumb mistakes and prevent many needless headaches down the road.
You should also close your file after you are done extracting its contents into the #useful_files array! close $fh;
(On a side note, the name of this array is misleading. I would suggest you name it something like #lines or #file_contents since it contains the contents of a single file... not multiple files as your variable name seems to suggest.)
I am trying to parse an html table in order to obtain the values. See here.
<tr>
<th>CLI:</th>
<td>0044123456789</td>
</tr>
<tr>
<th>Call Type:</th>
<td>New Enquiry</td>
</tr>
<tr>
<th class=3D"nopaddingtop">Caller's Name:</th>
<td class=3D"nopaddingtop"> </td>
</tr>
<tr>
<th class=3D"nopaddingmid"></th>
<td class=3D"nopaddingmid">Mr</td>
</tr>
<tr>
<th class=3D"nopaddingmid"></th>
<td class=3D"nopaddingmid">Lee</td>
</tr>
<tr>
<th class=3D"nopaddingbot"></th>
<td class=3D"nopaddingbot">Butler</td>
</tr>
I want to read the values associated wit the "CLI", "Call Type", and "Caller's Name" into separate variables using sed / awk.
For example:
cli="0044123456789"
call_type="New Enquiry"
caller_name="Mr Lee Butler"
How can I do this?
Many thanks, Neil.
One example for CLI one :
var=$(xmllint --html --xpath '//th[contains(., "CLI")]/../td/text()' file.html)
echo "$var"
For the multi <tr> part :
$ for key in {4..6}; do
xmllint \
--html \
--xpath "//th[contains(., 'CLI')]/../../tr[$key]/td/text()" file.html
printf ' '
done
echo
Output:
Mr Lee Butler