I am looking to try and export the filenames in a directory to a CSV file to later include in a HTML table. I am wanting to have a few different column headings on the CSV/table which sometimes will include data and sometimes won't (based on the filename).
For example the directory /usr/local/flowsim/data/ contains the following file 10.0.0.0,9996,Netflow-Version-10.conf This file contains config info which is also the filename (in this case IP, Port, Netflow Version). There may be a better way to name the file as to get the info into the CSV but I am wanting to take the comma seperated values in the filename and put these into a table like so:
IP Address
Port
Netflow Version
10.0.0.0
9996
Netflow Version 10
In addition to this I am trying to figure out a way to have the output of each of the .conf files in a column at the end of the table with the full contents of the file (the file contents may sometimes contain more details than just that of the filename so would be best to just include it in its own column), i.e.
IP Address
Port
Netflow Version
Full Configuration
10.0.0.0
9996
Netflow Version 10
Output of cat 10.0.0.0,9996,Netflow-Version-10.conf
Is there anyway to achieve this or am I being unrealistic here? Thanks!
EDIT
I managed to resolve this using the following bash script:
#!/usr/bin/env bash
> /var/www/html/live-flows.html
head='<!DOCTYPE html>
<html>
<div class="u-expanded-width u-table u-table-responsive u-table-1">
<table class="u-table-entity u-table-entity-1">
<colgroup>
<col width="33.3%">
<col width="33.3%">
<col width="33.3%">
</colgroup>
<thead class="u-align-center u-custom-font u-grey-5 u-heading-font u-table-header u-table-header-1">
<tr style="height: 40px;">
<th class="u-border-1 u-border-grey-dark-1 u-table-cell">Collector IP Address</th>
<th class="u-border-1 u-border-grey-dark-1 u-table-cell">Collector Port</th>
<th class="u-border-1 u-border-grey-dark-1 u-table-cell">Netflow Version</th>
</tr>
</thead>
<tbody class="u-align-center u-table-body">
<tr style="height: 7px;">
</tr>'
tail='</tbody>
</table>
</html>'
printf '%s\n' "$head"
shopt -s nullglob
for file in /usr/local/flowsim/data/*,*,*.conf; do
[[ $file =~ ([^/]+),(.+),(.+).conf$ ]] &&
ip=${BASH_REMATCH[1]}
port=${BASH_REMATCH[2]}
version=${BASH_REMATCH[3]}
printf ' <tr>\n <td>%s</td>\n <td>%s</td>\n <td>%s</td>\n </tr>\n' "$ip" "$port" "$version"
done
printf '%s\n' "$tail"
I want to extract each html table from a list of links. The code I use is the following:
wget -O - "https://example.com/section-1/table-name/financial-data/" | xmllint --html --xpath '//*[#id="financial-data"]/div/table/tbody' - 2>/dev/null >> /Applications/parser/output.txt
This works perfectly fine, however, given that this is not the only table I want to extract it will give me difficulties identifying which financial-data belongs to which table. In this case scenario, it will only parse one table that is appended to that output file where the SDTOUT looks like this:
<tbody>
<tr class="text-right">
<td class="text-left">Sep 08, 2017</td>
<td>4605.16</td>
<td>4661.00</td>
<td>4075.18</td>
<td>4228.75</td>
<td>2,700,890,000</td>
<td>76,220,200,000</td>
</tr>
<tr class="text-right">
<td class="text-left">Sep 07, 2017</td>
<td>4589.14</td>
<td>4655.04</td>
<td>4491.33</td>
<td>4599.88</td>
<td>1,844,620,000</td>
<td>75,945,000,000</td>
</tr>
...
</tbody>
But I am looking for this:
<tbody>
<tr class="text-right">
<td>TABLE-NAME</td>
<td class="text-left">Sep 08, 2017</td>
<td>4605.16</td>
<td>4661.00</td>
<td>4075.18</td>
<td>4228.75</td>
<td>2,700,890,000</td>
<td>76,220,200,000</td>
</tr>
<tr class="text-right">
<td>TABLE-NAME</td>
<td class="text-left">Sep 07, 2017</td>
<td>4589.14</td>
<td>4655.04</td>
<td>4491.33</td>
<td>4599.88</td>
<td>1,844,620,000</td>
<td>75,945,000,000</td>
</tr>
...
</tbody>
Where the TABLE-NAME is the name of the specific asset. The name can be extracted either using the XPath /html/body/div[3]/div/div[1]/div[3]/div[1]/h1/text() which appears in the same URL where the table is, or from the link itself /table-name/.
I cannot figure out the syntax.
NB: I purposely omitted the -q flag in the wget command as I want to see what is happening in the Terminal at the moment the script is executed.
Thanks!
UPDATE
According to #DanielHaley this can be done through XMLStarlet, however, when I read through the documentation I could not find an example of how to use it.
What is the correct syntax? Do I first have to parse the HTML table via xmllint --html --xpath and then apply xmlstarlet afterwards?
This is what I've found so far:
-i or --insert <xpath> -t (--type) elem|text|attr -n <name> -v (--value) <value>
-a or --append <xpath> -t (--type) elem|text|attr -n <name> -v (--value) <value>
NEW UPDATE
According to this link, I came across the script that adds a subnode easily like this:
wget -O - "https://example.com/section-1/table-name/financial-data/" |
xmllint --html --xpath '//*[#id="financial-data"]/div/table/tbody' - 2>/dev/null |
xmlstarlet ed --subnode "/tbody/tr" --type elem -n td -v "Hello World" >> /Applications/parser/output.txt
Which writes the following to STDOUT:
<tbody>
<tr class="text-right">
<td class="text-left">Sep 08, 2017</td>
<td>4605.16</td>
<td>4661.00</td>
<td>4075.18</td>
<td>4228.75</td>
<td>2,700,890,000</td>
<td>76,220,200,000</td>
<td>Hello World</td>
</tr>
<tr class="text-right">
<td class="text-left">Sep 07, 2017</td>
<td>4589.14</td>
<td>4655.04</td>
<td>4491.33</td>
<td>4599.88</td>
<td>1,844,620,000</td>
<td>75,945,000,000</td>
<td>Hello World</td>
</tr>
...
</tbody>
So far so good, however, this reproduces some default text declared as a text string using the option -v, i.e. in this case scenario "Hello World". I'm hoping to replace this text string with the actual name of the asset. As stated previously, the TABLE-NAME is found in the same page where the table is and can be accessed via the other XPath, hence I tried the following code:
wget -O - "https://example.com/section-1/table-name/financial-data/" |
header=$(xmllint --html --xpath '/html/body/div[3]/div/div[1]/div[3]/div[1]/h1' -) |
xmllint --html --xpath '//*[#id="financial-data"]/div/table/tbody' - 2>/dev/null |
xmlstarlet ed --subnode "/tbody/tr" --type elem -n td -v "$header" >> /Applications/parser/output.txt
Here you can clearly see that I tried declaring a variable $header that shall include the name of the asset. This does not work and leaves my output file empty, probably because the declaration is wrong or the pipe's syntax is not correct.
How can I insert the according XPath (that references to the name of the asset) into the newly created subnode <td>? A variable is the first thing that I came up with; can it be done elsewise?
You should try to insert the additional column before appending the output to output.txt. Make sure the tablename you need is stored in a variable. You want to do something like
tbl=testtbl
echo "<tbody>
<tr class="text-right">
<td class="text-left">Sep 08, 2017</td>
<td>4605.16</td>
<td>4661.00</td>
<td>4075.18</td>
<td>4228.75</td>
<td>2,700,890,000</td>
<td>76,220,200,000</td>
</tr>
<tr class="text-right">
<td class="text-left">Sep 07, 2017</td>
<td>4589.14</td>
<td>4655.04</td>
<td>4491.33</td>
<td>4599.88</td>
<td>1,844,620,000</td>
<td>75,945,000,000</td>
</tr>
" | sed 's#.*<tr.*#&\n <td>'"${tbl}"'</td>#'
In the sed command the normal slashes are replaced by '#', so you do not to escape the slash in </td>.
When you have a file alltables.txt with the apporox. 1160 tables, you van make a loop like this:
while IFS= read -r tbl; do
wget -O - "https://example.com/section-1/table-name/financial-data/" |
xmllint --html --xpath '//*[#id="financial-data"]/div/table/tbody' - 2>/dev/null |
sed 's#.*<tr.*#&\n <td>'"${tbl}"'</td>#' >> /Applications/parser/output.txt
done < alltables.txt
You could probably do this with the ed (edit) command in xmlstarlet, but I don't know xmlstarlet well enough to give you an easy answer.
Also, like you said, it looks like you'd have to pass the HTML through either xmllint or use the fo xmlstarlet command before passing it to xmlstarlet ed. It doesn't look like ed supports --html.
What I would do is use the xmlstarlet tr (transform) command with an XSLT stylesheet.
It's very verbose, but it's much safer than trying to parse HTML/XML with regex. It's also a lot easier to extend.
Here's the XSLT. I added comments to try to help you understand what's happening.
XSLT 1.0 (stylesheet.xsl)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<!--Parameter to capture the table name. This is set on the command line.-->
<xsl:param name="tablename"/>
<!--Identity transform. Will basically output attributes/nodes without
change if not matched by a more specific template.-->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!--Template matching the root element. I do this to narrow the scope of what's
being processed.-->
<xsl:template match="/*">
<!--Process tbody.-->
<xsl:apply-templates select=".//*[#id='financial-data']/div/table/tbody"/>
</xsl:template>
<!--Match tr elements so we can add the new td with the table name.-->
<xsl:template match="tr">
<!--Output the tr element.-->
<xsl:copy>
<!--Process any attributes.-->
<xsl:apply-templates select="#*"/>
<!--Create new td element.-->
<td><xsl:value-of select="$tablename"/></td>
<!--Process any children of tr.-->
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
command line
wget -O - "https://example.com/section-1/table-name/financial-data/" |
xml tr --html stylesheet.xsl -p tablename="/html/body/div[3]/div/div[1]/div[3]/div[1]/h1"
I was able to test this locally by using cat on a local html file instead of wget. Let me know if you want me to add the test file/result to my answer.
This script works but is inefficient; it needs some editing:
name_query="html/body/div[3]/div/div[1]/div[3]/div[1]/h1/text()"
# Use xargs to TRIM result.
header=$(wget -O - "https://example.com/section-1/name-1/financial-data/" |
xmllint --html --xpath "$name_query" - 2>/dev/null |
xargs)
wget -O - "https://example.com/section-1/name-1/financial-data/" |
xmllint --html --xpath '//*[#id="financial-data"]/div/table/tbody' - 2>/dev/null |
xmlstarlet ed --subnode "/tbody/tr" --type elem -n td -v "$header" >> /Applications/parser/output.txt
This makes two requests:
Fetch the name and pass it to variable $header
Get the table and append a subnode <td>$header</td>
Hence, this writes the following to my output.txt file:
<tbody>
<tr class="text-right">
<td class="text-left">Sep 08, 2017</td>
<td>4605.16</td>
<td>4661.00</td>
<td>4075.18</td>
<td>4228.75</td>
<td>2,700,890,000</td>
<td>76,220,200,000</td>
<td>Name 1</td>
</tr>
<tr class="text-right">
<td class="text-left">Sep 07, 2017</td>
<td>4589.14</td>
<td>4655.04</td>
<td>4491.33</td>
<td>4599.88</td>
<td>1,844,620,000</td>
<td>75,945,000,000</td>
<td>Name 1</td>
</tr>
...
</tbody>
It's relatively slow because this can actually be done using one request only, but I can't figure out how.
Just learning sed, and I feel like I'm getting close to doing what I want, just missing something obvious.
The objective is to take bunch of <tr>...</tr>s in an html table and appended it to the single table in another page. So I want to take the initial file, strip everything above the first time I use <tr> and everything from </table> on down, then insert it just above the </table> in the other file. So like below, except <tr> and </tr> are on their own lines, if it matters.
Input File: Target File:
<html><body> <html><body>
<p>Whatever...</p> <p>Other whatever...</p>
<table> <table>
<tr><td>4</td></tr> <thead>
<tr><td>5</td></tr> <tr><th>#</th></tr>
<tr><td>6</td></tr> </thead>
</table> <tbody>
</body></html> <tr><td>1</td></tr>
<tr><td>2</td></tr>
<tr><td>3</td></tr>
</tbody>
</table>
</body></html>
Becomes:
Input file Target File:
doesn't matter. <html><body>
<p>Other whatever...</p>
<table>
<thead>
<tr><th>#</th></tr>
</thead>
<tbody>
<tr><td>1</td></tr>
<tr><td>2</td></tr>
<tr><td>3</td></tr>
<tr><td>4</td></tr>
<tr><td>5</td></tr>
<tr><td>6</td></tr>
</tbody>
</table>
</body></html>
Here's the code I'm trying to use:
#!/bin/bash
#$1 is the first parameter and $2 is the second parameter being passed when calling the script. The variable filename will be used to refer to this.
input=$1
inserttarget=$2
sed -e '/\<\/thead\>,$input' $input
sed -e '/\<\/table\>,$input' $input
sed -n -i -e '\<\/tbody\>/r' $inserttarget -e 1x -e '2,${x;p}' -e '${x;p}' $input
Pretty sure it's pretty simple, just messing the expression up. Can anyone set me straight?
Here I cut the problem in two:
1. Cut the rows from the input
2. Paste those rows in the output file
sed -n '\:<table>:,\:</table>:p' ${input} | sed -n '\:<tr>:p'
This line will remove all lines containing <tr> in the block ranging from the first line matching <table> to the first line matching </table>. All those freshly cut lines are printed in the standard output.
sed -i '\:</tbody>: {
r /dev/stdin
a </tbody>
d}' ${inserttarget}
This multi-line command will add the lines read from stdin after the line matching </tbody>. Then we move the </tbody> by appending it after the new lines and removing the old one.
Another trick used here is to replace the default regex delimiter / by :, so that we can use '/' in our matching pattern.
Final sotuion:
sed -i '\:</tbody>: {
r /dev/stdin
a </tbody>
d}' ${inserttarget} < <(sed -n '\:<table>:,\:</table>:p' ${input} | sed -n '\:<tr>:p')
Et voila!
I have managed to extract data from a website, then get relevant data from the extracted webpage. Now I am stuck as to how to extract data from <td> cols. into an array for data manipulation ?
My extracted HTML is following:
<tbody>
<tr>
<td>abc3207</td>
<td>151</td>
<td>Lorem Ipsum</td>
<td>Off Campus</td>
<td>OFF</td>
<td>12 of 999 </td>
<td> </td>
<td> </td>
<td>Get</td>
</tr>
<tr>
<td>abc3207</td>
<td>151</td>
<td>Dolor Sit Amet</td>
<td>Mount Lawley</td>
<td>ON</td>
<td>45 of 999 </td>
<td>Activity</td>
<td> </td>
<td>Get</td>
</tr>
</tbody>
I am doing this using a bash script as I must do it via bash only.
To parse html or xml, you'd better use dedicated command line tools as xmlstarlet or xmllint.
But with your html sample, you can try this :
mapfile td < <(sed -n 's/[\t ]*<td[^>]*>\(.*\)<\/td>/\1/p' file)
for td in "${td[#]}"; do
printf "$td"
done
sed extracts all td contents and pass the result to mapfile using process substitution.
mapfile stores each line from the process substitution in an array variable named $td.
It will work with your simple html with :
one td tag per line
opening and closing td on same line
I am trying to parse an html table in order to obtain the values. See here.
<tr>
<th>CLI:</th>
<td>0044123456789</td>
</tr>
<tr>
<th>Call Type:</th>
<td>New Enquiry</td>
</tr>
<tr>
<th class=3D"nopaddingtop">Caller's Name:</th>
<td class=3D"nopaddingtop"> </td>
</tr>
<tr>
<th class=3D"nopaddingmid"></th>
<td class=3D"nopaddingmid">Mr</td>
</tr>
<tr>
<th class=3D"nopaddingmid"></th>
<td class=3D"nopaddingmid">Lee</td>
</tr>
<tr>
<th class=3D"nopaddingbot"></th>
<td class=3D"nopaddingbot">Butler</td>
</tr>
I want to read the values associated wit the "CLI", "Call Type", and "Caller's Name" into separate variables using sed / awk.
For example:
cli="0044123456789"
call_type="New Enquiry"
caller_name="Mr Lee Butler"
How can I do this?
Many thanks, Neil.
One example for CLI one :
var=$(xmllint --html --xpath '//th[contains(., "CLI")]/../td/text()' file.html)
echo "$var"
For the multi <tr> part :
$ for key in {4..6}; do
xmllint \
--html \
--xpath "//th[contains(., 'CLI')]/../../tr[$key]/td/text()" file.html
printf ' '
done
echo
Output:
Mr Lee Butler