Unterminated address regex - misapplying escape characters in bash sed script? - bash

Just learning sed, and I feel like I'm getting close to doing what I want, just missing something obvious.
The objective is to take bunch of <tr>...</tr>s in an html table and appended it to the single table in another page. So I want to take the initial file, strip everything above the first time I use <tr> and everything from </table> on down, then insert it just above the </table> in the other file. So like below, except <tr> and </tr> are on their own lines, if it matters.
Input File: Target File:
<html><body> <html><body>
<p>Whatever...</p> <p>Other whatever...</p>
<table> <table>
<tr><td>4</td></tr> <thead>
<tr><td>5</td></tr> <tr><th>#</th></tr>
<tr><td>6</td></tr> </thead>
</table> <tbody>
</body></html> <tr><td>1</td></tr>
<tr><td>2</td></tr>
<tr><td>3</td></tr>
</tbody>
</table>
</body></html>
Becomes:
Input file Target File:
doesn't matter. <html><body>
<p>Other whatever...</p>
<table>
<thead>
<tr><th>#</th></tr>
</thead>
<tbody>
<tr><td>1</td></tr>
<tr><td>2</td></tr>
<tr><td>3</td></tr>
<tr><td>4</td></tr>
<tr><td>5</td></tr>
<tr><td>6</td></tr>
</tbody>
</table>
</body></html>
Here's the code I'm trying to use:
#!/bin/bash
#$1 is the first parameter and $2 is the second parameter being passed when calling the script. The variable filename will be used to refer to this.
input=$1
inserttarget=$2
sed -e '/\<\/thead\>,$input' $input
sed -e '/\<\/table\>,$input' $input
sed -n -i -e '\<\/tbody\>/r' $inserttarget -e 1x -e '2,${x;p}' -e '${x;p}' $input
Pretty sure it's pretty simple, just messing the expression up. Can anyone set me straight?

Here I cut the problem in two:
1. Cut the rows from the input
2. Paste those rows in the output file
sed -n '\:<table>:,\:</table>:p' ${input} | sed -n '\:<tr>:p'
This line will remove all lines containing <tr> in the block ranging from the first line matching <table> to the first line matching </table>. All those freshly cut lines are printed in the standard output.
sed -i '\:</tbody>: {
r /dev/stdin
a </tbody>
d}' ${inserttarget}
This multi-line command will add the lines read from stdin after the line matching </tbody>. Then we move the </tbody> by appending it after the new lines and removing the old one.
Another trick used here is to replace the default regex delimiter / by :, so that we can use '/' in our matching pattern.
Final sotuion:
sed -i '\:</tbody>: {
r /dev/stdin
a </tbody>
d}' ${inserttarget} < <(sed -n '\:<table>:,\:</table>:p' ${input} | sed -n '\:<tr>:p')
Et voila!

Related

extract text beetwen two words and in a specific line

I'm trying to make a linux bash script to download an html page, extract numbers from this html page and assign them to a variable.
the html page has several lines but I'm interested in these :
<tr>
<td width="16"><img src="img/ico_message.gif"></td>
<td width="180"><strong> TIME 1</strong></td>
<td width="132">
<div align="right"><strong>61</strong></div></td>
</tr>
<tr>
<td width="16"><img src="img/ico_message.gif"></td>
<td width="180"><strong> TIME 2</strong></td>
<td width="132">
<div align="right"><strong>65</strong></div></td>
</tr>
</table></td>
Every time I download the page I have to read the two values ​​in row 5 and 11 between strong> and </strong (61 ad 65 in this example; 61 and 65 in this example, but each time they are different)
The two values ​​extracted from html must be able to assign them to two variables
Thanks for any idea
Let's assume we a page called page.html. You can firstly select the line with grep, then extract the value with sed and finally select values iteratively with awk:
$ var0=$(cat page.html |\
grep -Ee "<strong>[0-9]+</strong>" -o |\
sed -Ee "s/<strong>([0-9]+)<\/strong>/\1/g" |\
awk 'NR%2==1')
$ var1=$(cat page.html |\
grep -Ee "<strong>[0-9]+</strong>" -o |\
sed -Ee "s/<strong>([0-9]+)<\/strong>/\1/g" |\
awk 'NR%2==0')
output:
$ echo $var0
61
$ echo $var1
65
This might work for you (GNU sed):
sed -rn '/TIME/{:a;N;5bb;11bb;ba;:b;s/.*TIME ([^<]*).*<strong>([^<]*).*/var\1=\2/p}' file
Use the integer associated with the TIME in the preceding code to differentiate the two variable names.

How to replace any text between html tags

i have text between html tags. For example:
<td>vip</td>
I will have any text between tags <td></td>
How can i cut any text from these tags and put any text between these tags.
I need to do it via bash/shell.
How can i do this ?
First of all, i tried to get this text, but without success
sed -n "/<td>/,/<\/td>/p" test.txt. But in a result i have
<td>vip</td>. but according to documentation, i should get only vip
You can try this:
sed -i -e 's/\(<td>\).*\(<\/td>\)/<td>TEXT_TO_REPLACE_BY<\/td>/g' test.txt
Note that it will only work for the <td> tags. It will replace everything between tags <td> (actually with them together and put the tags back) with TEXT_TO_REPLACE_BY.
You can use this to get the value vip
sed -e 's,.*<td>\([^<]*\)</td>.*,\1,g'
If you Input_file is same as shown example then following may help you too.
echo "<td>vip</td>" | awk -F"[><]" '{print $3}'
Simply printing the tag with echo then using awk to create a field separator >< then printing the 3rd field then which is your request.
d=$'<td>vip</td>\n<table>vip</table>\n<td>more data here</td>'
echo "$d"
<td>vip</td>
<table>vip</table>
<td>more data here</td>
awk '/<td>/{match($0,/(<.*>)(.*)(<\/.*>)/,t);print t[1] "something" t[3];next}1' <<<"$d"
<td>something</td>
<table>vip</table>
<td>something</td>
awk '/<table>/{match($0,/(<.*>)(.*)(<\/.*>)/,t);print t[1] "something" t[3];next}1' <<<"$d"
<td>vip</td>
<table>something</table>
<td>more data here</td>

Extract data between tags in shell script and store in array?

I'm using a bash script to obtain a value from a URL and it's returning a value in html tags form. Input:
<tr><td title='The name of the health check service.'>hc.name</td><td data-type='java.lang.String'>Replication Queue</td></tr>
<tr><td title='The persistence identifier of the service.'>service.pid</td><td data-type='java.lang.String'>com.adobe.granite.replication.hc.impl.ReplicationQueueHealthCheck</td></tr>
<tr><td title='The health check result'>ok</td><td data-type='java.lang.Boolean'>true</td></tr>
<tr><td title='The health check status'>status</td><td data-type='java.lang.String'>OK</td></tr>
<tr><td title='The elapsed time in miliseconds'>elapsedTime</td><td data-type='java.lang.Long'>18</td></tr>
<tr><td title='The date when the execution finished'>finishedAt</td><td data-type='java.util.Date'>2017-03-24T00:23:36+0530</td></tr>
<tr><td title='Indicates of the execution timed out'>timedOut</td><td data-type='java.lang.Boolean'>false</td></tr>
The desired output should be stored in a variable with the values between <td> tags from the above code:
values=( ["hc.name"]="Replication Queue" ["status"]="OK")
I tried to use this sed code, but it only works when multiple <td></td> tags are on separate lines. In this case, multiple <td></td> are on the same line.
sed -n 's:.*<td>(.*)</td>.*:\1:p' sample.txt
That command displays results only with input like this:
<tr>
<td>ok</td>
<td>status</td>
</tr>
I think you will have better luck using Perl regular expressions for this, since they support non-greedy matching. Here is a Perl one-liner that prints the information from your file:
perl -ne 'm:.*?<td [^>]*>(.*?)</td>.*?<td [^>]*>(.*?)</td>:; print "[\"$1\"] = \"$2\"\n";' sample.txt
This outputs:
["hc.name"] = "Replication Queue"
["service.pid"] = "com.adobe.granite.replication.hc.impl.ReplicationQueueHealthCheck"
["ok"] = "true"
["status"] = "OK"
["elapsedTime"] = "18"
["finishedAt"] = "2017-03-24T00:23:36+0530"
["timedOut"] = "false"
Here is a sed invocation that also works, but this is less precise as it matches all characters except > and < to approximate non-greedy matching, which is not supported in sed.
sed -n 's:.*<td [^>]*>\([^<]*\)</td><td [^>]*>\([^<]*\)</td>.*:[\"\1\"] = \"\2\":p' sample.txt
This outputs:
["hc.name"] = "Replication Queue"
["service.pid"] = "com.adobe.granite.replication.hc.impl.ReplicationQueueHealthCheck"
["ok"] = "true"
["status"] = "OK"
["elapsedTime"] = "18"
["finishedAt"] = "2017-03-24T00:23:36+0530"
["timedOut"] = "false"
sgrep and sed method (more reliable than pure sed):
sgrep -o'%r"' '">" __ "<"' sample.txt | sed 's/^/["/;s/""/"/;s/"/"]="/2'
Output:
["hc.name"]="Replication Queue"
["service.pid"]="com.adobe.granite.replication.hc.impl.ReplicationQueueHealthCheck"
["ok"]="true"
["status"]="OK"
["elapsedTime"]="18"
["finishedAt"]="2017-03-24T00:23:36+0530"
["timedOut"]="false"

How to extract data from <td> cols into an array using bash?

I have managed to extract data from a website, then get relevant data from the extracted webpage. Now I am stuck as to how to extract data from <td> cols. into an array for data manipulation ?
My extracted HTML is following:
<tbody>
<tr>
<td>abc3207</td>
<td>151</td>
<td>Lorem Ipsum</td>
<td>Off Campus</td>
<td>OFF</td>
<td>12 of 999 </td>
<td> </td>
<td> </td>
<td>Get</td>
</tr>
<tr>
<td>abc3207</td>
<td>151</td>
<td>Dolor Sit Amet</td>
<td>Mount Lawley</td>
<td>ON</td>
<td>45 of 999 </td>
<td>Activity</td>
<td> </td>
<td>Get</td>
</tr>
</tbody>
I am doing this using a bash script as I must do it via bash only.
To parse html or xml, you'd better use dedicated command line tools as xmlstarlet or xmllint.
But with your html sample, you can try this :
mapfile td < <(sed -n 's/[\t ]*<td[^>]*>\(.*\)<\/td>/\1/p' file)
for td in "${td[#]}"; do
printf "$td"
done
sed extracts all td contents and pass the result to mapfile using process substitution.
mapfile stores each line from the process substitution in an array variable named $td.
It will work with your simple html with :
one td tag per line
opening and closing td on same line

How to use sed to insert in a line before the matching pattern

I am dealing with some html code and i got stucked in some problem. Here is the extract of some code and the format is exactly the same
<tr>
<td nowrap valign="top" class="table_1row"><a name="d071301" id="d071301"></a>13-Jul-2011</td>
<td width="21%" valign="top" class="table_1row">LCQ8: Personal data of job</td>
Here i have to match with
<tr>
<td nowrap valign="top"
and insert something before <tr> .the problem occurs as i have to match a pattern in different lines.
i have tried
grep -c "<tr>\n<td nowrap valign="top"" test.html
grep -c "<tr>\n*<td nowrap valign="top"" test.html
grep -c "<tr>*<td nowrap valign="top"" test.html
to test but none of them works.So i have two dimension to figure out the problem:
Match <td nowrap valign="top" and insert in the line above
Match whole string
<tr>
<td nowrap valign="top"
Would anyone suggest a way to doing it in either way?
Using sed you can perfom replacement on multiple lines. Its also easy to substitute the match.
sed "/\s*<tr>\s*/ { N; s/.*<tr>\n\s*<td.*/insertion\n&/ }"
This cryptic line basically say:
match a line with (/\s*<tr>\s*/)
continue on next line (N)
substitute the matched pattern whit the insertion and the matched string, where & represent the matched string (s/.*<tr>\n\s*td.*/insertion\n&/)
Sed is very powerful to perform substitution, its a nice to know tool. See this manual if you want to learn more about sed:
http://www.grymoire.com/Unix/Sed.html
Try grep -P "tr>\s*\n\s*<td".
It's not clear how it will help you to insert something before <tr>, but anyway.
Quoted strings do not nest, you need to escape the quote characters, or use single quotes instead of double quotes.

Resources