Fetching all those text that match a pattern using shell - bash

html file that i have to read line by line. I then need to run a script that matches some class attribute of span tag and then returns the text enclosing the span and the line number on which it exists.
Following is my single line code of .html file:
<span id="L9_454" class="e"><span class="ln">454</span><span class="bar"></span> <span class="k">if</span> ( (strncmp(<span class="fm" value="2705">p_rout</span>-><span class="fm" value="186">source_corresp</span>.<span class="fm" value="105">name</span>, <span class="fm" value="5190">IL_LOWERING_INIT_ROUTINE_PREFIX</span>, strlen(<span class="fm" value="5190">IL_LOWERING_INIT_ROUTINE_PREFIX</span>)) == 0) </span>
i need to run the script on every line and search if class="fm" is set for any span tag then i need to dump the line no i.e 454 in above example and text that have span class="fm" i.e p_rout,source_corresp,name,IL_LOWERING_INIT_ROUTINE_PREFIX and IL_LOWERING_INIT_ROUTINE_PREFIX in a .xml file.
i know how to dump the data but i just dont know how can i get the texts required. I tried it using awk but cudn't get what regex should i match. Any other filter would also work. Pls help

awk '$1 ~ /fm/ {print $2}' RS=span FS='[<>]'
set Record Separator to span
set Field Separator to < or >
if field one contains fm print field two
Result
p_rout
source_corresp
name
IL_LOWERING_INIT_ROUTINE_PREFIX
IL_LOWERING_INIT_ROUTINE_PREFIX

Related

How to use sed to extract the specific substring?

div class="panel-body" id="current-conditions-body">
<!-- Graphic and temperatures -->
<div id="current_conditions-summary" class="pull-left" >
<img src="newimages/large/sct.png" alt="" class="pull-left" />
<p class="myforecast-current">Partly Cloudy</p>
<p class="myforecast-current-lrg">64°F</p>
<p class="myforecast-current-sm">18°C</p>
I try to extract the "64" in line 6, I was thinking to use awk '/<p class="myforecast-current-lrg">/{print}', but this only gave me the full line. Then I think I need to use sed, but i don't know how to use sed.
Assumptions:
input is nicely formatted as per the sample provided by OP so we can use some 'simple' pattern matching
Modifying OP's current awk code:
# use split() function to break line using dual delimiters ">" and "&"; print 2nd array entry
awk '/<p class="myforecast-current-lrg">/{ n=split($0,arr,"[>&]");print arr[2]}'
# define dual input field delimiter as ">" and "&"; print 2nd field in line that matches search string
awk -F'[>&]' ' /<p class="myforecast-current-lrg">/{print $2}'
Both of these generate:
64
One sed idea:
sed -En 's/.*<p class="myforecast-current-lrg">([^&]+)&deg.*/\1/p'
This generates:
64

Converting CSV file to multiline text file

I have file which looks like following:
C_DocType_ID,SOReference,DocumentNo,ProductValue,Quantity,LineDescription,C_Tax_ID,TaxAmt
1000000,1904093563U,1904093563U,5210-1,1,0,1000000,0
1000000,1904093563U,1904093563U,6511,2,0,1000000,0
1000000,1904093563U,1904093563U,5001,1,0,1000000,0
1000000,1904083291U,1904083291U,5310,4,0,1000000,0
1000000,1904083291U,1904083291U,5311,3,0,1000000,0
1000000,1904083291U,1904083291U,6101,6,0,1000000,0
1000000,1904083291U,1904083291U,6102,1,0,1000000,0
1000000,1904083291U,1904083291U,6106,6,0,1000000,0
I need to convert it to text file which looks like this:
WOH~1.0~~1904093563Utest~~~ORD~~~~
WOL~~~5210-1~~~~~~~~1~~~~~~~~~~~~~~~~~~~~~
WOL~~~6511~~~~~~~~2~~~~~~~~~~~~~~~~~~~~~
WOL~~~5001~~~~~~~~1~~~~~~~~~~~~~~~~~~~~~
WOH~1.0~~1904083291Utest~~~ORD~~~~~~
WOL~~~5310~~~~~~~~4~~~~~~~~~~~~~~~~~~~~~
WOL~~~5311~~~~~~~~3~~~~~~~~~~~~~~~~~~~~~
WOL~~~6101~~~~~~~~6~~~~~~~~~~~~~~~~~~~~~
WOL~~~6102~~~~~~~~1~~~~~~~~~~~~~~~~~~~~~
WOL~~~6106~~~~~~~~6~~~~~~~~~~~~~~~~~~~~~
The output file has header record and line item record. Header Record contains the SOReference and some hardcoded fields and the Line Item record contains the Product Value and Quantity associated to that SOReference . In the input file we have 2 unique SOReferences thats why the the output file contains 2 header record and their associated line items record.
Need something being done as a command line (awk/sed)? since I have a series of files like this one which need to be converted to text.
With AWK, please try the following:
awk -F, '
FNR==1 {next} # skip the header line
{
if ($2 != prevcol2) { # insert newline when SOReference changes
nl = FNR<=2 ? "" : "\n" # suppress the newline in the 1st line
printf("%sWOH~1.0~~%stest~~~ORD~~~~\n", nl, $2)
}
printf("WOL~~~%s~~~~~~~~%s~~~~~~~~~~~~~~~~~~~~~\n", $4, $5)
prevcol2 = $2
}' file.csv

bulk update form field names and IDs with bash, grep and sed

I need to update almost 100 HTML pages that contain 15-20 form fields each.
To pass Section 508 compliance, they all need to be uniquely named.
Each form-group has three of the same values for attributes like this:
<label for="input-title" class="control-label">Title*</label>
<input class="form-control" id="input-title" name="input-title" value="SA Analyst" required>
Notice the for, name and id attributes are all the same.
I just need it to be something like this with an incrementing number at the end:
<label for="input-title21" class="control-label">Title*</label>
<input class="form-control" id="input-title21" name="input-title21" value="SA Analyst" required>
The challenge is to:
- loop through all form fields in an HTML file (see regex below)
- update each "form-group" with an incrementing number appended to the end of each of the three attribute values "for, name and id"
- make sure each form-group has the same appended, incremented number (i.e. every three attributes would get the same number in the current loop)
Here is the starting bash code I am working with:
#!/bin/bash
FILES=/Users/Administrator/files/*.html
counter=1
for f in $FILES
do
echo "Processing $f file..."
# take action on each file. $f store current file name
# cat $f
# sed 's/<input/<input2/g' $f > $f.txt
sed "s/<input/<input$counter/g" $f > $f.txt
echo $counter
((counter++))
done
echo All done
This code successfully updates the input with the counter variable number and saves it to a .txt file but this is not yet the solution since it updates ALL input fields in the HTML file with the same incremented number.
Here is the regex I came up with that finds the form groups that need to be changed:
(.*for\=")([0-9A-Za-z-]+)(".*\n\s*[0-9A-Za-z\<\>\-\=\"\s]*[id=|name=]")([0-9A-Za-z-]+)(".*[id=|name=]")([0-9A-Za-z-]+)("\s[type|req])
So how do I integrate this regex with the bash code above and update the three attributes in every form-group?
With mawk:
scriptfile1:
/label for=\"input-title\"/ {
num++
}
{
gsub("label for=\"input-title\"","label for=\"input-
title"num"\"")
gsub("id=\"input-title\"","id=\"input-title"num"\"")
gsub("name=\"input-title\"","name=\"input-title"num"\"")
print
}
Here we increment a counter (num) every time we encounter the text label for="input-text", we then check for three instances of input-text in each segment (for=,id= and name=) using gensub and change these to add the num variable. We finally print the rebuilt line.
Run with:
awk -f scriptfile1 sourcedatafilename

Extract subset of a feed file with custom delimiter and create CSV file

I get a feed file in below format.
employee_id||034100151730105|L|
employee_cd||03410015|L|
dept_id||1730105|L|
dept_name||abc|L|
employee_firstname||pqr|L|
employee_lastname||ppp|L|
|R||L|
employee_id||034100151730108|L|
employee_cd||03410032|L|
dept_id||4230105|L|
dept_name||fdfd|L|
employee_firstname||sasas|L|
employee_lastname||dfdf|L|
|R||L|
.....
Is there any easy unix script to extract subset of fields and create a CSV like below..
employee_cd,employee_firstname,dept_name
03410015,pqr,abc
03410032,sasas,fdfd
.....
I would suggest awk solution (considering that dept_name item always goes before employee_firstname item):
awk -F'|' 'BEGIN{OFS=","; print "employee_cd,employee_firstname,dept_name";}
$1~/employee_cd|employee_firstname|dept_name/{ a[++c]=$3 }
END { for(i=1;i<length(a);i+=3) print a[i],a[i+2],a[i+1] }' file
The output:
employee_cd,employee_firstname,dept_name
03410015,pqr,abc
03410032,sasas,fdfd
Solution details:
OFS="," - setting output field separator
$1~/employee_cd|employee_firstname|dept_name/ - if first column matches one of the needed items
a[++c]=$3 - capturing an item value indexed by consequent position
for(i=1;i<length(a);i+=3) print a[i],a[i+2],a[i+1] - outputting item values by threes
To save the output as .csv file:
the above command > output.csv

Multiple occurrences in sed substitution

I am trying to retrieve some data within a specific div tag in my html file.
My current html code is in the following format.
<div class = "class0">
<div class = "class1">
<div class = "class2">
some text some text
</div>
Some more text
</div>
Too much text
</div>
When I try to extract tag in just the div with class2, using the bash code
sed -e ':a;N;$!ba
s/[[:space:]]\+/ /g
s/.*<div class\="class2">\(.*\).*/\1/g' test.html > out.html
I get the output html file with the code as
some text some text </div> Some more text </div> Too much text
I want all the data after the first </div> to be removed but instead the final one is being replaced.
Can someone please elaborate my mistake.
You could do this in awk:
awk '/class2/,/<\/div>/ {a[++i]=$0}END{for (j=2;j<i;++j) print a[j]}' file
Between the lines that match /class2/ and /<\/div>/, write the contents to an array. At the end of the file loop through the array, skipping the first and last lines.
Instead of making an array, you could check for the first and last lines using a regular expression:
awk '/class2/,/<\/div>/ {if (!/class2|<\/div>/) print}' file
This works for retrieving text inside the div class = "class2" tags
#!/bin/bash
htmlcode='
<div class = "class0">
<div class = "class1">
<div class = "class2">
some text some text
</div>
Some more text
</div>
Too much text
</div>
'
echo $htmlcode |
sed -e's,<,\
<,g' |
grep 'div class = "class2"' |
sed -e's,>,>\
,g'|
grep -v 'div class = "class2"'

Resources