I have written an awk/shell script to process an input xml file and output another xml file with the desired elements. While this script works, I would like to simplify it so that I do not use any temporary files and instead pipe the output between commands.
Here's the script.
#extract elements
awk 'BEGIN {FS="[<|>]"} /(elementname).*$/{matchingstring=$0}
{ printf "%s\n", matchingstring}' input.xml > tmp.xml
#sort, uniq, append closing tag (/>)
for i in `cat tmp.xml | awk '{print $2}' |sort | uniq `; do grep -m 1 $i tmp.xml;
done | sort -r | sed "s/>$/\/>/" > tmp2.xml
# Append xml header and root element
awk 'BEGIN {
FS="[<|>]"}
NR==1{
print "<?xml version=\"1\.0\" encoding=\"UTF\-8\"?>"
print "<listofelements>"
};
{ printf "%s\n", $0 }
END { print "</listifelements>";}' tmp2.xml > final.xml
Any inputs would be much appreciated.
One of the improvements would be:
awk 'BEGIN {FS="[<|>]"} /(elementname).*$/{matchingstring=$0}
{ printf "%s\n", matchingstring}' input.xml > tmp.xml
Can be replaced with :
awk '/(elementname).*$/' input.xml > tmp.xml
And also this below:
awk 'BEGIN {
FS="[<|>]"}
NR==1{
print "<?xml version=\"1\.0\" encoding=\"UTF\-8\"?>"
print "<listofelements>"
};
{ printf "%s\n", $0 }
END { print "</listifelements>";}' tmp2.xml > final.xml
Can be changed to :
awk 'BEGIN {
print "<?xml version=\"1\.0\" encoding=\"UTF\-8\"?>";
print "<listofelements>"}
END {print "</listifelements>";}1' tmp2.xml > final.xml
Related
I have a big CSV file that I need to cut into different pieces based on the value in one of the columns. My input file dataset.csv is something like this:
NOTE: edited to clarify that data is ,data, no spaces.
action,action_type, Result
up,1,stringA
down,1,strinB
left,2,stringC
So, to split by action_type I simply do (I need the whole matching line in the resulting file):
awk -F, '$2 ~ /^1$/ {print}' dataset.csv >> 1_dataset.csv
awk -F, '$2 ~ /^2$/ {print}' dataset.csv >> 2_dataset.csv
This works as expected but I am basicaly travesing my original dataset twice. My original dataset is about 5GB and I have 30 action_type categories. I need to do this everyday, so, I need to script the thing to run on its own efficiently.
I tried the following but it does not work:
# This is a file called myFilter.awk
{
action_type=$2;
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}
Then I run it as:
awk -f myFilter.awk dataset.csv
But I get nothing. Literally nothing, no even errors. Which sort of tell me that my code is simply not matching anything or my print / pipe statement is wrong.
You may try this awk to do this in a single command:
awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn; close(fn)}' file
With GNU awk to handle many concurrently open files and without replicating the header line in each output file:
awk -F',' '{print > ($2 "_dataset.csv")}' dataset.csv
or if you also want the header line to show up in each output file then with GNU awk:
awk -F',' '
NR==1 { hdr = $0; next }
!seen[$2]++ { print hdr > ($2 "_dataset.csv") }
{ print > ($2 "_dataset.csv") }
' dataset.csv
or the same with any awk:
awk -F',' '
NR==1 { hdr = $0; next }
{ out = $2 "_dataset.csv" }
!seen[$2]++ { print hdr > out }
{ print >> out; close(out) }
' dataset.csv
As currently coded the input field separator has not been defined.
Current:
$ cat myfilter.awk
{
action_type=$2;
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}
Invocation:
$ awk -f myfilter.awk dataset.csv
There are a couple ways to address this:
$ awk -v FS="," -f myfilter.awk dataset.csv
or
$ cat myfilter.awk
BEGIN {FS=","}
{
action_type=$2
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}
$ awk -f myfilter.awk dataset.csv
I am working with a CSV in bash, and attempting to merge the data in the 2nd column by matched data in the 3rd column.
My code works but the information in the other columns ends up just getting repeated instead of properly copied.
awk -F',' -v OFS=',' '{
env_name=$1
app_name=$4
lob_name=$5
if ($3 in a) {
a[$3] = a[$3]" "$2;
} else {
a[$3] = $2;
}
}
END { for (i in a) print env_name, i, a[i], app_name, lob_name}' input.tmp > output.tmp
This:
A,1,B,C,D
A,2,B,C,D
A,3,E,F,G
A,4,X,Y,Z
A,5,E,F,G
Should become this:
A,1 2,B,C,D
A,3 5,E,F,G
A,4,X,Y,Z
But instead we are getting this:
A,1 2,B,C,D
A,3 5,E,C,D
A,4,X,C,D
your grouping key should be all except second field
$ awk -F, 'BEGIN {SUPSEP=OFS=FS}
{k=$1 FS $3 FS $4 FS $5; a[k]=(k in a)?a[k]" "$2:$2}
END {for(k in a) {split(k,p); print p[1],a[k],p[2],p[3],p[4]}}' file
A,1 2,B,C,D
A,3 5,E,F,G
A,4,X,Y,Z
perhaps can be simplified a bit
$ awk 'BEGIN {OFS=FS=","}
{v=$2; $2=""; k=$0; a[k]=(k in a?a[k]" "v:v)}
END {for(k in a) {$0=k; $2=a[k]; print}}' file
sed + sort + awk
$ sed 's/,/+/3;s/,/+/3' merge_csv | sort -t, -k3 | awk -F, -v OFS=, ' { if($3==p) { a=a b " "; } if(p!=$3 && NR>1) { print $1,a b,p; a="" } b=$2; p=$3 } END { print $1,a b,p } ' | tr '+' ','
A,1 2,B,C,D
A,3 5,E,F,G
A,4,X,Y,Z
$
If Perl is an option, you can try this
$ perl -F, -lane '$x=join(",",#F[-3,-2,-1]); #t=#{$kv{$x}};push(#t,$F[1]);$kv{$x}=[#t]; END { for(keys %kv) { print "A,",join(" ",#{$kv{$_}}),",$_" }} ' merge_csv
A,1 2,B,C,D
A,4,X,Y,Z
A,3 5,E,F,G
$
Input file:
$ cat merge_csv
A,1,B,C,D
A,2,B,C,D
A,3,E,F,G
A,4,X,Y,Z
A,5,E,F,G
$
I am currently working on a script that will look through the output of nm and sum the values of column $1 using the following
read $filename
nm --demangle --size-sort --radix=d ~/object/$filename | {
awk '{ sum+= $1 } END { print "Total =" sum }'
}
I want to do the following for any number of files, looping through a directory to then output a summary of results. I want the result for each file and also the result of adding the first column of all the columns.
I am limited to using just bash and awk.
You need to put the read $filename in a while; do; done loop and feed the output of the entire loop to awk.
e.g.
while read filename ; do
nm ... $filename
done | awk '{print $0} { sum+=$1 } END { print "Total="sum}'
the awk {print $0} will print each file's line so you can see each one.
bash globstar option is for recursive file matching
you can use like **/*.txt at the end awk command
$ shopt -s globstar
$ awk '
BEGINFILE {
c="nm --demangle --size-sort --radix=d \"" FILENAME "\""
while ((c | getline) > 0) { fs+=$1; ts+=$1; }
printf "%45s %10'\''d\n",FILENAME, fs
close(c); fs=0; nextfile
} END {
printf "%30s %s\n", " ", "-----------------------------"
printf "%45s %10'\''d\n", "total", ts
}' **/*filename*
I have this bash script:
function getlist() {
grep -E 'pattern' ../fileWithInput.js | sed "s#^regexPattern#\1 \2#" | grep -v :
}
getlist | while read line; do
method=$(echo $line | awk '{ print $1 }')
uri=$(echo $line | awk '{ print $2 }')
`grep "$method" -vr .
#echo method: $method uri: $uri
done
Question:
Currently I have many 'pattern' strings. How to check with directory and output only 'pattern' strings that doesn't match.
What I have example in fileWithInput.js:
'foo','bar','hello'.
~/repo/anotherDirectory:
'foo','bar'.
How to print only strings from fileWithInput.js that are not in /repo/anotherDirectory?
Final output have to be like this:
'hello': 0 matches.
Please help with grep command to do this. Or maybe you have another idea. Thanks for attention and have a nice day!
file1.txt
'foo','bar','hello'.
filem.txt
'foo','bar'.
with awk
awk 'BEGIN{RS="[,\\.]"} NR==FNR{a[$0];next} {delete a[$0]} END{for(i in a){print i": 0 matches."}} ' filei.txt filem.txt
code breakdown:
BEGIN{RS="[,\\.]"} # Record seperator , or .
NR==FNR{a[$0];next} # store values ina array a and skip from next process
{delete a[$0]} # delete from array if file1 exists in file2
END{
for(i in a){
print i": 0 matches."} # print missing items
}
output:
'hello': 0 matches.
I'm trying to take last value in third column of a CSV file and replace then the whole third column with this value.
I've been trying this:
var=$(tail -n 1 math_ready.csv | awk -F"," '{print $3}'); awk -F, '{$3="$var";}1' OFS=, math_ready.csv > math1.csv
But it's not working and I don't understand why...
Please help!
awk '
BEGIN { ARGV[2]=ARGV[1]; ARGC++; FS=OFS="," }
NR==FNR { last = $3; next }
{ $3 = last; print }
' math_ready.csv > math1.csv
The main problem with your script was trying to access a shell variable ($var) inside your awk script. Awk is not shell, it is a completely separate language/tool with it's own namespace and variables. You cannot directly access a shell variable in awk, just like you couldn't access it in C. To access the VALUE of a shell variable you'd do:
shellvar=27
awk -v awkvar="$shellvar" 'BEGIN{ print awkvar }'`
Some additional cleanup:
When FS and OFS have the same value, don't assign them each to that value separately, use BEGIN{ FS=OFS="," } instead for clarity and maintainability.
Do not iniatailize variables AFTER the script that uses those variables unless you have a very specifc reason to do so. Use awk -F... -v OFS=... 'script' to init those variables to separate values, not awk -F... 'script' OFS=... as it's very unnatural to init variables in the code segment AFTER you've used them and variables inited in the args list at the end are not initialized when the BEGIN section is executed which can cause bugs.
A shell variable is not expandable internally in awk. You can do this instead:
awk -F, -v var="$var" '{ $3 = var } 1' OFS=, math_ready.csv > math1.cs
And you probably can simplify your code with this:
awk -F, 'NR == FNR { r = $3; next } { $3 = r } 1' OFS=, math_ready.csv math_ready.csv > math1.csv
Example input:
1,2,1
1,2,2
1,2,3
1,2,4
1,2,5
Output:
1,2,5
1,2,5
1,2,5
1,2,5
1,2,5
Try this one liner. It doesn't depend on the column count
var=`tail -1 sample.csv | perl -ne 'm/([^,]+)$/; print "$1";'`; cat sample.csv | while read line; do echo $line | perl -ne "s/[^,]*$/$var\n/; print $_;"; done
cat sample.csv
24,1,2,30,12
33,4,5,61,3333
66,7,8,91111,1
76,10,11,32,678
Out:
24,1,2,30,678
33,4,5,61,678
66,7,8,91111,678
76,10,11,32,678