I'm looping over a list of files, while for each file im scanning for specific things to grep out.
# create .fus file
grep DIF $file | awk '{
if ( $7 != $13 )
print $4, "\t", $5, "\t", $20, "\t", $10, "\t", $11, "\t", $22, "\t", "Null";
}' > $file_name.fus
# create .inv_fus file
grep DIF $file | awk '{
if ( $7 == $13 )
print $4, "\t", $5, "\t", $20, "\t", $10, "\t", $11, "\t", $22, "\t", "Null";
}' > $file_name.inv_fus
# create .del file
echo -e '1\t1\t1\t1' > ${file_name}.del
echo -e '1\t1\t1\t3' >> ${file_name}.del
grep DEL ${file} | awk '{print $4, "\t", $5, "\t", $12, "\t", "2"}' >> ${file_name}.del
The first awk checks if the values of column 7 and 13 are different, if they are, write to file.
The second awk checks if hte values are the same, if they are, write to file. The third creates a file with 2 lines always the same, and the rest filled in by lines containing 'DEL'.
The output files I use to generate a plot, but this fails because some fields are empty. How can I change my code (I guess the awk statement ?) so that it checks for empty fields (for columns 4, 5, 20, 10, 11 and 22) and replace empty columns with dots '.' ?
Like the other response has said, there's a lot of simplification that could happen here, but without knowing input or expected output its hard to comment on what changes would be beneficial.
Regardless, the question seems to boil down to replacing empty fields with dots in your output for some process down the line. Adding a function like this to your awk scripts would seem to do the trick:
function clean() {
for(i = 1; i <= NF; i++) { if($i==""){$i="."}; }
}
For example, given this input in test.txt:
a1,a2,a3,a4
b1,,b3,b4
,,c3,
d1,d2,d3,
Running the following awk results in empty fields being periods.
awk -F',' 'function clean() {
for(i = 1; i <= NF; i++) { if($i==""){$i="."}; }
}
BEGIN {OFS=","}
{clean(); print;}' test.txt
Example output:
a1,a2,a3,a4
b1,.,b3,b4
.,.,c3,.
d1,d2,d3,.
Let's start by cleaning up your script. Replace the whole thing with just one simple awk command:
awk -v file_name="$file_name" '
BEGIN {OFS="\t"; print 1, 1, 1, 1 ORS 1, 1, 1, 3 > (file_name ".del")}
/DIF/ {print $4, $5, $20, $10, $11, $22, "Null" > (file_name "." ($7==$13?"inv_":"") "fus")}
/DEL/ {print $4, $5, $12, 2 > (file_name ".del")}
' "$file"
Now, update your question with sample input and expected output that captures what else you need it to do.
Related
I have n number of records in a file karan.csv in the following format:
A=9607738162|B=9607562681|C=20200513191434|D=|F=959852599|G=MT|H=4012|I=4012|J=9607562681|K=947100410|
A=960299773008|B=9607793008|C=20200513191327|D=|E=ST|F=959852599|G=MO|H=2001|I=2001|J=9607793008|K=947100180|
A=9607704530|B=9607839496|C=20200513191730|D=|F=959852599|G=MT|I=5012|J=9607839496|K=|
Now if we notice, the number of columns are: 10, 11 & 9 respectively. This count is random within the file, however, the number of columns will remain the same.
Now, I wan to create a script that will remove $5 from that column (including the delimiter) if there are 11 columns in a line such that it looks exactly like the row with 10 columns
A=9607738162|B=9607562681|C=20200513191434|D=|F=959852599|G=MT|H=4012|I=4012|J=9607562681|K=947100410|
and, that adds "H=|" in $7 where the column count is 9
A=9607704530|B=9607839496|C=20200513191730|D=|F=959852599|G=MT|H=|I=5012|J=9607839496|K=|
Now I wrote the following code to achieve it:
for text in $(cat /tmp/karan.csv);do
count=`awk -F"|" '{print NF-1}' $text`
if [ $count == 9 ]
then
awk 'BEGIN{FS=OFS="|"}{$7="|H"}1' $text >> /tmp/karantest2.csv
elif [ $count == 10 ]
then
echo $text >> /tmp/karantest2.csv
else
awk -F"|" '{print $1,$2,$3,$4,$6,$7,$8,$9,$10,$11}' $text >> /tmp/karantest2.csv
fi
done
But after debugging, I realised the script was not moving ahead after:
count=`awk -F"|" '{print NF-1}' $text`
Can any one please me regarding the same.
Regards
A pure awk solution:
awk -F'|' '
BEGIN { OFS="|" }
NF==10 { print $1, $2, $3, $4, $5, $6, "H=", $7, $8, $9, $10 }
NF==11 { print $0 }
NF==12 { print $1, $2, $3, $4, $6, $7, $8, $9, $10, $11, $12 }
' karen.csv
Output for the sample input provided is:
A=9607738162|B=9607562681|C=20200513191434|D=|F=959852599|G=MT|H=4012|I=4012|J=9607562681|K=947100410|
A=960299773008|B=9607793008|C=20200513191327|D=|F=959852599|G=MO|H=2001|I=2001|J=9607793008|K=947100180|
A=9607704530|B=9607839496|C=20200513191730|D=|F=959852599|G=MT|H=|I=5012|J=9607839496|K=|
A sed solution, which first inserts H=| on lines with 9 columns, then removes the 7th column on lines with 11 columns:
sed -E '/^([^\|]+\|){9}$/s/(([^\|]+\|){6})/\1H=\|/;/^([^\|]+\|){11}$/s/(([^\|]+\|){4})[^\|]+\|/\1/ inputfile
If you need a POSIX-compliant command, then
since -E is not POSIX, you have to escape every (, ), {, }, + (and other special characters, which are not in this command), and un-escape \| to make it literal;
since \+ is not POSIX either, you need to use the more verbose \{1,\}.
Here's the POSIX-compliant command:
sed '/^\([^|]\{1,\}|\)\{9\}$/s/\(\([^|]\{1,\}|\)\{6\}\)/\1H=|/;/^\([^|]\{1,\}|\)\{11\}$/s/\(\([^|]\{1,\}|\)\{4\}\)[^|]\{1,\}|/\1/' inputfile
I have a input file with repetitive headers (below):
A1BG A1BG A1CF A1CF A2ML1
aa bb cc dd ee
1 2 3 4 5
I want to print all columns with same header in one file. e.g for above file there should be three output files; 1 for A1BG with 2 columns; 2nd for A1CF with 2 columns; 3rd for A2ML1 with 1 column. I there any way to do it using one-liners by awk or grep?
I tried following one-liner:
awk -v f="A1BG" '!o{for(x=1;x<=NF;x++)if($x==f){o=1;next}}o{print $x}' trial.txt
but this searches the pattern in only one column (1 in this case). I want to look through all the header names and print all the corresponding columns which have A1BG in their header.
This awk solution takes the same approach as Lars but uses gawk 4.0 2D arrays
awk '
# fill cols map of header to its list of columns
NR==1 {
for(i=1; i<=NF; ++i) {
if(!($i in cols))
j=0
cols[$i][j++]=i
}
}
{
# write tab-delimited columns for each header to its cols.header file
for(h in cols) {
of="cols."h
for(i=0; i < length(cols[h]); ++i) {
if(i > 0) printf("\t") >of
printf("%s", $cols[h][i]) >of
}
printf("\n") >of
}
}
'
awk solution should be pretty fast - output files are tab-delimited and named cols.A1BG cols.A1CF etc
awk '
# fill cols columns map to header and tab map to track tab state per header
NR==1 {
for(i=1; i<=NF; ++i) {
cols[i]=$i
tab[$i]=0
}
}
{
# reset tab state for every header
for(h in tab) tab[h]=0
# write tab-delimited column to its cols.header file
for(i=1; i<=NF; ++i) {
hdr=cols[i]
of="cols." hdr
if(tab[hdr]) {
printf("\t") >of
} else
tab[hdr]=1
printf("%s", $i) >of
}
# newline for every header file
for(h in tab) {
of="cols." h
printf("\n") >of
}
}
'
This is the output from both of my awk solutions:
$ ./scr.sh <in.txt; head cols.*
==> cols.A1BG <==
A1BG A1BG
aa bb
1 2
==> cols.A1CF <==
A1CF A1CF
cc dd
3 4
==> cols.A2ML1 <==
A2ML1
ee
5
I cannot help you with a 1-liner but here is a 10-liner for GNU awk:
script.awk
NR == 1 { PROCINFO["sorted_in"] = "#ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (i==1)? i : f2c[$i] " " i } }
{ for( n in f2c ) {
split( f2c[n], fls, " ")
tmp = ""
for( f in fls ) tmp = (f ==1) ? $fls[f] : tmp "\t" $fls[f]
print tmp > n
}
}
Use it like this: awk -f script.awk your_file
In the first action: it determines filenames from the columns in the first record (NR == 1).
In the second action: for each record: for each output file: its columns (as defined in the first record) are collected into tmp and written to the output file.
The use of PROCINFO requires GNU awk, see Ed Mortons comments for alternatives.
Example run and ouput:
> awk -f mpapccfaf.awk mpapccfaf.csv
> cat A1BG
A1BG A1BG
aa bb
1 2
Here y'go, a one-liner as requested:
awk 'NR==1{for(i=1;i<=NF;i++)a[$i][i]}{PROCINFO["sorted_in"]="#ind_num_asc";for(n in a){c=0;for(f in a[n])printf"%s%s",(c++?OFS:""),$f>n;print"">n}}' file
The above uses GNU awk 4.* for true multi-dimensional arrays and sorted_in.
For anyone else reading this who prefers clarity over the brevity the OP needs, here it is as a more natural multi-line script:
$ cat tst.awk
NR==1 {
for (i=1; i<=NF; i++) {
names2fldNrs[$i][i]
}
}
{
PROCINFO["sorted_in"] = "#ind_num_asc"
for (name in names2fldNrs) {
c = 0
for (fldNr in names2fldNrs[name]) {
printf "%s%s", (c++ ? OFS : ""), $fldNr > name
}
print "" > name
}
}
$ awk -f tst.awk file
$ cat A1BG
A1BG A1BG
aa bb
1 2
$ cat A1CF
A1CF A1CF
cc dd
3 4
$ cat A2ML1
A2ML1
ee
Since you wrote in one of the comments to my other answer that you have 20000 columns, lets consider a two step approach to ease debugging to find out which of the steps breaks.
step1.awk
NR == 1 { PROCINFO["sorted_in"] = "#ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (f2c[$i]=="")? "$" i : (f2c[$i] " $" i) } }
NR== 2 { for( fn in f2c) printf("%s:%s\n", fn,f2c[fn])
exit
}
Step1 should give us a list of files together with their columns:
> awk -f step1.awk yourfile
Mpap_1:$1, $2, $3, $5, $13, $19, $25
Mpap_2:$4, $6, $8, $12, $14, $16, $20, $22, $26, $28
Mpap_3:$7, $9, $10, $11, $15, $17, $18, $21, $23, $24, $27, $29, $30
In my test data Mpap_1 is the header in column 1,2,3,5,13,19,25. Lets hope that this first step works with your large set of columns. (To be frank: I dont know if awk can deal with $20000.)
Step 2: lets create one of those famous one liners:
> awk -f step1.awk yourfile | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1 "\"" }; END { print "}" }' | awk -v "OFS=\t" -f - yourfile
The first part is our step 1, the second part builds on-the-fly a second awk script, with lines like this: print $1, $2, $3, $5, $13, $19, $25 > "Mpap_1". This second awk script is piped to the third part, which read the script from stdin (-f -) and applies the script to your input file.
In case something does not work: watch the output of each part of step2, you can execute the parts from the left up to (but not including) each of the | symbols and see what is going on, e.g.:
awk -f step1.awk yourfile
awk -f step1.awk yourfile | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1 "\"" }; END { print "}" }'
Following worked for me:
code for step1.awk:
NR == 1 { PROCINFO["sorted_in"] = "#ind_num_asc"
for( i=1; i<=NF; i++ ) { f2c[$i] = (f2c[$i]=="")? "$" i : (f2c[$i] " \"\t\" $" i) } }
NR== 2 { for( fn in f2c) printf("%s:%s\n", fn,f2c[fn])
exit
}
Then run one liner which uses above awk script:
awk -f step1.awk file.txt | awk -F : 'BEGIN {print "{"}; {print " print " $2, "> \"" $1".txt" "\"" }; END { print "}" }'| awk -f - file.txt
This outputs tab delimited .txt files having all the columns with same header in one file. (separate files for each type of header)
Thanks Lars Fischer and others.
Cheers
I'm trying to do some filtering with awk but I'm currently running into an issue. I can make awk match a regex and print the line with the column but I cannot make it print the line without the column.
awk -v OFS='\t' '$6 ~ /^07/ {print $3, $4, $5, $6}' file
Is currently what I have. Can I make awk print the line without the sixth column if it doesn't match the regex?
Set $6 to the empty string if the regex doesn't match. As simple as that. This should do it:
awk -v OFS='\t' '{ if ($6 ~ /^07/) { print $3, $4, $5, $6 } else { $6 = ""; print $0; } }' file
Note that $0 is the entire line, including $2 (which you didn't seem to use). It will print every column except the 6th column.
If you just want to print $3, $4 and $5 when there isn't a match, use this instead:
awk -v OFS='\t' '{ if ($6 ~ /^07/) print $3, $4, $5, $6; else print $3, $4, $5 }' file
I am trying to add a header to a split file but with this code the header is appearing every other line:
awk -F, '{print "eid,devicetype,meterid,lat,lng" > $7"-"$6".csv"}{print $1",", $2",", $3",", $4",", $5"," >> $7"-"$6".csv"}' path/filename
The awk code by itself works but I need to apply a header in the file. The script splits the file based on the values in columns 6 & 7 as well as names the end file with those values. Then it removes columns 6 & 7 it only puts columns 1 - 5 in the output file. This is on Unix in a shell script run from PowerCenter.
I am sure it is probably simple fix for others more experienced.
awk '
BEGIN { FS=OFS="," }
{ fname = $7 "-" $6 ".csv" }
!seen[fname]++ { print "eid", "devicetype", "meterid", "lat, "lng" > fname}
{ print $1, $2, $3, $4, $5 > fname }
' path/filename
You can use:
awk -F, '!a[$7,$6]++{print "eid,devicetype,meterid,lat,lng" > $7 "-" $6 ".csv"}
{print $1,$2,$3,$4,$5 > $7 "-" $6 ".csv"}' OFS=, /path/filename.csv
NR==1 will make sure that header is printed for 1st record.
I can't seem to identify where the syntax error is ..I've tried to these 2 statements but nothing gets written to the 'BlockedIPs' file. Can someone please help? Thanks!
awk '/ (TCP|UDP) / { split($5, addr, /:/); cmd = "/Users/user1/Scripts/geoiplookup " addr[1] | awk '{print $4, $5, $6}'; cmd | getline rslt; close(cmd); print $1, $2, $3, rslt }' < "$IP_PARSED" >> "$BlockedIPs"
awk '/ (TCP|UDP) / { split($5, addr, /:/); cmd = "/Users/user1/Scripts/geoiplookup " addr[1] " | awk '{print $4, $5, $6}'" ; cmd | getline rslt; close(cmd); print $1, $2, $3, rslt }' < "$IP_PARSED" >> "$BlockedIPs"
Your problem is primarily with quoting and stems from the fact that you're trying to call AWK from within an AWK one-liner. It's certainly possible, but getting the quoting right would be very tricky.
It would be much better if you retrieved the complete output of geoiplookup into a variable then did a split() to get just the data you need. Something like:
awk '/ (TCP|UDP) / { split($5, addr, /:/); cmd = "/Users/user1/Scripts/geoiplookup " addr[1]; cmd | getline rslt; split(rslt, r); close(cmd); print $1, $2, $3, r[4], r[5], r[6] }' < "$IP_PARSED" >> "$BlockedIPs"