How to merge set of columns based on a common field - bash

How can I do the following in bash.
a) Take out columns 1,2,6
b) Row is identified by field 'packetId'; There can be one or 2 rows with same 'packetId'; if there are 2 rows with same packetId, then append the first row with the last field of the second row
c) If there is only one row for a 'packetId', then ignore that row and donot print
Input
SequenceId,TimeStamp,packetId,size,secondaryid,eventType,randomfield,Source,Destination,SystemTime
1,3:41:24,1,100,xyz,event1,abc,S1,D1,1586989874
2,3:41:25,1,100,xyz,event2,abc,S1,D1,1586989877
3,3:41:26,2,100,xyz,event1,abc,S1,D1,1586989879
4,3:41:26,3,100,xyz,event1,abc,S1,D1,1586989871
5,3:41:26,3,100,xyz,event2,abc,S1,D1,1586989879
output
packetId,size,secondaryid,randomfield,Source,Destination,SystemTime,OtherSystemTime
1,100,xyz,abc,S1,D1,1586989874,1586989877
3,100,xyz,abc,S1,D1,1586989871,1586989879

You can do it all in awk, but it's simpler to use cut to first remove the fields you don't care about:
$ cut -d, -f3-5,7- input.csv |
awk -F, 'NR == 1 { print $0 ",OtherSystemTime"; next }
{ if ($1 in seen) print seen[$1] "," $NF; else seen[$1] = $0 }'
packetId,size,secondaryid,randomfield,Source,Destination,SystemTime,OtherSystemTime
1,100,xyz,abc,S1,D1,1586989874,1586989877
3,100,xyz,abc,S1,D1,1586989871,1586989879

Related

How to select the minimum value which includes exponential value for each ID based on the forth column?

Can you please tell me how to Select rows with min value, including exponential, based on fourth column and group by first column in linux?
Original file
ID,y,z,p-value
1,a,b,0.22
1,a,b,5e-10
1,a,b,1.2e-10
2,c,d,0.06
2,c,d,0.003
2,c,d,3e-7
3,e,f,0.002
3,e,f,2e-8
3,e,f,1.0
The file I want is as below.
ID,y,z,p-value
1,a,b,1.2e-10
2,c,d,3e-7
3,e,f,2e-8
Actually this worked fine, so thanks for everybody!
tail -n +2 original_file > txt sort -t, -k 4g txt | awk -F, '!visited[$1]++' | sort -k2,2 -k3,3 >> final_file
You can do it fairly easily in awk just by keeping the current record with the minimum 4th field for a given 1st field. You have to handle outputting the header-row and storing the first record to begin the comparison, which you can do by operating on the first record NR==1 (or first in each file processed, FNR==1).
You can store the first minimum in an array indexed by the first field and save the initial record containing values operating on the 2nd record. Then it is just a matter of checking if the first-field is not the same as the last, if so output the minimum record for the last and keep going until you run out of records. (note: this presumes the first-fields appear in increasing order as they do in your file) Then you use the END rule to output the final record.
You can put that together as follows:
awk -F, '
FNR==1 {print; next}
FNR==2 {rec=$0; m[$1]=$4; next}
{
if ($1 in m) {
if ($4 < m[$1]) {
rec=$0
m[$1]=$4
}
}
else {
print rec
rec=$0
m[$1]=$4
}
}
END {
print rec
}' file
(where your data is in the file file)
If your first field is not in increasing order, then you will need to save the current minimum record in an array as well. (e.g. turn rec into an array indexed by the first-field holding the total record as its value). You would then delay looping over both arrays until the END rule to output the minimum record for each first-field.
Example Use/Output
You can update the filename to match the filename containing your data, and then to test, all you need to do is select-copy the awk expression and middle-mouse paste it into an xterm in the directory containing your file, e.g.
$ awk -F, '
> FNR==1 {print; next}
> FNR==2 {rec=$0; m[$1]=$4; next}
> {
> if ($1 in m) {
> if ($4 < m[$1]) {
> rec=$0
> m[$1]=$4
> }
> }
> else {
> print rec
> rec=$0
> m[$1]=$4
> }
> }
> END {
> print rec
> }' file
ID,y,z,p-value
1,a,b,1.2e-10
2,c,d,3e-7
3,e,f,2e-8
Look things over and let me know if you have questions.
A non-awk approach, using GNU datamash:
$ datamash -H -f -t, -g1 min 4 < input.txt | cut -d, -f1-4
ID,y,z,p-value
1,a,b,1.2e-10
2,c,d,3e-7
3,e,f,2e-8
(The cut is needed because with the -f option datamash adds a fifth column that's a duplicate of the 4th; without it it'll just show the first and fourth column values. Minor annoyance.)
This does require that your data is sorted on the first column like in your sample.

Formatting output using awk

I've a file with following content:
A 28713.64 27736.1000
B 9835.32
C 38548.96
Now, i need to check if the last row in the first column is 'C', then the value of first row in third column should be printed in the third column against 'C'.
Expected Output:
A 28713.64 27736.1000
B 9835.32
C 38548.96 27736.1000
I tried below, but it's not working:
awk '{if ($1 == "C") ; print $1,$2,$3}' file_name
Any help is most welcome!!!
This works for the given example:
awk 'NR==1{v=$3}$1=="C"{$0=$0 FS v}7' file|column -t
If you want to append the 3rd column value from A row to C row, change NR==1 into $1=="A"
The column -t part is just for making output pretty. :-)
EDIT: As per OP's comment OP is looking for very first line and looking to match C string at very last line of Input_file, if this is the case then one should try following.
awk '
FNR==1{
value=$NF
print
next
}
prev{
print prev
}
{
prev=$0
prev_first=$1
}
END{
if(prev_first=="C"){
print prev,value
}
else{
print
}
}' file | column -t
Assuming that your actual Input_file is same as shown samples and you want to pick value from 1st column whose value is A.
awk '$1=="A" && FNR==1{value=$NF} $1=="C"{print $0,value;next} 1' Input_file| column -t
Output will be as follows.
A 28713.64 27736.1000
B 9835.32
C 38548.96 27736.1000
POSIX dictates that "assigning to a nonexistent field (for example, $(NF+2)=5) shall increase the value of NF; create any intervening fields with the uninitialized value; and cause the value of $0 to be recomputed, with the fields being separated by the value of OFS."
So...
awk 'NR==1{x=$3} $1=="C"{$3=x} 1' input.txt
Note that the output is not formatted well, but that's likely the case with most of the solutions here. You could pipe the output through column, as Ravinder suggested. Or you could control things precisely by printing your data with printf.
awk 'NR==1{x=$3} $1=="C"{$3=x} {printf "%-2s%-26s%s\n",$1,$2,$3}' input.txt
If your lines can be expressed in a printf format, you'll be able to avoid the unpredictability of column -t and save the overhead of a pipe.

bash - In a csv file, get the last column of all duplicate and concatenate to first occurence

I have a sorted csv file, where several entries are duplicates, except for the last column. How can I concatenate all the last columns to the first occurrence of each entry ?
Input:
Test1,123,somestuff
Test1,123,differentstuff
Test2,345,otherstuff
Output:
Test1,123,somestuff, differentstuff
Test2,345,otherstuff
EDIT:
Obtaining the last column is easy (cut -d, -f3 test.csv); now I need to add it to every first occurence of an entry.
Use awk utility:
awk -F, '{ k=$1 FS $2; a[k] = (k in a)? a[k] FS $3 : $3 }
END{ for(i in a) print i,a[i] }' OFS=',' csvfile
The output:
Test1,123,somestuff,differentstuff
Test2,345,otherstuff
-F, - field separator
k=$1 FS $2 - associative array key (grouping records by the first 2 field values)

Compare multiple Columns and Append the result into another file

I have two files file1 and file2, Both the files have 5 columns.
I want to compare first 4 columns of file1 with file2.
If they are equal, need to compare the 5th column. If 5th column values are different, need to print the file1's 5th column as file2's 6th column.
I have used below awk to compare two columns in two different files, but how to compare multiple columns and append the particular column in another file if matches found?
awk -F, 'NR==FNR{_1[$1]++;next}!_1[$1]'
file1:
111,item1,garde1,wing1,maingroup
123,item3,grade5,wing10,topcat
132,item2,grade3,wing7,middlecat
134,item2,grade3,wing7,middlecat
177,item8,gradeA,wing11,lowcat
file2:
111,item1,garde1,wing1,maingroup
123,item3,grade5,wing10,lowcat
132,item3,grade3,wing7,middlecat
126,item2,grade3,wing7,maingroup
177,item8,gradeA,wing11,lowcat
Desired output:
123,item3,grade5,wing10,lowcat,topcat
Awk can simulate multidimensional arrays by sequencing the indices. Underneath the indices are concatenated using the built-in SUBSEP variable as a separator:
$ awk -F, -v OFS=, 'NR==FNR { a[$1,$2,$3,$4]=$5; next } a[$1,$2,$3,$4] && a[$1,$2,$3,$4] != $5 { print $0,a[$1,$2,$3,$4] }' file1.txt file2.txt
123,item3,grade5,wing10,lowcat,topcat
awk -F, -v OFS=,
Set both input and output separators to ,
NR==FNR { a[$1,$2,$3,$4]=$5; next }
Create an associative array from the first file relating the first four fields of each line to the
fifth. When using a comma-separated list of values as an index, awk actually concatenates them
using the value of the built-in SUBSEP variable as a separator. This is awk's way of
simulating multidimensional arrays with a single subscript. You can set SUBSEP to any value you like
but the default, which is a non-printing character unlikely to appear in the data, is usually
fine. (You can also just do the trick yourself, something like a[$1 "|" $2 "|" $3 "|" $4],
assuming you know that your data contains no vertical bars.)
a[$1,$2,$3,$4] && a[$1,$2,$3,$4] != $5 { print $0,a[$1,$2,$3,$4] }
Arriving here, we know we are looking at the second file. If the first four fields were found in the
first file, and the $5 from the first file is different than the $5 in the second, print the line
from the second file followed by the $5 from the first. (I am assuming here that no $5 from the first file will have a value that evaluates to false, such as 0 or empty.)
$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $0; sub("(,[^,]*){"NF-4"}$","",key) }
NR==FNR { file1[key] = $5; next }
(key in file1) && ($5 != file1[key]) {
print $0, file1[key]
}
$ awk -f tst.awk file1 file2
123,item3,grade5,wing10,lowcat,topcat

Unix AWK field seperator finding sum of one field group by other

I am using below awk command which is returning me unique value of parameter $11 and occurrence of it in the file as output separated by commas. But along with that I am looking for sum of parameter $14(last value) in the output. Please help me on it.
sample string in file
EXSTAT|BNK|2014|11|05|15|29|46|23169|E582754245|QABD|S|000|351
$14 is last value 351
bash-3.2$ grep 'EXSTAT|' abc.log|grep '|S|' |
awk -F"|" '{ a[$11]++ } END { for (b in a) { print b"," a[b] ; } }'
QDER,3
QCOL,1
QASM,36
QBEND,23
QAST,3
QGLBE,30
QCD,30
TBENO,1
QABD,9
QABE,5
QDCD,5
TESUB,1
QFDE,12
QCPA,3
QADT,80
QLSMR,6
bash-3.2$ grep 'EXSTAT|' abc.log
EXSTAT|BNK|2014|11|05|15|29|03|23146|E582754222|QGLBE|S|000|424
EXSTAT|BNK|2014|11|05|15|29|05|23147|E582754223|QCD|S|000|373
EXSTAT|BNK|2014|11|05|15|29|12|23148|E582754224|QASM|S|000|1592
EXSTAT|BNK|2014|11|05|15|29|13|23149|E582754225|QADT|S|000|660
EXSTAT|BNK|2014|11|05|15|29|14|23150|E582754226|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|14|23151|E582754227|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|15|23152|E582754228|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|15|23153|E582754229|QADT|S|000|258
EXSTAT|BNK|2014|11|05|15|29|17|23154|E582754230|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|18|23155|E582754231|QADT|S|000|263
EXSTAT|BNK|2014|11|05|15|29|18|23156|E582754232|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|19|23157|E582754233|QADT|S|000|270
EXSTAT|BNK|2014|11|05|15|29|19|23158|E582754234|QADT|S|000|264
EXSTAT|BNK|2014|11|05|15|29|20|23159|E582754235|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|20|23160|E582754236|QADT|S|000|241
EXSTAT|BNK|2014|11|05|15|29|21|23161|E582754237|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|21|23162|E582754238|QADT|S|000|229
EXSTAT|BNK|2014|11|05|15|29|22|23163|E582754239|QADT|S|000|234
EXSTAT|BNK|2014|11|05|15|29|22|23164|E582754240|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|23|23165|E582754241|QADT|S|000|254
EXSTAT|BNK|2014|11|05|15|29|23|23166|E582754242|QADT|S|000|402
EXSTAT|BNK|2014|11|05|15|29|24|23167|E582754243|QADT|S|000|223
EXSTAT|BNK|2014|11|05|15|29|24|23168|E582754244|QADT|S|000|226
Just add another associative array:
awk -F"|" '{a[$11]++;c[$11]+=$14}END{for(b in a){print b"," a[b]","c[b]}}'
tested below:
> cat temp
EXSTAT|BNK|2014|11|05|15|29|03|23146|E582754222|QGLBE|S|000|424
EXSTAT|BNK|2014|11|05|15|29|05|23147|E582754223|QCD|S|000|373
EXSTAT|BNK|2014|11|05|15|29|12|23148|E582754224|QASM|S|000|1592
EXSTAT|BNK|2014|11|05|15|29|13|23149|E582754225|QADT|S|000|660
EXSTAT|BNK|2014|11|05|15|29|14|23150|E582754226|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|14|23151|E582754227|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|15|23152|E582754228|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|15|23153|E582754229|QADT|S|000|258
EXSTAT|BNK|2014|11|05|15|29|17|23154|E582754230|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|18|23155|E582754231|QADT|S|000|263
EXSTAT|BNK|2014|11|05|15|29|18|23156|E582754232|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|19|23157|E582754233|QADT|S|000|270
EXSTAT|BNK|2014|11|05|15|29|19|23158|E582754234|QADT|S|000|264
EXSTAT|BNK|2014|11|05|15|29|20|23159|E582754235|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|20|23160|E582754236|QADT|S|000|241
EXSTAT|BNK|2014|11|05|15|29|21|23161|E582754237|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|21|23162|E582754238|QADT|S|000|229
EXSTAT|BNK|2014|11|05|15|29|22|23163|E582754239|QADT|S|000|234
EXSTAT|BNK|2014|11|05|15|29|22|23164|E582754240|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|23|23165|E582754241|QADT|S|000|254
EXSTAT|BNK|2014|11|05|15|29|23|23166|E582754242|QADT|S|000|402
EXSTAT|BNK|2014|11|05|15|29|24|23167|E582754243|QADT|S|000|223
EXSTAT|BNK|2014|11|05|15|29|24|23168|E582754244|QADT|S|000|226
> awk -F"|" '{a[$11]++;c[$11]+=$14}END{for(b in a){print b"," a[b]","c[b]}}' temp
QGLBE,1,424
QADT,20,5510
QASM,1,1592
QCD,1,373
>
also check the test here
You need not use grep for searching the file if it contains EXSTAT the awk can do that for you as well.
For example:
awk 'BEGIN{FS="|"; OFS=","} $1~EXSTAT && $12~S {sum[$11]+=$14; count[$11]++}END{for (i in sum) print i,count[i],sum[i]}' abc.log
for the input file abc.log with contents
EXSTAT|BNK|2014|11|05|15|29|03|23146|E582754222|QGLBE|S|000|424
EXSTAT|BNK|2014|11|05|15|29|05|23147|E582754223|QCD|S|000|373
EXSTAT|BNK|2014|11|05|15|29|12|23148|E582754224|QASM|S|000|1592
EXSTAT|BNK|2014|11|05|15|29|13|23149|E582754225|QADT|S|000|660
EXSTAT|BNK|2014|11|05|15|29|14|23150|E582754226|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|14|23151|E582754227|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|15|23152|E582754228|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|15|23153|E582754229|QADT|S|000|258
EXSTAT|BNK|2014|11|05|15|29|17|23154|E582754230|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|18|23155|E582754231|QADT|S|000|263
EXSTAT|BNK|2014|11|05|15|29|18|23156|E582754232|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|19|23157|E582754233|QADT|S|000|270
EXSTAT|BNK|2014|11|05|15|29|19|23158|E582754234|QADT|S|000|264
EXSTAT|BNK|2014|11|05|15|29|20|23159|E582754235|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|20|23160|E582754236|QADT|S|000|241
EXSTAT|BNK|2014|11|05|15|29|21|23161|E582754237|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|21|23162|E582754238|QADT|S|000|229
EXSTAT|BNK|2014|11|05|15|29|22|23163|E582754239|QADT|S|000|234
EXSTAT|BNK|2014|11|05|15|29|22|23164|E582754240|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|23|23165|E582754241|QADT|S|000|254
EXSTAT|BNK|2014|11|05|15|29|23|23166|E582754242|QADT|S|000|402
EXSTAT|BNK|2014|11|05|15|29|24|23167|E582754243|QADT|S|000|223
EXSTAT|BNK|2014|11|05|15|29|24|23168|E582754244|QADT|S|000|226
it will give an output as
QASM,1,1592
QGLBE,1,424
QADT,20,5510
QCD,1,373
What it does?
'BEGIN{FS="|"; OFS=","} excecuted before the input file is processed. It sets FS, input field seperator as | and OFS output field seperator as ,
$1~EXSTAT && $12~S{sum[$11]+=$14; count[$11]++} action is for each line
$1~EXSTAT && $12~S checks if first field is EXSTAT and 12th field is S
sum[$11]+=$14 array sum of field $14 indexed by $11
count[$11]++ array count indexed by $11
END{for (i in sum) print i,count[i],sum[i]}' excecuted at end of file, prints the content of the arrays
You can use a second array.
awk -F"|" '/EXSTAT\|/&&/\|S\|/{a[$11]++}/EXSTAT\|/{s[$11]+=$14}\
END{for(b in a)print b","a[b]","s[b];}' abc.log
Explanation
/EXSTAT\|/&&/\|S\|/{a[$11]++} on lines that contain both EXSTAT| and |S|, increment a[$11].
/EXSTAT\|/ on lines containing EXSTAT| add $14 to s[$11]
END{for(b in a)print b","a[b]","s[b];} print out all keys in array a, values of array a, and values of array s, separated by commas.
#!awk -f
BEGIN {
FS = "|"
}
$1 == "EXSTAT" && $12 == "S" {
foo[$11] += $14
}
END {
for (bar in foo)
printf "%s,%s\n", bar, foo[bar]
}

Resources