bash command for group by count - bash

I have a file in the following format
abc|1
def|2
abc|8
def|3
abc|5
xyz|3
I need to group by these words in the first column and sum the value of the second column. For instance, the output of this file should be
abc|14
def|5
xyz|3
Explanation: the corresponding values for word "abc" are 1, 8, and 5. By adding these numbers, the sum comes out to be 14 and the output becomes "abc|14". Similarly, for word "def", the corresponding values are 2 and 3. Summing up these, the final output comes out to be "def|5".
Thank you very much for the help :)
I tried the following command
awk -F "|" '{arr[$1]+=$2} END {for (i in arr) {print i"|"arr[i]}}' filename
another command which I found was
awk -F "," 'BEGIN { FS=OFS=SUBSEP=","}{arr[$1]+=$2 }END {for (i in arr) print i,arr[i]}' filename
Both didn't show me the intended results. Although I'm also in doubt of the working of these commands as well.

Short GNU datamash solution:
datamash -s -t\| -g1 sum 2 < filename
The output:
abc|14
def|5
xyz|3
-t\| - field separator
-g1 - group by the 1st column
sum 2 - sum up values of the 2nd column

I will just add an answer to fix the sorting issue you had, in your Awk logic, you don't need to use sort/uniq piped to the output of Awk, but process in Awk itself.
Referring to GNU Awk Using Predefined Array Scanning Orders with gawk, you can use the PROCINFO["sorted_in"] variable(gawk specific) to control how you want Awk to sort your final output.
Referring to the section below,
#ind_str_asc
Order by indices in ascending order compared as strings; this is the most basic sort. (Internally, array indices are always strings, so with a[2*5] = 1 the index is 10 rather than numeric 10.)
So using this in your requirement in the END clause just do,
END{PROCINFO["sorted_in"]="#ind_str_asc"; for (i in unique) print i,unique[i]}
with your full command being,
awk '
BEGIN{FS=OFS="|"}{
unique[$1]+=$2;
next
}
END{
PROCINFO["sorted_in"]="#ind_str_asc";
for (i in unique)
print i,unique[i]
}' file

awk -F\| '{ arry[$1]+=$2 } END { asorti(arry,arry2);for (i in arry2) { print arry2[i]"|"arry[arry2[i]]} }' filename
Your initial solution should work apart from the issue with sort. Use asorti function to sort the indices from arry to arry2 and then process these in the loop.

Related

Searching for a string between two characters

I need to find two numbers from lines which look like this
>Chr14:453901-458800
I have a large quantity of those lines mixed with lines that doesn't contain ":" so we can search for colon to find the line with numbers. Every line have different numbers.
I need to find both numbers after ":" which are separated by "-" then substract the first number from the second one and print result on the screen for each line
I'd like this to be done using awk
I managed to do something like this:
awk -e '$1 ~ /\:/ {print $0}' file.txt
but it's nowhere near the end result
For this example i showed above my result would be:
4899
Because it is the result of 458800 - 453901 = 4899
I can't figure it out on my own and would appreciate some help
With GNU awk. Separate the row into multiple columns using the : and - separators. In each row containing :, subtract the contents of column 2 from the contents of column 3 and print result.
awk -F '[:-]' '/:/{print $3-$2}' file
Output:
4899
Using awk
$ awk -F: '/:/ {split($2,a,"-"); print a[2] - a[1]}' input_file
4899

print 3 consecutive column after specific string from CSV

I need to print 2 columns after specific string (in my case it is 64). There can be multiple instances of 64 within same CSV row, however next instance will not occur within 3 columns of previous occurrence. Output of each instance should be in next line and unique. The problem is, the specific string does not fall in same column for all rows. All row is having kind of dynamic data and there is no header for CSV. Let say, below is input file (its just a sample, actual file is having approx 300 columns & 5 Million raws):
00:TEST,123453103279586,ABC,XYZ,123,456,65,906,06149,NIL TS21,1,64,906,06149,NIL TS22,1,64,916,06149,NIL BS20,1,64,926,06149,NIL BS30,1,64,906,06149,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222
00:TEST,123458131344169,ABC,XYZ,123,456,OCCF,1,1,1,64,857,19066,NIL TS21,1,64,857,19066,NIL TS22,1,64,857,19066,NIL BS20,1,64,857,19067,NIL BS30,1,64,857,19068,NIL PSS,1,E2 EPSDATA,GRANTED,NONE,1,N,N,256000,5
00:TEST,123458131016844,ABC,XYZ,123,456,HOLD,,1,64,938,36843,NIL TS21,1,64,938,36841,NIL TS22,1,64,938,36823,NIL BS20,1,64,938,36843,NIL BS30,1,64,938,36843,NIL CAML,1,ORIG,0,TERM,00,50000,N,N,N,N
00:TEST,123453102914690,ABC,XYZ,123,456,HOLD,,1,PBS,TS11,64,938,64126,NIL TS21,1,64,938,64126,NIL TS22,1,64,938,64126,NIL BS20,1,64,938,64226,NIL BS30,1,64,938,64326,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222,2222,R
Output required(only unique entries):
64,906,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,36843
64,938,36843
64,938,64326
There is no performance related concerns. I have tried to search many threads but could not get anything near related. Please help.
We can use a pipe of two commands... first to put the 64's leading on a line and a second to print first three columns if we see a leading 64.
sed 's/,64[,\n]/\n64,/g' | awk -F, '/^64/ { print $1 FS $2 FS $3 }'
There are ways of doing this with a single awk command, but this felt quick and easy to me.
Though the sample data from the question contains redundant lines, karakfa (see below) reminds me that the question speaks of a "unique data" requirement. This version uses the keys of an associative array to keep track of duplicate records.
sed 's/,64[,\n]/\n64,/g' | awk -F, 'BEGIN { split("",a) } /^64/ && !((x=$1 FS $2 FS $3) in a) { a[x]=1; print x }'
gawk:
awk -F, '{for(i=0;++i<=NF;){if($i=="64")a=4;if(--a>0)s=s?s","$i:$i;if(a==1){print s;s=""}}}' file
Sed for fun
sed -n -e 's/$/,n,n,n/' -e ':a' -e 'G;s/[[:blank:],]\(64,.*\)\(\n\)$/\2\1/;s/.*\(\n\)\(64\([[:blank:],][^[:blank:],]\{1,\}\)\{2\}\)\([[:blank:],][^[:blank:],]\{1,\}\)\{3\}\([[:blank:],].*\)\{0,1\}$/\1\2\1\5/;s/^.*\n\(.*\n\)/\1/;/^64.*\n/P;s///;ta' YourFile | sort -u
assuming column are separated by blank space or comma
need a sort -u for uniq (possible in sed but a new "simple" action of the same kind to add in this case)
awk to the rescue!
$ awk -F, '{for(i=1;i<=NF;i++)
if($i==64)
{k=$i FS $(++i) FS $(++i);
if (!a[k]++)
print k
}
}' file
64,906,06149
64,916,06149
64,926,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,64126
64,938,64226
64,938,64326
ps. your sample output doesn't match the given input.

awk combine 2 commands for csv file formatting

I have a CSV file which has 4 columns. I want to first:
print the first 10 items of each column
only print the items in the third column
My method is to pipe the first awk command into another but i didnt get exactly what i wanted:
awk 'NR < 10' my_file.csv | awk '{ print $3 }'
The only missing thing was the -F.
awk -F "," 'NR < 10' my_file.csv | awk -F "," '{ print $3 }'
You don't need to run awk twice.
awk -F, 'NR<=10{print $3}'
This prints the third field for every line whose record number (line) is less than or equal to 10.
Note that < is different from <=. The former matches records one through nine, the latter matches records one through ten. If you need ten records, use the latter.
Note that this will walk through your entire file, so if you want to optimize your performance:
awk -F, '{print $3} NR>10{exit}'
This will print the third column. Then if the record number is greater than 10, it will exit. This does not step through your entire file.
Note also that awk's "CSV" matching is very simple; awk does not understand quoted fields, so the record:
red,"orange,yellow",green
has four fields, two of which have double quotes in them. YMMV depending on your input.

AWK array parsing issue

My two input files are pipe separated.
File 1 :
a|b|c|d|1|44
File 2 :
44|ab|cd|1
I want to store all my values of first file in array.
awk -F\| 'FNR==NR {a[$6]=$0;next}'
So if I store the above way is it possible to interpret array; say I want to know $3 of File 1. How can I get tat from a[].
Also will I be able to access array values if I come out of that awk?
Thanks
I'll answer the question as it is stated, but I have to wonder whether it is complete. You state that you have a second input file, but it doesn't play a role in your actual question.
1) It would probably be most sensible to store the fields individually, as in
awk -F \| '{ for(i = 1; i < NF; ++i) a[$NF,i] = $i } END { print a[44,3] }' filename
See here for details on multidimensional arrays in awk. You could also use the split function:
awk -F \| '{ a[$NF] = $0 } END { split(a[44], fields); print fields[3] }'
but I don't see the sense in it here.
2) No. At most you can print the data in a way that the surrounding shell understands and use command substitution to build a shell array from it, but POSIX shell doesn't know arrays at all, and bash only knows one-dimensional arrays. If you require that sort of functionality, you should probably use a more powerful scripting language such as Perl or Python.
If, any I'm wildly guessing here, you want to use the array built from the first file while processing the second, you don't have to quit awk for this. A common pattern is
awk -F \| 'FNR == NR { for(i = 1; i < NF; ++i) { a[$NF,i] = $i }; next } { code for the second file here }' file1 file2
Here FNR == NR is a condition that is only true when the first file is processed (the number of the record in the current file is the same as the number of the record overall; this is only true in the first file).
To keep it simple, you can reach your goal of storing (and accessing) values in array without using awk:
arr=($(cat yourFilename |tr "|" " ")) #store in array named arr
# accessing individual elements
echo ${arr[0]}
echo ${arr[4]}
# ...or accesing all elements
for n in ${arr[*]}
do
echo "$n"
done
...even though I wonder if that's what you are looking for. Inital question is not really clear.

Bash/Shell: How to remove duplicates from csv file by columns?

I have a csv separated with ;. I need to remove lines where content of 2nd and 3rd column is not unique, and deliver the material to the standard output.
Example input:
irrelevant;data1;data2;irrelevant;irrelevant
irrelevant;data3;data4;irrelevant;irrelevant
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data1;data2;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant
irrelevant;data1;data2;irrelevant;irrelevant
irrelevant;data3;data4;irrelevant;irrelevant
Desired output
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant
I have found solutions where only first line is printed to the output:
sort -u -t ";" -k2,1 file
but this is not enough.
I have tried to use uniq -u but I can't find a way to check only a few columns.
Using awk:
awk -F';' '!seen[$2,$3]++{data[$2,$3]=$0}
END{for (i in seen) if (seen[i]==1) print data[i]}' file
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant
Explanation: If $2,$3 combination doesn't exist in seen array then a new entry with key of $2,$3 is stored in data array with whole record. Every time $2,$3 entry is found a counter for $2,$3 is incremented. Then in the end those entries with counter==1 are printed.
If order is important and if you can use perl then:
perl -F";" -lane '
$key = #F[1,2];
$uniq{$key}++ or push #rec, [$key, $_]
}{
print $_->[1] for grep { $uniq{$_->[0]} == 1 } #rec' file
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant
We use column2 and column3 to create composite key. We create array of array by pushing the key and the line to array rec for the first occurrence of the line.
In the END block, we check if that occurrence is the only occurrence. If so, we go ahead and print the line.
awk '!a[$0]++' file_input > file_output
This worked for me. It compares whole lines.

Resources