I want to get a comma-separated list of all of the values in certain columns (2,4,5) based on the value in column 1 of a tab-delimited file.
I was working with adapting the command below, but instead it is going to give me a list of all the values in the column, not just the one for each person - and I'm not sure how to do that.
awk -F"\t" '{print $2}' $i | sed -z 's/\n/,/g;s/,$/\n/'
This is what I am working with
Bob 24 M apples red
Bob 12 M apples green
Linda 56 F apples red
Linda 102 F bananas yellow
And this is what I would like to get (I want to keep duplicates and the order)
Bob 24,12 M apples,apples red,green
Linda 56,102 F apples,bananas red,yellow
Assumptions:
for duplicate names the gender will always be the same otherwise save the 'last' one seen
One awk idea:
awk '
BEGIN { FS=OFS="\t" }
{ nums[$1] = nums[$1] sep[$1] $2
gender[$1] = $3
fruits[$1] = fruits[$1] sep[$1] $4
colors[$1] = colors[$1] sep[$1] $5
sep[$1] = ","
}
END { # PROCINFO["sorted_in"]="#ind_str_asc" # this line requires GNU awk
for (name in nums)
print name,nums[name],gender[name],fruits[name],colors[name]
}
' input.tsv
This generates:
Bob 24,12 M apples,apples red,green
Linda 56,102 F apples,bananas red,yellow
NOTE: this just happens to display the output in Name order; if ordering (by Name) needs to be guaranteed OP can run the output through sort or if using GNU awk then uncomment the PROCINFO["sorted_in"] line
You never need sed when you're using awk.
Assuming your key values (first fields) are grouped as shown in your example (if not then sort the file first) then without reading the whole file into memory and for any number of input fields (you just have to identify which field numbers don't accumulate values, i.e. fields 1 and 3 in this case) you can do:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
$1 != vals[1] {
if ( NR>1 ) {
prt()
}
delete vals
}
{
for ( i=1; i<=NF; i++ ) {
pre = ( (i in vals) && (i !~ /^[13]$/) ? vals[i] "," : "" )
vals[i] = pre $i
}
}
END { prt() }
function prt( i) {
for ( i=1; i<=NF; i++ ) {
printf "%s%s", vals[i], (i<NF ? OFS : ORS)
}
}
$ awk -f tst.awk file
Bob 24,12 M apples,apples red,green
Linda 56,102 F apples,bananas red,yellow
This is related to my previous question that I [asked] (bash command for group by count)
What if I want to generalize this? For instance
The input file is
ABC|1|2
ABC|3|4
BCD|7|2
ABC|5|6
BCD|3|5
The output should be
ABC|9|12
BCD|10|7
The result is calculated by group first column and adding the values of 2nd column, and 3rd column, just like similar to group by command in SQL.
I tried modifying the command provided in the link but failed. I don't know whether I'm making a conceptual error or a silly mistake but all I know is none of the mentioned commands aren't working.
Command used
awk -F "|" '{arr[$1]+=$2} END arr2[$1]+=$5 END {for (i in arr) {print i"|"arr[i]"|"arr2[i]}}' sample
awk -F "|" '{arr[$1]+=$2} END {arr2[$1]+=$5} END {for (i in arr) {print i"|"arr[i]"|"arr2[i]}}' sample
awk -F "|" '{arr[$1]+=$2 arr2[$1]+=$5} END {for (i in arr2) {print i"|"arr[i]"|"arr2[i]}}' sample
Additionally, what if I'm trying here is to limit the use to summing the columns upto 2 only. What if there are n columns and we want to perform operations such as addition in one column and subtraction in other? How can that further be modified?
Example
ABC|1|2|4|......... upto n columns
ABC|4|5|6|......... upto n columns
DEF|1|4|6|......... upto n columns
lets say if sum is needed with first column, average may be for second column, some other operation for third column, etc. How this can be tackled?
For 3 fields (key and 2 data fields):
$ awk '
BEGIN { FS=OFS="|" } # set separators
{
a[$1]+=$2 # sum second field to a hash
b[$1]+=$3 # ... b hash
}
END { # in the end
for(i in a) # loop all
print i,a[i],b[i] # and output
}' file
BCD|10|7
ABC|9|12
More generic solution for n columns using GNU awk:
$ awk '
BEGIN { FS=OFS="|" }
{
for(i=2;i<=NF;i++) # loop all data fields
a[$1][i]+=$i # sum them up to related cells
a[$1][1]=i # set field count to first cell
}
END {
for(i in a) {
for((j=2)&&b="";j<a[i][1];j++) # buffer output
b=b (b==""?"":OFS)a[i][j]
print i,b # output
}
}' file
BCD|10|7
ABC|9|12
Latter only tested for 2 fields (busy at a meeting :).
gawk approach using multidimensional array:
awk 'BEGIN{ FS=OFS="|" }{ a[$1]["f2"]+=$2; a[$1]["f3"]+=$3 }
END{ for(i in a) print i,a[i]["f2"],a[i]["f3"] }' file
a[$1]["f2"]+=$2 - summing up values of the 2nd field (f2 - field 2)
a[$1]["f3"]+=$3 - summing up values of the 3rd field (f3 - field 3)
The output:
ABC|9|12
BCD|10|7
Additional short datamash solution (will give the same output):
datamash -st\| -g1 sum 2 sum 3 <file
-s - sort the input lines
-t\| - field separator
sum 2 sum 3 - sums up values of the 2nd and 3rd fields respectively
awk -F\| '{ array[$1]="";for (i=1;i<=NF;i++) { arr[$1,i]+=$i } } END { for (i in array) { printf "%s",i;for (p=2;p<=NF;p++) { printf "|%s",arr[i,p] } print "\n" } }' filename
We use two arrays, (array and arr) array is a single dimensional array tracking all the first pieces and arr is a multidimensional array keyed on the first piece and then the piece index and so for example arr["ABC",1]=1 and arr["ABC",2]=2. At the end we loop through array and then each field in the data set, we pull out the data from the multidimensional array arr.
This will work in any awk and will retain the input keys order in the output:
$ cat tst.awk
BEGIN { FS=OFS="|" }
!seen[$1]++ { keys[++numKeys] = $1 }
{
for (i=2;i<=NF;i++) {
sum[$1,i] += $i
}
}
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s%s", key, OFS
for (i=2;i<=NF;i++) {
printf "%s%s", sum[key,i], (i<NF?OFS:ORS)
}
}
}
$ awk -f tst.awk file
ABC|9|12
BCD|10|7
I want to override the child_value with the parent_value using awk. The solution must be generic applicable for larger data sources. The parent-record is defined by $1==$2.
This is my input file (format: ID;PARENT_ID;VALUE):
10;20;child_value
20;20;parent_value
This is the result I want:
10;20;parent_value
20;20;parent_value
This is my current approach:
awk -F\;
BEGIN {
OFS = FS
}
{
if ($1 == $2) {
mapping[$1] = $3
}
all[$1]=$0
}
END {
for (i in all) {
if (i[$3] == 'child_value') {
i[$3] = mapping[i]
}
print i
}
}
' file.in
Needless to say, that it doesn't work like that ;-) Anyone can help?
for multiple parent/child pairs perhaps on non-consecutive lines...
$ awk -F\; -v OFS=\; 'NR==FNR {if($1==$2) a[$2]=$3; next}
$1!=$2 {$3=a[$2]}1' file{,}
10;20;parent_value
20;20;parent_value
assumes the second field is the parent id.
Well, if your data is sorted in descnding order (you could use sort if not sorted at all or rev if data is sorted in ascending order) before processing, it's enough to hash the first entry of each key in $2 and use the value on the first match for the following records with the same key in $2:
$ sort -t\; -k2nr -k1nr bar | \
awk '
BEGIN{
FS=OFS=";"
}
{
if($2 in a) # if $2 in hash a, use it
$3=a[$2]
else # else add it
a[$2]=$3
if(p!=$2) # delete previous entries from wasting memory
delete a[p]
p=$2 # p is for previous on next round
}1'
20;20;parent_value
10;20;parent_value
If I have a number of files, such as
1.txt:
1;ab, bc
2;cd, de, ef
3;fgh
2.txt:
4;bc
1;cd, ef
5;ab
2;g
3.txt:
5;ef, hl
7;a, b, c
3;k, jk
1;b
6;x
Assuming that ; is a delimiter and the first column serves as ID, how to concatenate corresponding second columns (using eg. commas), so that the output becomes
output.txt:
1;ab, bc, cd, ef, b
2;cd, de, ef, g
3;fgh, k, jk
4;bc
5;ab, ef, hl
7;a, b, c
6;x
awk to the rescue!
$ awk -F";" '{a[$1]=a[$1]?a[$1]","$2:$2}
END{for(k in a) print k";"a[k]}' file{1,2,3} | sort
1;ab, bc,cd, ef,b
2;cd, de, ef,g
3;fgh,k, jk
4;bc
5;ab,ef, hl
6;x
7;a, b, c
cause join(1) is designed for join two files, and the input has to be sort, so why bother. the awk source:
#!/usr/bin/env awk -f
BEGIN { FS = ";" }
FNR==NR { a[$1] = $2; next}
{
if ($1 in a) {
a[$1] = a[$1] ", " $2
} else {
a[$1] = $2;
}
}
END {
for (i in a) {
printf("%s%s%s\n", i,FS,a[i]);
}
}
I have a tab-separated fileA where the 12th column (starting from 1) contain several comma separated identifiers. Some of them in the same row, however, can occur more than once:
GO:0042302, GO:0042302, GO:0042302
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
....
....
(some with a white-space after the comma, some where it is not).
I would like to only get the unique identifiers and remove the multiples for each row in the 12th column:
GO:0042302
GO:0004386,GO:0005524,GO:0006281
....
....
Here is what I have so far:
for row in `fileA`
do
cut -f12 $row | sed "s/,/\n/" | sort | uniq | paste fileA - | \
awk 'BEGIN {OFS=FS="\t"}{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $13}'
done > out
The idea was to go over each row at a time, cut out the 12th column, replace all commas with newlines and then sort and take uniq to get rid of duplicates, paste it back and print the columns in the right order, skipping the original identifier column.
However, this does not seem to work. Any ideas?
Just for completeness, and because I personally prefer Perl over Awk for this sort of thing, here's a Perl one-liner solution:
perl -F'\t' -le '%u=();#k=split/,/,$F[11];#u{#k}=#k;$F[11]=join",",sort
keys%u;print join"\t",#F'
Explanation:
-F'\t' Loop over input lines, splitting each one into fields at tabs
-l automatically remove newlines from input and append on output
-e get code to execute from the next argument instead of standard input
%u = (); # clear out the hash variable %u
#k = split /,/, $F[11]; # Split 12th field (1st is 0) on comma into array #k
#u{#k} = #k; # Copy the contents of #k into #u as key/value pairs
Because hash keys are unique, that last step means that the keys of %u are now a deduplicated copy of #k.
$F[11] = join ",", sort keys %u; # replace the 12th field with the sorted unique list
print join "\t", #F; # and print out the modified line
If I understand you correctly, then with awk:
awk -F '\t' 'BEGIN { OFS = FS } { delete b; n = split($12, a, /, */); $12 = ""; for(i = 1; i <= n; ++i) { if(!(a[i] in b)) { b[a[i]]; $12 = $12 a[i] "," } } sub(/,$/, "", $12); print }' filename
This works as follows:
BEGIN { OFS = FS } # output FS same as input FS
{
delete b # clear dirty table from last pass
n = split($12, a, /, */) # split 12th field into tokens,
$12 = "" # then clear it out for reassembly
for(i = 1; i <= n; ++i) { # wade through those tokens
if(!(a[i] in b)) { # those that haven't been seen yet:
b[a[i]] # remember that they were seen
$12 = $12 a[i] "," # append to result
}
}
sub(/,$/, "", $12) # remove trailing comma from resulting field
print # print the transformed line
}
The delete b; has been POSIX-conforming for only a short while, so if you're working with an old, old awk and it fails for you, see #MarkReed's comment for another way that ancient awks should accept.
Using field 2 instead of field 12:
$ cat tst.awk
BEGIN{ FS=OFS="\t" }
{
split($2,f,/ *, */)
$2 = ""
delete seen
for (i=1;i in f;i++) {
if ( !seen[f[i]]++ ) {
$2 = $2 (i>1?",":"") f[i]
}
}
print
}
.
$ cat file
a,a,a GO:0042302, GO:0042302, GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281 d,d,d
$ awk -f tst.awk file
a,a,a GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281 d,d,d
If your awk doesn't support delete seen you can use split("",seen).
Using this awk:
awk -F '\t' -v OFS='\t' '{
delete seen;
split($12, a, /[,; ]+/);
for (i=1; i<=length(a); i++) {
if (!(a[i] in seen)) {
seen[a[i]];
s=sprintf("%s%s,", s, a[i])
}
}
$12=s} 1' file
GO:0042302,
GO:0042302,GO:0004386,GO:0005524,GO:0006281,
In your example data, the comma followed by a space is the delimiter of the 12th field. Every subfield after that is merely a repeat of the first field. The subfields appear to already be in sorted order.
GO:0042302, GO:0042302, GO:0042302
^^^dup1^^^ ^^^dup2^^^
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
^^^^^^^^^^^^^^^dup1^^^^^^^^^^^^^
Based on that, you could simply keep the first of the subfields and toss the rest:
awk -F"\t" '{sub(/, .*/, "", $12)} 1' fileA
If instead, you can have different sets of repeated subfields, where keys are not sorted like this:
GO:0042302, GO:0042302, GO:0042302, GO:0062122,GO:0055000, GO:0055001, GO:0062122,GO:0055000
GO:0004386,GO:0005524,GO:0006281, GO:0005525, GO:0004386,GO:0005524,GO:0006281
If you were stuck with a default MacOS awk you could introduce a sort/uniq functions in an awk executable script:
#!/usr/bin/awk -f
BEGIN {FS="\t"}
{
c = uniq(a, split($12, a, /, |,/))
sort(a, c)
s = a[1]
for(i=2; i<=c; i++) { s = s "," a[i] }
$2 = s
}
47 # print out the modified line
# take an indexed arr as from split and de-dup it
function uniq(arr, len, i, uarr) {
for(i=len; i>=1; i--) { uarr[arr[i]] }
delete arr
for(k in uarr) { arr[++i] = k }
return( i )
}
# slightly modified from
# http://rosettacode.org/wiki/Sorting_algorithms/Bubble_sort#AWK
function sort(arr, len, haschanged, tmp, i)
{
haschanged = 1
while( haschanged==1 ) {
haschanged = 0
for(i=1; i<=(len-1); i++) {
if( arr[i] > arr[i+1] ) {
tmp = arr[i]
arr[i] = arr[i + 1]
arr[i + 1] = tmp
haschanged = 1
}
}
}
}
If you had GNU-awk, I think you could swap out the sort(a, c) call with asort(a), and drop the bubble-sort local function completely.
I get the following for the 12th field:
GO:0042302,GO:0055000,GO:0055001,GO:0062122
GO:0004386,GO:0005524,GO:0005525,GO:0006281