Finding hits based on subgroups (A, B, C) in big data set - bash

I have a huge amount of data to analyze.
From this example file I need to know all hits ("samples") that were found either
only in B,
only in C,
in A and C or
in A and B,
but not the ones that are found
in B and C or
in A, B and C.
[edit: to keep it more simple: there should be no co-occurence of B and C]
These letters are found in column $8.
The first two columns together can be used as an identifier for each "sample".
Example: You can see that for “463;88” we find A and C in column $8 which would make “463;88” a hit that I need in a separate output file. "348;64" is found in A, B and C and would therefore be discarded/ignored.
File1.csv
463;88;1;193187729;280062;CDC73;IS;A;0.0
463;88;1;193188065;280062;CDC73;IS;A;0.0
463;88;1;193188527;280062;CDC73;IS;A;0.0
463;88;1;193188542;280062;CDC73;IS;C;0.0
348;64;1;155219446;384172;GBAP1;IS;B;0.0
348;64;1;155224629;384172;GBAP1;IS;C;0.0
348;64;1;155224965;384172;GBAP1;IS;A;0.0
71;35;2;27400461;145220;PPM1G;IS;A;0.0
71;35;2;27400930;145220;PPM1G;IS;A;0.0
71;35;2;27401162;145220;PPM1G;IS;A;0.0
71;35;2;27403518;145220;PPM1G;IS;B;0.0
71;35;2;27403545;145220;PPM1G;IS;B;0.0
71;35;2;27404353;145220;PPM1G;IS;B;0.0
71;35;2;27419156;145220;NRBP1;IS;B;0.0
7;14;20;2894103;92099;PTPRA;IS;B;0.0
7;14;20;2906211;92099;PTPRA;IS;C;0.0
7;14;20;2907301;92099;PTPRA;IS;C;0.0
...
Does anyone have a suggestion how to do this, eg with bash, awk, grep...?
It does not need to be very efficient or fast, it just needs to be reliable.
Edit:
I generated a csv table with the columns 1 and 2 of lines that contain <3 different entries in column $8 in several steps.
awk print $1, $2, $8 | sort –n | uniq > file.tmp1
awk print $1, $2 from file.tmp1 | sort –n | unic –c | sed for csv format > file.tmp2
Finally,
awk to print only the identifier columns from file.tmp2 where the count was <3 (= only one or two different letters in column $8 of original file).
File2.csv
6;3;
12;9;
348;40;
463;88;
...
Then, I wanted to use
fgrep --file=File2.csv File1.csv
but this does not seem to work properly. And it still requires manual analysis as it gives me also false hits.

another alternative
keeps only the keys of the lines to be deleted, but scans the files twice. Also, file doesn't need to be sorted.
$ awk -F';' '{k=$1 FS $2}
NR==FNR {if($8=="B") b[k];
else if($8=="C") c[k];
if(k in b && k in c) d[k];
next}
!(k in d)' file{,}
463;88;1;193187729;280062;CDC73;IS;A;0.0
463;88;1;193188065;280062;CDC73;IS;A;0.0
463;88;1;193188527;280062;CDC73;IS;A;0.0
463;88;1;193188542;280062;CDC73;IS;C;0.0
71;35;2;27400461;145220;PPM1G;IS;A;0.0
71;35;2;27400930;145220;PPM1G;IS;A;0.0
71;35;2;27401162;145220;PPM1G;IS;A;0.0
71;35;2;27403518;145220;PPM1G;IS;B;0.0
71;35;2;27403545;145220;PPM1G;IS;B;0.0
71;35;2;27404353;145220;PPM1G;IS;B;0.0
71;35;2;27419156;145220;NRBP1;IS;B;0.0
with gawks bitwise operations, can be further simplified to
$ awk -F';' 'BEGIN {c["B"]=1; c["C"]=2}
{k=$1 FS $2}
NR==FNR {d[k]=or(d[k],c[$8]); next}
d[k]!=3' file{,}
or is an idempotent function updates the array if "B" or "C" are seen. If both seen the value will be 3, in the second round print everything else.

You don't show the expected output in your question so it's a guess but is this what you're looking for?
$ cat tst.awk
BEGIN { FS=OFS=";" }
{ curr = $1 FS $2 }
curr != prev { prt(); prev=curr }
{ lines[++numLines]=$0; seen[$8]++ }
END { prt() }
function prt() {
if ( !(seen["B"] && seen["C"]) ) {
for ( lineNr=1; lineNr<=numLines; lineNr++) {
print lines[lineNr]
}
}
delete seen
numLines=0
}
$ awk -f tst.awk file
463;88;1;193187729;280062;CDC73;IS;A;0.0
463;88;1;193188065;280062;CDC73;IS;A;0.0
463;88;1;193188527;280062;CDC73;IS;A;0.0
463;88;1;193188542;280062;CDC73;IS;C;0.0
71;35;2;27400461;145220;PPM1G;IS;A;0.0
71;35;2;27400930;145220;PPM1G;IS;A;0.0
71;35;2;27401162;145220;PPM1G;IS;A;0.0
71;35;2;27403518;145220;PPM1G;IS;B;0.0
71;35;2;27403545;145220;PPM1G;IS;B;0.0
71;35;2;27404353;145220;PPM1G;IS;B;0.0
71;35;2;27419156;145220;NRBP1;IS;B;0.0

Something like this should work, unless you run out of memory:
BEGIN { FS=";"; }
{
keys[$1,$2] = 1;
data[$1,$2,$8] = $0;
}
END {
for (key in keys) {
a = data[key,"A"];
b = data[key,"B"];
c = data[key,"C"];
if (!(b && c)) {
if (a) { print a; }
if (b) { print b; }
if (c) { print c; }
}
}
}
Assuming that all lines with the same key are consecutive, this should work:
BEGIN { FS=";"; }
$1";"$2 != key {
if (!(data["B"] && data["C"])) {
print key;
}
delete data;
key = $1";"$2;
}
{
data[$8] = 1;
}

Related

Get comma separated list of column values based on value in another column

I want to get a comma-separated list of all of the values in certain columns (2,4,5) based on the value in column 1 of a tab-delimited file.
I was working with adapting the command below, but instead it is going to give me a list of all the values in the column, not just the one for each person - and I'm not sure how to do that.
awk -F"\t" '{print $2}' $i | sed -z 's/\n/,/g;s/,$/\n/'
This is what I am working with
Bob 24 M apples red
Bob 12 M apples green
Linda 56 F apples red
Linda 102 F bananas yellow
And this is what I would like to get (I want to keep duplicates and the order)
Bob 24,12 M apples,apples red,green
Linda 56,102 F apples,bananas red,yellow
Assumptions:
for duplicate names the gender will always be the same otherwise save the 'last' one seen
One awk idea:
awk '
BEGIN { FS=OFS="\t" }
{ nums[$1] = nums[$1] sep[$1] $2
gender[$1] = $3
fruits[$1] = fruits[$1] sep[$1] $4
colors[$1] = colors[$1] sep[$1] $5
sep[$1] = ","
}
END { # PROCINFO["sorted_in"]="#ind_str_asc" # this line requires GNU awk
for (name in nums)
print name,nums[name],gender[name],fruits[name],colors[name]
}
' input.tsv
This generates:
Bob 24,12 M apples,apples red,green
Linda 56,102 F apples,bananas red,yellow
NOTE: this just happens to display the output in Name order; if ordering (by Name) needs to be guaranteed OP can run the output through sort or if using GNU awk then uncomment the PROCINFO["sorted_in"] line
You never need sed when you're using awk.
Assuming your key values (first fields) are grouped as shown in your example (if not then sort the file first) then without reading the whole file into memory and for any number of input fields (you just have to identify which field numbers don't accumulate values, i.e. fields 1 and 3 in this case) you can do:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
$1 != vals[1] {
if ( NR>1 ) {
prt()
}
delete vals
}
{
for ( i=1; i<=NF; i++ ) {
pre = ( (i in vals) && (i !~ /^[13]$/) ? vals[i] "," : "" )
vals[i] = pre $i
}
}
END { prt() }
function prt( i) {
for ( i=1; i<=NF; i++ ) {
printf "%s%s", vals[i], (i<NF ? OFS : ORS)
}
}
$ awk -f tst.awk file
Bob 24,12 M apples,apples red,green
Linda 56,102 F apples,bananas red,yellow

Bash group by on the basis of n number of columns

This is related to my previous question that I [asked] (bash command for group by count)
What if I want to generalize this? For instance
The input file is
ABC|1|2
ABC|3|4
BCD|7|2
ABC|5|6
BCD|3|5
The output should be
ABC|9|12
BCD|10|7
The result is calculated by group first column and adding the values of 2nd column, and 3rd column, just like similar to group by command in SQL.
I tried modifying the command provided in the link but failed. I don't know whether I'm making a conceptual error or a silly mistake but all I know is none of the mentioned commands aren't working.
Command used
awk -F "|" '{arr[$1]+=$2} END arr2[$1]+=$5 END {for (i in arr) {print i"|"arr[i]"|"arr2[i]}}' sample
awk -F "|" '{arr[$1]+=$2} END {arr2[$1]+=$5} END {for (i in arr) {print i"|"arr[i]"|"arr2[i]}}' sample
awk -F "|" '{arr[$1]+=$2 arr2[$1]+=$5} END {for (i in arr2) {print i"|"arr[i]"|"arr2[i]}}' sample
Additionally, what if I'm trying here is to limit the use to summing the columns upto 2 only. What if there are n columns and we want to perform operations such as addition in one column and subtraction in other? How can that further be modified?
Example
ABC|1|2|4|......... upto n columns
ABC|4|5|6|......... upto n columns
DEF|1|4|6|......... upto n columns
lets say if sum is needed with first column, average may be for second column, some other operation for third column, etc. How this can be tackled?
For 3 fields (key and 2 data fields):
$ awk '
BEGIN { FS=OFS="|" } # set separators
{
a[$1]+=$2 # sum second field to a hash
b[$1]+=$3 # ... b hash
}
END { # in the end
for(i in a) # loop all
print i,a[i],b[i] # and output
}' file
BCD|10|7
ABC|9|12
More generic solution for n columns using GNU awk:
$ awk '
BEGIN { FS=OFS="|" }
{
for(i=2;i<=NF;i++) # loop all data fields
a[$1][i]+=$i # sum them up to related cells
a[$1][1]=i # set field count to first cell
}
END {
for(i in a) {
for((j=2)&&b="";j<a[i][1];j++) # buffer output
b=b (b==""?"":OFS)a[i][j]
print i,b # output
}
}' file
BCD|10|7
ABC|9|12
Latter only tested for 2 fields (busy at a meeting :).
gawk approach using multidimensional array:
awk 'BEGIN{ FS=OFS="|" }{ a[$1]["f2"]+=$2; a[$1]["f3"]+=$3 }
END{ for(i in a) print i,a[i]["f2"],a[i]["f3"] }' file
a[$1]["f2"]+=$2 - summing up values of the 2nd field (f2 - field 2)
a[$1]["f3"]+=$3 - summing up values of the 3rd field (f3 - field 3)
The output:
ABC|9|12
BCD|10|7
Additional short datamash solution (will give the same output):
datamash -st\| -g1 sum 2 sum 3 <file
-s - sort the input lines
-t\| - field separator
sum 2 sum 3 - sums up values of the 2nd and 3rd fields respectively
awk -F\| '{ array[$1]="";for (i=1;i<=NF;i++) { arr[$1,i]+=$i } } END { for (i in array) { printf "%s",i;for (p=2;p<=NF;p++) { printf "|%s",arr[i,p] } print "\n" } }' filename
We use two arrays, (array and arr) array is a single dimensional array tracking all the first pieces and arr is a multidimensional array keyed on the first piece and then the piece index and so for example arr["ABC",1]=1 and arr["ABC",2]=2. At the end we loop through array and then each field in the data set, we pull out the data from the multidimensional array arr.
This will work in any awk and will retain the input keys order in the output:
$ cat tst.awk
BEGIN { FS=OFS="|" }
!seen[$1]++ { keys[++numKeys] = $1 }
{
for (i=2;i<=NF;i++) {
sum[$1,i] += $i
}
}
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s%s", key, OFS
for (i=2;i<=NF;i++) {
printf "%s%s", sum[key,i], (i<NF?OFS:ORS)
}
}
}
$ awk -f tst.awk file
ABC|9|12
BCD|10|7

Awk replace after loop

I want to override the child_value with the parent_value using awk. The solution must be generic applicable for larger data sources. The parent-record is defined by $1==$2.
This is my input file (format: ID;PARENT_ID;VALUE):
10;20;child_value
20;20;parent_value
This is the result I want:
10;20;parent_value
20;20;parent_value
This is my current approach:
awk -F\;
BEGIN {
OFS = FS
}
{
if ($1 == $2) {
mapping[$1] = $3
}
all[$1]=$0
}
END {
for (i in all) {
if (i[$3] == 'child_value') {
i[$3] = mapping[i]
}
print i
}
}
' file.in
Needless to say, that it doesn't work like that ;-) Anyone can help?
for multiple parent/child pairs perhaps on non-consecutive lines...
$ awk -F\; -v OFS=\; 'NR==FNR {if($1==$2) a[$2]=$3; next}
$1!=$2 {$3=a[$2]}1' file{,}
10;20;parent_value
20;20;parent_value
assumes the second field is the parent id.
Well, if your data is sorted in descnding order (you could use sort if not sorted at all or rev if data is sorted in ascending order) before processing, it's enough to hash the first entry of each key in $2 and use the value on the first match for the following records with the same key in $2:
$ sort -t\; -k2nr -k1nr bar | \
awk '
BEGIN{
FS=OFS=";"
}
{
if($2 in a) # if $2 in hash a, use it
$3=a[$2]
else # else add it
a[$2]=$3
if(p!=$2) # delete previous entries from wasting memory
delete a[p]
p=$2 # p is for previous on next round
}1'
20;20;parent_value
10;20;parent_value

How to concatenate corresponding columns from multiple files into one single file, using a specified column as ID?

If I have a number of files, such as
1.txt:
1;ab, bc
2;cd, de, ef
3;fgh
2.txt:
4;bc
1;cd, ef
5;ab
2;g
3.txt:
5;ef, hl
7;a, b, c
3;k, jk
1;b
6;x
Assuming that ; is a delimiter and the first column serves as ID, how to concatenate corresponding second columns (using eg. commas), so that the output becomes
output.txt:
1;ab, bc, cd, ef, b
2;cd, de, ef, g
3;fgh, k, jk
4;bc
5;ab, ef, hl
7;a, b, c
6;x
awk to the rescue!
$ awk -F";" '{a[$1]=a[$1]?a[$1]","$2:$2}
END{for(k in a) print k";"a[k]}' file{1,2,3} | sort
1;ab, bc,cd, ef,b
2;cd, de, ef,g
3;fgh,k, jk
4;bc
5;ab,ef, hl
6;x
7;a, b, c
cause join(1) is designed for join two files, and the input has to be sort, so why bother. the awk source:
#!/usr/bin/env awk -f
BEGIN { FS = ";" }
FNR==NR { a[$1] = $2; next}
{
if ($1 in a) {
a[$1] = a[$1] ", " $2
} else {
a[$1] = $2;
}
}
END {
for (i in a) {
printf("%s%s%s\n", i,FS,a[i]);
}
}

Unix/Bash: Uniq on a cell

I have a tab-separated fileA where the 12th column (starting from 1) contain several comma separated identifiers. Some of them in the same row, however, can occur more than once:
GO:0042302, GO:0042302, GO:0042302
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
....
....
(some with a white-space after the comma, some where it is not).
I would like to only get the unique identifiers and remove the multiples for each row in the 12th column:
GO:0042302
GO:0004386,GO:0005524,GO:0006281
....
....
Here is what I have so far:
for row in `fileA`
do
cut -f12 $row | sed "s/,/\n/" | sort | uniq | paste fileA - | \
awk 'BEGIN {OFS=FS="\t"}{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $13}'
done > out
The idea was to go over each row at a time, cut out the 12th column, replace all commas with newlines and then sort and take uniq to get rid of duplicates, paste it back and print the columns in the right order, skipping the original identifier column.
However, this does not seem to work. Any ideas?
Just for completeness, and because I personally prefer Perl over Awk for this sort of thing, here's a Perl one-liner solution:
perl -F'\t' -le '%u=();#k=split/,/,$F[11];#u{#k}=#k;$F[11]=join",",sort
keys%u;print join"\t",#F'
Explanation:
-F'\t' Loop over input lines, splitting each one into fields at tabs
-l automatically remove newlines from input and append on output
-e get code to execute from the next argument instead of standard input
%u = (); # clear out the hash variable %u
#k = split /,/, $F[11]; # Split 12th field (1st is 0) on comma into array #k
#u{#k} = #k; # Copy the contents of #k into #u as key/value pairs
Because hash keys are unique, that last step means that the keys of %u are now a deduplicated copy of #k.
$F[11] = join ",", sort keys %u; # replace the 12th field with the sorted unique list
print join "\t", #F; # and print out the modified line
If I understand you correctly, then with awk:
awk -F '\t' 'BEGIN { OFS = FS } { delete b; n = split($12, a, /, */); $12 = ""; for(i = 1; i <= n; ++i) { if(!(a[i] in b)) { b[a[i]]; $12 = $12 a[i] "," } } sub(/,$/, "", $12); print }' filename
This works as follows:
BEGIN { OFS = FS } # output FS same as input FS
{
delete b # clear dirty table from last pass
n = split($12, a, /, */) # split 12th field into tokens,
$12 = "" # then clear it out for reassembly
for(i = 1; i <= n; ++i) { # wade through those tokens
if(!(a[i] in b)) { # those that haven't been seen yet:
b[a[i]] # remember that they were seen
$12 = $12 a[i] "," # append to result
}
}
sub(/,$/, "", $12) # remove trailing comma from resulting field
print # print the transformed line
}
The delete b; has been POSIX-conforming for only a short while, so if you're working with an old, old awk and it fails for you, see #MarkReed's comment for another way that ancient awks should accept.
Using field 2 instead of field 12:
$ cat tst.awk
BEGIN{ FS=OFS="\t" }
{
split($2,f,/ *, */)
$2 = ""
delete seen
for (i=1;i in f;i++) {
if ( !seen[f[i]]++ ) {
$2 = $2 (i>1?",":"") f[i]
}
}
print
}
.
$ cat file
a,a,a GO:0042302, GO:0042302, GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281 d,d,d
$ awk -f tst.awk file
a,a,a GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281 d,d,d
If your awk doesn't support delete seen you can use split("",seen).
Using this awk:
awk -F '\t' -v OFS='\t' '{
delete seen;
split($12, a, /[,; ]+/);
for (i=1; i<=length(a); i++) {
if (!(a[i] in seen)) {
seen[a[i]];
s=sprintf("%s%s,", s, a[i])
}
}
$12=s} 1' file
GO:0042302,
GO:0042302,GO:0004386,GO:0005524,GO:0006281,
In your example data, the comma followed by a space is the delimiter of the 12th field. Every subfield after that is merely a repeat of the first field. The subfields appear to already be in sorted order.
GO:0042302, GO:0042302, GO:0042302
^^^dup1^^^ ^^^dup2^^^
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
^^^^^^^^^^^^^^^dup1^^^^^^^^^^^^^
Based on that, you could simply keep the first of the subfields and toss the rest:
awk -F"\t" '{sub(/, .*/, "", $12)} 1' fileA
If instead, you can have different sets of repeated subfields, where keys are not sorted like this:
GO:0042302, GO:0042302, GO:0042302, GO:0062122,GO:0055000, GO:0055001, GO:0062122,GO:0055000
GO:0004386,GO:0005524,GO:0006281, GO:0005525, GO:0004386,GO:0005524,GO:0006281
If you were stuck with a default MacOS awk you could introduce a sort/uniq functions in an awk executable script:
#!/usr/bin/awk -f
BEGIN {FS="\t"}
{
c = uniq(a, split($12, a, /, |,/))
sort(a, c)
s = a[1]
for(i=2; i<=c; i++) { s = s "," a[i] }
$2 = s
}
47 # print out the modified line
# take an indexed arr as from split and de-dup it
function uniq(arr, len, i, uarr) {
for(i=len; i>=1; i--) { uarr[arr[i]] }
delete arr
for(k in uarr) { arr[++i] = k }
return( i )
}
# slightly modified from
# http://rosettacode.org/wiki/Sorting_algorithms/Bubble_sort#AWK
function sort(arr, len, haschanged, tmp, i)
{
haschanged = 1
while( haschanged==1 ) {
haschanged = 0
for(i=1; i<=(len-1); i++) {
if( arr[i] > arr[i+1] ) {
tmp = arr[i]
arr[i] = arr[i + 1]
arr[i + 1] = tmp
haschanged = 1
}
}
}
}
If you had GNU-awk, I think you could swap out the sort(a, c) call with asort(a), and drop the bubble-sort local function completely.
I get the following for the 12th field:
GO:0042302,GO:0055000,GO:0055001,GO:0062122
GO:0004386,GO:0005524,GO:0005525,GO:0006281

Resources