Replacing values in large table using conversion table - bash

I am trying to replace values in a large space-delimited text-file and could not find a suitable answer for this specific problem:
Say I have a file "OLD_FILE", containing a header and approximately 2 million rows:
COL1 COL2 COL3 COL4 COL5
rs10 7 92221824 C A
rs1000000 12 125456933 G A
rs10000010 4 21227772 T C
rs10000012 4 1347325 G C
rs10000013 4 36901464 C A
rs10000017 4 84997149 T C
rs1000002 3 185118462 T C
rs10000023 4 95952929 T G
...
I want to replace the first value of each row with a corresponding value, using a large (2.8M rows) conversion table. In this conversion table, the first column lists the value I want to have replaced, and the second column lists the corresponding new values:
COL1_b36 COL2_b37
rs10 7_92383888
rs1000000 12_126890980
rs10000010 4_21618674
rs10000012 4_1357325
rs10000013 4_37225069
rs10000017 4_84778125
rs1000002 3_183635768
rs10000023 4_95733906
...
The desired output would be a file where all values in the first column have been changed according to the conversion table:
COL1 COL2 COL3 COL4 COL5
7_92383888 7 92221824 C A
12_126890980 12 125456933 G A
4_21618674 4 21227772 T C
4_1357325 4 1347325 G C
4_37225069 4 36901464 C A
4_84778125 4 84997149 T C
3_183635768 3 185118462 T C
4_95733906 4 95952929 T G
...
Additional info:
Performance is an issue (the following command takes approximately a year:
while read a b; do sed -i "s/\b$a\b/$b/g" OLD_FILE ; done < CONVERSION_TABLE
A complete match is necessary before replacing
Not every value in the OLD_FILE can be found in the conversion table...
...but every value that could be replaced, can be found in the conversion table.
Any help is very much appreciated.

Here's one way using awk:
awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { $1=a[$1] }1' TABLE OLD_FILE
Results:
COL1 COL2 COL3 COL4 COL5
7_92383888 7 92221824 C A
12_126890980 12 125456933 G A
4_21618674 4 21227772 T C
4_1357325 4 1347325 G C
4_37225069 4 36901464 C A
4_84778125 4 84997149 T C
3_183635768 3 185118462 T C
4_95733906 4 95952929 T G
Explanation, in order of appearance:
NR==1 { next } # simply skip processing the first line (header) of
# the first file in the arguments list (TABLE)
FNR==NR { ... } # This is a construct that only returns true for the
# first file in the arguments list (TABLE)
a[$1]=$2 # So when we loop through the TABLE file, we add the
# column one to an associative array, and we assign
# this key the value of column two
next # This simply skips processing the remainder of the
# code by forcing awk to read the next line of input
$1 in a { ... } # Now when awk has finished processing the TABLE file,
# it will begin reading the second file in the
# arguments list which is OLD_FILE. So this construct
# is a condition that returns true literally if column
# one exists in the array
$1=a[$1] # re-assign column one's value to be the value held
# in the array
1 # The 1 on the end simply enables default printing. It
# would be like saying: $1 in a { $1=a[$1]; print $0 }'

This might work for you (GNU sed):
sed -r '1d;s|(\S+)\s*(\S+).*|/^\1\\>/s//\2/;t|' table | sed -f - file

You can use join:
join -o '2.2 1.2 1.3 1.4 1.5' <(tail -n+2 file1 | sort) <(tail -n+2 file2 | sort)
This drops the headers of both files, you can add it back with head -n1 file1.
Output:
12_126890980 12 125456933 G A
4_21618674 4 21227772 T C
4_1357325 4 1347325 G C
4_37225069 4 36901464 C A
4_84778125 4 84997149 T C
3_183635768 3 185118462 T C
4_95733906 4 95952929 T G
7_92383888 7 92221824 C A

Another way with join. Assuming the files are sorted on the 1st column:
head -1 OLD_FILE
join <(tail -n+2 CONVERSION_TABLE) <(tail -n+2 OLD_FILE) | cut -f 2-6 -d' '
But with data of this size you should consider using a database engine.

Related

Bash: reshape a dataset of many rows to dataset of many columns

Suppose I have the following data:
# all the numbers are their own number. I want to reshape exactly as below
0 a
1 b
2 c
0 d
1 e
2 f
0 g
1 h
2 i
...
And I would like to reshape the data such that it is:
0 a d g ...
1 b e h ...
2 c f i ...
Without writing a complex composition. Is this possible using the unix/bash toolkit?
Yes, trivially I can do this inside a language. The idea is NOT TO "just" do that. So if some cat X.csv | rs [magic options] sort of solution (and rs, or the bash reshape command, would be great, except it isn't working here on debian stretch) exists, that is what I am looking for.
Otherwise, an equivalent answer that involves a composition of commands or script is out of scope: already got that, but would rather not have it.
Using GNU datamash:
$ datamash -s -W -g 1 collapse 2 < file
0 a,d,g
1 b,e,h
2 c,f,i
Options:
-s sort
-W use whitespace (spaces or tabs) as delimiters
-g 1 group on the first field
collapse 2 print comma-separated list of values of the second field
To convert the tabs and commas to space characters, pipe the output to tr:
$ datamash -s -W -g 1 collapse 2 < file | tr '\t,' ' '
0 a d g
1 b e h
2 c f i
bash version:
function reshape {
local index number key
declare -A result
while read index number; do
result[$index]+=" $number"
done
for key in "${!result[#]}"; do
echo "$key${result[$key]}"
done
}
reshape < input
We just need to make sure input is in unix format

Joining lines, modulo the number of records

Say my stream is x*N lines long, where x is the number of records and N is the number of columns per record, and is output column-wise. For example, x=2, N=3:
1
2
Alice
Bob
London
New York
How can I join every line, modulo the number of records, back into columns:
1 Alice London
2 Bob New York
If I use paste, with N -s, I get the transposed output. I could use split, with the -l option equal to N, then recombine the pieces afterwards with paste, but I'd like to do it within the stream without spitting out temporary files all over the place.
Is there an "easy" solution (i.e., rather than invoking something like awk)? I'm thinking there may be some magic join solution, but I can't see it...
EDIT Another example, when x=5 and N=3:
1
2
3
4
5
a
b
c
d
e
alpha
beta
gamma
delta
epsilon
Expected output:
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilon
You are looking for pr to "columnate" the stream:
pr -T -s$'\t' -3 <<'END_STREAM'
1
2
Alice
Bob
London
New York
END_STREAM
1 Alice London
2 Bob New York
pr is in coreutils.
Most systems should include a tool called pr, intended to print files. It's part of POSIX.1 so it's almost certainly on any system you'll use.
$ pr -3 -t < inp1
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilon
Or if you prefer,
$ pr -3 -t -s, < inp1
1,a,alpha
2,b,beta
3,c,gamma
4,d,delta
5,e,epsilon
or
$ pr -3 -t -w 20 < inp1
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilo
Check the link above for standard usage information, or man pr for specific options in your operating system.
In order to reliably process the input you need to either know the number of columns in the output file or the number of lines in the output file. If you just know the number of columns, you'd need to read the input file twice.
Hackish coreutils solution
# If you don't know the number of output lines but the
# number of output columns in advance you can calculate it
# using wc -l
# Split the file by the number of output lines
split -l"${olines}" file FOO # FOO is a prefix. Choose a better one
paste FOO*
AWK solutions
If you know the number of output columns in advance you can use this awk script:
convert.awk:
BEGIN {
# Split the file into one big record where fields are separated
# by newlines
RS=''
FS='\n'
}
FNR==NR {
# We are reading the file twice (see invocation below)
# When reading it the first time we store the number
# of fields (lines) in the variable n because we need it
# when processing the file.
n=NF
}
{
# n / c is the number of output lines
# For every output line ...
for(i=0;i<n/c;i++) {
# ... print the columns belonging to it
for(ii=1+i;ii<=NF;ii+=n/c) {
printf "%s ", $ii
}
print "" # Adds a newline
}
}
and call it like this:
awk -vc=3 -f convert.awk file file # Twice the same file
If you know the number of ouput lines in advance you can use the following awk script:
convert.awk:
BEGIN {
# Split the file into one big record where fields are separated
# by newlines
RS=''
FS='\n'
}
{
# x is the number of output lines and has been passed to the
# script. For each line in output
for(i=0;i<x;i++){
# ... print the columns belonging to it
for(ii=i+1;ii<=NF;ii+=x){
printf "%s ",$ii
}
print "" # Adds a newline
}
}
And call it like this:
awk -vx=2 -f convert.awk file

Bash - fill empty cell with following value in the column

I have a long tab-delimited CSV file and I am trying to paste in a cell a value that comes later on the column.
For instance, input.txt:
0
1
1.345 B
2
2.86 A
3
4
I would like an output such as:
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
I've been tinkering with code from other threads like this awk solution, but the problem is that the value I want is not before the empty cell, but after, kind of a .FillUp in Excel.
Additional information:
input file may have different number of lines
"A" and "B" in input file may be at different rows and not evenly separated
second column may have only two values
last cell in second column may not have value
[EDIT] for the last two rows in input.txt, B is known to be in the second column, as all rows after 2.86 are not A.
Thanks in advance.
$ tac input.txt | awk -v V=B '{if ($2) V=$2; else $2=V; print}' | tac
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
tac (cat backwards) prints a file in reverse. Reverse the file, fill in the missing values, and then reverse it again.
This allows you to process the file in a single pass as long as you know the first value to fill. It should be quite a bit faster than reversing the file twice.
awk 'BEGIN {fillvalue="B"} $2 {fillvalue=$2=="A"?"B":"A"} !$2 {$2=fillvalue} 1' input.txt
Note that this assumes knowledge about the nature of that second column being only 'A' or 'B' or blank.

Moving columns from the back of a file to the front

I have a file with a very large number of columns (basically several thousand sets of threes) with three special columns (Chr and Position and Name) at the end.
I want to move these final three columns to the front of the file, so that that columns become Name Chr Position and then the file continues with the trios.
I think this might be possible with awk, but I don't know enough about how awk works to do it!
Sample input:
Gene1.GType Gene1.X Gene1.Y ....ending in GeneN.Y Chr Position Name
Desired Output:
Name Chr Position (Gene1.GType Gene1.X Gene1.Y ) x n samples
I think the below example does more or less what you want.
$ cat file
A B C D E F G Chr Position Name
1 2 3 4 5 6 7 8 9 10
$ cat process.awk
{
printf $(NF-2)" "$(NF-1)" "$NF
for( i=1; i<NF-2; i++)
{
printf " "$i
}
print " "
}
$ awk -f process.awk file
Chr Position Name A B C D E F G
8 9 10 1 2 3 4 5 6 7
NF in awk denotes the number of field on a row.
one liner:
awk '{ Chr=$(NF-2) ; Position=$(NF-1) ; Name=$NF ; $(NF-2)=$(NF-1)=$NF="" ; print Name, Chr, Position, $0 }' file

Getting the count of unique values in a column in bash

I have tab delimited files with several columns. I want to count the frequency of occurrence of the different values in a column for all the files in a folder and sort them in decreasing order of count (highest count first). How would I accomplish this in a Linux command line environment?
It can use any common command line language like awk, perl, python etc.
To see a frequency count for column two (for example):
awk -F '\t' '{print $2}' * | sort | uniq -c | sort -nr
fileA.txt
z z a
a b c
w d e
fileB.txt
t r e
z d a
a g c
fileC.txt
z r a
v d c
a m c
Result:
3 d
2 r
1 z
1 m
1 g
1 b
Here is a way to do it in the shell:
FIELD=2
cut -f $FIELD * | sort| uniq -c |sort -nr
This is the sort of thing bash is great at.
The GNU site suggests this nice awk script, which prints both the words and their frequency.
Possible changes:
You can pipe through sort -nr (and reverse word and freq[word]) to see the result in descending order.
If you want a specific column, you can omit the for loop and simply write freq[3]++ - replace 3 with the column number.
Here goes:
# wordfreq.awk --- print list of word frequencies
{
$0 = tolower($0) # remove case distinctions
# remove punctuation
gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
Perl
This code computes the occurrences of all columns, and prints a sorted report for each of them:
# columnvalues.pl
while (<>) {
#Fields = split /\s+/;
for $i ( 0 .. $#Fields ) {
$result[$i]{$Fields[$i]}++
};
}
for $j ( 0 .. $#result ) {
print "column $j:\n";
#values = keys %{$result[$j]};
#sorted = sort { $result[$j]{$b} <=> $result[$j]{$a} || $a cmp $b } #values;
for $k ( #sorted ) {
print " $k $result[$j]{$k}\n"
}
}
Save the text as columnvalues.pl
Run it as: perl columnvalues.pl files*
Explanation
In the top-level while loop:
* Loop over each line of the combined input files
* Split the line into the #Fields array
* For every column, increment the result array-of-hashes data structure
In the top-level for loop:
* Loop over the result array
* Print the column number
* Get the values used in that column
* Sort the values by the number of occurrences
* Secondary sort based on the value (for example b vs g vs m vs z)
* Iterate through the result hash, using the sorted list
* Print the value and number of each occurrence
Results based on the sample input files provided by #Dennis
column 0:
a 3
z 3
t 1
v 1
w 1
column 1:
d 3
r 2
b 1
g 1
m 1
z 1
column 2:
c 4
a 3
e 2
.csv input
If your input files are .csv, change /\s+/ to /,/
Obfuscation
In an ugly contest, Perl is particularly well equipped.
This one-liner does the same:
perl -lane 'for $i (0..$#F){$g[$i]{$F[$i]}++};END{for $j (0..$#g){print "$j:";for $k (sort{$g[$j]{$b}<=>$g[$j]{$a}||$a cmp $b} keys %{$g[$j]}){print " $k $g[$j]{$k}"}}}' files*
Ruby(1.9+)
#!/usr/bin/env ruby
Dir["*"].each do |file|
h=Hash.new(0)
open(file).each do |row|
row.chomp.split("\t").each do |w|
h[ w ] += 1
end
end
h.sort{|a,b| b[1]<=>a[1] }.each{|x,y| print "#{x}:#{y}\n" }
end
Here is a tricky one approaching linear time (but probably not faster!) by avoiding sort and uniq, except for the final sort. It is based on... tee and wc instead!
$ FIELD=2
$ values="$(cut -f $FIELD *)"
$ mkdir /tmp/counts
$ cd /tmp/counts
$ echo | tee -a $values
$ wc -l * | sort -nr
9 total
3 d
2 r
1 z
1 m
1 g
1 b
$
Pure-Bash version:
FIELD=1
declare -A results
while read -a line; do
results[${line[$FIELD]:-(empty)}]=$((results[${line[$FIELD]:-(empty)}]+1));
done < file.txt
echo ${results[#]#A}
The key logic is to fill an associative array which keys are the values found in the file and the array's value is the number of occurrence:
$FIELD is the selected column number
${line[$FIELD]} is the column value from that line in the file
${...:-(empty)} is a special case for empty values (what happens if there is less columns than expected?)
To have the output sorted in the expected OP format, a little more work is needed:
sort -rn < <(
for k in "${!results[#]}"; do
echo "${results[$k]} $k";
done
)
Warning: it works well for tab-delimited and space-delimited files, but works bad for values with spaces in it.

Resources