match columns in 2 tab-delimited text files - bash

I have two tab-delimited .txt files
file1 has 20 million lines and the following structure
col1 col2 col3 col4 col5
1 x x A x
2 y y A x
3 z z A x
4 x x B x
5 x y B x
6 x y E x
7 x z F x
file2 has 3000 lines and the following structure
col1
A
B
C
D
Now I want to extract from file1 the lines where there is a match between col1 from file2 and col4 of file1
So the new file3 should look like this
col1 col2 col3 col4 col5
1 x x A x
2 y y A x
3 z z A x
4 x x B x
5 x y B x
How can I do this with perl or bash?

You can use standard awk command to join 2 files:
awk 'BEGIN{FS=OFS="\t"} FNR == NR { a[$1]; next } $4 in a' file2 file1

try this -
awk -F'[ ]+' 'NR==FNR {a[$1]++;next} $4 in a{print $0}' f2 f1
1 x x A x
2 y y A x
3 z z A x
4 x x B x
5 x y B x

Since you also asked about Perl, here's a reusable perl solution. You first read file 2, generate an array of lookup values, and then read file 1, printing out any line in which column 4 matches a value inside of the array we created above. Something like this might work:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $key_file = shift;
open(my $fh, "<", $key_file);
my $header = <$fh>; # read the header line into '$h'
my %keys = map{ chomp; $_ => 1 }<$fh>;
close $fh;
my $query_file = shift;
open(my $q_fh, "<", $query_file);
print scalar <$q_fh>;
while (<$q_fh>) {
my #fields = split;
print if $keys{$fields[3]};
}
close $q_fh;
You can run this as table_combine.pl <file2> <file1>.

Related

Bash way to compare specific columns from two different files based on an index list

I have two tab separated files of 1708 rows and different number of columns. My goal is to compare the value stored for all rows but only some specific columns. I have two lists containing the columns' number that I want to compare; here an example:
FileA ➝ col_ind_A = [12,20,24,55]
FileB ➝ col_ind_B = [14,28,35,79]
Here, column 12 of file A should be compared with column 14 of file B, 20 of fileA with 28 of fileB and so on. If file A has value 0 and file B doesn't, I want to modify file C (a copy of file A) in that position, and then store the value of file B (which is not 0):
# FileA #FileB #FileC
col11 col12 col13 col13 col14 col15 col11 col12 col13
A C G A C G A C G
G 0 T G T T G T T
I've seen that comparing columns is usually done with awk, but I'm quite new to bash and I don't know how to iterate over the rows of the two files while I iterate over the col_ind lists and indicate the column positions that I want to compare. Any suggestions are be welcome.
If it's of any help, I show an R code that does exactly this (it is just too slow):
for(i in 1:1708){ #rows
for(j in 1:31946){ #cols
if( fileA[i, col_ind_A[j]] == '0' && fileA[i, col_ind_A[j]] != fileB[ i, col_ind_B[j]]){
fileC[i, col_ind_A[j]] <- fileB[i, col_ind_B[j]] # write value from fileB in file C
}
}
}
Any help would be great. Thanks!!
A perl script that does it:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw/say/;
use List::Util qw/pairs/;
# Adjust as needed.
my #columns = (12 => 14, 20 => 28, 24 => 35, 55 => 79);
my ($filea_name, $fileb_name) = #ARGV;
#columns = pairs map { $_ - 1 } #columns;
open my $filea, '<', $filea_name;
open my $fileb, '<', $fileb_name;
$, = " "; # Or "\t" or whatever to delimit output columns
while (my $linea = <$filea>) {
my $lineb = <$fileb> or die "Files have different line counts\n";
chomp $linea;
chomp $lineb;
my #acols = split ' ', $linea;
my #bcols = split ' ', $lineb;
for my $p (#columns) {
if ($acols[$$p[0]] eq "0" && $bcols[$$p[1]] ne "0") {
$acols[$$p[0]] = $bcols[$$p[1]];
}
}
say #acols;
}
(Takes FileA and FileB as its command line arguments)
Since you asked for an awk solution, here's a straightforward one:
awk -v col_ind_A='12 20 24 55' -v col_ind_B='14 28 35 79' '
BEGIN { OFS="\t"
split(col_ind_A, ciA)
split(col_ind_B, ciB)
while (getline <"FileB" > 0 && split($0, B) && getline <"FileA" > 0)
{
for (i in ciA) if ($ciA[i] == 0) $ciA[i] = B[ciB[i]]
print >"FileC"
}
}'
But this won't be faster than the R code. An optimization step for the R code would probably be to eliminate the inner loop:
for (i in 1:nrow(FileA))
{
j = which(FileA[i, col_ind_A] == 0)
FileC[i, col_ind_A[j]] = FileB[i, col_ind_B[j]]
}
First join the files line by line, then just check the condition you want to check.
# recreate input
cat >file1 <<EOF
col11 col12 col13
A C G
G 0 T
EOF
cat >file2 <<EOF
col13 col14 col15
A C G
G T T
EOF
paste file1 file2 |
awk '{ if ($2 == 0 && $2 != $6) $2=$6; print $1, $2 ,$3}'
outputs:
col11 col12 col13
A C G
G T T
I guess from for(i in 1:1708){ #rows maybe you want to iterate over all columns, assuming there is the same number of columns in both files:
paste file1 file2 |
awk '{
for (i=1;i<=NF/2;++i) if ($i == 0 && $i != $(i*2)) $i = $(i*2);
for (i=1;i<=NF/2;++i) printf "%s%s", $i, i==NF/2?ORS:OFS;
}'

Combining 2 lines together but "interlaced"

I have 2 lines from an output as follow:
a b c
x y z
I would like to pipe both lines from the last command into a script that would combine them "interlaced", like this:
a x b y c z
The solution should work for a random number of columns from the output, such as:
a b c d e
x y z x y
Should result in:
a x b y c z d x e y
So far, I have tried using awk, perl, sed, etc... but without success. All I can do, is to put the output into one line, but it won't be "interlaced":
$ echo -e 'a b c\nx y z' | tr '\n' ' ' | sed 's/$/\n/'
a b c x y z
Keep fields of odd numbered records in an array, and update the fields of even numbered records using it. This will interlace each pair of successive lines in input.
prog | awk 'NR%2{split($0,a);next} {for(i in a)$i=(a[i] OFS $i)} 1'
Here's a 3 step solution:
$ # get one argument per line
$ printf 'a b c\nx y z' | xargs -n1
a
b
c
x
y
z
$ # split numbers of lines by 2 and combine them side by side
$ printf 'a b c\nx y z' | xargs -n1 | pr -2ts' '
a x
b y
c z
$ # combine all input lines into single line
$ printf 'a b c\nx y z' | xargs -n1 | pr -2ts' ' | paste -sd' '
a x b y c z
$ printf 'a b c d e\nx y z 1 2' | xargs -n1 | pr -2ts' ' | paste -sd' '
a x b y c z d 1 e 2
Could you please try following, it will join every 2 lines in "interlaced" fashion as follows.
awk '
FNR%2!=0 && FNR>1{
for(j=1;j<=NF;j++){
printf("%s%s",a[j],j==NF?ORS:OFS)
delete a
}
}
{
for(i=1;i<=NF;i++){
a[i]=(a[i]?a[i] OFS:"")$i}
}
END{
for(j=1;j<=NF;j++){
printf("%s%s",a[j],j==NF?ORS:OFS)
}
}' Input_file
Here is a simple awk script
script.awk
NR == 1 {split($0,inArr1)} # read fields frrom 1st line into arry1
NR == 2 {split($0,inArr2); # read fields frrom 2nd line into arry2
for (i = 1; i <= NF; i++) printf("%s%s%s%s", inArr1[i], OFS, inArr2[i], OFS); # ouput interlace fields from arr1 and arr2
print; # terminate output line.
}
input.txt
a b c d e
x y z x y
running:
awk -f script.awk input.txt
output:
a x b y c z d x e y x y z x y
Multiline awk solution:
interlaced.awk
{
a[NR] = $0
}
END {
split(a[1], b)
split(a[2], c)
for (i in b) {
printf "%s%s %s", i==1?"":OFS, b[i], c[i]
}
print ORS
}
Run it like this:
foo_program | awk -f interlaced.awk
Perl will do the job. It was invented for this type of task.
echo -e 'a b c\nx y z' | \
perl -MList::MoreUtils=mesh -e \
'#f=mesh #{[split " ", <>]}, #{[split " ", <>]}; print "#f"'
 
a x b y c z
You can of course print out the meshed output any way you want.
Check out http://metacpan.org/pod/List::MoreUtils#mesh
You could even make it into a shell function for easy use:
function meshy {
perl -MList::MoreUtils=mesh -e \
'#f=mesh #{[split " ", <>]}, #{[split " ", <>]}; print "#f"'
}
$ echo -e 'X Y Z W\nx y z w' |meshy
X x Y y Z z W w
$
Ain't Perl grand?
This might work for you (GNU sed):
sed -E 'N;H;x;:a;s/\n(\S+\s+)(.*\n)(\S+\s+)/\1\3\n\2/;ta;s/\n//;s// /;h;z;x' file
Process two lines at time. Append two lines in the pattern space to the hold space which will introduce a newline at the front of the two lines. Using pattern matching and back references, nibble away at the front of each of the two lines and place the pairs at the front. Eventually, the pattern matching fails, then remove the first newline and replace the second by a space. Copy the amended line to hold space, clean up the pattern space ready for the next couple of line (if any) and print.

How to repeat lines in bash and paste with different columns?

is there a short way in bash to repeat the first line of a file as often as needed to paste it with another file in a kronecker product type (for the mathematicians of you)?
What I mean is, I have a file A:
a
b
c
and a file B:
x
y
z
and I want to merge them as follows:
a x
a y
a z
b x
b y
b z
c x
c y
c z
I could probably write a script, read the files line by line and loop over them, but I am wondering if there a short one-line command that could do the same job. I can't think of one and as you can see, I am also lacking some keywords to search for. :-D
Thanks in advance.
You can use this one-liner awk command:
awk 'FNR==NR{a[++n]=$0; next} {for(i=1; i<=n; i++) print $0, a[i]}' file2 file1
a x
a y
a z
b x
b y
b z
c x
c y
c z
Breakup:
NR == FNR { # While processing the first file in the list
a[++n]=$0 # store the row in array 'a' by the an incrementing index
next # move to next record
}
{ # while processing the second file
for(i=1; i<=n; i++) # iterate over the array a
print $0, a[i] # print current row and array element
}
alternative to awk
join <(sed 's/^/_\t/' file1) <(sed 's/^/_\t/' file2) | cut -d' ' -f2-
add a fake key for join to have all records of file1 to match all records of file2, trim afterwards

complex line copying&modifying on-the-fly with grep or sed

Is there a way to do the followings with either grep, or sed: read each line of a file, and copy it twice and modify each copy:
Original line:
X Y Z
A B C
New lines:
Y M X
Y M Z
B M A
B M C
where X, Y, Z, M are all integers, and M is a fixed integer (i.e. 2) we inject while copying! I suppose a solution (if any) will be so complex that people (including me) will start bleeding after seeing it!
$ awk -v M=2 '{print $2,M,$1; print $2,M,$3;}' file
Y 2 X
Y 2 Z
B 2 A
B 2 C
How it works
-v M=2
This defines the variable M to have value 2.
print $2,M,$1
This prints the second column, followed by M, followed by the first column.
print $2,M,$3
This prints the second column, followed by M, followed by the third column.
Extended Version
Suppose that we want to handle an arbitrary number of columns in which we print all columns between first and last, followed by M, followed by the first, and then print all columns between first and last, followed by M, followed by the last. In this case, use:
awk -v M=2 '{for (i=2;i<NF;i++)printf "%s ",$i; print M,$1; for (i=2;i<NF;i++)printf "%s ",$i; print M,$NF;}' file
As an example, consider this input file:
$ cat file2
X Y1 Y2 Z
A B1 B2 C
The above produces:
$ awk -v M=2 '{for (i=2;i<NF;i++)printf "%s ",$i; print M,$1; for (i=2;i<NF;i++)printf "%s ",$i; print M,$NF;}' file2
Y1 Y2 2 X
Y1 Y2 2 Z
B1 B2 2 A
B1 B2 2 C
The key change to the code is the addition of the following command:
for (i=2;i<NF;i++)printf "%s "
This command prints all columns from the i=2, which is the column after the first to i=NF-1 which is the column before the last. The code is otherwise similar.
Sure; you can write:
sed 's/\(.*\) \(.*\) \(.*\)/\2 M \1\n\2 M \3/'
With bash builtin commands:
m=2; while read a b c; do echo "$b $m $a"; echo "$b $m $c"; done < file
Output:
Y 2 X
Y 2 Z
B 2 A
B 2 C

Substituting values of one column from a list of corresponding values

I want to replace the entries in one column of file input A.txt by the list given in B.txt in corresponding order
For example
A.txt is tab delimited but in a column values are separated by comma
need to change one of entries of that column values say P=
1 X y Z Q=Alpha,P=beta,O=Theta
2 x a b Q=Alpha,P=beta,O=Theta
3 y b c Q=Alpha,P=beta,O=Theta
4 a b c Q=Alpha,P=beta,O=Theta
5 x y z Q=Alpha,P=beta,O=Theta
B.txt is
1 gamma
2 alpha
3 alpha
4 gamma
5 alpha
now reading each entry in A.txt and replace P= with the corresponding line values in B.txt
Output:
1 X y Z Q=Alpha,P=gamma,O=Theta
2 x a b Q=Alpha,P=alpha,O=Theta
3 y b c Q=Alpha,P=alpha,O=Theta
4 a b c Q=Alpha,P=gamma,O=Theta
5 x y z Q=Alpha,P=alpha,O=Theta
Thanks in advance!!!
Assuming A.txt and B.txt are sorted on the first column, you can first join both files and then perform the replacement within a specified field using sed:
For example:
join -t $'\t' -j 1 A.txt B.txt | sed 's/,P=.*,\(.*\)\t\(.*\)/,P=\2,\1/g'
You could have sed write you a sed script, e.g.:
sed 's:^:/^:; s: :\\b/s/P=[^,]+/P=:; s:$:/:' B.txt
Output:
/^1\b/s/P=[^,]+/P=gamma/
/^2\b/s/P=[^,]+/P=alpha/
/^3\b/s/P=[^,]+/P=alpha/
/^4\b/s/P=[^,]+/P=gamma/
/^5\b/s/P=[^,]+/P=alpha/
Pipe it into a second sed:
sed 's:^:/^:; s: :\\b/s/P=[^,]+/P=:; s:$:/:' B.txt | sed -r -f - A.txt
Output:
1 X y Z Q=Alpha,P=gamma,O=Theta
2 x a b Q=Alpha,P=alpha,O=Theta
3 y b c Q=Alpha,P=alpha,O=Theta
4 a b c Q=Alpha,P=gamma,O=Theta
5 x y z Q=Alpha,P=alpha,O=Theta
Another solution:
awk '{getline b < "B.txt" split(b, a, FS)} -F "," {sub(/beta/, a[2]); print}' A.txt

Resources