shell command: join crashes for large files? - shell

I have two files; 1.txt and 2.txt
1.txt has the following content:
a 1 2 3 4 5
b 4 5 6 7 7
c 4 5 6 7 6
d 6 5 4 3 2
and 2.txt;
b
d
I need to extract those lines from 1.txt whose first fields match the first fields of 2.txt;
b 4 5 6 7 7
d 6 5 4 3 2
I thought a simple join command should work for me:
join 1.txt 2.txt
But unfortunately, the command produces just a couple of lines, even though both files are pretty large.
I cannot figure out what's going on.

Related

How to align rows from two different files by similitude? [duplicate]

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 3 years ago.
I need help to align two files by similitude of the values from the column 2 (file 1) and column 1 (file 2).
file 1:
1 d 3
2 e 4
5 o 1
file 2:
e 6
o 5
d 8
I want to get
1 d 3 d 8
2 e 4 e 6
5 o 1 o 5
Try using the join command:
join -o "1.1,1.2,1.3,2.1,2.2" -1 2 <(cat file1 | sort) <(cat file2 | sort)
output:
1 d 3 d 8
2 e 4 e 6
5 o 1 o 5
Your files will need to be sorted for this to work. They weren't, so I had to sort them for you.
If both files have exactly the same keys (and number of lines), you can use paste:
paste -d\ <(sort -k2 file1) <(sort file2)

extract specific columns from dataset using AWK

I am trying to apply simple awk script to the dataset file.
The file has 150 columns, I need cols between 20 to 30 only.
below is the script I used to get the records with field between 20 to 30.
code
BEGIN{}
{
for(f=20;f<=30;f++){
print $f;
}
}
I dont know why I get each value of the 10 fields in next line.
That is,
sample dataset
1 2 3 4 5 6 7
2 2 3 4 5 6 7
3 3 3 4 5 6 7
4 4 4 4 5 6 7
5 5 5 5 5 6 7
6 6 6 6 6 6 7
7 7 7 7 7 7 7
I get output as
1
2
3
4
5
6
7
2
2
3
4
5
6
7
...so on
Below is another way of doing the same
awk -v f=20 -v t=30 '{for(i=f;i<=t;i++) \
printf("%s%s",$i,(i==t)?"\n":OFS)}' file
Notes
f and t are the starting and the ending columns respectively.
We used the ternary operator to control the field separator between the needed columns.
Edit
If you need columns 20 thru 30 and the last column, below would suffice :
awk -v f=20 -v t=30 '{for(i=f;i<=t;i++) \
printf("%s%s",$i,(i==t)?OFS""$NF"\n":OFS)}' file
Solution
BEGIN{FS=" ";}
{
for(f=20;f<=30;f++){
printf("%s ",$f);
}print "";
}

Paste every two lines in a file together as one line BASH

My colleague has given me a file, in which half of the lines are made of 8 columns of info and the other half are made of the 9th column of info. They are always next to each other, e.g.
1 2 3 4 5 6 7 8
1.1
2 3 4 5 6 7 8 9
1.2
...
a b c d e f g h
abcd
I know how to paste every two lines as one and print them out in Python. But I was wondering if it's possible to do that even more conveniently in BASH?
Thanks guys!
You could use sed or awk, as other answers have mentioned. Those answers are all good.
You could also do this easily in pure shell.
$ while read line1; do read line2; echo "$line1 $line2"; done < input.txt
1 2 3 4 5 6 7 8 1.1
2 3 4 5 6 7 8 9 1.2
Note that whitespace is not preserved.
There's another tool available on most unix-like systems called paste:
$ paste - - < input.txt
1 2 3 4 5 6 7 8 1.1
2 3 4 5 6 7 8 9 1.2
In this case, there's a big space in the first line because paste separates columns using tabs, by default, and the trailing space in the first line of input.txt caused the separating tab to be offset to the next column. You can read paste's man page for options to control this.
Another awk
awk '{f=$0;getline;print f,$0}' file
1 2 3 4 5 6 7 8 1.1
2 3 4 5 6 7 8 9 1.2
And just for the fun of it a gnu awk
awk -v RS="[0-9][.][0-9]" '{$1=$1;print $0,RT}' file
1 2 3 4 5 6 7 8 1.1
2 3 4 5 6 7 8 9 1.2
Here is set the Record Separator to the value in line two.
Then the RT will have the actual separator stored.
try:
awk '{printf "%s%s",$0,(NR%2?FS:RS)}' file
or:
awk 'NR%2{printf "%s ",$0;next}7' file
test:
kent$ echo "1 2 3 4 5 6 7 8
1.1
2 3 4 5 6 7 8 9
1.2"|awk '{printf "%s%s",$0,(NR%2?FS:RS)}'
1 2 3 4 5 6 7 8 1.1
2 3 4 5 6 7 8 9 1.2
kent$ echo "1 2 3 4 5 6 7 8
1.1
2 3 4 5 6 7 8 9
1.2"|awk 'NR%2{printf "%s ",$0;next}7'
1 2 3 4 5 6 7 8 1.1
2 3 4 5 6 7 8 9 1.2
You can sed:
sed 'N;s/\n/ /' file
or awk:
awk 'NF==1{print $0}{printf "%s ",$0}' file

Join multiple tables by row names [duplicate]

This question already has answers here:
Merging very large csv files with common column
(6 answers)
Closed 8 years ago.
I would like to merge multiple tables by row names. The tables differ in the amount of rows and they have unique and shared rows, which should all appear in output. If possible I would like to solve the problem with awk, but I am also fine with other solutions.
table1.tab
a 5
b 5
d 9
table2.tab
a 1
b 2
c 8
e 11
The output I would like to obtain the following table:
table3.tab
a 5 1
b 5 2
d 9 0
c 0 8
e 0 11
I tried using join
join table1.tab table2.tab > table3.tab
but I get
table3.tab
a 5 1
b 5 2
row c, d and e are not in the output.
You want to do a full outer join:
join -a1 -a2 -o 0 1.2 2.2 -e "0" table1.tab table2.tab
a 5 1
b 5 2
c 0 8
d 9 0
e 0 11
this awk oneliner should work for your example:
awk 'NR==FNR{a[$1]=$2;k[$1];next}{b[$1]=$2;k[$1]}
END{for(x in k)printf"%s %d %d\n",x,a[x],b[x]}' table1 table2
test
kent$ head f1 f2
==> f1 <==
a 5
b 5
d 9
==> f2 <==
a 1
b 2
c 8
e 11
kent$ awk 'NR==FNR{a[$1]=$2;k[$1];next}{b[$1]=$2;k[$1]}END{for(x in k)printf"%s %d %d\n",x,a[x],b[x]}' f1 f2
a 5 1
b 5 2
c 0 8
d 9 0
e 0 11

Redirecting Multiple stdins?

I have three files named One, Two, Three.
One contains:
1
3
2
Two contains:
4
6
5
Three contains:
7
9
8
When I give the following command:
$sort < One < Two < Three
I get the output:
7
8
9
But when I give the following command:
$sort One Two Three
I get the ouput:
1
2
3
4
5
6
7
8
9
Can anyone please shed light on what exaclty is happening here? Why does the input from 1 and 2 not taken into consideration in the first command?
Your command is the same as:
sort 0<1 0<2 0<3
(file descriptor 0 is standard input)
Redirections are processed in the order they appear, from left to right.
sort command itself cannot see any of those files.
bash open file 1,2,3 at file descriptor 0 one by one.
So the right most one override left ones.
At last, sort read from file descriptor 0 which is bind to file 3.
You can't redirect multiple files with bash. To work around this limitation you could use cat:
cat 1 2 3 | sort
On a side note, zsh supports what it calls mutlios:
zsh$ setopt multios
zsh$ sort < 1 < 2 < 3 > 4 > 5
zsh$ tr '\n' ' ' < 4 < 5
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

Resources