Merge multiple text files on one column in bash - bash

How do you merge multiple plain text files (>2) on the first column? For example, I have three files like the below:
cat file1.txt
a 1
b 2
c 3
cat file2.txt
a 2
b 3
c 4
cat file3.txt
a 3
b 4
c 5
I am trying to merge these files into one file like in the first column like this:
cat ideal.txt
a 1 2 3
b 2 3 4
c 3 4 5

How about join?
Join lines of two sorted files on a common field.
More information: https://www.gnu.org/software/coreutils/join
join file1.txt file2.txt > join1.txt
join join1.txt file3.txt > ideal.txt
cat ideal.txt
Here's a script, I named the file "jj" you might use in order to work with many a file. To run it type: ./jj file1.txt file2.txt file3.txt
#!/usr/bin/env bash
# define temporary location, WIP/CACHE
tmp="/tmp/outjointmp"
# define target location
out="/tmp/outjoin"
# truncate both files, just in case there is any residue from anything
: > "$out"
: > "$tmp"
# first, copy the contents of the first file into the target file
cat "$1" > "$out"
# loop through all remaining arguments
while [[ $# -gt 1 ]]; do
join "$out" "$2" > "$tmp"
shift
# copy over the temp into destination file
cat "$tmp" > "$out"
done
cat "$out"
result&output:
$ ./jj file1.txt file2.txt file3.txt
a 1 2 3
b 2 3 4
c 3 4 5

A recursive function using process substitution should do the trick in order to join more than two files:
#!/bin/bash
join_rec() {
if (($# <= 2)); then
join "$#"
else
join "$1" <(join_rec "${#:2}")
fi
}
join_rec file*.txt > joined_file
assuming input files are sorted.

Related

Looping over files and its edition based on array information

I have a directory with a lot of files, which can be grouped based on their names. For example here I have 4 groups with 5 files in each:
ls - ./
# group 1
NpXynWT_apo_300K_0.pdb
NpXynWT_apo_300K_1.pdb
NpXynWT_apo_300K_2.pdb
NpXynWT_apo_300K_3.pdb
NpXynWT_apo_300K_4.pdb
# group 2
NpXynWT_apo_340K_0.pdb
NpXynWT_apo_340K_1.pdb
NpXynWT_apo_340K_2.pdb
NpXynWT_apo_340K_3.pdb
NpXynWT_apo_340K_4.pdb
# group 3
NpXynWT_com_300K_0.pdb
NpXynWT_com_300K_1.pdb
NpXynWT_com_300K_2.pdb
NpXynWT_com_300K_3.pdb
NpXynWT_com_300K_4.pdb
# group 4
NpXynWT_com_340K_0.pdb
NpXynWT_com_340K_1.pdb
NpXynWT_com_340K_2.pdb
NpXynWT_com_340K_3.pdb
NpXynWT_com_340K_4.pdb
So here each of the 5 files of the same group is different by the end suffix from 0 to 4:
NpXynWT_apo_300K_0 ... NpXynWT_apo_300K_4
NpXynWT_apo_340K_0 ... NpXynWT_apo_340K_4
etc
I need to loop over all of these 40 files and
pre-process each of the fille: adding inside of it "MODEL + A number of the file" (thus a number in range between 0 and 4) before the first string, and "ENDMDL" on the last string.
cat together the pre-processed files of the same group
In summary, as the result my script should create 4 new "combined" files, which will consist of 5 subfiles from the initial list.
For the realisation I created an array of the groups and looped it providing index from 0 to 4 as well as two loops: 1)pre-processing of each file; 2) CAT the pre-processed files together:
# list of 4 groups
systems=(NpXynWT_apo_300K NpXynWT_apo_340K NpXynWT_com_300K NpXynWT_com_340K)
# pre-process files
for model in "${systems[#]}"; do
i="0"
while [ $i -lt 5 ]; do
# EDIT EXISTING FILES
sed -i "1 i\MODEL $i" "${pdbs}"/"${model}"_"$i"_FA.pdb
echo "ENDMDL" >> "${pdbs}"/"${model}"_"$i"_FA.pdb
i=$[$i+1]
done
done
# cat pre-processed filles
for model in ${systems[#]}; do
cat "${pdbs}"/"${model}"_[0-4]_FA.pdb > "${output}/${model}.pdb"
done
1 - Would it be possible to merge together the both loops ? E.g. should it be the same as
# pre-processing PBDs and it catting
for model in "${systems[#]}"; do
##echo "$model"
i="0"
while [ $i -lt 5 ]; do
k=$[$i+1]
## do something with pdb
sed -i "1 i\MODEL $k" "${pdbs}"/"${model}"_"$i"_FA.pdb
echo "ENDMDL" >> "${pdbs}"/"${model}"_"$i"_FA.pdb
#gedit "${pdbs}"/"${model}"_"$i"_FA.pdb
i=$[$i+1]
done
# now we cat together the post-processed files
cat "${pdbs}"/"${model}"_[0-4]_FA.pdb > "${output}/${model}.pdb"
done
2- would it be possible simplify two operations from the first loop of the edition of the file?
sed -i "1 i\MODEL $i" "${pdbs}"/"${model}"_"$i"_FA.pdb
echo "ENDMDL" >> "${pdbs}"/"${model}"_"$i"_FA.pdb
how to match info from array "groups" to the files present in the folder ?
Use find. It is there to find files.
groups=(NpXynWT_apo_300K NpXynWT_apo_340K NpXynWT_com_300K NpXynWT_com_340K)
for group in ${groups[#]}; do
find . -name "${group}_*.pdb" -type f
done
You can be even more exact by using -regex and similar find options.

Compare files by position

I want to compare two files only by their first column.
My first looks like this:
0009608a4138a8e7 hdisk26 altinst_rootvg
000f7d4a8234a675 hdisk12 vgdbf
000f7d4a8234d5c9 hdisk22 vgarcbkp
My second file looks like this:
000f7d4a8234a675 hdiskpower64 [Lun_vgdbf]
000f7d4a8234d5c9 hdiskpower61 [Lun_vgarcbkp]
This is the output I would like to generate:
0009608a4138a8e7 hdisk26 altinst_rootvg
000f7d4a8234a675 hdisk12 vgdbf hdiskpower64 [Lun_vgdbf]
000f7d4a8234d5c9 hdisk22 vgarcbkp hdiskpower61 [Lun_vgarcbkp]
I wonder why diff does not support positional compare.
Something like diff -y -p1-17 file1 file2. Any idea?
You can use join to produce your desired output :
join -a 1 file1 file2
The -a 1 option states to output lines from the first file which have no correspondances in the second, so this assumes the first file contains every id that is present in the second.
It also relies on the files being sorted on their first file, which could be the case according to your sample data. If it's not you will need to sort them beforehand (the join command will warn you about your files not being sorted).
Sample execution :
$ echo '1 a b
> 2 c d
> 3 e f' > test1
$ echo '2 9 8
> 3 7 6' > test2
$ join -a 1 test1 test2
1 a b
2 c d 9 8
3 e f 7 6
Ideone test with your sample data.

Should I use a for loop to process text files line by line?

So I have two text files
FILE1: 1-40 names
FILE2: 1-40 names
Now what I want the program to do (Terminal) is to go through each name, by incrementing by ONE in each file so that the first name from FILE1 runs the first line from FILE2, and 20th name from FILE1 runs the 20th line from FILE2.
BUT I DON'T WANT IT TO run first name of FILE1, and then run through all of the names listed in FILE2, and repeat that over and over again.
Should I do a for loop?
I was thinking of doing something like:
for f in (cat FILE1); do
flirt -in $f -ref (cat FILE2);
done
I'm doing this using BASH.
Yes, you can do it quite easily, but it will require reading from two-different file descriptors at once. You can simply redirect one of the files into the next available file descriptor and use it to feed your read loop, e.g.
while read f1var && read -u 3 f2var; do
echo "f1var: $f1var -- f2var: $f2var"
done <file1.txt 3<file2.txt
Which will read line-by-line from each file reading a line from file1.txt on the standard file descriptor into f1var and from file2.txt on fd3 into f2var.
A short example might help:
Example Input Files
$ cat f1.txt
a
b
c
$ cat f2.txt
d
e
f
Example Use
$ while read f1var && read -u 3 f2var; do \
echo "f1var: $f1var -- f2var: $f2var"; \
done <f1.txt 3<f2.txt
f1var: a -- f2var: d
f1var: b -- f2var: e
f1var: c -- f2var: f
Using paste as an alternative
The paste utility also provides a simple alternative for combining files line-by-line, e.g.:
$ paste f1.txt f2.txt
a d
b e
c f
In Bash, you might make usage of arrays:
echo "Alice
> Bob
> Claire" > file-1
echo "Anton
Bärbel
Charlie" > file-2
n1=($(cat file-1))
n2=($(cat file-2))
for n in {0..2}; do echo ${n1[$n]} ${n2[$n]} ; done
Alice Anton
Bob Bärbel
Claire Charlie
Getting familiar with join and nl (number lines) can't be wrong, so here is a different approach:
nl -w 1 file-1 > file1
nl -w 1 file-2 > file2
join -1 1 -2 1 file1 file2 | sed -r 's/^[0-9]+ //'
nl with put a big amount of blanks in front of the small line numbers, if we don't tell it to -w 1.
We join the files by matching line number and remove the line number afterwards with sed.
Paste is of course much more elegant. Didn't know about this.

different lines in two files when ignoring last column - in bash

I have two files, smaller and bigger and bigger contains all lines of smaller. Those lines are almost same, just last column differs.
file_smaller
A NM 0
B GT 4
file_bigger
A NM 5 <-same as in file_smaller according to my rules
C TY 2
D OP 6
B GT 3 <-same as in file_smaller according to my rules
I would like to write lines, where the two files differ, that means:
wished_output
C TY 2
D OP 6
Could you please help me to do so? Thanks a lot.
you can do the following:
cat file_bigger file_smaller |sed 's=\(.*\).$=\1='|sort| uniq -u > temp_pat
grep -f temp_pat file_bigger ; rm temp_pat
which will (in the same order)
merge the files
remove the last column
sort the result
print only unique lines in temp_pat
find the original lines in file_bigger
all in all, the expected result.
awk 'FILENAME==file_bigger {arr[$1 $2]=$0}
FILENAME==file_smaller { tmp=$1 $2; if( tmp in arr) {next} else {print $0}}
' file_bigger file_smaller
See if that meets you needs
grep -vf <(cut -d " " -f 1-2 file_smaller| sed 's/^/^/') file_bigger
The process substitution results in this:
^A NM
^B GT
Then, grep -v removes those patterns from "file_bigger"
Bash 4 using associative arrays:
#!/usr/bin/env bash
f() {
if (( $# != 2 )); then
echo "usage: ${FUNCNAME} <smaller> <bigger>" >&2
return 1
fi
local -A smaller
local -a x
while read -ra x; do
smaller["${x[#]::2}"]=0
done <"$1"
while read -ra x; do
((${smaller["${x[#]::2}"]:-1})) && echo "${x[*]}"
done <"$2"
}
f /dev/fd/3 /dev/fd/0 <<"SMALLER" 3<&0 <<"BIGGER"
A NM 0
B GT 4
SMALLER
A NM 5
C TY 2
D OP 6
B GT 3
BIGGER

BASH: Iterating two v ariables in a for loop

I am having two files numbers.txt(1 \n 2 \n 3 \n 4 \n 5 \n) and alpha.txt (a \n n \n c \n d \n e \n)
Now I want to iterate both the files at the same time something like.
for num in `cat numbers.txt` && alpha in `cat alpha.txt`
do
echo $num "blah" $alpha
done
Or other idea I was having is
for num in `cat numbers.txt`
do
for alpha in `cat alpha.txt`
do
echo $num 'and' $alpha
break
done
done
but this kind of code always take the first value of $alpha.
I hope my problem is clear enough.
Thanks in advance.
Here it is what I actually intended to do. (Its just an example)
I am having one more file say template.txt having content.
variable1= NUMBER
variable2= ALPHA
I wanted to take the output from two files i.e numbers.txt and alpha.txt(one line from both at a time) and want to replace the NUMBER and ALPHA with the respective content from those two files.
so here it what I did as i got to know how to iterate both files together.
paste number.txt alpha.txt | while read num alpha
do
cp template.txt temp.txt
sed -i "{s/NUMBER/$num/g}" temp.txt
sed -i "{s/ALPHA/$alpha/g}" temp.txt
cat temp.txt >> final.txt
done
Now what i am having in final.txt is:
variable1= 1
variable2= a
variable1= 2
variable2= b
variable1= 3
variable2= c
variable1= 4
variable2= d
variable1= 5
variable2= e
variable1= 6
variable2= f
variable1= 7
variable2= g
variable1= 8
variable2= h
variable1= 9
variable2= i
variable1= 10
variable2= j
Its very simple and stupid approach. I wanted to know is there any other way to do this??
Any suggestion will be appreciated.
No, your question isn't clear enough. Specifically, the way you wish to iterate through your files is unclear, but assuming you want to have an output such as:
1 blah a
2 blah b
3 blah c
4 blah d
5 blah e
you can use the paste utility, like this:
paste number.txt alpha.txt | while read alpha num ; do
echo "$num and $alpha"
done
or even:
paste -d# alpha num | sed 's/#/ blah /'
Your first loop is impossible in bash. Your second one, without the break, would combine each line from numbers.txt with each line from alpha.txt, like this:
1 AND a
1 AND n
1 AND c
...
2 AND a
...
3 AND a
...
4 AND a
...
Your break makes it skip all lines from the alpha.txt, except the 1st one (bmk has already explained it in his answer)
It should be possible to organize the correct loop using the while loop construction, but it would be rather ugly.
There're lots of easier alternatives which maybe a better choice, depending on specifics of your task. For example, you could try this:
paste numbers.txt alpha.txt
or, if you really want your "AND"s, then, something like this:
paste numbers.txt alpha.txt | sed 's/\t/ AND /'
And if your numbers are really sequential (and you can live without 'AND'), you can simply do:
cat -n alpha.txt
Here is an alternate solution according to the first model you suggested:
while read -u 5 a && read -u 6 b
do
echo $a $b
done 5<numbers.txt 6<alpha.txt
The notation 5<numbers.txt tells the shell to open numbers.txt using file descriptor 5. read -u 5 a means read from a value for a from file descriptor 5, which has been associated with numbers.txt.
The advantage of this approach over paste is that it gives you fine-grain control over how you merge the two files. For example you could read one line from the first file and twice from the second file.
In your second example the inner loop is executed only once because of the break. It will simply jump out of the loop, i.e. you will always only get the first element of alpha.txt. Therefore I think you should remove it:
for num in `cat numbers.txt`
do
for alpha in `cat alpha.txt`
do
echo $num 'and' $alpha
done
done
If multiple loop isn't specifically your requirement but getting corresponding lines is then you may try the following code:
for line in `cat numbers.txt`
do
echo $line "and" $(cat alpha.txt| head -n$line | tail -n1 )
done
The head gets you the number of lines equal to the value of line and tail gets you the last element.
#tollboy, I think the answer you are looking for is this:
count=1
for item in $(paste number.txt alpha.txt); do
if [[ "${item}" =~ [a-zA-Z] ]]; then
echo "variable${count}= ${item}" >> final.txt
elif [[ "${item}" =~ [0-9] ]]; then
echo "variable${count}= ${item}" >> final.txt
fi
count=$((count+1))
done
When you type paste number.txt alpha.txt in your console, you see:
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
From bash's point of view $(paste number.txt alpha.txt) it looks like this:
1 a 2 b 3 c 4 d 5 e 6 f 7 g 8 h 9 i 10 j
So for each item in that list, figure out if it is alpha or numeric, and print it to the output file.
Lastly, increment the count.

Resources