Join command in Bash - bash

I community. I've got a problem with bash terminal. There are two files that I need to merge and I want to use delimiter ; in join command, but it doesn't work. How can I fix it? Thanks!
join -1 2 -2 2 -t; tasks.txt procowner.txt > answ.txt
upd. bash message
join: option requires an argument -- t
usage: join [-a fileno | -v fileno ] [-e string] [-1 field] [-2 field]
[-o list] [-t char] file1 file2
zsh: command not found: tasks.txt

The ; is treated as a command terminator by bash. This in turn means bash sees two separate commands:
join -1 2 -2 2 -t
# and
tasks.txt procowner.txt > answ.txt
The first one generates a syntax error for the join command; the second one generates an error stating tasks.txt is not a valid command.
The simple fix is to quote the ;, eg:
join -1 2 -2 2 -t';' tasks.txt procowner.txt > answ.txt

Related

Linux Bash Script find and join

Below are my example files.
Using AccountInfo(Account) to search the sub-folder existing on FolderStatus(Sub-folder).
AccountInfo
#server,Account,status,homefolder
vmftp01,admin01,enable,/home/admin01
vmftp01,admin02,enable,/home/admin02
vmftp01,admin03,enable,/home/admin03
FolderStatus
#account,sub-folder
admin01,/sftp/inbox
admin02,/ftp/inbox
admin02,/sftp/inbox
admin02,/as2/inbox
admin03,/ftp/inbox
admin03,/sftp/inbox
Desired Output:
/home/admin01/sftp/inbox
/home/admin02/ftp/inbox
/home/admin02/sftp/inbox
/home/admin02/as2/inbox
/home/admin03/ftp/inbox
/home/admin03/sftp/inbox
I tried:
join -t 1.2,2.2 file1 file2
with error:
join: file2:4: is not sorted: admin02,/as2/inbox
First of all, your FolderStatus file is indeed not sorted, but that's only a problem because your join command line doesn't seem to be correct.
try
join -t , -1 2 -o 1.4,2.2 AccountInfo FolderStatus
With AccountInfo = f2 and FolderStatus = f1, it worked for me this way (up to a comma that can be easily eliminated):
ronald#oncilla:~/tmp$ join -t , -1 2 -o 1.4,2.2 f2 f1
/home/admin01,/sftp/inbox
/home/admin02,/ftp/inbox
/home/admin02,/sftp/inbox
/home/admin02,/as2/inbox
/home/admin03,/ftp/inbox
/home/admin03,/sftp/inbox
The "-t ," option specifies the field separator.
The "-1 2" option specifies that the key in file 1 is the second column. And the "-o" flag specifies the output format

efficient join >100 files

I have a list containing >100 tab-delimited files, containing 5-8 million rows, and 16 columns (always in the same exact order). From each file I need to extract 5 specific columns, including one identifier-column. My final output (using 3 input files as an example) should be 4 files, containing the following columns:
output1: ID, VAR1
output2: VAR2.1,VAR2.2,VAR2.3
output3: VAR3.1,VAR3.2,VAR3.3
output4: VAR4.1,VAR4.2,VAR4.3
where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.
My problem is that the input files contain partially overlapping IDs and I need to extract the union of these rows (i.e. all IDs that occur at least once in one of the input files). To be more exact, output1 should contain the unions of the "ID"- and "VAR1"-columns of all input files. The row order of the remaining output files should be identical to output1. Finally, rows not present in any given input file should be padded with "NA" in output2, output3 and output4.
I'm using a combination of a while-loop, awk and join to get the job done, but it takes quite some time. I'd like to know whether there's a faster way to get this done, because I have to run the same script over and over with varying input files.
My script so far:
ID=1
VAR1=6
VAR2=9
VAR3=12
VAR4=16
while read FILE;do
sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="\t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'
awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1
cat output1 tmp5 > foo && mv foo output1
join -e "NA" -a1 -a2 -t $'\t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2
join -e "NA" -a1 -a2 -t $'\t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3
join -e "NA" -a1 -a2 -t $'\t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4
rm tmp?
done < files.list
sort -k1,1 output1 > foo && mv foo output1
Final remark: I use cat for output1 because all values in VAR1 for the same ID are identical across all input files (I've made sure of that when I pre-process my files). So I can just append rows that are not already included to the bottom of output1 and sort the final output-file
First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.
You can simply run the three joins in background in parallel (cmd args ) & and then wait for all of them to finish. If this takes 1 second and the awk part before takes 10 minutes then this will not help a lot.
You can also put the wait before cat output 1 tmp5... and before the final sort -k1... line. For this to work you'll have to name the temporary files differently and rename them just before the joins. The idea is to generate the input for the three parallel joins for the first file in background, wait, then rename the files, run the joins in background and generate the next inputs. After the loop is complete just wait the last joins to finish. This will help if the awk part consumes comparable to the joins CPU time.
HTH, you can make even more complex parallel execution scenarios.

Reading from file and passing it to another command

I have a two column tab delimited file the contains input for a command.
The input file looks like this:
2795.bam 2865.bam
2825.bam 2865.bam
2794.bam 2864.bam
the command line is:
macs2 callpeak -t trt.bam -c ctrl.bam -n Macs.name.bam --gsize hs --nomodel
where trt.bam are the names of files in column 1 and ctrl.bam are the names of files in col2.
what I trying is to read these values from input file and run them.
To do achieve this I am doing following:
cat temp | awk '{print $1 "\t" $2 }' | macs2 callpeak -t $1 -c $2 -n Macs.$1 --gsize hs --nomodel
This is failing. The error that I get is:
usage: macs2 callpeak [-h] -t TFILE [TFILE ...] [-c [CFILE [CFILE ...]]]
[-f {AUTO,BAM,SAM,BED,ELAND,ELANDMULTI,ELANDEXPORT,BOWTIE,BAMPE,BEDPE}]
[-g GSIZE] [--keep-dup KEEPDUPLICATES]
[--buffer-size BUFFER_SIZE] [--outdir OUTDIR] [-n NAME]
[-B] [--verbose VERBOSE] [--trackline] [--SPMR]
[-s TSIZE] [--bw BW] [-m MFOLD MFOLD] [--fix-bimodal]
[--nomodel] [--shift SHIFT] [--extsize EXTSIZE]
[-q QVALUE | -p PVALUE] [--to-large] [--ratio RATIO]
[--down-sample] [--seed SEED] [--tempdir TEMPDIR]
[--nolambda] [--slocal SMALLLOCAL] [--llocal LARGELOCAL]
[--broad] [--broad-cutoff BROADCUTOFF]
[--cutoff-analysis] [--call-summits]
[--fe-cutoff FECUTOFF]
macs2 callpeak: error: argument -t/--treatment: expected at least one argument
In an ideal situation this should be taking inputs like this:
macs2 callpeak -t 2795.bam -c 2865.bam -n Macs.2795 --gsize hs --nomodel
Where Macs is a standalone software that runs on linux. In the present situation, the software is failing to read the input from the file.
Any inputs are deeply appreciated.
I believe what you want to achieve is a loop over all lines in your input file. In bash, you can achieve this as :
while read -r tfile cfile; do
macs2 callpeak -t "$tfile" -c "$cfile" -n "Macs.$tfile" --gsize hs --nomodel
done < "input_file.txt"
See: https://mywiki.wooledge.org/BashFAQ/001 (cfr. Sundeep's comment)
original answer:
while read -a a; do
macs2 callpeak -t "${a[0]}" -c "${a[1]}" -n "Macs.${a[0]}" --gsize hs --nomodel
done < "input_file.txt"
This will read the input file input_file.txt line by line and store it in a bash array named a using read -a a. From that point forward, you process your command with the variables ${a[0]} and ${a[1]}.

Should I use a for loop to process text files line by line?

So I have two text files
FILE1: 1-40 names
FILE2: 1-40 names
Now what I want the program to do (Terminal) is to go through each name, by incrementing by ONE in each file so that the first name from FILE1 runs the first line from FILE2, and 20th name from FILE1 runs the 20th line from FILE2.
BUT I DON'T WANT IT TO run first name of FILE1, and then run through all of the names listed in FILE2, and repeat that over and over again.
Should I do a for loop?
I was thinking of doing something like:
for f in (cat FILE1); do
flirt -in $f -ref (cat FILE2);
done
I'm doing this using BASH.
Yes, you can do it quite easily, but it will require reading from two-different file descriptors at once. You can simply redirect one of the files into the next available file descriptor and use it to feed your read loop, e.g.
while read f1var && read -u 3 f2var; do
echo "f1var: $f1var -- f2var: $f2var"
done <file1.txt 3<file2.txt
Which will read line-by-line from each file reading a line from file1.txt on the standard file descriptor into f1var and from file2.txt on fd3 into f2var.
A short example might help:
Example Input Files
$ cat f1.txt
a
b
c
$ cat f2.txt
d
e
f
Example Use
$ while read f1var && read -u 3 f2var; do \
echo "f1var: $f1var -- f2var: $f2var"; \
done <f1.txt 3<f2.txt
f1var: a -- f2var: d
f1var: b -- f2var: e
f1var: c -- f2var: f
Using paste as an alternative
The paste utility also provides a simple alternative for combining files line-by-line, e.g.:
$ paste f1.txt f2.txt
a d
b e
c f
In Bash, you might make usage of arrays:
echo "Alice
> Bob
> Claire" > file-1
echo "Anton
Bärbel
Charlie" > file-2
n1=($(cat file-1))
n2=($(cat file-2))
for n in {0..2}; do echo ${n1[$n]} ${n2[$n]} ; done
Alice Anton
Bob Bärbel
Claire Charlie
Getting familiar with join and nl (number lines) can't be wrong, so here is a different approach:
nl -w 1 file-1 > file1
nl -w 1 file-2 > file2
join -1 1 -2 1 file1 file2 | sed -r 's/^[0-9]+ //'
nl with put a big amount of blanks in front of the small line numbers, if we don't tell it to -w 1.
We join the files by matching line number and remove the line number afterwards with sed.
Paste is of course much more elegant. Didn't know about this.

`join` with -e "NA" parameter incorrectly fills "NA" into a non-empty field

I am encountering a weird issue with join in a script I've written.
I have two files, say:
File1.txt (1st field: cluster size; 2nd field: brain coordinates)
54285;-40,-64,-2
5446;-32,6,24
File2.txt (1st field: cluster index; 2nd field: z-value; 3rd field: brain coordinates)
2;7.59;-40,-64,-2
2;7.33;62,-60,14
1;5.78;-32,6,24
1;5.66;-50,16,34
Where I am joining on the last field, the brain coordinates.
When I use the command
join -a 2 -e "NA" -1 2 -2 3 -t ";" -o "2.1 1.1 2.2 0" File1.txt File2.txt
I expect
2;54285;7.59;-40,-64,-2
2;NA;7.33;62,-60,14
1;5446;5.78;-32,6,24
1;NA;5.66;-50,16,34
But I get
2;54285;7.59;-40,-64,-2
2;NA;7.33;62,-60,14
1;NA;5.78;-32,6,24
1;NA;5.66;-50,16,34
Such that the cluster size is missing on row 3 (i.e., cluster size for cluster #1, "5446").
If I edit File2 to take out lines that don't have a match in File1, i.e.:
File2.txt
2;7.59;-40,-64,-2
1;5.78;-32,6,24
I get the expected output:
2;54285;7.59;-40,-64,-2
1;5446;5.78;-32,6,24
If I edit File2.txt like so, adding a line without a cluster-size value to cluster #1:
File2.txt
2;7.59;-40,-64,-2
1;5.78;-32,6,24
1;5.66;-50,16,34
I also get the expected output:
2;54285;7.59;-40,-64,-2
1;5446;5.78;-32,6,24
1;NA;5.66;-50,16,34
BUT, if I edit File2.txt like so, adding a line without a cluster-size value to cluster #2:
File2.txt
2;7.59;-40,-64,-2
2;7.33;62,-60,14
1;5.78;-32,6,24
Then I do not receive the expected output:
2;54285;7.59;-40,-64,-2
2;NA;7.33;62,-60,14
1;NA;5.78;-32,6,24
Can anyone give me any insight into why this is occurring? Have I done something wrong, or is there something quirky going on with join that I haven't been able to suss out from the man page?
Although alternative solutions to joining these files (that is, using different tools than join) , I am most interested in figuring out why the current command isn't working.
Input files to the join command must be sorted on join fields
Try this instead (note that this uses process substitution, which is a bashism)
join -a 2 -e "NA" -1 2 -2 3 -t ";" -o "2.1 1.1 2.2 0" <(sort -k2,2 -t';' File1.txt)\
<(sort -k3,3 -t';' File2.txt)
1;5446;5.78;-32,6,24
2;54285;7.59;-40,-64,-2
1;NA;5.66;-50,16,34
2;NA;7.33;62,-60,14

Resources