Linux Bash Script find and join - bash

Below are my example files.
Using AccountInfo(Account) to search the sub-folder existing on FolderStatus(Sub-folder).
AccountInfo
#server,Account,status,homefolder
vmftp01,admin01,enable,/home/admin01
vmftp01,admin02,enable,/home/admin02
vmftp01,admin03,enable,/home/admin03
FolderStatus
#account,sub-folder
admin01,/sftp/inbox
admin02,/ftp/inbox
admin02,/sftp/inbox
admin02,/as2/inbox
admin03,/ftp/inbox
admin03,/sftp/inbox
Desired Output:
/home/admin01/sftp/inbox
/home/admin02/ftp/inbox
/home/admin02/sftp/inbox
/home/admin02/as2/inbox
/home/admin03/ftp/inbox
/home/admin03/sftp/inbox
I tried:
join -t 1.2,2.2 file1 file2
with error:
join: file2:4: is not sorted: admin02,/as2/inbox

First of all, your FolderStatus file is indeed not sorted, but that's only a problem because your join command line doesn't seem to be correct.
try
join -t , -1 2 -o 1.4,2.2 AccountInfo FolderStatus
With AccountInfo = f2 and FolderStatus = f1, it worked for me this way (up to a comma that can be easily eliminated):
ronald#oncilla:~/tmp$ join -t , -1 2 -o 1.4,2.2 f2 f1
/home/admin01,/sftp/inbox
/home/admin02,/ftp/inbox
/home/admin02,/sftp/inbox
/home/admin02,/as2/inbox
/home/admin03,/ftp/inbox
/home/admin03,/sftp/inbox
The "-t ," option specifies the field separator.
The "-1 2" option specifies that the key in file 1 is the second column. And the "-o" flag specifies the output format

Related

Join command in Bash

I community. I've got a problem with bash terminal. There are two files that I need to merge and I want to use delimiter ; in join command, but it doesn't work. How can I fix it? Thanks!
join -1 2 -2 2 -t; tasks.txt procowner.txt > answ.txt
upd. bash message
join: option requires an argument -- t
usage: join [-a fileno | -v fileno ] [-e string] [-1 field] [-2 field]
[-o list] [-t char] file1 file2
zsh: command not found: tasks.txt
The ; is treated as a command terminator by bash. This in turn means bash sees two separate commands:
join -1 2 -2 2 -t
# and
tasks.txt procowner.txt > answ.txt
The first one generates a syntax error for the join command; the second one generates an error stating tasks.txt is not a valid command.
The simple fix is to quote the ;, eg:
join -1 2 -2 2 -t';' tasks.txt procowner.txt > answ.txt

Is there a way to merge two tables and keep only the matching values using linux bash?

I have two tables of string values, and the objetive is to make a new table that only keeps the matching values from both parent tables.
Example:
TABLE1
AX-18000257
AX-18000500
AX-18000816
AX-18000945
AX-18001189
AX-18001512
AX-18001524
TABLE2
AX-18000257
AX-18000512
AX-18000816
AX-18000947
AX-18001589
AX-18001525
AX-18001524
Expected output would be:
AX-18000257
AX-18000816
AX-18001189
AX-18001524
It could be done with both:
grep -v -f file2 file1 >> file3 #thanks David Ranieri
Or with:
join -1 1 -2 1 file1 file 2 > file3
Using join with get you a warning join: file 1 is not in sorted order
join: file 2 is not in sorted order but the result is correct

efficient join >100 files

I have a list containing >100 tab-delimited files, containing 5-8 million rows, and 16 columns (always in the same exact order). From each file I need to extract 5 specific columns, including one identifier-column. My final output (using 3 input files as an example) should be 4 files, containing the following columns:
output1: ID, VAR1
output2: VAR2.1,VAR2.2,VAR2.3
output3: VAR3.1,VAR3.2,VAR3.3
output4: VAR4.1,VAR4.2,VAR4.3
where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.
My problem is that the input files contain partially overlapping IDs and I need to extract the union of these rows (i.e. all IDs that occur at least once in one of the input files). To be more exact, output1 should contain the unions of the "ID"- and "VAR1"-columns of all input files. The row order of the remaining output files should be identical to output1. Finally, rows not present in any given input file should be padded with "NA" in output2, output3 and output4.
I'm using a combination of a while-loop, awk and join to get the job done, but it takes quite some time. I'd like to know whether there's a faster way to get this done, because I have to run the same script over and over with varying input files.
My script so far:
ID=1
VAR1=6
VAR2=9
VAR3=12
VAR4=16
while read FILE;do
sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="\t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'
awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1
cat output1 tmp5 > foo && mv foo output1
join -e "NA" -a1 -a2 -t $'\t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2
join -e "NA" -a1 -a2 -t $'\t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3
join -e "NA" -a1 -a2 -t $'\t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4
rm tmp?
done < files.list
sort -k1,1 output1 > foo && mv foo output1
Final remark: I use cat for output1 because all values in VAR1 for the same ID are identical across all input files (I've made sure of that when I pre-process my files). So I can just append rows that are not already included to the bottom of output1 and sort the final output-file
First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.
You can simply run the three joins in background in parallel (cmd args ) & and then wait for all of them to finish. If this takes 1 second and the awk part before takes 10 minutes then this will not help a lot.
You can also put the wait before cat output 1 tmp5... and before the final sort -k1... line. For this to work you'll have to name the temporary files differently and rename them just before the joins. The idea is to generate the input for the three parallel joins for the first file in background, wait, then rename the files, run the joins in background and generate the next inputs. After the loop is complete just wait the last joins to finish. This will help if the awk part consumes comparable to the joins CPU time.
HTH, you can make even more complex parallel execution scenarios.

Join in unix when field is numeric in a huge file

So I have two files. File A and File B. File A is huge (>60 GB) and has 16 rows, a mix of numeric and strings, is separated by "|", and has over 600,000,000 lines. Field 3 in this file is the ID and it is a numeric field, with different lengths (e.g., someone's ID can be 1, and someone else's can be 100)
File B just has a bunch of ID (~1,000,000) and I want to extract all the rows from File A that have an ID that is in `File B'. I have started doing this using Linux with the following code
sort -k3,3 -t'|' FileA.txt > FileASorted.txt
sort -k1,1 -t'|' FileB.txt > FileBSorted.txt
join -1 3 -2 1 -t'|' FileASorted.txt FileBSorted.txt > merged.txt
The problem I have is that merged.txt is empty (when I know for a fact there are at least 10 matches)... I have googled this and it seems like the issue is that the join field (the ID) is numeric. Some people propose padding the field with zeros but 1) I'm not entirely sure how to do this, and 2) this seems very slow/time inefficient.
Any other ideas out there? or help on how to add the padding of 0s only to the relevant field.
I would first sort file b using the unique flag (-u)
sort -u file.b > sortedfile.b
Then loop through sortedfile.b and for each grep file.a. In zsh I would do a
foreach C (`cat sortedfile.b`)
grep $C file.a > /dev/null
if [ $? -eq 0 ]; then
echo $C >> res.txt
fi
end
Redirect output from grep to /dev/null and test whether there was a match ($? -eq 0) and append (>>) the result from that line to res.txt.
A single > will overwrite the file. I'm a bit rusty at zsh now so there might be a typo. You may be using bash which can have a slightly different foreach syntax.

`join` with -e "NA" parameter incorrectly fills "NA" into a non-empty field

I am encountering a weird issue with join in a script I've written.
I have two files, say:
File1.txt (1st field: cluster size; 2nd field: brain coordinates)
54285;-40,-64,-2
5446;-32,6,24
File2.txt (1st field: cluster index; 2nd field: z-value; 3rd field: brain coordinates)
2;7.59;-40,-64,-2
2;7.33;62,-60,14
1;5.78;-32,6,24
1;5.66;-50,16,34
Where I am joining on the last field, the brain coordinates.
When I use the command
join -a 2 -e "NA" -1 2 -2 3 -t ";" -o "2.1 1.1 2.2 0" File1.txt File2.txt
I expect
2;54285;7.59;-40,-64,-2
2;NA;7.33;62,-60,14
1;5446;5.78;-32,6,24
1;NA;5.66;-50,16,34
But I get
2;54285;7.59;-40,-64,-2
2;NA;7.33;62,-60,14
1;NA;5.78;-32,6,24
1;NA;5.66;-50,16,34
Such that the cluster size is missing on row 3 (i.e., cluster size for cluster #1, "5446").
If I edit File2 to take out lines that don't have a match in File1, i.e.:
File2.txt
2;7.59;-40,-64,-2
1;5.78;-32,6,24
I get the expected output:
2;54285;7.59;-40,-64,-2
1;5446;5.78;-32,6,24
If I edit File2.txt like so, adding a line without a cluster-size value to cluster #1:
File2.txt
2;7.59;-40,-64,-2
1;5.78;-32,6,24
1;5.66;-50,16,34
I also get the expected output:
2;54285;7.59;-40,-64,-2
1;5446;5.78;-32,6,24
1;NA;5.66;-50,16,34
BUT, if I edit File2.txt like so, adding a line without a cluster-size value to cluster #2:
File2.txt
2;7.59;-40,-64,-2
2;7.33;62,-60,14
1;5.78;-32,6,24
Then I do not receive the expected output:
2;54285;7.59;-40,-64,-2
2;NA;7.33;62,-60,14
1;NA;5.78;-32,6,24
Can anyone give me any insight into why this is occurring? Have I done something wrong, or is there something quirky going on with join that I haven't been able to suss out from the man page?
Although alternative solutions to joining these files (that is, using different tools than join) , I am most interested in figuring out why the current command isn't working.
Input files to the join command must be sorted on join fields
Try this instead (note that this uses process substitution, which is a bashism)
join -a 2 -e "NA" -1 2 -2 3 -t ";" -o "2.1 1.1 2.2 0" <(sort -k2,2 -t';' File1.txt)\
<(sort -k3,3 -t';' File2.txt)
1;5446;5.78;-32,6,24
2;54285;7.59;-40,-64,-2
1;NA;5.66;-50,16,34
2;NA;7.33;62,-60,14

Resources