I have a list containing >100 tab-delimited files, containing 5-8 million rows, and 16 columns (always in the same exact order). From each file I need to extract 5 specific columns, including one identifier-column. My final output (using 3 input files as an example) should be 4 files, containing the following columns:
output1: ID, VAR1
output2: VAR2.1,VAR2.2,VAR2.3
output3: VAR3.1,VAR3.2,VAR3.3
output4: VAR4.1,VAR4.2,VAR4.3
where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.
My problem is that the input files contain partially overlapping IDs and I need to extract the union of these rows (i.e. all IDs that occur at least once in one of the input files). To be more exact, output1 should contain the unions of the "ID"- and "VAR1"-columns of all input files. The row order of the remaining output files should be identical to output1. Finally, rows not present in any given input file should be padded with "NA" in output2, output3 and output4.
I'm using a combination of a while-loop, awk and join to get the job done, but it takes quite some time. I'd like to know whether there's a faster way to get this done, because I have to run the same script over and over with varying input files.
My script so far:
ID=1
VAR1=6
VAR2=9
VAR3=12
VAR4=16
while read FILE;do
sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="\t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'
awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1
cat output1 tmp5 > foo && mv foo output1
join -e "NA" -a1 -a2 -t $'\t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2
join -e "NA" -a1 -a2 -t $'\t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3
join -e "NA" -a1 -a2 -t $'\t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4
rm tmp?
done < files.list
sort -k1,1 output1 > foo && mv foo output1
Final remark: I use cat for output1 because all values in VAR1 for the same ID are identical across all input files (I've made sure of that when I pre-process my files). So I can just append rows that are not already included to the bottom of output1 and sort the final output-file
First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.
You can simply run the three joins in background in parallel (cmd args ) & and then wait for all of them to finish. If this takes 1 second and the awk part before takes 10 minutes then this will not help a lot.
You can also put the wait before cat output 1 tmp5... and before the final sort -k1... line. For this to work you'll have to name the temporary files differently and rename them just before the joins. The idea is to generate the input for the three parallel joins for the first file in background, wait, then rename the files, run the joins in background and generate the next inputs. After the loop is complete just wait the last joins to finish. This will help if the awk part consumes comparable to the joins CPU time.
HTH, you can make even more complex parallel execution scenarios.
Related
How to split a large csv file (1GB) into multiple files (say one part with 1000 rows, 2nd part 10000 rows, 3rd part 100000, etc) and preserve the header in each part ?
How can I achieve this
h1 h2
a aa
b bb
c cc
.
.
12483720 rows
into
h1 h2
a aa
b bb
.
.
.
1000 rows
And
h1 h2
x xx
y yy
.
.
.
10000 rows
Another awk. First some test records:
$ seq 1 1234567 > file
Then the awk:
$ awk 'NR==1{n=1000;h=$0}{print > n}NR==n+c{n*=10;c=NR-1;print h>n}' file
Explained:
$ awk '
NR==1 { # first record:
n=1000 # set first output file size and
h=$0 # store the header
}
{
print > n # output to file
}
NR==n+c { # once target NR has been reached. close(n) goes here if needed
n*=10 # grow target magnitude
c=NR-1 # set the correction factor.
print h > n # first the head
}' file
Count the records:
$ wc -l 1000*
1000 1000
10000 10000
100000 100000
1000000 1000000
123571 10000000
1234571 total
Here is a small adaptation of the solution from: Split CSV files into smaller files but keeping the headers?
awk -v l=1000 '(NR==1){header=$0;next}
(n==l) {
c=sprintf("%0.5d",c+1);
close(file); file=FILENAME; sub(/csv$/,c".csv",file)
print header > file
n=0;l*=10
}
{print $0 > file; n++}' file.csv
This works in the following way:
(NR==1){header=$0;next}: If the record/line is the first line, save that line as the header.
(n==l){...}: Every time we wrote the requested amount of records/lines, we need to start writing to a new file. This happens every time n==l and we perform the following actions:
c=sprintf("%0.5d",c+1): increase the counter with one, and print it as 000xx
close(file): close the file you just wrote too.
file=FILENAME; sub(/csv$/,c".csv",file): define the new filename
print header > file: open the file and write the header to that file.
n=0: reset the current record count
l*=10: increase the maximum record count for the next file
{print $0 > file; n++}: write the entries to the file and increment the record count
Hacky, but utlizes the split utility, which does most of the heavy lifting for splitting the files. Then, with the split files with a well-defined naming convention, I loop over files without the header, and spit out a file with the header concatenated with the file body to tmp.txt, and then move that file back to the original filename.
# Use `split` utility to split the file csv, with 5000 lines per files,
# adding numerical suffixs, and adding additional suffix '.split' to help id
# files.
split -l 5000 -d --additional-suffix=.split repro-driver-table.csv
# This identifies all files that should NOT have headers
# ls -1 *.split | egrep -v -e 'x0+\.split'
# This identifies files that do have headers
# ls -1 *.split | egrep -e 'x0+\.split'
# Walk the files that do not have headers. For each one, cat the header from
# file with header, with rest of body, output to tmp.txt, then mv tmp.txt to
# original filename.
for f in $(ls -1 *.split | egrep -v -e 'x0+\.split'); do
cat <(head -1 $(ls -1 *.split | egrep -e 'x0+\.split')) $f > tmp.txt
mv tmp.txt $f
done
Here's a first approach:
#!/bin/bash
head -1 $1 >header
split $1 y
for f in y*; do
cp header h$f
cat $f >>h$f
done
rm -f header
rm -f y*
The following bash solution should work nicely :
IFS='' read -r header
for ((curr_file_max_rows=1000; 1; curr_file_max_rows*=10)) {
curr_file_name="file_with_${curr_file_max_rows}_rows"
echo "$header" > "$curr_file_name"
for ((curr_file_row_count=0; curr_file_row_count < curr_file_max_rows; curr_file_row_count++)) {
IFS='' read -r row || break 2
echo "$row" >> "$curr_file_name"
}
}
We have a first iteration level which produces the number of rows we're going to write for each successive file. It generates the file names and write the header to them. It is an infinite loop because we don't check how many lines the input has and therefore don't know beforehand how many files we're going to write to, so we'll have to break out of this loop to end it.
Inside this loop we iterate a second time, this time over the number of lines we're going to write to the current file. In this loop we try to read a line from the input file. If it works we write it to the current output file, if it doesn't (we've reached the end of the input) we break out of two levels of loop.
You can try it here.
I have 3 files:
File 1:
1111111
2222222
3333333
4444444
5555555
File 2:
6666666
7777777
8888888
9999999
File 3
8888888 7777777
9999999 6666666
4444444 8888888
I want to search file 3 for lines that contain a string from both file 1 and file 2, so the result of this example would be:
4444444 8888888
because 444444 is in file 1 and 888888 is file 2.
I currently have a solution, however my files contain 500+ lines and it can take a very long time to run my script:
#!/bin/sh
cat file1 | while read line
do
cat file2 | while read line2
do
grep -w -m 1 "$line" file3 | grep -w -m 1 "$line2" >> results
done
done
How can i improve this script to run this faster?
The current process is going to be slow due to the repeated scans of file2 (once for each row in file1) and file3 (once for each row in the cartesian product of file1 and file2). The additional invocation of sub-processes(as a result of the pipes |) is also going to slow things down.
So, to speed this up we want to look at reducing the number of times each file is scanned and limit the number of sub-processes we spawn.
Assumptions:
there are only 2 fields (when using white space as delimiter) in each row of file3 (eg, we won't see a row like "field1 has several strings" "and field2 does, too") otherwise we will need to come back revisit the parsing of file3
First our data files (I've added a couple extra lines):
$ cat file1
1111111
2222222
3333333
4444444
5555555
5555555 # duplicate entry
$ cat file2
6666666
7777777
8888888
9999999
$ cat file3
8888888 7777777
9999999 6666666
4444444 8888888
8888888 4444444 # switch position of values
8888888XX 4444444XX # larger values; we want to validate that we're matching on exact values and not sub-strings
5555555 7777777 # want to make sure we get a single hit even though 5555555 is duplicated in `file1`
One solution using awk:
$ awk '
BEGIN { filenum=0 }
FNR==1 { filenum++ }
filenum==1 { array1[$1]++ ; next }
filenum==2 { array2[$1]++ ; next }
filenum==3 { if ( array1[$1]+array2[$2] >= 2 || array1[$2]+array2[$1] >= 2) print $0 }
' file1 file2 file3
Explanation:
this single awk script will process our 3 files in the order in which they're listed (on the last line)
in order to aply different logic for each file we need to know which file we're processing; we'll use the variable filenum to keep track of which file we're currently processing
BEGIN { filenum=0 } - initialize our filenum variable; while the variable should automatically be set to zero the first time it's referenced, it doesn't hurt to be explicit
FNR maintains a running count of the records processed for the current file; each time a new file is opened FNR is reset to 1
when FNR==1 we know we just started processing a new file, so increment our variable { filenum++ }
as we read values from file1 and file2 we're going to use said values as the indexes for the associative arrays array1[] and array2[], respectively
filenum==1 { array1[$1]++ ; next } - create entry in our first associative array (array1[]) with the index equal to field1 (from file1); value of the array will be a number > 0 (1 === field exists once in file, 2 == field exists twice in file); next says to skip the rest of processing and go to the next row in the current file
filenum==2 { array2[$1]++ ; next } - same as previous command except in this case we're saving fields from file2 in our second associative array (array2[])
filenum==3 - optional because if we get this far in this script we have to be on our third file (file3); again, doesn't hurt to be explicit (and makes this easier to read/understand)
{ if ( ... ) } - test if the fields from file3 exist in both file1 and file2
array1[$1]+array2[$2] >= 2 - if (file3) field1 is in file1 and field2 is in file2 then we should find matches in both arrays and the sum of the array element values should be >= 2
array1[$2]+array2[$1] >= 2- same as previous command except we're testing for our 2 fields (file3) being in the opposite source files/arrays
print $0 - if our test returns true (ie, the current fields from file3 exist in both file1 and file2) then print the current line (to stdout)
Running this awk script against my 3 files generates the following output:
4444444 8888888 # same as the desired output listed in the question
8888888 4444444 # verifies we still match if we swap positions; also verifies
# we're matching on actual values and not a sub-string (ie, no
# sign of the row `8888888XX 4444444XX`)
5555555 7777777 # only shows up in output once even though 5555555 shows up
# twice in `file1`
At this point we've a) limited ourselves to a single scan of each file and b) eliminated all sub-process calls, so this should run rather quickly.
NOTE: One trade-off of this awk solution is the requirement for memory to store the contents of file1 and file2 in the arrays; which shouldn't be an issue for the relatively small data sets referenced in the question.
You can do it faster if load all data first and than process it
f1=$(cat file1)
f2=$(cat file2)
IFSOLD=$IFS; IFS=$'\n'
f3=( $(cat file3) )
IFS=$IFSOLD
for item in "${f3[#]}"; {
sub=( $item )
test1=${sub[0]}; test1=${f1//[!$test1]/}
test2=${sub[1]}; test2=${f2//[!$test2]/}
[[ "$test1 $test2" == "$item" ]] && result+="$item\n"
}
echo -e "$result" > result
I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.
For example, in a regular while-read loop I would code:
while read line
do
<some code>
done < input file or variable...
But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file.
I tried :
split -n 14 input_file
find . -name "xa*" | \
parallel -j 14 | \
while read line
do
<lot of stuff>
done
also tried
split -n 14 input_file
function loop {
while read line
do
<lot of stuff>
done
}
export -f loop
parallel -j 14 ::: loop
But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"
An example of the input file (a list of strings)
AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
EDIT
This is the code.
The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made.
The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):
Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank
D_6__Bacillus_atrophaeus 50% 1 20 5%
D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351%
$ head input file
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
Two additional files are also used inside the loop. They are not the loop input.
1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:
$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0
KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0
KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0
2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:
$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus
So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.
TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done < input_file
It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that.
Thank you all for being so patient with me =)
You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):
export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done
}
export -f doit
parallel -a input_file --pipepart doit
This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.
That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.
As an alternative I threw together a quick test.
#! /bin/env bash
mkfifo PIPELINE # create a single queue
cat "$1" > PIPELINE & # supply it with records
{ declare -i cnt=0 max=14
while (( ++cnt <= max )) # spawn loop creates worker jobs
do printf -v fn "%02d" $cnt
while read -r line # each work loop reads common stdin...
do echo "$fn:[$line]"
sleep 1
done >$fn.log 2>&1 & # these run in background in parallel
done # this one exits
} < PIPELINE # *all* read from the same queue
wait
cat [0-9][0-9].log
Doesn't need split, but does need a mkfifo.
Obviously, change the code inside the internal loop.
This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that.
So, let's make a million line file and split it into 14 parts:
seq 1000000 > 1M
split -n 14 1M part-
That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first):
#!/bin/bash
# This function will be called for each of the 14 files
DoOne(){
# Pick up parameters
job=$1
file=$2
# Count lines in specified file
lines=$(wc -l < "$file")
echo "Job No: $job, file: $file, lines: $lines"
}
# Make the function above known to processes spawned by GNU Parallel
export -f DoOne
# Run 14 parallel instances of "DoOne" passing job number and filename to each
parallel -k -j 14 DoOne {#} {} ::: part-??
Sample Output
Job No: 1, file: part-aa, lines: 83861
Job No: 2, file: part-ab, lines: 72600
Job No: 3, file: part-ac, lines: 70295
Job No: 4, file: part-ad, lines: 70295
Job No: 5, file: part-ae, lines: 70294
Job No: 6, file: part-af, lines: 70295
Job No: 7, file: part-ag, lines: 70295
Job No: 8, file: part-ah, lines: 70294
Job No: 9, file: part-ai, lines: 70295
Job No: 10, file: part-aj, lines: 70295
Job No: 11, file: part-ak, lines: 70295
Job No: 12, file: part-al, lines: 70294
Job No: 13, file: part-am, lines: 70295
Job No: 14, file: part-an, lines: 70297
You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.
I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code:
#!/bin/bash
for a in {A..Z} {0..9} ; do
for b in {A..Z} {0..9} ; do
for c in {A..Z} {0..9} ; do
echo "${a}${b}${c}"
done
done
done > a
# Now make file "b" which has the same stuff but shuffled into a different order
gshuf < a > b
Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this:
AAA
AAB
AAC
AAD
AAE
AAF
File b looks like this:
UKM
L50
AOC
79U
K6S
6PO
12I
XEV
WJN
Now I want to loop through a finding the corresponding line in b. First, I use your approach:
time while read thing ; do grep $thing b > /dev/null ; done < a
That takes 9 mins 35 seconds.
If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want.
time while read thing ; do grep -m1 $thing b > /dev/null ; done < a
That improves the time down to 4 mins 30 seconds.
If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this:
time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null
That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.
I hope this small script helps you out:
function process {
while read line; do
echo "$line"
done < $1
}
function loop {
file=$1
chunks=$2
dir=`mktemp -d`
cd $dir
split -n l/$chunks $file
for i in *; do
process "$i" &
done
rm -rf $dir
}
loop /tmp/foo 14
It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.
This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &:
function loop () {
while IFS= read -r -d $'\n'
do
# YOUR BIG STUFF
done < "${1}"
}
arr_files=(./xa*)
for i in "${arr_files[#]}"
do loop "${i}" &
done
wait
So I have two text files
FILE1: 1-40 names
FILE2: 1-40 names
Now what I want the program to do (Terminal) is to go through each name, by incrementing by ONE in each file so that the first name from FILE1 runs the first line from FILE2, and 20th name from FILE1 runs the 20th line from FILE2.
BUT I DON'T WANT IT TO run first name of FILE1, and then run through all of the names listed in FILE2, and repeat that over and over again.
Should I do a for loop?
I was thinking of doing something like:
for f in (cat FILE1); do
flirt -in $f -ref (cat FILE2);
done
I'm doing this using BASH.
Yes, you can do it quite easily, but it will require reading from two-different file descriptors at once. You can simply redirect one of the files into the next available file descriptor and use it to feed your read loop, e.g.
while read f1var && read -u 3 f2var; do
echo "f1var: $f1var -- f2var: $f2var"
done <file1.txt 3<file2.txt
Which will read line-by-line from each file reading a line from file1.txt on the standard file descriptor into f1var and from file2.txt on fd3 into f2var.
A short example might help:
Example Input Files
$ cat f1.txt
a
b
c
$ cat f2.txt
d
e
f
Example Use
$ while read f1var && read -u 3 f2var; do \
echo "f1var: $f1var -- f2var: $f2var"; \
done <f1.txt 3<f2.txt
f1var: a -- f2var: d
f1var: b -- f2var: e
f1var: c -- f2var: f
Using paste as an alternative
The paste utility also provides a simple alternative for combining files line-by-line, e.g.:
$ paste f1.txt f2.txt
a d
b e
c f
In Bash, you might make usage of arrays:
echo "Alice
> Bob
> Claire" > file-1
echo "Anton
Bärbel
Charlie" > file-2
n1=($(cat file-1))
n2=($(cat file-2))
for n in {0..2}; do echo ${n1[$n]} ${n2[$n]} ; done
Alice Anton
Bob Bärbel
Claire Charlie
Getting familiar with join and nl (number lines) can't be wrong, so here is a different approach:
nl -w 1 file-1 > file1
nl -w 1 file-2 > file2
join -1 1 -2 1 file1 file2 | sed -r 's/^[0-9]+ //'
nl with put a big amount of blanks in front of the small line numbers, if we don't tell it to -w 1.
We join the files by matching line number and remove the line number afterwards with sed.
Paste is of course much more elegant. Didn't know about this.
So I have two files. File A and File B. File A is huge (>60 GB) and has 16 rows, a mix of numeric and strings, is separated by "|", and has over 600,000,000 lines. Field 3 in this file is the ID and it is a numeric field, with different lengths (e.g., someone's ID can be 1, and someone else's can be 100)
File B just has a bunch of ID (~1,000,000) and I want to extract all the rows from File A that have an ID that is in `File B'. I have started doing this using Linux with the following code
sort -k3,3 -t'|' FileA.txt > FileASorted.txt
sort -k1,1 -t'|' FileB.txt > FileBSorted.txt
join -1 3 -2 1 -t'|' FileASorted.txt FileBSorted.txt > merged.txt
The problem I have is that merged.txt is empty (when I know for a fact there are at least 10 matches)... I have googled this and it seems like the issue is that the join field (the ID) is numeric. Some people propose padding the field with zeros but 1) I'm not entirely sure how to do this, and 2) this seems very slow/time inefficient.
Any other ideas out there? or help on how to add the padding of 0s only to the relevant field.
I would first sort file b using the unique flag (-u)
sort -u file.b > sortedfile.b
Then loop through sortedfile.b and for each grep file.a. In zsh I would do a
foreach C (`cat sortedfile.b`)
grep $C file.a > /dev/null
if [ $? -eq 0 ]; then
echo $C >> res.txt
fi
end
Redirect output from grep to /dev/null and test whether there was a match ($? -eq 0) and append (>>) the result from that line to res.txt.
A single > will overwrite the file. I'm a bit rusty at zsh now so there might be a typo. You may be using bash which can have a slightly different foreach syntax.