I am using Fedora, and bash to do some text manipulation with the files I have. I am trying to combine a large number of files, each one with two columns of data. From these files, I want to extract the data on the 2nd column of the files, and put it in a single file. Previously, I used the following script:
paste 0_0.dat 0_6.dat 0_12.dat | awk '{print $1, $2, $4}' >0.dat
But this is painfully hard as the number of files gets larger -- trying to do with 100 files. So I looked through the web to see if there's a way to achieve this in a simple way, but come up empty-handed.
I'd like to invoke a 'for' loop, if possible -- for example,
for i in $(seq 0 6 600)
do
paste 0_0.dat | awk '{print $2}'>>0.dat
done
but this does not work, of course, with paste command.
Please let me know if you have any recommendations on how to do what I'm trying to do ...
DATA FILE #1 looks like below (deliminated by a space)
-180 0.00025432
-179 0.000309643
-178 0.000189226
.
.
.
-1 2E-5
0 1.4E-6
1 0.00000
.
.
.
178 0.0023454268
179 0.002352534
180 0.001504992
DATA FILE #2
-180 0.0002352
-179 0.000423452
-178 0.00019304
.
.
.
-1 2E-5
0 1.4E-6
1 0.00000
.
.
.
178 0.0023454268
179 0.002352534
180 0.001504992
First column goes from -180 to 180, with increment of 1.
DESIRED
(n is the # of columns; and # of files)
-180 0.00025432 0.00025123 0.000235123 0.00023452 0.00023415 ... n
-179 0.000223432 0.0420504 0.2143450 0.002345123 0.00125235 ... n
.
.
.
-1 2E-5
0 1.4E-6
1 0.00000
.
.
.
179 0.002352534 ... n
180 0.001504992 ... n
Thanks,
join can get you your desired result.
join <(sort -r file1) <(sort -r file2)
Test:
[jaypal:~/Temp] cat file1
-180 0.00025432
-179 0.000309643
-178 0.000189226
[jaypal:~/Temp] cat file2
-180 0.0005524243
-179 0.0002424433
-178 0.0001833333
[jaypal:~/Temp] join <(sort -r file1) <(sort -r file2)
-180 0.00025432 0.0005524243
-179 0.000309643 0.0002424433
-178 0.000189226 0.0001833333
To do multiple files at once, you can use it with find command -
find . -type f -name "file*" -exec join '{}' +
How about this:
paste "$#" | awk '{ printf("%s", $1);
for (i = 2; i < NF; i += 2)
printf(" %s", $i); printf "\n";
}'
This assumes that you don't run into a limit with paste (check how many open files it can have). The "$#" notation means 'all the arguments given, exactly as given'. The awk script simply prints $1 from each line of pasted output, followed by the even-numbered columns; followed by a newline. It doesn't validate that the odd-numbered columns all match; it would perhaps be sensible to do so, and you could code a vaguely similar loop to do so in awk. It also doesn't check that the number of fields on this line is the same as the number on the previous line; that's another reasonable check. But this does do the whole job in one pass over all the files - for an essentially arbitrary list of files.
I have 100 input files -- how do I use this code to open up these files?
You put my original answer in a script 'filter-data'; you invoke the script with the 101 file names generated by seq. The paste command pastes all 101 files together; the awk command selects the columns you are interested in.
filter-data $(seq --format="0_%g.dat" 0 6 600)
The seq command with the format will list you 101 file names; these are the 101 files that will be pasted.
You could even do without the filter-data script:
paste $(seq --format="0_%g.dat" 0 6 600) | awk '{ printf("%s", $1);
for (i = 2; i < NF; i += 2)
printf(" %s", $i); printf "\n";
}'
I'd probably go with the more general script as the main script, and if need be I'd create a 'one-liner' that invokes the main script with the specific set of arguments currently of interest.
The other key point which might be a stumbling block: paste is not limited to 2 files only; it can paste as many files as you can have open (give or take about 3).
Based on my assumptions that you see in the comments above, you don't need paste. Try this
awk '{
arr[$1] = arr[$1] "\t" $2 };
END {for (x=-180;x<=180;x++) print x "\t" arr[x]
}' *.txt \
| sort -n
Note that we just take all of the values into an array based on the value in the first field, and append values based on the $1 key. After all data has been read in, The END section prints out the key and the value. I've added things like "x=", ":vals= " to help 'explain' what is happening. Remove those for completely clean tab-seperated data. Change '\t' to ':' or '|', or ... shudder ',' if you need to. Change the *.txt to what every your filespec is.
Be aware that all Unix command lines have limitations to the number and size (length of filenames, not the data inside), of filenames that can be processed in 1 invocation. Let us know if you get error messages about that.
The pipe to sort ensures that data is sorted by column1.
With my test data, the output was
-178 0.0001892261 0.0001892262 0.0001892263 0.000189226
-179 0.0003096431 0.0003096432 0.0003096433 0.000309643
-180 0.000254321 0.000254322 0.000254323 0.00025432
178 0.0001892261 0.0001892262 0.0001892263 0.000189226
179 0.0003096431 0.0003096432 0.0003096433 0.000309643
180 0.000254321 0.000254322 0.000254323 0.00025432
Based on 4 files of input.
I hope this helps.
P.S. Welcome to StackOverflow (S.O.) Please remeber to read the FAQs, http://tinyurl.com/2vycnvr , vote for good Q/A by using the gray triangles, http://i.imgur.com/kygEP.png , and to accept the answer that bes solves your problem, if any, by pressing the checkmark sign , http://i.imgur.com/uqJeW.png
This might work for you:
echo *.dat | sed 's/\S*/<(cut -f2 &)/2g;s/^/paste /' | bash >all.dat
Related
I want to remove specific sequence in the list with IDs and extract sequence from large fasta file.
input test.fasta file:
>GHAT8X
MKFNDIRNDGHEDCFNNIIFASKLSSHKNVLKLTGCCLETRIPVIVFESVKNRTLADHIYQNQPHFEPLLLSQRLRIAVHIANAIAYLHIGFSRPILHRKIRPSRIFLDEGYIAKLFDFSLSVSIPEGETCVKDKVTGTMGFLAPEYI
>GHAMNO
MRLIGCCLETENPVLVFEYVEYGTLADRIYHPRQPNFEPVTCSLRLKIAMEIAYGIAYLHVAFSRPIVFRNVKPSNILFQEQSVAKLFDFSYSESIPEGETRIRGRVMGTFGYLPPEYIATGDCNEKCDVYSFGMLLLELLTGQRAVD
>GHAXM6
MYSCLGAIKNSGKEDKEKCIMRNGKNLLENLISSFNDGETHIKDAIPIGIMGFVATEYVTTGDYNEKCDVFSFGVLLLVLLTGQKLYSIDEAGDRHWLLNRVKKHIECNTFDEIVDPVIREELCIQSSEKDKQVQAFVELAVKCVSES
seqid_len.txt file:
GHAT8X 25
GHAMNO 26
GHAXM6 20
Expected output:
>GHAT8X
SSHKNVLKLTGCCLETRIPVIVFESVKNRTLADHIYQNQPHFEPLLLSQRLRIAVHIANA
IAYLHIGFSRPILHRKIRPSRIFLDEGYIAKLFDFSLSVSIPEGETCVKDKVTGTMGFLA
PEYI
>GHAMNO
ADRIYHPRQPNFEPVTCSLRLKIAMEIAYGIAYLHVAFSRPIVFRNVKPSNILFQEQSVA
KLFDFSYSESIPEGETRIRGRVMGTFGYLPPEYIATGDCNEKCDVYSFGMLLLELLTGQR
AVD
>GHAXM6
MRNGKNLLENLISSFNDGETHIKDAIPIGIMGFVATEYVTTGDYNEKCDVFSFGVLLLVL
LTGQKLYSIDEAGDRHWLLNRVKKHIECNTFDEIVDPVIREELCIQSSEKDKQVQAFVEL
AVKCVSES
I tried:
sed 's/_/|/g' seqid_len.txt | while read line;do grep -i -A1 ${line%%[1-9]*} test.fasta | seqkit subseq -r ${line##[a-z]* }:-1 ; done
Only getting GHAT8X 25 and GHAMNO 26 sequence out. However, renaming the header does not work.
Any correction on this or any python solution would be really helpful.
Have a great weekend.
Thanks
Would you please try the following:
#!/bin/bash
awk 'NR==FNR {a[">" $1] = $2 + 0; next} # create an array which maps the header to the starting position of the sequence
$0 in a { # the header matches an array index
start = a[$0] # get the starting position
print # print the header
getline # read the sequence line
print substr($0, start) # print the sequence by removing the beginnings
}
' seqid_len.txt test.fasta | fold -w 60 # wrap the output within 60 columns
Output:
>GHAT8X
SSHKNVLKLTGCCLETRIPVIVFESVKNRTLADHIYQNQPHFEPLLLSQRLRIAVHIANA
IAYLHIGFSRPILHRKIRPSRIFLDEGYIAKLFDFSLSVSIPEGETCVKDKVTGTMGFLA
PEYI
>GHAMNO
ADRIYHPRQPNFEPVTCSLRLKIAMEIAYGIAYLHVAFSRPIVFRNVKPSNILFQEQSVA
KLFDFSYSESIPEGETRIRGRVMGTFGYLPPEYIATGDCNEKCDVYSFGMLLLELLTGQR
AVD
>GHAXM6
IMRNGKNLLENLISSFNDGETHIKDAIPIGIMGFVATEYVTTGDYNEKCDVFSFGVLLLV
LLTGQKLYSIDEAGDRHWLLNRVKKHIECNTFDEIVDPVIREELCIQSSEKDKQVQAFVE
LAVKCVSES
You'll see the 3rd sequence starts with IMR.., one column shifted compared with your expected MRN... If the 3rd one is correct and the 1st and the 2nd sequences should be fixed, tweak the calculation $2 + 0 as $2 + 1.
I have a tab separated text file, call it input.txt
cat input.txt
Begin Annotation Diff End Begin,End
6436687 >ENST00000422706.5|ENSG00000100342.21|OTTHUMG00000030427.9|-|APOL1-205|APOL1|2901|protein_coding| 50 6436736 6436687,6436736
6436737 >ENST00000426053.5|ENSG00000100342.21|OTTHUMG00000030427.9|-|APOL1-206|APOL1|2808|protein_coding| 48 6436784 6436737,6436784
6436785 >ENST00000319136.8|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000075315.5|APOL1-201|APOL1|3000|protein_coding| 51 6436835 6436785,6436835
6436836 >ENST00000422471.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319151.1|APOL1-204|APOL1|561|nonsense_mediated_decay| 11 6436846 6436836,6436846
6436847 >ENST00000475519.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319153.1|APOL1-212|APOL1|600|retained_intron| 11 6436857 6436847,6436857
6436858 >ENST00000438034.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319152.2|APOL1-210|APOL1|566|protein_coding| 11 6436868 6436858,6436868
6436869 >ENST00000439680.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319252.1|APOL1-211|APOL1|531|nonsense_mediated_decay| 10 6436878 6436869,6436878
6436879 >ENST00000427990.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319154.2|APOL1-207|APOL1|624|protein_coding| 12 6436890 6436879,6436890
6436891 >ENST00000397278.8|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319100.4|APOL1-202|APOL1|2795|protein_coding| 48 6436938 6436891,6436938
6436939 >ENST00000397279.8|ENSG00000100342.21|OTTHUMG00000030427.9|-|APOL1-203|APOL1|1564|protein_coding| 28 6436966 6436939,6436966
6436967 >ENST00000433768.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319253.2|APOL1-209|APOL1|541|protein_coding| 11 6436977 6436967,6436977
6436978 >ENST00000431184.1|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319254.1|APOL1-208|APOL1|550|nonsense_mediated_decay| 11 6436988 6436978,6436988
Using the information in input.txt I want to obtain information from a file called Other_File.fa. This file is an annotation file filled with ENST#'s (transcript IDs) and sequences of A's,T's,C's,and G's. I want to store the sequence in a file called Output.log (see example below) and I want to store the command used to retrieve the text in a file called Input.log (see example below).
I have tried to do this using awk and cut so far using a for loop. This is the code I have tried.
for line in `awk -F "\\t" 'NR != 1 {print substr($2,2,17)"#"$5}' input.txt`
do
transcript=`cut -d "#" -f 1 $line`
range=`cut -d "#" -f 2 $line` #Range is the string location in Other_File.fa
echo "Our transcript is ${transcript} and our range is ${range}" >> Input.log
sed -n '${range}' Other_File.fa >> Output.log
done
Here is an example of the 11 lines between ENST00000433768.5 and ENST00000431184.1 in Other_File.fa.
grep -A 11 ENST00000433768.5 Other_File.fa
>ENST00000433768.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319253.2|APOL1-209|APOL1|541|protein_coding|
ATCCACACAGCTCAGAACAGCTGGATCTTGCTCAGTCTCTGCCAGGGGAAGATTCCTTGG
AGGAGCACACTGTCTCAACCCCTCTTTTCCTGCTCAAGGAGGAGGCCCTGCAGCGACATG
GAGGGAGCTGCTTTGCTGAGAGTCTCTGTCCTCTGCATCTGGATGAGTGCACTTTTCCTT
GGTGTGGGAGTGAGGGCAGAGGAAGCTGGAGCGAGGGTGCAACAAAACGTTCCAAGTGGG
ACAGATACTGGAGATCCTCAAAGTAAGCCCCTCGGTGACTGGGCTGCTGGCACCATGGAC
CCAGGCCCAGCTGGGTCCAGAGGTGACAGTGGAGAGCCGTGTACCCTGAGACCAGCCTGC
AGAGGACAGAGGCAACATGGAGGTGCCTCAAGGATCAGTGCTGAGGGTCCCGCCCCCATG
CCCCGTCGAAGAACCCCCTCCACTGCCCATCTGAGAGTGCCCAAGACCAGCAGGAGGAAT
CTCCTTTGCATGAGAGCAGTATCTTTATTGAGGATGCCATTAAGTATTTCAAGGAAAAAG
T
>ENST00000431184.1|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319254.1|APOL1-208|APOL1|550|nonsense_mediated_decay|
The range value in input.txt for this transcript is 6436967,6436977. In my file Input.log for this transcript I hope to get
Our transcript is ENST00000433768.5 and our range is 6436967,6436977
And in Output.log for this transcript I hope to get
>ENST00000433768.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319253.2|APOL1-209|APOL1|541|protein_coding|
ATCCACACAGCTCAGAACAGCTGGATCTTGCTCAGTCTCTGCCAGGGGAAGATTCCTTGG
AGGAGCACACTGTCTCAACCCCTCTTTTCCTGCTCAAGGAGGAGGCCCTGCAGCGACATG
GAGGGAGCTGCTTTGCTGAGAGTCTCTGTCCTCTGCATCTGGATGAGTGCACTTTTCCTT
GGTGTGGGAGTGAGGGCAGAGGAAGCTGGAGCGAGGGTGCAACAAAACGTTCCAAGTGGG
ACAGATACTGGAGATCCTCAAAGTAAGCCCCTCGGTGACTGGGCTGCTGGCACCATGGAC
CCAGGCCCAGCTGGGTCCAGAGGTGACAGTGGAGAGCCGTGTACCCTGAGACCAGCCTGC
AGAGGACAGAGGCAACATGGAGGTGCCTCAAGGATCAGTGCTGAGGGTCCCGCCCCCATG
CCCCGTCGAAGAACCCCCTCCACTGCCCATCTGAGAGTGCCCAAGACCAGCAGGAGGAAT
CTCCTTTGCATGAGAGCAGTATCTTTATTGAGGATGCCATTAAGTATTTCAAGGAAAAAG
T
But I am getting the following error, and I am unsure as to why or how to fix it.
cut: ENST00000433768.5#6436967,6436977: No such file or directory
cut: ENST00000433768.5#6436967,6436977: No such file or directory
Our transcript is and our range is
My thought was each line from the awk would be read as a string then cut could split the string along the "#" symbol I have added, but it is reading each line as a file and throwing an error when it can't locate the file in my directory.
Thanks.
EDIT2: This is a generic solution which will compare 2 files(input and other_file.fa) and on whichever line whichever range is found it will print them. Eg--> Range numbers are found on 300 line number but range shows you should print from 1 to 20 it will work in that case also. Also note this calls system command which further calls sed command(like you were using range within sed), there are other ways too, like to load whole Input_file into an array or so and then print, but I am going with this one here, fair warning this is not tested with huge size files.
awk -F'[>| ]' '
FNR==NR{
arr[$2]=$NF
next
}
($2 in arr){
split(arr[$2],lineNum,",")
print arr[$2]
start=lineNum[1]
end=lineNum[2]
print "sed -n \047" start","end"p \047 " FILENAME
system("sed -n \047" start","end"p\047 " FILENAME)
start=end=0
}
' file1 FS="[>|]" other_file.fa
EDIT: With OP's edited samples, please try following to print lines based on other file. assumes that the line you find range values, those values will be always after the line on which they found(eg--> 3rd line range values found and range is 4 to 10).
awk -F'[>| ]' '
FNR==NR{
arr[$2]=$NF
next
}
($2 in arr){
split(arr[$2],lineNum," ")
start=lineNum[1]
end=lineNum[2]
}
FNR>=start && FNR<=end{
print
if(FNR==end){
start=end=0
}
}
' file1 FS="[>|]" other_file.fa
You need not to do this with a for loop and then call awk program each time for each line. This could be done in single awk, considering that you have to only print them. Written and tested with your shown samples.
awk -F'[>| ]' 'FNR>1{print "Our transcript is:"$3" and our range is:"$NF}' Input_file
NOTE: This will print for each line of your Input_file values of transcript and range, in case you want to further perform some operation with their values then please do mention.
I have about 140 files with data which I would like to process with a script.
The files have two types of names:
sys-time-4-16-80-15-1-1.txt
known-ratio-4-16-80-15-1-1.txt
where the two last numbers vary. The penultimate number takes 1, 50, 100, 150,...,300, and the last number ranges from 1,2,3,4,5...,10. A sample of these files are in this link.
I would like to write a new file with 3 columns as follows:
A 1st column with the penultimate number of the file, i.e., 1,25,50...
A 2nd column with the mean value of the second column in each sys-time-.. file.
A 3rd column with the mean value of the second column in each known-ratio-.. file.
The result might have a row for each pair of averaged 2nd columns of sys and known files:
1 mean-sys-1 mean-know-1
1 mean-sys-2 mean-know-2
.
.
1 mean-sys-10 mean-know-10
50 mean-sys-1 mean-know-1
50 mean-sys-2 mean-know-2
.
.
50 mean-sys-10 mean-know-10
100 mean-sys-1 mean-know-1
100 mean-sys-2 mean-know-2
.
.
100 mean-sys-10 mean-know-10
....
....
300 mean-sys-10 mean-know-10
where each row corresponds with the sys and known files with the same two last numbers.
Besides, I would like to copy in the first column the penultimate number of the files.
I know how to compute the mean value of the second column of a file with awk:
awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }' sys-time-4-16-80-15-1-5.txt
but I do not know how to iterate on all the files and build a result file with the three columns as above.
Here's a shell script that uses GNU datamash to compute the averages (Though you can easily swap out to awk if desired; I prefer datamash for calculating stats):
#!/bin/sh
nums=$(mktemp)
sysmeans=$(mktemp)
knownmeans=$(mktemp)
for systime in sys-time-*.txt
do
knownratio=$(echo -n "$systime" | sed -e 's/sys-time/known-ratio/')
echo "$systime" | sed -E 's/.*-([0-9]+)-[0-9]+\.txt/\1/' >> "$nums"
datamash -W mean 2 < "$systime" >> "$sysmeans"
datamash -W mean 2 < "$knownratio" >> "$knownmeans"
done
paste "$nums" "$sysmeans" "$knownmeans"
rm -f "$nums" "$sysmeans" "$knownmeans"
It creates three temporary files, one per column, and after populating them with the data from each pair of files, one pair per line of each, uses paste to combine them all and print the result to standard output.
I've used GNU Awk for easy, per-file operations. This is untested; please let me know how it runs. You might want to look into printf() for pretty-printed output.
mapfile -t Files < <(find . -type f -name "*-4-16-80-15-*" |sort -t\- -k7,7 -k8,8) #1
gawk '
BEGINFILE {n=split(FILENAME, f, "-"); type=f[1]; a[type]=0} #2
{a[type] = ($2 + a[type] * c++) / c} #3
ENDFILE {if(type=="sys") print f[n], a[sys], a[known]} #4
' "${Files[#]}"
Create a Bash array with matching files sorted by the last two "keys". We will feed this array to Awk later. Notice how we alternate between "sys" and "known" files in this sample:
./known-ratio-4-16-80-15-2-150
./sys-time-4-16-80-15-2-150
./known-ratio-4-16-80-15-3-1
./sys-time-4-16-80-15-3-1
./known-ratio-4-16-80-15-3-50
./sys-time-4-16-80-15-3-50
At the beginning of every file, clear any existing average value and save the type as either "sys" or "known".
On every line, calculate the Cumulative Moving Average
At the end of every file, check the file type. If we just handled a "sys" file, print the last part of the filename followed by our averages.
I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.
For example, in a regular while-read loop I would code:
while read line
do
<some code>
done < input file or variable...
But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file.
I tried :
split -n 14 input_file
find . -name "xa*" | \
parallel -j 14 | \
while read line
do
<lot of stuff>
done
also tried
split -n 14 input_file
function loop {
while read line
do
<lot of stuff>
done
}
export -f loop
parallel -j 14 ::: loop
But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"
An example of the input file (a list of strings)
AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
EDIT
This is the code.
The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made.
The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):
Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank
D_6__Bacillus_atrophaeus 50% 1 20 5%
D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351%
$ head input file
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
Two additional files are also used inside the loop. They are not the loop input.
1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:
$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0
KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0
KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0
2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:
$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus
So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.
TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done < input_file
It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that.
Thank you all for being so patient with me =)
You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):
export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done
}
export -f doit
parallel -a input_file --pipepart doit
This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.
That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.
As an alternative I threw together a quick test.
#! /bin/env bash
mkfifo PIPELINE # create a single queue
cat "$1" > PIPELINE & # supply it with records
{ declare -i cnt=0 max=14
while (( ++cnt <= max )) # spawn loop creates worker jobs
do printf -v fn "%02d" $cnt
while read -r line # each work loop reads common stdin...
do echo "$fn:[$line]"
sleep 1
done >$fn.log 2>&1 & # these run in background in parallel
done # this one exits
} < PIPELINE # *all* read from the same queue
wait
cat [0-9][0-9].log
Doesn't need split, but does need a mkfifo.
Obviously, change the code inside the internal loop.
This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that.
So, let's make a million line file and split it into 14 parts:
seq 1000000 > 1M
split -n 14 1M part-
That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first):
#!/bin/bash
# This function will be called for each of the 14 files
DoOne(){
# Pick up parameters
job=$1
file=$2
# Count lines in specified file
lines=$(wc -l < "$file")
echo "Job No: $job, file: $file, lines: $lines"
}
# Make the function above known to processes spawned by GNU Parallel
export -f DoOne
# Run 14 parallel instances of "DoOne" passing job number and filename to each
parallel -k -j 14 DoOne {#} {} ::: part-??
Sample Output
Job No: 1, file: part-aa, lines: 83861
Job No: 2, file: part-ab, lines: 72600
Job No: 3, file: part-ac, lines: 70295
Job No: 4, file: part-ad, lines: 70295
Job No: 5, file: part-ae, lines: 70294
Job No: 6, file: part-af, lines: 70295
Job No: 7, file: part-ag, lines: 70295
Job No: 8, file: part-ah, lines: 70294
Job No: 9, file: part-ai, lines: 70295
Job No: 10, file: part-aj, lines: 70295
Job No: 11, file: part-ak, lines: 70295
Job No: 12, file: part-al, lines: 70294
Job No: 13, file: part-am, lines: 70295
Job No: 14, file: part-an, lines: 70297
You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.
I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code:
#!/bin/bash
for a in {A..Z} {0..9} ; do
for b in {A..Z} {0..9} ; do
for c in {A..Z} {0..9} ; do
echo "${a}${b}${c}"
done
done
done > a
# Now make file "b" which has the same stuff but shuffled into a different order
gshuf < a > b
Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this:
AAA
AAB
AAC
AAD
AAE
AAF
File b looks like this:
UKM
L50
AOC
79U
K6S
6PO
12I
XEV
WJN
Now I want to loop through a finding the corresponding line in b. First, I use your approach:
time while read thing ; do grep $thing b > /dev/null ; done < a
That takes 9 mins 35 seconds.
If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want.
time while read thing ; do grep -m1 $thing b > /dev/null ; done < a
That improves the time down to 4 mins 30 seconds.
If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this:
time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null
That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.
I hope this small script helps you out:
function process {
while read line; do
echo "$line"
done < $1
}
function loop {
file=$1
chunks=$2
dir=`mktemp -d`
cd $dir
split -n l/$chunks $file
for i in *; do
process "$i" &
done
rm -rf $dir
}
loop /tmp/foo 14
It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.
This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &:
function loop () {
while IFS= read -r -d $'\n'
do
# YOUR BIG STUFF
done < "${1}"
}
arr_files=(./xa*)
for i in "${arr_files[#]}"
do loop "${i}" &
done
wait
I'm trying to write a script to pull the integers out of 4 files that store temperature readings from 4 industrial freezers, this is a hobby script it generates the general readouts I wanted, however when I try to generate a SUM of the temperature readings I get the following printout into the file and my goal is to print the end SUM only not the individual numbers printed out in a vertical format
Any help would be greatly appreciated;here's my code
grep -o "[0.00-9.99]" "/location/$value-1.txt" | awk '{ SUM += $1; print $1} END { print SUM }' >> "/location/$value-1.txt"
here is what I am getting in return
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
1
7
.
2
8
1
7
.
0
1
1
7
.
0
0
1
8
.
7
2
53
It does generate the SUM I don't need the already listed numbers, just the SUM total
Why not stick with AWK completely? Code:
$ cat > summer.awk
{
while(match($0,/[0-9]+\.[0-9]+/)) # while matches on record
{
sum+=substr($0, RSTART, RLENGTH) # extract matches and sum them
$0=substr($0, RSTART + RLENGTH) # reset to start after previous match
count++ # count matches
}
}
END {
print sum"/"count"="sum/count # print stuff
Data:
$ cat > data.txt
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
Run:
$ awk -f summer.awk file
70.01/4=17.5025
It might work in the winter too.
The regex in grep -o "[0.00-9.99]" "/location/$value-1.txt" is equivalent to [0-9.], but you're probably looking for numbers in the range 0.00 to 9.99. For that, you need a different regex:
grep -o "[0-9]\.[0-9][0-9]" "/location/$value-1.txt"
That looks for a digit, a dot, and two more digits. It was almost tempting to use [.] in place of \.; it would also work. A plain . would not; that would select entries such as 0X87.
Note that the pattern shown ([0-9]\.[0-9][0-9]) will match 192.16.24.231 twice (2.16 and 4.23). If that's not what you want, you have to be a lot more precise. OTOH, it may not matter in the slightest for the actual data you have. If you'd want it to match 192.16 and 24.231 (or .24 and .231), you have to refine your regex.
Your command structure:
grep … filename | awk '…' >> filename
is living dangerously. In the example, it is 'OK' (but there's a huge grimace on my face as I type 'OK') because the awk script doesn't write anything to the file until grep has read it all. But change the >> to > and you have an empty input, or have awk write material before the grep is complete and suddenly it gets very tricky to determine what happens (it depends, in part, on what awk writes to the end of the file).