I have a text with repeated data patterns, and grep keeps getting all matches without stop.
for ((count = 1; count !=17; count++)); do # 17 times
xuz1[count]=`grep -e "1 O1" $out_file | cut -c10-29`
xuz2[count]=`grep -e "2 O2" $out_file | cut -c10-29`
xuz3[count]=`grep -e "3 O3" $out_file | cut -c10-29`
echo ${xuz1[count]}
echo ${xuz2[count]}
echo ${xuz3[count]}
done
data looks like:
some text.....
Text....
.....
1 O1 111111 111111 111111
2 O2 222211 222211 222211
3 O3 643653 652346 757686
some text.....
1 O1 111122 111122 111122
2 O2 222222 222222 222222
3 O3 343653 652346 757683
some text.....
1 O1 111333 111333 111333
2 O2 222333 222333 222333
3 O3 343653 652346 757684
.
.
.
And result I'm getting:
xuz1[1] = 111111 111111 111111
xuz2[1] = 222211 222211 222211
xuz3[1] = 643653 652346 757686
xuz1[2] = 111111 111111 111111
xuz2[2] = 222211 222211 222211
xuz3[2] = 643653 652346 757686
...
looking for result like this:
xuz1[1]=111111 111111 111111
xuz2[1]=222211 222211 222211
xuz3[1]=343653 652346 757683
xuz1[2]=111122 111122 111122
xuz2[2]=222222 222222 222222
xuz3[2]=343653 652346 757684
also tried "grep -m 1 -e"
Which way should I go?
for now I ended up with one line
grep -A4 -e "1 O1" $out_file | cut -c10-29
Some text.... Is a huge text part.
A little bash script with a single grep is enough
grep -E '^[0-9]+ +O[0-9]+ +.*'|
while read idx oidx cols; do
if ((idx == 1)); then
let ++i
name=xuz$i
let j=1
fi
echo "$name[$j]=$cols"
let ++j
done
You haven't really described what you want, but I guess something like this.
awk '! /^[1-9][0-9]* O[0-9] / { n++; m=0; if (NR>1) print ""; next }
{ print "xuz" ++m "[" n "]=" substr($0, 10) }' "$out_file"
If the regex doesn't match, we assume we are looking at one of the "some text" pieces, and that this starts a new record. Increment n and reset m. Otherwise, print the output for this item within this record.
If some text could be more than one line, you will need a minor change, but I hope this should be enough at least to send you in the right direction.
You can do this in pure Bash, too, though this is going to be highly inefficient - you would expect a Bash while read loop to be at least a hundred times slower than Awk, and the code is markedly less idiomatic and elegant.
while read -r m x result; do
case $m::$x in
[1-9]::O[1-9])
printf 'xuz%d[%d]=%s\n' $m $n "$result;;
*)
# If n is unset, don't print an empty line
printf '%s' "${n+$'\n'}"
let ((n++));;
esac
done <"$out_file"
I would aggressively challenge any requirement to do this in pure Bash. If it's for homework, the requirement is unrealistic, and a core skill for shell script authors is to understand the limits of the shell and the strengths of the common support tools like Awk. The Awk language is virtually guaranteed to be available wherever you have a shell, in particular a heavy shell like Bash. (In a limited e.g. embedded environment, a limited shell like Dash would make more sense. Then e.g. the let keyword won't be available, though it should not be hard to make this script properly portable.)
The case statement accepts glob patterns, not regular expressions, so the pattern here is slightly less general (we accept one positive digit in the first field).
Thank you all for participating in discussion.
*** this is my home project to help my wife do extract data from research calculations /// speed up is around 400 times **
file used for extracting data from, contains around 2000 lines,
needed data blocks look like this
and they're repeated 10-20 times in the file.
uiyououy COORDINATES
NR ATOM CCCCC X Y Z
1 O1 8.00 0.000000000 0.882236820 -0.789494235
2 O2 8.00 0.000000000 -1.218250722 -1.644061652
3 O3 8.00 0.000000000 1.218328524 0.400260050
4 O4 8.00 0.000000000 -0.882314622 2.033295837
Text text text text
tons of text
to extract 4 lines I used expression below
grep -A4 --no-group-separator -e "1 O1" $from_file | cut -c23-64
>xyz_temp.txt
# grep 4 lines at once to txt
sed -i '/^[ \t]*$/d' xyz_temp.txt
#del empty lines from xyz txt
next is to convert string in to numbers (should use '| bc -l' for arithmetic)
while IFS= read line
do
IFS=' ' read -r -a arr_line <<< "$line"
# break line of xyz into 3 numbers
s1=$(echo "${arr_line[0]}" \* 0.529177249 | bc -l)
# some math convertion
s2=$(echo "${arr_line[1]}" \* 0.529177249 | bc -l)
s3=$(echo "${arr_line[2]}" \* 0.529177249 | bc -l)
#-------to array non sorted ------------
arr[$n]=${n}";"${from_file}";"${gd_}";"${frt[count_4s]}";"${n4}";"${s1}";"${s2}";"${s3}
echo ${arr[n]}
#--------------------------------------------
done <"$from_file_txt"
sort array
IFS=$'\n' sorted=($(sort -t \; -k4 -k5 -g <<<"${arr[*]}"))
# -t separator ';' -k column -g generic * to get new line output
#-k4 -k5 sort by column 4 then5
#printf "%s\n" "${sorted[*]}"
unset IFS
There is Last part which will combine data to result view
echo "$n"
n2=1
n42=1
count_4s2=1
i=0
echo "============================== sorted =============================="
################### loop for empty 4s lines
printf "%s" ";" ";" ";" ";" ";" "${count_4s2}" ";"
printf "%s\n"
printf "%s\n" "${sorted[i]}"
while [ $i -lt $((n-2)) ]
do
i=$((i+1))
if [ "$n42" = "4" ] # 1234
then n42=0
count_4s2=$((count_4s2+1))
printf "%s" ";" ";" ";" ";" ";" "${count_4s2}" ";"
printf "%s\n"
fi
#--------------------------------------------
n2=$((n2+1))
n42=$((n42+1))
printf "%s\n" "${sorted[i]}"
done ############# while
#00000000000000000000000000000000000000
printf "%s\n"
echo ==END===END===END==
Output looks like this
============================== sorted ==============================
;;;;;1;
17;A-13_A1+.out;1.3;0.4;1;0;.221176355474853043;-.523049776514580244
18;A-13_A1+.out;1.3;0.4;2;0;-.550350051428402955;-.734584881824005358
19;A-13_A1+.out;1.3;0.4;3;0;.665269869069959489;.133910683627893251
20;A-13_A1+.out;1.3;0.4;4;0;-.336096173116409577;1.123723974181515102
;;;;;2;
13;A-13_A1+.out;1.3;0.45;1;0;.279265277182782148;-.504490787956469897
14;A-13_A1+.out;1.3;0.45;2;0;-.583907412327951988;-.759310392973448167
15;A-13_A1+.out;1.3;0.45;3;0;.662538493711206290;.146829200993661293
16;A-13_A1+.out;1.3;0.45;4;0;-.357896358566036450;1.116971979936256771
;;;;;3;
9;A-13_A1+.out;1.3;0.5;1;0;.339333719743262501;-.482029749553797105
10;A-13_A1+.out;1.3;0.5;2;0;-.612395507070451545;-.788968880150283253
11;A-13_A1+.out;1.3;0.5;3;0;.658674809217196345;.163289820251690233
12;A-13_A1+.out;1.3;0.5;4;0;-.385613021360830052;1.107708808923212876
==END===END===END==
*note : some code might not shown here
next step is to paste it to excel with ; separator.
Related
I'm would like to substitute a set of edit: single byte characters with a set of literal strings in a stream, without any constraint on the line size.
#!/bin/bash
for (( i = 1; i <= 0x7FFFFFFFFFFFFFFF; i++ ))
do
printf '\a,\b,\t,\v'
done |
chars_to_strings $'\a\b\t\v' '<bell>' '<backspace>' '<horizontal-tab>' '<vertical-tab>'
The expected output would be:
<bell>,<backspace>,<horizontal-tab>,<vertical-tab><bell>,<backspace>,<horizontal-tab>,<vertical-tab><bell>...
I can think of a bash function that would do that, something like:
chars_to_strings() {
local delim buffer
while true
do
delim=''
IFS='' read -r -d '.' -n 4096 buffer && (( ${#buffer} != 4096 )) && delim='.'
if [[ -n "${delim:+_}" ]] || [[ -n "${buffer:+_}" ]]
then
# Do the replacements in "$buffer"
# ...
printf "%s%s" "$buffer" "$delim"
else
break
fi
done
}
But I'm looking for a more efficient way, any thoughts?
Since you seem to be okay with using ANSI C quoting via $'...' strings, then maybe use sed?
sed $'s/\a/<bell>/g; s/\b/<backspace>/g; s/\t/<horizontal-tab>/g; s/\v/<vertical-tab>/g'
Or, via separate commands:
sed -e $'s/\a/<bell>/g' \
-e $'s/\b/<backspace>/g' \
-e $'s/\t/<horizontal-tab>/g' \
-e $'s/\v/<vertical-tab>/g'
Or, using awk, which replaces newline characters too (by customizing the Output Record Separator, i.e., the ORS variable):
$ printf '\a,\b,\t,\v\n' | awk -vORS='<newline>' '
{
gsub(/\a/, "<bell>")
gsub(/\b/, "<backspace>")
gsub(/\t/, "<horizontal-tab>")
gsub(/\v/, "<vertical-tab>")
print $0
}
'
<bell>,<backspace>,<horizontal-tab>,<vertical-tab><newline>
For a simple one-liner with reasonable portability, try Perl.
for (( i = 1; i <= 0x7FFFFFFFFFFFFFFF; i++ ))
do
printf '\a,\b,\t,\v'
done |
perl -pe 's/\a/<bell>/g;
s/\b/<backspace>/g;s/\t/<horizontal-tab>/g;s/\v/<vertical-tab>/g'
Perl internally does some intelligent optimizations so it's not encumbered by lines which are longer than its input buffer or whatever.
Perl by itself is not POSIX, of course; but it can be expected to be installed on any even remotely modern platform (short of perhaps embedded systems etc).
Assuming the overall objective is to provide the ability to process a stream of data in real time without having to wait for a EOL/End-of-buffer occurrence to trigger processing ...
A few items:
continue to use the while/read -n loop to read a chunk of data from the incoming stream and store in buffer variable
push the conversion code into something that's better suited to string manipulation (ie, something other than bash); for sake of discussion we'll choose awk
within the while/read -n loop printf "%s\n" "${buffer}" and pipe the output from the while loop into awk; NOTE: the key item is to introduce an explicit \n into the stream so as to trigger awk processing for each new 'line' of input; OP can decide if this additional \n must be distinguished from a \n occurring in the original stream of data
awk then parses each line of input as per the replacement logic, making sure to append anything leftover to the front of the next line of input (ie, for when the while/read -n breaks an item in the 'middle')
General idea:
chars_to_strings() {
while read -r -n 15 buffer # using '15' for demo purposes otherwise replace with '4096' or whatever OP wants
do
printf "%s\n" "${buffer}"
done | awk '{print NR,FNR,length($0)}' # replace 'print ...' with OP's replacement logic
}
Take for a test drive:
for (( i = 1; i <= 20; i++ ))
do
printf '\a,\b,\t,\v'
sleep 0.1 # add some delay to data being streamed to chars_to_strings()
done | chars_to_strings
1 1 15 # output starts printing right away
2 2 15 # instead of waiting for the 'for'
3 3 15 # loop to complete
4 4 15
5 5 13
6 6 15
7 7 15
8 8 15
9 9 15
A variation on this idea using a named pipe:
mkfifo /tmp/pipeX
sleep infinity > /tmp/pipeX # keep pipe open so awk does not exit
awk '{print NR,FNR,length($0)}' < /tmp/pipeX &
chars_to_strings() {
while read -r -n 15 buffer
do
printf "%s\n" "${buffer}"
done > /tmp/pipeX
}
Take for a test drive:
for (( i = 1; i <= 20; i++ ))
do
printf '\a,\b,\t,\v'
sleep 0.1
done | chars_to_strings
1 1 15 # output starts printing right away
2 2 15 # instead of waiting for the 'for'
3 3 15 # loop to complete
4 4 15
5 5 13
6 6 15
7 7 15
8 8 15
9 9 15
# kill background 'awk' and/or 'sleep infinity' when no longer needed
don't waste FS/OFS - use the built-in variables to take 2 out of the 5 needed :
echo $' \t abc xyz \t \a \n\n ' |
mawk 'gsub(/\7/, "<bell>", $!(NF = NF)) + gsub(/\10/,"<bs>") +\
gsub(/\11/,"<h-tab>")^_' OFS='<v-tab>' FS='\13' ORS='<newline>'
<h-tab> abc xyz <h-tab> <bell> <newline><newline> <newline>
To have NO constraint on the line length you could do something like this with GNU awk:
awk -v RS='.{1,100}' -v ORS= '{
$0 = RT
gsub(foo,bar)
print
}'
That will read and process the input 100 chars at a time no matter which chars are present, whether it has newlines or not, and even if the input was one multi-terabyte line.
Replace gsub(foo,bar) with whatever substitution(s) you have in mind, e.g.:
$ printf '\a,\b,\t,\v' |
awk -v RS='.{1,100}' -v ORS= '{
$0 = RT
gsub(/\a/,"<bell>")
gsub(/\b/,"<backspace>")
gsub(/\t/,"<horizontal-tab>")
gsub(/\v/,"<vertical-tab>")
print
}'
<bell>,<backspace>,<horizontal-tab>,<vertical-tab>
and of course it'd be trivial to pass a list of old and new strings to awk rather than hardcoding them, you'd just have to sanitize any regexp or backreference metachars before calling gsub().
I want to compare the 2nd and 4th columns of lines in a file. In detail, line 1 with 2,3,4...N, then line 2 with 3,4,5...N, and so on.
I have written a script, it worked but running so long, over 30 minutes.
Let the number of lines is 1733 with header, my code is:
for line1 in {2..1733}; do \
for line2 in {$((line1+1))..1733}; do \
i7_diff=$(cmp -bl \
<(sed -n "${line1}p" Lab_primers.tsv | cut -f 2) \
<(sed -n "${line2}p" Lab_primers.tsv | cut -f 2) | wc -l);
i5_diff=$(cmp -bl \
<(sed -n "${line1}p" Lab_primers.tsv | cut -f 4) \
<(sed -n "${line2}p" Lab_primers.tsv | cut -f 4) | wc -l);
if [ $i7_diff -lt 3 ]; then
if [ $i5_diff -lt 3 ]; then
echo $(sed -n "${line1}p" Lab_primers.tsv)"\n" >> primer_collision.txt
echo $(sed -n "${line2}p" Lab_primers.tsv)"\n\n" >> primer_collision.txt
fi;
fi;
done
done
I used nested for loops then using sed to print exactly the $line, next using cut to extract the desired column. Finally, the cmp and wc command to count the number of differences of two columns of a pair lines.
If meeting the condition (both 2nd and 4th columns of pair of lines have the number of differences less than 3), the code will print a pair lines to output file.
Here is an excerpt of the input (it has 1733 lines):
I7_Index_ID index I5_Index_ID index2 primer
D703 CGCTCATT D507 ACGTCCTG 27
D704 GAGATTCC D507 ACGTCCTG 28
D701 ATTACTCG D508 GTCAGTAC 29
S6779 CGCTAATC S6579 ACGTCATA 559
D708 TAATGCGC D503 AGGATAGG 44
D705 ATTCAGAA D504 TCAGAGCC 45
D706 GAATTCGT D504 TCAGAGCC 46
i796 ATATGCGC i585 AGGATAGC R100
D714 TGCTTGCT D510 AACCTCTC 102
D715 GGTGATGA D510 AACCTCTC 103
D716 AACCTACG D510 AACCTCTC 104
i787 TGCTTCCA i593 ATCGTCTC R35
Then the expected output is:
D703 CGCTCATT D507 ACGTCCTG 27
S6779 CGCTAATC S6579 ACGTCATA 559
D708 TAATGCGC D503 AGGATAGG 44
i796 ATATGCGC i585 AGGATAGC R100
D714 TGCTTGCT D510 AACCTCTC 102
i787 TGCTTCCA i593 ATCGTCTC R35
My question is what the better code to deal with it, how to reduce the running time?
Thank you for your help!
You could start to sort by fields 2 and 4.
Then, no need for double loop: if a pair exist, they should be adjacent.
sort -k 2,2 -k 4,4 myfile.txt
Then, we need to print only bunch of consecutive lines that share the same 2 and 4 fields.
first=yes
sort -k 2,2 -k 4,4 test.txt | while read l
do
fields=(${l})
new2=${fields[1]}
new4=${fields[3]} # Fields 2 and 4, bash-way
if [[ "$new2" = "$old2" ]] && [[ "$new4" = "$old4" ]]
then
if [[ $first ]]
then
# first time we print something for this series: we need
# to also print the previous line (the first of the series)
echo; echo "$oldl"
# But if the next line is identical (series of 3, no need to repeat this line)
first=
fi
echo "$l"
else
# This line is not identical to the previous. So nothing to print
# If the next one is identical to this one, then, this one will
# be the first of its series
first=yes
fi
old2=$new2
old4=$new4
oldl="${l}"
done
One frustrating thing: uniq -D almost does all the job we did in this script. Except that it is unable to filter on specific lines.
But we could also rewrites lines so that uniq can work.
Not very fluent in awk (if I were, I am pretty sure awk could do the uniq work for me), but well
sort -k 2,2 -k 4,4 test.txt | awk '{print $0" "$2" "$4}' | uniq -D -f 5 | awk '{printf "%-12s %-9s %-12s %-9s %s\n",$1,$2,$3,$4,$5}'
does the job.
sort sort the lines by fields 2 and 4. awk add to the end of each line a copy of fields 2 and 4. Which then make uniq usable, since uniq is able to ignore N first fields. So here, we use uniq ignoring the 5 1st fields, that is working only on the copies of fields 2 and 4. With -D uniq display only duplicate lines.
Then, the last awk remove the copies of field we don't need anymore.
I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.
For example, in a regular while-read loop I would code:
while read line
do
<some code>
done < input file or variable...
But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file.
I tried :
split -n 14 input_file
find . -name "xa*" | \
parallel -j 14 | \
while read line
do
<lot of stuff>
done
also tried
split -n 14 input_file
function loop {
while read line
do
<lot of stuff>
done
}
export -f loop
parallel -j 14 ::: loop
But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"
An example of the input file (a list of strings)
AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
EDIT
This is the code.
The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made.
The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):
Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank
D_6__Bacillus_atrophaeus 50% 1 20 5%
D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351%
$ head input file
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
Two additional files are also used inside the loop. They are not the loop input.
1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:
$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0
KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0
KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0
2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:
$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus
So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.
TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done < input_file
It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that.
Thank you all for being so patient with me =)
You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):
export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done
}
export -f doit
parallel -a input_file --pipepart doit
This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.
That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.
As an alternative I threw together a quick test.
#! /bin/env bash
mkfifo PIPELINE # create a single queue
cat "$1" > PIPELINE & # supply it with records
{ declare -i cnt=0 max=14
while (( ++cnt <= max )) # spawn loop creates worker jobs
do printf -v fn "%02d" $cnt
while read -r line # each work loop reads common stdin...
do echo "$fn:[$line]"
sleep 1
done >$fn.log 2>&1 & # these run in background in parallel
done # this one exits
} < PIPELINE # *all* read from the same queue
wait
cat [0-9][0-9].log
Doesn't need split, but does need a mkfifo.
Obviously, change the code inside the internal loop.
This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that.
So, let's make a million line file and split it into 14 parts:
seq 1000000 > 1M
split -n 14 1M part-
That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first):
#!/bin/bash
# This function will be called for each of the 14 files
DoOne(){
# Pick up parameters
job=$1
file=$2
# Count lines in specified file
lines=$(wc -l < "$file")
echo "Job No: $job, file: $file, lines: $lines"
}
# Make the function above known to processes spawned by GNU Parallel
export -f DoOne
# Run 14 parallel instances of "DoOne" passing job number and filename to each
parallel -k -j 14 DoOne {#} {} ::: part-??
Sample Output
Job No: 1, file: part-aa, lines: 83861
Job No: 2, file: part-ab, lines: 72600
Job No: 3, file: part-ac, lines: 70295
Job No: 4, file: part-ad, lines: 70295
Job No: 5, file: part-ae, lines: 70294
Job No: 6, file: part-af, lines: 70295
Job No: 7, file: part-ag, lines: 70295
Job No: 8, file: part-ah, lines: 70294
Job No: 9, file: part-ai, lines: 70295
Job No: 10, file: part-aj, lines: 70295
Job No: 11, file: part-ak, lines: 70295
Job No: 12, file: part-al, lines: 70294
Job No: 13, file: part-am, lines: 70295
Job No: 14, file: part-an, lines: 70297
You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.
I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code:
#!/bin/bash
for a in {A..Z} {0..9} ; do
for b in {A..Z} {0..9} ; do
for c in {A..Z} {0..9} ; do
echo "${a}${b}${c}"
done
done
done > a
# Now make file "b" which has the same stuff but shuffled into a different order
gshuf < a > b
Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this:
AAA
AAB
AAC
AAD
AAE
AAF
File b looks like this:
UKM
L50
AOC
79U
K6S
6PO
12I
XEV
WJN
Now I want to loop through a finding the corresponding line in b. First, I use your approach:
time while read thing ; do grep $thing b > /dev/null ; done < a
That takes 9 mins 35 seconds.
If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want.
time while read thing ; do grep -m1 $thing b > /dev/null ; done < a
That improves the time down to 4 mins 30 seconds.
If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this:
time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null
That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.
I hope this small script helps you out:
function process {
while read line; do
echo "$line"
done < $1
}
function loop {
file=$1
chunks=$2
dir=`mktemp -d`
cd $dir
split -n l/$chunks $file
for i in *; do
process "$i" &
done
rm -rf $dir
}
loop /tmp/foo 14
It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.
This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &:
function loop () {
while IFS= read -r -d $'\n'
do
# YOUR BIG STUFF
done < "${1}"
}
arr_files=(./xa*)
for i in "${arr_files[#]}"
do loop "${i}" &
done
wait
I have two files, smaller and bigger and bigger contains all lines of smaller. Those lines are almost same, just last column differs.
file_smaller
A NM 0
B GT 4
file_bigger
A NM 5 <-same as in file_smaller according to my rules
C TY 2
D OP 6
B GT 3 <-same as in file_smaller according to my rules
I would like to write lines, where the two files differ, that means:
wished_output
C TY 2
D OP 6
Could you please help me to do so? Thanks a lot.
you can do the following:
cat file_bigger file_smaller |sed 's=\(.*\).$=\1='|sort| uniq -u > temp_pat
grep -f temp_pat file_bigger ; rm temp_pat
which will (in the same order)
merge the files
remove the last column
sort the result
print only unique lines in temp_pat
find the original lines in file_bigger
all in all, the expected result.
awk 'FILENAME==file_bigger {arr[$1 $2]=$0}
FILENAME==file_smaller { tmp=$1 $2; if( tmp in arr) {next} else {print $0}}
' file_bigger file_smaller
See if that meets you needs
grep -vf <(cut -d " " -f 1-2 file_smaller| sed 's/^/^/') file_bigger
The process substitution results in this:
^A NM
^B GT
Then, grep -v removes those patterns from "file_bigger"
Bash 4 using associative arrays:
#!/usr/bin/env bash
f() {
if (( $# != 2 )); then
echo "usage: ${FUNCNAME} <smaller> <bigger>" >&2
return 1
fi
local -A smaller
local -a x
while read -ra x; do
smaller["${x[#]::2}"]=0
done <"$1"
while read -ra x; do
((${smaller["${x[#]::2}"]:-1})) && echo "${x[*]}"
done <"$2"
}
f /dev/fd/3 /dev/fd/0 <<"SMALLER" 3<&0 <<"BIGGER"
A NM 0
B GT 4
SMALLER
A NM 5
C TY 2
D OP 6
B GT 3
BIGGER
How can print a value, either 1, 2 or 3 (at random). My best guess failed:
#!/bin/bash
1 = "2 million"
2 = "1 million"
3 = "3 million"
print randomint(1,2,3)
To generate random numbers with bash use the $RANDOM internal Bash function:
arr[0]="2 million"
arr[1]="1 million"
arr[2]="3 million"
rand=$[ $RANDOM % 3 ]
echo ${arr[$rand]}
From bash manual for RANDOM:
Each time this parameter is
referenced, a random integer between 0
and 32767 is generated. The sequence
of random numbers may be initialized
by assigning a value to RANDOM. If
RANDOM is unset,it loses its
special properties, even if it is
subsequently reset.
Coreutils shuf
Present in Coreutils, this function works well if the strings don't contain newlines.
E.g. to pick a letter at random from a, b and c:
printf 'a\nb\nc\n' | shuf -n1
POSIX eval array emulation + RANDOM
Modifying Marty's eval technique to emulate arrays (which are non-POSIX):
a1=a
a2=b
a3=c
eval echo \$$(expr $RANDOM % 3 + 1)
This still leaves the RANDOM non-POSIX.
awk's rand() is a POSIX way to get around that.
64 chars alpha numeric string
randomString32() {
index=0
str=""
for i in {a..z}; do arr[index]=$i; index=`expr ${index} + 1`; done
for i in {A..Z}; do arr[index]=$i; index=`expr ${index} + 1`; done
for i in {0..9}; do arr[index]=$i; index=`expr ${index} + 1`; done
for i in {1..64}; do str="$str${arr[$RANDOM%$index]}"; done
echo $str
}
~.$ set -- "First Expression" Second "and Last"
~.$ eval echo \$$(expr $RANDOM % 3 + 1)
and Last
~.$
Want to corroborate using shuf from coreutils using the nice -n1 -e approach.
Example usage, for a random pick among the values a, b, c:
CHOICE=$(shuf -n1 -e a b c)
echo "choice: $CHOICE"
I looked at the balance for two samples sizes (1000, and 10000):
$ for lol in $(seq 1000); do shuf -n1 -e a b c; done > shufdata
$ less shufdata | sort | uniq -c
350 a
316 b
334 c
$ for lol in $(seq 10000); do shuf -n1 -e a b c; done > shufdata
$ less shufdata | sort | uniq -c
3315 a
3377 b
3308 c
Ref: https://www.gnu.org/software/coreutils/manual/html_node/shuf-invocation.html