How can I parallelize my loop ? (fasta file)

How can I parallelize my loop ? (fasta file) - bash

I wrote a script to change specific lines in one text files (fasta format) and I want to parallelize because there is a lot of lines (~800k).
>CTC14_37541|M00842:336:000000000-C7WWK:1:2101:20913:9309:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=75;p=71|CO
And I want to transform it to:
>Sample-CTC14_Read37541
I have two problems.
I tried to run my script with and without function:
Without function, it works: all the lines I want to change are modified.
When I use a function, only one line is modified. Something is wrong in my function header()?
Second problem is the parallelization. I tried something with "&" but I'm not sure that is the best solution. Any idea?
My code without function and parallel:
#!/bin/bash
TMP_PATH="/path/where/is/my/fasta"
cd $TMP_PATH
for fasta in *.fasta
do
echo $fasta
lines=$(grep ">" $fasta)
for line in $lines
do
if [[ $line = *">"* ]]; then
read_nb="_Read"$(echo $line | cut -d'|' -f1 | cut -d'_' -f2)
sample=$(echo $line | cut -d'_' -f1 | cut -d'>' -f2)
newheader=$(echo ">Sample-$sample$read_nb")
sed -i -e "s/$line/$newheader/g" $fasta
sed -i -e "s/ /\n/g" $fasta
fi
done
done
echo "END"
My code with function and parallel:
#!/bin/bash
TMP_PATH="/path/where/is/my/fasta"
cd $TMP_PATH
n=0
maxjobs=500
header(){
if [[ $line = *">"* ]]; then
read_nb="_Read"$(echo $line | cut -d'|' -f1 | cut -d'_' -f2)
sample=$(echo $line | cut -d'_' -f1 | cut -d'>' -f2)
newheader=$(echo ">Sample-$sample$read_nb")
sed -i -e "s/$line/$newheader/g" $fasta
sed -i -e "s/ /\n/g" $fasta
fi
}
for fasta in *.fasta
do
lines=$(grep ">" $fasta)
for line in $lines
do
header $line &
#limit jobs
if (( $(($((++n)) % $maxjobs)) == 0 )) ; then
wait
echo $n wait
fi
done
done
I have a fasta file as input that contains several headers and sequences. And I want to transform headers in order to use my fasta file in a specific workflow. I need to go from that :
>CTC14_18758|M00842:336:000000000-C7WWK:1:1108:17474:5670:0:66|o:98|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=66;p=62|CO:0|
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGCGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCTTGGGGAGCAAACAGG
>CTC14_20535|M00842:336:000000000-C7WWK:1:1108:28568:20175:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=64|CO:0|
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACCCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
>CTC14_24700|M00842:336:000000000-C7WWK:1:1110:7911:9824:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=71|CO:0|
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
To this:
>Sample-CTC14_Read18758
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGCGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCTTGGGGAGCAAACAGG
>Sample-CTC14_Read20535
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACCCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
>Sample-CTC14_Read24700
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
And I want to make this parallel because I have a lot of lines to change (~700-800k) and it takes very long time if I run the script line by line.
With my script without function, job is works but it's too long.
With my script with function and parallel, job doesn't work fine because only one header is changed in my fasta instead of all headers and I don't understand why. I tried different ways to write and call my function but the result is always the same.
Moreover, I tried with the gnu-parallel but it's the same way. I think my function or my call have a problem but I don't understand where.
I think use awk as you suggested is a good idea but I'm not comfortable with it. Can you help me please?
Proper format of my fasta file is:
>CTC14_1600|M00842:336:000000000-C7WWK:1:1101:26089:18004:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=71|CO:0| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGACGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG$
>CTC14_11169|M00842:336:000000000-C7WWK:1:1105:11636:11876:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=76;p=65|CO:0| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGACGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAACTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG$
>CTC14_16471|M00842:336:000000000-C7WWK:1:1107:6941:10486:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=70|CO:0| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGGCGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAGCTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTGAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG$

Assuming that >CTC14_18758|M00842:336:000000000- is on a separate line, this code will convert the input to the output.
#!/bin/sed -f
#skip blank lines
/^[[:space:]]*$/n
#change >CTC14_18758|M00842:336:000000000-
# to >Sample-CTC14_Read18758
s/^>/>Sample-/
s/_/_Read/
/^>/s/|.*$//
# remove 2ndary header
# C7WWK:1:1108:17474:5670:0:66|o:98|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=66;p=62|CO:0| TGGGGAATATTGGAC...
# to
# TGGGGAATATTGGAC...
s/^[^>].*| //
Save that as a file/script.
Then mark it as executable with
chmod +x mySed
and run it like
./mySed -i fileIn
Or if you get an warning/error message about -i, then run
./mySed fileIn > fileOut && mv fileOut fileIn
Now you can eliminate your function header(), and the 2ndary loop in your code.
Just
for file in *.fasta ; do
echo "processing file=$file"
/path/to/mySed -i "$file"
# run other processing if needed
# don't think you need wait any more
#uncomment? wait
done
-------------- version 2 sed ---------------
#!/bin/sed -f
#skip blank lines
/^[[:space:]]*$/n
#>CTC14_18758|M00842:336:000000000-C7WWK:1:1108:17474:5670:0:66|o:98|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=66;p=62|CO:0| TGGGGA...
#change >CTC14_18758|M00842:336:000000000-
# to >Sample-CTC14_Read18758
s/^>/>Sample-/
s/_/_Read/
s/|.*| / /
# /^>/s/-.*| / /
# s/-.*| / /
works with data like
>CTC14_16471|M00842:336:000000000-C7WWK:1:1107:6941:10486:0:66|o:97|mo:0.000000|MR:n=0;r1=0;r2=0|Q30:p=77;p=70|CO:0| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCATGAGTGAAGAAGGCCTTTGGGTTGTAAAGCTCTTTTAGTGAGGAAGATAATGGCGGTACTCACAGAAGAAGTCCTGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGAGGGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCTGGTTAATAAGTTAAAAGTGAAATCCCGAGGCTTAACCTTGGAATTGCTTTTAAAGCTATTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTGATGTAGAGGTAAAATTCGTGAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAAGCGCGAAGGCGTGGGGAGCAAACAGG
IHTH

Related

Inline array substitution

I have file with a few lines:
x 1
y 2
z 3 t
I need to pass each line as paramater to some program:
$ program "x 1" "y 2" "z 3 t"
I know how to do it with two commands:
$ readarray -t a < file
$ program "${a[#]}"
How can i do it with one command? Something like that:
$ program ??? file ???

The (default) options of your readarray command indicate that your file items are separated by newlines.
So in order to achieve what you want in one command, you can take advantage of the special IFS variable to use word splitting w.r.t. newlines (see e.g. this doc) and call your program with a non-quoted command substitution:
IFS=$'\n'; program $(cat file)
As suggested by #CharlesDuffy:
you may want to disable globbing by running beforehand set -f, and if you want to keep these modifications local, you can enclose the whole in a subshell:
( set -f; IFS=$'\n'; program $(cat file) )
to avoid the performance penalty of the parens and of the /bin/cat process, you can write instead:
( set -f; IFS=$'\n'; exec program $(<file) )
where $(<file) is a Bash equivalent to to $(cat file) (faster as it doesn't require forking /bin/cat), and exec consumes the subshell created by the parens.
However, note that the exec trick won't work and should be removed if program is not a real program in the PATH (that is, you'll get exec: program: not found if program is just a function defined in your script).

Passing a set of params should be more organized :
In this example case I'm looking for a file containing chk_disk_issue=something etc.. so I set the values by reading a config file which I pass in as a param.
# -- read specific variables from the config file (if found) --
if [ -f "${file}" ] ;then
while IFS= read -r line ;do
if ! [[ $line = *"#"* ]]; then
var="$(echo $line | cut -d'=' -f1)"
case "$var" in
chk_disk_issue)
chk_disk_issue="$(echo $line | tr -d '[:space:]' | cut -d'=' -f2 | sed 's/[^0-9]*//g')"
;;
chk_mem_issue)
chk_mem_issue="$(echo $line | tr -d '[:space:]' | cut -d'=' -f2 | sed 's/[^0-9]*//g')"
;;
chk_cpu_issue)
chk_cpu_issue="$(echo $line | tr -d '[:space:]' | cut -d'=' -f2 | sed 's/[^0-9]*//g')"
;;
esac
fi
done < "${file}"
fi
if these are not params then find a way for your script to read them as data inside of the script and pass in the file name.

Weird bash results using cut

I am trying to run this command:
./smstocurl SLASH2.911325850268888.911325850268896
smstocurl script:
#SLASH2.911325850268888.911325850268896
model=$(echo \&model=$1 | cut -d'.' -f 1)
echo $model
imea1=$(echo \&simImea1=$1 | cut -d'.' -f 2)
echo $imea1
imea2=$(echo \&simImea2=$1 | cut -d'.' -f 3)
echo $imea2
echo $model$imea1$imea2
Result Received
&model=SLASH2911325850268888911325850268896
Result Expected
&model=SLASH2&simImea1=911325850268888&simImea2=911325850268896
What am I missing here ?

You are cutting based on the dot .. In the first case your desired string contains the first string, the one containing &model, so then it is printed.
However, in the other cases you get the 2nd and 3rd blocks (-f2, -f3), so that the imea text gets cutted off.
Instead, I would use something like this:
while IFS="." read -r model imea1 imea2
do
printf "&model=%s&simImea1=%s&simImea2=%s\n" $model $imea1 $imea2
done <<< "$1"
Note the usage of printf and variables to have more control about what we are writing. Using a lot of escapes like in your echos can be risky.
Test
while IFS="." read -r model imea1 imea2; do printf "&model=%s&simImea1=%s&simImea2=%s\n" $model $imea1 $imea2
done <<< "SLASH2.911325850268888.911325850268896"
Returns:
&model=SLASH2&simImea1=911325850268888&simImea2=911325850268896
Alternatively, this sed makes it:
sed -r 's/^([^.]*)\.([^.]*)\.([^.]*)$/\&model=\1\&simImea1=\2\&simImea2=\3/' <<< "$1"
by catching each block of words separated by dots and printing back.

You can also use this way
Run:
./program SLASH2.911325850268888.911325850268896
Script:
#!/bin/bash
String=`echo $1 | sed "s/\./\&simImea1=/"`
String=`echo $String | sed "s/\./\&simImea2=/"`
echo "&model=$String
Output:
&model=SLASH2&simImea1=911325850268888&simImea2=911325850268896

awk way
awk -F. '{print "&model="$1"&simImea1="$2"&simImea2="$3}' <<< "SLASH2.911325850268888.911325850268896"
or
awk -F. '$0="&model="$1"&simImea1="$2"&simImea2="$3' <<< "SLASH2.911325850268888.911325850268896"
output
&model=SLASH2&simImea1=911325850268888&simImea2=911325850268896

head output till a specific line

I have a bunch of files in the following format.
A.txt:
some text1
more text2
XXX
more text
....
XXX
.
.
XXX
still more text
text again
Each file has at least 3 lines that start with XXX. Now, for each file A.txt I want to write all the lines till the 3rd occurrence of XXX (in the above example it is till the line before still more text) to file A_modified.txt.
I want to do this in bash and came up with grep -n -m 3 -w "^XXX$" * | cut -d: -f2 to get the corresponding line number in each file.
Is is possible to use head along with these line numbers to generate the required output?
PS: I know a simple python script would do the job but I am trying to do in this bash for no specific reason.

A simpler method would be to use awk. Assuming there's nothing but files of interest in your present working directory, try:
for i in *; do awk '/^XXX$/ { c++ } c<=3' "$i" > "$i.modified"; done
Or if your files are very big:
for i in *; do awk '/^XXX$/ { c++ } c>=3 { exit }1' "$i" > "$i.modified"; done

head -n will print out the first 'n' lines of the file
#!/bin/sh
for f in `ls *.txt`; do
echo "searching $f"
line_number=`grep -n -m 3 -w "^XXX$" $f | cut -d: -f1 | tail -1`
# line_number now stores the line of the 3rd XXX
# now dump out the first 'line_number' of lines from this file
head -n $line_number $f
done

awk parse filename and add result to the end of each line

I have number of files which have similar names like
DWH_Export_AUSTA_20120701_20120731_v1_1.csv.397.dat.2012-10-02 04-01-46.out
DWH_Export_AUSTA_20120701_20120731_v1_2.csv.397.dat.2012-10-02 04-03-12.out
DWH_Export_AUSTA_20120801_20120831_v1_1.csv.397.dat.2012-10-02 04-04-16.out
etc.
I need to get number before .csv(1 or 2) from the file name and put it into end of every line in file with TAB separator.
I have written this code, it finds number that I need, but i do not know how to put this number into file. There is space in the filename, my script breaks because of it.
Also I am not sure, how to send to script list of files. Now I am working only with one file.
My code:
#!/bin/sh
string="DWH_Export_AUSTA_20120701_20120731_v1_1.csv.397.dat.2012-10-02 04-01-46.out"
out=$(echo $string | awk 'BEGIN {FS="_"};{print substr ($7,0,1)}')
awk ' { print $0"\t$out" } ' $string

for file in *
do
sfx=$(echo "$file" | sed 's/.*_\(.*\).csv.*/\1/')
sed -i "s/$/\t$sfx/" "$file"
done

Using sed:
$ sed 's/.*_\(.*\).csv.*/&\t\1/' file
DWH_Export_AUSTA_20120701_20120731_v1_1.csv.397.dat.2012-10-02 04-01-46.out 1
DWH_Export_AUSTA_20120701_20120731_v1_2.csv.397.dat.2012-10-02 04-03-12.out 2
DWH_Export_AUSTA_20120801_20120831_v1_1.csv.397.dat.2012-10-02 04-04-16.out 1
To make this for many files:
sed 's/.*_\(.*\).csv.*/&\t\1/' file1 file2 file3
OR
sed 's/.*_\(.*\).csv.*/&\t\1/' file*
To make this changed get saved in the same file(If you have GNU sed):
sed -i 's/.*\(.\).csv.*/&\t\1/' file

Untested, but this should do what you want (extract the number before .csv and append that number to the end of every line in the .out file)
awk 'FNR==1 { split(FILENAME, field, /[_.]/) }
{ print $0"\t"field[7] > FILENAME"_aaaa" }' *.out
for file in *_aaaa; do mv "$file" "${file/_aaaa}"; done

If I understood correctly, you want to append the number from the filename to every line in that file - this should do it:
#!/bin/bash
while [[ 0 < $# ]]; do
num=$(echo "$1" | sed -r 's/.*_([0-9]+).csv.*/\t\1/' )
#awk -e "{ print \$0\"\t${num}\"; }" < "$1" > "$1.new"
#sed -r "s/$/\t$num/" < "$1" > "$1.mew"
#sed -ri "s/$/\t$num/" "$1"
shift
done
Run the script and give it names of the files you want to process. $# is the number of command line arguments for the script which is decremented at the end of the loop by shift, which drops the first argument, and shifts the other ones. Extract the number from the filename and pick one of the three commented lines to do the appending: awk gives you more flexibility, first sed creates new files, second sed processes them in-place (in case you are running GNU sed, that is).

Instead of awk, you may want to go with sed or coreutils.
Grab number from filename, with grep for variety:
num=$(<<<filename grep -Eo '[^_]+\.csv' | cut -d. -f1)
<<<filename is equivalent to echo filename.
With sed
Append num to each line with GNU sed:
sed "s/\$/\t$num" filename
Use the -i switch to modify filename in-place.
With paste
You also need to know the length of the file for this method:
len=$(<filename wc -l)
Combine filename and num with paste:
paste filename <(seq $len | while read; do echo $num; done)
Complete example
for filename in DWH_Export*; do
num=$(echo $filename | grep -Eo '[^_]+\.csv' | cut -d. -f1)
sed -i "s/\$/\t$num" $filename
done

Speed up bash filter function to run commands consecutively instead of per line

I have written the following filter as a function in my ~/.bash_profile:
hilite() {
export REGEX_SED=$(echo $1 | sed "s/[|()]/\\\&/g")
while read line
do
echo $line | egrep "$1" | sed "s/$REGEX_SED/\x1b[7m&\x1b[0m/g"
done
exit 0
}
to find lines of anything piped into it matching a regular expression, and highlight matches using ANSI escape codes on a VT100-compatible terminal.
For example, the following finds and highlights the strings bin, U or 1 which are whole words in the last 10 lines of /etc/passwd:
tail /etc/passwd | hilite "\b(bin|[U1])\b"
However, the script runs very slowly as each line forks an echo, egrep and sed.
In this case, it would be more efficient to do egrep on the entire input, and then run sed on its output.
How can I modify my function to do this? I would prefer to not create any temporary files if possible.
P.S. Is there another way to find and highlight lines in a similar way?

sed can do a bit of grepping itself: if you give it the -n flag (or #n instruction in a script) it won't echo any output unless asked. So
while read line
do
echo $line | egrep "$1" | sed "s/$REGEX_SED/\x1b[7m&\x1b[0m/g"
done
could be simplified to
sed -n "s/$REGEX_SED/\x1b[7m&\x1b[0m/gp"
EDIT:
Here's the whole function:
hilite() {
REGEX_SED=$(echo $1 | sed "s/[|()]/\\\&/g");
sed -n "s/$REGEX_SED/\x1b[7m&\x1b[0m/gp"
}
That's all there is to it - no while loop, reading, grepping, etc.

If your egrep supports --color, just put this in .bash_profile:
hilite() { command egrep --color=auto "$#"; }
(Personally, I would name the function egrep; hence the usage of command).

I think you can replace the whole while loop with simply
sed -n "s/$REGEX_SED/\x1b[7m&\x1b[0m/gp"
because sed can read from stdin line-by-line so you don't need read
I'm not sure if running egrep and piping to sed is faster than using sed alone, but you can always compare using time.
Edit: added -n and p to sed to print only highlighted lines.

Well, you could simply do this:
egrep "$1" $line | sed "s/$REGEX_SED/\x1b[7m&\x1b[0m/g"
But I'm not sure that it'll be that much faster ; )

Just for the record, this is a method using a temporary file:
hilite() {
export REGEX_SED=$(echo $1 | sed "s/[|()]/\\\&/g")
export FILE=$2
if [ -z "$FILE" ]
then
export FILE=~/tmp
echo -n > $FILE
while read line
do
echo $line >> $FILE
done
fi
egrep "$1" $FILE | sed "s/$REGEX_SED/\x1b[7m&\x1b[0m/g"
return $?
}
which also takes a file/pathname as the second argument, for case like
cat /etc/passwd | hilite "\b(bin|[U1])\b"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How can I parallelize my loop ? (fasta file) - bash

Related

Inline array substitution

Weird bash results using cut

head output till a specific line

awk parse filename and add result to the end of each line

Speed up bash filter function to run commands consecutively instead of per line

Categories

Resources