Optimizing performance of an IO-heavy shell script - performance

The following bash code, which reads from a single input file line-by-line and writes to a large number (~100) of output files, exhibits unreasonable performance -- on the scale of 30 seconds for 10,000 lines, when I want it to be usable on a scale of millions or billions of lines of input.
In the following code, batches is an already-defined associative array (in other languages, a map).
How can this be improved?
while IFS='' read -r line
do
x=`echo "$line" | cut -d" " -f1`;
y=`echo "$line" | cut -d" " -f2`;
# echo "find match between $x and $y";
a="${batches["$x"]}";
b="${batches["$y"]}";
if [ -z $a ] && [ -n $b ]
then echo "$line" >> Output/batch_$b.txt;
elif [ -n $a ] && [ -z $b ]
then echo "$line" >> Output/batch_$a.txt;
elif [ -z $a ] && [ -z $b ]
then echo "$line" >> Output/batch_0.txt;
elif [ $a -gt $b ]
then echo "$line" >> Output/batch_$a.txt;
elif [ $a -le $b ]
then echo "$line" >> Output/batch_$b.txt;
fi
done < input.txt

while IFS= read -r line; do
x=${line%%$'\t'*}; rest=${line#*$'\t'}
y=${rest%%$'\t'*}; rest=${rest#*$'\t'}
...
done <input.txt
That way you aren't starting two external programs every single time you want to split line into x and y.
Under normal circumstances you could use read to do string-splitting implicitly by reading the columns into different fields, but as read trims leading whitespace, that doesn't work correctly if (as here) your columns are whitespace-separated and the first can be empty; consequently, it's necessary to use parameter expanison. See BashFAQ #73 for details on how parameter expansion works, or BashFAQ #100 for a general introduction to string manipulation with bash's native facilities.
Also, re-opening your output files every time you want to write a single line to them is silly at this kind of volume. Either use awk, which will handle this for you automatically, or write a helper (note that the following requires a fairly new release of bash -- probably 4.2):
write_to_file() {
local filename content new_out_fd
filename=$1; shift
printf -v content '%s\t' "$#"
content=${content%$'\t'}
declare -g -A output_fds
if ! [[ ${output_fds[$filename]} ]]; then
exec {new_out_fd}>"$filename"
output_fds[$filename]=$new_out_fd
fi
printf '%s\n' "$content" >&"${output_fds[$filename]}"
}
...and then:
if [[ $a && ! $b ]]; then
write_to_file "Output/batch_$a.txt" "$line"
elif [[ ! $a ]] && [[ $b ]]; then
write_to_file "Output/batch_$b.txt" "$line"
elif [[ ! $a ]] && [[ ! $b ]]; then
write_to_file "Output/batch_0.txt" "$line"
elif (( a > b )); then
write_to_file "Output/batch_$a.txt" "$line"
else
write_to_file "Output/batch_$b.txt" "$line"
fi
Note that caching FDs only makes sense if you have few enough output files that you can maintain open file descriptors for each of them (and such that re-opening files receiving more than one write is a net benefit). Feel free to leave this out and only do faster string-splitting if it doesn't make sense for you.
Just for completion, here's another approach (also written using automatic fd management, thus requiring bash 4.2) -- running two cut invocations and letting them both run through the entirety of the input file.
exec {x_columns_fd}< <(cut -d" " -f1 <input.txt)
exec {y_columns_fd}< <(cut -d" " -f2 <input.txt)
while IFS='' read -r line && \
IFS='' read -r -u "$x_columns_fd" x && \
IFS='' read -r -u "$y_columns_fd" y; do
...
done <input.txt
This works because it's not cut itself that's inefficient -- it's starting it up, running it, reading its output and shutting it down all the time that costs. If you just run two copies of cut, and let each of them process the whole file, performance is fine.

Related

Bash sed command gives me "invalid command code ."

I'm trying to automate a build process by replacing .js chunks for particular lines in my main.config.php file. When I run the following code:
declare -a js_strings=("footer." "footerJQuery." "headerCSS." "headerJQuery.")
build_path="./build/build"
config_path="./system/Config/main.config.php"
while read -r line;
do
for js_string in ${js_strings[#]}
do
if [[ $line == *$js_string* ]]
then
for js_file in "$build_path"/*
do
result="${js_file//[^.]}"
if [[ $js_file == *$js_string* ]] && [[ ${#result} -eq 3 ]]
then
sed -i "s/$line/$line$(basename $js_file)\";/g" $config_path
fi
done
fi
done
done < "$config_path"
I get this message back, and file has not been updated/edited:
sed: 1: "./system/Config/main.co ...": invalid command code .
I haven't been able to find anything in my searches that pertain to this specific message. Does anyone know what I need to change/try to get the specific lines replaced in my .php file?
Updated script with same message:
declare -a js_strings=("footer." "footerJQuery." "headerCSS." "headerJQuery.")
build_path="./build/build"
config_path="./system/Config/main.config.php"
while read -r line;
do
for js_string in ${js_strings[#]}
do
if [[ $line == *$js_string* ]]
then
for js_file in "$build_path"/*
do
result="${js_file//[^.]}"
if [[ $js_file == *$js_string* ]] && [[ ${#result} -eq 3 ]]
then
filename=$(basename $js_file)
newline="${line//$js_string*/$filename\";}"
echo $line
echo $newline
sed -i "s\\$line\\$newline\\g" $config_path
echo ""
fi
done
fi
done
done < "$config_path"
Example $line:
$config['public_build_header_css_url'] = "http://localhost:8080/build/headerCSS.js";
Example $newline:
$config['public_build_header_css_url'] = "http://localhost:8080/build/headerCSS.7529a73071877d127676.js";
Updated script with changes suggested by #Vercingatorix:
declare -a js_strings=("footer." "footerJQuery." "headerCSS." "headerJQuery.")
build_path="./build/build"
config_path="./system/Config/main.config.php"
while read -r line;
do
for js_string in ${js_strings[#]}
do
if [[ $line == *$js_string* ]]
then
for js_file in "$build_path"/*
do
result="${js_file//[^.]}"
if [[ $js_file == *$js_string* ]] && [[ ${#result} -eq 3 ]]
then
filename=$(basename $js_file)
newline="${line//$js_string*/$filename\";}"
echo $line
echo $newline
linenum=$(grep -n "^${line}\$" ${config_path} | cut -d':' -f 1 )
echo $linenum
[[ -n "${linenum}" ]] && sed -i "${linenum}a\\
${newline}
;${linenum}d" ${config_path}
echo ""
fi
done
fi
done
done < "$config_path"
Using sed's s command to replace a line of that complexity is a losing proposition, because whatever delimiter you choose may appear in the line and mess things up. If these are in fact entire lines, it is better to delete them and insert a new one:
linenum=$(fgrep -nx -f "${line}" "${config_path}" | awk -F : "{print \$1}" )
[[ -n "${linenum}" ]] && sed -i "" "${linenum}a\\
${newline}
;${linenum}d" "${config_path}"
What this does is search for the line number of the line that matches $line in its entirety, then extracts the line number portion. fgrep is necessary otherwise the symbols in your file are interpreted as regular expressions. If there was a match, then it runs sed, appending the new line (a) and deleting the old one (d).

Shell - Skip matching lines while read

I'm looking for help with my shell... Hope I'll find it here...
Here's my code :
#!/bin/sh
while IFS= read -r line || [ -n "$line" ]
do
[[ "$line" =~ ^[[:space:]]*\# ]] && continue # This line must stay
[[ "$line" =~ *read[[:space:]]-r[[:space:]]line* ]] && continue
echo "${line%$NL}"
done < $0
First test will suppress "only comment lines".
Second test purpose is to suppress the "while IFS= read..." lines - no matter what, it's just a test :-)
"done < $0" has been here written intentionaly... for the test !
Running the shell outputs this :
while IFS= read -r line || [ -n "$line" ]
do
[[ "$line" =~ ^[[:space:]]*\# ]] && continue # This line must stay
[[ "$line" =~ *read[[:space:]]-r[[:space:]]line* ]] && continue
echo "${line%$NL}"
done < $0
as I thought the first line will be gone because of matching then 2nd test.
What's my mistake ?
For the record, I don't want to use extra sed or awk sentence.
Actually, the input data (here $0 file) has to be standard input (eg extract from tee command). I read lot of stackOverflow subject about this, with sed or awk responses that didn't match my purpose.
The regex is invalid. A short test shows:
> [[ 'while IFS= read -r line || [ -n "$line" ]' =~ *read[[:space:]]-r[[:space:]]line* ]]
> echo $?
2
The * can't be "alone" - it can't be the first character in a POSIX extended regular expression. It has to "bind" to something, ex. a dot .. A dot represents any character. You want:
[[ $line =~ .*read[[:space:]]-r[[:space:]]line.* ]]
Problem is presence of starting quantifier * in the regex in second continue line. You may use:
[[ "$line" =~ read[[:space:]]+-r[[:space:]]+line ]] && continue
There is no need to match anything before read or after line in this regex.
Also it is better to use quantifier + after [[:space:]] to make it match 1 or more white spaces.
You can do more refactoring and combine both regex into one by using alternation as in this code:
while IFS= read -r line || [[ -n $line ]]
do
[[ $line =~ ^[[:space:]]*#|read[[:space:]]+-r[[:space:]]+line ]] && continue
echo "${line%$NL}"
done < $0

Search for exact string

I'm trying to search for exact string located in chars number 4-7.
When I run the cut command on terminal it works,
however in a script it fails as I believe the if statement provide me "0".
This is what I've done:
for NAME in `cat LISTS_NAME`; do
if [[ John == cut -c 4-7 "${NAME}" ]]; then
Do Something ...
fi
if [[ Dana == cut -c 4-7 "${NAME}" ]]; then
Do Something...
fi
Can you advise me how to run this using cut or any other reg-ex?
You aren't running the cut command there. You are comparing John and Dana to the literal string cut -c 4-7 <value-of-$NAME>.
You need to use:
if [[ John == $(cut -c 4-7 "${NAME}") ]]; then
etc.
That being said you should only do the cut call once and store that in a variable. And for exact matching you need to quote the right-hand side of == to avoid globbing. So
substr=$(cut -c 4-7 "${NAME}")
if [[ John == "$substr" ]]; then
And then to avoid needing duplicate if ...; then lines you could do better with a case statement:
substr=$(cut -c 4-7 "${NAME}")
case $substr in
John)
Do something
;;
Dana)
Do something else
;;
esac
Your script has many problems and you don't need cut. Use it this way:
while read -r line; do
if [[ "${line:3:4}" == "John" ]]; then
Do Something ...
elif [[ "${line:3:4}" == "Dana" ]]; then
Do Something...
fi
done < LISTS_NAME
In BASH "${line:3:3}" is same as cut -c 4-7
EDIT: If you don't want precise string matching then you can use:
while read -r line; do
if [[ "${line:3}" == "John"* ]]; then
Do Something ...
elif [[ "${line:3}" == "Dana"* ]]; then
Do Something...
fi
done < LISTS_NAME

iterating through each line and each character in a file

i'm trying to get a file name, and a character index, and to print me the characters with that index from each line (and do it for each character index the user enters if such character exists).
This is my code:
#!/bin/bash
read file_name
while read x
do
if [ -f $file_name ]
then
while read string
do
counter=0
while read -n char
do
if [ $counter -eq $x ]
then
echo $char
fi
counter=$[$counter+1]
done < $(echo -n "$string")
done < $file_name
fi
done
But, it says an error:
line 20: abcdefgh: No such file or directory
line 20 is the last done, so it doesn't help me figure out where is the error.
So what's wrong in my code and how do I fix it?
Thanks a lot.
I think "cut" might fit the bill:
read file_name
if [ -f $file_name ]
then
while read -n char
do
cut -c $char $file_name
done
fi
This line seems to be problematic:
done < $(echo -n "$string")
Replace that with:
done < <(echo -n "$string")
replace
counter=0
while read -n char
do
if [ $counter -eq $x ]
then
echo $char
fi
counter=$[$counter+1]
done < $(echo -n "$string")
with
if [ $x -lt ${#string} ]
then
echo ${line:$x:1}
fi
It does the same, but allows to avoid such errors.
Another approach is using cut
cut -b $(($x+1)) $file_name | grep -v "^$"
It can replace two inner loops

Read line by line from different locations please help to avoid duplicated code

I've the following script.
for args
do
while read line; do
# do something
done <"$args"
done
If the script is started with a list of filenames, it should read out each file line by line.
Now I'm looking for a way the read from stdin when script is started without a list of filenames, but I doesn't want to duplicate the while loop.
Any ideas?
Quick answer:
[[ -z $1 ]] && defaultout=/dev/stdin
for f in "$#" $defaultout; do
while read line; do
# do something
done < "$f"
done
Drawback: parameters are not parsed
Second attempt:
[[ -z $1 ]] && defaultout=/dev/stdin
for f in $# $defaultout; do
if [[ -f $f ]]; then
while read line; do
# do something
done < "$f"
fi
done
Drawback: Filenames with spaces will be parsed into two words.
You could try:
args="$*"
if [ "$args" = "" ]; then
args=/dev/stdin;
fi
for arg in $args; do
while read -r line; do
# do something
done < "$arg";
done
The following should do what you want:
cat "$#" | while read line; do
# something
done

Resources