Do I need stay away from bash scripts for big files?

Do I need stay away from bash scripts for big files? - bash

I have big log files(1-2 gb and more). I'm new on programming and bash so useful and easy for me. When I need something, I can do (someone help me on here). Simple scripts works fine, but when I need complex operations, maybe bash so slow maybe my programming skill so bad, it's so slow working.
So do I need C for complex programming on my server log files or do I need just optimization my scripts?
If I need just optimization, how can I check where is bad or where is good on my codes?
For example I have while-do loop:
while read -r date month size;
do
...
...
done < file.tmp
How can I use awk for faster run?

That depends on how you use bash. To illustrate, consider how you'd sum a possibly large number of integers.
This function does what Bash was meant for: being control logic for calling other utilities.
sumlines_fast() {
awk '{n += $1} END {print n}'
}
It runs in 0.5 seconds on a million line file. That's the kind of bash code you can very effectively use for larger files.
Meanwhile, this function does what Bash is not intended for: being a general purpose programming language:
sumlines_slow() {
local i=0
while IFS= read -r line
do
(( i += $line ))
done
echo "$i"
}
This function is slow, and takes 30 seconds to sum the same million line file. You should not be doing this for larger files.
Finally, here's a function that could have been written by someone who has no understanding of bash at all:
sumlines_garbage() {
i=0
for f in `cat`
do
i=`echo $f + $i | bc`
done
echo $i
}
It treats forks as being free and therefore runs ridiculously slowly. It would take something like five hours to sum the file. You should not be using this at all.

Related

Bash Script "integer expression expected" , (floats in bash) replace part with awk

I wrote a little bash-script to check my Memory-usage and warn me if it's to high.
now my problem is, that I would like to keep the floating-value instead of just cutting it away. but I'm not able to do it..
I would prefer awk over anything else as it's pre-installed on many systems and I already use it in the script.
#!/bin/bash
#define what values are too high (%)
inacceptableRAM="90"
#check RAM %
RAM=`free | grep Mem | awk '{print $3/$2 * 100.0}'`
#send Alarm for RAM
if [ $RAM -ge $inacceptableRAM ] ; then
echo Alarm RAM usage is #$RAM"
fi
so how can I replace my if -ge using awk?
I think it should look something like:
awk ' $RAM >= $inacceptableRAM '
but what do I need to do to make it work inside the bash script?

Since you're comparing with an integer, you can just trim off the decimal part when comparing:
if [ "${RAM%.*}" -ge "$inacceptableRAM" ] ; then
If you want to do it entirely in awk, the only tricky thing is that you have to use -v var=value to convert the inacceptableRAM shell variable into an awk variable:
free | awk -v limit="$inacceptableRAM" '/Mem/ {ram=$3/$2*100; if (ram>=limit) print ram}'
Note that I'm using /Mem/ in the awk script to effectively replace the grep command. Piping from grep to awk is almost never necessary, since you can just do it all in awk.
Other recommendations: use $( ) instead of backticks for command substitutions (see BashFAQ #82), and use lower- or mixed-case variable names (e.g. ram instead of RAM) to avoid accidentally using one of the many all-caps names that have special meanings (see this answer).

An alternative to awk is bc , something like:
#!/usr/bin/env bash
#define what values are too high (%)
inacceptableRAM="90"
#check RAM %
ram=$(free | awk '/Mem/{print $3/$2 * 100.0}')
#send Alarm for RAM
if (( $(bc <<< "$ram > $inacceptableRAM") )) ; then
echo "Alarm RAM usage is #$ram"
fi

You're trying to do too much in shell. A shell is a tool to manipulate files and process and sequence calls to other tools. The tool that the guys who created shell also created for shell to call to manipulate text is awk:
#!/usr/bin/env bash
free |
awk '
BEGIN {
#define what values are too high (%)
unacceptableRam = 90
}
/Mem/ {
#check RAM %
ram = ( $2 ? $3 / $2 * 100 : 0 )
#send Alarm for RAM
if ( ram >= unacceptableRam ) {
print "Alarm RAM usage is #" ram
}
}
'

It's worth considering if it's time for you to "upgrade" to a proper programming language with support for features like floating point arithmetic. Bash and shell scripting is great, but it runs up against limitations very quickly. Even Google's Shell Style Guide suggests changing languages when things get complicated. It's likely you could get a Python script doing exactly what your Bash script is doing with just a few more lines of code.
That said, I am very guilty of leaning on Bash even when I probably shouldn't :)
I recently wanted to compute a floating point average in a Bash script and decided to delegate to a Python subprocess just for that task, rather than swap over fully to Python. I opted for Python because it's almost as ubiquitous as Bash, easy to read, and extensible. I considered bc but was worried it might not be installed on all systems. I didn't consider awk because, as great a tool as it is, its syntax is often opaque and arcane. Python is probably slower than bc or awk but I really don't care about that incremental performance difference. If I did care that would be strong evidence I should rewrite the whole script in a performant language.
For your use case Jetchisel's suggestion to use bc <<< "$ram > $inacceptableRAM" is probably sufficient, as the Python equivalent is more verbose, but for something more complex Python may be a good choice.

efficient renaming of strings based on comma delimited table

I want to rename multiple individual entries in a long file based on a comma delimited table. I figured out a way how to do it, but I feel it's highly inefficient and I'm wondering if there's a better way to do it.
My file contains >30k entries like this this:
>Gene.1::Fmerg_contig0.1::g.1::m.1 Gene.1::Fmerg_contig0.1::g.1
TPAPHKMQEPTTPFTPGGTPKPVFTKTLKGDVVEPGDGVTFVCEVAHPAAYFITWLKDSK
>Gene.17::Fmerg_Transcript_1::g.17::m.17 Gene.17::Fmerg_Transcript_1::g.17
PLDDKLADRVQQTDAGAKHALKMTDEGCKHTLQVLNCRVEDSGIYTAKATDENGVWSTCS
>Gene.15::Fmerg_Transcript_1::g.15::m.15 Gene.15::Fmerg_Transcript_1::g.15
AQLLVQELTEEERARRIAEKSPFFMVRMKPTQVIENTNLSYTIHVKGDPMPNVTFFKDDK
And the table with the renaming information looks like this:
original,renamed
Fmerg_contig0.1,Fmerg_Transcript_0
Fmerg_contig1.1,Fmerg_Transcript_1
Fmerg_contig2.1,Fmerg_Transcript_2
The inefficient solution I came up with looks like this:
#!/bin/bash
#script to revert dammit name changes
while read line; do
IFS="," read -r -a contig <<< "$line"
sed -i "s|${contig[1]}|${contig[0]}|g" Fmerg_final.fasta.transdecoder_test.pep
done < Fmerg_final.fasta.dammit.namemap.csv
However, this means that sed iterates over the table once per entry to be renamed.
I could imagine there is a way to only access each line once and iterate over the name list that way, but I'm not sure how to tackle this. I chose bash because this is the language that I'm most fluent in. But I'm not adverse to use perl or python if they offer an easier solution.

This is On problem and you solved it with On solution so I wouldn't consider it inefficient. However, if you are good with bash you can do more it no problem.
Divide and conquer.
I have done this many times as you can reduce the work time closer to the time it takes one item to be processed ..
Take this pseudo code, I call a method that cuts up the 30K file into say X parts, then I call it in a loop with the & option to run as threads.
declare -a file_part_names
# cut files into parts
function cut_file_into_parts() {
orig_file="$1"
number_parts="$1"
}
# call method to handle renaming a file
function rename_fields_in_file() {
file_part="$1"
while read line; do
IFS="," read -r -a contig <<< "$line"
sed -i "s|${contig[1]}|${contig[0]}|g" "$tmp_file"
done < "$file_part"
}
# main
cut_file_into_parts "Fmerg_final.fasta.dammit.namemap.csv"
for each file_part ;do
if threads_pids < 100
rename_fields_in_file $each &
else
sleep 10
fi
done
wait
#Now that you have a pile of temp files processed, combine them all.
for each temp file do
temp_$i.txt >> final_result.txt
done
In summary, cut the big file into say 500 tmp files labled file1, file2 etc. in say /tmp/folder. Then go through them one at a time but launch them as child processes up to say 100 running at the same time, keep the pipe full by checking that if over 100 do nothing (sleep 10) if under add more. When done, one more loop to combine file1_finish.txt to file2_finish.txt etc. which is super quick.
NOTE: if this is too much you can always just break the file up and call the the same script X times for each file instead of using threads.

bash loop taking extremely long time

I have a list of times that I am looping through in the format HH:MM:SS to find the nearest but not past time. The code that I have is:
for i in ${times[#]}; do
hours=$(echo $i | sed 's/\([0-9]*\):.*/\1/g')
minutes=$(echo $i | sed 's/.*:\([0-9]*\):.*/\1/g')
currentHours=$(date +"%H")
currentMinutes=$(date +"%M")
if [[ hours -ge currentHours ]]; then
if [[ minutes -ge currentMinutes ]]; then
break
fi
fi
done
The variable times is an array of all the times that I am sorting through (its about 20-40 lines). I'd expect this to take less than 1 second however it is taking upwards of 5 seconds. Any suggestions for decreasing the time of the regular expression would be appreciated.
times=($(cat file.txt))
Here is a list of the times that are stored in a text file and are imported into the times variable using the above line of code.
6:05:00
6:35:00
7:05:00
7:36:00
8:08:00
8:40:00
9:10:00
9:40:00
10:11:00
10:41:00
11:11:00
11:41:00
12:11:00
12:41:00
13:11:00
13:41:00
14:11:00
14:41:00
15:11:00
15:41:00
15:56:00
16:11:00
16:26:00
16:41:00
16:58:00
17:11:00
17:26:00
17:41:00
18:11:00
18:41:00
19:10:00
19:40:00
20:10:00
20:40:00
21:15:00
21:45:00

One of the key things to understand in looking at bash scripts from a performance perspective is that while the bash interpreter is somewhat slow, the act of spawning an external process is extremely slow. Thus, while it can often speed up your scripts to use a single invocation of awk or sed to process a large stream of input, starting those invocations inside a tight loop will greatly outweigh the performance of those tools once they're running.
Any command substitution -- $() -- causes a second copy of the interpreter to be fork()ed off as a subshell. Invoking any command not built into bash -- date, sed, etc -- then causes a subprocess to be fork()ed off for that process, and then the executable associated with that process to be exec()'d -- something involves a great deal of OS-level overhead (the binary needs to be linked, loaded, etc).
This loop would be better written as:
IFS=: read -r currentHours currentMinutes < <(date +"%H:%M")
while IFS=: read -r hours minutes _; do
if (( hours >= currentHours )) && (( minutes >= currentMinutes )); then
break
fi
done <file.txt
In this form only one external command is run, date +"%H:%M", outside the loop. If you were only targeting bash 4.2 and newer (with built-in time formatting support), even this would be unnecessary:
printf -v currentHours '%(%H)T' -1
printf -v currentMinutes '%(%M)T' -1
...will directly place the current hour and minute into the variables currentHours and currentMinutes using only functionality built into modern bash releases.
See:
BashFAQ #1 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
BashFAQ #100 - How can I do native string manipulations in bash? (Subsection: "Splitting a string into fields")

To be honest I'm not sure why it's taking an extremely long time but there are certainly some things which could be made more efficient.
currentHours=$(date +"%H")
currentMinutes=$(date +"%M")
for time in "${times[#]}"; do
IFS=: read -r hours minutes seconds <<<"$time"
if [[ hours -ge currentHours && minutes -ge currentMinutes ]]; then
break
fi
done
This uses read, a built-in command, to split the text into variables, rather than calling external commands and creating subshells.
I assume that you want the script to run so quickly that it's safe to reuse currentHours and currentMinutes within the loop.
Note that you can also just use awk to do the whole thing:
awk -F: -v currentHours="$(date +"%H") -v currentMinutes="$(date +"%M")" '
$1 >= currentHours && $2 >= currentMinutes { print; exit }' file.txt
Just to make the program produce some output, I added a print, so that the last line is printed.

awk to the rescue!
awk -v time="12:12:00" '
function pad(x) {split(x,ax,":"); return (ax[1]<10)?"0"x:x}
BEGIN {time=pad(time)}
time>pad($0) {next}
{print; exit}' times
12:41:00
with 0 padding the hour you can do string only comparison.

Recursive bash function vs iterative "eval" string building: Which performs better?

I am creating this function to make multiple grep's over every line of a file. I run it as following:
cat file.txt | agrep string1 string2 ... stringN
The idea is to print every line that contains all the strings: string1, string2, ..., stringN, I followed this two approaches the fist is a recursive method :
agrep () {
if [ $# = 0 ]; then
cat
else
pattern="$1"
shift
grep -e "$pattern" | agrep "$#"
fi
}
by the other hand I have a second approach related with an iterative method, since i am using a for method :
function agrep () {
for a in $#; do
cmd+=" | grep '$a'";
done ;
while read line ; do
eval "echo "\'"$line"\'" $cmd";
done;
}
This two approaches works very well but i would like to know if someone can tell me which one is the more efficient? and also if it's posible if there is a way to measure this in bash?, since i consider that i don't have the enough experience to determine this because i don't know if bash it's a programming language that works better with iterative methods or recursive methods or maybe if it's to expensive to use eval.
This two functions are designed to work with large texts and process every line of the texts, I really would appreciate any explanation or advice of this.
This is an example of text file called risk:
1960’s. Until the 1990’s it was a purely theoretical analysis of the
problem of function estimation from a given collection of data.
In the middle of the 1990’s new types of learning algorithms
(called support vector machines) based on the developed t
and then if i run:
cat risk | agrep Until
I get:
1960.s. Until the 1990.s it was a purely theoretical analysis of the
but by the other hand if i run:
cat risk | agrep Until new
prints nothing since there inst any line with that two strings, this was a function designed to clarify the usage of the function.

I completely agree with the comments and answers that have already informed you of the pitfalls of your current approach.
Based on the suggestion made by karakfa, I would suggest using a function that calls awk, along these lines:
agrep() {
awk 'BEGIN {
# read command line arguments and unset them
for (i = 1; i < ARGC; ++i) {
strings[i] = ARGV[i]
ARGV[i] = ""
}
}
{
for (i in strings) {
# if the line does not match, skip it
if ($0 !~ strings[i]) next
}
# print remaining lines
print
}' "$#"
}
This passes in all of the arguments to the function as arguments to awk, which would normally treat them as filenames. Each argument is added to a new array, strings and removed from ARGV before any lines of input are processed.
Use it like this:
agrep string1 string2 string3 < file

both are inefficient but since grep is very fast you may not be noticing. A better approach is switching to awk
awk '/string1/ && /string2/ && ... && /stringN/' file
will do the same in one iteration.

Security
The eval-based approach has a critical flaw: It allows code injection via maliciously formed strings being searched for. Thus, for the two as-given, the recursive approach is the only reasonable option for real-world production scenarios.
Why is the eval approach insecure? Look at this code for a moment:
cmd+=" | grep '$a'";
What happens if a=$'\'"$(rm -rf ~)"\''?
A corrected implementation might modify this line to read as follows:
printf -v cmd '%s | grep -e %q' "$cmd" "$a"
Performance
Your recursive approach does all its recursing while setting up a pipeline of length proportional to the number of arguments passed to agrep. Once that pipeline has been set up, the shell itself is out of the way (all ongoing operations are performed by the grep processes), and the performance overhead is exactly identical to the performance of the pipeline itself.
Thus, for a sufficiently large input file, the performance of the setup stage becomes effectively nil, and the relevant performance difference will be that between cat and a while read loop -- which cat will handily win for inputs large enough to overcome its startup costs.

Splitting large files efficiently (currently using awk)

I have a 4 GB file that I need to do some operations on. I have a Bash script to do this, but it Bash seems ill suited to reading large data files into an array. So I decided to break up my file with awk.
My current script is :
for((i=0; i<100; i++)); do awk -v i=$i 'BEGIN{binsize=60000}{if(binsize*i < NR && NR <= binsize*(i+1)){print}}END{}' my_large_file.txt &> my_large_file_split$i.fastq; done
However the problem with this script is that it will read in and loop through this large file 100 times (which presumably will lead to about 400GB of IO).
QUESTION : Is there better strategy of reading in the the large file once? Perhaps doing the writing to files within awk instead of redirecting its output?

Assuming binsize is the number of lines you want per chunk, you could just maintain and reset a line counter as you step through the file, and setting alternate output files within awk instead of using the shell to redirect.
awk -v binsize=60000 '
BEGIN {
outfile="output_chunk_1.txt"
}
count > binsize {
if (filenum>1) {
close(outfile)
}
filenum++
outfile="output_chunk_" filenum ".txt"
count=0
}
{
count++
print > outfile
}
' my_large_file.txt
I haven't actually tested this code, so if it doesn't work verbatim, at least it should give you an idea of a strategy to use. :-)
The idea is that we'll step through the file, updating a filename in a variable whenever our line count for a chunk exceeds binsize. Note that the close(outfile) isn't strictly necessary, as awk will of course close any open files when it exits, but it may save you a few bytes of memory per open file handle (which will only be significant if you have many many output files).
That said, you could do almost exactly the same thing in bash alone:
#!/usr/bin/env bash
binsize=60000
filenum=1; count=0
while read -r line; do
if [ $count -gt $binsize ]; then
((filenum++))
count=0
fi
((count++))
outfile="output_chunk_${filenum}.txt"
printf '%s\n' "$line" >> $outfile
done < my_large_file.txt
(Also untested.)
And while I'd expect the awk solution to be faster than bash, it might not hurt to do your own benchmarks. :)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Do I need stay away from bash scripts for big files? - bash

Related

Bash Script "integer expression expected" , (floats in bash) replace part with awk

efficient renaming of strings based on comma delimited table

bash loop taking extremely long time

Recursive bash function vs iterative "eval" string building: Which performs better?

Splitting large files efficiently (currently using awk)

Categories

Resources