I am creating this function to make multiple grep's over every line of a file. I run it as following:
cat file.txt | agrep string1 string2 ... stringN
The idea is to print every line that contains all the strings: string1, string2, ..., stringN, I followed this two approaches the fist is a recursive method :
agrep () {
if [ $# = 0 ]; then
cat
else
pattern="$1"
shift
grep -e "$pattern" | agrep "$#"
fi
}
by the other hand I have a second approach related with an iterative method, since i am using a for method :
function agrep () {
for a in $#; do
cmd+=" | grep '$a'";
done ;
while read line ; do
eval "echo "\'"$line"\'" $cmd";
done;
}
This two approaches works very well but i would like to know if someone can tell me which one is the more efficient? and also if it's posible if there is a way to measure this in bash?, since i consider that i don't have the enough experience to determine this because i don't know if bash it's a programming language that works better with iterative methods or recursive methods or maybe if it's to expensive to use eval.
This two functions are designed to work with large texts and process every line of the texts, I really would appreciate any explanation or advice of this.
This is an example of text file called risk:
1960’s. Until the 1990’s it was a purely theoretical analysis of the
problem of function estimation from a given collection of data.
In the middle of the 1990’s new types of learning algorithms
(called support vector machines) based on the developed t
and then if i run:
cat risk | agrep Until
I get:
1960.s. Until the 1990.s it was a purely theoretical analysis of the
but by the other hand if i run:
cat risk | agrep Until new
prints nothing since there inst any line with that two strings, this was a function designed to clarify the usage of the function.
I completely agree with the comments and answers that have already informed you of the pitfalls of your current approach.
Based on the suggestion made by karakfa, I would suggest using a function that calls awk, along these lines:
agrep() {
awk 'BEGIN {
# read command line arguments and unset them
for (i = 1; i < ARGC; ++i) {
strings[i] = ARGV[i]
ARGV[i] = ""
}
}
{
for (i in strings) {
# if the line does not match, skip it
if ($0 !~ strings[i]) next
}
# print remaining lines
print
}' "$#"
}
This passes in all of the arguments to the function as arguments to awk, which would normally treat them as filenames. Each argument is added to a new array, strings and removed from ARGV before any lines of input are processed.
Use it like this:
agrep string1 string2 string3 < file
both are inefficient but since grep is very fast you may not be noticing. A better approach is switching to awk
awk '/string1/ && /string2/ && ... && /stringN/' file
will do the same in one iteration.
Security
The eval-based approach has a critical flaw: It allows code injection via maliciously formed strings being searched for. Thus, for the two as-given, the recursive approach is the only reasonable option for real-world production scenarios.
Why is the eval approach insecure? Look at this code for a moment:
cmd+=" | grep '$a'";
What happens if a=$'\'"$(rm -rf ~)"\''?
A corrected implementation might modify this line to read as follows:
printf -v cmd '%s | grep -e %q' "$cmd" "$a"
Performance
Your recursive approach does all its recursing while setting up a pipeline of length proportional to the number of arguments passed to agrep. Once that pipeline has been set up, the shell itself is out of the way (all ongoing operations are performed by the grep processes), and the performance overhead is exactly identical to the performance of the pipeline itself.
Thus, for a sufficiently large input file, the performance of the setup stage becomes effectively nil, and the relevant performance difference will be that between cat and a while read loop -- which cat will handily win for inputs large enough to overcome its startup costs.
Related
I'm new to Bash and Linux. I'm trying to make it happen in Bash. So far my code looks smth like this:
The problem is with coping using awk or sed, i suppose... as i try to put the sum of a constant and a variable somehow wrong. Brackets or different quotes do not make any difference.
for i in {0..10}
do
touch /path/file$i
awk -v "NR>=1+$i*9 && NR<=$i*9" /path/BigFile > /path/file$i
(or with sed) sed -n "1+$i*9,$i*9" /path/BigFile > /path/file$i
done
Thank you in advance
Instead of reinventing this wheel, you can use the split utility. split -l 10 will tell it to split into chunks of 10 lines (or maybe less for the last one), and there are some options you can use to control the output filenames -- you probably want -d to get numeric suffixes (the default is alphabetical).
hobbs' answer (use split) is the right way to solve this. Aside from being simpler, it's also linear (meaning that if you double the input size, it only takes twice as long) while these for loops are quadratic (doubling the input quadruples the time it takes), plus the loop requires that you know in advance how many lines there are in the file. But for completeness let me explain what went wrong in the original attempts.
The primary problem is that the math is wrong. NR>=1+$i*9 && NR<=$i*9 will never be true. For example, in the first iteration, $i is 0, so this is equivalent to NR>=1 && NR<=0, which requires that the record number be at least 1 but no more than 0. Similarly, when $i is 1, it becomes NR>=10 && NR<=9... same basic problem. What you want is something like NR>=1+$i*9 && NR<=($i+1)*9, which matches for lines 1-9, then 10-18, etc.
The second problem is that you're using awk's -v option without supplying a variable name & value. Either remove the -v, or (the cleaner option) use -v to convert the shell variable i into an awk variable (and then put the awk program in single-quotes and don't use $ to get the variable's value -- that's how you get shell variables' values, not awk variables). Something like this:
awk -v i="$i" 'NR>=1+i*9 && NR<=(i+1)*9' /path/BigFile > "/path/file$i"
(Note that I double-quoted everything involving a shell variable reference -- not strictly necessary here, but a general good scripting habit.)
The sed version also has a couple of problems of its own. First, unlike awk, sed doesn't do math; if you want to use an expression as a line number, you need to have the shell do the math with $(( )) and pass the result to sed as a simple number. Second, your sed command specifies a line range, but doesn't say what to do with it; you need to add a p command to print those lines. Something like this:
sed -n "$((1+i*9)),$(((i+1)*9)) p" /path/BigFile > "/path/file$i"
Or, equivalently, omit -n and tell it to delete all but those lines:
sed "$((1+i*9)),$(((i+1)*9)) ! d" /path/BigFile > "/path/file$i"
But again, don't actually do any of these; use split instead.
This question already has answers here:
How do I iterate over a range of numbers defined by variables in Bash?
(20 answers)
Closed 3 years ago.
I am still learning how to shell script and I have been given a challenge to make it easier for me to echo "Name1" "Name2"..."Name15" and I'm not too sure where to start, I've had ideas but I don't want to look silly if I mess it up. Any help?
I haven't actually tried anything just yet it's all just been mostly thought.
#This is what I wrote to start
#!/bin/bash
echo "Name1"
echo "Name2"
echo "Name3"
echo "Name4"
echo "Name5"
echo "Name6"
echo "Name7"
echo "Name8"
echo "Name9"
echo "Name10"
echo "Name11"
echo "Name12"
echo "Name13"
echo "Name14"
echo "Name15"
My expected results are obviously just for it to output "Name1" "Name2" etc. But I'm looking for a more creative way to do it. If possible throw in a few ways to do it so I can learn. Thank you.
The easiest (possibly not the most creative) way to do this is to use printf:
printf "%s\n" name{1..15}
This relies on bash brace expansion {1..15} to have the 15 strings.
Use a for loop
for i in {1..15};do echo "Name$i";done
A few esoteric solutions, from the least to the most unreasonable :
base64 encoded string :
base64 -d <<<TmFtZTEKTmFtZTIKTmFtZTMKTmFtZTQKTmFtZTUKTmFtZTYKTmFtZTcKTmFtZTgKTmFtZTkKTmFtZTEwCk5hbWUxMQpOYW1lMTIKTmFtZTEzCk5hbWUxNApOYW1lMTUK
The weird chain is your expected result encoded in base64, an encoding generally used to represent binary data as text. base64 -d <<< weirdChain is passing the weird chain as input to the base64 tool and asking it to decode it, which displays your expected result
generate an infinite stream of "Name", truncate it, use line numbers :
yes Name | awk 'NR == 16 { exit } { printf("%s%s\n", $0, NR) }'
yes outputs an infinite stream of what it's passed as argument (or y by default, used to automatize interactive scripts asking for [y/n] confirmation). The awk command exits once it reaches the 16th line, and otherwise prints its input (provided by yes) followed by the line number. The truncature could as easily be done with head -15, and I've tried using the nl "number line" utility or grep -n to number lines, but they always added the line numbers as prefix which required an extra re-formatting step.
read random binary data and hope to stumble on all the lines you want to output :
timeout 1d strings /dev/urandom | grep -Eo "Name(1[0-5]|[1-9])" | sort -uV
strings /dev/urandom will extract ascii sequences from the binary random source /dev/urandom, grep will filter those which respect the format of a line of your expected output and sort will reorder those lines in the correct order. Since sort needs to have a received its whole input before it reorders it and /dev/urandom won't stop producing data, we use timeout 1d to stop reading from /dev/urandom after a whole day in hope it has sifted through enough random data to find your 15 lines (I'm not sure that's even remotely likely).
use an HTTP client to retrieve this page, extract the bash script you posted and execute it.
my_old_script=$(curl "https://stackoverflow.com/questions/57818680/" | grep "#This is what I wrote to start" -A 18 | tail -n+4)
eval "$my_old_script"
curl is a command line tool that can be used as an HTTP client, grep with its -A 18 parameter will select the "This is what I wrote to start" text and the 18 lines that follow, tail will remove the first 3 lines, and eval will execute your script.
While it will be much more efficient than the previous solution, it's an even less reasonable solution because high-rep users can edit your question to make this solution execute arbitrary code on your computer. Ideally you'd be using an HTML-aware parser rather than basic string manipulation to extract the code, but we're not talking about best practices here...
I am writing a bash script that loops over a large file of data which i have extracted the key parts I need to use. It seems quite trivial when I was trying to do it but all I need to do is something akin to,
string1=...
string2=...
correct=0
for i in 1..29
do
if [string1[i] == string2[i]]
then
correct=correct+1
fi
done
When I tried doing something like this I get a Bad Substitution which I assume is because some of the key's look like this,
`41213343323455122411141331555 - key`
`3113314233111 22321112433111* - answer`
The spaces and occational * that are found don't need special treatment in my case, just a simple comparison of each index.
#!/bin/bash
answersCorrect=0
for i in $(nawk 'BEGIN{ for(i=1;i<=29;i++) print i}')
do
if [ "${answer:i:1}" = "${key:i:1}" ]
then
answersCorrect=$answersCorrect+1 #this line#
fi
done
I am getting no compiler errors now however I don't think i'm incrementing answersCorrect correctly. When I output it it is something like 0+1+1+1 instead of just 3 (this segment is being used inside a while loop)
Fixed Solution for that line : answersCorrect=$((answersCorrect+1))
The original problem is fixed by comments and some extra work of #Mikel.
An alternative is comparing the strings after converting the strings to lines.
diff --suppress-common-lines <(fold -w1 <<< "${string1}") <(fold -w1 <<< "${string2}") |
grep -c "^<"
I have a stream that is null delimited, with an unknown number of sections. For each delimited section I want to pipe it into another pipeline until the last section has been read, and then terminate.
In practice, each section is very large (~1GB), so I would like to do this without reading each section into memory.
For example, imagine I have the stream created by:
for I in {3..5}; do seq $I; echo -ne '\0';
done
I'll get a steam that looks like:
1
2
3
^#1
2
3
4
^#1
2
3
4
5
^#
When piped through cat -v.
I would like to pipe each section through paste -sd+ | bc, so I get a stream that looks like:
6
10
15
This is simply an example. In actuality the stream is much larger and the pipeline is more complicated, so solutions that don't rely on streams are not feasible.
I've tried something like:
set -eo pipefail
while head -zn1 | head -c-1 | ifne -n false | paste -sd+ | bc; do :; done
but I only get
6
10
If I leave off bc I get
1+2+3
1+2+3+4
1+2+3+4+5
which is basically correct. This leads me to believe that the issue is potentially related to buffering and the way each process is actually interacting with the pipes between them.
Is there some way to fix the way that these commands exchange streams so that I can get the desired output? Or, alternatively, is there a way to accomplish this with other means?
In principle this is related to this question, and I could certainly write a program that reads stdin into a buffer, looks for the null character, and pipes the output to a spawned subprocess, as the accepted answer does for that question. Given the general support of streams and null delimiters in bash, I'm hoping to do something that's a little more "native". In particular, if I want to go this route, I'll have to escape the pipeline (paste -sd+ | bc) in a string instead of just letting the same shell interpret it. There's nothing too inherently bad about this, but it's a little ugly and will require a bunch of somewhat error prone escaping.
Edit
As was pointed out in an answer, head makes no guarantees about how much it buffers. Unless it only buffers single byte at a time, which would be impractical, this will never work. Thus, it seems like the only solution would be to read it into memory, or write a specific program.
The issue with your original code is that head doesn't guarantee that it won't read more than it outputs. Thus, it can consume more than one (NUL-delimited) chunk of input, even if it's emitting only one chunk of output.
read, by contrast, guarantees that it won't consume more than you ask it for.
set -o pipefail
while IFS= read -r -d '' line; do
bc <<<"${line//$'\n'/+}"
done < <(build_a_stream)
If you want native logic, there's nothing more native than just writing the whole thing in shell.
Calling external tools -- including bc, cut, paste, or others -- involves a fork() penalty. If you're only processing small amounts of data per invocation, the efficiency of the tools is overwhelmed by the cost of starting them.
while read -r -d '' -a numbers; do # read up to the next NUL into an array
sum=0 # initialize an accumulator
for number in "${numbers[#]}"; do # iterate over that array
(( sum += number )) # ...using an arithmetic context for our math
done
printf '%s\n' "$sum"
done < <(build_a_stream)
For all of the above, I tested with the following build_a_stream implementation:
build_a_stream() {
local i j IFS=$'\n'
local -a numbers
for ((i=3; i<=5; i++)); do
numbers=( )
for ((j=0; j<=i; j++)); do
numbers+=( "$j" )
done
printf '%s\0' "${numbers[*]}"
done
}
As discussed, the only real solution seemed to be writing a program to do this specifically. I wrote one in rust called xstream-util. After installing it with cargo install xstream-util, you can pipe the input into
xstream -0 -- bash -c 'paste -sd+ | bc'
to get the desired output
6
10
15
It doesn't avoid having to run the program in bash, so it still needs escaping if the pipeline is complicated. Also, it currently only supports single byte delimiters.
I have big log files(1-2 gb and more). I'm new on programming and bash so useful and easy for me. When I need something, I can do (someone help me on here). Simple scripts works fine, but when I need complex operations, maybe bash so slow maybe my programming skill so bad, it's so slow working.
So do I need C for complex programming on my server log files or do I need just optimization my scripts?
If I need just optimization, how can I check where is bad or where is good on my codes?
For example I have while-do loop:
while read -r date month size;
do
...
...
done < file.tmp
How can I use awk for faster run?
That depends on how you use bash. To illustrate, consider how you'd sum a possibly large number of integers.
This function does what Bash was meant for: being control logic for calling other utilities.
sumlines_fast() {
awk '{n += $1} END {print n}'
}
It runs in 0.5 seconds on a million line file. That's the kind of bash code you can very effectively use for larger files.
Meanwhile, this function does what Bash is not intended for: being a general purpose programming language:
sumlines_slow() {
local i=0
while IFS= read -r line
do
(( i += $line ))
done
echo "$i"
}
This function is slow, and takes 30 seconds to sum the same million line file. You should not be doing this for larger files.
Finally, here's a function that could have been written by someone who has no understanding of bash at all:
sumlines_garbage() {
i=0
for f in `cat`
do
i=`echo $f + $i | bc`
done
echo $i
}
It treats forks as being free and therefore runs ridiculously slowly. It would take something like five hours to sum the file. You should not be using this at all.