Create a random name for the complete script - bash

I have a script that joins several csv files into an output file called merged_t*.csv
Here is the script:
for i in $(ls -latr sample_*.csv);
do
paste -d, $i >> out_$RANDOM.csv;
done
sed 's/^|$/\x27/g' out_$RANDOM.csv | paste -d, > merged_t$RANDOM.csv
The "$RANDOM" in the first command "out_$RANDOM" must be the same that the "out_$RANDOM" in the second.
How can i do it?

declare your own variable first:
myrandom=$RANDOM
for i in $(ls -latr sample_*.csv); do
paste -d, $i >> out_${myrandom}.csv;
done
sed 's/^|$/\x27/g' out_${myrandom}.csv | paste -d, > merged_t${myrandom}.csv

use the command that already solve your problem: mktemp
csv=$(mktemp out_XXXXXXXXXX.csv)
for i in $(ls -latr sample_*.csv);
do
paste -d, $i >> ${csv};
done
sed 's/^|$/\x27/g' ${csv} | paste -d, > merged_t${csv}

Ipor Sircer already answered you correctly. The solution is to save the value of $RANDOM into a variable.
To explain why your question had a bug-
Each time you call $RANDOM, it does exactly what it is supposed to- generate a random number. So if you call it multiple times, it will generate different random numbers. So each calling of it will result in a different name. However when you call it once and save it in a variable, it can no longer change (since it has only been called the one time).
$RANDOM is not the best way to generate random numbers though. To test this, you can do the following-
for j in {1..500}
do
for i in {1..1000}
do
echo "$RANDOM"
done | awk '{a+=$1} END {print a/NR}'
done
What this small script does is generate a thousand $RANDOM outputs (inner loop) and then take the average of the 1000 numbers. It does this 500 times in this example (outer loop). You will see that the averages over any thousand iterations of $RANDOM are quite similar to each other (all around 16000). This shows that there isn't very good variation being observed in the different times you call it.
Better ways include the awk srand command.

Related

How to rewrite a bad shell script to understand how to perform similar tasks? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
So, I wrote a bad shell script (according to several questions, one of which I asked) and now I am wondering which way to go to perform the same, or similar, task(s).
I honestly have no clue about which tool may be best for what I need to achieve and I hope that, by understanding how to rewrite this piece of code, it will be easier to understand which way to go.
There we go:
# read reference file line by line
while read -r linE;
do
# field 2 will be grepped
pSeq=`echo $linE | cut -f2 -d" "`
# field 1 will be used as filename to store the grepped things
fName=`echo $linE | cut -f1 -d" "`
# grep the thing in a very big file
grep -i -B1 -A2 "^"$pSeq a_very_big_file.txt | sed 's/^--$//g' | awk 'NF' > $dir$fName".txt"
# grep the same thing in another very big file and store it in the same file as abovr
grep -i -B1 -A2 "^"$pSeq another_very_big_file.txt | sed 's/^--$//g' | awk 'NF' >> $dir$fName".txt"
done < reference_file.csv
At this point I am wondering...how to achieve the same result, whithout using a while loop to read into the reference_file.csv? What is the best way to go, to solve similar problems?
EDIT: when I mentioned the two very_big_files, I am talking > 5GB.
EDIT II: these should be the format of the files:
reference_file.csv:
object pattern
oj1 ptt1
oj2 ptt2
... ...
ojN pttN
a_very_big_file and another_very_big_file:
>head1
ptt1asequenceofcharacters
+
asequenceofcharacters
>head2
ptt1anothersequenceofcharacters
+
anothersequenceofcharacters
>headN
pttNathirdsequenceofcharacters
+
athirdsequenceofcharacters
Basically, I search for pattern in the two files, then I need to get the line above and the two below each match. Of course, not all the lines in the two files match with the patterns in the reference_file.csv.
Global Maxima
Efficient bash scripts are typically very creative and nothing you can achieve by incrementally improving a naive solution.
The most important part of finding efficient solutions is to know your data. Every restriction you can make allows optimizations. Some examples that can make a huge difference:
- The input is sorted or data in different files has the same order.
- The elements in a list are unique.
- One of the files to be processed is way bigger than the others.
- The symbol X never appears in the input or only appears at special places.
- The order of the output does not matter.
When I try to find an efficient solution, my first goal is to make it work without an explicit loop. For this, I need to know the available tools. Then comes the creative part of combining these tools. To me, this is like assembling a jigsaw puzzle without knowing the final picture. A typical mistake here is similar to the XY problem: After you assembled some pieces, you might be fooled into thinking you'd know the final picture and search for a piece Y that does not exist in your toolbox. Frustrated, you implement Y yourself (typically by using a loop) and ruin the solution.
If there is no right piece for your current approach, either use a different approach or give up on bash and use a better scripting/programming language.
Local Maxima
Even though you might not be able to get the best solution by improving a bad solution, you still can improve it. For this you don't need to be very creative if you know some basic anti-patterns and their better alternatives. Here are some typical examples from your script:
Some of these might seem very small, but starting a new process is way more expensive than one might suppose. Inside a loop, the cost of starting a process is multiplied by the number of iterations.
Extract multiple fields from a line
Instead of calling cut for each individual field, use read to read them all at once:
while read -r line; do
field1=$(echo "$line" | cut -f1 -d" ")
field2=$(echo "$line" | cut -f2 -d" ")
...
done < file
while read -r field1 field2 otherFields; do
...
done < file
Combinations of grep, sed, awk
Everything grep (in its basic form) can do, sed can do better. And everything sed can do, awk can do better. If you have a pipe of these tools you can combine them into a single call.
Some examples of (in your case) equivalent commands, one per line:
sed 's/^--$//g' | awk 'NF'
sed '/^--$/d'
grep -vFxe--
grep -i -B1 -A2 "^$pSeq" | sed 's/^--$//g' | awk 'NF'
awk "/^$pSeq/"' {print last; c=3} c>0; {last=$0; c--}'
Multiple grep on the same file
You want to read files at most once, especially if they are big. With grep -f you can search multiple patterns in a single run over one file. If you just wanted to get all matches, you would replace your entire loop with
grep -i -B1 -A2 -f <(cut -f2 -d' ' reference_file | sed 's/^/^/') \
a_very_big_file another_very_big_file
But since you have to store different matches in different files ... (see next point)
Know when to give up and switch to another language
Dynamic output files
Your loop generates multiple files. The typical command line utils like cut, grep and so on only generate one output. I know only one standard tool that generates a variable number of output files: split. But that does not filter based on values, but on position. Therefore, a non-loop solution for your problem seems unlikely. However, you can optimize the loop by rewriting it in a different language, e.g. awk.
Loops in awk are faster ...
time awk 'BEGIN{for(i=0;i<1000000;++i) print i}' >/dev/null # takes 0.2s
time for ((i=0;i<1000000;++i)); do echo $i; done >/dev/null # takes 3.3s
seq 1000000 > 1M
time awk '{print}' 1M >/dev/null # takes 0.1s
time while read -r l; do echo "$l"; done <1M >/dev/null # takes 5.4s
... but the main speedup will come from something different. awk has everything you need built into it, so you don't have to start new processes. Also ... (see next point)
Iterate the biggest file
Reduce the number of times you have to read the biggest files. So instead of iterating reference_file and reading both big files over and over, iterate over the big files once while holding reference_file in memory.
Final script
To replace your script, you can try the following awk script. This assumes that ...
the filenames (first column) in reference_file are unique
the two big files do not contain > except for the header
the patterns (second column) in reference_file are not prefixes of each other.
If this is not the case, simply remove the break.
awk -v dir="$dir" '
FNR==NR {max++; file[max]=$1; pat[max]=$2; next}
{
for (i=1;i<=max;i++)
if ($2~"^"pat[i]) {
printf ">%s", $0 > dir"/"file[i]
break
}
}' reference_file RS=\> FS=\\n a_very_big_file another_very_big_file

How to loop a variable range in cut command

I have a file with 2 columns, and i want to use the values from the second column to set the range in the cut command to select a range of characters from another file. The range i desire is the character in the position of the value in the second column plus the next 10 characters. I will give an example in a while.
My files are something like that:
File with 2 columns and no blank lines between lines (file1.txt):
NAME1 10
NAME2 25
NAME3 48
NAME4 66
File that i want to extract the variable range of characters(just one very long line with no spaces and no bold font) (file2.txt):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
...or, more literally (for copy/paste to test):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
Desired resulting file, one sequence per line (result.txt):
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
The resulting file would have the characters from 10-20, 25-35, 48-58 and 66-76, each range in a new line. So, it would always keep the range of 10, but in different start points and those start points are set by the values in the second column from the first file.
I tried the command:
for i in $(awk '{print $2}' file1.txt);
do
p1=$i;
p2=`expr "$1" + 10`
cut -c$p1-$2 file2.txt > result.txt;
done
I don't get any output or error message.
I also tried:
while read line; do
set $line
p2=`expr "$2" + 10`
cut -c$2-$p2 file2.txt > result.txt;
done <file1.txt
This last command gives me an error message:
cut: invalid range with no endpoint: -
Try 'cut --help' for more information.
expr: non-integer argument
There's no need for cut here; dd can do the job of indexing into a file, and reading only the number of bytes you want. (Note that status=none is a GNUism; you may need to leave it out on other platforms and redirect stderr otherwise if you want to suppress informational logging).
while read -r name index _; do
dd if=file2.txt bs=1 skip="$index" count=10 status=none
printf '\n'
done <file1.txt >result.txt
This approach avoids excessive memory requirements (as present when reading the whole of file2 -- assuming it's large), and has bounded performance requirements (overhead is equal to starting one copy of dd per sequence to extract).
Using awk
$ awk 'FNR==NR{a=$0; next} {print substr(a,$2+1,10)}' file2 file1
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
If file2.txt is not too large, then you can read it in memory,
and use Bash sub-strings to extract the desired ranges:
data=$(<file2.txt)
while read -r name index _; do
echo "${data:$index:10}"
done <file1.txt >result.txt
This will be much more efficient than running cut or another process for every single range definition.
(Thanks to #CharlesDuffy for the tip to read data without a useless cat, and the while loop.)
One way to solve it:
#!/bin/bash
while read line; do
pos=$(echo "$line" | cut -f2 -d' ')
x=$(head -c $(( $pos + 10 )) file2.txt | tail -c 10)
echo "$x"
done < file1.txt > result.txt
It's not the solution an experienced bash hacker would use, but it is very good for someone who is new to bash. It uses tools that are very versatile, although somewhat bad if you need high performance. Shell scripting is commonly used by people who rarely shell scripts, but knows a few commands and just wants to get the job done. That's why I'm including this solution, even if the other answers are superior for more experienced people.
The first line is pretty easy. It just extracts the numbers from file1.txt. The second line uses the very nice tools head and tail. Usually, they are used with lines instead of characters. Nevertheless, I print the first pos + 10 characters with head. The result is piped into tail which prints the last 10 characters.
Thanks to #CharlesDuffy for improvements.

Different output for pipe in script vs. command line

I have a directory with files that I want to process one by one and for which each output looks like this:
==== S=721 I=47 D=654 N=2964 WER=47.976% (1422)
Then I want to calculate the average percentage (column 6) by piping the output to AWK. I would prefer to do this all in one script and wrote the following code:
for f in $dir; do
echo -ne "$f "
process $f
done | awk '{print $7}' | awk -F "=" '{sum+=$2}END{print sum/NR}'
When I run this several times, I often get different results although in my view nothing really changes. The result is almost always incorrect though.
However, if I only put the for loop in the script and pipe to AWK on the command line, the result is always the same and correct.
What is the difference and how can I change my script to achieve the correct result?
Guessing a little about what you're trying to do, and without more details it's hard to say what exactly is going wrong.
for f in $dir; do
unset TEMPVAR
echo -ne "$f "
TEMPVAR=$(process $f | awk '{print $7}')
ARRAY+=($TEMPVAR)
done
I would append all your values to an array inside your for loop. Now all your percentages are in $ARRAY. It should be easy to calculate the average value, using whatever tool you like.
This will also help you troubleshoot. If you get too few elements in the array ${#ARRAY[#]} then you will know where your loop is terminating early.
# To get the percentage of all files
Percs=$(sed -r 's/.*WER=([[:digit:].]*).*/\1/' *)
# The divisor
Lines=$(wc -l <<< "$Percs")
# To change new lines into spaces
P=$(echo $Percs)
# Execute one time without the bc. It's easier to understand
echo "scale=3; (${P// /+})/$Lines" | bc

Unix script is sorting the input

I am having sometime here with my home assignment. Maybe you guys will advise what to read or what commands I can use in order to create the following:
Create a shell script test that will act as follows:
The script will display the following message on the terminal screen:
Enter file names (wild cards OK)
The script will read the list of names.
For each file on the list that is a proper file, display a table giving the ten most frequently used words in the file, sorted with the most frequent first. Include the count.
Repeat steps 1-3 over and over until the user indicates end-of-file. This is done by entering the single character Ctrl-d as a file name.
Here is what I have so far:
#!/bin/bash
echo 'Enter file names (wild cards OK)'
read input_source
if test -f "$input_source"
then
I'm usually ignoring homework questions without showing some progress and effort to learn something - but you're as beautifully cheeky so i'll make an exception.
here is what you want
while read -ep 'Files?> ' files
do
for file in $files
do
echo "== word counts for the $file =="
tr -cs '[:alnum:]' '\n' < "$file" | sort | uniq -c | tail | sort -nr
done
done
And now = at least try understand what the above doing...
Ps: voting to close...
How to find the ten most frequently used words in a file
Assumptions:
The files given have one word per line.
The files are not huge, so efficiency isn't a primary concern.
You can use sort and uniq to find the count of non-unique values in a file, then tail to cut off all but the last ten, and reverse-numeric sort to put them in descending order.
sort "$afile" | uniq -c | tail | sort -rd
Some tips:
have access to the complete bash manual: it's daunting at first, but it's an invaluable reference -- http://www.gnu.org/software/bash/manual/bashref.html
You can get help about bash builtins at the command line: try help read
the read command can handle printing the prompt with the -p option (see previous tip)
you'll accomplish the last step with a while loop:
while read -p "the prompt" filenames; do
# ...
done

Fastest way to print a single line in a file

I have to fetch one specific line out of a big file (1500000 lines), multiple times in a loop over multiple files, I was asking my self what would be the best option (in terms of performance).
There are many ways to do this, i manly use these 2
cat ${file} | head -1
or
cat ${file} | sed -n '1p'
I could not find an answer to this do they both only fetch the first line or one of the two (or both) first open the whole file and then fetch the row 1?
Drop the useless use of cat and do:
$ sed -n '1{p;q}' file
This will quit the sed script after the line has been printed.
Benchmarking script:
#!/bin/bash
TIMEFORMAT='%3R'
n=25
heading=('head -1 file' 'sed -n 1p file' "sed -n '1{p;q} file" 'read line < file && echo $line')
# files upto a hundred million lines (if your on slow machine decrease!!)
for (( j=1; j<=100,000,000;j=j*10 ))
do
echo "Lines in file: $j"
# create file containing j lines
seq 1 $j > file
# initial read of file
cat file > /dev/null
for comm in {0..3}
do
avg=0
echo
echo ${heading[$comm]}
for (( i=1; i<=$n; i++ ))
do
case $comm in
0)
t=$( { time head -1 file > /dev/null; } 2>&1);;
1)
t=$( { time sed -n 1p file > /dev/null; } 2>&1);;
2)
t=$( { time sed '1{p;q}' file > /dev/null; } 2>&1);;
3)
t=$( { time read line < file && echo $line > /dev/null; } 2>&1);;
esac
avg=$avg+$t
done
echo "scale=3;($avg)/$n" | bc
done
done
Just save as benchmark.sh and run bash benchmark.sh.
Results:
head -1 file
.001
sed -n 1p file
.048
sed -n '1{p;q} file
.002
read line < file && echo $line
0
**Results from file with 1,000,000 lines.*
So the times for sed -n 1p will grow linearly with the length of the file but the timing for the other variations will be constant (and negligible) as they all quit after reading the first line:
Note: timings are different from original post due to being on a faster Linux box.
If you are really just getting the very first line and reading hundreds of files, then consider shell builtins instead of external external commands, use read which is a shell builtin for bash and ksh. This eliminates the overhead of process creation with awk, sed, head, etc.
The other issue is doing timed performance analysis on I/O. The first time you open and then read a file, file data is probably not cached in memory. However, if you try a second command on the same file again, the data as well as the inode have been cached, so the timed results are may be faster, pretty much regardless of the command you use. Plus, inodes can stay cached practically forever. They do on Solaris for example. Or anyway, several days.
For example, linux caches everything and the kitchen sink, which is a good performance attribute. But it makes benchmarking problematic if you are not aware of the issue.
All of this caching effect "interference" is both OS and hardware dependent.
So - pick one file, read it with a command. Now it is cached. Run the same test command several dozen times, this is sampling the effect of the command and child process creation, not your I/O hardware.
this is sed vs read for 10 iterations of getting the first line of the same file, after read the file once:
sed: sed '1{p;q}' uopgenl20121216.lis
real 0m0.917s
user 0m0.258s
sys 0m0.492s
read: read foo < uopgenl20121216.lis ; export foo; echo "$foo"
real 0m0.017s
user 0m0.000s
sys 0m0.015s
This is clearly contrived, but does show the difference between builtin performance vs using a command.
If you want to print only 1 line (say the 20th one) from a large file you could also do:
head -20 filename | tail -1
I did a "basic" test with bash and it seems to perform better than the sed -n '1{p;q} solution above.
Test takes a large file and prints a line from somewhere in the middle (at line 10000000), repeats 100 times, each time selecting the next line. So it selects line 10000000,10000001,10000002, ... and so on till 10000099
$wc -l english
36374448 english
$time for i in {0..99}; do j=$((i+10000000)); sed -n $j'{p;q}' english >/dev/null; done;
real 1m27.207s
user 1m20.712s
sys 0m6.284s
vs.
$time for i in {0..99}; do j=$((i+10000000)); head -$j english | tail -1 >/dev/null; done;
real 1m3.796s
user 0m59.356s
sys 0m32.376s
For printing a line out of multiple files
$wc -l english*
36374448 english
17797377 english.1024MB
3461885 english.200MB
57633710 total
$time for i in english*; do sed -n '10000000{p;q}' $i >/dev/null; done;
real 0m2.059s
user 0m1.904s
sys 0m0.144s
$time for i in english*; do head -10000000 $i | tail -1 >/dev/null; done;
real 0m1.535s
user 0m1.420s
sys 0m0.788s
How about avoiding pipes?
Both sed and head support the filename as an argument. In this way you avoid passing by cat. I didn't measure it, but head should be faster on larger files as it stops the computation after N lines (whereas sed goes through all of them, even if it doesn't print them - unless you specify the quit option as suggested above).
Examples:
sed -n '1{p;q}' /path/to/file
head -n 1 /path/to/file
Again, I didn't test the efficiency.
I have done extensive testing, and found that, if you want every line of a file:
while IFS=$'\n' read LINE; do
echo "$LINE"
done < your_input.txt
Is much much faster then any other (Bash based) method out there. All other methods (like sed) read the file each time, at least up to the matching line. If the file is 4 lines long, you will get: 1 -> 1,2 -> 1,2,3 -> 1,2,3,4 = 10 reads whereas the while loop just maintains a position cursor (based on IFS) so would only do 4 reads in total.
On a file with ~15k lines, the difference is phenomenal: ~25-28 seconds (sed based, extracting a specific line from each time) versus ~0-1 seconds (while...read based, reading through the file once)
The above example also shows how to set IFS in a better way to newline (with thanks to Peter from comments below), and this will hopefully fix some of the other issue seen when using while... read ... in Bash at times.
For the sake of completeness you can also use the basic linux command cut:
cut -d $'\n' -f <linenumber> <filename>

Resources