I'm trying to use a Bash script to run a large number of calculations (just over 2 million) using a terminal-based program called uvspec. But I've hit a serious barrier following the latest addition to the calculation...
The script, opens an input file which has 2e^6 lines looking like this:
0 66.3426 -9.999 -9999
0 66.6192 -9.999 -9999
0 61.9212 1.655 1655
0 61.9999 1.655 1655
...
Each of these values represents a different value I want to substitute into the input file (using sed), so I read each line into an array. Many of these lines contain negative values in the 4th column e.g. -9999, which result in errors in the program so I would like to omit those lines and return a standard output - I'm doing this with the if statement... Problem is something terribly wrong is coming out of my output and I'm 99.9% sure the problem is a mistake in the following script as I'm fairly new to bash.
Can anyone spot anything here that doesn't make sense or is bad syntax?
Any comments on the script in general would also be useful feedback.
cat ".../Maps/dniinput" | while IFS=$' ' read -r -a myArray
do
if [ "${myArray[3]}" -gt 0 ]
then
sed s/TAU/"${myArray[0]}"/ x.template x.template > a.template
sed s/SZA/"${myArray[1]}"/ a.template a.template > b.template
sed s/ALT/"${myArray[2]}"/ b.template b.template > x.inp
../bin/uvspec < x.inp >> dni.out
else
echo "0 -9999" >> dnijul.out
fi
done
Sed can do all three substitutions in one go and you can pipe the output straight into your analysis program without creating any intermediate a.template and b.template files...
sed -e "s/.../.../" -e "s/.../.../" -e "s/.../.../" x.template | ../bin/uvspec
By the way, you can also get rid of the "cat" at the start, and replace your array with variables whose names better match what they are, if you use a loop like this:
while IFS=S' ' read tau sza alt p4
do
echo $tau $sza $alt $p4
done < a
0 66.3426 -9.999 -9999
0 66.6192 -9.999 -9999
0 61.9212 1.655 1655
0 61.9999 1.655 1655
I named the fourth element "p4" because you refer to the 4th one as the altitude in your comment, but in your code you replace the word "ALT" with the third column - so I am not really sure what your parameters are, but you should hopefully get the idea from the example above.
You might want to combine those "sed" lines into something more like:
sed -e "s/TAU/${myArray[0]}/" -e "s/SZA/${myArray[1]}/" \
-e "s/ALT/${myArray[2]}/" < x.template \
| ../bin/uvspec >> dni.out
Related
I'm uncertain as to how I can use the until loop inside a while loop.
I have an input file of 500,000 lines that look like this:
9 1 1 0.6132E+02
9 2 1 0.6314E+02
10 3 1 0.5874E+02
10 4 1 0.5266E+02
10 5 1 0.5571E+02
1 6 1 0.5004E+02
1 7 1 0.5450E+02
2 8 1 0.5696E+02
11 9 1 0.6369E+02
.....
And what I'm hoping to achieve is to sort the numbers in the first column in numerical order such that I can pull all the similar lines (eg. lines that start with the same number) into new text files "cluster${i}.txt". From there I want to sort the fourth column of ("cluster${i}.txt") files in numerical order. After sorting I would like to write the first row of each sorted "cluster${i}.txt" file into a single output file. A sample output of "cluster1.txt" would like this:
1 6 1 0.5004E+02
1 7 1 0.5450E+02
1 11 1 0.6777E+02
....
as well as an output.txt file that would look like this:
1 6 1 0.5004E+02
2 487 1 0.3495E+02
3 34 1 0.0344E+02
....
Here is what I've written:
#!/bin/bash
input='input.txt'
i=1
sort -nk 1 $input > 'temp.txt'
while read line; do
awk -v var="$i" '$1 == var' temp.txt > "cluster${i}.txt"
until [[$i -lt 20]]; do
i=$((i+1))
done
done
for f in *.txt; do
sort -nk 4 > temp2.txt
head -1 temp2.txt
rm temp2.txt
done > output.txt
This only takes one line, if your sort -n knows how to handle exponential notation:
sort -nk 1,4 <in.txt | awk '{ of="cluster" $1 ".txt"; print $0 >>of }'
...or, to also write the first line for each index to output.txt:
sort -nk 1,4 <in.txt | awk '
{
if($1 != last) {
print $0 >"output.txt"
last=$1
}
of="cluster" $1 ".txt";
print $0 >of
}'
Consider using an awk implementation -- such as GNU awk -- which will cache file descriptors, rather than reopening each output file for every append; this will greatly improve performance.
By the way, let's look at what was wrong with the original script:
It was slow. Really, really slow.
Starting a new instance of awk 20 times for every line of input (because the whole point of while read is to iterate over individual lines, so putting an awk inside a while read is going to run awk at least once per line) is going to have a very appreciable impact on performance. Not that it was actually doing this, because...
The while read line outer loop was reading from stdin, not temp.txt or input.txt.
Thus, the script was hanging if stdin didn't have anything written on it, or wasn't executing the contents of the loop at all if stdin pointed to a source with no content like /dev/null.
The inner loop wasn't actually processing the line read by the outer loop. line was being read, but all of temp.txt was being operated on.
The awk wasn't actually inside the inner loop, but rather was inside the outer loop, just before the inner loop. Consequently, it wasn't being run 20 times with different values for i, but run only once per line read, with whichever value for i was left over from previously executed code.
Whitespace is important to how commands are parsed. [[foo]] is wrong; it needs to be [[ foo ]].
To "fix" the inner loop, to do what I imagine you meant to write, might look like this:
# this is slow and awful, but at least it'll work.
while IFS= read -r line; do
i=0
until [[ $i -ge 20 ]]; do
awk -v var="$i" '$1 == var' <<<"$line" >>"cluster${i}.txt"
i=$((i+1))
done
done <temp.txt
...or, somewhat better (but still not as good as the solution suggested at the top):
# this is a somewhat less awful.
for (( i=0; i<=20; i++ )); do
awk -v var="$i" '$1 == var' <temp.txt >"cluster${i}.txt"
head -n 1 "cluster${i}.txt"
done >output.txt
Note how the redirection to output.txt is done just once, for the whole loop -- this means we're only opening the file once.
So I have two files. File A and File B. File A is huge (>60 GB) and has 16 rows, a mix of numeric and strings, is separated by "|", and has over 600,000,000 lines. Field 3 in this file is the ID and it is a numeric field, with different lengths (e.g., someone's ID can be 1, and someone else's can be 100)
File B just has a bunch of ID (~1,000,000) and I want to extract all the rows from File A that have an ID that is in `File B'. I have started doing this using Linux with the following code
sort -k3,3 -t'|' FileA.txt > FileASorted.txt
sort -k1,1 -t'|' FileB.txt > FileBSorted.txt
join -1 3 -2 1 -t'|' FileASorted.txt FileBSorted.txt > merged.txt
The problem I have is that merged.txt is empty (when I know for a fact there are at least 10 matches)... I have googled this and it seems like the issue is that the join field (the ID) is numeric. Some people propose padding the field with zeros but 1) I'm not entirely sure how to do this, and 2) this seems very slow/time inefficient.
Any other ideas out there? or help on how to add the padding of 0s only to the relevant field.
I would first sort file b using the unique flag (-u)
sort -u file.b > sortedfile.b
Then loop through sortedfile.b and for each grep file.a. In zsh I would do a
foreach C (`cat sortedfile.b`)
grep $C file.a > /dev/null
if [ $? -eq 0 ]; then
echo $C >> res.txt
fi
end
Redirect output from grep to /dev/null and test whether there was a match ($? -eq 0) and append (>>) the result from that line to res.txt.
A single > will overwrite the file. I'm a bit rusty at zsh now so there might be a typo. You may be using bash which can have a slightly different foreach syntax.
I am absolutely new to bash scripting but I need to perform some task with it. I have a file with just one column of numbers (6250000). I need to extract 100 at a time, put them into a new file and submit each 100 to another program. I think this should be some kind of a loop going through my file each 100 numbers and submitting them to the program.
Let's say my numbers in the file would look like this.
1.6435
-1.2903
1.1782
-0.7192
-0.4098
-1.7354
-0.4194
0.2427
0.2852
I need to feed each of those 62500 output files to a program which has a parameter file. I was doing something like this:
lossopt()
{
cat<<END>temp.par
Parameters for LOSSOPT
***********************
START OF PARAMETERS:
lossin.out \Input file with distribution
1 \column number
lossopt.out \Output file
-3.0 3.0 0.01 \xmin, xmax, xinc
-3.0 1
0.0 0.0
0.0 0.0
3.0 0.12
END
}
for i in {1..62500}
do
sed -n 1,100p ./rearnum.out > ./lossin.out
echo temp.par | ./lossopt >> lossopt.out
rm lossin.out
cut -d " " -f 101- rearnum.out > rearnum.out
done
rearnum is my big initial file
If you need to split it into files containing 100 lines each, I'd use split -l 100 <source> which will create a lot of files named like xaa, xab, xac, ... each of which contain at most 100 lines of the source file (the last file may contain fewer). If you want the names to start with something other than x you can give the prefix those names should use as the last argument to split as in split -l 100 <source> OUT which will now give files like OUTaa, OUTab, ...
Then you can loop over those files and process them however you like. If you need to run a script with them you could do something like
for file in OUT*; do
<other_script> "$file"
done
You can still use a read loop and redirection:
#!/bin/bash
fnbase=${1:-file}
increment=${2:-100}
declare -i count=0
declare -i fcount=1
fname="$(printf "%s_%08d" "$fnbase" $((fcount)))"
while read -r line; do
((count == 0)) && :> "$fname"
((count++))
echo "$line" >> "$fname"
((count % increment == 0)) && {
count=0
((fcount++))
fname="$(printf "%s_%08d" "$fnbase" $((fcount)))"
}
done
exit 0
use/output
$ bash script.sh yourprefix <yourfile
Which will take yourfile with many thousands of lines and write every 100 lines out to yourprefix_00000001 -> yourprefix_99999999 (default is file_000000001, etc..). Each new filename is truncated to 0 lines before writing begins.
Again you can specify on the command line the number of lines to write to each file. E.g.:
$ bash script.sh yourprefix 20 <yourfile
Which will write 20 lines per file to yourprefix_00000001 -> yourprefix_99999999
Even though it may seem stupid for professional in bash, I will take the risk and post my own answer to my question
cat<<END>temp.par
Parameters for LOSSOPT
***********************
START OF PARAMETERS:
lossin.out \Input file with distribution
1 \column number
lossopt.out \Output file
-3.0 3.0 0.01 \xmin, xmax, xinc
-3.0 1
0.0 0.0
0.0 0.0
3.0 0.12
END
for i in {1..62500}
do
sed -n 1,100p ./rearnum.out >> ./lossin.out
echo temp.par | ./lossopt >> sdis.out
rm lossin.out
tail -n +101 rearnum.out > temp
tail -n +1 temp > rearnum.out
rm temp
done
This script consequentially "eats" the big initial file and puts the "pieces" into the external program. After it takes one portion of 100 number, it deletes this portion from the big file. Then, the process repeats until the big file is empty. It is not an elegant solution but it worked for me.
Suppose I have a file (sizes.txt)
daveclark#foo.com 0 23252 0
mikeclark#foo.com 0 45131 1
clark#foo.com 0 55235 0
joeclark#bar.net 33632 1
maryclark#bar.net 0 55523 0
clark#bar.net 0 99356 0
Now I have another file (users.txt)
clark#foo.com
clark#bar.net
What I want to do is find each line in sizes.txt for the specific email addresses in users.txt...using a loop, bash or one-liner in CentOS. Here's the key point, I need to find lines that only contain clark#foo.com and then clark#bar.net - meaning this should be one line only for each.
The most simple way that comes to mind...
for i in `cat users.txt`; do grep $i sizes.txt; done
...but this does not work because processing the first line of users.txt will return the lines containing daveclark#foo.com, mikeclark#foo.com and clark#foo.com. I explicitly want the line containing "clark#foo.com" (the third line of sizes.txt). Processing second line of users.txt, will have the same problem (it will return maryclark#bar.net and clark#bar.net lines) I know this has to be something totally simple that I'm overlooking.
What you are looking for is the exact match with grep. In your case that would be the -w option.
So
for i in cat users.txt do
grep -w "^$i" sizes.txt
done
should do the trick.
Cheers.
You can try something like this using only bash built-in functions and syntax:
while read -r user ; do
while read -r s_user s_column_2 s_column_3 s_column_4 ; do
[ "${s_user}" = "${user}" ] && printf "%b\t%b\t%b\t%b\n" "${s_user}" "${s_column_2}" "${s_column_3}" "${s_column_4}"
done < sizes.txt
done < users.txt
this nested while could be slow when using big size.txt files. In those cases you could use this in combination with awk
I am very new to Bash scripting. I am trying to write a script that works with two files. Each line of the files looks like this:
INST <_variablename_> = <_value_>;
The two files share many variables, but they are in a different order, so I can't just diff them. What I want to do is go through the files and find all the variables that have different values, or all the variables that are specified in one file but not the other.
Here is my script so far. Again, I'm very new to Bash so please go easy on me, but also feel free to suggest improvements (I appreciate it).
#!/bin/bash
line_no=1
while read LINE
do
search_var=`echo $LINE | awk '{print $2}'`
result_line=`grep -w $search_var file2`
if [ $? -eq 1 ]
then
echo "$line_no: not found [ $search_var ]"
else
value=`echo $LINE | awk '{print $4}'`
result_value=`echo $result_line | awk '{print $4}'`
if [ "$value" != "$result_value" ]
then
echo "$line_no: mismatch [ $search_var , $value , $result_value ]"
fi
fi
line_no=`expr $line_no + 1`
done < file1
Now here's an example of some of the output that I'm getting:
111: mismatch [ TXAREFBIASSEL , TRUE; , "TRUE"; ]
, 4'b1100; ] [ TXTERMTRIM , 4'b1100;
113: not found [ VREFBIASMODE ]
, 2'b00; ]ch [ CYCLE_LIMIT_SEL , 2'b00;
, 3'b100; ]h [ FDET_LCK_CAL , 3'b101;
The first line is what I would expect (I'll deal with the quotes later). On the second, fourth, and fifth line, it looks like the final value is overwriting the "line_no: mismatch" part. And furthermore, on the second and fourth line, the values DO match--it shouldn't print anything at all!
I asked my friend about this, and his suggestion was "Do it in Perl." So I'm learning Perl right now, but I'd still like to know what's going on and why this is happening.
Thank you!
EDIT:
Sigh. I figured out the problem. One of the files had Unix line breaks, and the other had DOS line breaks. I actually thought this might be the case, but I also thought that vi was supposed to display some character if it opened a dos-ended file. Since they looked the same, I assumed that they were the same.
Thanks for your help and suggestions everybody!
Rather than simply replacing the Bash language with Perl, how about a paradigm shift?
diff -w <(sort file1) <(sort file2)
This will sort both files, so that the variables will appear in the same order in each, and will diff the results (ignoring whitespace differences, just for fun).
This may give you more or less what you need, without any "code" per se. Note that you could also sort the files into intermediate files and run diff on those if you find that easier...I happen to like doing it with no temporary files.
What about this? 2 is avaliable in both files and same value. other values can be parsed easily.
sort 1.txt 2.txt | uniq -c
2 a = 10
1 b = 20
1 b = 40
1 c = 10
1 c = 30
1 e = 50
or like this get your key and values.
sed 's|INST \(.*\) = \(.*\)|\1 = \2|' 1.txt 2.txt | sort | uniq -c
2 a = 10
1 b = 20
1 b = 40
1 c = 10
1 c = 30
1 e = 50