Array arithmetic bash shell - bash

I have numerous(nTotal) number of files each with one column of length L of float numbers, I want to add entry in line i_th of all these file and at the end. Compute its average and standard deviation.
I first read each file. Then I try to add this array to a an array, which gives me a syntax error: (standard_in) 2: syntax error. I expect that suma[i] contains sum of all the entries on line i_th of all the files now. Then I find the average
Edit I changed for loops as suggested.
for (( n= 1 ; n < $nTotal; n++ ))
do
IFS=$'\n'
arr1=($(./a.out filename | sed 's/:.*//'))
for (( i= 1 ; i < $L; i++ ))
do
sum[i]=`echo "${sum[i]} - ${arr1[i]}" | bc`
done
done
for (( i= 1 ; i < $L; i++ ))
do
ya=$(echo -1*${sum[i]} | bc)
aveSum=$(echo $ya/$nTotal | bc -l)
done
Edit: ./a.out produces files with one column of float numbers.
To find standard deviation though, I again read data files and store them in arrays (I'm sure this is not the smartest way of doing it but I couldn't think of anything else.). I also could not find the standard deviation using:
for (( i= 1 ; i < $L; i++ ))
do
ya=$(echo -1*${sum[i]} | bc)
ta=$(echo $ya/$nTotal | bc -l)
tempval=`echo "${arr1[i]} - $ta * ${arr1[i]} - $ta" | bc`
val[i]=`echo "${val[i]} - $tempval" | bc`
done
Here I get zero for val[i] elements, I can't figure what is wrong. I would really appreciate it if you can guide me for this problem.

Bash might not be exactly the easiest for this problem, particularly since it doesn't implement non-integer arithmetic. I'd use awk:
awk '{ n[FNR]++;
delta = $1 - mean[FNR];
mean[FNR] += delta / n[FNR];
m2[FNR] += delta * ($1 - mean[FNR]);
}
END {for (i=1; i in n; ++i)
print mean[i], sqrt(m2[i]/(n[i]-1));
}' file1 file2 ...
The math is taken directly from the well-known "online" mean and variance algorithms. The program assumes that all files have exactly L lines, but if a few have more or less, the missing data will just be ignored; you might want to do a better validity test. In the particular case that only one file has too many lines, the standard deviation computation will trap a divide-by-zero; in one reading, that doesn't matter since the correct data will already have been printed, but you might want to fix that, too.
The program makes use of a couple of awk features: first, arrays are automatically (and lazily) initialized to 0 (if used as numbers); second, FNR is the line number in the current file. (NR is the line number in the input as a whole, but in this case FNR is more useful.)

Related

Listing skipped numbers in a large txt file using bash

I need to find a way to display the missing numbers from a large txt file. It's a web graph that has 875,713 vertices. However, when I sort the file the largest number that is displayed at the end is 916,427. So there are some numbers not being used for vertex index. Is there a bash command I could use to do this?
I found this after searching around some other threads but I'm not entirely sure if its correct:
awk 'NR != $1 { for (i = prev + 1; i < $1; i++) {print i} } { prev = $1 + 1 }' file
Assuming the 'number' of each vertex is in the first column, you can use:
awk '{a[$1]} END{for(i = 1; i <= 916427; i++){if(!(i in a)){print i}}}' file
E.g.
# create some example data and remove "10"
seq 916427 | sed '10d' > test.txt
head test.txt
1
2
3
4
5
6
7
8
9
11
awk '{a[$1]} END { for (i = 1; i <= 916427; i++) { if (!(i in a)) {print i}}}' test.txt
10
If you don't want to store the array in memory (otherwise #jared_mamrot solution would work), you can use
awk 'NR==1 {p=$1; next} {for (i=p+1; i<$1; i++) {print i}; p=$1}' < <( sort -n file)
which sorts the file first.
Just because you tagged your question bash, I'll provide a bash solution. :)
# sample data as jared suggested, with 10 removed...
seq 916427 | sed '10d' > test.txt
# read sample data into an array...
mapfile -t a < test.txt
# reverse the $a array into $b
for i in "${a[#]}"; do b[$i]=1; done
# step through list of possible numbers, testing if each one is an index of $b
for ((i=1; i<${a[((${#a[#]}-1))]}; i++)); do [[ -z ${b[i]} ]] && echo $i; done
The line noise in the last line (${a[((${#a[#]}-1))]}) simply means "the value of the last array element", and the -1 is there because without instructions otherwise, mapfile starts numbering things at zero.
This takes a little longer to run than awk, because awk is awesome. But it runs in bash without calling any external tools. Aside from the ones generating our sample data, of course!
Note that the last line verifies $b array membership with a string comparison. You might get a very slight performance increase by doing a math comparison instead ((( ${b[i]} )) || echo $i) but the improvement would be so small that it's not even worth mentioning. Oh dang.
Note also that both this and the awk solution involve creating very large arrays in memory, then stepping through those arrays. Be careful of your memory, and don't waste array space with unnecessary data. You will probably want to pull just your indices out of your original dataset for this comparison, rather than loading everything into a bash or awk array.

How to insert a comma after 4 digits for all number with more than 8 digits

I have csv-file that looks like this:
12625,6475,387,-388,-332,-217,-104,17,125,160,121,38,-101,-282,-368
-2675,6475,420,-385,-330,-217,-106,16,124,158,120,37,-104,-281,-365
2725,6475,633,-377,-327,-222,-117,6,113,148,109,26,-114,-282,-359
-12775,6475,927,-367,-324,-229,-133,-9,99,134,95,11,-128,-283,-351
12825,64751200,-357,-320,-236,-147,-23,86,121,82,-3,-140,-283,-344
^ missing comma
In some rows I have the problem shown in the last row of the example, where a comma is missing between the second and third column. I know from the data that the most digits a legitimate entry can have is 5 (in some cases with a - in front) and all entries that have 8 digits originate from missing commas, which should appear after the fourth digit.
I am looking from an expression - presumably with sed - that inserts a comma after the fourth digit of all 8-digit numbers in the file.
What I have so far is
echo "12356" | sed 's/\B[0-9]\{3\}/&,/g'
which will insert a comma after four digits. How can filter such that this will only happen for 8-digit numbers, not for 5-digit numbers.
I am also open to any more elegant way that might exist to solve that problem.
Thank you
Try this sed
sed -E 's/([0-9]{4})([0-9]{4})/\1,\2/g'
Because sed has already been mentioned, here’s some awk…
awk -F, -vOFS=, '{
for (i = 1; i <= NF; ++i)
if (length($i) >= 8)
$i = substr($i, 1, 4) "," substr($i, 5)
} 1' < some_file.csv
…and here’s some pure Bash, for no good reason:
(
IFS=,
while read -ra line; do
for i in "${!line[#]}"; do
((${#line[i]} >= 8)) && line[i]="${line[i]::4},${line[i]:4}"
done
printf '%s\n' "${line[*]}"
done
) < some_file.csv

Creating histograms in bash

EDIT
I read the question that this is supposed to be a duplicate of (this one). I don't agree. In that question the aim is to get the frequencies of individual numbers in the column. However if I apply that solution to my problem, I'm still left with my initial problem of grouping the frequencies of the numbers in a particular range into the final histogram. i.e. if that solution tells me that the frequency of 0.45 is 2 and 0.44 is 1 (for my input data), I'm still left with the problem of grouping those two frequencies into a total of 3 for the range 0.4-0.5.
END EDIT
QUESTION-
I have a long column of data with values between 0 and 1.
This will be of the type-
0.34
0.45
0.44
0.12
0.45
0.98
.
.
.
A long column of decimal values with repetitions allowed.
I'm trying to change it into a histogram sort of output such as (for the input shown above)-
0.0-0.1 0
0.1-0.2 1
0.2-0.3 0
0.3-0.4 1
0.4-0.5 3
0.5-0.6 0
0.6-0.7 0
0.7-0.8 0
0.8-0.9 0
0.9-1.0 1
Basically the first column has the lower and upper bounds of each range and the second column has the number of entries in that range.
I wrote it (badly) as-
for i in $(seq 0 0.1 0.9)
do
awk -v var=$i '{if ($1 > var && $1 < var+0.1 ) print $1}' input | wc -l;
done
Which basically does a wc -l of the entries it finds in each range.
Output formatting is not a part of the problem. If I simply get the frequencies corresponding to the different bins , that will be good enough. Also please note that the bin size should be a variable like in my proposed solution.
I already read this answer and want to avoid the loop. I'm sure there's a much much faster way in awk that bypasses the for loop. Can you help me out here?
Following the same algorithm of my previous answer, I wrote a script in awk which is extremely fast (look at the picture).
The script is the following:
#!/usr/bin/awk -f
BEGIN{
bin_width=0.1;
}
{
bin=int(($1-0.0001)/bin_width);
if( bin in hist){
hist[bin]+=1
}else{
hist[bin]=1
}
}
END{
for (h in hist)
printf " * > %2.2f -> %i \n", h*bin_width, hist[h]
}
The bin_width is the width of each channel. To use the script just copy it in a file, make it executable (with chmod +x <namefile>) and run it with ./<namefile> <name_of_data_file>.
For this specific problem, I would drop the last digit, then count occurrences of sorted data:
cut -b1-3 | sort | uniq -c
which gives, on the specified input set:
2 0.1
1 0.3
3 0.4
1 0.9
Output formatting can be done by piping through this awk command:
| awk 'BEGIN{r=0.0}
{while($2>r){printf "%1.1f-%1.1f %3d\n",r,r+0.1,0;r=r+.1}
printf "%1.1f-%1.1f %3d\n",$2,$2+0.1,$1}
END{while(r<0.9){printf "%1.1f-%1.1f %3d\n",r,r+0.1,0;r=r+.1}}'
The only loop you will find in this algorithm is around the line of the file.
This is an example on how to realize what you asked in bash. Probably bash is not the best language to do this since it is slow with math. I use bc, you can use awk if you prefer.
How the algorithm works
Imagine you have many bins: each bin correspond to an interval. Each bin will be characterized by a width (CHANNEL_DIM) and a position. The bins, all together, must be able to cover the entire interval where your data are casted. Doing the value of your number / bin_width you get the position of the bin. So you have just to add +1 to that bin. Here a much more detailed explanation.
#!/bin/bash
# This is the input: you can use $1 and $2 to read input as cmd line argument
FILE='bash_hist_test.dat'
CHANNEL_NUMBER=9 # They are actually 10: 0 is already a channel
# check the max and the min to define the dimension of the channels:
MAX=`sort -n $FILE | tail -n 1`
MIN=`sort -rn $FILE | tail -n 1`
# Define the channel width
CHANNEL_DIM_LONG=`echo "($MAX-$MIN)/($CHANNEL_NUMBER)" | bc -l`
CHANNEL_DIM=`printf '%2.2f' $CHANNEL_DIM_LONG `
# Probably printf is not the best function in this context because
#+the result could be system dependent.
# Determine the channel for a given number
# Usage: find_channel <number_to_histogram> <width_of_histogram_channel>
function find_channel(){
NUMBER=$1
CHANNEL_DIM=$2
# The channel is found dividing the value for the channel width and
#+rounding it.
RESULT_LONG=`echo $NUMBER/$CHANNEL_DIM | bc -l`
RESULT=`printf '%.0f' $RESULT_LONG`
echo $RESULT
}
# Read the file and do the computuation
while IFS='' read -r line || [[ -n "$line" ]]; do
CHANNEL=`find_channel $line $CHANNEL_DIM`
[[ -z HIST[$CHANNEL] ]] && HIST[$CHANNEL]=0
let HIST[$CHANNEL]+=1
done < $FILE
counter=0
for i in ${HIST[*]}; do
CHANNEL_START=`echo "$CHANNEL_DIM * $counter - .04" | bc -l`
CHANNEL_END=`echo " $CHANNEL_DIM * $counter + .05" | bc`
printf '%+2.1f : %2.1f => %i\n' $CHANNEL_START $CHANNEL_END $i
let counter+=1
done
Hope this helps. Comment if you have other questions.

How to round a large number in Shell command?

In Mac terminal, I would like to round a large number.
For example,
At 10^13th place:
1234567812345678 --> 1230000000000000
Or at 10^12th place:
1234567812345678 --> 1235000000000000
So I would like to specify the place, and then get the rounded number.
How do I do this?
You can use arithmetic expansion:
$ val=1234567812345678
$ echo $(( ${val: -13:1} < 5 ? val - val % 10**13 : val - val % 10**13 + 10**13 ))
1230000000000000
$ echo $(( ${val: -12:1} < 5 ? val - val % 10**12 : val - val % 10**12 + 10**12 ))
1235000000000000
This checks if the most significant removed digit is 5 or greater, and if it is, the last significant unremoved digit is increased by one; then we subtract the division remainder from the (potentially modified) initial value.
If you don't want to have to write it this way, you can wrap it in a little function:
round () {
echo $(( ${1: -$2:1} < 5 ? $1 - $1 % 10**$2 : $1 - $1 % 10**$2 + 10**$2 ))
}
which can then be used like this:
$ round "$val" 13
1230000000000000
$ round "$val" 12
1235000000000000
Notice that quoting $val isn't strictly necessary here, it's just a good habit.
If the one-liner is too cryptic, this is a more readable version of the same:
round () {
local rounded=$(( $1 - $1 % 10**$2 )) # Truncate
# Check if most significant removed digit is >= 5
if (( ${1: -$2:1} >= 5 )); then
(( rounded += 10**$2 ))
fi
echo $rounded
}
Apart from arithmetic expansion, this also uses parameter expansion to get a substring: ${1: -$2:1} stands for "take $1, count $2 from the back, take one character". There has to be a space before -$2 (or is has to be in parentheses) because otherwise it would be interpreted as a different expansion, checking if $1 is unset or null, which we don't want.
awk's [s]printf function can do rounding for you, within the limits of double-precision floating-point arithmetic:
$ for p in 13 12; do
awk -v p="$p" '{ n = sprintf("%.0f", $0 / 10^p); print n * 10^p }' <<<1234567812345678
done
1230000000000000
1235000000000000
For a pure bash implementation, see Benjamin W.'s helpful answer.
Actually, if you want to round to n significant digits you might be best served by mixing up traditional math and strings.
Serious debugging is left to the student, but this is what I quickly came up with for bash shell and hope MAC is close enough:
function rounder
{
local value=$1;
local digits=${2:-3};
local zeros="$( eval "printf '0%.0s' {1..$digits}" )"; #proper zeros
# a bit of shell magic that repats the '0' $digits times.
if (( value > 1$zeros )); then
# large enough to require rounding
local length=${#value};
local digits_1=$(( $digits + 1 )); #digits + 1
local tval="${value:0:$digits_1}"; #leading digits, plus one
tval=$(( $tval + 5 )); #half-add
local tlength=${#tval}; #check if carried a digit
local zerox="";
if (( tlength > length )); then
zerox="0";
fi
value="${tval:0:$digits}${zeros:0:$((length-$digits))}$zerox";
fi
echo "$value";
}
See how this can be done much shorter, but that's another exercise for the student.
Avoiding floating point math due to the inherit problems within.
All sorts of special cases, like negative numbers, are not covered.

Inserting if loop within awk

I had a problem solved in a previous post using the awk, but now I want to put an if loop in it, but I am getting an error.
Here's the problem:
I had a lot of files that looked like this:
Header
175566717.000
175570730.000
175590376.000
175591966.000
175608932.000
175612924.000
175614836.000
.
.
.
175680016.000
175689679.000
175695803.000
175696330.000
And I wanted to extract the first 2000 lines (line 1 to 2000), then extract the lines 1500 to 3500, then 3000 to 5000 and so on... What I mean is: extract a window of 2000 lines with an overlap of 500 lines between contiguous windows until the end of the file.
This is the awk command used for it:
awk -v i=1 -v t=2000 -v d=501 'NR>1{a[NR-1]=$0}END{
while(i<NR-1){
++n;
for(k=i;k<i+t;k++)print a[k] > "win"n".txt";
close("_win"n".txt")
i=i+t-d
}
}' myfile.txt
done
And I get several files with names win1.txt , win2.txt , win3.txt , etc...
My problem now is that because the file was not a multiple of 2000, my last window has less than 2000 lines. How can I put an if loop that would do this: if the last window had less than 2000 digital numbers, the previous window should had all the lines until the end of the file.
EXTRA INFO
When the windows are created, there is a line break at the end.That is why I needed the if loop to take into account a window of less than 2000 digital numbers, and not just lines.
If you don't have to use awk for some other reason, try the sed approach
#!/bin/bash
file="$(sed '/^\s*$/d' myfile.txt)"
sed -n 1,2000p <<< "$file"
first=1500
last=3500
max=$(wc -l <<< "$file" | awk '{print $1}')
while [[ $max -ge 2000 && $last -lt $((max+1500)) ]]; do
sed -n "$first","$last"p <<< "$file"
((first+=1500))
((last+=1500))
done
Obviously this is going to be less fast than awk and more error prone for gigatic files, but should work in most cases.
Change the while condition to make it stop earlier:
while (i+t <= NR) {
Change the end condition of the for loop to compensate for the last output file being potentially bigger:
for (k = i; k < (i+t+t-d <= NR ? i+t : NR); k++)
The rest of your code can stay the same; although I took the liberty of removing the close statement (why was that?), and to set d=500, to make the output files really overlap by 500 lines.
awk -v i=1 -v t=2000 -v d=500 'NR>1{a[NR-1]=$0}END{
while (i+t <= NR) {
++n;
for (k=i; k < (i+t+t-d <= NR ? i+t : NR); k++) print a[k] > "win"n".txt";
i=i+t-d
}
}' myfile.txt
I tested it with small values of t and d, and it seems to work as requested.
One final remark: for big input files, I wouldn't encourage storing the whole thing in array a.

Resources