Creating histograms in bash

Creating histograms in bash - bash

EDIT
I read the question that this is supposed to be a duplicate of (this one). I don't agree. In that question the aim is to get the frequencies of individual numbers in the column. However if I apply that solution to my problem, I'm still left with my initial problem of grouping the frequencies of the numbers in a particular range into the final histogram. i.e. if that solution tells me that the frequency of 0.45 is 2 and 0.44 is 1 (for my input data), I'm still left with the problem of grouping those two frequencies into a total of 3 for the range 0.4-0.5.
END EDIT
QUESTION-
I have a long column of data with values between 0 and 1.
This will be of the type-
0.34
0.45
0.44
0.12
0.45
0.98
.
.
.
A long column of decimal values with repetitions allowed.
I'm trying to change it into a histogram sort of output such as (for the input shown above)-
0.0-0.1 0
0.1-0.2 1
0.2-0.3 0
0.3-0.4 1
0.4-0.5 3
0.5-0.6 0
0.6-0.7 0
0.7-0.8 0
0.8-0.9 0
0.9-1.0 1
Basically the first column has the lower and upper bounds of each range and the second column has the number of entries in that range.
I wrote it (badly) as-
for i in $(seq 0 0.1 0.9)
do
awk -v var=$i '{if ($1 > var && $1 < var+0.1 ) print $1}' input | wc -l;
done
Which basically does a wc -l of the entries it finds in each range.
Output formatting is not a part of the problem. If I simply get the frequencies corresponding to the different bins , that will be good enough. Also please note that the bin size should be a variable like in my proposed solution.
I already read this answer and want to avoid the loop. I'm sure there's a much much faster way in awk that bypasses the for loop. Can you help me out here?

Following the same algorithm of my previous answer, I wrote a script in awk which is extremely fast (look at the picture).
The script is the following:
#!/usr/bin/awk -f
BEGIN{
bin_width=0.1;
}
{
bin=int(($1-0.0001)/bin_width);
if( bin in hist){
hist[bin]+=1
}else{
hist[bin]=1
}
}
END{
for (h in hist)
printf " * > %2.2f -> %i \n", h*bin_width, hist[h]
}
The bin_width is the width of each channel. To use the script just copy it in a file, make it executable (with chmod +x <namefile>) and run it with ./<namefile> <name_of_data_file>.

For this specific problem, I would drop the last digit, then count occurrences of sorted data:
cut -b1-3 | sort | uniq -c
which gives, on the specified input set:
2 0.1
1 0.3
3 0.4
1 0.9
Output formatting can be done by piping through this awk command:
| awk 'BEGIN{r=0.0}
{while($2>r){printf "%1.1f-%1.1f %3d\n",r,r+0.1,0;r=r+.1}
printf "%1.1f-%1.1f %3d\n",$2,$2+0.1,$1}
END{while(r<0.9){printf "%1.1f-%1.1f %3d\n",r,r+0.1,0;r=r+.1}}'

The only loop you will find in this algorithm is around the line of the file.
This is an example on how to realize what you asked in bash. Probably bash is not the best language to do this since it is slow with math. I use bc, you can use awk if you prefer.
How the algorithm works
Imagine you have many bins: each bin correspond to an interval. Each bin will be characterized by a width (CHANNEL_DIM) and a position. The bins, all together, must be able to cover the entire interval where your data are casted. Doing the value of your number / bin_width you get the position of the bin. So you have just to add +1 to that bin. Here a much more detailed explanation.
#!/bin/bash
# This is the input: you can use $1 and $2 to read input as cmd line argument
FILE='bash_hist_test.dat'
CHANNEL_NUMBER=9 # They are actually 10: 0 is already a channel
# check the max and the min to define the dimension of the channels:
MAX=`sort -n $FILE | tail -n 1`
MIN=`sort -rn $FILE | tail -n 1`
# Define the channel width
CHANNEL_DIM_LONG=`echo "($MAX-$MIN)/($CHANNEL_NUMBER)" | bc -l`
CHANNEL_DIM=`printf '%2.2f' $CHANNEL_DIM_LONG `
# Probably printf is not the best function in this context because
#+the result could be system dependent.
# Determine the channel for a given number
# Usage: find_channel <number_to_histogram> <width_of_histogram_channel>
function find_channel(){
NUMBER=$1
CHANNEL_DIM=$2
# The channel is found dividing the value for the channel width and
#+rounding it.
RESULT_LONG=`echo $NUMBER/$CHANNEL_DIM | bc -l`
RESULT=`printf '%.0f' $RESULT_LONG`
echo $RESULT
}
# Read the file and do the computuation
while IFS='' read -r line || [[ -n "$line" ]]; do
CHANNEL=`find_channel $line $CHANNEL_DIM`
[[ -z HIST[$CHANNEL] ]] && HIST[$CHANNEL]=0
let HIST[$CHANNEL]+=1
done < $FILE
counter=0
for i in ${HIST[*]}; do
CHANNEL_START=`echo "$CHANNEL_DIM * $counter - .04" | bc -l`
CHANNEL_END=`echo " $CHANNEL_DIM * $counter + .05" | bc`
printf '%+2.1f : %2.1f => %i\n' $CHANNEL_START $CHANNEL_END $i
let counter+=1
done
Hope this helps. Comment if you have other questions.

Related

Average marks from list

Sorry if I don't write good, it's my first post.
I have a list in one file with the name, id, marks etc of students (see below):
And I want to calculate the average mark in another file, but I don't know how to take only the marks and write the average in another file.
Thanks;
#name surname student_index_number course_group_id lecturer_id list_of_marks
athos musketeer 1 1 1 3,4,5,3.5
porthos musketeer 2 1 1 2,5,3.5
aramis musketeer 3 2 2 2,1,4,5
while read line; do
echo "$line" | cut -f 6 -d ' '
done<main_list

awk 'NR>1{n=split($NF,a,",");for(i=1;i<=n;i++){s+=a[i]} ;print $1,s/n;s=0}' input
athos 3.875
porthos 3.5
aramis 3
For all the lines except header(NR>1 will filter out header) , pick up the last column and split into smaller numbers by comma. Using for loop sum the value of all the marks and then divid by the total subject number.

Something like (untested)
awk '{ n = split($6, a, ","); total=0; for (v in a) total += a[v]; print total / n }' main_list

In pure BASH solution, could you please try following once.
while read first second third fourth fifth sixth
do
if [[ "$first" =~ (^#) ]]
then
continue
fi
count="${sixth//[^,]}"
val=$(echo "(${#count}+1)" | bc)
echo "scale=2; (${sixth//,/+})/$val" | bc
done < "Input_file"

How to select a specific percentage of lines?

Goodmorning !
I have a file.csv with 140 lines and 26 columns. I need to sort the lines in according the values in column 23. This is an exemple :
Controller1,NA,ASHEBORO,ASH,B,,3674,4572,1814,3674,4572,1814,1859,#NAME?,0,124.45%,49.39%,19%,1,,"Big Risk, No Spare disk",45.04%,4.35%,12.63%,160,464,,,,,,0,1,1,1,0,410,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller2,EU,FR,URG,D,,0,0,0,0,0,0,0,#NAME?,0,#DIV/0!,#DIV/0!,#DIV/0!,1,,#N/A,0.00%,0.00%,#DIV/0!,NO STATS,-1088,,,,,,#N/A,#N/A,#N/A,#N/A,0,#N/A,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller3,EU,FR,URG,D,,0,0,0,0,0,0,0,#NAME?,0,#DIV/0!,#DIV/0!,#DIV/0!,1,,#N/A,0.00%,0.00%,#DIV/0!,NO STATS,-2159,,,,,,#N/A,#N/A,#N/A,#N/A,0,#N/A,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller4,NA,STARR,STA,D,,4430,6440,3736,4430,6440,3736,693,#NAME?,0,145.38%,84.35%,18%,1,,No more Data disk,65.17%,19.18%,-2.18%,849,-96,,,,,,0,2,1,2,2,547,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
To sort the lines according the values of the column 23, I do this :
awk -F "%*," '$23 > 4' myfikle.csv
The result :
Controller1,NA,ASHEBORO,ASH,B,,3674,4572,1814,3674,4572,1814,1859,#NAME?,0,124.45%,49.39%,19%,1,,"Big Risk, No Spare disk",45.04%,4.35%,12.63%,160,464,,,,,,0,1,1,1,0,410,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller4,NA,STARR,STA,D,,4430,6440,3736,4430,6440,3736,693,#NAME?,0,145.38%,84.35%,18%,1,,No more Data disk,65.17%,19.18%,-2.18%,849,-96,,,,,,0,2,1,2,2,547,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
In my example, I use the value of 4% in column 23, the goal being to retrieve all the rows with their value in % which increases significantly in column 23. The problem is that I can't base myself on the 4% value because it is only representative of the current table. So I have to find another way to retrieve the rows that have a high value in column 23.
I have to sort the Controllers in descending order according to the percentage in column 23, I prefer to process the first 10% of the sorted lines to make sure I have the controllers with a large percentage.
The goal is to be able to vary the percentage according to the number of lines in the table.
Do you have any tips for that ?
Thanks ! :)

I could have sworn that this question was a duplicate, but so far I couldn't find a similar question.
Whether your file is sorted or not does not really matter. From any file you can extract the NUMBER first lines with head -n NUMBER. There is no built-in way to specify the number percentually, but you can compute that PERCENT% of your file's lines are NUMBER lines.
percentualHead() {
percent="$1"
file="$2"
linesTotal="$(wc -l < "$file")"
(( lines = linesTotal * percent / 100 ))
head -n "$lines" "$file"
}
or shorter but less readable
percentualHead() {
head -n "$(( "$(wc -l < "$2")" * "$1" / 100 ))" "$2"
}
Calling percentualHead 10 yourFile will print the first 10% of lines from yourFile to stdout.
Note that percentualHead only works with files because the file has to be read twice. It does not work with FIFOs, <(), or pipes.

If you want to use standard tools, you'll need to read the file twice. But if you're content to use perl, you can simply do:
perl -e 'my #sorted = sort <>; print #sorted[0..$#sorted * .10]' input-file

Here is one for GNU awk to get the top p% from the file but they are outputed in the order of appearance:
$ awk -F, -v p=0.5 ' # 50 % of top $23 records
NR==FNR { # first run
a[NR]=$23 # hash precentages to a, NR as key
next
}
FNR==1 { # second run, at beginning
n=asorti(a,a,"#val_num_desc") # sort percentages to descending order
for(i=1;i<=n*p;i++) # get only the top p %
b[a[i]] # hash their NRs to b
}
(FNR in b) # top p % BUT not in order
' file file | cut -d, -f 23 # file processed twice, cut 23rd for demo
45.04%
19.18%
Commenting this in a bit.

Get a percentage of randomly chosen lines from a text file

I have a text file (bigfile.txt) with thousands of rows. I want to make a smaller text file with 1 % of the rows which are randomly chosen. I tried the following
output=$(wc -l bigfile.txt)
ds1=$(0.01*output)
sort -r bigfile.txt|shuf|head -n ds1
It give the following error:
head: invalid number of lines: ‘ds1’
I don't know what is wrong.

Even after you fix your issues with your bash script, it cannot do floating point arithmetic. You need external tools like Awk which I would use as
randomCount=$(awk 'END{print int((NR==0)?0:(NR/100))}' bigfile.txt)
(( randomCount )) && sort -r file | shuf | head -n "$randomCount"
E.g. Writing a file with with 221 lines using the below loop and trying to get random lines,
tmpfile=$(mktemp /tmp/abc-script.XXXXXX)
for i in {1..221}; do echo $i; done >> "$tmpfile"
randomCount=$(awk 'END{print int((NR==0)?0:(NR/100))}' "$tmpfile")
If I print the count, it would return me a integer number 2 and using that on the next command,
sort -r "$tmpfile" | shuf | head -n "$randomCount"
86
126

Roll a die (with rand()) for each line of the file and get a number between 0 and 1. Print the line if the die shows less than 0.01:
awk 'rand()<0.01' bigFile
Quick test - generate 100,000,000 lines and count how many get through:
seq 1 100000000 | awk 'rand()<0.01' | wc -l
999308
Pretty close to 1%.
If you want the order random as well as the selection, you can pass this through shuf afterwards:
seq 1 100000000 | awk 'rand()<0.01' | shuf
On the subject of efficiency which came up in the comments, this solution takes 24s on my iMac with 100,000,000 lines:
time { seq 1 100000000 | awk 'rand()<0.01' > /dev/null; }
real 0m23.738s
user 0m31.787s
sys 0m0.490s
The only other solution that works here, heavily based on OP's original code, takes 13 minutes 19s.

How to sum a row of numbers from text file-- Bash Shell Scripting

I'm trying to write a bash script that calculates the average of numbers by rows and columns. An example of a text file that I'm reading in is:
1 2 3 4 5
4 6 7 8 0
There is an unknown number of rows and unknown number of columns. Currently, I'm just trying to sum each row with a while loop. The desired output is:
1 2 3 4 5 Sum = 15
4 6 7 8 0 Sum = 25
And so on and so forth with each row. Currently this is the code I have:
while read i
do
echo "num: $i"
(( sum=$sum+$i ))
echo "sum: $sum"
done < $2
To call the program it's stats -r test_file. "-r" indicates rows--I haven't started columns quite yet. My current code actually just takes the first number of each column and adds them together and then the rest of the numbers error out as a syntax error. It says the error comes from like 16, which is the (( sum=$sum+$i )) line but I honestly can't figure out what the problem is. I should tell you I'm extremely new to bash scripting and I have googled and searched high and low for the answer for this and can't find it. Any help is greatly appreciated.

You are reading the file line by line, and summing line is not an arithmetic operation. Try this:
while read i
do
sum=0
for num in $i
do
sum=$(($sum + $num))
done
echo "$i Sum: $sum"
done < $2
just split each number from every line using for loop. I hope this helps.

Another non bash way (con: OP asked for bash, pro: does not depend on bashisms, works with floats).
awk '{c=0;for(i=1;i<=NF;++i){c+=$i};print $0, "Sum:", c}'

Another way (not a pure bash):
while read line
do
sum=$(sed 's/[ ]\+/+/g' <<< "$line" | bc -q)
echo "$line Sum = $sum"
done < filename

Using the numsum -r util covers the row addition, but the output format needs a little glue, by inefficiently paste-ing a few utils:
paste "$2" \
<(yes "Sum =" | head -$(wc -l < "$2") ) \
<(numsum -r "$2")
Output:
1 2 3 4 5 Sum = 15
4 6 7 8 0 Sum = 25
Note -- to run the above line on a given file foo, first initialize $2 like so:
set -- "" foo
paste "$2" <(yes "Sum =" | head -$(wc -l < "$2") ) <(numsum -r "$2")

What's an easy way to read random line from a file?

What's an easy way to read random line from a file in a shell script?

You can use shuf:
shuf -n 1 $FILE
There is also a utility called rl. In Debian it's in the randomize-lines package that does exactly what you want, though not available in all distros. On its home page it actually recommends the use of shuf instead (which didn't exist when it was created, I believe). shuf is part of the GNU coreutils, rl is not.
rl -c 1 $FILE

Another alternative:
head -$((${RANDOM} % `wc -l < file` + 1)) file | tail -1

sort --random-sort $FILE | head -n 1
(I like the shuf approach above even better though - I didn't even know that existed and I would have never found that tool on my own)

This is simple.
cat file.txt | shuf -n 1
Granted this is just a tad slower than the "shuf -n 1 file.txt" on its own.

perlfaq5: How do I select a random line from a file? Here's a reservoir-sampling algorithm from the Camel Book:
perl -e 'srand; rand($.) < 1 && ($line = $_) while <>; print $line;' file
This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.

using a bash script:
#!/bin/bash
# replace with file to read
FILE=tmp.txt
# count number of lines
NUM=$(wc - l < ${FILE})
# generate random number in range 0-NUM
let X=${RANDOM} % ${NUM} + 1
# extract X-th line
sed -n ${X}p ${FILE}

Single bash line:
sed -n $((1+$RANDOM%`wc -l test.txt | cut -f 1 -d ' '`))p test.txt
Slight problem: duplicate filename.

Here's a simple Python script that will do the job:
import random, sys
lines = open(sys.argv[1]).readlines()
print(lines[random.randrange(len(lines))])
Usage:
python randline.py file_to_get_random_line_from

Another way using 'awk'
awk NR==$((${RANDOM} % `wc -l < file.name` + 1)) file.name

A solution that also works on MacOSX, and should also works on Linux(?):
N=5
awk 'NR==FNR {lineN[$1]; next}(FNR in lineN)' <(jot -r $N 1 $(wc -l < $file)) $file
Where:
N is the number of random lines you want
NR==FNR {lineN[$1]; next}(FNR in lineN) file1 file2
--> save line numbers written in file1 and then print corresponding line in file2
jot -r $N 1 $(wc -l < $file) --> draw N numbers randomly (-r) in range (1, number_of_line_in_file) with jot. The process substitution <() will make it look like a file for the interpreter, so file1 in previous example.

#!/bin/bash
IFS=$'\n' wordsArray=($(<$1))
numWords=${#wordsArray[#]}
sizeOfNumWords=${#numWords}
while [ True ]
do
for ((i=0; i<$sizeOfNumWords; i++))
do
let ranNumArray[$i]=$(( ( $RANDOM % 10 ) + 1 ))-1
ranNumStr="$ranNumStr${ranNumArray[$i]}"
done
if [ $ranNumStr -le $numWords ]
then
break
fi
ranNumStr=""
done
noLeadZeroStr=$((10#$ranNumStr))
echo ${wordsArray[$noLeadZeroStr]}

Here is what I discovery since my Mac OS doesn't use all the easy answers. I used the jot command to generate a number since the $RANDOM variable solutions seems not to be very random in my test. When testing my solution I had a wide variance in the solutions provided in the output.
RANDOM1=`jot -r 1 1 235886`
#range of jot ( 1 235886 ) found from earlier wc -w /usr/share/dict/web2
echo $RANDOM1
head -n $RANDOM1 /usr/share/dict/web2 | tail -n 1
The echo of the variable is to get a visual of the generated random number.

Using only vanilla sed and awk, and without using $RANDOM, a simple, space-efficient and reasonably fast "one-liner" for selecting a single line pseudo-randomly from a file named FILENAME is as follows:
sed -n $(awk 'END {srand(); r=rand()*NR; if (r<NR) {sub(/\..*/,"",r); r++;}; print r}' FILENAME)p FILENAME
(This works even if FILENAME is empty, in which case no line is emitted.)
One possible advantage of this approach is that it only calls rand() once.
As pointed out by #AdamKatz in the comments, another possibility would be to call rand() for each line:
awk 'rand() * NR < 1 { line = $0 } END { print line }' FILENAME
(A simple proof of correctness can be given based on induction.)
Caveat about rand()
"In most awk implementations, including gawk, rand() starts generating numbers from the same starting number, or seed, each time you run awk."
-- https://www.gnu.org/software/gawk/manual/html_node/Numeric-Functions.html

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio