median of column with awk - bash

How can I use AWK to compute the median of a column of numerical data?
I can think of a simple algorithm but I can't seem to program it:
What I have so far is:
sort | awk 'END{print NR}'
And this gives me the number of elements in the column. I'd like to use this to print a certain row (NR/2). If NR/2 is not an integer, then I round up to the nearest integer and that is the median, otherwise I take the average of (NR/2)+1 and (NR/2)-1.

With awk you have to store the values in an array and compute the median at the end, assuming we look at the first column:
sort -n file | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'
Sure, for real median computation do the rounding as described in the question:
sort -n file | awk ' { a[i++]=$1; }
END { x=int((i+1)/2); if (x < (i+1)/2) print (a[x-1]+a[x])/2; else print a[x-1]; }'

This awk program assumes one column of numerically sorted data:
#/usr/bin/env awk
{
count[NR] = $1;
}
END {
if (NR % 2) {
print count[(NR + 1) / 2];
} else {
print (count[(NR / 2)] + count[(NR / 2) + 1]) / 2.0;
}
}
Sample usage:
sort -n data_file | awk -f median.awk

OK, just saw this topic and thought I could add my two cents, since I looked for something similar in the past. Even though the title says awk, all the answers make use of sort as well. Calculating the median for a column of data can be easily accomplished with datamash:
> seq 10 | datamash median 1
5.5
Note that sort is not needed, even if you have an unsorted column:
> seq 10 | gshuf | datamash median 1
5.5
The documentation gives all the functions it can perform, and good examples as well for files with many columns. Anyway, it has nothing to do with awk, but I think datamash is of great help in cases like this, and could also be used in conjunction with awk. Hope it helps somebody!

This AWK based answer to a similar question on unix.stackexchange.com gives the same results as Excel for calculating the median.

If you have an array to compute median from (contains one-liner of Johnsyweb solution):
array=(5 6 4 2 7 9 3 1 8) # numbers 1-9
IFS=$'\n'
median=$(awk '{arr[NR]=$1} END {if (NR%2==1) print arr[(NR+1)/2]; else print (arr[NR/2]+arr[NR/2+1])/2}' <<< sort <<< "${array[*]}")
unset IFS

Related

Subsetting a CSV based on a percentage of unique values

I've been reading through other similar questions. I have this working, but it is very slow due to the size of the CSV I'm working with. Are there ways to make this more efficient?
My goal:
I have an incredibly large CSV (>100 GB). I would like to take all of the unique values in a column, extract 10% of these, and then use that 10% to subsample the original CSV.
What I'm doing:
1 - I'm pulling all unique values from column 11 and writing those to a text file:
cat File1.csv | cut -f11 -d , | sort | uniq > uniqueValues.txt
2 - Next, I'm sampling a random 10% of the values in uniqueValues.txt:
cat uniqueValues.txt | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .10) print $0'} > uniqueValues.10pct.txt
3 - Next, I'm pulling the rows in File1.csv which have column 11 matching values from uniqueValues.10pct.txt:
awk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
As far as I can tell, this seems to be working. Does this seem reasonable? Any suggestions on how to improve the efficiency?
Any suggestions on how to improve the efficiency?
Avoid sort in 1st step as 2nd and 3rd do not care about order, you might do your whole 1st step using single awk command as follows:
awk 'BEGIN{FS=","}!arr[$11]++{print $11}' File1.csv > uniqueValues.txt
Explanation: I inform GNU AWK that field separator (FS) is comma, then for each line I do arr[$11]++ to get number of occurence of value in 11th column and use ! to negate it, so 0 becomes true, whilst 1 and greater becomes false. If this hold true I print 11th column.
Please test above against your 1st step for you data and then select one which is faster.
As for 3th step you might attemp using not-GNU AWK if you are allowed to install tools at your machine. For example author of article¹ Don’t MAWK AWK – the fastest and most elegant big data munging language! found nawk faster than GNU AWK and mawk faster than nawk. After installing prepare test data and measure times for
gawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
nawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
mawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
then use one which proved by fastest.
¹be warned that values shown pertains to versions available at September 2009, you might get different times with version available at June 2022.
You might find this to be faster (untested since no sample input/output provided):
cut -f11 -d',' File1.csv |
sort -u > uniqueValues.txt
numUnq=$(wc -l < uniqueValues.txt)
shuf -n "$(( numUnq / 10 ))" uniqueValues.txt |
awk -F',' 'NR==FNR{a[$1]; next} $11 in vals' - File1.csv
You could try replacing that first cut | sort; numUnq=$(wc...) with
numUnq=$(awk -F',' '!seen[$11]++{print $11 > "uniqueValues.txt"; cnt++} END{print cnt+0}' File1.csv)
to see if that's any faster but I doubt it since cut, sort, and wc are all very fast while awk has to do regexp-based field splitting and store all $11 values in memory (which can get slow as the array size increases due to how dynamic array allocation works).
Create a sample *.csv file:
for ((i=1;i<=100;i++))
do
for ((j=1;j<=100;j++))
do
echo "a,b,c,d,e,f,g,h,i,j,${j},k,l,m"
done
done > large.csv
NOTES:
1,000 total lines
100 unique values in the 11th field
each unique value shows up 10 times in the file
We'll look at a couple awk ideas that:
keep track of unique values as we find them
apply the random percentage check as we encounter a new (unique) value
require just a single pass through the source file
NOTE: both of these awk scripts (below) replace all of OP's current code (cat/cut/sort/uniq/cat/awk/awk)
First idea applies our random percentage check each time we find a new unique value:
awk -F',' '
BEGIN { srand() }
!seen[$11]++ { if (rand() <= 0.10) # if this is the 1st time we have seen this value and rand() is <= 10% then ...
keep[$11] # add the value to our keep[] array
}
$11 in keep # print current line if $11 is an index in the keep[] array
' large.csv > small.csv
NOTES:
one drawback to this approach is that the total number of unique values is not guaranteed to always be exactly 10% since we're at the mercy of the rand() function, for example ...
a half dozen sample runs generated 70, 110, 100, 140, 110 lines (ie, 7, 11, 10, 14 and 11 unique values) in small.csv
A different approach where we pre-generate a random set of modulo 100 values (ie, 0 to 99); as we find a new uniq value we check the count (of uniq values) modulo 100 and if we find a match to our pre-generated set then we print the row:
awk -F',' -v pct=10 '
BEGIN { srand()
delete mods # force awk to treat all "mods" references as an array and not a scalar
while (length(mods) < pct) # repeat loop until we have "pct" unique indices in the mods[] array
mods[int(rand() * 100)] # generate random integers betwen 0 and 99
}
!seen[$11]++ { if ((++uniqcnt % 100) in mods) # if this is the 1st time we have seen this value then increment our unique value counter and if "modulo 100" is an index in the mods[] array then ...
keep[$11] # add the value to our keep[] array
}
$11 in keep # print current line if $11 is an index in the keep[] array
' large.csv > small.csv
NOTES:
for a large pct this assumes the rand() results are evenly distributed between 0 and 1 so that the mods[] array is populated in a timely manner
this has the benefit of printing lines that represent exactly 10% of the possible unique values (depending on number of unique values the percentage will actually be 10% +/- 1%)
a half dozen sample runs all generated exactly 100 lines (ie, 10 unique values) in small.csv
If OP still needs to generate the two intermediate (sorted) files (uniqueValues.txt and uniqueValues.10pct.txt) then this could be done in the same awk script via an END {...} block, eg:
END { PROCINFO["sorted_in"]="#ind_num_asc" # this line of code requires GNU awk otherwise OP can sort the files at the OS/bash level
for (i in seen)
print i > "uniqueValues.txt"
for (i in keep)
print i > "uniqueValues.10pct.txt" # use with 1st awk script
# print i > "uniqueValues." pct "pct.txt" # use with 2nd awk script
}

How to sort a file by line length and then alphabetically for the second key?

Say I have a file:
ab
aa
c
aaaa
I would like it to be sorted like this
c
aa
ab
aaaa
That is to sort by line length and then alphabetically. Is that possible in bash?
You can prepend the length of the line to each line, then do a numerical sorting, and finally cutting out the numbers
< your_file awk '{ print length($0), $0; }' | sort -n | cut -f2
You see that I've accomplished the sorting via sort -n, without doing any multi-key sorting. Honestly I was lucky that this worked:
I didn't think that lines could begin with numbers and so I expected sort -n to work because alphabetic and numeric sorting give the same result if all the strings are the same length, as is the case exaclty because we are sorting by the line length which I'm adding via awk.
It turns out everything works even if your input has lines starting with digits, the reason being that sort -n
sorts numerically on the leading numeric part of the lines;
in case of ties, it uses strcmp to compare the whole lines
Here's some demo:
$ echo -e '3 11\n3 2' | sort -n
3 11
3 2
# the `3 ` on both lines makes them equal for numerical sorting
# but `3 11` comes before `3 2` by `strcmp` before `1` comes before `2`
$ echo -e '3 11\n03 2' | sort -n
03 2
3 11
# the `03 ` vs `3 ` is a numerical tie,
# but `03 2` comes before `3 11` by `strcmp` because `0` comes before `3`
So the lucky part is that the , I included in the awk command inserts a space (actually an OFS), i.e. a non-digit, thus "breaking" the numeric sorting and letting the strcmp sorting kick in (on the whole lines which compare equal numerically, in this case).
Whether this behavior is POSIX or not, I don't know, but I'm using GNU coreutils 8.32's sort. Refer to this question of mine and this answer on Unix for details.
awk could do all itself, but I think using sort to sort is more idiomatic (as in, use sort to sort) and efficient, as explained in a comment (after all, why would you not expect that sort is the best performing tool in the shell to sort stuff?).
Insert a length for the line using gawk (zero-filled to four places so it will sort correctly), sort by two keys (first the length, then the first word on the line), then remove the length:
gawk '{printf "%04d %s\n", length($0), $0}' | sort -k1 -k2 | cut -d' ' -f2-
If it must be bash:
while read -r line; do printf "%04d %s\n" ${#line} "${line}"; done | sort -k1 -k2 | (while read -r len remainder; do echo "${remainder}"; done)
For GNU awk:
$ gawk '{
a[length()][$0]++ # hash to 2d array
}
END {
PROCINFO["sorted_in"]="#ind_num_asc" # first sort on length dim
for(i in a) {
PROCINFO["sorted_in"]="#ind_str_asc" # and then on data dim
for(j in a[i])
for(k=1;k<=a[i][j];k++) # in case there are duplicates
print j
# PROCINFO["sorted_in"]="#ind_num_asc" # I don t think this is needed?
}
}' file
Output:
c
aa
ab
aaaa
aaaaaaaaaa
aaaaaaaaaa

bash command for group by count

I have a file in the following format
abc|1
def|2
abc|8
def|3
abc|5
xyz|3
I need to group by these words in the first column and sum the value of the second column. For instance, the output of this file should be
abc|14
def|5
xyz|3
Explanation: the corresponding values for word "abc" are 1, 8, and 5. By adding these numbers, the sum comes out to be 14 and the output becomes "abc|14". Similarly, for word "def", the corresponding values are 2 and 3. Summing up these, the final output comes out to be "def|5".
Thank you very much for the help :)
I tried the following command
awk -F "|" '{arr[$1]+=$2} END {for (i in arr) {print i"|"arr[i]}}' filename
another command which I found was
awk -F "," 'BEGIN { FS=OFS=SUBSEP=","}{arr[$1]+=$2 }END {for (i in arr) print i,arr[i]}' filename
Both didn't show me the intended results. Although I'm also in doubt of the working of these commands as well.
Short GNU datamash solution:
datamash -s -t\| -g1 sum 2 < filename
The output:
abc|14
def|5
xyz|3
-t\| - field separator
-g1 - group by the 1st column
sum 2 - sum up values of the 2nd column
I will just add an answer to fix the sorting issue you had, in your Awk logic, you don't need to use sort/uniq piped to the output of Awk, but process in Awk itself.
Referring to GNU Awk Using Predefined Array Scanning Orders with gawk, you can use the PROCINFO["sorted_in"] variable(gawk specific) to control how you want Awk to sort your final output.
Referring to the section below,
#ind_str_asc
Order by indices in ascending order compared as strings; this is the most basic sort. (Internally, array indices are always strings, so with a[2*5] = 1 the index is 10 rather than numeric 10.)
So using this in your requirement in the END clause just do,
END{PROCINFO["sorted_in"]="#ind_str_asc"; for (i in unique) print i,unique[i]}
with your full command being,
awk '
BEGIN{FS=OFS="|"}{
unique[$1]+=$2;
next
}
END{
PROCINFO["sorted_in"]="#ind_str_asc";
for (i in unique)
print i,unique[i]
}' file
awk -F\| '{ arry[$1]+=$2 } END { asorti(arry,arry2);for (i in arry2) { print arry2[i]"|"arry[arry2[i]]} }' filename
Your initial solution should work apart from the issue with sort. Use asorti function to sort the indices from arry to arry2 and then process these in the loop.

awk calculate average or zero

I am calculating the average for a bunch of numbers in a bunch of text files like this:
grep '^num' file.$i | awk '{ sum += $2 } END { print sum / NR }'
But some times the file doesn't contain the pattern, in which cas I want the script to return zero. Any ideas of this slightly modified one-liner?
You're adding to your load (average) by spawning an extra process to do everything the first can do. Using 'grep' and 'awk' together is a red-flag. You would be better to write:
awk '/^num/ {n++;sum+=$2} END {print n?sum/n:0}' file
Try this:
... END { print NR ? sum/NR : 0 }
Use awk's ternary operator, i.e. m ? m : n which means, if m has a value '?', use it, else ':' use this other value. Both n and m can be strings, numbers, or expressions that produce a value.
grep '^num' file.$i | awk '{ sum += $2 } END { print sum ? sum / NR : 0.0 }'

Perform arithmetic operations on all "cells" in tab-delimited file

I have a tab delimited file of n by m (where n is number of rows and m is number of columns).
I want to perform a mathematical operation on values present in the file (say adding 5 to value present in each column and then dividing it by 12).
any one line regex command or a mixture of things .... help
Thank you in advance.
awk '{
# add all numbers on a line
tot=0
for (i=1;i<=NF;i++) tot+=$i
# print detail
print "LineNo=" NR "\ttot="tot "\tavg=" tot/12 "data=" $0
gTot+=tot
}
END {
print "Number of Lines =" NR "\n" \
GrandTotal=\t" gTot
}
' yourFile
You'll want to work thru this excellent awk tutorial to really understand what is happening.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, and/or give it a + (or -) as a useful answer. Note that you can 'accept' only one answer (with a check mark) and you can vote for up to 30 answers each day.
Example using awk:
gawk '{for (i = 1; i <= NF; i += 1) {printf "%f\t", ($i + 5) / 12;} printf "\n"}'
try sed or awk (awk is very good) they were designed to do that

Resources