How to use "column" to center a chart? - bash

I was wondering what the best way to sort a chart using the column command to center each column instead of the default left aligned column was. I have been using the column -t filename command.
Current Output:
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
Desired Output: Something like this
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
Basically, I want it to be centered instead of left aligned.

Here is an awk solution:
# Collect all lines in "data", keep track of maximum width for each field
{
data[NR] = $0
for (i = 1; i <= NF; ++i)
max[i] = length($i) > max[i] ? length($i) : max[i]
}
END {
for (i = 1; i <= NR; ++i) {
# Split record into array "arr"
split(data[i], arr)
# Loop over array
for (j = 1; j <= NF; ++j) {
# Calculate amount of padding required
pad = max[j] - length(arr[j])
# Print field with appropriate padding, see below
printf "%*s%*s%s", length(arr[j]) + int(pad/2), arr[j], \
pad % 2 == 0 ? pad/2 : int(pad/2) + 1, "", \
j == NF ? "" : " "
}
# Newline at end of record
print ""
}
}
Called like this:
$ awk -f centre.awk infile
Label1 label2
Anotherlabel label2442
label152 label42242
label78765 label373737737
The printf statement uses padding with dynamic widths:
The first %*s takes care of left padding and the data itself: arr[j] gets printed and padded to a total width of length(arr[j]) + int(pad/2).
The second %*s prints the empty string, left padded to half of the total padding required. pad % 2 == 0 ? pad/2 : int(pad/2) + 1 checks if the total padding was an even number, and if not, adds an extra space.
The last %s prints j == NF ? "" : " ", i.e., two spaces, unless we're at the last field.
Some older awks don't support the %*s syntax, but the formatting string can be assembled like width = 5; "%" width "s" in that case.

Here's a Python program to do what you want. It's probably too hard to do in bash, so you'll need to use a custom program or awk script. Basic algorithm:
count number of columns
[optional] make sure each line has the same number of columns
figure out the maximum length of data for each column
print each line using the max lengths
.
#!/usr/bin/env python3
import sys
def column():
# Read file and split each line into fields (by whitespace)
with open(sys.argv[1]) as f:
lines = [line.split() for line in f]
# Check that each line has the same number of fields
num_fields = len(lines[0])
for n, line in enumerate(lines):
if len(line) != num_fields:
print('Line {} has wrong number of columns: expected {}, got {}'.format(n, num_fields, len(line)))
sys.exit(1)
# Calculate the maximum length of each field
max_column_widths = [0] * num_fields
for line in lines:
line_widths = (len(field) for field in line)
max_column_widths = [max(z) for z in zip(max_column_widths, line_widths)]
# Now print them centered using the max_column_widths
spacing = 4
format_spec = (' ' * spacing).join('{:^' + str(n) + '}' for n in max_column_widths)
for line in lines:
print(format_spec.format(*line))
if __name__ == '__main__':
column()

Related

grep variable pattern and output match and sequence position

Given the following string,
>Q07092
MWVSWAPGLWLLGLWATFGHGANTGAQCPPSQQEGLKLEHSSSLPANVTGFNLIHRLSLMKTSAIKKIRNPKGPLILRLGAAPVTQPTRRVFPRGLPEEFALVLTLLLKKHTHQKTWYLFQVTDANGYPQISLEVNSQERSLELRAQGQDGDFVSCIFPVPQLFDLRWHKLMLSVAGRVASVHVDCSSASSQPLGPRRPMRPVGHVFLGLDAEQGKPVSFDLQQVHIYCDPELVLEEGCCEILPAGCPPETSKARRDTQSNELIEINPQSEGKVYTRCFCLEEPQNSEVDAQLTGRISQKAERGAKVHQETAADECPPCVHGARDSNVTLAPSGPKGGKGERGLPGPPGSKGEKGARGNDCVRISPDAPLQCAEGPKGEKGESGALGPSGLPGSTGEKGQKGEKGDGGIKGVPGKPGRDGRPGEICVIGPKGQKGDPGFVGPEGLAGEPGPPGLPGPPGIGLPGTPGDPGGPPGPKGDKGSSGIPGKEGPGGKPGKPGVKGEKGDPCEVCPTLPEGFQNFVGLPGKPGPKGEPGDPVPARGDPGIQGIKGEKGEPCLSCSSVVGAQHLVSSTGASGDVGSPGFGLPGLPGRAGVPGLKGEKGNFGEAGPAGSPGPPGPVGPAGIKGAKGEPCEPCPALSNLQDGDVRVVALPGPSGEKGEPGPPGFGLPGKQGKAGERGLKGQKGDAGNPGDPGTPGTTGRPGLSGEPGVQGPAGPKGEKGDGCTACPSLQGTVTDMAGRPGQPGPKGEQGPEGVGRPGKPGQPGLPGVQGPPGLKGVQGEPGPPGRGVQGPQGEPGAPGLPGIQGLPGPRGPPGPTGEKGAQGSPGVKGATGPVGPPGASVSGPPGRDGQQGQTGLRGTPGEKGPRGEKGEPGECSCPSQGDLIFSGMPGAPGLWMGSSWQPGPQGPPGIPGPPGPPGVPGLQGVPGNNGLPGQPGLTAELGSLPIEQHLLKSICGDCVQGQRAHPGYLVEKGEKGDQGIPGVPGLDNCAQCFLSLERPRAEEARGDNSEGDPGCVGSPGLPGPPGLPGQRGEEGPPGMRGSPGPPGPIGPPGFPGAVGSPGLPGLQGERGLTGLTGDKGEPGPPGQPGYPGATGPPGLPGIKGERGYTGSAGEKGEPGPPGSEGLPGPPGPAGPRGERGPQGNSGEKGDQGFQGQPGFPGPPGPPGFPGKVGSPGPPGPQAEKGSEGIRGPSGLPGSPGPPGPPGIQGPAGLDGLDGKDGKPGLRGDPGPAGPPGLMGPPGFKGKTGHPGLPGPKGDCGKPGPPGSTGRPGAEGEPGAMGPQGRPGPPGHVGPPGPPGQPGPAGISAVGLKGDRGATGERGLAGLPGQPGPPGHPGPPGEPGTDGAAGKEGPPGKQGFYGPPGPKGDPGAAGQKGQAGEKGRAGMPGGPGKSGSMGPVGPPGPAGERGHPGAPGPSGSPGLPGVPGSMGDMVNYDEIKRFIRQEIIKMFDERMAYYTSRMQFPMEMAAAPGRPGPPGKDGAPGRPGAPGSPGLPGQIGREGRQGLPGVRGLPGTKGEKGDIGIGIAGENGLPGPPGPQGPPGYGKMGATGPMGQQGIPGIPGPPGPMGQPGKAGHCNPSDCFGAMPMEQQYPPMKTMKGPFG
I want to first grep for pattern matching 6 or more xGx repeats, where x is any character. This, I can easily do,
grep -EIho -B1 '([^G]G[^G]){6,}' file
which outputs
>Q07092
KGERGLPGPPGSKGEKGARGN
EGPKGEKGESGALGPSGLPGSTGEKGQKGEKGD
IGPKGQKGDPGFVGPEGLAGEPGPPGLPGPPGI
PGPKGDKGSSGIPGKEGP
FGLPGLPGRAGVPGLKGEKGNFGEAGPAGSPGPPGPVGPAGIKGAKGE
FGLPGKQGKAGERGLKGQKGDAGNPGDPGTPGTTGRPGLSGEPGVQGPAGPKGEKGD
AGRPGQPGPKGEQGPEGV
PGKPGQPGLPGVQGPPGLKGVQGEPGPPGR
QGPQGEPGAPGLPGIQGLPGPRGPPGPTGEKGAQGSPGVKGATGPVGPPGA
SGPPGRDGQQGQTGLRGTPGEKGPRGEKGEPGE
PGPQGPPGIPGPPGPPGVPGLQGVPGNNGLPGQPGL
EGDPGCVGSPGLPGPPGLPGQRGEEGPPGMRGSPGPPGPIGPPGFPGAVGSPGLPGLQGERGLTGLTGDKGEPGPPGQPGYPGATGPPGLPGIKGERGYTGSAGEKGEPGPPGSEGLPGPPGPAGPRGERGPQGNSGEKGDQGFQGQPGFPGPPGPPGFPGKVGSPGPPGP
KGSEGIRGPSGLPGSPGPPGPPGIQGPAGLDGLDGKDGKPGLRGDPGPAGPPGLMGPPGFKGKTGHPGLPGPKGDCGKPGPPGSTGRPGAEGEPGAMGPQGRPGPPGHVGPPGPPGQPGPAGI
VGLKGDRGATGERGLAGLPGQPGPPGHPGPPGEPGTDGAAGKEGPPGKQGFYGPPGPKGDPGAAGQKGQAGEKGRAGM
PGKSGSMGPVGPPGPAGERGHPGAPGPSGSPGLPGVPGSMGD
PGRPGPPGKDGAPGRPGAPGSPGLPGQIGREGRQGLPGVRGLPGTKGEKGDIGI
AGENGLPGPPGPQGPPGY
MGATGPMGQQGIPGIPGPPGPMGQPGKAGH
Now, I want to find the character position of all G's when they occur in 'TGA' or 'SGA'. The character positions should be based on the input and NOT the output.
Expected output,
$ some-grep-awk-code
>Q07092
TGA: 573
SGA: 384
The awk solution,
awk -v str='TGA' '{ off=0; while (pos=index(substr($0,off+1),str)) { printf("%d: %d\n", NR, pos+off); off+=length(str)+pos } }' file
outputs TGA both at character position 25 and 573. However, I want to only identify the character position of G in SGA/TGA when they occur in the midst of six or more xGx repeats.
Really appreciate any help!
Here's a basic awk solution:
Each sequence must span a single line
The resulting positions are relatives to the start of the line
The algorithm first searches the parts of the line that match [^G]G[^G]{6,}, then searches for the occurrences of SGA and TGA in those parts. The implementation is a little tedious, as there's no offset option for the match() and index() functions of awk.
awk '
BEGIN {
regexp = "([^G]G[^G]){6,}"
search["SGA"]
search["TGA"]
}
/^>/ {
print
next
}
{
i0 = 1
s0 = $0
while ( match( s0, regexp ) ) {
head = substr(s0,RSTART,RLENGTH)
tail = substr(s0,RSTART+RLENGTH)
i0 += RSTART - 1
for (s in search) {
s1 = head
i1 = i0
while ( i = index(s1, s) ) {
s1 = substr(s1, i+1)
i1 += i
search[s] = search[s] " " i1-1
}
}
s0 = tail
i0 += RLENGTH
}
for (s in search) {
print s ":" search[s]
search[s] = ""
}
}
'
Example with simplified sequences
>TEST1
SGA.G..G.TGATGA.G..G..G.SGA.....TGA.....SGA.....G..G.SGA.G..G..G.
>TEST2
.G..G.TGA.G..G.G.....G..G..G..G.SGA.G.
>TEST1
SGA: 1 25 54
TGA: 10 13
>TEST2
SGA: 33
TGA:
TODO
Parameterize the regex and the search strings: it's not difficult per se but the current code will run into an infinite loop when a search string is empty or when the regex allows 0-length matches; you'll need to prevent that from happening.
Allow multi-line sequences
Allow overlapping matches for the regex. Basically, it means to look for the next match at RSTART+1 of the previous iteration; that will generate a lot of duplicate results that you need to discard one way or an other.
You could match all occurrences of the regular expression [ST]GA and look at the wider substring surrounding each match to compare that window to (.G.){6}. Here is some code to do that:
$ awk '
/^>/ { label = $0 ORS; next }
{
while (match(substr($0, pos + 1), /[ST]GA/)) {
pos += RSTART
if (len = RLENGTH) {
wbeg = pos - 18 + len # 18 is the length of .G..G..G..G..G..G.
wlen = 2 * 18 - len + (wbeg < 1 ? wbeg - 1 : 0)
wbeg = (wbeg < 1 ? 1 : wbeg) # substr must start from at least 1
window = substr($0, wbeg, wlen)
if (window ~ /.G..G..G..G..G..G./) {
str = substr($0, pos, len)
print label str ":", pos + int(len / 2)
label = ""
}
pos += len - 1
}
if (pos >= length($0)) {
break
}
}
pos = 0
}
' file
>Q07092
SGA: 384
The output only shows SGA: 384 because that is the only portion of the example input that meets the requirement:
I want to only identify the character position of G in SGA/TGA when they occur in the midst of six or more xGx repeats.
With your shown samples please try following awk code. Written and tested in GNU awk should work in any POSIX awk. In this code we could pass how many strings/variables into the function and can get their ALL present index values in the line. Pass all the values needs to be searched into awk variable named keyWords and it will look for all those into the lines.
awk -v keyWords="SGA,TGA" '
BEGIN{
num=split(keyWords,arr1,",")
for(i=1;i<=num;i++){
checkValues[arr1[i]]
}
}
!/>/{
start=diff=prev=""
while(match($0,/(.G.){6,}/)){
lineMatch=substr($0,RSTART,RLENGTH)
start+=(RSTART>1?RSTART-1:RSTART)
diff=(start-prev)
for(key in checkValues){
if(ind=index(lineMatch,key)){
print substr(lineMatch,ind,length(key)),(RSTART?RSTART-1:1)+ind+start+diff
}
prev=start
}
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file
Output with shown samples will be as follows:
>Q07092
SGA: 384

Finding a range of numbers of a file in another file using awk

I have lots of files like this:
3
10
23
.
.
.
720
810
980
And a much bigger file like this:
2 0.004
4 0.003
6 0.034
.
.
.
996 0.01
998 0.02
1000 0.23
What I want to do is find in which range of the second file my first file falls and then estimate the mean of the values in the 2nd column of that range.
Thanks in advance.
NOTE
The numbers in the files do not necessarily follow an easy pattern like 2,4,6...
Since your smaller files are sorted you can pull out the first row and the last row to get the min and max. Then you just need go through the bigfile with an awk script to compute the mean.
So for each smallfile small you would run the script
awk -v start=$(head -n 1 small) -v end=$(tail -n 1 small) -f script bigfile
Where script can be something simple like
BEGIN {
sum = 0;
count = 0;
range_start = -1;
range_end = -1;
}
{
irow = int($1)
ival = $2 + 0.0
if (irow >= start && end >= irow) {
if (range_start == -1) {
range_start = NR;
}
sum = sum + ival;
count++;
}
else if (irow > end) {
if (range_end == -1) {
range_end = NR - 1;
}
}
}
END {
print "start =", range_start, "end =", range_end, "mean =", sum / count
}
You can try below:
for r in *; do
awk -v r=$r -F' ' \
'NR==1{b=$2;v=$4;next}{if(r >= b && r <= $2){m=(v+$4)/2; print m; exit}; b=$2;v=$4}' bigfile.txt
done
Explanation:
First pass it saves column 2 & 4 into temp variables. For all other passes it checks if filename r is between the begin range (previous coluimn 2) and end range (current column 2).
It then works out the mean and prints the result.

Organising inconsistent values [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
No idea if this is ok to ask here since it's not programming but I have no idea where else to go:
I want to organise the following data in a consistent way. At the moment it's a mess, with only the first two columns (comma separated) consistent. The remaining columns can number anywhere from 1-9 and are usually different.
In other words, I want to sort it so the text matches (all of the value columns in a row, all of the recoil columns in a row, etc). Then I can remove the text and add a header, and it will still make sense.
bm_wp_upg_o_t1micro, sight, value = 3, zoom = 3, recoil = 1, spread_moving = -1
bm_wp_upg_o_marksmansight_rear, sight, value = 3, zoom = 1, recoil = 1, spread = 1
bm_wp_upg_o_marksmansight_front, extra, value = 1
bm_wp_m4_upper_reciever_edge, upper_reciever, value = 3, recoil = 1
bm_wp_m4_upper_reciever_round, upper_reciever, value = 1
bm_wp_m4_uupg_b_long, barrel, value = 4, damage = 1, spread = 1, spread_moving = -2, concealment = -2
Any suggestions (even on just where the right place is to actually ask this) would be great.
Context is just raw data ripped from a game file that I'm trying to organise.
I'm afraid regex isn't going to help you much here because of the irregular nature of your input (it would be possible to match it, but it would be a bear to get it all arranged one way or another). This could be done pretty easily with any programming language, but for stuff like this, I always go to awk.
Assuming your input is in a file called input.txt, put the following in a program called parse.awk:
BEGIN {
FS=" *, *";
formatStr = "%32s,%8s,%8s,%8s,%10s,%16s,%8s,%18s,%10s,%10s,%16s,%16s\n";
printf( formatStr, "id", "sight", "value", "zoom", "recoil", "spread_moving", "extra", "upper_receiver", "barrel", "damage", "spread_moving", "concealment" );
}
{
split("",a);
for( i=2; i<=NF; i++ ) {
if( split( $(i), kvp, " *= *" ) == 1 ) {
a[kvp[1]] = "x";
} else {
a[kvp[1]] = gensub( /^\s*|\s*$/, "", "g", kvp[2] );
}
}
printf( formatStr, $1, a["sight"], a["value"], a["zoom"], a["recoil"],
a["spread_moving"], a["extra"], a["upper_receiver"],
a["barrel"], a["damage"], a["spread_moving"], a["concealment"] );
}
Run awk against it:
awk -f parse.awk input.txt
And get your output:
id, sight, value, zoom, recoil, spread_moving, extra, upper_receiver, barrel, damage, spread_moving, concealment
bm_wp_upg_o_t1micro, x, 3, 3, 1, -1, , , , , -1,
bm_wp_upg_o_marksmansight_rear, x, 3, 1, 1, , , , , , ,
bm_wp_upg_o_marksmansight_front, , 1, , , , x, , , , ,
bm_wp_m4_upper_reciever_edge, , 3, , 1, , , , , , ,
bm_wp_m4_upper_reciever_round, , 1, , , , , , , , ,
bm_wp_m4_uupg_b_long, , 4, , , -2, , , x, 1, -2, -2
Note that I chose to just use an 'x' for sight, which seems to a present/absent thing. You can use whatever you want there.
If you're using Linux or a Macintosh, you should have awk available. If you're on Windows, you'll have to install it.
I did make another awk version. I think this should a little easier to read.
All value/column are read from the file to make it as dynamic as possible.
awk -F, '
{
ID[$1]=$2 # use column 1 as index
for (i=3;i<=NF;i++ ) # loop through all fields from #3 to end
{
gsub(/ +/,"",$i) # remove space from field
split($i,a,"=") # split field in name and value a[1] and a[2]
COLUMN[a[1]]++ # store field name as column name
DATA[$1" "a[1]]=a[2] # store data value in DATA using field #1 and column name as index
}
}
END {
printf "%49s ","info" # print info
for (i in COLUMN)
{printf "%15s",i} # print column name
print ""
for (i in ID) # loop through all ID
{
printf "%32s %16s ",i, ID[i] # print ID and info
for (j in COLUMN)
{
printf "%14s ",DATA[i" "j]+0 # print value
}
print ""
}
}' file
Output
info spread recoil zoom concealment spread_moving damage value
bm_wp_m4_upper_reciever_round upper_reciever 0 0 0 0 0 0 1
bm_wp_m4_uupg_b_long barrel 1 0 0 -2 -2 1 4
bm_wp_upg_o_marksmansight_rear sight 1 1 1 0 0 0 3
bm_wp_upg_o_marksmansight_front extra 0 0 0 0 0 0 1
bm_wp_m4_upper_reciever_edge upper_reciever 0 1 0 0 0 0 3
bm_wp_upg_o_t1micro sight 0 1 3 0 -1 0 3
Stick with Ethan's answer — this is just me enjoying myself. (And yes, that makes me pretty weird!)
awk script
awk 'BEGIN {
# f_idx[field] holds the column number c for a field=value item
# f_name[c] holds the names
# f_width[c] holds the width of the widest value (or the field name)
# f_fmt[c] holds the appropriate format
FS = " *, *"; n = 2;
f_name[0] = "id"; f_width[0] = length(f_name[0])
f_name[1] = "type"; f_width[1] = length(f_name[1])
}
{
#-#print NR ":" $0
line[NR,0] = $1
len = length($1)
if (len > f_width[0])
f_width[0] = len
line[NR,1] = $2
len = length($2)
if (len > f_width[1])
f_width[1] = len
for (i = 3; i <= NF; i++)
{
split($i, fv, " = ")
#-#print "1:" fv[1] ", 2:" fv[2]
if (!(fv[1] in f_idx))
{
f_idx[fv[1]] = n
f_width[n++] = length(fv[1])
}
c = f_idx[fv[1]]
f_name[c] = fv[1]
gsub(/ /, "", fv[2])
len = length(fv[2])
if (len > f_width[c])
f_width[c] = len
line[NR,c] = fv[2]
#-#print c ":" f_name[c] ":" f_width[c] ":" line[NR,c]
}
}
END {
for (i = 0; i < n; i++)
f_fmt[i] = "%s%" f_width[i] "s"
#-#for (i = 0; i < n; i++)
#-# printf "%d: (%d) %s %s\n", i, f_width[i], f_name[i], f_fmt[i]
#-# pad = ""
for (j = 0; j < n; j++)
{
printf f_fmt[j], pad, f_name[j]
pad = ","
}
printf "\n"
for (i = 1; i <= NR; i++)
{
pad = ""
for (j = 0; j < n; j++)
{
printf f_fmt[j], pad, line[i,j]
pad = ","
}
printf "\n"
}
}' data
This script adapts to the data it finds in the file. It assigns the column heading 'id' to column 1 of the input, and 'type' to column 2. For each of the sets of values in columns 3..N, it splits up the data into key (in fv[1]) and value (in fv[2]). If the key has not been seen before, it is assigned a new column number, and the key is stored as the column name, and the width of key as the initial column width. Then the value is stored in the appropriate column within the line.
When all the data's read, the script knows what the column headings are going to be. It can then create a set of format strings. Then it prints the headings and all the rows of data. If you don't want fixed width output, then you can simplify the script considerably. There are some (mostly minor) simplifications that could be made to this script.
Data file
bm_wp_upg_o_t1micro, sight, value = 3, zoom = 3, recoil = 1, spread_moving = -1
bm_wp_upg_o_marksmansight_rear, sight, value = 3, zoom = 1, recoil = 1, spread = 1
bm_wp_upg_o_marksmansight_front, extra, value = 1
bm_wp_m4_upper_receiver_edge, upper_receiver, value = 3, recoil = 1
bm_wp_m4_upper_receiver_round, upper_receiver, value = 1
bm_wp_m4_uupg_b_long, barrel, value = 4, damage = 1, spread = 1, spread_moving = -2, concealment = -2
Output
id, type,value,zoom,recoil,spread_moving,spread,damage,concealment
bm_wp_upg_o_t1micro, sight, 3, 3, 1, -1, , ,
bm_wp_upg_o_marksmansight_rear, sight, 3, 1, 1, , 1, ,
bm_wp_upg_o_marksmansight_front, extra, 1, , , , , ,
bm_wp_m4_upper_receiver_edge,upper_receiver, 3, , 1, , , ,
bm_wp_m4_upper_receiver_round,upper_receiver, 1, , , , , ,
bm_wp_m4_uupg_b_long, barrel, 4, , , -2, 1, 1, -2

how to write bash script in ubuntu to normalize the index of text comparison

I had a input which is a result from text comparison. It is in a very simple format. It has 3 columns, position, original texts and new texts.
But some of the records looks like this
4 ATCG ATCGC
10 1234 123
How to write the short script to normalize it to
7 G GC
12 34 3
probably, the whole original texts and the whole new text is like below respectively
ACCATCGGA1234
ACCATCGCGA123
"Normalize" means "trying to move the position in the first column to the position that changes gonna occur", or "we would remove the common prefix ATG, add its length 3 to the first field; similarly on line 2 the prefix we remove is length 2"
This script
awk '
BEGIN {OFS = "\t"}
function common_prefix_length(str1, str2, max_len, idx) {
idx = 1
if (length(str1) < length(str2))
max_len = length(str1)
else
max_len = length(str2)
while (substr(str1, idx, 1) == substr(str2, idx, 1) && idx < max_len)
idx++
return idx - 1
}
{
len = common_prefix_length($2, $3)
print $1 + len, substr($2, len + 1), substr($3, len + 1)
}
' << END
4 ATCG ATCGC
10 1234 123
END
outputs
7 G GC
12 34 3

How to match a column name and find out the column position in awk?

I am trying to parse some csv files using awk. I am new to shell scripting and awk.
The csv file i am working on looks something like this :
fnName,minAccessTime,maxAccessTime
getInfo,300,600
getStage,600,800
getStage,600,800
getInfo,250,620
getInfo,200,700
getStage,700,1000
getInfo,280,600
I need to find the average AccessTimes of the different functions.
I have been working with awk and have been able to get the average times provided the exact column numbers are specified like $2, $3 etc.
However I need to have a general script in which if i input "minAccessTime" in the command argument, I need the script to print the average AccessTime (instead of explicitly specifying $2 or $3 while using awk).
I have been googling about this and saw in various forums but none of them seems to work.
Can someone tell me how to do this ? It would be of great help !
Thanks in advance!!
This awk script should give you all that you want.
It first evaluates which column you're interested in by using the name passed in as the COLM variable and checking against the first line. It converts this into an index (it's left as the default 0 if it couldn't find the column).
It then basically runs through all other lines in your input file. On all these other lines (assuming you've specified a valid column), it updates the count, sum, minimum and maximum for both the overall data plus each individual function name.
The former is stored in count, sum, min and max. The latter are stored in associative arrays with similar names (with _arr appended).
Then, once all records are read, the END section outputs the information.
NR == 1 {
for (i = 1; i <= NF; i++) {
if ($i == COLM) {
cidx = i;
}
}
}
NR > 1 {
if (cidx > 0) {
count++;
sum += $cidx;
if (count == 1) {
min = $cidx;
max = $cidx;
} else {
if ($cidx < min) { min = $cidx; }
if ($cidx > max) { max = $cidx; }
}
count_arr[$1]++;
sum_arr[$1] += $cidx;
if (count_arr[$1] == 1) {
min_arr[$1] = $cidx;
max_arr[$1] = $cidx;
} else {
if ($cidx < min_arr[$1]) { min_arr[$1] = $cidx; }
if ($cidx > max_arr[$1]) { max_arr[$1] = $cidx; }
}
}
}
END {
if (cidx == 0) {
print "Column '" COLM "' does not exist"
} else {
print "Overall:"
print " Total records = " count
print " Sum of column = " sum
if (count > 0) {
print " Min of column = " min
print " Max of column = " max
print " Avg of column = " sum / count
}
for (task in count_arr) {
print "Function " task ":"
print " Total records = " count_arr[task]
print " Sum of column = " sum_arr[task]
print " Min of column = " min_arr[task]
print " Max of column = " max_arr[task]
print " Avg of column = " sum_arr[task] / count_arr[task]
}
}
}
Storing that script into qq.awk and placing your sample data into qq.in, then running:
awk -F, -vCOLM=minAccessTime -f qq.awk qq.in
generates the following output, which I'm relatively certain will give you every possible piece of information you need:
Overall:
Total records = 7
Sum of column = 2930
Min of column = 200
Max of column = 700
Avg of column = 418.571
Function getStage:
Total records = 3
Sum of column = 1900
Min of column = 600
Max of column = 700
Avg of column = 633.333
Function getInfo:
Total records = 4
Sum of column = 1030
Min of column = 200
Max of column = 300
Avg of column = 257.5
For `maxAccessTime, you get:
Overall:
Total records = 7
Sum of column = 5120
Min of column = 600
Max of column = 1000
Avg of column = 731.429
Function getStage:
Total records = 3
Sum of column = 2600
Min of column = 800
Max of column = 1000
Avg of column = 866.667
Function getInfo:
Total records = 4
Sum of column = 2520
Min of column = 600
Max of column = 700
Avg of column = 630
And, for xyzzy (a non-existent column), you'll see:
Column 'xyzzy' does not exist
If I understand the requirements correctly, you want the average of a column, and you'd like to specify the column by name.
Try the following script (avg.awk):
BEGIN {
FS=",";
}
NR == 1 {
for (i=1; i <= NF; ++i) {
if ($i == SELECTED_FIELD) {
SELECTED_COL=i;
}
}
}
NR > 1 && $1 ~ SELECTED_FNAME {
sum[$1] = sum[$1] + $SELECTED_COL;
count[$1] = count[$1] + 1;
}
END {
for (f in sum) {
printf("Average %s for %s: %d\n", SELECTED_FIELD, f, sum[f] / count[f]);
}
}
and invoke your script like this
awk -v SELECTED_FIELD=minAccessTime -f avg.awk < data.csv
or
awk -v SELECTED_FIELD=maxAccessTime -f avg.awk < data.csv
or
awk -v SELECTED_FIELD=maxAccessTime -v SELECTED_FNAME=getInfo -f avg.awk < data.csv
EDIT:
Rewritten to group by function name (assumed to be first field)
EDIT2:
Rewritten to allow additional parameter to filter by function name (assumed to be first field)

Resources