Calculate sum of size notated figures? - bash

I want to calculate the total size of all .mobi files from this
link (it's a good link by the way).
In my attempt of making this as my learning experience, I have made a 'pipe' (let's call it a) that output all the sizes from that page which looks like:
189K
20M
549K
2.2M
1.9M
3.1M
2.5M
513K
260K
1.1M
2.8M
5.1M
3.7M
1.5M
5.6M
1.0M
5.6M
1.5M
4.9M
3.4M
810K
My target is to get the total size (ex: 50.50M, or 50000K) - sum of all these numbers.
My question is, how to calculate that target, using pipeling (a | some_other_commands). Answers using python or any other language (preferably one liners) are welcome. Thanks a lot.

For the fun a solution in shell:
a | sed -e 's/M$/ 1024 * +/' -e 's/K$/ +/' | dc -e '0' -f - -e 'p'

Perl one-liner:
a | perl -ne 's/^([\d.]+)M$/$1*1024/e;$sum+=$_; END{print $sum."K"}'
see it
It assumes that all entries are in either Kilobytes or Megabytes as shown in OPs input.

Sigh, someone says “one-liner” and all my code-golf reflexes fire...
ruby -e 'puts $<.read.split.inject(0){ |m,e| m += e.to_f * { "M" => 1, "K" => 0.001 }[e[-1,1]]}.to_s+"M"'
or, with some shortcuts...
ruby -ne 'p #e=#e.to_f+$_.to_f*{"M"=>1,"K"=>0.001}[$_[-2,1]]'
Update: Heh, ok, hard to read. The OP asked for a "one liner". :-)
#!/usr/bin/env ruby
total = 0
while s = gets # get line
scalefactorMK = s.chomp[-1,1] # get the M or K
scalefactor = { 'M'=>1,'K'=>0.001 }[scalefactorMK] # get numeric scale
total += s.to_f * scalefactor # accumulate total
end
puts "%5.1fM" % [total]

if you have Ruby (1.9+)
require 'net/http'
url="http://hewgill.com/~greg/stackoverflow/ebooks/"
response = Net::HTTP.get_response( URI.parse(url) )
data=response.body
total=0
data.split("\n").each do |x|
if x=~/\.mobi/
size = x.split(/\s+/)[-1]
c = case size[-1]
when 'K' then 1024
when 'M' then 1024 * 1024
when 'G' then 1024 * 1024 * 1024
end
total+=size[0..-1].to_i * c
end
end
puts "Total size: %.2f MB" % ( total/(1024.0 * 1024.0) )

awk (assume files less than 1K don't substantially add to the total):
a | awk '/K/ {sum += $1/1024} /M/ {sum += $1} END {printf("%.2fM\n", sum)}'

Related

Awk Standard deviation for each unique identifier

I have the following dataset with multiple different ids in column 1 and I wish to calculate the mean and standard deviation for column 2 for each id
123456 0.1234
123456 0.5673
123456 0.0011
123456 -0.0947
123457 0.9938
123457 0.0001
123457 0.2839
I have the following code to get the mean per id but struggling to amend this to get the SD as well
awk '{sum4[$1] += $2; count4[$1]++}; END{ for (id in sum4) { print id, sum4[id]/count4[id] } }' < want3.txt > mean_id.txt
The desired output is a file of id mean and sd
123456 0.149275 0.2926
123457 0.425933 0.5118
Any advice would be much appreciated.
Thanks
here is another approach which is more memory efficient but possibly less precision for large mean.
$ awk -v t=1 '{s[$1]+=$2; ss[$1]+=$2*$2; c[$1]++}
END {for(k in s) print k,m=s[k]/c[k],sqrt((ss[k]-m^2*c[k])/(c[k]-t))}' file
123456 0.149275 0.292628
123457 0.425933 0.51185
this computes the sample standard deviation, if you have the full distribution not just the samples you can set t=0 to get population standard deviation which will be slightly lower but for large N they are practically equivalent (within the error of margin due to measurement errors).
With GNU awk. Derived from Ivan's answer with standard deviation of the population (division by n). I switched to sample standard deviation (division by n-1).
awk '
{
numrec[$1] += 1
sum[$1] += $2
array[numrec[$1]] = $2
array[$1,numrec[$1]] = $2
}
END {
for(w in numrec) {
for(x=1; x<=numrec[w]; x++)
sumsq[w] += ((array[w,x]-(sum[w]/numrec[w]))^2)
printf("%d %.6f %.4f\n", w, sum[w]/numrec[w], sqrt(sumsq[w]/(numrec[w]-1)))
}
}
' file
Output:
123456 0.149275 0.2926
123457 0.425933 0.5118

Ruby takes the doubled amount of RAM it should

I was trying things out with my homeserver and wrote a little ruby program that fills up the RAM by a given amount. But actually I have to halve the amount of bytes I want to put into the RAM. Am I missing something here or is this a bug?
Here the code:
class RAM
def initialize
#b = ''
end
def fill_ram(size)
puts 'Choose if you want to set the size in bytes, megabytes or gigabytes.'
answer = ''
valid = ['bytes', 'megabytes', 'gigabytes']
until valid.include?(answer)
answer = gets.chomp.downcase
if answer == 'bytes'
size = size * 0.5
elsif answer == 'megabytes'
size = size * 1024 * 1024 * 0.5
elsif answer == 'gigabytes'
size = size * 1024 * 1024 * 1024 * 0.5
else
puts 'Please choose between bytes, megabytes or gigabyte.'
end
end
size1 = size
if #b.bytesize != 0
size1 = size + #b.bytesize
end
until #b.bytesize == size1
#b << '0' * size
end
size = 0
end
def clear_ram
exit
end
def read_ram
puts 'At the moment this program fills ' + #b.bytesize.to_s + ' bytes of RAM'
end
end
Just imagine that the "* 0.5" at each line wouldn't be there.
I did test it in IRB and just created a new RAM object and filled it with 1000 Megabytes of data. In my case it filled the RAM actually with 2000 Megabytes of data, so I did add the times 0.5 to each line, but that can't be the solution.
When I run it I get:
Choose if you want to set the size in bytes, megabytes or gigabytes.
bytes
At the moment this program fills 512 bytes of RAM
I think the problem is the missing check for the encoding.
I ran my test in US-ASCII (One character = 1 Byte).
If you run it in UTF-16 you have an explanation for your problem.
Can you try the following code to check your encoding:
p Encoding.default_internal
p Encoding.default_external
After reading the comment:
The result of your script depends on the parameter of RAM.fill_ram. How do you start your script - and how often do you call RAM.fill_ram?
Please provide the full code.
I called my example with
r = RAM.new
r.fill_ram(1024)
r.read_ram

Awk Calc Avg Rows Below Certain Line

I'm having trouble calculating an average of specific numbers in column BELOW a specific text identifier using awk. I have two columns of data and I'm trying to start the average keying on a common identifier that repeats, which is 01/1991. So, awk should calc the average of all lines beginning with 01/1991, which repeats, using the next 21 lines with total count of rows for average = 22 for the total number of years 1991-2012. The desired output is an average of each TextID/Name entry for all the January's (01) for each year 1991 - 2012 show below:
TextID/Name 1
Avg: 50.34
TextID/Name 2
Avg: 45.67
TextID/Name 3
Avg: 39.97
...
sample data:
TextID/Name 1
01/1991, 57.67
01/1992, 56.43
01/1993, 49.41
..
01/2012, 39.88
TextID/Name 2
01/1991, 45.66
01/1992, 34.77
01/1993, 56.21
..
01/2012, 42.11
TextID/Name 3
01/1991, 32.22
01/1992, 23.71
01/1993, 29.55
..
01/2012, 35.10
continues with the same data for TextID/Name 4
I'm getting an answer using this code shown below but the average is starting to calculate BEFORE the specific identifier line and not on and below that line (01/1991).
awk '$1="01/1991" {sum+=$2} (NR%22==0){avg=sum/22;print"Average: "avg;sum=0;next}' myfile
Thanks and explanations of the solution is greatly appreciated! I have edited the original answer with more description - thank you again.
If you look at your file, the first field is "01/1991," with a comma at the end, not "01/1991". Also, NR%22==0 will look at line numbers divisible by 22, not 22 lines after the point it thinks you care about.
You can do something like this instead:
awk '
BEGIN { l=-1; }
$1 == "01/1991," {
l=22;
s=0;
}
l > 0 { s+=$2; l--; }
l == 0 { print s/22; l--; }'
It has a counter l that it sets to the number of lines to count, then it sums up that number of lines.
You may want to consider simply summing all lines from one 01/1991 to the next though, which might be more robust.
If you're allowed to use Perl instead of Awk, you could do:
#!/usr/bin/env perl
$start = 0;
$have_started = 0;
$count = 0;
$sum = 0;
while (<>) {
$line = $_;
# Grab the value after the date and comma
if ($line = /\d+\/\d+,\s+([\d\.]+)/) {
$val = $+;
}
# Start summing values after 01/1991
if (/01\/1991,\s+([\d\.]+)/) {
$have_started = 1;
$val = $+;
}
# If we have started counting,
if ($have_started) {
$count++;
$sum += $+;
}
}
print "Average of all values = " . $sum/$count;
Run it like so:
$ cat your-text-file.txt | above-perl-script.pl

Calculating the difference between durations with milliseconds in Ruby

TL;DR: I need to get the difference between HH:MM:SS.ms and HH:MM:SS.ms as HH:MM:SS:ms
What I need:
Here's a tricky one. I'm trying to calculate the difference between two timestamps such as the following:
In: 00:00:10.520
Out: 00:00:23.720
Should deliver:
Diff: 00:00:13.200
I thought I'd parse the times into actual Time objects and use the difference there. This works great in the previous case, and returns 00:0:13.200.
What doesn't work:
However, for some, this doesn't work right, as Ruby uses usec instead of msec:
In: 00:2:22.760
Out: 00:2:31.520
Diff: 00:0:8.999760
Obviously, the difference should be 00:00:8:760 and not 00:00:8.999760. I'm really tempted to just tdiff.usec.to_s.gsub('999','') ……
My code so far:
Here's my code so far (these are parsed from the input strings like "0:00:10:520").
tin_first, tin_second = ins.split(".")
tin_hours, tin_minutes, tin_seconds = tin_first.split(":")
tin_usec = tin_second * 1000
tin = Time.gm(0, 1, 1, tin_hours, tin_minutes, tin_seconds, tin_usec)
The same happens for tout. Then:
tdiff = Time.at(tout-tin)
For the output, I use:
"00:#{tdiff.min}:#{tdiff.sec}.#{tdiff.usec}"
Is there any faster way to do this? Remember, I just want to have the difference between two times. What am I missing?
I'm using Ruby 1.9.3p6 at the moment.
Using Time:
require 'time' # Needed for Time.parse
def time_diff(time1_str, time2_str)
t = Time.at( Time.parse(time2_str) - Time.parse(time1_str) )
(t - t.gmt_offset).strftime("%H:%M:%S.%L")
end
out_time = "00:00:24.240"
in_time = "00:00:14.520"
p time_diff(in_time, out_time)
#=> "00:00:09.720"
Here's a solution that doesn't rely on Time:
def slhck_diff( t1, t2 )
ms_to_time( time_as_ms(t2) - time_as_ms(t1) )
end
# Converts "00:2:22.760" to 142760
def time_as_ms( time_str )
re = /(\d+):(\d+):(\d+)(?:\.(\d+))?/
parts = time_str.match(re).to_a.map(&:to_i)
parts[4]+(parts[3]+(parts[2]+parts[1]*60)*60)*1000
end
# Converts 142760 to "00:02:22.760"
def ms_to_time(ms)
m = ms.floor / 60000
"%02i:%02i:%06.3f" % [ m/60, m%60, ms/1000.0 % 60 ]
end
t1 = "00:00:10.520"
t2 = "01:00:23.720"
p slhck_diff(t1,t2)
#=> "01:00:13.200"
t1 = "00:2:22.760"
t2 = "00:2:31.520"
p slhck_diff(t1,t2)
#=> "00:00:08.760"
I figured the following could work:
out_time = "00:00:24.240"
in_time = "00:00:14.520"
diff = Time.parse(out_time) - Time.parse(in_time)
Time.at(diff).strftime("%H:%M:%S.%L")
# => "01:00:09.720"
It does print 01 for the hour, which I don't really understand.
In the meantime, I used:
Time.at(diff).strftime("00:%M:%S.%L")
# => "00:00:09.720"
Any answer that does this better will get an upvote or the accept, of course.
in_time = "00:02:22.760"
out_time = "00:02:31.520"
diff = (Time.parse(out_time) - Time.parse(in_time))*1000
puts diff
OUTPUT:
8760.0 millliseconds
Time.parse(out_time) - Time.parse(in_time) gives the result in seconds so multiplied by 1000 to convert into milliseconds.

Ruby data extraction from a text file

I have a relatively big text file with blocks of data layered like this:
ANALYSIS OF X SIGNAL, CASE: 1
TUNE X = 0.2561890123390808
Line Frequency Amplitude Phase Error mx my ms p
1 0.2561890123391E+00 0.204316425208E-01 0.164145385871E+03 0.00000000000E+00 1 0 0 0
2 0.2562865535359E+00 0.288712798671E-01 -.161563284233E+03 0.97541196785E-04 1 0 0 0
(they contain more lines and then are repeated)
I would like first to extract the numerical value after TUNE X = and output these in a text file. Then I would like to extract the numerical value of LINE FREQUENCY and AMPLITUDE as a pair of values and output to a file.
My question is the following: altough I could make something moreorless working using a simple REGEXP I'm not convinced that it's the right way to do it and I would like some advices or examples of code showing how I can do that efficiently with Ruby.
Generally, (not tested)
toggle=0
File.open("file").each do |line|
if line[/TUNE/]
puts line.split("=",2)[-1].strip
end
if line[/Line Frequency/]
toggle=1
next
end
if toggle
a = line.split
puts "#{a[1]} #{a[2]}"
end
end
go through the file line by line, check for /TUNE/, then split on "=" to get last item.
Do the same for lines containing /Line Frequency/ and set the toggle flag to 1. This signify that the rest of line contains the data you want to get. Since the freq and amplitude are at fields 2 and 3, then split on the lines and get the respective positions. Generally, this is the idea. As for toggling, you might want to set toggle flag to 0 at the next block using a pattern (eg SIGNAL CASE or ANALYSIS)
file = File.open("data.dat")
#tune_x = #frequency = #amplitude = []
file.each_line do |line|
tune_x_scan = line.scan /TUNE X = (\d*\.\d*)/
data_scan = line.scan /(\d*\.\d*E[-|+]\d*)/
#tune_x << tune_x_scan[0] if tune_x_scan
#frequency << data_scan[0] if data_scan
#amplitude << data_scan[0] if data_scan
end
There are lots of ways to do it. This is a simple first pass at it:
text = 'ANALYSIS OF X SIGNAL, CASE: 1
TUNE X = 0.2561890123390808
Line Frequency Amplitude Phase Error mx my ms p
1 0.2561890123391E+00 0.204316425208E-01 0.164145385871E+03 0.00000000000E+00 1 0 0 0
2 0.2562865535359E+00 0.288712798671E-01 -.161563284233E+03 0.97541196785E-04 1 0 0 0
ANALYSIS OF X SIGNAL, CASE: 1
TUNE X = 1.2561890123390808
Line Frequency Amplitude Phase Error mx my ms p
1 1.2561890123391E+00 0.204316425208E-01 0.164145385871E+03 0.00000000000E+00 1 0 0 0
2 1.2562865535359E+00 0.288712798671E-01 -.161563284233E+03 0.97541196785E-04 1 0 0 0
ANALYSIS OF X SIGNAL, CASE: 1
TUNE X = 2.2561890123390808
Line Frequency Amplitude Phase Error mx my ms p
1 2.2561890123391E+00 0.204316425208E-01 0.164145385871E+03 0.00000000000E+00 1 0 0 0
2 2.2562865535359E+00 0.288712798671E-01 -.161563284233E+03 0.97541196785E-04 1 0 0 0
'
require 'stringio'
pretend_file = StringIO.new(text, 'r')
That gives us a StringIO object we can pretend is a file. We can read from it by lines.
I changed the numbers a bit just to make it easier to see that they are being captured in the output.
pretend_file.each_line do |li|
case
when li =~ /^TUNE.+?=\s+(.+)/
print $1.strip, "\n"
when li =~ /^\d+\s+(\S+)\s+(\S+)/
print $1, ' ', $2, "\n"
end
end
For real use you'd want to change the print statements to a file handle: fileh.print
The output looks like:
# >> 0.2561890123390808
# >> 0.2561890123391E+00 0.204316425208E-01
# >> 0.2562865535359E+00 0.288712798671E-01
# >> 1.2561890123390808
# >> 1.2561890123391E+00 0.204316425208E-01
# >> 1.2562865535359E+00 0.288712798671E-01
# >> 2.2561890123390808
# >> 2.2561890123391E+00 0.204316425208E-01
# >> 2.2562865535359E+00 0.288712798671E-01
You can read your file line by line and cut each by number of symbol, for example:
to extract tune x get symbols from
10 till 27 on line 2
to extract LINE FREQUENCY get
symbols from 3 till 22 on line 6+n

Resources