I read a CSV file. This file looks like this
1.00 cm; 2.00cm ; 3.00 cm; ... ; 100 cm
2.00 cm; 4.00 cm; 6.00 cm; ... ; 100 cm
4.00 cm; 8.00 cm; 12.00 cm; ... ; 100cm
8.00 cm; 16.00 cm; 24.00 cm; ... ; 100cm
I have already written the following code
CSV.foreach("/Users/testUser/Entwicklung/coverrechner/CoverPaperDE.csv", col_sep: ';') do |row|
puts row[0]
end
This produces the following output:
1.00 cm
2.00 cm
4.00 cm
8.00 cm
Example:
My matrix is constructed
1.1 1.2 1.3 1.4
2.1 2.2 2.3 2.4
3.1 3.2 3.3 3.4
4.1 4.2 4.3 4.4
I want the following output
1.1 2.1 3.1 4.1 1.2 2.2 3.2 4.2 ... 4.4
How does it work?
Do you want the first element of the first row and the third element of the last row?
first = last = nil
CSV.foreach("/Users/testUser/Entwicklung/coverrechner/CoverPaperDE.csv", col_sep: ';') do |row|
first ||= row[0]
last = row[3]
end
puts "first is #{first} and last is #{last}"
In addition, this does give the exact expected output:
csv_str = "1.00 cm; 2.00cm ; 3.00 cm; ... ; 100 cm\n2.00 cm; 4.00 cm; 6.00 cm; ... ; 100 cm\n4.00 cm; 8.00 cm; 12.00 cm; ... ; 100cm\n8.00 cm; 16.00 cm; 24.00 cm; ... ; 100cm"
CSV.parse(csv_str, col_sep: ';') do |row|
puts row[0]
end
output:
1.00 cm
2.00 cm
4.00 cm
8.00 cm
(CSV.parse is the same as your method, just parsing a string instead of using a file, used for the case of a sscce)
Related
I have the following two tables:
Historical_Data_Tbl:
DATE
Cloud%
Wind_KM
Solar_Utiliz
Price
01-Jan
0.85
0
0.1
4.5
02-Jan
0.85
0
0.1
4.5
03-Jan
0.95
15
0
10
04-Jan
0.95
15
0
8
05-Jan
0.6
25
0.35
6
06-Jan
0.6
25
0.35
6
07-Jan
0.2
55
0.8
6
08-Jan
0.2
55
0.8
7
09-Jan
0.55
10
0.5
5.5
10-Jan
0.55
10
0.5
5.5
11-Jan
0.28
12
0.6
2
12-Jan
0.28
12
0.6
2
13-Jan
0.1
40
0.9
3
14-Jan
0.1
40
0.9
3
15-Jan
0.33
17
0.7
8
16-Jan
0.01
17
0.95
1
17-Jan
0.01
17
0.95
1
Forecast_Tbl:
Date
Fcst_Cloud
Fcst_Wind
Fcst_Solar
Max_Cloud
Min_Cloud
Max_Wind
Min_Wind
Max_Solar
Min_Solar
1
0.5
12
0.5
0.7
0.3
27
-3
0.75
0.25
2
0.8
10
0.1
1
0.6
25
-5
0.35
-0.15
3
0.15
15
0.8
0.35
-0.05
30
0
1.05
0.55
4
0.75
10
0.2
0.95
0.55
25
-5
0.45
-0.05
5
0.1
99
0.99
0.3
-0.1
114
84
1.24
0.74
6
0.11
35
0.8
0.31
-0.09
50
20
1.05
0.55
CODE BELOW:
let
//Read in Historical table and set data types
Source = Excel.CurrentWorkbook(){[Name="Historical"]}[Content],
Historical = Table.Buffer(Table.TransformColumnTypes(Source,{
{"DATE", type date}, {"Cloud%", type number}, {"Wind_KM", Int64.Type},
{"Solar_Utiliz", type number}, {"Price", type number}})),
//Read in Forecast table anda set data types
Source1 = Excel.CurrentWorkbook(){[Name="Forecast"]}[Content],
Forecast = Table.Buffer(Table.TransformColumnTypes(Source1,{
{"Date", Int64.Type}, {"Fcst_Cloud", type number}, {"Fcst_Wind", Int64.Type},
{"Fcst_Solar", type number}, {"Max_Cloud", type number},
{"Min_Cloud", type number}, {"Max_Wind", Int64.Type}, {"Min_Wind", Int64.Type},
{"Max_Solar", type number}, {"Min_Solar", type number}})),
//Generate list of filtered Historical Table for each row in Forecast Table with aggregations
//Merge aggregations with the associated Forecast row
#"Filtered Historical" = List.Generate(
()=>[t=Table.SelectRows(Historical, (h)=>
h[#"Cloud%"] <= Forecast[Max_Cloud]{0} and h[#"Cloud%"]>= Forecast[Min_Cloud]{0}
and h[Wind_KM] <= Forecast[Max_Wind]{0} and h[Wind_KM] >= Forecast[Min_Wind]{0}
and h[Solar_Utiliz] <= Forecast[Max_Solar]{0} and h[Solar_Utiliz] >= Forecast[Min_Solar]{0}),
idx=0],
each [idx] < Table.RowCount(Forecast),
each [t=Table.SelectRows(Historical, (h)=>
h[#"Cloud%"] <= Forecast[Max_Cloud]{[idx]+1} and h[#"Cloud%"]>= Forecast[Min_Cloud]{[idx]+1}
and h[Wind_KM] <= Forecast[Max_Wind]{[idx]+1} and h[Wind_KM] >= Forecast[Min_Wind]{[idx]+1}
and h[Solar_Utiliz] <= Forecast[Max_Solar]{[idx]+1} and h[Solar_Utiliz] >= Forecast[Min_Solar]{[idx]+1}),
idx=[idx]+1],
each Forecast{[idx]} & Record.FromList(
{List.Count([t][Price]),List.Min([t][Price]), List.Max([t][Price]),
List.Modes([t][Price]){0}, List.Median([t][Price]), List.Average([t][Price])},
{"Count","Min","Max","Mode","Median","Average"})),
#"Converted to Table" = Table.FromList(#"Filtered Historical", Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table", "Column1",
{"Date", "Fcst_Cloud", "Fcst_Wind", "Fcst_Solar", "Max_Cloud", "Min_Cloud", "Max_Wind", "Min_Wind", "Max_Solar", "Min_Solar",
"Count", "Min", "Max", "Mode", "Median", "Average"}),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded Column1",{
{"Date", Int64.Type}, {"Fcst_Cloud", Percentage.Type}, {"Fcst_Wind", Int64.Type}, {"Fcst_Solar", type number},
{"Max_Cloud", type number}, {"Min_Cloud", type number}, {"Max_Wind", Int64.Type}, {"Min_Wind", Int64.Type},
{"Max_Solar", type number}, {"Min_Solar", type number}, {"Count", Int64.Type},
{"Min", Currency.Type}, {"Max", Currency.Type}, {"Mode", Currency.Type}, {"Median", Currency.Type}, {"Average", Currency.Type}})
in
#"Changed Type"
And this is the resulting output:
Date
Fcst_Cloud
Fcst_Wind
Fcst_Solar
Max_Cloud
Min_Cloud
Max_Wind
Min_Wind
Max_Solar
Min_Solar
Count
Min
Max
Mode
Median
Average
1
0.5
12
0.5
0.7
0.3
27
0
0.75
0.25
5
5.5
8
6
6
6.2
2
0.8
10
0.1
1
0.6
25
-5
0.35
-0.15
6
4.5
10
4.5
6
6.5
3
0.15
15
0.8
0.35
-0.05
30
0
1.05
0.55
5
1
8
2
2
2.8
4
0.75
10
0.2
0.95
0.55
25
-5
0.45
-0.05
6
4.5
10
4.5
6
6.5
6
0.11
35
0.8
0.31
-0.09
50
20
1.05
0.55
2
3
3
3
3
3
Forecast_Tbl OUTPUT](https://i.stack.imgur.com/8ozB2.png)
The issue is that when one forecast row (for example where Date "5" in output table should be) doesn't have any data points within the filtered range of Historical Data table, it return blank for the entire row.
What I would like it to do is return the original data from the Forecast_Tbl in the first 10 columns, for "Count" column show "0" (When no filtered Criteria are met), and use the previous rows "Average" column value (in this case 6.5) when no filtered Criteria are met. Below is the output I would like for the table to return:
Date
Fcst_Cloud
Fcst_Wind
Fcst_Solar
Max_Cloud
Min_Cloud
Max_Wind
Min_Wind
Max_Solar
Min_Solar
Count
Min
Max
Mode
Median
Average
1
0.5
12
0.5
0.7
0.3
27
0
0.75
0.25
5
5.5
8
6
6
6.2
2
0.8
10
0.1
1
0.6
25
-5
0.35
-0.15
6
4.5
10
4.5
6
6.5
3
0.15
15
0.8
0.35
-0.05
30
0
1.05
0.55
5
1
8
2
2
2.8
4
0.75
10
0.2
0.95
0.55
25
-5
0.45
-0.05
6
4.5
10
4.5
6
6.5
5
0.1
99
0.99
0.3
-0.1
114
84
1.24
0.74
0
6.5
6
0.11
35
0.8
0.31
-0.09
50
20
1.05
0.55
2
3
3
3
3
3
I have tried using conditional if functions but unsuccessful.
How about
....
{"Count","Min","Max","Mode","Median","Average"})),
#"Converted to Table" = Table.FromList(#"Filtered Historical", Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Added Index" = Table.AddIndexColumn(#"Converted to Table", "Index", 0, 1, Int64.Type),
#"Added Custom" = Table.AddColumn(#"Added Index", "Column2", each try if Value.Is([Column1], type record ) then [Column1] else null otherwise Record.Combine({Forecast{[Index]}, [Count = 0, Average = #"Added Index"{[Index]-1}[Column1][Average]]})),
#"Expanded Column1" = Table.ExpandRecordColumn(Table.SelectColumns(#"Added Custom",{"Column2"}), "Column2",
{"Date", "Fcst_Cloud", "Fcst_Wind", "Fcst_Solar", "Max_Cloud", "Min_Cloud", "Max_Wind", "Min_Wind", "Max_Solar", "Min_Solar",
"Count", "Min", "Max", "Mode", "Median", "Average"}),
....
I'm looking for an algorithm that has two input values and one output value and follows this pattern:
Input_A: 10 (When INPUT_B is increased from 0 to 1 in very small steps, it should reach the value '1' 100/10=10 times.)
Input_B => Output
0.025 => 0.25
...
0.05 => 0.50
...
0.075 => 0.75
...
0.1 => 1.00
...
0.125 => 0.25
...
0.15 => 0.50
...
0.175 => 0.75
...
0.2 => 1.00
....
0.9 => 1.00
....
0.95 => 0.50
...
Input_A: 20 (When INPUT_B is increased from 0 to 1 in very small steps, it should reach the value '1' 100/20=5 times.)
Input_B => Output
0.025 => 0.50
...
0.05 => 1.00
...
0.075 => 0.50
...
0.1 => 1.00
...
0.125 => 0.50
...
0.15 => 1.00
...
0.175 => 0.50
...
0.2 => 1.00
....
0.9 => 1.00
....
0.9125 => 0.25
...
0.925 => 0.50
...
0.95 => 1.00
...
I think I managed to create an algorithm that follows the first pattern. But I couldn't find one that follows both.
myAlgorithm(Input_A,Input_B) {
return (Input_B && Input_B%0.1 == 0) ? 1 : Input_B%0.1 * Input_A;
}
It seems you need something like this:
A10 = A * 10 //0.175 * 10 = 1.75
AInt = (Int)A10 //integer part = 1
AFrac = A10 - AInt //fractional part = 0.75
Output = AFrac? AFrac: 1.0 ; //extra case of zero fractional part
Here is very simplified part of code:
srand 0
WIDTH, HEIGHT = 10, 10
array = Array.new(HEIGHT){ [0]*WIDTH }
require "profile"
10000.times do
y, x = rand(HEIGHT), rand(WIDTH)
g = array[y][x] + [-1,+1].sample
array[y][x] = g unless [[y-1,x],[y+1,x],[y,x-1],[y,x+1]].any?{ |y, x|
y>=0 && y<HEIGHT && x>=0 && x<WIDTH && 1 < (array[y][x] - g).abs
}
end
We see (ruby 2.0.0p451 (2014-02-24) [i386-mingw32]):
% cumulative self self total
time seconds seconds calls ms/call ms/call name
39.33 0.86 0.86 47471 0.02 0.05 nil#
If we totally remove the unless thing:
53.48 0.48 0.48 10000 0.05 0.09 nil#
Looks like we get this nil# operation:
on every comparision or boolean operation?
on every Array object creation or block envocation?
Would be nice to get a deep answer.
I am trying to shift the Pandas dataframe column data by group of first index. Here is the demo code:
In [8]: df = mul_df(5,4,3)
In [9]: df
Out[9]:
COL000 COL001 COL002
STK_ID RPT_Date
A0000 B000 -0.5505 0.7445 -0.3645
B001 0.9129 -1.0473 -0.5478
B002 0.8016 0.0292 0.9002
B003 2.0744 -0.2942 -0.7117
A0001 B000 0.7064 0.9636 0.2805
B001 0.4763 0.2741 -1.2437
B002 1.1563 0.0525 -0.7603
B003 -0.4334 0.2510 -0.0105
A0002 B000 -0.6443 0.1723 0.2657
B001 1.0719 0.0538 -0.0641
B002 0.6787 -0.3386 0.6757
B003 -0.3940 -1.2927 0.3892
A0003 B000 -0.5862 -0.6320 0.6196
B001 -0.1129 -0.9774 0.7112
B002 0.6303 -1.2849 -0.4777
B003 0.5046 -0.4717 -0.2133
A0004 B000 1.6420 -0.9441 1.7167
B001 0.1487 0.1239 0.6848
B002 0.6139 -1.9085 -1.9508
B003 0.3408 -1.3891 0.6739
In [10]: grp = df.groupby(level=df.index.names[0])
In [11]: grp.shift(1)
Out[11]:
COL000 COL001 COL002
STK_ID RPT_Date
A0000 B000 NaN NaN NaN
B001 -0.5505 0.7445 -0.3645
B002 0.9129 -1.0473 -0.5478
B003 0.8016 0.0292 0.9002
A0001 B000 NaN NaN NaN
B001 0.7064 0.9636 0.2805
B002 0.4763 0.2741 -1.2437
B003 1.1563 0.0525 -0.7603
A0002 B000 NaN NaN NaN
B001 -0.6443 0.1723 0.2657
B002 1.0719 0.0538 -0.0641
B003 0.6787 -0.3386 0.6757
A0003 B000 NaN NaN NaN
B001 -0.5862 -0.6320 0.6196
B002 -0.1129 -0.9774 0.7112
B003 0.6303 -1.2849 -0.4777
A0004 B000 NaN NaN NaN
B001 1.6420 -0.9441 1.7167
B002 0.1487 0.1239 0.6848
B003 0.6139 -1.9085 -1.9508
The mul_df() code is attached here : How to speed up Pandas multilevel dataframe sum?
Now I want to grp.shift(1) for a big dataframe.
In [1]: df = mul_df(5000,30,400)
In [2]: grp = df.groupby(level=df.index.names[0])
In [3]: timeit grp.shift(1)
1 loops, best of 3: 5.23 s per loop
5.23s is too slow. How to speed it up ?
(My computer configuration is: Pentium Dual-Core T4200#2.00GHZ, 3.00GB RAM, WindowXP, Python 2.7.4, Numpy 1.7.1, Pandas 0.11.0, numexpr 2.0.1 , Anaconda 1.5.0 (32-bit))
How about shift the total DataFrame object and then set the first row of every group to NaN?
dfs = df.shift(1)
dfs.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
the problem is that the shift operation is not cython optimized, so it involves callback to python. Compare this with:
In [84]: %timeit grp.shift(1)
1 loops, best of 3: 1.77 s per loop
In [85]: %timeit grp.sum()
1 loops, best of 3: 202 ms per loop
added an issue for this: https://github.com/pydata/pandas/issues/4095
similar question and added answer with that works for shift in either direction and magnitude: pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift
Code (including test setup) is:
#
# the function to use in apply
#
def replace_shift_overlap(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
#
# the apply
#
df = df.groupby(level=0).apply(replace_shift_overlap,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)
EDIT: Note that the initial sort really eats into the effectiveness of this. So in some cases the original answer is more effective.
try this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [10, 20, 15, 30, 45,43,67,22,12,14,54],
'B': [13, 23, 18, 33, 48, 1,7, 56,66,45,32],
'C': [17, 27, 22, 37, 52,77,34,21,22,90,8],
'D': ['a','a','a','a','b','b','b','c','c','c','c']
})
df
#> A B C D
#> 0 10 13 17 a
#> 1 20 23 27 a
#> 2 15 18 22 a
#> 3 30 33 37 a
#> 4 45 48 52 b
#> 5 43 1 77 b
#> 6 67 7 34 b
#> 7 22 56 21 c
#> 8 12 66 22 c
#> 9 14 45 90 c
#> 10 54 32 8 c
def groupby_shift(df, col, groupcol, shift_n, fill_na = np.nan):
'''df: dataframe
col: column need to be shifted
groupcol: group variable
shift_n: how much need to shift
fill_na: how to fill nan value, default is np.nan
'''
rowno = list(df.groupby(groupcol).size().cumsum())
lagged_col = df[col].shift(shift_n)
na_rows = [i for i in range(shift_n)]
for i in rowno:
if i == rowno[len(rowno)-1]:
continue
else:
new = [i + j for j in range(shift_n)]
na_rows.extend(new)
na_rows = list(set(na_rows))
na_rows = [i for i in na_rows if i <= len(lagged_col) - 1]
lagged_col.iloc[na_rows] = fill_na
return lagged_col
df['A_lag_1'] = groupby_shift(df, 'A', 'D', 1)
df
#> A B C D A_lag_1
#> 0 10 13 17 a NaN
#> 1 20 23 27 a 10.0
#> 2 15 18 22 a 20.0
#> 3 30 33 37 a 15.0
#> 4 45 48 52 b NaN
#> 5 43 1 77 b 45.0
#> 6 67 7 34 b 43.0
#> 7 22 56 21 c NaN
#> 8 12 66 22 c 22.0
#> 9 14 45 90 c 12.0
#> 10 54 32 8 c 14.0
I'm using a ruby-prof gem to profile my code.
And results looks like this:
%self total self wait child calls name
50.56 31.06 23.45 0.00 7.62 234593 Array#each
14.29 6.62 6.62 0.00 0.00 562480 Array#-
13.63 6.32 6.32 0.00 0.00 157816 Array#|
11.20 5.20 5.20 0.00 0.00 6210903 Hash#default
2.44 46.36 1.13 0.00 46.36 78909 Analyzer#process
2.02 46.36 0.94 0.00 46.36 78908 Analyzer#try
1.70 0.79 0.79 0.00 0.00 562481 UnboundMethod#bind
1.53 7.34 0.71 0.00 6.62 562480 Method#call
1.18 0.55 0.55 0.00 0.00 562480 Kernel#instance_variable_defined?
0.76 46.36 0.35 0.00 46.36 6250 Array#combination
0.37 0.17 0.17 0.00 0.00 105763 Array#concat
0.25 25.19 0.12 0.00 25.07 77842 Enumerable#group_by
0.02 46.36 0.01 0.00 46.36 3125 Enumerator#each
0.02 0.01 0.01 0.00 0.00 78908 Array#empty?
...
I'm sure that my code does not tries to access non-existing keys in some of the Hashes.
The question is - what Hash#default might mean?
And here is a piece of code:
class Analyzer
def process(level, hypo_right, hypo_wrong)
if portion = #portions[level]
selections = #selections[level] - hypo_wrong
master_size = selections.size
selections -= hypo_right
new_portion = portion + selections.size - master_size
if new_portion > selections.size || new_portion < 0
return
elsif new_portion == 0
try(level, hypo_right, [], hypo_wrong, selections)
else
selections.combination(new_portion).each do |hypo_right2|
try(level, hypo_right, hypo_right2, hypo_wrong, (selections - hypo_right2))
end
end
else
puts hypo_right.inspect
end
end
def try(level, right, right2, wrong, wrong2)
local_right = right | right2
local_wrong = wrong | wrong2
right2.each { |r| local_wrong.concat(#siblings[r]) }
unless wrong2.empty?
grouped_wrong = local_wrong.group_by{|e| #vars[e] }
wrong2.each do |w|
qid = #vars[w]
if grouped_wrong[qid].size == #limits[qid]
local_right << (#q_hash[qid] - grouped_wrong[qid])[0]
end
end
end
process(level + 1, local_right, local_wrong)
end
def start
process(0, [], [])
end
end
#selections, #portions are Arrays; #q_hash, #siblings, #limits and #vars are Hashes.
Thanks to riffraff, found answer:
require 'ruby-prof'
h = (0..9).inject({}) {|h, x| h[x] = (x+97).chr;h }
a = (0..1000000).collect { rand(100) }
RubyProf.start
g = a.group_by {|x| h[x/10] }
RubyProf::FlatPrinter.new(RubyProf.stop).print(STDOUT)
Thread ID: 17188993880
Total: 1.210938
%self total self wait child calls name
100.00 1.21 1.21 0.00 0.00 1 Array#each
0.00 0.00 0.00 0.00 0.00 10 Hash#default
0.00 1.21 0.00 0.00 1.21 1 Enumerable#group_by
0.00 1.21 0.00 0.00 1.21 1 Object#irb_binding