single stage cluster sampling - cluster-computing

Choose a single stage cluster sample of size 1,000 students and then calculate the mean value for both variables along with their 95% confidence interval.
Variable 1 is the class number, V2 is the Student number in the class, V3 and V4 is the test results.
c1 <- sample(unique(test$V1), size=1000, replace=F)
c1_s <- test[test$V1 %in% c1,]
I have done the coding for sampling, but I need help with the coding for calculating the mean value for V3 and V4 along with their 95% CI.

Related

How can I get Results faster for my MINLP Optimization with APM GEKKO?

I am trying to do an Optimization for the Energy Supply of an domestic House. The energy demand should be satisfied by a Heat Pump, PV-Modules, Electric Water Heater and the public electricity Grid. Also the Energy System consists of an Battery Storage an a Heat Storage. The only (binary) integer Variable in my Program is the Heat Pump. My Goal is to Optimize the System over the Timeframe of 1 Year (8760 timesteps). When I run the Code with 1800 timesteps I get results in about 500 seconds. For 4500 timestamps it already takes about 9 Hours. For the full 8760 timesteps the code is still running (since about 24 Hours) without any solution. In earlier iterations of the code it ran for more than a Week without generating results. I already tried a few things I read here to speed up the optimization. Is there anyway I can get the Program to find a solution faster? Since I am a beginner at Python it is very much possible that my Code is inefficient. I would very much appreciate it, if someone has an Idea that can get me faster Results or estimate the time the program takes to find a solution. Thank you very much in advance.
Here is my Code, I have shortened the Arrays for the Energy-Demand to 100 Values to shorten my Code:
from gekko import GEKKO
import numpy as np
import matplotlib.pyplot as plt
timesteps= 100
m = GEKKO(remote=False)
t = np.linspace(0, timesteps-1, timesteps) #Zeit
m.time = t
m.options.SOLVER = 1
m.options.IMODE = 6
m.options.NODES = 2
m.options.REDUCE=3
m.solver_options = ['minlp_maximum_iterations 1000',\
'minlp_max_iter_with_int_sol 1000',\
'minlp_integer_tol 1.0e-1',\
'minlp_branch_method 1',\
'objective_convergence_tolerance 1.0e-4',\
'minlp_gap_tol 9e-1']
# Energy Demand
#1. electricity
EL_Demand_Arr1=np.array([1.9200000,1.4890000,1.4920000,1.1300000,0.64099997,0.58600003,0.58399999,0.61000001,0.54900002,0.59500003,0.92699999,0.95599997,0.91000003,1.1450000,1.1090000,1.6360000,1.4740000,1.4680001,2.6150000,2.1810000,1.2320000,1.3700000,0.96899998,1.3220000,1.1880000,0.64399999,0.53899997,0.55299997,0.52899998,0.56099999,0.54600000,0.80000001,1.1350000,0.70700002,1.1680000,1.0440000,2.3160000,1.6420000,2.2370000,2.8870001,1.8550000,1.4030000,0.70599997,1.4980000,3.4879999,1.5130000,1.4349999,1.3520000,1.0530000,0.51700002,0.55000001,0.52800000,0.52999997,0.56199998,0.53700000,0.58999997,0.53500003,0.92500001,1.3490000,0.66299999,4.3810000,1.0200000,0.79799998,0.77899998,1.0840000,2.1530001,3.7449999,5.3490000,1.8710001,2.3610001,0.78799999,0.47099999,0.56800002,0.51700002,0.54799998,0.55699998,0.51400000,0.56500000,3.2790000,2.2750001,1.2300000,0.97899997,0.78200001,1.0140001,0.77800000,0.58099997,0.52999997,0.55900002,1.1770000,1.5400000,1.4349999,2.0400000,2.2790000,1.6520000,1.6450000,1.2830000,0.55800003,0.52499998,0.51899999,0.53799999])
EL_Demand_Arr2=EL_Demand_Arr1.round(decimals=3)
EL_Demand_Arr=EL_Demand_Arr2[0:timesteps]
EL_Demand=m.Param(EL_Demand_Arr,name='EL_Demand')
#2. heat
H_Demand_Arr1=np.array([1.0960000,1.0790000,1.1590000,1.1760000,1.6940000,2.2639999,2.1450000,2.0769999,2.0720000,2.0300000,1.9069999,1.8810000,1.7880000,1.8180000,1.8049999,2.0430000,2.1489999,2.1700001,2.1830001,2.1910000,1.9920000,1.5290000,1.1810000,1.0400000,1.4310000,1.4110000,1.4700000,1.4900000,1.8880000,2.4530001,2.2809999,2.3199999,2.2960000,2.3299999,2.1630001,2.1289999,2.0599999,2.1090000,2.0940001,2.3450000,2.4380000,2.4679999,2.4630001,2.4480000,2.2219999,1.8480000,1.5779999,1.4310000,1.5000000,1.4790000,1.5410000,1.5620000,1.9790000,2.5720000,2.3910000,2.4319999,2.4070001,2.4430001,2.2679999,2.2309999,2.1589999,2.2110000,2.1949999,2.4579999,2.5560000,2.5869999,2.5820000,2.5660000,2.3290000,1.9380000,1.6540000,1.5000000,1.7160000,1.6930000,1.7630000,1.7869999,2.2650001,2.9430001,2.7360001,2.7839999,2.7539999,2.7950001,2.5950000,2.5539999,2.4710000,2.5300000,2.5120001,2.8130000,2.9250000,2.9600000,2.9549999,2.9370000,2.6659999,2.2170000,1.8930000,1.7160000,1.7980000,1.7670000,1.8789999,1.9160000])
H_Demand_Arr2=H_Demand_Arr1.round(decimals=3)
H_Demand_Arr=H_Demand_Arr2[0:timesteps]
H_Demand=m.Param(H_Demand_Arr,name='H_Demand')
#3. Domestic Hot Water
DHW_Demand_Arr1=np.array([1.7420000,0,0,2.0320001,0,0,3.7739999,2.4960001,3.3670001,0,2.4380000,1.1030000,0,0,0,3.1350000,2.2060001,0,4.4120002,0,0,0,0.87099999,1.5089999,0,0,0,0,0,0.87099999,0.81300002,1.1610000,2.5539999,1.6260000,0,0,0.63900000,0,3.4830000,2.8450000,2.4960001,7.1409998,5.7480001,2.3800001,3.1930001,0,1.1610000,0,0,0,0,0,0,0,2.6129999,1.9160000,4.2379999,0.34799999,5.4569998,0,0,2.8450000,0,0,0,0,0,2.4960001,1.6260000,0,2.5539999,0,0,0,0,0,1.6260000,0,3.0190001,0,2.8450000,1.1030000,2.9030001,0,0,0,0.98699999,0,1.1610000,0.34799999,1.3930000,1.2770000,4.4120002,0,0,0,0,1.8580000,0,0.98699999])
DHW_Demand_Arr2=DHW_Demand_Arr1.round(decimals=3)
DHW_Demand_Arr=DHW_Demand_Arr2[0:timesteps]
DHW_Demand=m.Param(DHW_Demand_Arr,name='TWW_BED')
#4. electricity production from PV
PV_P_Arr1=np.array([0,0,0,0,0,0,0,0,0,0.057000000,0.14399999,0.30500001,0.13600001,0.28900000,0.22000000,0.0040000002,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.061999999,0.78899997,0.56300002,0.13600001,0.052999999,0.017000001,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.037000000,0.098999999,0.15000001,0.11200000,0,0.12600000,0.032000002,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0040000002,0.73600000,1.8250000,2.4020000,3.1870000,0.66500002,0.045000002,0,0,0,0,0,0,0,0,0,0,0,0,0])
PV_P_Arr2=PV_P_Arr1.round(decimals=3)
PV_P_Arr=PV_P_Arr2[0:timesteps]
PV_P=m.Param(PV_P_Arr,name='PV_P')
# Heat Pump "Bit" ist '1' during the Heating Season and '0' outside the heating Season to tell the Promgram that the Heat Pump may only be used during heating Season
HP_Bit_Arr1=np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1])
HP_Bit_Arr=HP_Bit_Arr1[0:timesteps]
HP_Bit=m.Param(HP_Bit_Arr,name='HP_Bit')
# Battery Storage
B_S = m.SV(0,lb=0)
B_S.FSTATUS=1
m.fix_initial(B_S,0)
B_S_Load = m.SV(0,lb=0) #Loading Battery
B_S_Recover = m.SV(0,lb=0) #Recover Energy from Batterie
eff_B_S = 0.95 #efficiency Battery
# Heat Storage
H_S = m.SV(0,lb=0)
H_S.FSTATUS=1
m.fix_initial(H_S,0)
H_S_Load = m.SV(lb=0) #Loading Heat Storage
H_S_Recover = m.SV(lb=0) #Recover Energy from Heat Storage
eff_H_S = 0.9 #efficiency Heat Storage
#Heat Pump
# Binary Variable for Heat Pump (it can either be turned on '1' or off '0')
P_HP_Binary=HP_Bit*m.sos1([0,1])
P_HP_Binary.STATUS = 1
#Electrical sizing of the Heat Pump
P_el_HP1= m.SV(0,lb=0)
P_el_HP1.FSTATUS=1
#Electrical Water Heater
EH=m.SV(0,lb=0)
EH.FSTATUS=1
# The Power of the Heat Pump multiplied with the Binary Variable gives the actual Output of the Heat Pump
P_el_HP=m.Intermediate(P_HP_Binary*P_el_HP1)
COP_HP=3.5 #COP of the Heat Pump
Q_HP=m.Intermediate(COP_HP*P_el_HP) # thermal Energy Output of the Heat Pump
# the objective of this Optimization ist to minimize the Cost for the Energy-System, scince you only Pay for the maximal Value of the Heat Pump, Energy Storages and the electrical Water Heater and not for the value at each timestep, I define a FV that describes the maximal Value of the Components
P_el_HP_max=m.FV(lb=0) #Heat Pump
P_el_HP_max.STATUS=1
H_S_max=m.FV(lb=0) #Heat Storage
H_S_max.STATUS=1
B_S_max=m.FV(lb=0) #Battery
B_S_max.STATUS=1
EH_max=m.FV(lb=0) #Electrical Water Heater
EH_max.STATUS=1
# We have energy Production from PV, there ist a possibility to give Energy thats not needed to the public Grid
I_Excess=m.Var(0,lb=0)
# In Case we have more Demand for Electrical Enery than Production from PV we have the possibility to get Energy from the public Grid
I_feed_out=m.SV(0,lb=0)
I_feed_out.FSTATUS=1
# Volume of the Heat Storage in m^3
Vol_HS=m.Intermediate((H_S_max*3600)/(1000*4.18*(35-10)))
# boundary conditions
m.Equations([PV_P +I_feed_out + B_S_Recover - P_el_HP - B_S_Load - I_Excess - EH == EL_Demand, #Energy Balance needs to satisfy the Demand
B_S.dt() == B_S_Load - B_S_Recover/eff_B_S, #Loading and Recovery of the Battery
B_S_Load * B_S_Recover == 0, #It is not allowed to Load and Recover at the same Time, at least one of both needs to be equal to '0' at each Timestep
P_el_HP*COP_HP + H_S_Recover - H_S_Load + EH == H_Demand + DHW_Demand, #The Demand of Heat and DHW needs to be satisfied at each timestep
H_S.dt() == H_S_Load - H_S_Recover/eff_H_S, #Loading and recovery of the Heat Storage
H_S_Load * H_S_Recover == 0, #It is not allowed to Load and Recover at the same Time, at least one of both needs to be equal to '0' at each Timestep
# The maximal Value of the Enery System Components is the Upper Bound for the Value at each time Step
P_el_HP1 <= P_el_HP_max,
P_el_HP1 >= 0.4*P_el_HP_max, # the Heat Pump is a variable speed heat Pump and has a minimal output of 40% of the nominal Power
H_S <= H_S_max,
B_S <= B_S_max,
EH <= EH_max,])
#Objective is to minimize the cost of the Energy System (the Cost of Components that only need to be bought once get divided by the number of timesteps)
Objective=(((P_el_HP_max*1918.4)*P_el_HP_max+(EH_max*50)+B_S_max*1664.9*B_S_max+(Vol_HS*2499.3*Vol_HS))/(20*timesteps)-0.05*I_Excess+0.35*(I_feed_out))
m.Minimize(Objective)
m.solve(disp=True)
#Print Results
print("Nominal Power of the Heat Pump=",max(P_el_HP),"kW")
print("maximum Capacity of the Heat Storage=",max(H_S),"kW")
print("Volume of the Heat Storage=", max(Vol_HS),"m^3")
print("maximum Capacity of the Battery", max(B_S),"kW")
print("Electricity from the Public Grid",sum(I_feed_out[0:timesteps-1]))
# Plot results
fig, axes = plt.subplots(6, 1, figsize=(5, 5.1), sharex=True)
axes = axes.ravel()
ax = axes[0]
ax.plot(t, EL_Demand, 'r-', label='Electrical Demand',lw=1)
ax.plot(t, PV_P,'b:', label='PV Production',lw=1) #z.B. Generator (haben wir aber in unserem Energiesystem nicht)
ax = axes[1]
ax.plot(t, EL_Demand, 'r-', label='Electrical Demand',lw=1)
ax.plot(t,I_feed_out, 'k--', label='Electricity from the public Grid',lw=1)
ax = axes[2]
ax.plot(t,B_S.value, 'k-', label='Battery Storage',lw=1)
ax.plot(t,B_S_Load,'g--',label='Battery Storage Loading',lw=1)
ax.plot(t,B_S_Recover,'b:',label='Battery Storage Recovery',lw=1) #lw=2 --> linewidth
ax = axes[3]
ax.plot(t,H_Demand, 'r-', label='Heat Demand',lw=1)
ax.plot(t, Q_HP.value,'b:',\
label='Thermal Production Heat Pump',lw=1)
ax = axes[4]
ax.plot(t,H_S, 'k-', label='Heat Storage',lw=1)
ax.plot(t,H_S_Load,'g--',label='Heat Storage Loading',lw=1)
ax.plot(t,H_S_Recover.value,'b:',\
label='Heat Storage Recovered Energy',lw=1)
ax = axes[5]
ax.plot(t,DHW_Demand, 'r-', label='Domestic Hot Water Demand',lw=1)
ax.plot(t, EH,'b:',\
label='Electrical Water Heater',lw=1)
for ax in axes:
ax.legend(loc='center left',\
bbox_to_anchor=(1,0.5),frameon=False)
ax.grid()
ax.set_xlim(0,len(t)-1)
plt.savefig('Results.png', dpi=600,\
bbox_inches = 'tight')
plt.show()
The scale-up issue is likely with the heat pump binary variable. An exhaustive search for your cases leads to the evaluation of 2^8760 possible solutions. The APOPT solver uses a branch and bound method that greatly reduces the number of potential solution candidates. Here are solver options that are recommended to improve the speed and control the solution tolerance.
m = GEKKO()
m.solver_options = ['minlp_gap_tol 0.1',\
'minlp_maximum_iterations 1000',\
'minlp_max_iter_with_int_sol 500',\
'minlp_branch_method 1',\
'nlp_maximum_iterations 100']
m.options.solver = 1
minlp_maximum_iterations - maximum number of NLP solutions from the branch and bound method. A successful solution is returned if there is an integer solution upon reaching the maximum number of iterations. Otherwise, the solution is not considered to be successful and an error message is returned with the failed solution.
minlp_max_iter_with_int_sol - maximum number of NLP solutions when a candidate integer solution is found
minlp_gap_tol: gap is the spread between the lowest candidate leaf (obj_r=non-integer solution) and the best integer solution (obj_i). When the gap is below the minlp_gap_tol, the best integer solution is returned
minlp_branch_method: 1=depth first (find integer solution faster), 2=breadth first, 3=lowest objective leaf, 4=highest objective leaf
nlp_maximum_iterations: maximum number of iterations for each NLP sub-problem. Reducing the NLP maximum iterations can improve the solution speed because less computational time is spent on candidate solutions that may not converge
I recommend fine-tuning these solver options on a short time horizon problem, perhaps 100 time steps to solve in a few seconds. The improvement in computational speed should also apply for the larger problems.
There are excellent suggestions in the other answer posted on tweaking GEKKO, which I'm not too familiar with, but the main issue you have is that you've made a non-linear model, which will be extraordinarily difficult to solve over that many time periods. I'd strongly suggest:
Reformulate. You can very likely make this linear. You are multiplying variables together in several places, which makes the model non-linear. There are linear formulations that could be substituted. Find them all and fix them, even if you have to add more variables. You might be able to make this a simple LP (no integer requirements) and it would solve in a snap. For instance, you do not need to multiply the binary heat pump variable by the heat pump output to regulate that. That is non-linear. You should just be doing something like:
heat_pump_output[t] <= heat_pump_max_output * heat_pump_on[t]
where heat_pump_max_output is a fixed parameter (optionally time-indexed) and heat_pump_on[t] is either a parameter limiting on times or a binary variable, if needed.
There are several other parts that might be changed also, such as charge-discharge where you have 2 variables and might consider just one "flow" variable that can be positive or negative. (This might be tough if you have different "efficiencies" for charging/discharging.)
There are also ways to linearize "or" conditions if that (above) doesn't work or if there are other needs with binary variables without multiplication.
Review your objective for non-linearities also. It is unclear why you are squaring variables in your objective function when you are looking at costs
If the above is unsuccessful and you cannot linearize the model, then think about not solving for all time steps at once or just pick a much larger time step, perhaps aggregated to 6 observations a day at "stressful" times.
Here is a highly similar model written in pulp that solves in about 30 seconds for 8000+ time steps. Translation into GEKKO shouldn't be too daunting if you like that framework.

H2O isolation forest -- scores greater than 1

Isolation Forest in H2O (3.30.0.1, R 3.6.1) computed scores greater than 1 when model applied to the test set. Here is the code to reproduce scores greater than 1. Looks like h2o is not using the normalization used in the original paper [https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest] which is score=2^(-mean length/c(n)), c(n) is alway positive for n>0, so the scores should be always less than 1.
Other implementations of isolation forest produce scores less than 1 for the same data set.
Download train and test data files.
library(data.table)
library(h2o)
h2o.init()
#import data
train<-h2o.importFile('train.csv')
test<-h2o.importFile('test.csv')
#Train model
model <- h2o.isolationForest(training_frame = train)
# Calculate score
scores <- h2o.predict(model,test)
max(scores[,1])
If your question is why the predicted value is greater than 1 for some rows in the test set, then this is due to the test values having shorter mean_lengths. I.e. fewer than average splits are needed to isolate them compared to training data. Remember that isolation forest uses an ensemble of trees (not just one). So if you have records more unique than training (anomalies), then your predicted value can be greater than 1 (or mean_length is shorter than normal).
You can see this by looking at the rows in your test data that have predict larger than 1:
scores[scores[,1] > 1, ]
predict mean_length
1 1.232558 3.82
2 1.023256 4.36
3 1.069767 4.24
4 1.286822 3.68
Also, you can see that your training data had an mean_length averaged for all rows was 4.42 (which is larger than ones from your test data, mean_lengths above)
scores_train <- h2o.predict(model,train)
mean(scores_train[,'mean_length'])
4.42
Check out this post to learn more about interpreting isolation forests.
Neema pointed out that h2o is using min/max path length to normalize which makes scores great than 1 possible.
[https://support.h2o.ai/support/tickets/97280]

suitable formula/algorithm for detecting temperature fluctuations

I'm creating an app to monitor water quality. The temperature data is updated every 2 min to firebase real-time database. App has two requirements
1) It should alert the user when temperature exceed 33 degree or drop below 23 degree - This part is done
2) It should alert user when it has big temperature fluctuation after analysing data every 30min - This part i'm confused.
I don't know what algorithm to use to detect big temperature fluctuation over a period of time and alert the user. Can someone help me on this?
For a period of 30 minutes, your app would give you 15 values.
If you want to figure out a big change in this data, then there is one way to do so.
You can use implement the following method:
Calculate the mean and the standard deviation of the values.
Subtract the data you have from the mean and then take the absolute value of the result.
Compare if the absolute value is greater than one standard deviation, if it is greater then you have a big data.
See this example for better understanding:
Lets suppose you have these values for 10 minutes:
25,27,24,35,28
First Step:
Mean = 27 (apprx)
One standard deviation = 3.8
Second Step: Absolute(Data - Mean)
abs(25-27) = 2
abs(27-27) = 0
abs(24-27) = 3
abs(35-27) = 8
abs(28-27) = 1
Third Step
Check if any of the subtraction is greater than standard deviation
abs(35-27) gives 8 which is greater than 3.8
So, there is a big fluctuation. If all the subtracted results are less than standard deviation, then there is no fluctuation.
You can still improvise the result by selecting two or three standard deviation instead of one standard deviation.
Start by defining what you mean by fluctuation.
You don't say what temperature scale you're using. Fahrenheit, Celsius, Rankine, or Kelvin?
Your sampling rate is a new data value every two minutes. Do you define fluctuation as the absolute value of the difference between the last point and current value? That's defensible.
If the max allowable absolute value is some multiple of your 33-23 = 10 degrees you're in business.

pair-wise aggregation of input lines in Hadoop

A bunch of driving cars produce traces (sequences of ordered positions)
car_id order_id position
car1 0 (x0,y0)
car1 1 (x1,y1)
car1 2 (x2,y2)
car2 0 (x0,y0)
car2 1 (x1,y1)
car2 2 (x2,y2)
car2 3 (x3,y3)
car2 4 (x4,y4)
car3 0 (x0,y0)
I would like to compute the distance (path length) driven by the cars.
At the core, I need to process all records line by line pair-wise. If the
car_id of the previous line is the same as the current one then I need to
compute the distance to the previous position and add it up to the aggregated
value. If the car_id of the previous line is different from the current line
then I need to output the aggregate for the previous car_id, and initialize the
aggregate of the current car_id with zero.
How should the architecture of the hadoop program look like? Is it possible to
archieve the following:
Solution (1):
(a) Every mapper computes the aggregated distance of the trace (per physical
block)
(b) Every mapper aggregates the distances further in case the trace was split
among multiple blocks and nodes
Comment: this solution requires to know whether I am on the last record (line)
of the block. Is this information available at all?
Solution (2)
(a) The mappers read the data line by line (do no computations) and send the
data to the reducer based on the car_id.
(b) The reducers sort the data for individual car_ids based on order_id,
computes the distances, and aggregates them
Comment: high network load due to laziness of mappers
Solution (3)
(a) implement a custom reader to read define a logical record to be the whole
trace of one car
(b) each mapper computes the distances and the aggregate
(c) reducer is not really needed as everything is done by the mapper
Comment: high main memory costs as the whole trace needs to be loaded into main
memory (although only two lines are used at a time).
I would go with Solution (2), since it is the cleanest to implement and reuse.
You certainly want to sort based on car_id AND order_id, so you can compute the distances on the fly without loading them all up into memory.
Your concern about high network usage is valid, however, you can pre-aggregate the distances in a combiner.
How would that look like, let's take some pseudo-code:
Mapper:
foreach record:
emit((car_id, order_id), (x,y))
Combiner:
if(prev_order_id + 1 == order_id): // subsequent measures
// compute distance and emit that as the last possible order
emit ((car_id, MAX_VALUE), distance(prev, cur))
else:
// send to the reducer, since it is probably crossing block boundaries
emit((car_id, order_id), (x,y))
The reducer then has two main parts:
compute the sum over subsequent measures, like the combiner did
sum over all existing sums, tagged with order_id = MAX_VALUE
That's already best-effort what you can get from a network usage POV.
From a software POV, better use Spark- your logic will be five lines instead of 100 across three class files.
For your other question:
this solution requires to know whether I am on the last record (line)
of the block. Is this information available at all?
Hadoop only guarantees that it is not splitting through records when reading, it may very well be that your record is already touching two different blocks underneath. The way to find that out is basically to rewrite your input format to make this information available to your mappers, or even better- take your logic into account when splitting blocks.

R CODE Creating a heat map for multi-time scale vs time analysis

Dear stackoverflowers,
I have derived a modified version of an entropy measure of ME (Market efficiency) where I windowed/rolled CMSE (Composite Multiscale Entropy) over length 500 window for the SP500. I then ran 5000 replications of length(500) Gaussian iid RV. I made any windowed CMSE[i,j] with higher value then the lower bound of the 5000 replications CMSE boot equal to 1. The data set in front of you is the result.
How do I insert the data?
The question is how one would create a heat map when there are 8007 columns (time variable) and each time there are 28 scales (time-scales variable) using anything like ggplot2
I can get it to come up very ugly like this
heatmap.2(adjrollingME_CMSE,col=redgreen(75),dendrogram='none', Rowv=FALSE,
Colv=FALSE,trace='none')
library(ggplot2)
date<- index(DSP500F)[1:8007]
y<- 0:28
gg <- ggplot(data =data.frame(adjrollingME_CMSE), aes(x = date, y =y, fill = value)),
geom_tile()
gg
Don't know how to automatically pick scale for object of type function. Defaulting to continuous
Error: Aesthetics must either be length one, or the same length as the dataProblems:hm
I can not see your data set, but it sounds like you are using a matrix rather than storing your data in a data frame. With R, you should always store time series data in the data frame format. This is a really strange idea at first, but it is sort of like taking your matrix and normalizing it.
Here is some info on data frames.
And here is another question about converting a matrix to a data frame.
Good luck!

Resources