R CODE Creating a heat map for multi-time scale vs time analysis

I have derived a modified version of an entropy measure of ME (Market efficiency) where I windowed/rolled CMSE (Composite Multiscale Entropy) over length 500 window for the SP500. I then ran 5000 replications of length(500) Gaussian iid RV. I made any windowed CMSE[i,j] with higher value then the lower bound of the 5000 replications CMSE boot equal to 1. The data set in front of you is the result.
How do I insert the data?
The question is how one would create a heat map when there are 8007 columns (time variable) and each time there are 28 scales (time-scales variable) using anything like ggplot2
I can get it to come up very ugly like this
heatmap.2(adjrollingME_CMSE,col=redgreen(75),dendrogram='none', Rowv=FALSE,
date<- index(DSP500F)[1:8007]
y<- 0:28
gg <- ggplot(data =data.frame(adjrollingME_CMSE), aes(x = date, y =y, fill = value)),
Don't know how to automatically pick scale for object of type function. Defaulting to continuous
Error: Aesthetics must either be length one, or the same length as the dataProblems:hm

I can not see your data set, but it sounds like you are using a matrix rather than storing your data in a data frame. With R, you should always store time series data in the data frame format. This is a really strange idea at first, but it is sort of like taking your matrix and normalizing it.
Here is some info on data frames.
And here is another question about converting a matrix to a data frame.
Good luck!


Finding the time in which a specific value is reached in time-series data when peaks are found

I would like to find the time instant at which a certain value is reached in a time-series data with noise. If there are no peaks in the data, I could do the following in MATLAB.
Code from here
% create example data
ts = timeseries(d,t);
% define threshold
thr = 55;
data = ts.data(:);
time = ts.time(:);
ind = find(data>thr,1,'first');
time(ind) %time where data>threshold
But when there is noise, I am not sure what has to be done.
In the time-series data plotted in the above image I want to find the time instant at which the y-axis value 5 is reached. The data actually stabilizes to 5 at t>=100 s. But due to the presence of noise in the data, we see a peak that reaches 5 somewhere around 20 s . I would like to know how to detect e.g 100 seconds as the right time and not 20 s . The code posted above will only give 20 s as the answer. I
saw a post here that explains using a sliding window to find when the data equilibrates. However, I am not sure how to implement the same. Suggestions will be really helpful.
The sample data plotted in the above image can be found here
Suggestions on how to implement in Python or MATLAB code will be really helpful.
I don't want to capture when the peak (/noise/overshoot) occurs. I want to find the time when equilibrium is reached. For example, around 20 s the curve rises and dips below 5. After ~100 s the curve equilibrates to a steady-state value 5 and never dips or peaks.
Precise data analysis is a serious business (and my passion) that involves a lot of understanding of the system you are studying. Here are comments, unfortunately I doubt there is a simple nice answer to your problem at all -- you will have to think about it. Data analysis basically always requires "discussion".
First to your data and problem in general:
When you talk about noise, in data analysis this means a statistical random fluctuation. Most often Gaussian (sometimes also other distributions, e.g. Poission). Gaussian noise is a) random in each bin and b) symmetric in negative and positive direction. Thus, what you observe in the peak at ~20s is not noise. It has a very different, very systematic and extended characteristics compared to random noise. This is an "artifact" that must have a origin, but of which we can only speculate here. In real-world applications, studying and removing such artifacts is the most expensive and time-consuming task.
Looking at your data, the random noise is negligible. This is very precise data. For example, after ~150s and later there are no visible random fluctuations up to fourth decimal number.
After concluding that this is not noise in the common sense it could be a least two things: a) a feature of the system you are studying, thus, something where you could develop a model/formula for and which you could "fit" to the data. b) a characteristics of limited bandwidth somewhere in the measurement chain, thus, here a high-frequency cutoff. See e.g. https://en.wikipedia.org/wiki/Ringing_artifacts . Unfortunately, for both, a and b, there are no catch-all generic solutions. And your problem description (even with code and data) is not sufficient to propose an ideal approach.
After spending now ~one hour on your data and making some plots. I believe (speculate) that the extremely sharp feature at ~10s cannot be a "physical" property of the data. It simply is too extreme/steep. Something fundamentally happened here. A guess of mine could be that some device was just switched on (was off before). Thus, the data before is meaningless, and there is a short period of time afterwards to stabilize the system. There is not really an alternative in this scenario but to entirely discard the data until the system has stabilized at around 40s. This also makes your problem trivial. Just delete the first 40s, then the maximum becomes evident.
So what are technical solutions you could use, please don't be too upset that you have to think about this yourself and assemble the best possible solution for your case. I copied your data in two numpy arrays x and y and ran the following test in python:
Remove unstable time
This is the trivial solution -- I prefer it.
plt.plot(x, y, label="original")
y_cut = y
y_cut[:40] = 0
plt.plot(x, y_cut, label="cut 40s")
Note carry on reading below only if you are a bit crazy (about data).
Sliding window
You mentioned "sliding window" which is best suited for random noise (which you don't have) or periodic fluctuations (which you also don't really have). Sliding window just averages over consecutive bins, averaging out random fluctuations. Mathematically this is a convolution.
Technically, you can actually solve your problem like this (try even larger values of Nwindow yourself):
y_slide_10 = np.convolve(y, np.ones((Nwindow,))/Nwindow, mode='same')
y_slide_20 = np.convolve(y, np.ones((Nwindow,))/Nwindow, mode='same')
y_slide_30 = np.convolve(y, np.ones((Nwindow,))/Nwindow, mode='same')
plt.plot(x,y, label="original")
plt.plot(x,y_slide_10, label="window=10")
plt.plot(x,y_slide_20, label='window=20')
plt.plot(x,y_slide_30, label='window=30')
#plt.xscale('log') # useful
Thus, technically you can succeed to suppress the initial "hump". But don't forget this is a hand-tuned and not general solution...
Another caveat of any sliding window solution: this always distorts your timing. Since you average over an interval in time depending on rising or falling signals your convoluted trace is shifted back/forth in time (slightly, but significantly). In your particular case this is not a problem since the main signal region has basically no time-dependence (very flat).
Frequency domain
This should be the silver bullet, but it also does not work well/easily for your example. The fact that this doesn't work better is the main hint to me that the first 40s of data are better discarded.... (i.e. in a scientific work)
You can use fast Fourier transform to inspect your data in frequency-domain.
import scipy.fft
y_fft = scipy.fft.rfft(y)
# original frequency domain plot
plt.plot(y_fft, label="original")
The structure in frequency represent the features of your data. The peak a zero is the stabilized region after ~100s, the humps are associated to (rapid) changes in time. You can now play around and change the frequency spectrum (--> filter) but I think the spectrum is so artificial that this doesn't yield great results here. Try it with other data and you may be very impressed! I tried two things, first cut high-frequency regions out (set to zero), and second, apply a sliding-window filter in frequency domain (sparing the peak at 0, since this cannot be touched. Try and you know why).
# cut high-frequency by setting to zero
y_fft_2 = np.array(y_fft)
y_fft_2[50:70] = 0
# sliding window in frequency
Nwindow = 15
Start = 10
y_fft_slide = np.array(y_fft)
y_fft_slide[Start:] = np.convolve(y_fft[Start:], np.ones((Nwindow,))/Nwindow, mode='same')
# frequency-domain plot
plt.plot(y_fft, label="original")
plt.plot(y_fft_2, label="high-frequency, filter")
plt.plot(y_fft_slide, label="frequency sliding window")
Converting this back into time-domain:
# reverse FFT into time-domain for plotting
y_filtered = scipy.fft.irfft(y_fft_2)
y_filtered_slide = scipy.fft.irfft(y_fft_slide)
# time-domain plot
plt.plot(x[:500], y[:500], label="original")
plt.plot(x[:500], y_filtered[:500], label="high-f filtered")
plt.plot(x[:500], y_filtered_slide[:500], label="frequency sliding window")
# plt.xscale('log') # useful
There are apparent oscillations in those solutions which make them essentially useless for your purpose. This leads me to my final exercise to again apply a sliding-window filter on the "frequency sliding window" time-domain
# extra time-domain sliding window
y_fft_90 = np.convolve(y_filtered_slide, np.ones((Nwindow,))/Nwindow, mode='same')
# final time-domain plot
plt.plot(x[:500], y[:500], label="original")
plt.plot(x[:500], y_fft_90[:500], label="frequency-sliding window, slide")
# plt.xscale('log') # useful
I am quite happy with this result, but it still has very small oscillations and thus does not solve your original problem.
How much fun. One hour well wasted. Maybe it is useful to someone. Maybe even to you Natasha. Please be not mad a me...
Let's assume your data is in data variable and time indices are in time. Then
import numpy as np
threshold = 0.025
stable_index = np.where(np.abs(data[-1] - data) > threshold)[0][-1] + 1
print('Stabilizes after', time[stable_index], 'sec')
Stabilizes after 96.6 sec
Here data[-1] - data is a difference between last value of data and all the data values. The assumption here is that the last value of data represents the equilibrium point.
np.where( * > threshold )[0] are all the indices of values of data which are greater than the threshold, that is still not stabilized. We take only the last index. The next one is where time series is considered stabilized, hence the + 1.
If you're dealing with deterministic data which is eventually converging monotonically to some fixed value, the problem is pretty straightforward. Your last observation should be the closest to the limit, so you can define an acceptable tolerance threshold relative to that last data point and scan your data from back to front to find where you exceeded your threshold.
Things get a lot nastier once you add random noise into the picture, particularly if there is serial correlation. This problem is common in simulation modeling(see (*) below), and is known as the issue of initial bias. It was first identified by Conway in 1963, and has been an active area of research since then with no universally accepted definitive answer on how to deal with it. As with the deterministic case, the most widely accepted answers approach the problem starting from the right-hand side of the data set since this is where the data are most likely to be in steady state. Techniques based on this approach use the end of the dataset to establish some sort of statistical yardstick or baseline to measure where the data start looking significantly different as observations get added by moving towards the front of the dataset. This is greatly complicated by the presence of serial correlation.
If a time series is in steady state, in the sense of being covariance stationary then a simple average of the data is an unbiased estimate of its expected value, but the standard error of the estimated mean depends heavily on the serial correlation. The correct standard error squared is no longer s2/n, but instead it is (s2/n)*W where W is a properly weighted sum of the autocorrelation values. A method called MSER was developed in the 1990's, and avoids the issue of trying to correctly estimate W by trying to determine where the standard error is minimized. It treats W as a de-facto constant given a sufficiently large sample size, so if you consider the ratio of two standard error estimates the W's cancel out and the minimum occurs where s2/n is minimized. MSER proceeds as follows:
Starting from the end, calculate s2 for half of the data set to establish a baseline.
Now update the estimate of s2 one observation at a time using an efficient technique such as Welford's online algorithm, calculate s2/n where n is the number of observations tallied so far. Track which value of n yields the smallest s2/n. Lather, rinse, repeat.
Once you've traversed the entire data set from back to front, the n which yielded the smallest s2/n is the number of observations from the end of the data set which are not detectable as being biased by the starting conditions.
Justification - with a sufficiently large baseline (half your data), s2/n should be relatively stable as long as the time series remains in steady state. Since n is monotonically increasing, s2/n should continue decreasing subject to the limitations of its variability as an estimate. However, once you start acquiring observations which are not in steady state the drift in mean and variance will inflate the numerator of s2/n. Hence the minimal value corresponds to the last observation where there was no indication of non-stationarity. More details can be found in this proceedings paper. A Ruby implementation is available on BitBucket.
Your data has such a small amount of variation that MSER concludes that it is still converging to steady state. As such, I'd advise going with the deterministic approach outlined in the first paragraph. If you have noisy data in the future, I'd definitely suggest giving MSER a shot.
(*) - In a nutshell, a simulation model is a computer program and hence has to have its state set to some set of initial values. We generally don't know what the system state will look like in the long run, so we initialize it to an arbitrary but convenient set of values and then let the system "warm up". The problem is that the initial results of the simulation are not typical of the steady state behaviors, so including that data in your analyses will bias them. The solution is to remove the biased portion of the data, but how much should that be?

How to take random samples for H2O data frame in R?

I have a h2o data table with 40 columns and 1 million rows. I want do a random selection of 0.3 million rows without replacement. The H2o.sample function i looked online gives the error (I've already start h2o cluster)
Error: could not find function "h2o.sample"
Is there any other way i can do this? Thanks in advance!
There is no h2o.sample() function (maybe there was in a very old version of H2O?). You can use the h2o.splitFrame() function to split your frame into pieces. This also serves as a way to take a random subset of your data frame (without replacement). The function will actually create two (or more) pieces of your data, so if you want just the 30%, here is an example in R using iris to get a ~30% random sample of the rows:
hf <- as.h2o(iris)
ss <- h2o.splitFrame(hf, ratios = c(0.3), seed = 1)
sub_hf <- ss[[1]] # will contain 30% of the rows
Note that for scalability reasons, h2o.splitFrame() uses "approximate splitting" which means that you won't necessarily get exactly 30% of the rows. However, the expected value is 30%, and it will closer to the desired percentage when your data is bigger. The iris is a tiny 150 row dataset, so there is more variance.

Algorithm or command line tool to decimate point cloud of terrain points?

I need to take a larger (more dense) than needed list of lidar survey points (longitude, latitude, and elevation) for terrain definition and decimate it based on a 2 dimensional grid. The idea would be to end up with points based on a NxN (i.e. 1 meter x 1 meter) dimension grid using the longitude, latitude (x,y) values, therefore eliminating the points that are more than are needed. The goal is to determine what the elevation is at each point in the grid after the decimation, not use elevation as part of the decimation rule itself.
An actual or precisely structured grid is not necessary or the goal here, I only use the grid terminology to best approximate what I envision as the remainder of the cloud of points after reducing it in a manner that we have always have a point within a certain radius (i.e. 1 meter). It is possible there is a better term to use than grid.
I would like to either code/script this myself in a scripting or programming language if I can start with a decimation algorithm or use a command line tool from a project that may already exist that can do this that can run on Ubuntu and called from our application as system call. The approach should not require using a GUI based type of software or tool to solve this. It needs to be part of an automated set of steps.
The data currently exists in a tab separated values file but I could load the data into a sqlite database file if using an database/sql query driven algorithm would be better/faster. The ideal scripting language would be ruby or python but can be any really and if there exists C/C++/C# libraries for this already then we could wrap those for our needs.
Clarifying the use of the result of this decimated list: Given a user's location (known by latitude and longitude), what is the closest point in the list and in turn its elevation? We can do this now of course, but we have more data than is necessary so we just want to relax the density of the data so that if we can find the closest point within a tolerance distance (i.e. 1 meter) if able to use a decimated list vs the full list. The latitude, longitude values in the list are in decimal GPS (i.e. 38.68616190027656, -121.11013105991036)
PART 1: decimated version
Load data
Load the data from the tabular file (change sep according to the separator you are using):
# installed as dependency
import pandas as pd
# https://github.com/daavoo/pyntcloud
from pyntcloud import PyntCloud
dense = PyntCloud(pd.read_csv("example.tsv",
This is how it looks the example I created:
Build VoxelGrid
Asuming that the latitude and longitude in your file are in meters you can generate a grid as follows:
grid_id = dense.add_structure("voxelgrid",
sizes=[1, 1,None],
voxelgrid = dense.voxelgrids[grid_id]
This voxelgrid has a size of 1 along the x (latitude) and y (longitude) dimensions.
Build decimated version
decimated = dense.get_sample("voxelgrid_centroids", voxelgrid=grid_id)
decimated is a numpy (N,3) array. You can store it for later use in a SQL database, etc.
PART 2: Query
Option A: query voxelgrid
Get mean altitudes for each grid cell
You can know get a vector with the mean z (altitude) value for each cell in the grid:
z_mean = voxelgrid.get_feature_vector(mode="z_mean")
Query the grid with users's location:
users_location = np.random.rand(100000, 2)
Add a column of zeros because query requires 3D (This doesn't affect the results):
users_location = np.c_[ users_location, np.zeros(users_location.shape[0]) ]
Get in wich cell each user is:
users_cell = voxelgrid.query(users_location)
And finally, get the altitude corresponding to each user:
users_altitude = z_mean[users_cell]
Option B: Use decimated version for query
Build a KDTree of decimated:
from scipy.spatial import cKDTree
kdt = cKDTree(decimated)
Query the KDTree with user locations:
users_location = np.random.rand(100000, 2)
users_location = np.c_[ users_location, np.zeros(users_location.shape[0])
distances, indices = kdt.query(user_locations, k=1, n_jobs=-1)
Extra, you can save and laod the voxelgrid with pickle:
pickle.dump(voxelgrid, open("voxelgrid.pkl", "wb"))
voxelgrid = pickle.load(open("voxelgrid.pkl", "rb"))
If you have a point cloud as a text file (.xyz) a simple and fast solution is to take a random sample from the file using shuf.
10 million points in a xyz-file equals 10 million lines of text. You can run:
shuf input.xyz -l 5000000 -o out.xyz
You have decimated the file to half the original size.

Own fast Gamma Index implementation

My friends and I are writing our own implementation of Gamma Index algorithm. It should compute it within 1s for standard size 2d pictures (512 x 512) though could also calculate 3D pictures; be portable and easy to install and maintain.
Gamma Index, in case if you haven't came across this topic, is a method for comparing pictures. On input we provide two pictures (reference and target); every picture consist of points distributed over regular fine grid; every point has location and value. As output we receive a picture of Gamma Index values. For each point of target picture we calculate some function (called gamma) against every point from reference picture (in original version) or against points from reference picture, that are closest to the one from target picture (in version, that is usually used in Gamma Index calculation software). The Gamma Index for certain target point is minimum of calculated for it gamma function.
So far we have tried following ideas with these results:
use GPU - the calculation time has decreased 10 times. Problem is, that it's fairly difficult to install it on machines with non nVidia graphics card
use supercomputer or cluster - the problem is with maintenance of this solution. Plus every picture has to be ciphered for travel through network due to data sensitivity
iterate points ordered by their distances to target point with some extra stop criterion - this way we got 15 seconds at best condition (which is actually not ideally precise)
currently we are writing in Python due to NumPy awesome optimizations over matrix calculation, but we are open for other languages too.
Do you have any ideas how we can accelerate our algorithm(s), in order to meet the objectives? Do you think the obtaining of this level of performance is possible?
Some more information about GI for anyone interested:

Estimating number of results in Google App Engine Query

I'm attempting to estimate the total amount of results for app engine queries that will return large amounts of results.
In order to do this, I assigned a random floating point number between 0 and 1 to every entity. Then I executed the query for which I wanted to estimate the total results with the following 3 settings:
* I ordered by the random numbers that I had assigned in ascending order
* I set the offset to 1000
* I fetched only one entity
I then plugged the entities's random value that I had assigned for this purpose into the following equation to estimate the total results (since I used 1000 as the offset above, the value of OFFSET would be 1000 in this case):
The idea is that since each entity has a random number assigned to it, and I am sorting by that random number, the entity's random number assignment should be proportionate to the beginning and end of the results with respect to its offset (in this case, 1000).
The problem I am having is that the results I am getting are giving me low estimates. And the estimates are lower, the lower the offset. I had anticipated that the lower the offset that I used, the less accurate the estimate should be, but I thought that the margin of error would be both above and below the actual number of results.
Below is a chart demonstrating what I am talking about. As you can see, the predictions get more consistent (accurate) as the offset increases from 1000 to 5000. But then the predictions predictably follow a 4 part polynomial. (y = -5E-15x4 + 7E-10x3 - 3E-05x2 + 0.3781x + 51608).
Am I making a mistake here, or does the standard python random number generator not distribute numbers evenly enough for this purpose?
It turns out that this problem is due to my mistake. In another part of the program, I was grabbing entities from the beginning of the series, doing an operation, then re-assigning the random number. This resulted in a denser distribution of random numbers towards the end.
I did a little more digging into this concept, fixed the problem, and tried it again on a different query (so the number of results are different from above). I found that this idea can be used to estimate the total results for a query. One thing of note is that the "error" is very similar for offsets that are close by. When I did a scatter chart in excel, I expected the accuracy of the predictions at each offset to "cloud". Meaning that offsets at the very begging would produce a larger, less dense cloud that would converge to a very tiny, dense could around the actual value as the offsets got larger. This is not what happened as you can see below in the cart of how far off the predictions were at each offset. Where I thought there would be a cloud of dots, there is a line instead.
This is a chart of the maximum after each offset. For example the maximum error for any offset after 10000 was less than 1%:
When using GAE it makes a lot more sense not to try to do large amounts work on reads - it's built and optimized for very fast requests turnarounds. In this case it's actually more efficent to maintain a count of your results as and when you create the entities.
If you have a standard query, this is fairly easy - just use a sharded counter when creating the entities. You can seed this using a map reduce job to get the initial count.
If you have queries that might be dynamic, this is more difficult. If you know the range of possible queries that you might perform, you'd want to create a counter for each query that might run.
If the range of possible queries is infinite, you might want to think of aggregating counters or using them in more creative ways.
If you tell us the query you're trying to run, there might be someone who has a better idea.
Some quick thought:
Have you tried Datastore Statistics API? It may provide a fast and accurate results if you won't update your entities set very frequently.
I did some math things, I think the estimate method you purposed here, could be rephrased as an "Order statistic" problem.
For example:
If the actual entities number is 60000, the question equals to "what's the probability that your 1000th [2000th, 3000th, .... ] sample falling in the interval [l,u]; therefore, the estimated total entities number based on this sample, will have an acceptable error to 60000."
If the acceptable error is 5%, the interval [l, u] will be [0.015873015873015872, 0.017543859649122806]
I think the probability won't be very large.
This doesn't directly deal with the calculations aspect of your question, but would using the count attribute of a query object work for you? Or have you tried that out and it's not suitable? As per the docs, it's only slightly faster than retrieving all of the data, but on the plus side it would give you the actual number of results.
