PyMC3: When using the shape parameter for time series, is time on the row or column axis? - pymc

I want to sample multiple time series of the same length in PyMC3. If I have code that reads:
with model:
z = pm.GaussianRandomWalk('z', sd=1, shape=(100, 10))
Does this read "100 time series of length 10 each", or "10 time series of length 100 each"?

Related

Calculating Moving Average for N Months in DAX Power BI

I have a measure that calculates Moving Average for 3 months:
Moving_Avg_3_Months = AVERAGEX(DATESINPERIOD('Calendar FY'[Date],
LASTDATE('Calendar FY'[Date]), -3, MONTH),[CUS Revenue Credible All])
Is it possible to create a measure that would calculate Moving Average for my [CUS Revenue Credible All] - but for N months. Where N = 3 or N = 6 or N = whatever number I'd like?
If you create a new table with the different values for moving average you want to use eg. TableMovingAverage: [-3,-6,-12,-24,...,N]
and modify you DAX formula like this:
Moving_Avg_3_Months =
AVERAGEX(
DATESINPERIOD('Calendar FY'[Date],
LASTDATE('Calendar FY'[Date]),
SELECTEDVALUE('TableMovingAverage', -3),
MONTH),
[CUS Revenue Credible All])
SELECTEDVALUE returns a scalar if only one value is in the specified table, otherwise it return a default value -3 in this case.
If you filter TableMovingAverage you can switch between different moving averages

Best algorithm to compare massive data

I have a big dataset as csv (334MB) which is looking like below.
month, output
1,"['23482394','4358309','098903284'....(total 2.5 million entries)]"
2,"['92438545','23482394',323103404'....(total 2.2 million entries)]"
3,"[...continue
Now, I need to compare what percent of the output in a month is overlapping with the one in previous month.
For example, when I compare month 1 and month 2, I'd like to get result like "Month 2 output has 90% overlap against month1", then "Month3 has 88% overap against Month2"
What's the best way to solve this problem by Python3?
You can use set intersection method to extract common elements b/w two arrays or lists. The complexity of set intersection is O(min(len(a), len(b)).
# generate random numpy array with unique elements
import numpy as np
month1 = np.random.choice(range(10**5, 10**7), size=25*10**5, replace=False)
month2 = np.random.choice(range(10**5, 10**7), size=22*10**5, replace=False)
month3 = np.random.choice(range(10**5, 10**7), size=21*10**5, replace=False)
print('Month 1, 2, and 3 contains {}, {}, and {} elements respectively'.format(len(month1), len(month2), len(month3)))
Month 1, 2, and 3 contains 2500000, 2200000, and 2100000 elements respectively
# Compare month arrays for overlap
import time
startTime = time.time()
union_m1m2 = set(month1).intersection(month2)
union_m2m3 = set(month2).intersection(month3)
print('Percent of elements in both month 1 & 2: {}%'.format(round(100*len(union_m1m2)/len(month2),2)))
print('Percent of elements in both month 2 & 3: {}%'.format(round(100*len(union_m2m3)/len(month3),2)))
print('Process time:{:.2f}s'.format(time.time()-startTime))
Percent of elements in both month 1 & 2: 25.3%
Percent of elements in both month 2 & 3: 22.24%
Process time:2.46s
You may have better success on the overlap between month entries with actual data.

Fastest approximate counting algorithm

Whats the fastest way to get an approximate count of number of rows of an input file or std out data stream. FYI, this is a probabilistic algorithm, I can't find many examples online.
The data could just be one or 2 columns coming from an awk script of csv file! Lets say i want an aprox groupby on one of the columns. I would use a database group by but the number of rows are over 6-7 billion. I would like the first approx result In under 3 to 4 seconds. Then run a bayes or something after decisions are made on the prior. Any ideas on a really rough initial group count?
If you can provide the algorithm example in python, or java that would be very helpful.
#Ben Allison's answer is a good way if you want to count the total lines. Since you mentioned the Bayes and the prior, I will add an answer in that direction to calculate the percentage of different groups. (see my comments on your question. I guess if you have an idea of the total and if you want to do a groupby, to estimate the percentage of different groups makes more sense).
The recursive Bayesian update:
I will start by assuming you have only two groups (extensions can be made to make it work for multiple groups, see later explanations for that.), group1 and group2.
For m group1s out of the first n lines(rows) you processed, we denote the event as M(m,n). Obviously you will see n-m group2s because we assume they are the only two possible groups. So you know the conditional probability of the event M(m,n) given the percentage of group1 (s), is given by the binomial distribution with n trials. We are trying to estimate s in a bayesian way.
The conjugate prior for binomial is beta distribution. So for simplicity, we choose Beta(1,1) as the prior (of course, you can pick your own parameters here for alpha and beta), which is a uniform distribution on (0,1). Therefor, for this beta distribution, alpha=1 and beta=1.
The recursive update formulas for a binomial + beta prior are as below:
if group == 'group1':
alpha = alpha + 1
else:
beta = beta + 1
The posterior of s is actually also a beta distribution:
s^(m+alpha-1) (1-s)^(n-m+beta-1)
p(s| M(m,n)) = ----------------------------------- = Beta (m+alpha, n-m+beta)
B(m+alpha, n-m+beta)
where B is the beta function. To report the estimate result, you can rely on Beta distribution's mean and variance, where:
mean = alpha/(alpha+beta)
var = alpha*beta/((alpha+beta)**2 * (alpha+beta+1))
The python code: groupby.py
So a few lines of python to process your data from stdin and estimate the percentage of group1 would be something like below:
import sys
alpha = 1.
beta = 1.
for line in sys.stdin:
data = line.strip()
if data == 'group1':
alpha += 1.
elif data == 'group2':
beta += 1.
else:
continue
mean = alpha/(alpha+beta)
var = alpha*beta/((alpha+beta)**2 * (alpha+beta+1))
print 'mean = %.3f, var = %.3f' % (mean, var)
The sample data
I feed a few lines of data to the code:
group1
group1
group1
group1
group2
group2
group2
group1
group1
group1
group2
group1
group1
group1
group2
The approximate estimation result
And here is what I get as results:
mean = 0.667, var = 0.056
mean = 0.750, var = 0.037
mean = 0.800, var = 0.027
mean = 0.833, var = 0.020
mean = 0.714, var = 0.026
mean = 0.625, var = 0.026
mean = 0.556, var = 0.025
mean = 0.600, var = 0.022
mean = 0.636, var = 0.019
mean = 0.667, var = 0.017
mean = 0.615, var = 0.017
mean = 0.643, var = 0.015
mean = 0.667, var = 0.014
mean = 0.688, var = 0.013
mean = 0.647, var = 0.013
The result shows that group1 is estimated to have 64.7% percent up to the 15th row processed (based on our beta(1,1) prior). You might notice that the variance keeps shrinking because we have more and more observation points.
Multiple groups
Now if you have more than 2 groups, just change the underline distribution from binomial to multinomial, and then the corresponding conjugate prior would be Dirichlet. Everything else you just make similar changes.
Further notes
You said you would like the approximate estimate in 3-4 seconds. In this case, you just sample a portion of your data and feed the output to the above script, e.g.,
head -n100000 YOURDATA.txt | python groupby.py
That's it. Hope it helps.
If it's reasonable to assume the data are IID (so there's no bias such as certain types of records occur in certain parts of the stream), then just subsample and scale up the counts by approximate size.
Take say the first million records (this should be processable in a couple of seconds). Its size is x units (MB, chars, whatever you care about). The full stream has size y where y >> x. Now, derive counts for whatever you care about from your sample x, and simply scale them by the factor y/*x* for approximate full-counts. An example: you want to know roughly how many records have column 1 with value v in the full stream. The first million records have a file size of 100MB, while the total file size is 10GB. In the first million records, 150,000 of them have value v for column 1. So, you assume that in the full 10GB of records, you'll see 150,000 * (10,000,000,000 / 100,000,000) = 15,000,000 with that value. Any statistics you compute on the sample can simply be scaled by the same factor to produce an estimate.
If there is bias in the data such that certain records are more or less likely to be in certain places of the file then you should select your sample records at random (or evenly spaced intervals) from the total set. This is going to ensure an unbiased, representative sample, but probably incur a much greater I/O overhead.

Algorithm to smooth numbers with variable input time

I have an app that accepts integers at a variable rate every .25 to 2 seconds.
I'd like to output the data in a smoothed format for 3, 5 or 7 seconds depending on user input.
If the data always came in at the same rate, let's say every .25 seconds, then this would be easy. The variable rate is what confuses me.
Data might come in like this:
Time - Data
0.25 - 100
0.50 - 102
1.00 - 110
1.25 - 108
2.25 - 107
2.50 - 102
ect...
I'd like to display a 3 second rolling average every .25 seconds on my display.
The simplest form of doing this is to put each item into an array with a time stamp.
array.push([0.25, 100])
array.push([0.50, 102])
array.push([1.00, 110])
array.push([1.25, 108])
ect...
Then every .25 seconds I would read through the array, back to front, until I got to a time that was less than now() - rollingAverageTime. I would sum that and display it. I would then .Shift() the beginning of the array.
That seems not very efficient though. I was wondering if someone had a better way to do this.
Why don't you save the timestamp of the starting value and then accumulate the values and the number of samples until you get a timestamp that is >= startingTime + rollingAverageTime and then divide the accumulator by the number of samples taken?
EDIT:
If you want to preserve the number of samples, you can do this way:
Take the accumulator, and for each input value sum it and store the value and the timestamp in a shift register; at every cycle, you have to compare the latest sample's timestamp with the oldest timestamp in the shift register plus the smoothing time; if it's equal or more, subtract the oldest saved value from the accumulator, delete that entry from the shift register and output the accumulator, divided by the smoothing time. If you iterate you obtain a rolling average with (i think) the least amount of computation for each cycle:
a sum (to increment the accumulator)
a sum and a subtraction (to compare the timestamp)
a subtraction (from the accumulator)
a division (to calculate the average, done in a smart way can be a shift right)
For a total of about 4 algebric sums and a division (or shift)
EDIT:
For taking into account the time from the last sample as a weighting factor, you can divide the value for the ratio between this time and the averaging time, and you obtain an already weighted average, without having to divide the accumulator.
I added this part because it doesn't add computational load, so you can implement quite easy if you want to.
The answer from clabacchio has the basics right, but perhaps you need a bit more sophisticated answer.
Calculating the average:
0.25 - 100
0.50 - 102
1.00 - 110
In the above subset of the data what is the answer you want? You could use the mean of these numbers or you could do it in a weighted fashion. You could convert the data into:
0.50 - 0.25 = 0.25 ---- (100+102)/2 = 101
1.00 - 0.50 = 0.50 ---- (102+110)/2 = 106
Then you can take the weighted average of these values, weight being the time difference, and value being the average value.
The final answer = (0.25*101 + 0.5*106)/(0.25+0.5) = whatever the value is.
Now coming to "moving" averages:
You can either use previous k values or previous k seconds worth of data. In both cases you can keep two sums: weighted sum and sum of weights.
So... the worst case scenario is 4 readings per second over 7 seconds = 28 values in your array to process. That will be done in nanoseconds anyway, so not worth optimizing IMHO.

How to decide on weights?

For my work, I need some kind of algorithm with the following input and output:
Input: a set of dates (from the past). Output: a set of weights - one weight per one given date (the sum of all weights = 1).
The basic idea is that the closest date to today's date should receive the highest weight, the second closest date will get the second highest weight, and so on...
Any ideas?
Thanks in advance!
First, for each date in your input set assign the amount of time between the date and today.
For example: the following date set {today, tomorrow, yesterday, a week from today} becomes {0, 1, 1, 7}. Formally: val[i] = abs(today - date[i]).
Second, inverse the values in such a way that their relative weights are reversed. The simplest way of doing so would be: val[i] = 1/val[i].
Other suggestions:
val[i] = 1/val[i]^2
val[i] = 1/sqrt(val[i])
val[i] = 1/log(val[i])
The hardest and most important part is deciding how to inverse the values. Think, what should be the nature of the weights? (do you want noticeable differences between two far away dates, or maybe two far away dates should have pretty equal weights? Do you want a date which is very close to today have an extremely bigger weight or a reasonably bigger weight?).
Note that you should come up with an inverting procedure where you cannot divide by zero. In the example above, dividing by val[i] results in division by zero. One method to avoid division by zero is called smoothing. The most trivial way to "smooth" your data is using the add-one smoothing where you just add one to each value (so today becomes 1, tomorrow becomes 2, next week becomes 8, etc).
Now the easiest part is to normalize the values so that they'll sum up to one.
sum = val[1] + val[2] + ... + val[n]
weight[i] = val[i]/sum for each i
Sort dates and remove dups
Assign values (maybe starting from the farthest date in steps of 10 or whatever you need - these value can be arbitrary, they just reflect order and distance)
Normalize weights to add up to 1
Executable pseudocode (tweakable):
#!/usr/bin/env python
import random, pprint
from operator import itemgetter
# for simplicity's sake dates are integers here ...
pivot_date = 1000
past_dates = set(random.sample(range(1, pivot_date), 5))
weights, stepping = [], 10
for date in sorted(past_dates):
weights.append( (date, stepping) )
stepping += 10
sum_of_steppings = sum([ itemgetter(1)(x) for x in weights ])
normalized = [ (d, (w / float(sum_of_steppings)) ) for d, w in weights ]
pprint.pprint(normalized)
# Example output
# The 'date' closest to 1000 (here: 889) has the highest weight,
# 703 the second highest, and so forth ...
# [(151, 0.06666666666666667),
# (425, 0.13333333333333333),
# (571, 0.2),
# (703, 0.26666666666666666),
# (889, 0.3333333333333333)]
How to weight: just compute the difference of all dates and the current date
x(i) = abs(date(i) - current_date)
you can then use different expression to assign weights:
w(i) = 1/x(i)
w(i) = exp(-x(i))
w(i) = exp(-x(i)^2))
use gaussian distribution - more complicated, do not recommend
Then use normalized weights: w(i)/sum(w(i)) so that the sum is 1.
(Note that the exponential func is always used by statisticians in survival analysis)
The first thing that comes to my mind to to use a geometric series:
http://en.wikipedia.org/wiki/Geometric_series
(1/2)+(1/4)+(1/8)+(1/16)+(1/32)+(1/64)+(1/128)+(1/256)..... sums to one.
Yesterday would be 1/2
2 days ago would be 1/4
and so on
Is is the index for the i-th date.
Assign weights equal to to Ni / D.
D0 is the first date.
Ni is the difference in days between the i-th date and the first date D0.
D is the normalization factor
converts dates to yyyymmddhhmiss format (24 hours), add all these values ​​and the total, divide by the total time, and sort by this value.
declare #data table
(
Date bigint,
Weight float
)
declare #sumTotal decimal(18,2)
insert into #Data (Date)
select top 100
replace(replace(replace(convert(varchar,Datetime,20),'-',''),':',''),' ','')
from Dates
select #sumTotal=sum(Date)
from #Data
update #Data set
Weight=Date/#sumTotal
select * from #Data order by 2 desc

Resources