So my friend and I ran an experiment 2 weeks ago and we've encountered something a bit weird. I should preface this by saying that I dont really program much so sorry if this is a dumb question and seems like a waste of time.
Let's say we have data-set A and data-set B (the experiment itself doesnt matter). All of the times are given in fractional day. The format of the data should all match, but the time the data points were recorded for each set aren't necessarily aligned (they each have their own time vectors). For example, the measurements for data-set A are recorded every 100 ms. However, the instrument for data-set B is averaging the data and only records a point once every minute or so. My problem here is aligning the time for the different types of data collected. For data-set A, the data and the time vectors have a length of 25042 (25042x1 double). Data-set B and its time vector have a length of 828 (828x1 double).
It comes down to the fact that I need to look at data set B, and find the times that correspond to peaks in the data. Those times are the only times of interest to me in data set A. This is why I need a way of aligning the time vectors/series and thus the data. If an exact solution isn't possible, even an approximation would be great help. Does anybody have any ideas?
So you have two time vectors: tA and tB, and a vector of time indices bIndices that contains the known peak(s). This corresponds to time(s) tB(bIndices(:)). You need to loop through the entire vector bIndices searching again through the entire vector tA(:) FULLY each time until the time is greater-than-or-equal-to tB(b)
bIndices = [101, 403,...]; %Vector containing the indices of the peaks in 'tB'
aIndices = []; %Allocate an empty vector
A = []; %Allocate an empty vector
B = []; %Allocate an empty vector
for b = bIndices %This will cycle through all peak indices one at a time setting 'b' to the current single index
B = [B tB(b)]; %Retrieve the actual time using the index, concatenate it
for a = 1:length(tA) %Loop through the entire time vector tA
if (tA(a) >= tB(b)) %Time is greater than or equal
%Concatenate the newly found index 'a' from tA to the vector aIndex:
aIndices = [aIndices a];
%Concatenate the newly found time 'tA(a)' to the time vector A:
A = [A tA(a)]; %Or if you want the actual time
break; %Exit the inner loop, search for the next index `b`
end
end
end
At the end, A stores an array of peak times matching all the times in B (approximately, probably a little later). A-B is the discrepancy between the two times (both vectors should be the same length), but it should be pretty small, any zeros would mean the 2 aligned perfectly at those instances. aIndices is the corresponding indices of tA at the desired time(s). I didn't actually test this code, hopefully my logic is sound.
Related
I have tried it using arrayfun() as follows as answered on stackoverflow:
prob_y = arrayfun(#(x)length(find(y==x)), unique(y)) / length(y)
But the problem with this is that I have to find the occurrences of 40 different values in a vector, so I'll have to use this arrayfun() for 40 times and it runs faster for the first value, but when it goes for the 2nd value it takes hell lot of time and my vector size is also huge. Can someone please suggest me some alternative for this so that it can save my time?
You could use hist and unique together to do this efficiently.
[a, b] = hist(y, unique(y));
a = a/length(y);
I've collected some data from a potentiometer using an Arduino microcontroller. Here is the data which was sampled at 500 Hz (it's a lot of data):
http://cl.ly/3D3s1U3m1R1T?_ga=1.178935463.2093327149.1426657579
If you zoom in you can see that essentially I have a pot that just rotates back and forth i.e., I should see a linear increase and then a linear decrease. While the general shape of the data affirms this, almost every single time there's some really freaking annoying (sometimes surprisingly wide) spikes that get in the way of a really nice shape. Is there any way I can make some type of algorithm or filter which fixes this? I tried a median filter, and using the percentiles but neither worked. I mean I feel like it shouldn't be the hardest thing because I can clearly see what it should look like-basically the minimum of where the spikes occur-but for some reason everything I try miserably fails or at least looses the integrity of the original data.
I'd appreciate any help I can get with this.
There are many ways to tackle your problem. However none of them will ever be perfect. I'll give you 2 approaches here.
Moving average (low pass filter)
In Matlab, one easy way to "low pass" filter your data without having to explicitly use FFT, is to use the filter function` (available in the base package, you do not need any specific toolbox).
You create a kernel for the filter, and apply it twice (once in each direction), to cancel the phase shift introduced. This is in effect a "Moving Average" filter with zero phase shift.
The size (length) of the kernel will control how heavy the averaging process will be.
So for example, 2 different filter lengths:
n = 100 ; %// length of the filter
kernel = ones(1,n)./n ;
q1 = filter( kernel , 1 , fliplr(p) ) ; %// apply the filter in one direction
q1 = filter( kernel , 1 , fliplr(q1) ) ; %// re-apply in the opposite direction to cancel phase shift
n = 500 ; %// length of the filter
kernel = ones(1,n)./n ;
q2 = filter( kernel , 1 , fliplr(filter( kernel , 1 , fliplr(p) )) ) ; %// same than above in one line
Will produce on your data:
As you can see, each filter size has its pros and cons. The more you filter, the more of your spikes you cancel, but the more you deform your original signal. It is up to you to find your optimum settings.
2) Search derivative anomalies
This is a different approach. You can observe on your signal that the spikes are mostly sudden, it means the change of value of your signal is rapid, and luckily faster than the "normal" rate of change of your desired signal. It means that you can calculate the derivative of your signal, and identify all the spikes (the derivative will be much higher than for the rest of the curve).
Since this only identify the "beginning" and "end" of the spikes (not the occasional plateau in the middle), we will need to extend a little bit the zone identified as faulty by this method.
When the identification of faulty data is done, you just discard these data points and re-interpolate your curve over the original interval (taking support on the points you have left).
%% // Method 2 - Reinterpolation of cancelled data
%// OPTIONAL slightly smooth the initial data to get a cleaner derivative
n = 10 ; kernel = ones(1,n)./n ;
ps = filter( kernel , 1 , fliplr(filter( kernel , 1 , fliplr(p) )) ) ;
%// Identify the derivative anomalies (too high or too low)
dp = [0 diff(ps)] ; %// calculate the simplest form of derivative (just the difference between consecutive points)
dpos = dp >= (std(dp)/2) ; %// identify positive derivative above a certain threshold (I choose the STD but you could choose something else)
dneg = dp <= -(std(dp)/2) ; %// identify negative derivative above the threshold
ixbad = dpos | dneg ; %// prepare a global vector of indices to cancel
%// This will cancel "nPtsOut" on the RIGHT of each POSITIVE derivative
%// point identified, and "nPtsOut" on the LEFT of each NEGATIVE derivative
nPtsOut = 100 ; %// decide how many points after/before spikes we are going to cancel
for ii=1:nPtsOut
ixbad = ixbad | circshift( dpos , [0 ii]) | circshift( dneg , [0 -ii]) ;
end
%// now we just reinterpolate the missing gaps
xp = 1:length(p) ; %// prepare a base for reinterpolation
pi = interp1( xp(~ixbad) , p(~ixbad) , xp ) ; %// do the reinterpolation
This will produce:
The red signal is the result of the above moving average, the green signal is the result of the derivative approach.
There are also settings you can change to adjust this result (the threshold for the derivative, 'nPtsOut' and even the initial smoothing of the data).
As you can see, for the same amount of spike cancellation than the moving average method, it respect a bit more the integrity of the initial data. However it is not perfect either and some interval will still be deformed. But as I said at the beginning, no method is ever perfect.
It seems you have large spikes near the maximum and minimum points of you pot. You can limit the range of your valid data between 200 and 300 for instance.
Another option is a 1st order low pass filter like this one
alpha = 0.01 %parameter to tune!
p_filtered(1) = p(1);
for i=2:length(p)
p_filtered(i) = alpha*p(i) + (1-alpha)* p_filtered(i-1);
end
The noise spikes are being caused by the POT’s wiper bouncing along the resistive track as the knob is turned. This is a common problem with them. In future you should look at adding a 0.1uF capacitor to the POT output and this should fix the problem.
With your current data the simplest option is to just do a simple moving average and visually tune the number of samples averaged until the spikes are sufficiently suppressed while not affecting the underlying data. Note that a moving average is just a low pass filter with a sinc frequency response.
The normal way to post-process this sort of data is to do an FFT (using an appropriate windowing function), zero out the noise values above the signal of interest and then take an inverse FFT. This is also just lowpass filtering (with a sinc * windowing function weighted moving average), but you use the insight provided by the FFT to select your cutoff frequency. If you’re not comfortable with the maths involved in doing this then just go with the simple moving average filter. It should be fine for your needs.
I have two arrays of data:
I would like to align these similar graphs together (by adding an offset to either array):
Essentially what I want is the most constructive interference, as shown when two waves together produce the same wave but with larger amplitude:
This is also the same as finding the most destructive interference, but one of the arrays must be inverted as shown:
Notice that the second wave is inverted (peaks become troughs / vice-versa).
The actual data will not only consist of one major and one minor peak and trough, but of many, and there might not be any noticeable spikes. I have made the data in the diagram simpler to show how I would like the data aligned.
I was thinking about a few loops, such as:
biggest = 0
loop from -10 to 10 as offset
count = 0
loop through array1 as ar1
loop through array2 as ar2
count += array1[ar1] + array2[ar2 - offset]
replace biggest with count if count/sizeof(array1) > biggest
However, that requires looping through offset and looping through both arrays. My real array definitions are extremely large and this would would take too long.
How would I go about determining the offset required to match data1 with data2?
JSFiddle (note that this is language agnostic and I would like to understand the algorithm more-so than the actual code)
Look at Convolution and Cross-correlation an its computation using Fast Fourier Transformation. It's the way how it is done in real life applications.
If (and only if) you data has very recognizeable spikes, you could do, what a human being would do: Match the spikes: Fiddle
the importand part is function matchData()
An improved version would search for N max and min spikes, then calculate an average offset.
I am trying to figure out system design behind Google Trends (or any other such large scale trend feature like Twitter).
Challenges:
Need to process large amount of data to calculate trend.
Filtering support - by time, region, category etc.
Need a way to store for archiving/offline processing. Filtering support might require multi dimension storage.
This is what my assumption is (I have zero practial experience of MapReduce/NoSQL technologies)
Each search item from user will maintain set of attributes that will be stored and eventually processed.
As well as maintaining list of searches by time stamp, region of search, category etc.
Example:
Searching for Kurt Cobain term:
Kurt-> (Time stamp, Region of search origin, category ,etc.)
Cobain-> (Time stamp, Region of search origin, category ,etc.)
Question:
How do they efficiently calculate frequency of search term ?
In other words, given a large data set, how do they find top 10 frequent items in distributed scale-able manner ?
Well... finding out the top K terms is not really a big problem. One of the key ideas in this fields have been the idea of "stream processing", i.e., to perform the operation in a single pass of the data and sacrificing some accuracy to get a probabilistic answer. Thus, assume you get a stream of data like the following:
A B K A C A B B C D F G A B F H I B A C F I U X A C
What you want is the top K items. Naively, one would maintain a counter for each item, and at the end sort by the count of each item. This takes O(U) space and O(max(U*log(U), N)) time, where U is the number of unique items and N is the number of items in the list.
In case U is small, this is not really a big problem. But once you are in the domain of search logs with billions or trillions of unique searches, the space consumption starts to become a problem.
So, people came up with the idea of "count-sketches" (you can read up more here: count min sketch page on wikipedia). Here you maintain a hash table A of length n and create two hashes for each item:
h1(x) = 0 ... n-1 with uniform probability
h2(x) = 0/1 each with probability 0.5
You then do A[h1[x]] += h2[x]. The key observation is that since each value randomly hashes to +/-1, E[ A[h1[x]] * h2[x] ] = count(x), where E is the expected value of the expression, and count is the number of times x appeared in the stream.
Of course, the problem with this approach is that each estimate still has a large variance, but that can be dealt with by maintaining a large set of hash counters and taking the average or the minimum count from each set.
With this sketch data structure, you are able to get an approximate frequency of each item. Now, you simply maintain a list of 10 items with the largest frequency estimates till now, and at the end you will have your list.
How exactly a particular private company does it is likely not publicly available, and how to evaluate the effectiveness of such a system is at the discretion of the designer (be it you or Google or whoever)
But many of the tools and research is out there to get you started. Check out some of the Big Data tools, including many of the top-level Apache projects, like Storm, which allows for the processing of streaming data in real-time
Also check out some of the Big Data and Web Science conferences like KDD or WSDM, as well as papers put out by Google Research
How to design such a system is challenging with no correct answer, but the tools and research are available to get you started
Summary
As Ted Jaspers wisely pointed out, the methodology I described in the original proposal back in 2012 is actually a special case of an exponential moving average. The beauty of this approach is that it can be calculated recursively, meaning you only need to store a single popularity value with each object and then you can recursively adjust this value when an event occurs. There's no need to record every event.
This single popularity value represents all past events (within the limits of the data type being used), but older events begin to matter exponentially less as new events are factored in. This algorithm will adapt to different time scales and will respond to varying traffic volumes. Each time an event occurs, the new popularity value can be calculated using the following formula:
(a * t) + ((1 - a) * p)
a — coefficient between 0 and 1 (higher values discount older events faster)
t — current timestamp
p — current popularity value (e.g. stored in a database)
Reasonable values for a will depend on your application. A good starting place is a=2/(N+1), where N is the number of events that should significantly affect the outcome. For example, on a low-traffic website where the event is a page view, you might expect hundreds of page views over a period of a few days. Choosing N=100 (a≈0.02) would be a reasonable choice. For a high-traffic website, you might expect millions of page views over a period of a few days, in which case N=1000000 (a≈0.000002) would be more reasonable. The value for a will likely need to be gradually adjusted over time.
To illustrate how simple this popularity algorithm is, here's an example of how it can be implemented in Craft CMS in 2 lines of Twig markup:
{% set popularity = (0.02 * date().timestamp) + (0.98 * entry.popularity) %}
{% do entry.setFieldValue("popularity", popularity) %}
Notice that there's no need to create new database tables or store endless event records in order to calculate popularity.
One caveat to keep in mind is that exponential moving averages have a spin-up interval, so it takes a few recursions before the value can be considered accurate. This means the initial condition is important. For example, if the popularity of a new item is initialized using the current timestamp, the item immediately becomes the most popular item in the entire set before eventually settling down into a more accurate position. This might be desirable if you want to promote new content. Alternatively, you may want content to work its way up from the bottom, in which case you could initialize it with the timestamp of when the application was first launched. You could also find a happy medium by initializing the value with an average of all popularity values in the database, so it starts out right in the middle.
Original Proposal
There are plenty of suggested algorithms for calculating popularity based on an item's age and the number of votes, clicks, or purchases an item receives. However, the more robust methods I've seen often require overly complex calculations and multiple stored values which clutter the database. I've been contemplating an extremely simple algorithm that doesn't require storing any variables (other than the popularity value itself) and requires only one simple calculation. It's ridiculously simple:
p = (p + t) / 2
Here, p is the popularity value stored in the database and t is the current timestamp. When an item is first created, p must be initialized. There are two possible initialization methods:
Initialize p with the current timestamp t
Initialize p with the average of all p values in the database
Note that initialization method (1) gives recently added items a clear advantage over historical items, thus adding an element of relevance. On the other hand, initialization method (2) treats new items as equals when compared to historical items.
Let's say you use initialization method (1) and initialize p with the current timestamp. When the item receives its first vote, p becomes the average of the creation time and the vote time. Thus, the popularity value p still represents a valid timestamp (assuming you round to the nearest integer), but the actual time it represents is abstracted.
With this method, only one simple calculation is required and only one value needs to be stored in the database (p). This method also prevents runaway values, since a given item's popularity can never exceed the current time.
An example of the algorithm at work over a period of 1 day: http://jsfiddle.net/q2UCn/
An example of the algorithm at work over a period of 1 year: http://jsfiddle.net/tWU9y/
If you expect votes to steadily stream in at sub-second intervals, then you will need to use a microsecond timestamp, such as the PHP microtime() function. Otherwise, a standard UNIX timestamp will work, such as the PHP time() function.
Now for my question: do you see any major flaws with this approach?
I think this is a very good approach, given its simplicity. A very interesting result.
I made a quick set of calculations and found that this algorithm does seem to understand what "popularity" means. Its problem is that it has a clear tendency to favor recent votes like this:
Imagine we take the time and break it into discrete timestamp values ranging from 100 to 1000. Assume that at t=100 both A and B (two items) have the same P = 100.
A gets voted 7 times on 200, 300, 400, 500, 600, 700 and 800
resulting on a final Pa(800) = 700 (aprox).
B gets voted 4 times on 300, 500, 700 and 900
resulting on a final Pb(900) = 712 (aprox).
When t=1000 comes, both A and B receive votes, so:
Pa(1000) = 850 with 8 votes
Pb(1000) = 856 with 5 votes
Why? because the algorithm allows an item to quickly beat historical leaders if it receives more recent votes (even if the item has fewer votes in total).
EDIT INCLUDING SIMULATION
The OP created a nice fiddle that I changed to get the following results:
http://jsfiddle.net/wBV2c/6/
Item A receives one vote each day from 1970 till 2012 (15339 votes)
Item B receives one vote each month from Jan to Jul 2012 (7 votes)
The result: B is more popular than A.
The proposed algorithm is a good approach, and is a special case of an Exponential Moving Average where alpha=0.5:
p = alpha*p + (1-alpha)*t = 0.5*p + 0.5*t = (p+t)/2 //(for alpha = 0.5)
A way to tweak the fact that the proposed solution for alpha=0.5 tends to favor recent votes (as noted by daniloquio) is to choose higher values for alpha (e.g. 0.9 or 0.99). Note that applying this to the testcase proposed by daniloquio is not working however, because when alpha increases the algorithm needs more 'time' to settle (so the arrays should be longer, which is often true in real applications).
Thus:
for alpha=0.9 the algorithm averages approximately the last 10 values
for alpha=0.99 the algorithm averages approximately the last 100 values
for alpha=0.999 the algorithm averages approximately the last 1000 values
etc.
I see one problem, only the last ~24 votes count.
p_i+1 = (p + t) / 2
For two votes we have
p2 = (p1 + t2) / 2 = ((p0 + t1) /2 + t2 ) / 2 = p0/4 + t1/4 + t2/2
Expanding that for 32 votes gives:
p32 = t*2^-32 + t0*2^-32 + t1*2^-31 + t2*2^-30 + ... + t31*2^-1
So for signed 32 bit values, t0 has no effect on the result. Because t0 gets divided by 2^32, it will contribute nothing to p32.
If we have two items A and B (no matter how big the differences are) if they both get the same 32 votes, they will have the same popularity. So you're history goes back for only 32 votes. There is no difference in 2032 and 32 votes, if the last 32 votes are the same.
If the difference is less than a day, they will be equal after 17 votes.
The flaw is that something with 100 votes is usually more meaningful than something with only one recent vote. However it isn't hard to come up with variants of your scheme that work reasonably well.
I don't think that the above-discussed logic is going to work.
p_i+1= (p_i + t) /2
Article A gets viewed on timestamps: 70, 80, 90 popularity(Article A): 82.5
Article B gets viewed on timestamps: 50, 60, 70, 80, 90 popularity(Article B): 80.625
In this case, the popularity of Article B should have been more. Firstly Article B was viewed as recently as Article A and secondly, it was also viewed more times than Article A.