Principle Component Analysis with very big dimension of data

Principle Component Analysis with very big dimension of data - algorithm

I have a set of samples (vectors) each have a dimension about of M (10000) and the size of the set is also about N(10000), and i want to find first (with biggest eiegenvalues) 10 PC of this set. Due to the big dimension of samples i cannot calculate covariation matrix in reasonable time. Are there any methods to select PC without calculation of full cov matrix or methods that can effectively handle big dimension of data or something like this? So these methods should require less operations than O(M*M*N).

NIPALS -- Non-linear iterative partial least squares
see for example here: http://en.wikipedia.org/wiki/NIPALS

guys, maybe it could help somehow, i have found solution in family of EM-PCA methods (see for example this, http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/papers/PCA_RoweisEMPCA.pdf)

Related

determine optimal cut-off value for data (in matlab)

I realize this is an unspecific question (because I don't know a lot about the topic, please help me in this regard), that said here's the task I'd like to achieve:
Find a statistically sound algorithm to determine an optimal cut-off value to binarize a vector to filter out minimal values (i.e. get rid of). Here's code in matlab to visualize this problem:
randomdata=rand(1,100,1);
figure;plot(randomdata); %plot random data between 0 and 1
cutoff=0.5; %plot cut-off value
line(get(gca,'xlim'),[cutoff cutoff],'Color','red');
Thanks

You could try using Matlab's percentile function:
cutoff = prctile(randomdata,10);

In matlab, speed up cross correlation

I have a long time series with some repeating and similar looking signals in it (not entirely periodical). The length of the time series is about 60000 samples. To identify the signals, I take out one of them, having a length of around 1000 samples and move it along my timeseries data sample by sample, and compute cross-correlation coefficient (in Matlab: corrcoef). If this value is above some threshold, then there is a match.
But this is excruciatingly slow (using 'for loop' to move the window).
Is there a way to speed this up, or maybe there is already some mechanism in Matlab for this ?
Many thanks
Edited: added information, regarding using 'xcorr' instead:
If I use 'xcorr', or at least the way I have used it, I get the wrong picture. Looking at the data (first plot), there are two types of repeating signals. One marked by red rectangles, whereas the other and having much larger amplitudes (this is coherent noise) is marked by a black rectangle. I am interested in the first type. Second plot shows the signal I am looking for, blown up.
If I use 'xcorr', I get the third plot. As you see, 'xcorr' gives me the wrong signal (there is in fact high cross correlation between my signal and coherent noise).
But using "'corrcoef' and moving the window, I get the last plot which is the correct one.
There maybe a problem of normalization when using 'xcorr', but I don't know.

I can think of two ways to speed things up.
1) make your template 1024 elements long. Suddenly, correlation can be done using FFT, which is significantly faster than DFT or element-by-element multiplication for every position.
2) Ask yourself what it is about your template shape that you really care about. Do you really need the very high frequencies, or are you really after lower frequencies? If you could re-sample your template and signal so it no longer contains any frequencies you don't care about, it will make the processing very significantly faster. Steps to take would include
determine the highest frequency you care about
filter your data so higher frequencies are blocked
resample the resulting data at a lower sampling frequency
Now combine that with a template whose size is a power of 2
You might find this link interesting reading.
Let us know if any of the above helps!

Your problem seems like a textbook example of cross-correlation. Therefore, there's no good reason using any solution other than xcorr. A few technical comments:
xcorr assumes that the mean was removed from the two cross-correlated signals. Furthermore, by default it does not scale the signals' standard deviations. Both of these issues can be solved by z-scoring your two signals: c=xcorr(zscore(longSig,1),zscore(shortSig,1)); c=c/n; where n is the length of the shorter signal should produce results equivalent with your sliding window method.
xcorr's output is ordered according to lags, which can obtained as in a second output argument ([c,lags]=xcorr(..). Always plot xcorr results by plot(lags,c). I recommend trying a synthetic signal to verify that you understand how to interpret this chart.
xcorr's implementation already uses Discere Fourier Transform, so unless you have unusual conditions it will be a waste of time to code a frequency-domain cross-correlation again.
Finally, a comment about terminology: Correlating corresponding time points between two signals is plain correlation. That's what corrcoef does (it name stands for correlation coefficient, no 'cross-correlation' there). Cross-correlation is the result of shifting one of the signals and calculating the correlation coefficient for each lag.

Why do Perlin noise algorithms use lookup tables for random numbers

I have been researching noise algorithms for a library I wish to build, and have started with Perlin noise (more accurately Simplex noise, I want to work with arbitrary dimensions, or at least up to 6). Reading Simplex noise demystified, helped, but looking through the implementations at the end, i saw a big lookup table named perm.
In the code example, it seems to be used to generate indexes into a set of gradients, but the method seems odd. I assume that the table is just there to provide 1) determinism, and 2) a speed boost.
My question is, does the perm lookup table have any auxiliary meaning or purpose, or it there for the reasons above? Or another way, is there a specific reason that a pseudo-random number generator is not used, other than performance?

This is a bytes array. The range is 0 to 255.
You may randomize it if you want. You will probably want to seed the random... etc.

The perm table (and grad table) is used for optimization. They are just lookup tables of precomputed values. You are correct on both points 1) and 2).
Other than performance and portability, there is no reason you couldn't use a PRN.

Good Data Structure for Unit Conversion? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
StackOverflow crowd. I have a very open-ended software design question.
I've been looking for an elagant solution to this for a while and I was wondering if anyone here had some brilliant insight into the problem. Consider this to be like a data structures puzzle.
What I am trying to do is to create a unit converter that is capable of converting from any unit to any unit. Assume that the lexing and parsing is already done. A few simple examples:
Convert("days","hours") // Yields 24
Convert("revolutions", "degrees") // Yields 360
To make things a little more complicated, it must smoothly handle ambiguities between inputs:
Convert("minutes","hours") // Yields (1/60)
Convert("minutes","revolutions") // Yields (1/21600)
To make things even more fun, it must handle complex units without needing to enumerate all possibilities:
Convert("meters/second","kilometers/hour")
Convert("miles/hour","knots")
Convert("Newton meters","foot pounds")
Convert("Acre feet","meters^3")
There's no right or wrong answer, I'm looking for ideas on how to accomplish this. There's always a brute force solution, but I want something elegant that is simple and scalable.

I would start with a hashtable (or persisted lookup table - your choice how you implement) that carries unit conversions between as many pairs as you care to put in. If you put in every possible pair, then this is your brute force approach.
If you have only partial pairs, you can then do a search across the pairs you do have to find a combination. For example, let's say I have these two entries in my hashtable:
Feet|Inches|1/12
Inches|Centimeters|2.54
Now if I want to convert feet to centimeters, I have a simple graph search: vertices are Feet, Inches, and Centimeters, and edges are the 1/12 and 2.54 conversion factors. The solution in this case is the two edges 1/12, 2.54 (combined via multiplication, of course). You can get fancier with the graph parameters if you want to.
Another approach might be applying abductive reasoning - look into AI texts about algebraic problem solvers for this...
Edit: Addressing Compound Units
Simplified problem: convert "Acres" to "Meters^2"
In this case, the keys are understanding that we are talking about units of length, so why don't we insert a new column into the table for unit type, which can be "length" or "area". This will help performance even in the earlier cases as it gives you an easy column to pare down your search space.
Now the trick is to understand that length^2 = area. Why not add another lookup that stores this metadata:
Area|Length|Length|*
We couple this with the primary units table:
Meters|Feet|3.28|Length
Acres|Feet^2|43560|Area
So the algorithm goes:
Solution is m^2, which is m * m, which is a length * length.
Input is acres, which is an area.
Search the meta table for m, and find the length * length mapping. Note that in more complex examples there may be more than one valid mapping.
Append to the solution a conversion Acres->Feet^2.
Perform the original graph search for Feet->M.
Note that:
The algorithm won't know whether to use area or length as the basic domain in which to work. You can provide it hints, or let it search both spaces.
The meta table gets a little brute-force-ish.
The meta table will need to get smarter if you start mixing types (e.g. Resistance = Voltage / Current) or doing something really ugly and mixing unit systems (e.g. a FooArea = Meters * Feet).

Whatever structure you choose, and your choice may well be directed by your preferred implementation (OO ? functional ? DBMS table ?) I think you need to identify the structure of units themselves.
For example a measurement of 1000km/hr has several components:
a scalar magnitude, 1000;
a prefix, in this case kilo; and
a dimension, in this case L.T^(-1), that is, length divided by time.
Your modelling of measurements with units needs to capture at least this complexity.
As has already been suggested, you should establish what the base set of units you are going to use are, and the SI base units immediately suggest themselves. Your data structure(s) for modelling units would then be defined in terms of those base units. You might therefore define a table (thinking RDBMS here, but easily translatable into your preferred implementation) with entries such as:
unit name dimension conversion to base
foot Length 0.3048
gallon(UK) Length^3 4.546092 x 10^(-3)
kilowatt-hour Mass.Length^2.Time^(-2) 3.6 x 10^6
and so forth. You'll also need a table to translate prefixes (kilo-, nano-, mega-, mibi- etc) into multiplying factors, and a table of base units for each of the dimensions (ie meter is the base unit for Length, second for Time, etc). You'll also have to cope with units such as feet which are simply synonyms for other units.
The purpose of dimension is, of course, to ensure that your conversions and other operations (such as adding 2 feet to 3.5 metres) are commensurate.
And, for further reading, I suggest this book by Cardarelli.
EDIT in response to comments ...
I'm trying to veer away from suggesting (implementation-specific) solutions so I'll waffle a bit more. Compound units, such as kilowatt-hours, do pose a problem. One approach would be to tag measurements with multiple unit-expressions, such as kilowatt and hour, and a rule for combining them, in this case multiplication I could see this getting quite hairy quite quickly. It might be better to restrict the valid set of units to the most common ones in the domain of the application.
As to dealing with measurements in mixed units, well the purpose of defining the Dimension of a unit is to provide some means to ensure that only sensible operations can be applied to measurements-with-units. So, it's sensible to add two lengths (L+L) together, but not a length (L) and a volume (L^3). On the other hand it is sensible to divide a volume by a length (to get an area (L^2)). And it's kind of up to the application to determine if strange units such as kilowatt-hours per square metre are valid.
Finally, the book I link to does enumerate all the possibilities, I guess most sensible applications with units will implement only a selection.

I would start by choosing a standard unit for every quantity(eg. meters for length, newtons for force, etc) and then storing all the conversion factors to that unit in a table
then to go from days to hours, for example, you find the conversion factors for seconds per day and seconds per hour and divide them to find the answer.
for ambiguities, each unit could be associated with all the types of quantities it measures, and to determine which conversion to do, you would take the intersection of those two sets of types(and if you're left with 0 or more than one you would spit out an error)

I assume that you want to hold the data about conversion in some kind of triples (fstUnit, sndUnit, multiplier).
For single unit conversions:
Use some hash functions in O(1) to change the unit stucture to a number, and then put all multipliers in a matrix (you only have to remember the upper-right part, because the reflection is the same, but inversed).
For complex cases:
Example 1. m/s to km/h. You check (m,km) in the matrix, then the (s,h), then multiply the results.
Example 2. m^3 to km^3. You check (m,km) and take it to the third power.
Of course some errors, when types don't match like field and volume.

You can make a class for Units that takes the conversion factor and the exponents of all basic units (I'd suggest to use metric units for this, that makes your life easier). E.g. in Pseudo-Java:
public class Unit {
public Unit(double factor, int meterExp, int secondExp, int kilogrammExp ... [other base units]) {
...
}
}
//you need the speed in km/h (1 m/s is 3.6 km/h):
Unit kmPerH = new Unit(1 / 3.6, 1, -1, 0, ...)

I would have a table with these fields:
conversionID
fromUnit
toUnit
multiplier
and however many rows you need to store all the conversions you want to support
If you want to support a multi-step process (degrees F to C), you'd need a one-to-many relationship with the units table, say called conversionStep, with fields like
conversionID
sequence
operator
value
If you want to store one set of conversions but support multi-step conversions, like storing
Feet|Inches|1/12
Inches|Centimeters|2.54
and supporting converting from Feet to Centimeters, I would store a conversion plan in another table, like
conversionPlanID
startUnits
endUnits
via
your row would look like
1 | feet | centimeters | inches

Aging a dataset

For reasons I'd rather not go into, I need to filter a set of values to reduce jitter. To that end, I need to be able to average a list of numbers, with the most recent having the greatest effect, and the least recent having the smallest effect. I'm using a sample size of 10, but that could easily change at some point.
Are there any reasonably simple aging algorithms that I can apply here?

Have a look at the exponential smoothing. Fairly simple, and might be sufficient for your needs. Basically recent observations are given relatively more weight than the older ones.
Also (depending on the application) you may want to look at various reinforcement learning techniques, for example Q-Learning or TD-Learning or generally speaking any method involving the discount.

I ran into something similar in an embedded control application.
The simplest option that I came across was a 3/4 filter. This gets applied continuously over the entire data set:
current_value = (3*current_value + new_value)/4
I eventually decided to go with a 16-tap FIR filter instead:
Overview
FIR FAQ
Wikipedia article

Many weighted averaging algorithms could be used.
For example, for items I(n) for n = 1 to N in sequence (newest to oldest):
(SUM(I(n) * (N + 1 - n)) / SUM(n)

It's not exactly clear from the question whether you're dealing with fixed-length
data or if data is continuously coming in. A nice physical model for the latter
would be a low pass filter, using a capacitor and a resistor (R and C). Assuming
your data is equidistantly spaced in time (is it?), this leads to an update prescription
U_aged[n+1] = U_aged[n] + deltat/Tau (U_raw[n+1] - U_aged[n])
where Tau is the time constant of the filter. In the limit of zero deltat, this
gives an exponential decay (old values will be reduced to 1/e of their value after
time Tau). In an implementation, you only need to keep a running weighted sum U_aged.
deltat would be 1 and Tau would specify the 'aging constant', the number of steps
it takes to reduce a sample's contribution to 1/e.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio