how to plot variables with possibly wild variable values? - user-interface

I want to build an application that would do something equivalent to running lsof (maybe changing it to output differently, because string processing may mean it is not real time enough) in a loop and then associate each line (entries) with what iteration it was present in, what I will be referring further as frames, as later on it will be better for understanding. My intention with it is that showing the times in which files are open by applications can reveal something about their structure, while not having big impact on their execution, which is often a problem. One problem I have is on processing the output, which would be a table relating "frames X entry", for that I am already anticipating that I will have wildly variable entry lengths. Which can fall in that problem of representing on geometry when you have very different scales, the smaller get infinitely small, while the bigger gets giant and fragmentation makes it even worse; so my question is if plotting libraries deal with this problem and how they do it

The easiest and most well-established technique for showing both small and large values in reasonable detail is a logarithmic scale. Instead of plotting raw values, plot their logarithms. This is notoriously problematic if you can have zero or even negative values, but as I understand your situations all your lengths would be strictly positive so this should work.
Another statistical solution you could apply is to plot ranks instead of raw values. Take all the observed values, and put them in a sorted list. When plotting any single data point, instead of plotting the value itself you look up that value in the list of values (possibly using binary search since it's a sorted list) then plot the index at which you found the value.
This is a monotonous transformation, so small values map to small indices and big values to big indices. On the other hand it completely discards the actual magnitude, only the relative comparisons matter.
If this is too radical, you could consider using it as an ingredient for something more tuneable. You could experiment with a linear combination, i.e. plot
a*x + b*log(x) + c*rank(x)
then tweak a, b and c till the result looks pleasing.

Related

What is the fastest way to intersect two large set of ids

The Problem
On a server, I host ids in a json file. From clients, I need to mandate the server to intersect and sometimes negate these ids (the ids never travel to the client even though the client instructs the server its operations to perform).
I typically have 1000's of ids, often have 100,000's of ids, and have a maximum of 56,000,000 of them, where each value is unique and between -100,000,000 and +100,000,000.
These ids files are stable and do not change (so it is possible to generate a different representation for it that is better adapted for the calculations if needed).
Sample ids
Largest file sizes
I need an algorithm that will intersect ids in the sub-second range for most cases. What would you suggest? I code in java, but do not limit myself to java for the resolution of this problem (I could use JNI to bridge to native language).
Potential solutions to consider
Although you could not limit yourselves to the following list of broad considerations for solutions, here is a list of what I internally debated to resolve the situation.
Neural-Network pre-qualifier: Train a neural-network for each ids list that accepts another list of ids to score its intersection potential (0 means definitely no intersection, 1 means definitely there is an intersection). Since neural networks are good and efficient at pattern recognition, I am thinking of pre-qualifying a more time-consuming algorithm behind it.
Assembly-language: On a Linux server, code an assembly module that does such algorithm. I know that assembly is a mess to maintain and code, but sometimes one need the speed of an highly optimized algorithm without the overhead of a higher-level compiler. Maybe this use-case is simple enough to benefit from an assembly language routine to be executed directly on the Linux server (and then I'd always pay attention to stick with the same processor to avoid having to re-write this too often)? Or, alternately, maybe C would be close enough to assembly to produce clean and optimized assembly code without the overhead to maintain assembly code.
Images and GPU: GPU and image processing could be used and instead of comparing ids, I could BITAND images. That is, I create a B&W image of each ids list. Since each id have unique values between -100,000,000 and +100,000,000 (where a maximum of 56,000,000 of them are used), the image would be mostly black, but the pixel would become white if the corresponding id is set. Then, instead of keeping the list of ids, I'd keep the images, and do a BITAND operation on both images to intersect them. This may be fast indeed, but then to translate the resulting image back to ids may be the bottleneck. Also, each image could be significantly large (maybe too large for this to be a viable solution). An estimate of a 200,000,000 bits sequence is 23MB each, just loading this in memory is quite demanding.
String-matching algorithms: String comparisons have many adapted algorithms that are typically extremely efficient at their task. Create a binary file for each ids set. Each id would be 4 bytes long. The corresponding binary file would have each and every id sequenced as their 4 bytes equivalent into it. The algorithm could then be to process the smallest file to match each 4 bytes sequence as a string into the other file.
Am I missing anything? Any other potential solution? Could any of these approaches be worth diving into them?
I did not yet try anything as I want to secure a strategy before I invest what I believe will be a significant amount of time into this.
EDIT #1:
Could the solution be a map of hashes for each sector in the list? If the information is structured in such a way that each id resides within its corresponding hash key, then, the smaller of the ids set could be sequentially ran and matching the id into the larger ids set first would require hashing the value to match, and then sequentially matching of the corresponding ids into that key match?
This should make the algorithm an O(n) time based one, and since I'd pick the smallest ids set to be the sequentially ran one, n is small. Does that make sense? Is that the solution?
Something like this (where the H entry is the hash):
{
"H780" : [ 45902780, 46062780, -42912780, -19812780, 25323780, 40572780, -30131780, 60266780, -26203780, 46152780, 67216780, 71666780, -67146780, 46162780, 67226780, 67781780, -47021780, 46122780, 19973780, 22113780, 67876780, 42692780, -18473780, 30993780, 67711780, 67791780, -44036780, -45904780, -42142780, 18703780, 60276780, 46182780, 63600780, 63680780, -70486780, -68290780, -18493780, -68210780, 67731780, 46092780, 63450780, 30074780, 24772780, -26483780, 68371780, -18483780, 18723780, -29834780, 46202780, 67821780, 29594780, 46082780, 44632780, -68406780, -68310780, -44056780, 67751780, 45912780, 40842780, 44642780, 18743780, -68220780, -44066780, 46142780, -26193780, 67681780, 46222780, 67761780 ],
"H782" : [ 27343782, 67456782, 18693782, 43322782, -37832782, 46152782, 19113782, -68411782, 18763782, 67466782, -68400782, -68320782, 34031782, 45056782, -26713782, -61776782, 67791782, 44176782, -44096782, 34041782, -39324782, -21873782, 67961782, 18703782, 44186782, -31143782, 67721782, -68340782, 36103782, 19143782, 19223782, 31711782, 66350782, 43362782, 18733782, -29233782, 67811782, -44076782, -19623782, -68290782, 31721782, 19233782, 65726782, 27313782, 43352782, -68280782, 67346782, -44086782, 67741782, -19203782, -19363782, 29583782, 67911782, 67751782, 26663782, -67910782, 19213782, 45992782, -17201782, 43372782, -19992782, -44066782, 46142782, 29993782 ],
"H540" : [...
You can convert each file (list of ids) into a bit-array of length 200_000_001, where bit at index j is set if the list contains value j-100_000_000. It is possible, because the range of id values is fixed and small.
Then you can simply use bitwise and and not operations to intersect and negate lists of ids. Depending on the language and libraries used, it would require operating element-wise: iterating over arrays and applying corresponding operations to each index.
Finally, you should measure your performance and decide whether you need to do some optimizations, such as parallelizing operations (you can work on different parts of arrays on different processors), preloading some of arrays (or all of them) into memory, using GPU, etc.
First, the bitmap approach will produce the required performance, at a huge overhead in memory. You'll need to benchmark it, but I'd expect times of maybe 0.2 seconds, with that almost entirely dominated by the cost of loading data from disk, and then reading the result.
However there is another approach that is worth considering. It will use less memory most of the time. For most of the files that you state, it will perform well.
First let's use Cap'n Proto for a file format. The type can be something like this:
struct Ids {
is_negated #0 :Bool;
ids #1 :List(Int32);
}
The key is that ids are always kept sorted. So list operations are a question of running through them in parallel. And now:
Applying not is just flipping is_negated.
If neither is negated, it is a question of finding IDs in both lists.
If the first is not negated and the second is, you just want to find IDs in the first that are not in the second.
If the first is negated and the second is not, you just want to find IDs in the second that are not in the first.
If both are negated, you just want to find all ids in either list.
If your list has 100k entries, then the file will be about 400k. A not requires copying 400k of data (very fast). And intersecting with another list of the same size involves 200k comparisons. Integer comparisons complete in a clock cycle, and branch mispredictions take something like 10-20 clock cycles. So you should be able to do this operation in the 0-2 millisecond range.
Your worst case 56,000,000 file will take over 200 MB and intersecting 2 of them can take around 200 million operations. This is in the 0-2 second range.
For the 56 million file and a 10k file, your time is almost all spent on numbers in the 56 million file and not in the 10k one. You can speed that up by adding a "galloping" mode where you do a binary search forward in the larger file looking for the next matching number and picking most of them. Do be warned that this code tends to be tricky and involves lots of mispredictions. You'll have to benchmark it to find out how big a size difference is needed.
In general this approach will lose for your very biggest files. But it will be a huge win for most of the sizes of file that you've talked about.

Why is a finite sum calculated so long?

I'm trying to compute the next sum:
It is calculated instantly. So I raise the number of points to 24^3 and it still works fast:
But when the number of points is 25^3 it's almost impossible to await the result! Moreover, there is a warning:
Why is it so time-consuming to calculate a finite sum? How can I get a precise answer?
Try
max=24;
Timing[N[
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,2,max},{j,1,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,2,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,1,1},{k,2,max}]]]
which quickly returns
{0.143978,14330.9}
and
max=25;
Timing[N[
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,2,max},{j,1,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,2,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,1,1},{k,2,max}]]]
which quickly returns
{0.156976,14636.6}
and even
max=50;
Timing[N[
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,2,max},{j,1,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,2,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,1,1},{k,2,max}]]]
which quickly returns
{1.36679,16932.5}
Changing your code in this way avoids doing hundreds or thousands of If tests that will almost always result in True. And it potentially uses symbolic algorithms to find those results instead of needing to add up each one of the individual values.
Compare those results and times if you replace Sum with NSum and if you replace /500 with *.002
To try to guess why the times you see suddenly change as you increment the bound, other people have noticed in the past that it appears there are some hard coded bounds inside some of the numerical algorithms and when a range is small enough Mathematica will use one algorithm, but when the range is just large enough to exceed that bound then it will switch to another and potentially slower algorithm. It is difficult or impossible to know exactly why you see this change without being able to inspect the decisions being made inside the algorithms and nobody outside Wolfram gets to see that information.
To get a more precise numerical value you can change N[...] to N[...,64] or N[...,256] or eliminate the N entirely and get a large complicated exact numeric result.
Be cautious with this, check the results carefully to make certain that I have not made any mistakes. And some of this is just guesswork on my part.

word2vec window size at sentence boundaries

I am using word2vec (and doc2vec) to get embeddings for sentences, but i want to completely ignore word order.
I am currently using gensim, but can use other packages if necessary.
As an example, my text looks like this:
[
['apple', 'banana','carrot','dates', 'elderberry', ..., 'zucchini'],
['aluminium', 'brass','copper', ..., 'zinc'],
...
]
I intentionally want 'apple' to be considered as close to 'zucchini' as it is to 'banana' so I have set the window size to a very large number, say 1000.
I am aware of 2 problems that may arise with this.
Problem 1:
The window might roll in at the start of a sentence creating the following training pairs:
('apple', ('banana')), ('apple', ('banana', 'carrot')), ('apple', ('banana', 'carrot', 'date')) before it eventually gets to the correct ('apple', ('banana','carrot', ..., 'zucchini')).
This would seem to have the effect of making 'apple' closer to 'banana' than 'zucchini',
since their are so many more pairs containing 'apple' and 'banana' than there are pairs containing 'apple' and 'zucchini'.
Problem 2:
I heard that pairs are sampled with inverse proportion to the distance from the target word to the context word- This also causes an issue making nearby words more seem more connected than I want them to be.
Is there a way around problems 1 and 2?
Should I be using cbow as opposed to sgns? Are there any other hyperparameters that I should be aware of?
What is the best way to go about removing/ignoring the order in this case?
Thank you
I'm not sure what you mean by "Problem 1" - there's no "roll" or "wraparound" in the usual interpretation of a word2vec-style algorithm's window parameter. So I wouldn't worry about this.
Regarding "Problem 2", this factor can be essentially made negligible by the choice of a giant window value – say for example, a value one million times larger than your largest sentence. Then, any difference in how the algorithm treats the nearest-word and the 2nd-nearest-word is vanishingly tiny.
(More specifically, the way the gensim implementation – which copies the original Google word2vec.c in this respect – achieves a sort of distance-based weighting is actually via random dynamic shrinking of the actual window used. That is, for each visit during training to each target word, the effective window truly used is some random number from 1 to the user-specified window. By effectively using smaller windows much of the time, the nearer words have more influence – just without the cost of performing other scaling on the whole window's words every time. But in your case, with a giant window value, it will be incredibly rare for the effective-window to ever be smaller than your actual sentences. Thus every word will be included, equally, almost every time.)
All these considerations would be the same using SG or CBOW mode.
I believe a million-times-larger window will be adequate for your needs, for if for some reason it wasn't, another way to essentially cancel-out any nearness effects could be to ensure your corpus's items individual word-orders are re-shuffled between each time they're accessed as training data. That ensures any nearness advantages will be mixed evenly across all words – especially if each sentence is trained on many times. (In a large-enough corpus, perhaps even just a 1-time shuffle of each sentence would be enough. Then, over all examples of co-occurring words, the word co-occurrences would be sampled in the right proportions even with small windows.)
Other tips:
If your training data starts in some arranged order that clumps words/topics together, it can be beneficial to shuffle them into a random order instead. (It's better if the full variety of the data is interleaved, rather than presented in runs of many similar examples.)
When your data isn't true natural-language data (with its usual distributions & ordering significance), it may be worth it to search further from the usual defaults to find optimal metaparameters. This goes for negative, sample, & especially ns_exponent. (One paper has suggested the optimal ns_exponent for training vectors for recommendation-systems is far different from the usual 0.75 default for natural-language modeling.)

In matlab, speed up cross correlation

I have a long time series with some repeating and similar looking signals in it (not entirely periodical). The length of the time series is about 60000 samples. To identify the signals, I take out one of them, having a length of around 1000 samples and move it along my timeseries data sample by sample, and compute cross-correlation coefficient (in Matlab: corrcoef). If this value is above some threshold, then there is a match.
But this is excruciatingly slow (using 'for loop' to move the window).
Is there a way to speed this up, or maybe there is already some mechanism in Matlab for this ?
Many thanks
Edited: added information, regarding using 'xcorr' instead:
If I use 'xcorr', or at least the way I have used it, I get the wrong picture. Looking at the data (first plot), there are two types of repeating signals. One marked by red rectangles, whereas the other and having much larger amplitudes (this is coherent noise) is marked by a black rectangle. I am interested in the first type. Second plot shows the signal I am looking for, blown up.
If I use 'xcorr', I get the third plot. As you see, 'xcorr' gives me the wrong signal (there is in fact high cross correlation between my signal and coherent noise).
But using "'corrcoef' and moving the window, I get the last plot which is the correct one.
There maybe a problem of normalization when using 'xcorr', but I don't know.
I can think of two ways to speed things up.
1) make your template 1024 elements long. Suddenly, correlation can be done using FFT, which is significantly faster than DFT or element-by-element multiplication for every position.
2) Ask yourself what it is about your template shape that you really care about. Do you really need the very high frequencies, or are you really after lower frequencies? If you could re-sample your template and signal so it no longer contains any frequencies you don't care about, it will make the processing very significantly faster. Steps to take would include
determine the highest frequency you care about
filter your data so higher frequencies are blocked
resample the resulting data at a lower sampling frequency
Now combine that with a template whose size is a power of 2
You might find this link interesting reading.
Let us know if any of the above helps!
Your problem seems like a textbook example of cross-correlation. Therefore, there's no good reason using any solution other than xcorr. A few technical comments:
xcorr assumes that the mean was removed from the two cross-correlated signals. Furthermore, by default it does not scale the signals' standard deviations. Both of these issues can be solved by z-scoring your two signals: c=xcorr(zscore(longSig,1),zscore(shortSig,1)); c=c/n; where n is the length of the shorter signal should produce results equivalent with your sliding window method.
xcorr's output is ordered according to lags, which can obtained as in a second output argument ([c,lags]=xcorr(..). Always plot xcorr results by plot(lags,c). I recommend trying a synthetic signal to verify that you understand how to interpret this chart.
xcorr's implementation already uses Discere Fourier Transform, so unless you have unusual conditions it will be a waste of time to code a frequency-domain cross-correlation again.
Finally, a comment about terminology: Correlating corresponding time points between two signals is plain correlation. That's what corrcoef does (it name stands for correlation coefficient, no 'cross-correlation' there). Cross-correlation is the result of shifting one of the signals and calculating the correlation coefficient for each lag.

Geohashes - Why is interleaving index values necessary?

I have had a look at this post about geohashes. According to the author, the final step in calculating the hash is interleaving the x and y index values. But is this really necessary? Is there a proper reason not to just concatenate these values, as long as the hash table is built according to that altered indexing rule?
From the wiki page
Geohashes offer properties like arbitrary precision and the
possibility of gradually removing characters from the end of the code
to reduce its size (and gradually lose precision).
If you simply concatenated x and y coordinates, then users would have to take a lot more care when trying to reduce precision by being careful to remove exactly the right number of characters from both the x and y coordinate.
There is a related (and more important) reason than arbitrary precision: Geohashes with a common prefix are close to one another. The longer the common prefix, the closer they are.
54.321 -2.345 has geohash gcwm48u6
54.322 -2.346 has geohash gcwm4958
(See http://geohash.org to try this)
This feature enables fast lookup of nearby points (though there are some complications), and only works because we interleave the two dimensions to get a sort of approximate 2D proximity metric.
As the wikipedia entry goes on to explain:
When used in a database, the structure of geohashed data has two
advantages. First, data indexed by geohash will have all points for a
given rectangular area in contiguous slices (the number of slices
depends on the precision required and the presence of geohash "fault
lines"). This is especially useful in database systems where queries
on a single index are much easier or faster than multiple-index
queries. Second, this index structure can be used for a
quick-and-dirty proximity search - the closest points are often among
the closest geohashes.
Note that the converse is not always true - if two points happen to lie on either side of a subdivision (e.g. either side of the equator) then they may be extremely close but have no common prefix. Hence the complications I mentioned earlier.

Resources