How to access CoordinateMatrix entries directly in Spark?

How to access CoordinateMatrix entries directly in Spark? - matrix

I want to store a big sparse matrix using Spark,
so I tried to use CoordinateMatrix, since it is a distributed matrix.
However, I have not found a way to access each entry directly such as this way:
apply(int x, int y)
I only found the functions like:
public RDD<MatrixEntry> entries()
In this case, I have to loop over the entries to find out the one I want, which is not efficient way.
Has anyone used CoordinateMatrix before ?
What should I do to get each entry from CoordinateMatrix efficiently?

Short answer is you don't. RDDs, and CoordinateMatrix is more or less a wrapper around the RDD[MatrixEntry], are not well suited for random access. Moreover RDDs are immutable so you cannot simply modify a single entry. If it is your requirement you're probably looking at the wrong technology.
There is some limited support for random access if you use PairRDD. If such a RDD is partitioned you can use lookup method to efficiently recover a single value:
val n = ??? // Number of partitions
val pairs = mat.
entries.
map{case MatrixEntry(i, j, v) => ((i, j), v)}.
partitionBy(new HashPartitioner(n))
pairs.lookup((1, 1))

Related

Is there any probabilistic data structure that reduces the space complexity of a large number of counters?

Basically I need to keep track of a large number of counters. I can increment or decrement each counter by name. The simplest way to do so is to use a hash table, using counter_name as key and its corresponding count as the value for that key.
The counters don't need to be 100% accurate, approximate values for count are fine. So I'm wondering if there is any probabilistic data structure that can reduce the space complexity of N counters to lower than O(N), kinda similar to how HyperLogLog reduces the memory requirement of counting N items by giving only an approximate result. Any ideas?

In my opinion, the thing you are looking for is Count-min sketch.
Reading a stream of elements a1, a2, a3, ..., an where there can be a
lot of repeated elements, in any time it will give you the answer to
the following question: how many ai elements have you seen so far.
basically your unique elements can be bijected into your counters. Countmin sketch allows you to adjust parameters to trade your memory for the accuracy.
P.S. I described some other popular probabilistic data structures here.

Stefan Haustein's correct that the names are likely to take more space than the counters, and you may be able to prioritise certain names as he suggests, but failing that you can consider how best to store the names. If they're fairly short (e.g. 8 characters or less), you might consider using a closed hashing table that stores them directly in the buckets. If they're long, you could store them contiguously (NUL terminated) in a block of memory, and in the hash table store the offset into that block of their first character.
For the counter itself, you can save space by using a probabilistic approach as follows:
template <typename T, typename Q = unsigned>
class Approx_Counter
{
public:
Approx_Counter() : n_(0) { }
Approx_Counter& operator++()
{
if (n_ < 2 || rand() % (operator Q()) == 0)
++n_;
return *this;
}
operator Q() const { return n_ < 2 ? n_ : 1 << n_; }
private:
T n_;
};
Then you can use e.g. Approx_Counter<unsigned char, unsigned long>. Swap out rand() for a C++11 generator if you care.
The idea's simple:
when n_ is 0, ++ has definitely not be invoked
when n_ is 1, ++ has definitely been invoked exactly once
when n_ >= 2, it indicates ++ has probably been invoked about 2n_ times
To keep that last implication in line with the number of ++ invocations actually made, each invocation has a 1 in 2n_ chance of actually incrementing n_ again.
Just make sure your rand() or substitute returns values much larger than the largest counter value you want to track, otherwise you'll get rand() % (operator Q()) == 0 too often and increment inappropriately.
That said, having a smaller counter doesn't help much if you have pointers or offsets to it, so you'll want to squeeze the counter into the bucket too, another reason to prefer your own closed hashing implementation if you genuinely need to tighten up memory usage but want to stick with a hash table (a trie is another possibility).
The above is still O(N) in counter space, just with a smaller constant. For genuinely < O(N) options, you need to consider whether/how keys are related, such that incrementing a counter might reasonable impact multiple keys. You've given us no insights in your question to date.

The names probably take up more space than the counters.
How about having a fixed number of counters and only keep the ones with the highest counts, plus some kind of LRU mechanism to allow new counters to rise to the top? I guess it really depends on your use case...

Maps in Go - how to avoid double key lookup?

Suppose I want to update some existing value in a map, or do something else if the key is not found. How do I do this, without performing 2 lookups? What's the golang equivalent of the following C++ code:
auto it = m.find(key);
if (it != m.end()) {
// update the value, without performing a second lookup
it->second = calc_new_value(it->second);
} else {
// do something else
m.insert(make_pair(key, 42));
}

Go does not expose the map's internal (key,value) pair data structure like C++ does, so you can't replicate this exactly.
One possible work around would be to make the values of your map pointers, so you can keep the same values in the map but update what they point to. For example, if m is a map[int]*int, you could change a value with:
v := m[10]
*v = 42
With that said, I wouldn't be surprised if the savings from reducing the number of hash lookups will be eaten by the additional memory management overhead. So it would be worth benchmarking whatever solution you settle on.

You cannot. The situation is actually the same with Python dicts. However it shouldn't matter. Both lookup and assignment to a Go map are amortized O(1). Combining the two operations has the same time complexity.

Data structure for "interpolated" table lookup

I have a collection of 2-D points which represent a 1-variable function.
Given a random input value, I have to select the closest value.
Example:
Curve:
(1,5)
(2,8)
(5,9)
Input: 3 Output: 8
My main concern is speed, space doesn't matter as much.
Which data structure would be best?
EDIT: The table is static, it won't change during runtime

It depends upon whether the table is static or dynamic.
If it's static data, simple sorted array and binary search will get the job done: search for the key, if it isn't found, check the index above and below to see which is closer to the search key, and return its associated value.
If the data is dynamic, I'd go with a B+Tree variant (though any balanced tree structure should work). Essentially the same algorithm, but you'd be checking sibling nodes, instead of just checking adjacent array cells.

You say the table is static, and won't change during runtime.
Then if you need blazing performance, and if the table is not too large, it's hard to beat a hard-coded binary search.
For the table you gave, it looks like this:
result = (x < 3.5
? (x < 1.5
? 5
: 8
)
: 9
);
You may have to write a little program to take the table as input, and generate the code as output, so you can include it in your main program.
If you don't mind using a macro, you might make it a little easier to write, like this:
#define M(a,middle,b) (x < (middle) ? (a) : (b))
result = M( M(5, 1.5, 8), 3.5, 9);
The only way to beat that is with a hard-coded hash search (using a switch statement).
If the table can change between runs, it might make sense to, whenever the program starts, it generates the code, compiles and links it into a dll, loads the dll, and runs with that.
That can take all of about a second, and then you have the high speed.

Haskell Range Map library

Is there a Haskell library that allows me to have a Map from ranges to values? (Preferable somewhat efficient.)
let myRangeMap = RangeMap [(range 1 3, "foo"),(range 2 7, "bar"),(range 9 12, "baz")]
in rangeValues 2
==> ["foo","bar"]

I've written a library to search in overlapping intervals because the existing ones did not fit my needs. I think it may have a more approachable interface than for example SegmentTree:
https://www.chr-breitkopf.de/comp/IntervalMap/index.html
It's also available on Hackage: https://hackage.haskell.org/package/IntervalMap

This task is called a stabbing query on a set of intervals. An efficient data structure for it is called (one-dimensional) segment tree.
The SegmentTree package provides an implementation of this data structure, but unfortunately I cannot figure out how to use it. (I feel that the interface of this package does not provide the right level of abstraction.)

Perhaps the rangemin library does what you want?
Good old Data.Map (and its more efficient Data.IntMap cousin) has a function
splitLookup :: Ord k => k -> Map k a -> (Map k a, Maybe a, Map k a)
which splits a map into submaps of keys less than / greater than a given key. This can be used for certain kinds of range searching.

Haskell caching results of a function

I have a function that takes a parameter and produces a result. Unfortunately, it takes quite long for the function to produce the result. The function is being called quite often with the same input, that's why it would be convenient if I could cache the results. Something like
let cachedFunction = createCache slowFunction
in (cachedFunction 3.1) + (cachedFunction 4.2) + (cachedFunction 3.1)
I was looking into Data.Array and although the array is lazy, I need to initialize it with a list of pairs (using listArray) - which is impractical . If the 'key' is e.g. the 'Double' type, I cannot initialize it at all, and even if I can theoretically assign an Integer to every possible input, I have several tens of thousands possible inputs and I only actually use a handful. I would need to initialize the array (or, preferably a hash table, as only a handful of resutls will be used) using a function instead of a list.
Update: I am reading the memoization articles and as far as I understand it the MemoTrie could work the way I want. Maybe. Could somebody try to produce the 'cachedFunction'? Prefereably for a slow function that takes 2 Double arguments? Or, alternatively, that takes one Int argument in a domain of ~ [0..1 billion] that wouldn't eat all memory?

Well, there's Data.HashTable. Hash tables don't tend to play nicely with immutable data and referential transparency, though, so I don't think it sees a lot of use.
For a small number of values, stashing them in a search tree (such as Data.Map) would probably be fast enough. If you can put up with doing some mangling of your Doubles, a more robust solution would be to use a trie-like structure, such as Data.IntMap; these have lookup times proportional primarily to key length, and roughly constant in collection size. If Int is too limiting, you can dig around on Hackage to find trie libraries that are more flexible in the type of key used.
As for how to cache the results, I think what you want is usually called "memoization". If you want to compute and memoize results on demand, the gist of the technique is to define an indexed data structure containing all possible results, in such a way that when you ask for a specific result it forces only the computations needed to get the answer you want. Common examples usually involve indexing into a list, but the same principle should apply for any non-strict data structure. As a rule of thumb, non-function values (including infinite recursive data structures) will often be cached by the runtime, but not function results, so the trick is to wrap all of your computations inside a top-level definition that doesn't depend on any arguments.
Edit: MemoTrie example ahoy!
This is a quick and dirty proof of concept; better approaches may exist.
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeOperators #-}
import Data.MemoTrie
import Data.Binary
import Data.ByteString.Lazy hiding (map)
mangle :: Double -> [Int]
mangle = map fromIntegral . unpack . encode
unmangle :: [Int] -> Double
unmangle = decode . pack . map fromIntegral
instance HasTrie Double where
data Double :->: a = DoubleTrie ([Int] :->: a)
trie f = DoubleTrie $ trie $ f . unmangle
untrie (DoubleTrie t) = untrie t . mangle
slow x
| x < 1 = 1
| otherwise = slow (x / 2) + slow (x / 3)
memoSlow :: Double -> Integer
memoSlow = memo slow
Do note the GHC extensions used by the MemoTrie package; hopefully that isn't a problem. Load it up in GHCi and try calling slow vs. memoSlow with something like (10^6) or (10^7) to see it in action.
Generalizing this to functions taking multiple arguments or whatnot should be fairly straightforward. For further details on using MemoTrie, you might find this blog post by its author helpful.

See memoization

There are a number of tools in GHC's runtime system explicitly to support memoization.
Unfortunately, memoization isn't really a one-size fits all affair, so there are several different approaches that we need to support in order to cope with different user needs.
You may find the original 1999 writeup useful as it includes several implementations as examples:
Stretching the Storage Manager: Weak Pointers and Stable Names in Haskell by Simon Peyton Jones, Simon Marlow, and Conal Elliott

I will add my own solution, which seems to be quite slow as well. First parameter is a function that returns Int32 - which is unique identifier of the parameter. If you want to uniquely identify it by different means (e.g. by 'id'), you have to change the second parameter in H.new to a different hash function. I will try to find out how to use Data.Map and test if I get faster results.
import qualified Data.HashTable as H
import Data.Int
import System.IO.Unsafe
cache :: (a -> Int32) -> (a -> b) -> (a -> b)
cache ident f = unsafePerformIO $ createfunc
where
createfunc = do
storage <- H.new (==) id
return (doit storage)
doit storage = unsafePerformIO . comp
where
comp x = do
look <- H.lookup storage (ident x)
case look of
Just res -> return res
Nothing -> do
result <- return (f x)
H.insert storage (ident x) result
return result

You can write the slow function as a higher order function, returning a function itself. Thus you can do all the preprocessing inside the slow function and the part that is different in each computation in the returned (hopefully fast) function. An example could look like this:
(SML code, but the idea should be clear)
fun computeComplicatedThing (x:float) (y:float) = (* ... some very complicated computation *)
fun computeComplicatedThingFast = computeComplicatedThing 3.14 (* provide x, do computation that needs only x *)
val result1 = computeComplicatedThingFast 2.71 (* provide y, do computation that needs x and y *)
val result2 = computeComplicatedThingFast 2.81
val result3 = computeComplicatedThingFast 2.91

I have several tens of thousands possible inputs and I only actually use a handful. I would need to initialize the array ... using a function instead of a list.
I'd go with listArray (start, end) (map func [start..end])
func doesn't really get called above. Haskell is lazy and creates thunks which will be evaluated when the value is actually required.
When using a normal array you always need to initialize its values. So the work required for creating these thunks is necessary anyhow.
Several tens of thousands is far from a lot. If you'd have trillions then I would suggest to use a hash table yada yada

I don't know haskell specifically, but how about keeping existing answers in some hashed datastructure (might be called a dictionary, or hashmap)? You can wrap your slow function in another function that first check the map and only calls the slow function if it hasn't found an answer.
You could make it fancy by limiting the size of the map to a certain size and when it reaches that, throwing out the least recently used entry. For this you would additionally need to keep a map of key-to-timestamp mappings.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to access CoordinateMatrix entries directly in Spark? - matrix

Related

Is there any probabilistic data structure that reduces the space complexity of a large number of counters?

Maps in Go - how to avoid double key lookup?

Data structure for "interpolated" table lookup

Haskell Range Map library

Haskell caching results of a function

Categories

Resources