Global cache dict across dask workers - caching

Let's say I have a delayed function which does a certain task but it needs a dict to store intermediate key/value pairs which are read and modified in each dask worker.
Can delayed or another mechanism be used to share the cache dict across workers?
I can't seem to find any documentation about doing this.

You could probably achieve what you want using actors - which, be warned, are marked as "experimental" and do not see too much use. The data structure would be stored on one particular worker, and other workers would communicate with it to affect changes. Therefore, if there's any chance of workers going down, you would stand to loose results.
Naturally, you could instead interface your tasks with any external key/value storage: in-cluster things like redis or even a shared filesystem, or external things like cloud storage.

There are a variety of ways to coordinate data between workers. I recommend looking at Dask's Coordination Primitives

Related

Spark write the file inside the worker process

I have a Spark job that is generating a set of results with statistics. My number of work items are more than slave count. So I am doing more than one processing per slave.
I cache results after generating RDD objects to be able to reuse them as I have multiple write operations: one for result objects and another for statistics. Both write operations use saveAsHadoopFile.
Without caching Spark reruns the job again per each write operation and that is taking a long time and redoing the same execution twice (more if I had more writes).
With caching I am hitting the memory limit. Some of previously calculated results are lost during caching and I am seeing "CacheManager:58 - Partition rdd_1_0 not found, computing it" messages. Spark eventually goes into an infinite loop as it tries to cache more results while losing some others.
I am aware of the fact that Spark has different storage levels for caching. Using memory + disk would solve our problem. But I am wondering whether we can write down files right in the worker without generating RDD objects or not. I am not sure if that is possible though. Is it?
It turns out that writing files inside a Spark worker process is not different than writing a file in a Java process. Write operation just requires just creating functionality to serialize and save files to HDFS. This question has several answers on how to do it.
saveAsHadoopFile is just a convenient way of doing it.

Can Redis use disk as part of a LRU cache?

We have the need for a distributed LRU cache, but one which can use both memory and disk. We have a large dataset, which is stored on disk permenantly. From that dataset, we create other calculated datasets, but only when clients need them.
Since these secondary datasets are derived from data which is persistent, we never need to permanently save this derived data.
I thought that Redis would have the ability to use disk as a secondary LRU cache, but have not been able to find any documentation that points to that. It seems like Redis only uses the disk to persist the entire cache. I envisioned that we'd be able to scale out horizontally with a bunch of Redis instances.
If Redis can not do this, is there another system that does?
If the data does not fit into memory, the OS can swap it out to the disk. This is called virtual memory. Here you find an explanation: http://redis.io/topics/virtual-memory
Remark: You want to retrieve some data, do stuff on it and you have some intermediate results. Please check whether you may want to distribute your processing, not only the data. Take a look at Apache Hadoop and especially Apache Spark.
The way to solve this problem without changing how your clients work, is in fact not to use Redis, but instead to use a Redis compatible database like Ardb which in turn can be configured to use LevelDB under the hood which supports LRU type on-disk caches.

sharing global array amongst map-reduce tasks

I need to keep a global array of strings across all map and reduce tasks, which each one of them can update while running.
Is is possible to do that in hadoop 1.2.1?
As far as I understood, counters only work with type long, and distributed cache files are read-only.
Would be great if someone can give pointers for this problem.
Thanks!
You really should not have shared variables in map-reduce programs.
But if you really need it check the zookeeper, it is a distributed coordination service and is a core part of hadoop ecosystem. You can use it to store any kind of shared data, including arrays of strings.

Distributed Processing Clarification

I have something in mind but I don't know the typical solution that could help me achieve that.
I need to have a distributed environment where not only memory is shared but processing is also shared, that means ALL Shared Processors work as one Big Processor Computing The code I wrote.
Could this be achieved knowing that I have limited knowledge in Data Grids and Hadoop?
Data Grid Platform (I knew that memory only is shared in that model) or Hadoop (where the code is shared among nodes but each node processes the code separately from other nodes but works on a subset of the data on HDFS).
But I need a solution that not only (shares memory or code as hadoop) but also the processing power of all the machines as one Single Big processor and one single Big Memory?
Do you expect that you just spawn the thread and it get executed somewhere and the middleware miraculously balances the load across nodes, moving threads from one node to another? I think you won't find this directly. The tagged frameworks don't have transparent shared memory either, for good reasons.
When using multiple nodes, you usually need them for processing power, and hiding everything and pretending you're on single machine will tend to unnecessary communication, slowing stuff down.
Instead, you can always design your app using the distribution API provided by those frameworks. For example in Infinispan, look for the Map-Reduce or Distributed Executors API.
I need to have a distributed environment where not only memory is shared but processing is also shared, that means ALL Shared Processors work as one Big Processor Computing The code I wrote.
You are not benefiting with processing on single machine. Application will scale if the processing is spread across multiple machines. If you want to see benefits of one Big Processor Computing, you can virtualize big physical machine into multiple virtual nodes (using technologies like VMWare).
But distributed processing across multiple VM nodes across multiple physical machines in a big cluster is best for distributed applications. Hadoop/Spark is best fit for these type of applications depending on batch processing (Hadoop) or real time processing needs (Spark).

Distributed computing using message queue VS Map/Reduce

Context:
We are considering an AMQP-compliant solution as a way to compute a constant live stream of data that amounts to 90 gb daily. What we'd like to achieve is live stats, more or less, based on all or some combination of the metrics we're observing. The considered strategy is to send data on the queue and have worker process deltas of the data, sending the data back on the queue as an aggregation of the original data.
Observation:
To me, this looks like a job for something like Hadoop, but concerns (and shields) were raised, mainly about speed. I didn't have the time to benchmark both, we're expecting to pump a good amount of data through the queue (anywhere in the neighborhood of 10~100 mb/s) though. I still think it looks like a job for a distributed computing system, and I also feel the queue solution will scale poorer than a distributed computing solution.
Question:
Put simply, am I right? I've read a bit on Hadoop + HDFS, I was thinking about using another FS, like Lustre or something, to circumvent the NodeName SPOF, and use some kind of solution to have some kind of tolerance to failure of nodes of any kind on the whole cluster.
Its really hard to write your own "distributed environment" solution when you need fail-tolarence, good balancing, etc.If you need near-realtime map/reduce you should checkout storm which is what twitter uses for their huge data needs. Its less complicated then hadoop, and better on consuming queue type input (In my opinion).
Also if you decide to analyze your data on hadoop don't worry too much on SPOF of name node, there are some ways to avoid it.

Resources