What's the relationship between Apache spark and Map reduce - shell

I've got some questions about the Spark framework.
First, if I want to write some applications that runs on spark clusters, is it unavoidable to follow the map-reduce procedure? Since to follow the map-reduce procedure, lots of codes has to be changed to parallelize forms, I'm looking for some simple way to move current project to cluster with little changes in codes.
Second is about the spark-shell. I've tried to launch the spark-shell on a cluster using the following code: MASTER=spark://IP:PORT ./bin/spark-shell. Then I write some scala codes on the spark-shell,for example:
var count1=0
var ntimes=10000
var index=0
val t1 = Math.random()
val t2 = Math.random()
if (t1*t1 + t2*t2 < 1)
var pi= 4.0 * count1 / ntimes
val count2 = spark.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count2 / NUM_SAMPLES)
These codes contain two different Pi caculation programs. I'm wandering whether all of these codes runs on the cluster. I guess that only these codes surrounded by the map{} function are executed on cluster while other codes only executed on the master node. but I'm not sure whether that's correct.

Spark provides a more generic framework than simply Map & Reduce. If you examine the API you can find quite a few other functions that are more generic, such as aggregate. In addition, Spark supports features such as broadcast variables and accumulators that make parallel programming much more effective.
The second question (you really should separate the two):
Yes, the two codes are executed differently. If you want to take advantage of Spark's parallel capabilities, you have to use the RDD data structures. Until you understand how the RDD is distributed and how operations affect the RDD, it will be difficult to use Spark effectively.
Any code that is not executing in an method over an RDD is not parallel.


Scikit-Learn with Dask-Distributed using nested parallelism?

For example suppose I have the code:
vectorizer = CountVectorizer(input=u'filename', decode_error=u'replace')
classifier = OneVsRestClassifier(LinearSVC())
pipeline = Pipeline([
('vect', vectorizer),
('clf', classifier)])
with parallel_backend('distributed', scheduler_host=host_port):
scores = cross_val_score(pipeline, X, y, cv=10)
If I execute this code I can see in the dask webview (through Bokeh) that 10 tasks are created (1 for each fold). However if I execute:
(I know X and y should be split into training and testing, but this is just for testing purposes).
with parallel_backend('distributed', scheduler_host=host_port):
I can see 1 task for each y class being created (20 in my case). Is there a way to have the cross_val_score be run in parallel AND the underlying OneVsRestClassifier run in parallel? Or is the original code of
with parallel_backend('distributed', scheduler_host=host_port):
scores = cross_val_score(pipeline, X, y, cv=10)
running the OneVsRestClassifier in parallel along with the cross_val_score in parallel and I'm just not seeing it? Will I have to implement this manually with dask-distributed?
The design of the parallel backends of joblib is currently too limited to handle nested parallel calls. This problem is tracked here: https://github.com/joblib/joblib/pull/538
We will also need to extend the distributed backend of joblib to use http://distributed.readthedocs.io/en/latest/api.html#distributed.get_client

How do we improve a MongoDB MapReduce function that takes too long to retrieve data and gives out of memory errors?

Retrieving data from mongo takes too long, even for small datasets. For bigger datasets we get out of memory errors of the javascript engine. We've tried several schema designs and several ways to retrieve data. How do we optimize mongoDB/mapReduce function/MongoWire to retrieve more data quicker?
We're not very experienced with MongoDB yet and are therefore not sure whether we're missing optimization steps or if we're just using the wrong tools.
1. Background
For graphing and playback purposes we want to store changes for several objects over time. Currently we have tens of objects per project, but expectations are we need to store thousands of objects. The objects may change every second or not change for long periods of time. A Delphi backend writes to and reads from MongoDB through MongoWire and SuperObjects, the data is displayed in a web frontend.
2. Schema design
We're storing the object changes in minute-second-millisecond objects in a record per hour. The schema design is like described here. Sample:
o: object1,
dt: $date,
v: {0: {0:{0: {speed: 8, rate: 0.8}}}, 1: {0:{0: {speed: 9}}}, …}
We've put indexes on {dt: -1, o: 1} and {o:1}.
3. Retrieving data
We use a mapReduce to construct a new date based on the minute-second-millisecond objects and to put the object back in v:
o: object1,
dt: $date,
v: {speed: 8, rate:0.8}
An average document is about 525 kB before the mapReduce function and has had ~29000 updates. After mapReduce of such a document, the result is about 746 kB.
3.1 Retrieving data from through mongo shell with mapReduce
We're using the following map function:
function mapF(){
for (var i = 0; i < 3600; i++){
var imin = Math.floor(i / 60);
var isec = (i % 60);
var min = ''+imin;
var sec = ''+isec;
if (this.v.hasOwnProperty(min) && this.v[min].hasOwnProperty(sec)) {
for (var ms in this.v[min][sec]) {
if (imin !== 0 && isec !== 0 && ms !== '0' && this.v[min][sec].hasOwnProperty(ms)) {// is our keyframe
var currentV = this.v[min][sec][ms];
//newT is new date computed by the min, sec, ms above
if (toDate > newT && newT > fromDate) {
if (fields && fields.length > 0) {
for (var p = 0, length = fields.length; p < length; p++){
//check if field is present and put it in newV
if (newV) {
emit(this.o, {vs: [{o: this.o, dt: newT, v: newV}]});
} else {
emit(this.o, {vs: [{o: this.o, dt: newT, v: currentV}]});
The reduce function basically just passes the data on. The call to mapReduce:
db.collection.mapReduce( mapF,reduceF,
{out: {inline: 1},
query: {o: {$in: objectNames]}, dt: {$gte: keyframeFromDate, $lt: keyframeToDate}},
sort: {dt: 1},
scope: {toDate: toDateWithinKeyframe, fromDate: fromDateWithinKeyframe, fields: []},
jsMode: true});
Retrieving 2 objects over 1 hour: 2,4 seconds.
Retrieving 2 objects over 5 hour: 8,3 seconds.
For this method we would have to write js and bat files runtime and read the json data back in. We have not measured times fort his yet, because frankly, we don’t like the idea very much.
Another problem with this method is that we get out of memory errors of the v8 javascript engine when we try to retrieve data for longer periods and/or more objects. Using a pc with more RAM works to some extend in preventing out of memory, but it doesn't make retrieving data faster.
This article mentions splitVector, which we might use to devide the workload. But we're not sure on how to use the keyPattern and maxChunkSizeBytes options. Can we use a keyPattern for both o and dt?
We might use multiple collections, but our dataset isn’t that big to start with at the moment, so we’re worried about how much collections we’d need.
3.2 Retrieving data through mongoWire with mapReduce
For retrieving data through mongoWire with mapReduce, we use the same mapReduce functions as above. We use the following Delphi code to start te query:
'mapreduce', ‘collection’,
'map', bsonJavaScriptCodePrefix + FMapVCRFunction.Text,
'reduce', bsonJavaScriptCodePrefix + FReduceVCRFunction.Text,
'out', BSON(['inline', 1]),
'query', mapquery,
'sort', BSON(['dt', -1]),
'scope', scope
Retrieving data with this method is about 3-4 times (!) slower. And then the data has to be translated from BSON (IBSONDocument to JSON (SuperObject), which is a major time consuming part in this method. For retrieving raw data we use TMongoWireQuery which translates the BSONdocument in parts, while this mapReduce function uses TMongoWire directly and tries to translate the complete result. This might explain why this takes so long, while normally it's quite fast. If we can reduce the time it takes for the mapReduce to return results, this might be a next step for us to focus on.
3.3 Retrieving raw data and parsing in Delphi
Retrieving raw data to Delphi takes a bit longer then the previous method, but probably because of the use of TMongoWireQuery, the translation from BSON to JSON is much quicker.
4. Questions
Can we do further optimizations on our schema design?
How can we make the mapReduce function faster?
How can we prevent the out of
memory errors of the v8 engine? Can someone give more information on
the splitVector function?
How can we best use of mapReduce from Delphi? Can we use
MongoWireQuery in stead of MongoWire?
5. Specs
MongoDB 3.0.3
MongoWire from 2015 (recently updated)
Delphi 2010 (got XE5 as well)
4GB RAM (tried on 8GB RAM as well, less out of memory, but reading times are about the same)
Phew what a question! First up: I'm not an expert at MongoDB. I wrote TMongoWire as a way to get to know MongoDB a little. Also I really (really) dislike when wrappers have a plethora of overloads to do the same thing but for all kinds of specific types. A long time ago programmers didn't have generics, but we did have Variant. So I built a MongoDB wrapper (and IBSONDocument) based around variants. That said, I apparently made something people like to use, and by keeping it simple performs quite well. (I haven't been putting much time in it lately, but on the top of the list is catering for the new authentication schemes since version 3.)
Now, about your specific setup. You say you use mapreduce to get from 500KB to 700KB? I think there's a hint there you're using the wrong tool for the job. I'm not sure what the default mongo shell does differently than when you do the same over TMongoWire.Get, but if I assume mapReduce assembles the response first before sending it over the wire, that's where the performance gets lost.
So here's my advice: you're right with thinking about using TMongoWireQuery. It offers a way to process data faster as the server will be streaming it in, but there's more.
I strongly suggest to use an array to store the list of seconds. Even if not all seconds have data, store null on the seconds without data so each minute array has 60 items. This is why:
One nicety that turned up in designing TMongoWireQuery, is the assumption you'll be processing a single (BSON) document at a time, and that the contents of the documents will be roughly similar, at least in the value names. So by using the same IBSONDocument instance when enumerating the response, you actually save a lot of time by not having to de-allocate and re-allocate all those variants.
That goes for simple documents, but would actually be nice to have on arrays as well. That's why I created IBSONDocumentEnumerator. You need to pre-load an IBSONDocument instance with an IBSONDocumentEnumerator in the place where you're expecting the array of documents, and you need to process the array in roughly the same way as with TMongoWireQuery: enumerate it using the same IBSONDocument instance, so when subsequent documents have the same keys, time is saved not having to re-allocate them.
In your case though, you would still need to pull the data of an entire hour through the wire just to select the seconds you need. As I said before, I'm not a MongoDB expert, but I suspect there could be a better way to store data like this. Either with a separate document per second (I guess this would let the indexes do more of the work, and MongoDB can take that insert-rate), or with a specific query construction so that MongoDB knows to shorten the seconds array into just that data you're requesting (is that what $splice does?)
Here's an example of how to use IBSONDocumentEnumerator on documents like {name:"fruit",items:[{name:"apple"},{name:"pear"}]}
while q.Next(d) do
while e.Next(d1) do

Apache Spark DAGScheduler Missing Parents for Stage

When running my iterative program on Apache Spark I occasionally get the message:
INFO scheduler.DAGScheduler: Missing parents for Stage 4443: List(Stage 4441, Stage 4442)
I gather it means it needs to compute the parent RDD - but I am not 100% sure. I don't just get one of these, I end up with 100's if not thousands of them at a time - it completely slows down my programme and another iteration does not complete for 10-15 minutes (they usually take 4-10 seconds).
I cache the main RDD on each iteration, using StorageLevel.MEMORY_AND_DISK_SER. The next iteration uses this RDD. The lineage of the RDD therefore gets very large hence the need for caching. However, if I am caching (and spilling to disk) how can a parent be lost?
I quote Imran Rashid from Cloudera:
It's normal for stages to get skipped if they are shuffle map stages, which get read multiple times. Eg., here's a little example program I wrote earlier to demonstrate this: "d3" doesn't need to be re-shuffled since each time its read w/ the same partitioner. So skipping stages in this way is a good thing:
val partitioner = new org.apache.spark.HashPartitioner(10)
val d3 = sc.parallelize(1 to 100).map { x => (x % 10) -> x}.partitionBy(partitioner)
(0 until 5).foreach { idx =>
val otherData = sc.parallelize(1 to (idx * 100)).map{ x => (x % 10) -> x}.partitionBy(partitioner)
println(idx + " ---> " + otherData.join(d3).count())
If you run this, f you look in the UI you'd see that all jobs except for the first one have one stage that is skipped. You will also see this in the log:
15/06/08 10:52:37 INFO DAGScheduler: Parents of final stage: List(Stage 12, Stage 13)
15/06/08 10:52:37 INFO DAGScheduler: Missing parents: List(Stage 13)
Admittedly that is not very clear, but that is sort of indicating to you that the DAGScheduler first created stage 12 as a necessary step, and then later on changed its mind by realizing that everything it needed for stage 12 already existed, so there was nothing to do.
See the following for the email source:

Multiple node cassandra cluster is really slow

I had a single node cassandra cluster on EC2. I was running my tests on it and it worked great.
But then, I had to move this cluster to a VPC, so rather than moving the data, I created a new cluster with two nodes (both seeds), and imported the data from the former cluster using sstableloader.
I thought it was really slow, so decided to add two more instances (not seeds). It's even slower.
I use a ONE consistency, and my replication factor is 1, so I don't quite see why it is so slow.
To give you an idea, I can only do 3 read per second.
We use the EC2Snitch but not the AMI recommended by Cassandra though (we didn't see that part in the documentation when we installed it).
I didn't run a cleanup yet on the two first nodes after adding the two new nodes.
When I request all elements of a column family which contains only a dozen of rows, it times out. If I request one element, I get the result after a long time, and with a huge tracing session (~30000 lines...)!
Does anyone know what I can do to make it faster? I don't quite know where to look at right now.
My Cassandra version is Cassandra 2.1.3.
Here is my keyspace schema:
CREATE KEYSPACE keyspace_name WITH replication = {'class': 'NetworkTopologyStrategy', 'us-west-2': '1'} AND durable_writes = true;
And the options for our column family
CREATE TABLE keyspace_name."CFName" (
// ...
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
I had to run a compaction on my nodes because I had too many tombstones.
Many thanks to the amazing IRC channel on freenode #cassandra.

Minimizing shuffle when mapper output is mostly sorted

I have a map-reduce process in which the mapper takes input from a file that is sorted by key. For example:
1 ...
2 ...
2 ...
3 ...
3 ...
3 ...
4 ...
Then it gets transformed and 99.9% of the keys stay in the same order in relation to one another and 99% of the remainder are close. So the following might be the output of running the map task on the above data:
a ...
c ...
c ...
d ...
e ...
d ...
e ...
Thus, if you could make sure that a reducer took in a range of inputs and put that reducer in the same node where most of the inputs were already located, the shuffle would require very little data transfer. For example, suppose that I partitioned the data so that a-d were taken care of by one reducer and e-g by the next. Then if a-d could be run on the same node that had handled the mapping of 1-4, only two records for e would need to be sent over the network.
How do I construct a system that takes advantage of this property of my data? I have both Hadoop and Spark available and do not mind writing custom partitioners and the like. However, the full workload is such a classic example of MapReduce that I'd like to stick with a framework which supports that paradigm.
Hadoop mail archives mention consideraton of such an optimization. Would one need to modify the framework itself to implement it?
From the SPARK perspective there is not direct support for this: the closest is mapPartitions with preservePartions=true. However that will not directly help in your case because the keys may not be changed.
* Return a new RDD by applying a function to each partition of this RDD.
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = {
val func = (context: TaskContext, index: Int, iter: Iterator[T]) => f(iter)
new MapPartitionsRDD(this, sc.clean(func), preservesPartitioning)
If you were able to know definitively that none of the keys would move outside of their original partitions the above would work. But the values on the boundaries would likely not cooperate.
What is the scale of the data compared to the migrating keys? You may consider adding a postprocessing step. First construct a partition for all migrating keys. Your mapper would output a special key value for keys needing to migrate. Then postprocess the results to do some sort of append to the standard partitions. That is extra hassle so you would need to evaluate the tradeoff in an extra step and pipeline complexity.
