Spark performance - how to parallelize large loops? - performance

I have a Spark application containing 8000 loops totally and it runs on a cluster of 5 nodes. Each node has 125GB memory and 32 cores. The code in concern looks like the following:
for (m <- 0 until deviceArray.size) { // there are 1000 device
var id = deviceArray(m)
for (t <- 1 to timePatterns) { // there are 8 time patterns
var hrpvData = get24HoursPVF(dataDF, id, t).cache()
var hrpvDataZI = hrpvData.zipWithIndex
var clustersLSD = runKMeans(hrpvData, numClusters, numIterations)
var clusterPVPred = hrpvData.map(x => clustersLSD.predict(x))
var clusterPVMap = hrpvDataZI.zip(clusterPVPred)
var pvhgmRDD = clusterPVMap.map{r => (r._2, r._1._2)}.groupByKey
var arrHGinfo = pvhgmRDD.collect
// Post process data
// .....
hrpvData.unpersist()
}
}
The function call get24HoursPVF() prepares feature vectors for k-means, and it takes about 40 seconds. Each loop takes about 50 seconds to finish using the cluster. My data size is from 2 to 3 GB (read from tables). Given 8000 loops, the total time running this Spark application is unacceptable (8000x50s).
Since each device is independent, is there any way to parallelize the 8000 iterations? Or how to utilize clusters to solve the problem of total long running time? Scala Future won't work because it just submits jobs near simultaneously but Spark won't run these jobs simultaneously.

Aside from the for loops, you've got 2 of the slowest API calls in Spark in your code there - groupByKey, and collect.
groupByKey should almost never be used, instead look at reduceByKey, see this Databricks blog for more details.
collect transfers all the data in that RDD to an array on the driver node, unless that's a small amount of data it'll have a fairly big performance impact.
On the for loops, I'm not particularly familiar with what you're trying to do, but in
var hrpvData = get24HoursPVF(dataDF, id, t).cache()
you're building and caching a new dataframe for each id and t value. I'm not sure why you couldn't just build one single dataframe containing each variant of id and t at the start, then run your zipWithIndex, map, etc over that whole dataframe?

Related

union not happening with Spark transform

I have a Spark stream in which records are flowing in. And the interval size is 1 second.
I want to union all the data in the stream. So i have created an empty RDD , and then using transform method, doing union of RDD (in the stream) with this empty RDD.
I am expecting this empty RDD to have all the data at the end.
But this RDD always remains empty.
Also, can somebody tell me if my logic is correct.
JavaRDD<Row> records = ss.emptyDataFrame().toJavaRDD();
JavaDStream<Row> transformedMessages = messages.flatMap(record -> processData(record))
.transform(rdd -> rdd.union(records));
transformedMessages.foreachRDD(record -> {
System.out.println("Aman" +record.count());
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(records, schema);
ds.createOrReplaceTempView("tempTable");
ds.show();
});
Initially, records is empty.
Then we have transformedMessages = messages + records, but records is empty, so we have: transformedMessages = messages (obviating the flatmap function which is not relevant for the discussion)
Later on, when we do Dataset ds = ss.createDataFrame(records, schema); records
is still empty. That does not change in the flow of the program, so it will remain empty as an invariant over time.
I think what we want to do is, instead of
.transform(rdd -> rdd.union(records));
we should do:
.foreachRDD{rdd => records = rdd.union(records)} //Scala: translate to Java syntax
That said, please note that as this process iteratively adds to the lineage of the 'records' RDD and also will accumulate all data over time. This is not a job that can run stable for a long period of time as, eventually, given enough data, it will grow beyond the limits of the system.
There's no information about the usecase behind this question, but the current approach does not seem to be scalable nor sustainable.

What is best or Most lightweight/efficient/cheapest RDD action to perform on Huge/large RDD in Apache Spark

I am new to Apache Spark.
Below is the code snippet which demonstrates my sample code.
val x = 5
val arrayVal = (1 to 100000)
val rdd1 = sc.parallelize(arrayVal, x)//Has Huge RDD of Min 10000 to 100000
var rdd2 = rdd1.map(x => (x, x))
rdd2 = rdd2.cache()
rdd2.count()
val cartesianRDD = rdd2.cartesian(rdd2)
var filteredRDD = cartesianRDD.filter(f => (f._1._1 < f._2._1))
filteredRDD = filteredRDD.repartition(x/2)
rdd2 = rdd2.unpersist(false)
filteredRDD.persist(StorageLevel.MEMORY_ONLY)//To avoid re-calculation
filteredRDD.count()
As I do count on RDD which takes many minutes to count RDD. I wants to know what is the best or most efficient/cheapest/lightweight way to trigger RDD transformations.
I have also tried rdd.take(1) and rdd.first() which results the same.
Ultimately my goal is to reduce the time taken by the any of these action. So that total time of execution could be reduced.
Thanks in advance.
rdd.first() is the cheapest one you can have since it only materializes the first partition.
The cheapest action that will materialize all partitions is rdd.forEachPartition{_=>_}.
Ultimately my goal is to reduce the time taken by the any of these action. So that total time of execution could be reduced.
However, the action you take will not affect the time taken by the previous steps. If you want to decrease total time, you have to optimize other things.

Is Fork-Join framework in Java 8 the best option?

I have a scenario like i want to read a spreadsheet which consists of around 2000 records and enter it into database.
Currently we are using Executor framework. We have limitation that no of tasks should be only 5. Each task reads 20 rows from the excel. We provide the start index and end index of the rows to be read from the excel to each task.
Say, currently,
Task 1 handles 1-20
Task-2 handles 21-40
Task-3 handles 41-60
Task-4 handles 61-80
Task-5 handles 81-100
If Task-1 finishes its execution, it takes the next 20 rows thats 101-120. Suppose if Task-2 finishes before Task-1, it will start reading from 121-140 and not 101-120.
Can i handle this scenario more effectively in a Fork-Join framework only with the restriction of 5 tasks and each task 20 rows?
Need some insight into the performance issues.
No need to switch the thread pool. To make the load more balanced you can just maintain atomic variable which points to the first not taken row:
AtomicInteger currentRow = new AtomicInteger(); // shared between tasks
final int maxRow = 2000;
final int batchSize = 20;
// Inside every task:
while(true) {
int row = currentRow.getAndAdd(batchSize);
if(row >= maxRow) return;
int from = row+1;
int to = Math.min(row+batchSize, maxRow);
// process rows from..to; it's guaranteed that other threads
// do not process the same rows.
}
The body of every task is absolutely the same. Also this implementation does not depend on number of tasks created. If you later decide to have 3 tasks or 7 tasks, just adjust the thread pool size and submit more (or less) tasks.

Is it possible to get the first n elements of every RDD in Spark Streaming?

When using Spark Streaming, is it possible to get the first n elements of every RDD in a DStream? In the real world, my stream consists of a number of geotagged events, and I want to take the 100 (or whatever) which are closest to a given point for further processing, but a simple example which shows what I'm trying to do is something like:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ConstantInputDStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
object take {
def main(args: Array[String]) {
val data = 1 to 10
val sparkConf = new SparkConf().setAppName("Take");
val streamingContext = new StreamingContext(sparkConf, Seconds(1))
val rdd = streamingContext.sparkContext.makeRDD(data)
val stream = new ConstantInputDStream(streamingContext, rdd)
// In the real world, do a bunch of stuff which results in an ordered RDD
// This obviously doesn't work
// val filtered = stream.transform { _.take(5) }
// In the real world, do some more processing on the DStream
stream.print()
streamingContext.start()
streamingContext.awaitTermination()
}
}
I understand I could pull the top n results back to the driver fairly easily, but that isn't something I want to do in this case as I need to do further processing on the RDD after having filtered it down.
Why is it not working? I think your example is fine.
You should compute the distance for each event
Sort the events by distance with a number of partitions adapted to your amount of data
Take the first 100 events from each partition (so you'll shuffle a small part of the initial data), make the returned collection a new RDD with sparkContext.parallelize(data)
Sort again with only one partition so all the data is shuffled in the same dataset
Take the first 100 events, this is your top 100
The code for the sort is the same in step 2 and 4, you just change the number of partitions.
Step 1 is executed on the DStream, steps 2 to 5 are executed on the RDDs in a transform operation.

streaming and bulk update to elasticsearch

As part of data analysis, I collect records I need to store in Elasticsearch. As of now I gather the records in an intermediate list, which I then write via a bulk update.
While this works, it has its limits when the number of records is so large that they do not fit into memory. I am therefore wondering if it is possible to use a "streaming" mechanism, which would allow to
persistently open a connection to elasticsearch
continuously update in a bulk-like way
I understand that I could simply open a connection to Elasticsearch and classically update as data are available but this is about 10 times slower, so I would like to keep the bulk mechanism:
import elasticsearch
import elasticsearch.helpers
import elasticsearch.client
import random
import string
import time
index = "testindexyop1"
es = elasticsearch.Elasticsearch(hosts='elk.example.com')
if elasticsearch.client.IndicesClient(es).exists(index=index):
ret = elasticsearch.client.IndicesClient(es).delete(index=index)
data = list()
for i in range(1, 10000):
data.append({'hello': ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10))})
start = time.time()
# this version takes 25 seconds
# for _ in data:
# res = es.bulk(index=index, doc_type="document", body=_)
# and this one - 2 seconds
elasticsearch.helpers.bulk(client=es, index=index, actions=data, doc_type="document", raise_on_error=True)
print(time.time()-start)
You can always simply split data into n approximately equally sized sets such that each of them fits in memory and then do n bulk updates. This seems to be the easiest solution to me.

Resources