I am running a PySpark application where we are comparing two large datasets of 3GB each. There are some differences in the datasets, which we are filtering via outer join.
mismatch_ids_row = (sourceonedf.join(sourcetwodf, on=primary_key,how='outer').where(condition).select(primary_key)
mismatch_ids_row.count()
So the output of join on count is a small data of say 10 records. The shuffle partition at this point is about 30 which has been counted as amount of data/partition size(100Mb).
After the result of the join, the previous two datasets are joined with the resultant joined datasets to filter out data for each dataframe.
df_1 = sourceonedf.join(mismatch_ids_row, on=primary_key, how='inner').dropDuplicates()
df_2 = sourcetwodf.join(mismatch_ids_row, on=primary_key, how='inner').dropDuplicates()
Here we are dropping duplicates since the result of first join will be double via outer join where some values are null.
These two dataframes are further joined to find the column level comparison and getting the exact issue where the data is mismatched.
df = (df_1.join(df_2,on=some condition, how="full_outer"))
result_df = df.count()
The resultant dataset is then used to display as:
result_df.show()
The issue is that, the first join with more data is using merge sort join with partition size as 30 which is fine since the dataset is somewhat large.
After the result of the first join has been done, the mismatched rows are only 10 and when joining with 3Gb is a costly operation and using broadcast didn't help.
The major issue in my opinion comes when joining two small resultant datasets in second join to produce the result. Here too many shuffle partitions are killing the performance.
The application is running in client mode as spark run for testing purposes and the parameters are sufficient for it to be running on the driver node.
Here is the DAG for the last operation:
As an example:
data1 = [(335008138387,83165192,"yellow","2017-03-03",225,46),
(335008138384,83165189,"yellow","2017-03-03",220,4),
(335008138385,83165193,"yellow","2017-03-03",210,11),
(335008138386,83165194,"yellow","2017-03-03",230,12),
(335008138387,83165195,"yellow","2017-03-03",240,13),
(335008138388,83165196,"yellow","2017-03-03",250,14)
]
data2 = [(335008138387,83165192,"yellow","2017-03-03",300,46),
(335008138384,83165189,"yellow","2017-03-03",220,10),
(335008138385,83165193,"yellow","2017-03-03",210,11),
(335008138386,83165194,"yellow","2017-03-03",230,12),
(335008138387,83165195,"yellow","2017-03-03",240,13),
(335008138388,83165196,"yellow","2017-03-03",250,14)
]
field = [
StructField("row_num",LongType(),True),
StructField("tripid",IntegerType(),True),
StructField("car_type",StringType(),True),
StructField("dates", StringType(), True),
StructField("pickup_location_id", IntegerType(), True),
StructField("trips", IntegerType(), True)
]
schema = StructType(field)
sourceonedf = spark.createDataFrame(data=data1,schema=schema)
sourcetwodf = spark.createDataFrame(data=data2,schema=schema)
They have just two differences, on a larger dataset think of these as 10 or more differences.
df_1 will get rows from 1st sourceonedf based on mismatch_ids_row and so will the df_2. They are then joined to create another resultant dataframe which outputs the data.
How can we optimize this piece of code so that optimum partitions are there for it to perform faster that it does now.
At this point it takes ~500 secs to do whole activity, when it can take about 200 secs lesser and why does the show() takes time as well, there are only 10 records so it should print pretty fast if all are in 1 partition I guess.
Any suggestions are appreciated.
You should be able to go without df_1 and df_2. After the first 'outer' join you have all the data in that table already.
Cache the result of the first join (as you said, the dataframe is small):
# (Removed the select after the first join)
mismatch_ids_row = sourceonedf.join(sourcetwodf, on=primary_key, how='outer').where(condition)
mismatch_ids_row.cache()
mismatch_ids_row.count()
Then you should be able to create a self-join condition. When joining, use dataframe aliases for explicit control:
result_df = (
mismatch_ids_row.alias('a')
.join(mismatch_ids_row.alias('b'), on=some condition...)
.select(...)
)
I have a table with around 50,000,000 records.
I would like to fetch one column of the whole table
SELECT id FROM `project.dataset.table`
Running this code in the Web Console takes around 80 seconds.
However when doing this with the Ruby Gem, I'm limited to fetch only 100,000 records per query. With the #next method I can access the next 100,000 records.
require "google/cloud/bigquery"
#big_query = Google::Cloud::Bigquery.new(
project: "project",
keyfile: "keyfile"
)
#dataset = #big_query.dataset("dataset")
#table = #dataset.table("table")
queue = #big_query.query("SELECT id FROM `project.dataset.table`", max: 1_000_000)
stash = queue
loop do
queue = queue.next
unless queue
break
else
O.timed stash.size
stash += queue
end
end
The problem with this is that each request takes around 30 seconds. max: 1_000_000 is of no use, I'm stuck at 100,000. This way the query takes over 4 hours, which is not acceptable.
What am I doing wrong?
You should rather do an export job, this way you will have as file(s) on GCS.
Then downloading from there is easy.
https://cloud.google.com/bigquery/docs/exporting-data
Ruby example here https://github.com/GoogleCloudPlatform/google-cloud-ruby/blob/master/google-cloud-bigquery/lib/google/cloud/bigquery.rb
I have a Spark stream in which records are flowing in. And the interval size is 1 second.
I want to union all the data in the stream. So i have created an empty RDD , and then using transform method, doing union of RDD (in the stream) with this empty RDD.
I am expecting this empty RDD to have all the data at the end.
But this RDD always remains empty.
Also, can somebody tell me if my logic is correct.
JavaRDD<Row> records = ss.emptyDataFrame().toJavaRDD();
JavaDStream<Row> transformedMessages = messages.flatMap(record -> processData(record))
.transform(rdd -> rdd.union(records));
transformedMessages.foreachRDD(record -> {
System.out.println("Aman" +record.count());
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(records, schema);
ds.createOrReplaceTempView("tempTable");
ds.show();
});
Initially, records is empty.
Then we have transformedMessages = messages + records, but records is empty, so we have: transformedMessages = messages (obviating the flatmap function which is not relevant for the discussion)
Later on, when we do Dataset ds = ss.createDataFrame(records, schema); records
is still empty. That does not change in the flow of the program, so it will remain empty as an invariant over time.
I think what we want to do is, instead of
.transform(rdd -> rdd.union(records));
we should do:
.foreachRDD{rdd => records = rdd.union(records)} //Scala: translate to Java syntax
That said, please note that as this process iteratively adds to the lineage of the 'records' RDD and also will accumulate all data over time. This is not a job that can run stable for a long period of time as, eventually, given enough data, it will grow beyond the limits of the system.
There's no information about the usecase behind this question, but the current approach does not seem to be scalable nor sustainable.
I have a Spark application containing 8000 loops totally and it runs on a cluster of 5 nodes. Each node has 125GB memory and 32 cores. The code in concern looks like the following:
for (m <- 0 until deviceArray.size) { // there are 1000 device
var id = deviceArray(m)
for (t <- 1 to timePatterns) { // there are 8 time patterns
var hrpvData = get24HoursPVF(dataDF, id, t).cache()
var hrpvDataZI = hrpvData.zipWithIndex
var clustersLSD = runKMeans(hrpvData, numClusters, numIterations)
var clusterPVPred = hrpvData.map(x => clustersLSD.predict(x))
var clusterPVMap = hrpvDataZI.zip(clusterPVPred)
var pvhgmRDD = clusterPVMap.map{r => (r._2, r._1._2)}.groupByKey
var arrHGinfo = pvhgmRDD.collect
// Post process data
// .....
hrpvData.unpersist()
}
}
The function call get24HoursPVF() prepares feature vectors for k-means, and it takes about 40 seconds. Each loop takes about 50 seconds to finish using the cluster. My data size is from 2 to 3 GB (read from tables). Given 8000 loops, the total time running this Spark application is unacceptable (8000x50s).
Since each device is independent, is there any way to parallelize the 8000 iterations? Or how to utilize clusters to solve the problem of total long running time? Scala Future won't work because it just submits jobs near simultaneously but Spark won't run these jobs simultaneously.
Aside from the for loops, you've got 2 of the slowest API calls in Spark in your code there - groupByKey, and collect.
groupByKey should almost never be used, instead look at reduceByKey, see this Databricks blog for more details.
collect transfers all the data in that RDD to an array on the driver node, unless that's a small amount of data it'll have a fairly big performance impact.
On the for loops, I'm not particularly familiar with what you're trying to do, but in
var hrpvData = get24HoursPVF(dataDF, id, t).cache()
you're building and caching a new dataframe for each id and t value. I'm not sure why you couldn't just build one single dataframe containing each variant of id and t at the start, then run your zipWithIndex, map, etc over that whole dataframe?
When using Spark Streaming, is it possible to get the first n elements of every RDD in a DStream? In the real world, my stream consists of a number of geotagged events, and I want to take the 100 (or whatever) which are closest to a given point for further processing, but a simple example which shows what I'm trying to do is something like:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ConstantInputDStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
object take {
def main(args: Array[String]) {
val data = 1 to 10
val sparkConf = new SparkConf().setAppName("Take");
val streamingContext = new StreamingContext(sparkConf, Seconds(1))
val rdd = streamingContext.sparkContext.makeRDD(data)
val stream = new ConstantInputDStream(streamingContext, rdd)
// In the real world, do a bunch of stuff which results in an ordered RDD
// This obviously doesn't work
// val filtered = stream.transform { _.take(5) }
// In the real world, do some more processing on the DStream
stream.print()
streamingContext.start()
streamingContext.awaitTermination()
}
}
I understand I could pull the top n results back to the driver fairly easily, but that isn't something I want to do in this case as I need to do further processing on the RDD after having filtered it down.
Why is it not working? I think your example is fine.
You should compute the distance for each event
Sort the events by distance with a number of partitions adapted to your amount of data
Take the first 100 events from each partition (so you'll shuffle a small part of the initial data), make the returned collection a new RDD with sparkContext.parallelize(data)
Sort again with only one partition so all the data is shuffled in the same dataset
Take the first 100 events, this is your top 100
The code for the sort is the same in step 2 and 4, you just change the number of partitions.
Step 1 is executed on the DStream, steps 2 to 5 are executed on the RDDs in a transform operation.