union not happening with Spark transform - spark-streaming

I have a Spark stream in which records are flowing in. And the interval size is 1 second.
I want to union all the data in the stream. So i have created an empty RDD , and then using transform method, doing union of RDD (in the stream) with this empty RDD.
I am expecting this empty RDD to have all the data at the end.
But this RDD always remains empty.
Also, can somebody tell me if my logic is correct.
JavaRDD<Row> records = ss.emptyDataFrame().toJavaRDD();
JavaDStream<Row> transformedMessages = messages.flatMap(record -> processData(record))
.transform(rdd -> rdd.union(records));
transformedMessages.foreachRDD(record -> {
System.out.println("Aman" +record.count());
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(records, schema);
ds.createOrReplaceTempView("tempTable");
ds.show();
});

Initially, records is empty.
Then we have transformedMessages = messages + records, but records is empty, so we have: transformedMessages = messages (obviating the flatmap function which is not relevant for the discussion)
Later on, when we do Dataset ds = ss.createDataFrame(records, schema); records
is still empty. That does not change in the flow of the program, so it will remain empty as an invariant over time.
I think what we want to do is, instead of
.transform(rdd -> rdd.union(records));
we should do:
.foreachRDD{rdd => records = rdd.union(records)} //Scala: translate to Java syntax
That said, please note that as this process iteratively adds to the lineage of the 'records' RDD and also will accumulate all data over time. This is not a job that can run stable for a long period of time as, eventually, given enough data, it will grow beyond the limits of the system.
There's no information about the usecase behind this question, but the current approach does not seem to be scalable nor sustainable.

Related

Multiple consecutive join operations on PySpark

I am running a PySpark application where we are comparing two large datasets of 3GB each. There are some differences in the datasets, which we are filtering via outer join.
mismatch_ids_row = (sourceonedf.join(sourcetwodf, on=primary_key,how='outer').where(condition).select(primary_key)
mismatch_ids_row.count()
So the output of join on count is a small data of say 10 records. The shuffle partition at this point is about 30 which has been counted as amount of data/partition size(100Mb).
After the result of the join, the previous two datasets are joined with the resultant joined datasets to filter out data for each dataframe.
df_1 = sourceonedf.join(mismatch_ids_row, on=primary_key, how='inner').dropDuplicates()
df_2 = sourcetwodf.join(mismatch_ids_row, on=primary_key, how='inner').dropDuplicates()
Here we are dropping duplicates since the result of first join will be double via outer join where some values are null.
These two dataframes are further joined to find the column level comparison and getting the exact issue where the data is mismatched.
df = (df_1.join(df_2,on=some condition, how="full_outer"))
result_df = df.count()
The resultant dataset is then used to display as:
result_df.show()
The issue is that, the first join with more data is using merge sort join with partition size as 30 which is fine since the dataset is somewhat large.
After the result of the first join has been done, the mismatched rows are only 10 and when joining with 3Gb is a costly operation and using broadcast didn't help.
The major issue in my opinion comes when joining two small resultant datasets in second join to produce the result. Here too many shuffle partitions are killing the performance.
The application is running in client mode as spark run for testing purposes and the parameters are sufficient for it to be running on the driver node.
Here is the DAG for the last operation:
As an example:
data1 = [(335008138387,83165192,"yellow","2017-03-03",225,46),
(335008138384,83165189,"yellow","2017-03-03",220,4),
(335008138385,83165193,"yellow","2017-03-03",210,11),
(335008138386,83165194,"yellow","2017-03-03",230,12),
(335008138387,83165195,"yellow","2017-03-03",240,13),
(335008138388,83165196,"yellow","2017-03-03",250,14)
]
data2 = [(335008138387,83165192,"yellow","2017-03-03",300,46),
(335008138384,83165189,"yellow","2017-03-03",220,10),
(335008138385,83165193,"yellow","2017-03-03",210,11),
(335008138386,83165194,"yellow","2017-03-03",230,12),
(335008138387,83165195,"yellow","2017-03-03",240,13),
(335008138388,83165196,"yellow","2017-03-03",250,14)
]
field = [
StructField("row_num",LongType(),True),
StructField("tripid",IntegerType(),True),
StructField("car_type",StringType(),True),
StructField("dates", StringType(), True),
StructField("pickup_location_id", IntegerType(), True),
StructField("trips", IntegerType(), True)
]
schema = StructType(field)
sourceonedf = spark.createDataFrame(data=data1,schema=schema)
sourcetwodf = spark.createDataFrame(data=data2,schema=schema)
They have just two differences, on a larger dataset think of these as 10 or more differences.
df_1 will get rows from 1st sourceonedf based on mismatch_ids_row and so will the df_2. They are then joined to create another resultant dataframe which outputs the data.
How can we optimize this piece of code so that optimum partitions are there for it to perform faster that it does now.
At this point it takes ~500 secs to do whole activity, when it can take about 200 secs lesser and why does the show() takes time as well, there are only 10 records so it should print pretty fast if all are in 1 partition I guess.
Any suggestions are appreciated.
You should be able to go without df_1 and df_2. After the first 'outer' join you have all the data in that table already.
Cache the result of the first join (as you said, the dataframe is small):
# (Removed the select after the first join)
mismatch_ids_row = sourceonedf.join(sourcetwodf, on=primary_key, how='outer').where(condition)
mismatch_ids_row.cache()
mismatch_ids_row.count()
Then you should be able to create a self-join condition. When joining, use dataframe aliases for explicit control:
result_df = (
mismatch_ids_row.alias('a')
.join(mismatch_ids_row.alias('b'), on=some condition...)
.select(...)
)

Could using changelogs cause a bottleneck for the app itself?

I have a spring cloud kafka streams application that rekeys incoming data to be able to join two topics, selectkeys, mapvalues and aggregate data. Over time the consumer lag seems to increase and scaling by adding multiple instances of the app doesn't help a bit. With every instance the consumer lag seems to be increasing.
I scaled up and down the instances from 1 to 18 but no big difference is noticed. The number of messages it lags behind, keeps increasing every 5 seconds independent of the number of instances
KStream<String, MappedOriginalSensorData> flattenedOriginalData = originalData
.flatMap(flattenOriginalData())
.through("atl-mapped-original-sensor-data-repartition", Produced.with(Serdes.String(), new MappedOriginalSensorDataSerde()));
//#2. Save modelid and algorithm parts of the key of the errorscore topic and reduce the key
// to installationId:assetId:tagName
//Repartition ahead of time avoiding multiple repartition topics and thereby duplicating data
KStream<String, MappedErrorScoreData> enrichedErrorData = errorScoreData
.map(enrichWithModelAndAlgorithmAndReduceKey())
.through("atl-mapped-error-score-data-repartition", Produced.with(Serdes.String(), new MappedErrorScoreDataSerde()));
return enrichedErrorData
//#3. Join
.join(flattenedOriginalData, join(),
JoinWindows.of(
// allow messages within one second to be joined together based on their timestamp
Duration.ofMillis(1000).toMillis())
// configure the retention period of the local state store involved in this join
.until(Long.parseLong(retention)),
Joined.with(
Serdes.String(),
new MappedErrorScoreDataSerde(),
new MappedOriginalSensorDataSerde()))
//#4. Set instalation:assetid:modelinstance:algorithm::tag key back
.selectKey((k,v) -> v.getOriginalKey())
//#5. Map to ErrorScore (basically removing the originalKey field)
.mapValues(removeOriginalKeyField())
.through("atl-joined-data-repartition");
then the aggregation part:
Materialized<String, ErrorScore, WindowStore<Bytes, byte[]>> materialized = Materialized
.as(localStore.getStoreName());
// Set retention of changelog topic
materialized.withLoggingEnabled(topicConfig);
// Configure how windows looks like and how long data will be retained in local stores
TimeWindows configuredTimeWindows = getConfiguredTimeWindows(
localStore.getTimeUnit(), Long.parseLong(topicConfig.get(RETENTION_MS)));
// Processing description:
// 2. With the groupByKey we group the data on the new key
// 3. With windowedBy we split up the data in time intervals depending on the provided LocalStore enum
// 4. With reduce we determine the maximum value in the time window
// 5. Materialized will make it stored in a table
stream.groupByKey()
.windowedBy(configuredTimeWindows)
.reduce((aggValue, newValue) -> getMaxErrorScore(aggValue, newValue), materialized);
}
private TimeWindows getConfiguredTimeWindows(long windowSizeMs, long retentionMs) {
TimeWindows timeWindows = TimeWindows.of(windowSizeMs);
timeWindows.until(retentionMs);
return timeWindows;
}
I would expect that increasing the number of instances would decrease the consumer lag tremendous.
So in this setup there are multiple topics involved such as:
* original-sensor-data
* error-score
* kstream-joinother
* kstream-jointhis
* atl-mapped-original-sensor-data-repartition
* atl-mapped-error-score-data-repartition
* atl-joined-data-repartition
the idea is to join the original-sensor-data with the error-score. The rekeying requires the atl-mapped-* topics. then the join will use the kstream* topics and in the end as a result of the join the atl-joined-data-repartition is filled. After that the aggregation also creates topics but I leave this out of scope now.
original-sensor-data
\
\
\ atl-mapped-original-sensor-data-repartition-- kstream-jointhis -\
/ atl-mapped-error-score-data-repartition -- kstream-joinother -\
/ \
error-score atl-joined-data-repartition
As it seems that increasing the number of instances doesn't seem to have much of affect anymore since I introduced the join and the atl-mapped topics, I'm wondering if it is possible that this topology would become its own bottleneck. From the consumer lag it seems that the original-sensor-data and error-score topic have a much smaller consumer lag compare to for instance the atl-mapped-* topics. Is there a way to cope with this by removing these changelogs or does this result in not being able to scale?

Sending Items to specific partitions

I'm looking for a way to send structures to pre-determined partitions so that they can be used by another RDD
Lets say I have two RDDs of key-value pairs
val a:RDD[(Int, Foo)]
val b:RDD[(Int, Foo)]
val aStructure = a.reduceByKey(//reduce into large data structure)
b.mapPartitions{
iter =>
val usefulItem = aStructure(samePartitionKey)
iter.map(//process iterator)
}
How could I go about setting up the Partition such that the specific data structure I need will be present for the mapPartition but I won't have the extra overhead of sending over all values (which would happen if I were to make a broadcast variable).
One thought I have been having is to store the objects in HDFS but I'm not sure if that would be a suboptimal solution.
Another thought I am currently exploring is whether there is some way I can create a custom Partition or Partitioner that could hold the data structure (Although that might get too complicated and become problematic)
thank you for your help!
edit:
Pangea makes a very good point that I should offer some more specifics. Essentially I'm given and RDD of SparseVectors and an RDD of inverted indexes. The inverted index objects are quite large.
My hope is to do a MapPartitions within the RDD of vectors where I can compare each vector to the inverted index. The issue is that I only NEED one inverted index object per partition and doing a join would cause me to have a lot of copies of that index.
val vectors:RDD[(Int, SparseVector)]
val invertedIndexes:RDD[(Int, InvIndex)] = a.reduceByKey(generateInvertedIndex)
vectors:RDD.mapPartitions{
iter =>
val invIndex = invertedIndexes(samePartitionKey)
iter.map(invIndex.calculateSimilarity(_))
)
}
A Partitioner is a function that, given a generic element, will return in which partition it belongs. It also decides the number of partitions.
There's a form of reduceByKey that takes a partitioner as an argument.
If I am understanding correctly your question, you want the data be partitioned while doing the reduce.
See the example:
// create example data
val a =sc.parallelize(List( (1,1),(1,2), (2,3),(2,4) ) )
// create simple sample partitioner - 2 partitions, one for odd
// one for even key.hashCode. You should put your partitioning logic here
val p = new Partitioner { def numPartitions: Int = 2; def getPartition(key:Any) = key.hashCode % 2 }
// your reduceByKey function. Sample: just add
val f = (a:Int,b:Int) => a + b
val rdd = a.reduceByKey(p, f)
// here your rdd will be partitioned the way you want with the number
// of partitions you want
rdd.partitions.size
res8: Int = 2
rdd.map() .. // go on with your processing

Spark dataframe requery when converted to rdd

I have a dataframe queried as
val df1 = sqlContext.sql("select * from table1 limit 1")
df1.cache()
df1.take(1)
scala> Array[org.apache.spark.sql.Row] = Array([10,20151100-0000,B95A,293759,0,7698141.001,8141-11,GOOD,22.01,number,2015-10-07 11:34:37.492])
However, if I continue
val df2 = df1.rdd
df2.take(1)
scala> Array[org.apache.spark.sql.Row] = Array([10,20151100-0000,B95A,293759,0,7685751.001,5751-05,GOOD,0.0,number,2015-10-03 13:19:22.631])
The two results are totally different even though I tried to cache df1. Is there a way to make the result consistent ie. df2 is not going to requery the table again to get the value? Thank you.
with take(1) you are just taking one random value out of the rdd. When the command is executed, there is no order/sorting specified. As you have a distributed dataset, it is not ensured that you get the same value every time.
You could do a sorting/filtering on the rdd e.g. based on a key (index) or schema column. Then you should be able to always extract the same value you are looking for.

Hadoop Buffering vs Streaming

Could someone please explain to me what is the difference between Hadoop Streaming vs Buffering?
Here is the context I have read in Hive :
In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers whereas the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in:
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
In a reduce side join, the values from multiple tables are often tagged to identify them on reducer stage, for the table they are coming from.
Consider a case of two tables:
On reduce call, the mixed values associated with both tables are iterated.
During iteration, the value for one of the tag/table are locally stored into an arraylist. (This is buffering).
While the rest of the values are being streamed through and values for another tag/table are detected, the values of first tag are fetched from the saved arraylist. The two tag values are joined and written to output collector.
Contrast this with the case what if the larger table values are kept in arraylist then it could result into OOM if the arraylist outgrows to overwhelm the memory of the container's JVM.
void reduce(TextPair key , Iterator <TextPair> values ,OutputCollector <Text,Text> output ,Reporter reporter ) throws IOException {
//buffer for table1
ArrayList <Text> table1Values = new ArrayList <Text>() ;
//table1 tag
Text table1Tag = key . getSecond();
TextPair value = null;
while( values . hasNext() ){
value = values . next() ;
if(value.getSecond().equals(table1Tag)){
table1Values.add (value.getFirst() );
}
else{
for( Text val : table1Values ){
output.collect ( key.getFirst() ,new Text(val.toString() + "\t"+ value.getFirst().toString () ));
}
}
}
}
You can use the below hint to specify which of the joined tables would be streamed on reduce side:
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
Hadoop Streaming in general refers to using custom made python or shell scripts to perform your map-reduce logic. ( For example, using the Hive TRANSFORM keyword.)
Hadoop buffering, in this context, refers to the phase in a map-reduce job of a Hive query with a join, when records are read into the reducers, after having been sorted and grouped coming out of the mappers. The author is explaining why you should order the join clauses i n a Hive query, so that the largest tables are last; because it helps optimize the implementation of joins in Hive.
They are completely different concepts.
In response to your comments:
In Hive's join implementation, it must take records from multiple tables, sort them by the join key, and then collate them together in the proper order. It has to read them grouped by the different tables, so they have to see groups from different tables, and once all tables have been seen, start processing them. The first groups from the first tables need to be buffered (kept in memory) because they can not be processed until the last table is seen. The last table can be streamed, (each row processed as they are read) since the other tables group are in memory, and the join can start.

Resources