pyspark stream does not read delta lake change data feed in an order - spark-streaming

df = spark.readStream.option("readChangeFeed", "true")\
.option("startingVersion", 2)\
.load(hubble_account_tablePath)
display(df)
this returns unordered change data feed. Any suggestions to get the change data feed in asc order as a continuous stream?

Think you can order by commitVersion
df = spark.readStream.option("readChangeFeed", "true")\
.option("startingVersion", 2)\
.load(hubble_account_tablePath)
.orderBy("commitVersion")
display(df)

Related

Multiple consecutive join operations on PySpark

I am running a PySpark application where we are comparing two large datasets of 3GB each. There are some differences in the datasets, which we are filtering via outer join.
mismatch_ids_row = (sourceonedf.join(sourcetwodf, on=primary_key,how='outer').where(condition).select(primary_key)
mismatch_ids_row.count()
So the output of join on count is a small data of say 10 records. The shuffle partition at this point is about 30 which has been counted as amount of data/partition size(100Mb).
After the result of the join, the previous two datasets are joined with the resultant joined datasets to filter out data for each dataframe.
df_1 = sourceonedf.join(mismatch_ids_row, on=primary_key, how='inner').dropDuplicates()
df_2 = sourcetwodf.join(mismatch_ids_row, on=primary_key, how='inner').dropDuplicates()
Here we are dropping duplicates since the result of first join will be double via outer join where some values are null.
These two dataframes are further joined to find the column level comparison and getting the exact issue where the data is mismatched.
df = (df_1.join(df_2,on=some condition, how="full_outer"))
result_df = df.count()
The resultant dataset is then used to display as:
result_df.show()
The issue is that, the first join with more data is using merge sort join with partition size as 30 which is fine since the dataset is somewhat large.
After the result of the first join has been done, the mismatched rows are only 10 and when joining with 3Gb is a costly operation and using broadcast didn't help.
The major issue in my opinion comes when joining two small resultant datasets in second join to produce the result. Here too many shuffle partitions are killing the performance.
The application is running in client mode as spark run for testing purposes and the parameters are sufficient for it to be running on the driver node.
Here is the DAG for the last operation:
As an example:
data1 = [(335008138387,83165192,"yellow","2017-03-03",225,46),
(335008138384,83165189,"yellow","2017-03-03",220,4),
(335008138385,83165193,"yellow","2017-03-03",210,11),
(335008138386,83165194,"yellow","2017-03-03",230,12),
(335008138387,83165195,"yellow","2017-03-03",240,13),
(335008138388,83165196,"yellow","2017-03-03",250,14)
]
data2 = [(335008138387,83165192,"yellow","2017-03-03",300,46),
(335008138384,83165189,"yellow","2017-03-03",220,10),
(335008138385,83165193,"yellow","2017-03-03",210,11),
(335008138386,83165194,"yellow","2017-03-03",230,12),
(335008138387,83165195,"yellow","2017-03-03",240,13),
(335008138388,83165196,"yellow","2017-03-03",250,14)
]
field = [
StructField("row_num",LongType(),True),
StructField("tripid",IntegerType(),True),
StructField("car_type",StringType(),True),
StructField("dates", StringType(), True),
StructField("pickup_location_id", IntegerType(), True),
StructField("trips", IntegerType(), True)
]
schema = StructType(field)
sourceonedf = spark.createDataFrame(data=data1,schema=schema)
sourcetwodf = spark.createDataFrame(data=data2,schema=schema)
They have just two differences, on a larger dataset think of these as 10 or more differences.
df_1 will get rows from 1st sourceonedf based on mismatch_ids_row and so will the df_2. They are then joined to create another resultant dataframe which outputs the data.
How can we optimize this piece of code so that optimum partitions are there for it to perform faster that it does now.
At this point it takes ~500 secs to do whole activity, when it can take about 200 secs lesser and why does the show() takes time as well, there are only 10 records so it should print pretty fast if all are in 1 partition I guess.
Any suggestions are appreciated.
You should be able to go without df_1 and df_2. After the first 'outer' join you have all the data in that table already.
Cache the result of the first join (as you said, the dataframe is small):
# (Removed the select after the first join)
mismatch_ids_row = sourceonedf.join(sourcetwodf, on=primary_key, how='outer').where(condition)
mismatch_ids_row.cache()
mismatch_ids_row.count()
Then you should be able to create a self-join condition. When joining, use dataframe aliases for explicit control:
result_df = (
mismatch_ids_row.alias('a')
.join(mismatch_ids_row.alias('b'), on=some condition...)
.select(...)
)

In pyspark, is df.select(column1, column2....) impact performance

for example, I have a dataframe with 10 columns, and later I need use this dataframe join with other dataframes. But in the dataframe only column1, and column2 are used, others are not useful.
If I do this:
df1 = df.select(['column1', 'column2'])
...
...
result = df1.join(other_df)....
Is this good for the performance?
If yes, why this is good, is there any document?
Thanks.
Spark is distributed lazily evaluated framework, which means either you select all columns or some of the columns they will be brought into the memory only when an action is applied to it.
So if you run
df.explain()
at any stage, it'll show you the projection of the column. So if the column is required only then it'll be available in memory else it'll not be selected.
It's better to specify the required column as it comes under best practices and also will improve your code in terms of understanding the logic.
To understand more about action and transformation visit here
Especially for a join, the least columns you have to use (and therefore select), the maximum it will be efficient.
Of course, Spark is lazy & optimized, which means as long as you don't call a triggering function like show() or count() for example, it won't change anything.
So doing :
df = df.select(["a", "b"])
df = df.join(other_df)
df.show()
OR join first and select after :
df = df.join(other_df)
df = df.select(["a", "b"])
df.show()
doesn't change anything because it will optimize and choose the select first, when compiling the query with a count() or show() after.
On the other hand and to answer your question,
Doing a show() or count() in between will definitely impact performances and the one with the lowest column will be definitely faster.
Try comparing :
df = df.select(["a", "b"])
df.count()
df = df.join(other_df)
df.show()
and
df = df.join(other_df)
df.count()
df = df.select(["a", "b"])
df.show()
You will see the difference in time.
The difference will might not be huge, but if you're using filters (df = df.filter("b" == "blabla"), it can be really really big, especially if you're working with joins.

union not happening with Spark transform

I have a Spark stream in which records are flowing in. And the interval size is 1 second.
I want to union all the data in the stream. So i have created an empty RDD , and then using transform method, doing union of RDD (in the stream) with this empty RDD.
I am expecting this empty RDD to have all the data at the end.
But this RDD always remains empty.
Also, can somebody tell me if my logic is correct.
JavaRDD<Row> records = ss.emptyDataFrame().toJavaRDD();
JavaDStream<Row> transformedMessages = messages.flatMap(record -> processData(record))
.transform(rdd -> rdd.union(records));
transformedMessages.foreachRDD(record -> {
System.out.println("Aman" +record.count());
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(records, schema);
ds.createOrReplaceTempView("tempTable");
ds.show();
});
Initially, records is empty.
Then we have transformedMessages = messages + records, but records is empty, so we have: transformedMessages = messages (obviating the flatmap function which is not relevant for the discussion)
Later on, when we do Dataset ds = ss.createDataFrame(records, schema); records
is still empty. That does not change in the flow of the program, so it will remain empty as an invariant over time.
I think what we want to do is, instead of
.transform(rdd -> rdd.union(records));
we should do:
.foreachRDD{rdd => records = rdd.union(records)} //Scala: translate to Java syntax
That said, please note that as this process iteratively adds to the lineage of the 'records' RDD and also will accumulate all data over time. This is not a job that can run stable for a long period of time as, eventually, given enough data, it will grow beyond the limits of the system.
There's no information about the usecase behind this question, but the current approach does not seem to be scalable nor sustainable.

Apache PIG - Sampling grouped data in foreach using a percentage value

I have country-region data in a pig relation which I am trying to sample based on number of countries in each region. I want to filter 10% of the countries from each region. I am trying to use SAMPLE within FOREACH for this , but looks like SAMPLE is not supported within FOREACH.
COUNTRY_FULL = LOAD 'COUNTRY_REGION' USING org.apache.hive.hcatalog.pig.HCatLoader();
COUNTRIES = FILTER COUNTRY_FULL by partition_dt=='2016-09-04';
COUNTRIES_GROUPED_BY_REGION = GROUP COUNTRIES BY region_id;
SAMPLED_DATA = FOREACH COUNTRIES_GROUPED_BY_REGION {
SAMPLED = SAMPLE COUNTRIES 0.1;
GENERATE FLATTEN(SAMPLED);
};
DUMP SAMPLED_DATA;
Is there a way to achieve this percentage based sampling within a grouped relation in pig ?
The standard trick here is to perform the desired operation (for example sample) before or after your foreach.
In this case I would say it should be possible to use the sample function somewhere before the foreach.
Haven't tried this out so not sure about the syntactical correctness, but what if we try something on the following lines. We basically sort inside the nested foreach on a random number and pick the top 10% of this data:
data = countries, RANDOM() as random;
orderedData = ORDER data BY random;
sampledData = LIMIT orderedData COUNT(data)/10;
GENERATE FLATTEN(sampledData);

Spark dataframe requery when converted to rdd

I have a dataframe queried as
val df1 = sqlContext.sql("select * from table1 limit 1")
df1.cache()
df1.take(1)
scala> Array[org.apache.spark.sql.Row] = Array([10,20151100-0000,B95A,293759,0,7698141.001,8141-11,GOOD,22.01,number,2015-10-07 11:34:37.492])
However, if I continue
val df2 = df1.rdd
df2.take(1)
scala> Array[org.apache.spark.sql.Row] = Array([10,20151100-0000,B95A,293759,0,7685751.001,5751-05,GOOD,0.0,number,2015-10-03 13:19:22.631])
The two results are totally different even though I tried to cache df1. Is there a way to make the result consistent ie. df2 is not going to requery the table again to get the value? Thank you.
with take(1) you are just taking one random value out of the rdd. When the command is executed, there is no order/sorting specified. As you have a distributed dataset, it is not ensured that you get the same value every time.
You could do a sorting/filtering on the rdd e.g. based on a key (index) or schema column. Then you should be able to always extract the same value you are looking for.

Resources