for example, I have a dataframe with 10 columns, and later I need use this dataframe join with other dataframes. But in the dataframe only column1, and column2 are used, others are not useful.
If I do this:
df1 = df.select(['column1', 'column2'])
...
...
result = df1.join(other_df)....
Is this good for the performance?
If yes, why this is good, is there any document?
Thanks.
Spark is distributed lazily evaluated framework, which means either you select all columns or some of the columns they will be brought into the memory only when an action is applied to it.
So if you run
df.explain()
at any stage, it'll show you the projection of the column. So if the column is required only then it'll be available in memory else it'll not be selected.
It's better to specify the required column as it comes under best practices and also will improve your code in terms of understanding the logic.
To understand more about action and transformation visit here
Especially for a join, the least columns you have to use (and therefore select), the maximum it will be efficient.
Of course, Spark is lazy & optimized, which means as long as you don't call a triggering function like show() or count() for example, it won't change anything.
So doing :
df = df.select(["a", "b"])
df = df.join(other_df)
df.show()
OR join first and select after :
df = df.join(other_df)
df = df.select(["a", "b"])
df.show()
doesn't change anything because it will optimize and choose the select first, when compiling the query with a count() or show() after.
On the other hand and to answer your question,
Doing a show() or count() in between will definitely impact performances and the one with the lowest column will be definitely faster.
Try comparing :
df = df.select(["a", "b"])
df.count()
df = df.join(other_df)
df.show()
and
df = df.join(other_df)
df.count()
df = df.select(["a", "b"])
df.show()
You will see the difference in time.
The difference will might not be huge, but if you're using filters (df = df.filter("b" == "blabla"), it can be really really big, especially if you're working with joins.
Related
I am a beginner with SAS and trying to create a table with code below. Although the code has been running for 3 hours now. The dataset is quite huge (150000 rows). Although, when I insert a different date it runs in 45 mins. The date I have inserted is valid under date_key. Any suggestions on why this may be/what I can do? Thanks in advance
proc sql;
create table xyz as
select monotonic() as rownum ,*
from x.facility_yz
where (Fac_Name = 'xyz' and (Ratingx = 'xyz' or Ratingx is null) )
and Date_key = '20000101'
;
quit;
Tried running it again but same problem
Is your dataset coming from an external database? A SAS dataset of this size should not take nearly this long to query - it should be almost instant. If it is external, you may be able to take advantage of indexing. Try and find out what the database is indexed on and try using that as a first pass. You may consider using a data step instead rather than SQL with the monotonic() function.
For example, assume it is indexed by date:
data xyz1;
set x.facility_xyz;
where date_key = '20000101';
run;
Then you can filter this final dataset within SAS itself. 150,000 rows is nothing for a SAS dataset, assuming there aren't hundreds of variables making it large. A SAS dataset this size should run lightning fast when querying.
data xyz2;
set xyz1;
where fac_name = 'xyz' AND (Ratingx = 'xyz' or Ratingx = ' ') );
rownum = _N_;
run;
Or, you could try it all in one pass while still taking advantage of the index:
data xyz;
set x.facility_xyz;
where date_key = '20000101';
if(fac_name = 'xyz' AND (Ratingx = 'xyz' or Ratingx = ' ') );
rownum+1;
run;
You could also try rearranging your where statement to see if you can take advantage of compound indexing:
data xyz;
set x.facility_xyz;
where date_key = '20000101'
AND fac_name = 'xyz'
AND (Ratingx = 'xyz' or Ratingx = ' ')
;
rownum = _N_;
run;
More importantly, only keep variables that are necessary. If you need all of them then that is okay, but consider using the keep= or drop= dataset options to only pull what you need. This is especially important when talking with an external database.
What kind of libname to you use ?
if you are running implicit passthrough using sas function, it would explain why it takes so long.
If you are using sas/connect to xxx module, first add option to understand what is going on : options sastrace=,,,d sastraceloc=saslog;
You should probably use explicit passthrough : using rdbms native language to avoid automatic translation of your code.
I am running a PySpark application where we are comparing two large datasets of 3GB each. There are some differences in the datasets, which we are filtering via outer join.
mismatch_ids_row = (sourceonedf.join(sourcetwodf, on=primary_key,how='outer').where(condition).select(primary_key)
mismatch_ids_row.count()
So the output of join on count is a small data of say 10 records. The shuffle partition at this point is about 30 which has been counted as amount of data/partition size(100Mb).
After the result of the join, the previous two datasets are joined with the resultant joined datasets to filter out data for each dataframe.
df_1 = sourceonedf.join(mismatch_ids_row, on=primary_key, how='inner').dropDuplicates()
df_2 = sourcetwodf.join(mismatch_ids_row, on=primary_key, how='inner').dropDuplicates()
Here we are dropping duplicates since the result of first join will be double via outer join where some values are null.
These two dataframes are further joined to find the column level comparison and getting the exact issue where the data is mismatched.
df = (df_1.join(df_2,on=some condition, how="full_outer"))
result_df = df.count()
The resultant dataset is then used to display as:
result_df.show()
The issue is that, the first join with more data is using merge sort join with partition size as 30 which is fine since the dataset is somewhat large.
After the result of the first join has been done, the mismatched rows are only 10 and when joining with 3Gb is a costly operation and using broadcast didn't help.
The major issue in my opinion comes when joining two small resultant datasets in second join to produce the result. Here too many shuffle partitions are killing the performance.
The application is running in client mode as spark run for testing purposes and the parameters are sufficient for it to be running on the driver node.
Here is the DAG for the last operation:
As an example:
data1 = [(335008138387,83165192,"yellow","2017-03-03",225,46),
(335008138384,83165189,"yellow","2017-03-03",220,4),
(335008138385,83165193,"yellow","2017-03-03",210,11),
(335008138386,83165194,"yellow","2017-03-03",230,12),
(335008138387,83165195,"yellow","2017-03-03",240,13),
(335008138388,83165196,"yellow","2017-03-03",250,14)
]
data2 = [(335008138387,83165192,"yellow","2017-03-03",300,46),
(335008138384,83165189,"yellow","2017-03-03",220,10),
(335008138385,83165193,"yellow","2017-03-03",210,11),
(335008138386,83165194,"yellow","2017-03-03",230,12),
(335008138387,83165195,"yellow","2017-03-03",240,13),
(335008138388,83165196,"yellow","2017-03-03",250,14)
]
field = [
StructField("row_num",LongType(),True),
StructField("tripid",IntegerType(),True),
StructField("car_type",StringType(),True),
StructField("dates", StringType(), True),
StructField("pickup_location_id", IntegerType(), True),
StructField("trips", IntegerType(), True)
]
schema = StructType(field)
sourceonedf = spark.createDataFrame(data=data1,schema=schema)
sourcetwodf = spark.createDataFrame(data=data2,schema=schema)
They have just two differences, on a larger dataset think of these as 10 or more differences.
df_1 will get rows from 1st sourceonedf based on mismatch_ids_row and so will the df_2. They are then joined to create another resultant dataframe which outputs the data.
How can we optimize this piece of code so that optimum partitions are there for it to perform faster that it does now.
At this point it takes ~500 secs to do whole activity, when it can take about 200 secs lesser and why does the show() takes time as well, there are only 10 records so it should print pretty fast if all are in 1 partition I guess.
Any suggestions are appreciated.
You should be able to go without df_1 and df_2. After the first 'outer' join you have all the data in that table already.
Cache the result of the first join (as you said, the dataframe is small):
# (Removed the select after the first join)
mismatch_ids_row = sourceonedf.join(sourcetwodf, on=primary_key, how='outer').where(condition)
mismatch_ids_row.cache()
mismatch_ids_row.count()
Then you should be able to create a self-join condition. When joining, use dataframe aliases for explicit control:
result_df = (
mismatch_ids_row.alias('a')
.join(mismatch_ids_row.alias('b'), on=some condition...)
.select(...)
)
I am working on a complex application. From source data, we compute many statistics, eg .
val df1 = sourceData.filter($"col1" === "val" and ...)
.select(...)
.groupBy(...)
.min()
val df2 = sourceData.filter($"col2" === "val" and ...)
.select(...)
.groupBy(...)
.count()
As the dataframe are grouped on the same columns, the result dataframes are then grouped together:
df1.join(df2, Seq("groupCol"), "full_outer")
.join(df3....)
.write.save(...)
(in my code this is done in a loop)
This is not performant, the problem is that each dataframe (I have about 30) ends with a action, so in my understanding each dataframe is computed and returned to the driver, which then sends back data to executors to perform the join.
This gives me memory error, I can increase the driver memory but I am looking for a better way of doing it. For ex. if all dataframes were computed only at the end (with the saving of the joined dataframe) I guess that everything would be managed by the cluster.
Is there a way to do a kind of lazy action? Or should I join the dataframes in another way?
Thx
First of all, the code you've shown contains only one action-like operation - DataFrameWriter.save. All other components are lazy.
But laziness doesn't really help you here. The biggest problem (assuming no ugly data skew or misconfigured broadcasting) is that the individual aggregations require separate shuffles and expensive subsequent merge.
A naive solution would be to leverage that:
the dataframe are grouped on the same columns
to shuffle first:
val groupColumns: Seq[Column] = ???
val sourceDataPartitioned = sourceData.groupBy(groupColumns: _*)
and use the result to compute individual aggregates
val df1 = sourceDataPartitioned
...
val df2 = sourceDataPartitioned
...
However, this approach is rather brittle and is unlikely to scale in presence large / skewed groups.
Therefore it would be much better to rewrite your code to perform only aggregation. Luckily for you, standard SQL behavior is all you need.
Let's start with structuring you code into three element tuples with:
_1 being a predicate (the condition you use with filter).
_2 being a list of Columns for which you want to compute aggregates.
_3 being an aggregate function.
Where example structure can look this:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{count, min}
val ops: Seq[(Column, Seq[Column], Column => Column)] = Seq(
($"col1" === "a" and $"col2" === "b", Seq($"col3", $"col4"), count),
($"col2" === "b" and $"col3" === "c", Seq($"col4", $"col5"), min)
)
Now you compose aggregate expressions using
agg_function(when(predicate, column))
pattern
import org.apache.spark.sql.functions.when
val exprs: Seq[Column] = ops.flatMap {
case (p, cols, f) => cols.map {
case c => f(when(p, c))
}
}
and use it on the sourceData
sourceData.groupBy(groupColumns: _*).agg(exprs.head, exprs.tail: _*)
Add aliases when necessary.
I have a dataframe queried as
val df1 = sqlContext.sql("select * from table1 limit 1")
df1.cache()
df1.take(1)
scala> Array[org.apache.spark.sql.Row] = Array([10,20151100-0000,B95A,293759,0,7698141.001,8141-11,GOOD,22.01,number,2015-10-07 11:34:37.492])
However, if I continue
val df2 = df1.rdd
df2.take(1)
scala> Array[org.apache.spark.sql.Row] = Array([10,20151100-0000,B95A,293759,0,7685751.001,5751-05,GOOD,0.0,number,2015-10-03 13:19:22.631])
The two results are totally different even though I tried to cache df1. Is there a way to make the result consistent ie. df2 is not going to requery the table again to get the value? Thank you.
with take(1) you are just taking one random value out of the rdd. When the command is executed, there is no order/sorting specified. As you have a distributed dataset, it is not ensured that you get the same value every time.
You could do a sorting/filtering on the rdd e.g. based on a key (index) or schema column. Then you should be able to always extract the same value you are looking for.
Hi Can any one help me out of this query forming logic
SELECT C.CPPID, c.CPP_AMT_MANUAL
FROM CPP_PRCNT CC,CPP_VIEW c
WHERE
CC.CPPYR IN (
SELECT C.YEAR FROM CPP_VIEW_VIEW C WHERE UPPER(C.CPPNO) = UPPER('123')
AND C.CPP_CODE ='CPP000000000053'
and TO_CHAR(c.CPP_DATE,'YYYY/Mon')='2012/Nov'
)
AND UPPER(C.CPPNO) = UPPER('123')
AND C.CPP_CODE ='CPP000000000053'
and TO_CHAR(c.CPP_DATE,'YYYY/Mon') = '2012/Nov';
Please Correct me if i formed wrong query structure, in terms of query Performance and Standards. Thanks in Advance
If you have some indexes or partitioned tables I would not use functions on columns but on variables, to be able to use indexes/select partitions.
Also I use ANSI 92 SQL syntax. You don't specify(or not directly) a join contition between cpp_prcnt and cpp_view so it is actually a cartesian product(cross join)
SELECT C.CPPID, c.CPP_AMT_MANUAL
FROM CPP_PRCNT CC
CROSS JOIN CPP_VIEW c
WHERE
CC.CPPYR IN (
SELECT C.YEAR
FROM CPP_VIEW_VIEW C
WHERE C.CPPNO = '123'
AND C.CPP_CODE ='CPP000000000053'
AND trunc(c.CPP_DATE,'MM')=to_date('2012/Nov','YYYY/Mon')
)
AND C.CPPNO = '123'
AND C.CPP_CODE ='CPP000000000053'
AND trunc(c.CPP_DATE,'MM')=to_date('2012/Nov','YYYY/Mon')
If you show us the definition of cpp_view_view(seems to be a view over cpp_view), the definition(if simple) of CPP_VIEW and what you're trying to achieve, I bet there are more things to be improved/fixed.
There are a couple of things you could improve:
if possible, get rid of the UPPER() in the comparison - this will render any indices useless. If that's not possible, consider a function-based index on UPPER(CPPNO)
do not convert your DATE column to a string to compare it with a string - do it the other way round (i.e. convert your string to a date => only one conversion needed instead of one per table row, use of indices possible)
play around with EXISTS instead of IN, as suggested by Dileep - might be faster