Spark SQL performance: version 1.6 vs version 1.5 - performance

I have tried to compare the performance of Spark SQL version 1.6 and version 1.5. In a simple case, Spark 1.6 is quite faster than Spark 1.5. However, in a more complex query - in my case an aggregation query with grouping sets, Spark SQL version 1.6 is very much slower than Spark SQL version 1.5. Does anybody notice the same issue? and even better have a solution for this kind of query?
Here is my code
case class Toto(
a: String = f"${(math.random*1e6).toLong}%06.0f",
b: String = f"${(math.random*1e6).toLong}%06.0f",
c: String = f"${(math.random*1e6).toLong}%06.0f",
n: Int = (math.random*1e3).toInt,
m: Double = (math.random*1e3))
val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )
df.registerTempTable( "toto" )
val sqlSelect = "SELECT a, b, COUNT(1) AS k1, COUNT(DISTINCT n) AS k2, SUM(m) AS k3"
val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
val sqlText = s"$sqlSelect $sqlGroupBy"
val rs1 = sqlContext.sql( sqlText )
rs1.saveAsParquetFile( "rs1" )
Here are 2 screenshots Spark 1.5.2 and Spark 1.6.0 with --driver-memory=1G. The DAG on Spark 1.6.0 can be viewed at DAG.

Thanks Herman van Hövell for his reply on spark dev community. In order to share with other members, I share his response here.
1.6 plans single distinct aggregates like multiple distinct aggregates; this inherently causes some overhead but is more stable in case of high cardinalities. You can revert to the old behavior by setting the spark.sql.specializeSingleDistinctAggPlanning option to false. See also: https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L452-L462
Actually in order to revert the setting value should be "true".

Related

Optimizing Apache Spark SQL Queries

I am facing very long latencies on Apache Spark when running some SQL queries. In order to simplify the query, I run my calculations in a sequential manner: The output of each query is stored as a temporary table (.registerTempTable('TEMP')) so it can be used in the following SQL query and so on... But the query takes too much time, while in 'Pure Python' code, it takes just a few minutes.
sqlContext.sql("""
SELECT PFMT.* ,
DICO_SITES.CodeAPI
FROM PFMT
INNER JOIN DICO_SITES
ON PFMT.assembly_department = DICO_SITES.CodeProg """).registerTempTable("PFMT_API_CODE")
sqlContext.sql("""
SELECT GAMMA.*,
(GAMMA.VOLUME*GAMMA.PRORATA)/100 AS VOLUME_PER_SUPPLIER
FROM
(SELECT PFMT_API_CODE.* ,
SUPPLIERS_PROP.CODE_SITE_FOURNISSEUR,
SUPPLIERS_PROP.PRORATA
FROM PFMT_API_CODE
INNER JOIN SUPPLIERS_PROP ON PFMT_API_CODE.reference = SUPPLIERS_PROP.PIE_NUMERO
AND PFMT_API_CODE.project_code = SUPPLIERS_PROP.FAM_CODE
AND PFMT_API_CODE.CodeAPI = SUPPLIERS_PROP.SITE_UTILISATION_FINAL) GAMMA """).registerTempTable("TEMP_ONE")
sqlContext.sql("""
SELECT TEMP_ONE.* ,
ADCP_DATA.* ,
CASE
WHEN ADCP_DATA.WEEK <= weekofyear(from_unixtime(unix_timestamp())) + 24 THEN ADCP_DATA.CAPACITY_ST + ADCP_DATA.ADD_CAPACITY_ST
WHEN ADCP_DATA.WEEK > weekofyear(from_unixtime(unix_timestamp())) + 24 THEN ADCP_DATA.CAPACITY_LT + ADCP_DATA.ADD_CAPACITY_LT
END AS CAPACITY_REF
FROM TEMP_ONE
INNER JOIN ADCP_DATA
ON TEMP_ONE.reference = ADCP_DATA.PART_NUMBER
AND TEMP_ONE.CodeAPI = ADCP_DATA.API_CODE
AND TEMP_ONE.project_code = ADCP_DATA.PROJECT_CODE
AND TEMP_ONE.CODE_SITE_FOURNISSEUR = ADCP_DATA.SUPPLIER_SITE_CODE
AND TEMP_ONE.WEEK_NUM = ADCP_DATA.WEEK_NUM
""" ).registerTempTable('TEMP_BIS')
sqlContext.sql("""
SELECT TEMP_BIS.CSF_ID,
TEMP_BIS.CF_ID ,
TEMP_BIS.CAPACITY_REF,
TEMP_BIS.VOLUME_PER_SUPPLIER,
CASE
WHEN TEMP_BIS.CAPACITY_REF >= VOLUME_PER_SUPPLIER THEN 'CAPACITY_OK'
WHEN TEMP_BIS.CAPACITY_REF < VOLUME_PER_SUPPLIER THEN 'CAPACITY_NOK'
END AS CAPACITY_CHECK
FROM TEMP_BIS
""").take(100)
Could anyone highlight (if there are any) the best practices for writing pyspark SQL queries on Spark?
Does it make sense that locally on my computer the script is much faster than on the Hadoop cluster?
Thanks in advance
You should cache your intermediate results, what is the data source?
can you retrieve only relevant data from it or only relevant columns. There are many options you should provide more info about your data.

Issue with WITH clause with Cloudera JDBC Driver for Impala - Returning column name instead of actual Data

I am using Cloudera JDBC Driver for Impala v 2.5.38 with Spark 1.6.0 to create DataFrame. It is working fine for all queries except WITH clause, but WITH is extensively used in my organization.
Below is my code snippet.
def jdbcHDFS(url:String,sql: String):DataFrame = {
var rddDF: DataFrame = null
val jdbcURL = s"jdbc:impala://$url"
val connectionProperties = new java.util.Properties
connectionProperties.setProperty("driver","com.cloudera.impala.jdbc41.Driver")
rddDF = sqlContext.read.jdbc(jdbcURL, s"($sql) AS ST", connectionProperties)
rddDF
}
Given below example for working and non-working SQL
val workingSQL = "select empname from (select * from employee) as tmp"
val nonWorkingSQL = "WITH tmp as (select * from employee) select empname from tmp"
Below is the output of rddDF.first for above SQLs.
For workingSQL
scala> rddDF.first
res8: org.apache.spark.sql.Row = [Kushal]
For nonWorkingSQL
scala> rddDF.first
res8: org.apache.spark.sql.Row = [empname] //Here we are expecting actual data ie. 'Kushal' instead of column name like the output of previous query.
It would be really helpful if anyone can suggest any solution for it.
Please note: Both the queries are working fine in IMPALA-SHELL as well as in HIVE through HUE.
Update:
I have tried to setup plain JDBC connection and execute the nonWorkingSQL and it worked!
Then i thought the issue is due to Spark wraps a "SELECT * FROM ( )" around the query, hence i tried the below SQL to find the root cause but still it worked and displayed expected result.
String sql = "SELECT * FROM (WITH tmp as (select * from employee) select empname from tmp) AS ST"
Hence, the root cause is not clear and need to be analysed so that it work with SPARK as well. Please suggest further.

Why Spark DataFrame Repartition not working correctly

Spark 1.6.2 HDP 2.5.2
I am using spark sql to fetch data from a hive table and then repartitioning on a particular column "serial" with 100 partitions but spark does not repartition the data into 100 partitions (can be seen as number of tasks in spark ui) instead has 126 tasks.
val data = sqlContext.sql("""select * from default.tbl_orc_zlib""")
val filteredData = data.filter( data("day").isNotNull ) // NULL check
//Repartition on serial column with 100 partitions
val repartData = filteredData.repartition(100,filteredData("serial"))
val repartSortData = repartData.sortWithinPartitions("serial","linenr")
val mappedData = repartSortData.map(s => s.mkString("\t"))
val res = mappedData.pipe("xyz.dll")
res.saveAsTextFile("hdfs:///../../../")
But if i use a coalesce first and then repartition then the number of tasks become 150 (correct 50 of coalesce and 100 of repartition)
filteredData.coalesce(50)//works fine
Can someone please explain me why is this happening

Iteratively running queries on Apache Spark

I've been trying to execute 10,000 queries over a relatively large dataset 11M. More specifically I am trying to transform an RDD using filter based on some predicate and then compute how many records conform to that filter by applying the COUNT action.
I am running Apache Spark on my local machine having 16GB of memory and an 8-core CPU. I have set the --driver-memory to 10G in order to cache the RDD in memory.
However, because I have to re-do this operation 10,000 times it takes unusually long for this to finish. I am also attaching my code hoping it will make things more clear.
Loading the queries and the dataframe I am going to query against.
//load normalized dimensions
val df = spark.read.parquet("/normalized.parquet").cache()
//load query ranges
val rdd = spark.sparkContext.textFile("part-00000")
Parallelizing the execution of queries
In here, my queries are collected in a list and using par are executed in parallel. I then collect the required parameters that my query needs, to filter the Dataset. The isWithin function calls a function and tests whether the Vector contained in my dataset is within the given bounds by my queries.
Now after filtering my dataset, I execute count to get the number of records that exist in the filtered dataset and then create a string reporting how many that was.
val results = queries.par.map(q => {
val volume = q(q.length-1)
val dimensions = q.slice(0, q.length-1)
val count = df.filter(row => {
val v = row.getAs[DenseVector]("scaledOpen")
isWithin(volume, v, dimensions)
}).count
q.mkString(",")+","+count
})
Now, what I have in mind is that this task is generally really hard given the large dataset that I have and trying to run such thing on a single machine. I know this could be much faster on something running on top of Spark or by utilizing an index. However, I am wondering if there is a way to make it faster as it is.
Just because you parallelize access to a local collection it doesn't mean that anything is executed in parallel. Number of jobs that can be executed concurrently is limited by the cluster resources not driver code.
At the same time Spark is designed for high latency batch jobs. If number of jobs goes into tens of thousands you just cannot expect things to be fast.
One thing you can try is to push filters down into a single job. Convert DataFrame to RDD:
import org.apache.spark.mllib.linalg.{Vector => MLlibVector}
import org.apache.spark.rdd.RDD
val vectors: RDD[org.apache.spark.mllib.linalg.DenseVector] = df.rdd.map(
_.getAs[MLlibVector]("scaledOpen").toDense
)
map vectors to {0, 1} indicators:
import breeze.linalg.DenseVector
// It is not clear what is the type of queries
type Q = ???
val queries: Seq[Q] = ???
val inds: RDD[breeze.linalg.DenseVector[Long]] = vectors.map(v => {
// Create {0, 1} indicator vector
DenseVector(queries.map(q => {
// Define as before
val volume = ???
val dimensions = ???
// Output 0 or 1 for each q
if (isWithin(volume, v, dimensions)) 1L else 0L
}): _*)
})
aggregate partial results:
val counts: breeze.linalg.DenseVector[Long] = inds
.aggregate(DenseVector.zeros[Long](queries.size))(_ += _, _ += _)
and prepare final output:
queries.zip(counts.toArray).map {
case (q, c) => s"""${q.mkString(",")},$c"""
}

how to sort_area_size increase

how set sort_area_size in oracle 10g and what size should be as i have more than 2.2m rows in single table. and please tell me the suggested size of SORT_AREA_RETAINED_SIZE
as my queries are too much slow they takes more than 1 hours to complete. (mostly)
please suggest me the way by which i can optimize my queries and tune the database oracle 10g
thanks
updated with query
the query is
SELECT A.TITLE,C.TOWN_VILL U_R,F.CODE TOWN_CODE,F.CITY_TOWN_MAKE,A.FRM,A.PRD_CODE,A.BR_CODE,A.SIZE_CODE ,B.PRICES,
A.PROJECT_YY,A.PROJECT_MM,d.province ,D.BR_CODE BRANCH_CODE,D.STRATUM,L.LSM_GRP LSM,
SUM(GET_FRAC_FACTOR_ALL_PR_NEW(A.FRM,A.PRD_CODE,A.BR_CODE,A.SIZE_CODE,A.PROJECT_YY,A.PROJECT_MM,A.FRAC_CODE ,B.PRICES,A.QTY_USED,A.VERIF_CODE, A.PACKING_CODE, J.TYPE ,'R') )
* MAX(D.UNIVERSE) / MAX(E.SAMPLE) /1000000 MARKET , D.UNIVERSE ,E.SAMPLE
FROM A2_FOR_CPMARKETS A,
BRAND J,
PRICES B,CP_SAMPLE_ALL_MONTHS C ,
CP_LSM L,
HOUSEHOLD_GL D,
SAMPLE_CP_ALL_MONTHS E ,
City_Town_ALL F
WHERE A.PRD_CODE = B.PRD_CODE
AND A.BR_CODE = B.BR_CODE
AND DECODE(A.SIZE_CODE,NULL,'L',A.SIZE_CODE) = B.SIZE_CODE -- for unbranded loose
AND DECODE(B.VAR_CODE,'X','X',A.VAR_CODE) = B.VAR_CODE
AND DECODE(B.COL_CODE,'X','X',A.COL_CODE) = B.COL_CODE
AND DECODE(B.PACK_CODE,'X','X',A.PACKING_CODE) = B.PACK_CODE
AND A.project_yy||A.project_MM BETWEEN B.START_DATE AND B.END_DATE
AND A.PRD_CODE=J.PRD_CODE
AND A.BR_CODE=J.BR_CODE
AND A.FRM = C.FRM
AND A.PROJECT_YY=L.YEAR
AND A.frm=L.FORM_NO
AND C.TOWN_VILL= D.U_R
AND C.CLASS = D.CLASS
AND D.TOWN=F.GRP
AND D.TOWN = E.TOWN_CODE
AND A.PROJECT_YY = E.PROJECT_YY
AND A.PROJECT_MM = E.PROJECT_MM
AND A.PROJECT_YY = C.PROJECT_YY
AND A.PROJECT_MM = C.PROJECT_MM
-- FOR HOUSEJOLD_GL
AND A.PROJECT_YY = D.YEAR
AND A.PROJECT_MM = D.MONTH
-- END HOUSEHOLD_GL
AND C.TOWN_VILL = E.TOWN_VILL
AND C.CLASS = E.CLASS
AND C.TOWN_VILL = F.TOWN_VILL
AND C.TOWN_CODE=F.CODE
AND (DECODE(e.PROJECT_YY,'1997','1','1998','1','1999','1','2000','1','2001','1','2002','1','2') = F.TYP )
GROUP BY A.TITLE,C.TOWN_VILL,F.CODE ,F.CITY_TOWN_MAKE,A.FRM,A.PRD_CODE,A.BR_CODE,A.SIZE_CODE ,B.PRICES,
A.PROJECT_YY,A.PROJECT_MM,d.province,D.BR_CODE ,D.STRATUM,L.LSM_GRP ,
UNIVERSE ,E.SAMPLE
![alt text][1]
[1]: http://C:\Documents and Settings\Hussain\My Documents\My Pictures\explain plan.jpg
Check here for Oracle Documentation for SORT_AREA_SIZE. You can use alter session set sort_area_size=10000 command to modify this for the session and alter system for system. It is the same way for SORT_AREA_RETAINED_SIZE.
Is you entire table (with 2.2 m rows) fetched in the result set? Is there some sort operation in it?
There could be some other reasons for the query to perform badly. Can you share the query and explain plan?
When you run an execution plan for the query using the DBMS_Xplan.Display method oracle will estimate (usually pretty reasonably) what size of temporary tablespace storage you would need to execute it.
2.2 Million rows may be irrelevant to the sort size by the way. the memory required for aggregate operations such as MAX and SUM are more related to the size of the result set than to the size of the source data.
Providing a link to a jpg file stored on your pc does not count as having provided an execution plan, btw.
A.project_yy||A.project_MM BETWEEN B.START_DATE AND B.END_DATE
You know we have DATE datatypes in databases, right ? Using the incorrect datatypes makes it harder for Oracle to determine data distributions, predicate selectivity and appropriate query plans

Resources