I am a newbie to spark and need some help to debug very slow performance in spark.
I am doing below transformations and its been running for more than 2 hours.
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext( sc )
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#2b33f7a0
scala> val t1_df = hiveContext.sql("select * from T1" )
scala> t1_df.registerTempTable( "T1" )
warning: there was one deprecation warning; re-run with -deprecation for details
scala> t1_df.count
17/06/07 07:26:51 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
res3: Long = 1732831
scala> val t1_df1 = t1_df.dropDuplicates( Array("c1","c2","c3", "c4" ))
scala> df1.registerTempTable( "ABC" )
warning: there was one deprecation warning; re-run with -deprecation for details
scala> hiveContext.sql( "select * from T1 where c1 not in ( select c1 from ABC )" ).count
[Stage 4:====================================================> (89 + 8) / 97]
I am using spark2.1.0 and reading data from hive.2.1.1 on amazon VMs cluster of 7 nodes each with 250GB RAM and 64 virtual cores. With this massive resource, i am expecting this simple query on 1.7 mil recs to fly but its painfully slow.
Any pointers would be of great help.
Adding explain plan:
scala> hiveContext.sql( "select * from T1 where c1 not in ( select c1 from ABC )" ).explain
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, LeftAnti, (isnull((c1#26 = c1#26#1398)) || (c1#26 = c1#26#1398))
:- FileScan parquet default.t1_pq[cols
more fields] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://<hostname>/user/hive/warehouse/atn_load_pq], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<hdr_msg_src:string,hdr_recv_tsmp:timestamp,hdr_desk_id:string,execprc:string,dreg:string,c...
+- BroadcastExchange IdentityBroadcastMode
+- *HashAggregate(keys=[c1#26, c2#59, c3#60L, c4#82], functions=[])
+- Exchange hashpartitioning(c1#26, c2#59, c3#60L, c4#82, 200)
+- *HashAggregate(keys=[c1#26, c2#59, c3#60L, c4#82], functions=[])
+- *FileScan parquet default.atn_load_pq[c1#26,c2#59,c3#60L,c4#82] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://<hostname>/user/hive/warehouse/atn_load_pq], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c1:string,c2:string,c3:bigint,c4:string>

Although I think your count will always be 0 in your query, you may try to use an left-anti join, and don't forget to cache t1_df to avoid multiple re-computations
val t1_df = hiveContext.sql("select * from T1" ).cache
t1_df.dropDuplicates( Array("c1","c2","c3", "c4" )),


Apache airflow with conditional statements

I am new to Airflow. I want to do an operation like below using airflow operators.
Briefly I want to read some data from a database table and according to the values of a column in that table I want to do different tasks.
This is the table which I used to get data.
| task_name | status |
| a | 1 |
| b | 2 |
| c | 4 |
| d | 3 |
| e | 4 |
From the above table I want to select the rows where status=4 and according to their task name run the relevant jar file (for running jar files I am planning to use Bash Operator). I want to execute this task using Airflow. Note that I am using PostgreSQL.
This is the code which I have implemented so far.
from airflow.models import DAG
from airflow.operators.postgres_operator import PostgresOperator
from datetime import datetime, timedelta
from airflow import settings
#set the default attributes
default_args = {
'owner': 'Airflow',
'start_date': datetime(2020,10,4)
status_four_dag = DAG(
dag_id = 'status_check',
default_args = default_args,
schedule_interval = timedelta(seconds=5)
sql='''select * from table1 where status=4;''',
I am stuck in the place where I want to check the task_name and call the relevant BashOperators.
Your support is appreciated. Thank you.
XComs are used for communicating messages between tasks. Send the JAR filename and other arguments for forming the command to xcom and consume it in the subsequent tasks.
For example,
check_status >> handle_status
check_status - checks status from DB and write JAR filename and arguments to xcom
handle_status - pulls the JAR filename and arguments from xcom, forms the command and execute it
Sample code:
def check_status(**kwargs):
if randint(1, 100) % 2 == 0:
kwargs["ti"].xcom_push("jar_filename", "even.jar")
kwargs["ti"].xcom_push("jar_filename", "odd.jar")
with DAG(dag_id='new_example', default_args=default_args) as dag:
t0 = PythonOperator(
t1 = BashOperator(
jar_filename={{ ti.xcom_pull(task_ids='check_status', key='jar_filename') }}
echo "java -jar ${jar_filename}"
t0 >> t1

How to convert Iterable[String, String, String] to DataFrame?

I have a dataset of (String, String, String) which is about 6GB. After parsing the dataset I did groupby using (element => element._2) and got RDD[(String, Iterable[String, String, String])]. Then foreach element in groupby I am doing toList in-order to convert it to DataFrame.
val dataFrame = groupbyElement._2.toList.toDF()
But It is taking a huge amount of time to save data as parquet file format.
Is there any efficient way I can use?
N.B. I have five node cluster. Each node has 28 GB RAM and 4 cores. I am using standalone mode and giving 16 GB RAM to each executor.
You can try using the dataframe/dataset methods instead of those for RDD. It can look something like this:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = Seq(
("ABC", "123", "a"),
("ABC", "321", "b"),
("BCA", "123", "c")).toDF("Col1", "Col2", "Col3")
| ABC| 123| a|
| ABC| 321| b|
| BCA| 123| c|
val df2 = df
collect_list($"Col1") as "Col1_list"),
collect_list($"Col3") as "Col3_list"))
|Col2| Col1_list|Col3_list|
| 123|[ABC, BCA]| [a, c]|
| 321| [ABC]| [b]|
Additionally, instead of reading the data into a RDD you could make use of the methods to get a dataframe directly.

Spark partition pruning doesn't work on 1.6.0

I created partitioned parquet files on hdfs and created HIVE external table. When I query the table with filter on partitioning column, spark checks all the partition files instead of specific partition. We are on spark 1.6.0.
df = hivecontext.createDataFrame([
("class1", "Economics", "name1", None),
("class2","Economics", "name2", 92),
("class2","CS", "name2", 92),
("class1","CS", "name1", 92)
], ["class","subject", "name", "marks"])
creating parquet partitions:
hivecontext.setConf("spark.sql.parquet.compression.codec", "snappy")
hivecontext.setConf("spark.sql.hive.convertMetastoreParquet", "false")
df1.write.parquet("/transient/testing/students", mode="overwrite", partitionBy='subject')
df = hivecontext.sql('select * from vatmatching_stage.students where subject = "Economics"')
| class| name|marks| subject|
|class1|name1| 0|Economics|
|class2|name2| 92|Economics|
== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
+- 'Filter ('subject = Economics)
+- 'UnresolvedRelation `vatmatching_stage`.`students`, None
== Analyzed Logical Plan ==
class: string, name: string, marks: bigint, subject: string
Project [class#90,name#91,marks#92L,subject#89]
+- Filter (subject#89 = Economics)
+- Subquery students
+- Relation[class#90,name#91,marks#92L,subject#89] ParquetRelation: vatmatching_stage.students
== Optimized Logical Plan ==
Project [class#90,name#91,marks#92L,subject#89]
+- Filter (subject#89 = Economics)
+- Relation[class#90,name#91,marks#92L,subject#89] ParquetRelation: vatmatching_stage.students
== Physical Plan ==
Scan ParquetRelation: vatmatching_stage.students[class#90,name#91,marks#92L,subject#89] InputPaths: hdfs://dev4/transient/testing/students/subject=Art, hdfs://dev4/transient/testing/students/subject=Civil, hdfs://dev4/transient/testing/students/subject=CS, hdfs://dev4/transient/testing/students/subject=Economics, hdfs://dev4/transient/testing/students/subject=Music
But, if I do the same query on HIVE browser we can see HIVE is doing partition pruning.
44 location hdfs://testing/students/subject=Economics
45 name vatmatching_stage.students
46 numFiles 1
47 numRows -1
48 partition_columns subject
49 partition_columns.types string
Is this limitation in spark 1.6.0 or am I missing something here.
Found the root cause of this issue. HiveContext used for querying the table doesn't have spark.sql.hive.convertMetastoreParquet" set to "false". Its set to "true" - default value.
When I set it to "false", I can see its using partition pruning.

How to remove the parentheses around records when saveAsTextFile on RDD[(String, Int)]? [duplicate]

How do I remove the parenthesis "(" and ")" from the output by the below spark job?
When I try to read the spark output using PigScript it creates a problem.
My code:
scala> val words = Array("HI","HOW","ARE")
words: Array[String] = Array(HI, HOW, ARE)
scala> val wordsRDD = sc.parallelize(words)
wordsRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:23
scala> val keyvalueRDD = => (elem,1))
keyvalueRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[1] at map at <console>:25
scala> val wordcountRDD = keyvalueRDD.reduceByKey((x,y) => x+y)
wordcountRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[2] at reduceByKey at <console>:27
scala> wordcountRDD.saveAsTextFile("/user/cloudera/outputfiles")
Output as per above code :
hadoop dfs -cat /user/cloudera/outputfiles/part*
But I want the output of spark to be stored as below as without parenthesis
Now I want to read the above output using a PigScript.
LOAD statement in Pigscript treats "(HOW" as first atom and "1)" as second atom
Is there anyway we can get rid off parenthesis in spark code itself as I don't want to apply the fix for this in pigscript..
Pig script :
records = LOAD '/user/cloudera/outputfiles' USING PigStorage(',') AS (word:chararray);
dump records;
Pig output :
Use map transformation before you save the records to outputfiles directory, e.g. { case (k, v) => s"$k, $v" }.saveAsTextFile("/user/cloudera/outputfiles")
See Spark's documentation about map.
I strongly recommend using Datasets instead.
scala> words.toSeq.toDS.groupBy("value").count().show
| HOW| 1|
| ARE| 1|
| HI| 1|
scala> words.toSeq.toDS.groupBy("value").count.write.csv("outputfiles")
$ cat outputfiles/part-00199-aa752576-2f65-481b-b4dd-813262abb6c2-c000.csv
See Spark SQL, DataFrames and Datasets Guide.
This format is a format of Tuple. You can manually define your format:
val wordcountRDD = keyvalueRDD.reduceByKey((x,y) => x+y)
// here we set custom format
.map(x => x._1 + "," + x._2)

cloudera impala jdbc query doesn't see array<string> Hive column

I have a table in Hive that has the following structure:
> describe volatility2;
Query: describe volatility2
| name | type | comment |
| version | int | |
| unmappedmkfindex | int | |
| mfvol | array<string> | |
It was created by Spark HiveContext code by using a DataFrame API like this:
val volDF = hc.createDataFrame(volRDD)
which carried over the RDD structure that was defined in the schema:
def schemaVolatility: StructType = StructType(
StructField("Version", IntegerType, false) ::
StructField("UnMappedMKFIndex", IntegerType, false) ::
StructField("MFVol", DataTypes.createArrayType(StringType), true) :: Nil)
However, when I'm trying to select from this table using the latest JDBC Impala driver the last column is not visible to it. My query is very simple - trying to print the data to the console - exactly like in the example code provided by the driver download:
String sqlStatement = "select * from default.volatility2";
con = DriverManager.getConnection(connectionUrl);
Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery(sqlStatement);
System.out.println("\n== Begin Query Results ======================");
ResultSetMetaData metadata = rs.getMetaData();
for (int i=1; i<=metadata.getColumnCount(); i++) {
System.out.println("== End Query Results =======================\n\n");
The console output it this:
== Begin Query Results ======================
== End Query Results =======================
Is it a driver bug or I'm missing something?
I found the answer to my own question. Posting it here so it may help others and save time in searching. Apparently Impala lately introduced the so called "complex types" support to their SQL that include array among others. The link to the document is this:
According to this what I had to do is change the query to look like this:
select version, unmappedmkfindex, mfvol.ITEM from volatility2, volatility2.mfvol
and I got the right expected results back
