Spark partition pruning doesn't work on 1.6.0 - hadoop

I created partitioned parquet files on hdfs and created HIVE external table. When I query the table with filter on partitioning column, spark checks all the partition files instead of specific partition. We are on spark 1.6.0.
dataframe:
df = hivecontext.createDataFrame([
("class1", "Economics", "name1", None),
("class2","Economics", "name2", 92),
("class2","CS", "name2", 92),
("class1","CS", "name1", 92)
], ["class","subject", "name", "marks"])
creating parquet partitions:
hivecontext.setConf("spark.sql.parquet.compression.codec", "snappy")
hivecontext.setConf("spark.sql.hive.convertMetastoreParquet", "false")
df1.write.parquet("/transient/testing/students", mode="overwrite", partitionBy='subject')
Query:
df = hivecontext.sql('select * from vatmatching_stage.students where subject = "Economics"')
df.show()
+------+-----+-----+---------+
| class| name|marks| subject|
+------+-----+-----+---------+
|class1|name1| 0|Economics|
|class2|name2| 92|Economics|
+------+-----+-----+---------+
df.explain(True)
== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
+- 'Filter ('subject = Economics)
+- 'UnresolvedRelation `vatmatching_stage`.`students`, None
== Analyzed Logical Plan ==
class: string, name: string, marks: bigint, subject: string
Project [class#90,name#91,marks#92L,subject#89]
+- Filter (subject#89 = Economics)
+- Subquery students
+- Relation[class#90,name#91,marks#92L,subject#89] ParquetRelation: vatmatching_stage.students
== Optimized Logical Plan ==
Project [class#90,name#91,marks#92L,subject#89]
+- Filter (subject#89 = Economics)
+- Relation[class#90,name#91,marks#92L,subject#89] ParquetRelation: vatmatching_stage.students
== Physical Plan ==
Scan ParquetRelation: vatmatching_stage.students[class#90,name#91,marks#92L,subject#89] InputPaths: hdfs://dev4/transient/testing/students/subject=Art, hdfs://dev4/transient/testing/students/subject=Civil, hdfs://dev4/transient/testing/students/subject=CS, hdfs://dev4/transient/testing/students/subject=Economics, hdfs://dev4/transient/testing/students/subject=Music
But, if I do the same query on HIVE browser we can see HIVE is doing partition pruning.
44 location hdfs://testing/students/subject=Economics
45 name vatmatching_stage.students
46 numFiles 1
47 numRows -1
48 partition_columns subject
49 partition_columns.types string
Is this limitation in spark 1.6.0 or am I missing something here.

Found the root cause of this issue. HiveContext used for querying the table doesn't have spark.sql.hive.convertMetastoreParquet" set to "false". Its set to "true" - default value.
When I set it to "false", I can see its using partition pruning.

Related

Apache airflow with conditional statements

I am new to Airflow. I want to do an operation like below using airflow operators.
Briefly I want to read some data from a database table and according to the values of a column in that table I want to do different tasks.
This is the table which I used to get data.
+-----------+--------+
| task_name | status |
+-----------+--------+
| a | 1 |
| b | 2 |
| c | 4 |
| d | 3 |
| e | 4 |
+-----------+--------+
From the above table I want to select the rows where status=4 and according to their task name run the relevant jar file (for running jar files I am planning to use Bash Operator). I want to execute this task using Airflow. Note that I am using PostgreSQL.
This is the code which I have implemented so far.
from airflow.models import DAG
from airflow.operators.postgres_operator import PostgresOperator
from datetime import datetime, timedelta
from airflow import settings
#set the default attributes
default_args = {
'owner': 'Airflow',
'start_date': datetime(2020,10,4)
}
status_four_dag = DAG(
dag_id = 'status_check',
default_args = default_args,
schedule_interval = timedelta(seconds=5)
)
test=PostgresOperator(
task_id='check_status',
sql='''select * from table1 where status=4;''',
postgres_conn_id='test',
database='status',
dag=status_four_dag,
)
I am stuck in the place where I want to check the task_name and call the relevant BashOperators.
Your support is appreciated. Thank you.
XComs are used for communicating messages between tasks. Send the JAR filename and other arguments for forming the command to xcom and consume it in the subsequent tasks.
For example,
check_status >> handle_status
check_status - checks status from DB and write JAR filename and arguments to xcom
handle_status - pulls the JAR filename and arguments from xcom, forms the command and execute it
Sample code:
def check_status(**kwargs):
if randint(1, 100) % 2 == 0:
kwargs["ti"].xcom_push("jar_filename", "even.jar")
else:
kwargs["ti"].xcom_push("jar_filename", "odd.jar")
with DAG(dag_id='new_example', default_args=default_args) as dag:
t0 = PythonOperator(
task_id="check_status",
provide_context=True,
python_callable=check_status
)
t1 = BashOperator(
task_id="handle_status",
bash_command="""
jar_filename={{ ti.xcom_pull(task_ids='check_status', key='jar_filename') }}
echo "java -jar ${jar_filename}"
"""
)
t0 >> t1

How can I exclude partitions when converting CSV to ORC using AWS Glue?

I have a bunch of CSV files in S3 that I am trying to covert to ORC using an ETL job in AWS Glue. I have a crawler that crawls the directory containing the CSVs and generates a table. The table looks like this:
Column name | Data type | Partition key
---------------------------------------
field1 | string |
field2 | string |
field3 | string |
partition_0 | string | Partition (0)
partition_1 | string | Partition (1)
Next, I try to convert the CSVs into ORC files. Here is a similar ETL script to what I'm using:
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'database', 'table_name', 'output_dir'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
partition_predicate = '(partition_0 = "val1") AND (partition_1 = "val2")'
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = args['database'], table_name = args['table_name'], push_down_predicate = partition_predicate, transformation_ctx = "datasource0")
final = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = { "path": args['output_dir'] }, format = "orc")
job.commit()
I have another crawler that crawls my output directory containing the ORC files. When it generates the table, it looks like this:
Column name | Data type | Partition key
---------------------------------------
field1 | string |
field2 | string |
field3 | string |
partition_0 | string |
partition_1 | string |
partition_0 | string | Partition (0)
partition_1 | string | Partition (1)
It looks like it considers the partitions to be fields in the ORC file (which they should not be). How can I modify my script so that the CSV to ORC conversion won't include the partition keys as schema columns?
If you need to preserve partitioning then add option partitionKeys to the writer:
final = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = { "path": args['output_dir'], "partitionKeys" -> Seq("partition_0", "partition_1") }, format = "orc")
Otherwise just remove partitioning columns:
cleanDyf = datasource0.dropFields(Seq("partition_0", "partition_1"))
final = glueContext.write_dynamic_frame.from_options(frame = cleanDyf, connection_type = "s3", connection_options = { "path": args['output_dir'] }, format = "orc")

Sql Window function on whole dataframe in spark

I am working on spark streaming project which consumes data from Kafka in every 3 minutes. I want to calculate moving sum of value. Below is the sample logic for a rdd which works well. I want to know will this logic work for spark streaming. I read some docs that you have to assign rang of data. ex - Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1) But I want to calculate the logic on whole dataframe. Does the below logic work for the whole value of dataframe or It will take only the range of value of dataframe.
val customers = spark.sparkContext.parallelize(List(("Alice", "2016-05-01", 50.00),
("Alice", "2016-05-03", 45.00),
("Alice", "2016-05-04", 55.00),
("Bob", "2016-05-01", 25.00),
("Bob", "2016-05-04", 29.00),
("Bob", "2016-05-06", 27.00))).
toDF("name", "date", "amountSpent")
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val wSpec1 = Window.partitionBy("name").orderBy("date")
customers.withColumn( "movingSum",
sum(customers("amountSpent")).over(wSpec1) ).show()
output
+-----+----------+-----------+---------+
| name| date|amountSpent|movingSum|
+-----+----------+-----------+---------+
| Bob|2016-05-01| 25.0| 25.0|
| Bob|2016-05-04| 29.0| 54.0|
| Bob|2016-05-06| 27.0| 81.0|
|Alice|2016-05-01| 50.0| 50.0|
|Alice|2016-05-03| 45.0| 95.0|
|Alice|2016-05-04| 55.0| 150.0|
+-----+----------+-----------+---------+

very slow spark performance

I am a newbie to spark and need some help to debug very slow performance in spark.
I am doing below transformations and its been running for more than 2 hours.
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext( sc )
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#2b33f7a0
scala> val t1_df = hiveContext.sql("select * from T1" )
scala> t1_df.registerTempTable( "T1" )
warning: there was one deprecation warning; re-run with -deprecation for details
scala> t1_df.count
17/06/07 07:26:51 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
res3: Long = 1732831
scala> val t1_df1 = t1_df.dropDuplicates( Array("c1","c2","c3", "c4" ))
scala> df1.registerTempTable( "ABC" )
warning: there was one deprecation warning; re-run with -deprecation for details
scala> hiveContext.sql( "select * from T1 where c1 not in ( select c1 from ABC )" ).count
[Stage 4:====================================================> (89 + 8) / 97]
I am using spark2.1.0 and reading data from hive.2.1.1 on amazon VMs cluster of 7 nodes each with 250GB RAM and 64 virtual cores. With this massive resource, i am expecting this simple query on 1.7 mil recs to fly but its painfully slow.
Any pointers would be of great help.
UPDATES:
Adding explain plan:
scala> hiveContext.sql( "select * from T1 where c1 not in ( select c1 from ABC )" ).explain
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, LeftAnti, (isnull((c1#26 = c1#26#1398)) || (c1#26 = c1#26#1398))
:- FileScan parquet default.t1_pq[cols
more fields] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://<hostname>/user/hive/warehouse/atn_load_pq], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<hdr_msg_src:string,hdr_recv_tsmp:timestamp,hdr_desk_id:string,execprc:string,dreg:string,c...
+- BroadcastExchange IdentityBroadcastMode
+- *HashAggregate(keys=[c1#26, c2#59, c3#60L, c4#82], functions=[])
+- Exchange hashpartitioning(c1#26, c2#59, c3#60L, c4#82, 200)
+- *HashAggregate(keys=[c1#26, c2#59, c3#60L, c4#82], functions=[])
+- *FileScan parquet default.atn_load_pq[c1#26,c2#59,c3#60L,c4#82] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://<hostname>/user/hive/warehouse/atn_load_pq], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c1:string,c2:string,c3:bigint,c4:string>
Although I think your count will always be 0 in your query, you may try to use an left-anti join, and don't forget to cache t1_df to avoid multiple re-computations
val t1_df = hiveContext.sql("select * from T1" ).cache
t1_df
.join(
t1_df.dropDuplicates( Array("c1","c2","c3", "c4" )),
Seq("c1"),
"leftanti"
)
.count()

cloudera impala jdbc query doesn't see array<string> Hive column

I have a table in Hive that has the following structure:
> describe volatility2;
Query: describe volatility2
+------------------+---------------+---------+
| name | type | comment |
+------------------+---------------+---------+
| version | int | |
| unmappedmkfindex | int | |
| mfvol | array<string> | |
+------------------+---------------+---------+
It was created by Spark HiveContext code by using a DataFrame API like this:
val volDF = hc.createDataFrame(volRDD)
volDF.saveAsTable(volName)
which carried over the RDD structure that was defined in the schema:
def schemaVolatility: StructType = StructType(
StructField("Version", IntegerType, false) ::
StructField("UnMappedMKFIndex", IntegerType, false) ::
StructField("MFVol", DataTypes.createArrayType(StringType), true) :: Nil)
However, when I'm trying to select from this table using the latest JDBC Impala driver the last column is not visible to it. My query is very simple - trying to print the data to the console - exactly like in the example code provided by the driver download:
String sqlStatement = "select * from default.volatility2";
Class.forName(jdbcDriverName);
con = DriverManager.getConnection(connectionUrl);
Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery(sqlStatement);
System.out.println("\n== Begin Query Results ======================");
ResultSetMetaData metadata = rs.getMetaData();
for (int i=1; i<=metadata.getColumnCount(); i++) {
System.out.println(rs.getMetaData().getColumnName(i)+":"+rs.getMetaData().getColumnTypeName(i));
}
System.out.println("== End Query Results =======================\n\n");
The console output it this:
== Begin Query Results ======================
version:version
unmappedmkfindex:unmappedmkfindex
== End Query Results =======================
Is it a driver bug or I'm missing something?
I found the answer to my own question. Posting it here so it may help others and save time in searching. Apparently Impala lately introduced the so called "complex types" support to their SQL that include array among others. The link to the document is this:
http://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types_using
According to this what I had to do is change the query to look like this:
select version, unmappedmkfindex, mfvol.ITEM from volatility2, volatility2.mfvol
and I got the right expected results back

Resources