Why does one action produce two jobs? - performance

I use Spark 2.1.0.
Why does the following one action produce 2 identical jobs (same DAG in each one)? Shouldn't it produce just 1? Here you have the code:
val path = "/usr/lib/spark/examples/src/main/resources/people.txt"
val peopleDF = spark.
sparkContext.
textFile(path, 4).
map(_.split(",")).
map(attr => Person(attr(0), attr(1).trim.toInt)).
toDF
peopleDF.show()
I see that in the graphic interface when checking what is going on? I suppose it has something to do with all Data Frame transformation.

Although in general, a single SQL query may lead to more than one Spark job in this particular case Spark 2.3.0-SNAPSHOT gives only one (contrary to what you see).
The Job 12 is also pretty nice, i.e. just a single-stage no-shuffle Spark job.
The reason to see more than one Spark job per Spark SQL's structured query (using SQL or Dataset API) is that Spark SQL offers a high level API atop RDDs and uses RDDs and actions freely to make your life as a Spark developer and a Spark performance tuning expert easier. In most cases (esp. when you wanted to build abstractions), you'd have to fire up the Spark jobs yourself to achieve the comparable performance.

Related

Hive - How to know which execution engine I am currently using

I want to automate my hive ETL workflow in such a way
that I need to execute hive jobs on the basis of execution engine (Tez
or MR) because of memory constraints.
Would you please help, as I wanted to cross-check in-between of my whole work-flow which execution engine currently I'm dealing with.
Thanks in advance.
The Hive execution engine is controlled by hive.execution.engine property. It can be either of the following:
mr (Map Reduce, default)
tez (Tez execution, for Hadoop 2 only)
spark (Spark execution, for Hive 1.1.0 onward).
The property can be read & updated using hive/beeline cli
For reading - SET hive.execution.engine;
For updating - SET hive.execution.engine=tez;
If you want to programmatically get this value out, you must go for HiveClient which supports multiple ways like JDBC, Java, Python, PHP, Ruby, C++, etc.
References
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=82903061#ConfigurationProperties-hive.execution.engine
https://cwiki.apache.org/confluence/display/Hive/HiveClient

Apache Spark-SQL vs Sqoop benchmarking while transferring data from RDBMS to hdfs

I am working on a use case where I have to transfer data from RDBMS to HDFS. We have done the benchmarking of this case using sqoop and found out that we are able to transfer around 20GB data in 6-7 Mins.
Where as when I try the same with Spark SQL, the performance is very low(1 Gb of records is taking 4 min to transfer from netezza to hdfs). I am trying to do some tuning and increase its performance but its unlikely to tune it to the level of sqoop(around 3 Gb of data in 1 Min).
I agree to the fact that spark is primarily a processing engine but my main question is that both spark and sqoop are using JDBC driver internally so why there is so much difference in the performance(or may be I am missing something). I am posting my code here.
object helloWorld {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Netezza_Connection").setMaster("local")
val sc= new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("numPartitions","14").option("lowerBound","0").option("upperBound","13").option("partitionColumn", "id").option("fetchSize","100000").load().registerTempTable("POC")
val df2 =sqlContext.sql("select * from POC")
val partitioner= new org.apache.spark.HashPartitioner(14)
val rdd=df2.rdd.map(x=>(String.valueOf(x.get(1)),x)).partitionBy(partitioner).values
rdd.saveAsTextFile("hdfs://Hostname/test")
}
}
I have checked many other post but could not get a clear answer for the internal working and tuning of sqoop nor I got sqoop vs spark sql benchmarking .Kindly help in understanding this issue.
You are using the wrong tools for the job.
Sqoop will launch a slew of processes (on the datanodes) that will each make a connections to your database (see num-mapper) and they will each extract a part of the dataset. I don't think you can achieve kind of read parallelism with Spark.
Get the dataset with Sqoop and then process it with Spark.
you can try the following:-
Read data from netezza without any partitions and with increased fetch_size to a million.
sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("fetchSize","1000000").load().registerTempTable("POC")
repartition the data before writing it to final file.
val df3 = df2.repartition(10) //to reduce the shuffle
ORC formats are more optimized than TEXT. Write the final output to parquet/ORC.
df3.write.format("ORC").save("hdfs://Hostname/test")
#amitabh
Although marked as an answer, I disagree with it.
Once you give the predicate to partition the data while reading from the jdbc, spark will run separate tasks for each partition. In your case no of tasks should be 14 (u can confirm this using spark UI).
I notice that you are using local as master, which would provide only 1 core for executors. Hence there will be no parallelism. Which is what is happening in your case.
Now to get the same throughput as sqoop you need to make sure that these tasks are running in parallel. Theoretically this can be done either by:
1. Using 14 executors with 1 core each
2. Using 1 executor with 14 cores (other end of the spectrum)
Typically, I would go with 4-5 cores per executor. So I test the performance with 15/5= 3 executors (i added 1 to 14 to consider 1 core for the driver running in clustor mode).
Use: executor.cores, executor.instances in sparkConf.set to play with the configs.
If this does not significantly increase performance, the next thing would be to look at the executor memory.
Finally, I would tweak the application logic to look at mapRDD sizes, partition sizes and shuffle sizes.
I had the same problem because the piece of code you are using it's not working for partition.
sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("numPartitions","14").option("lowerBound","0").option("upperBound","13").option("partitionColumn", "id").option("fetchSize","100000").load().registerTempTable("POC")
You can check number of partitions created in you spark job by
df.rdd.partitions.length
you can use following code to connect db:
sqlContext.read.jdbc(url=db_url,
table=tableName,
columnName="ID",
lowerBound=1L,
upperBound=100000L,
numPartitions=numPartitions,
connectionProperties=connectionProperties)
To optimize your spark job following are the parameters:
1. # of partitions
2. --num-executors
3.--executor-cores
4. --executor-memory
5. --driver-memory
6. fetch-size
2,3,4 and 5 options are depends on you cluster configurations
you can monitor your spark job on spark ui.
Sqoop and Spark SQL both use JDBC connectivity to fetch the data from RDBMS engines but Sqoop has an edge here since it is specifically made to migrate the data between RDBMS and HDFS.
Every single option available in Sqoop has been fine-tuned to get the best performance while doing the data ingestions.
You can start with discussing the option -m which control the number of mappers.
This is what you need to do to fetch data in parallel from RDBMS. Can I do it in Spark SQL?
Of course yes but the developer would need to take care of "multithreading" that Sqoop has been taking care automatically.
The below solution helped me
var df=spark.read.format("jdbc").option("url","
"url").option("user","user").option("password","password").option("dbTable","dbTable").option("fetchSize","10000").load()
df.registerTempTable("tempTable")
var dfRepart=spark.sql("select * from tempTable distribute by primary_key") //this will repartition the data evenly
dfRepart.write.format("parquet").save("hdfs_location")
Apache Sqoop is retired now - https://attic.apache.org/projects/sqoop.html
Using Apache Spark is a good option. This link shows how Spark can be used instead of Sqoop - https://medium.com/zaloni-engineering/apache-spark-vs-sqoop-engineering-a-better-data-pipeline-ef2bcb32b745
Else one can choose any cloud services like Azure Data Factory or Amazon Redshift etc.

What is the difference between HUE, YARN and OOZIE

I understand the concepts of HDFS and Map Reduce and how it is important to move the processing logic to the data to increase efficiency. I was even able to run a couple of map reduce job on my basic Hadoop cluster. Surrounding these concepts there are a lot of different technologies like YARN, HUE, OOZIE all of which seems to do the same thing (at least from a very high level) which is operation visibility and CRUD abilities for jobs (which can be map-reduce or something else).
Am I correct in making this assumption or is there a much more fundamental difference between them?
Thanks
Kay
YARN - Map Reduce is API where you have to implement data processing logic in it. Once the code is compiled you have to submit the jobs using hadoop jar command. YARN is the framework which will keep track of the resources, submit the job on the cluster, execute the job, show/log the progress.
OOZIE - Take a data integration example. You might have to get a data set from one database and other data set from other database, then you want to join, process the data and reload it into a cache or 3rd database. It involves 2 sqoop jobs to pull data from database, a hive/map reduce job to join and process the data, then push into cache/database. All these jobs are dependent on each other, eg: we are supposed to process the data only after data is pulled from source databases. Hence we need to create a workflow to execute complete data integration process. OOZIE can facilitate that. It is map reduce based workflow tool. Workflow it self will be executed as one or more map reduce jobs.
HUE: There are many tools in Hadoop - HDFS (file system), Sqoop, Hive/pig to process the data, Impala, HBase and many many more. To execute the POCs, it can get tedious to connect to the cluster. Also it need some linux skills. To overcome those challenges all the Hadoop eco system tools are consolidate under one umbrella - called Hue.

Will hadoop(sqoop) load oracle faster than SQL loader?

We presently load CDRs to an oracle warehouse using a combination of bash shell scripts and SQL loader with multiple threads. We are hoping to offload this process to hadoop because we envisage that the increase in data due to increase in subscriber base will soon max out the current system. And we also want to gradually introduce hadoop into our data warehouse environment.
Will loading from hadoop be faster?
If so what's is the best set of hadoop tool for this?
Further info:
We usually will get contunoius stream of pipe delimited text files through ftp to a folder, add two more fields to each record, load to temp tables in oracle and run a procedure to load to final table. How would u advice the process flow to be in terms of tools to use. For example;
files are ftp to the Linux file system (or is possible to ftp straight to hadoop?) and flume loads to hadoop.
fields are added (what will be best to do this? Pig, hive, spark or any other recommendations)
files are then loaded to oracle using sqoop
the final procedure is called(can sqoop make an oracle procedure call? If not what tool will be best to execute procedure and help control the whole process ?)
Also how can one control the level of paralleism ? Does it equate the number of mappers running the job?
Had a similar task of exporting data from a < 6 node Hadoop cluster to an Oracle Datewarehouse.
I've tested the following:
Sqoop
OraOop
Oracle Loader for Hadoop from the "Oracle BigData Connectors" suite
Hadoop streaming job which uses sqloader as mapper, in its configuration you can read from stdin using: load data infile "-"
Considering just speed, the Hadoop streaming job with sqloader as a mapper was the fastest way to transfer the data, but you have to install sqloader on each machine of your cluster. It was more of a personal curiosity, I would not recommend using this way to export data, the logging capabilities are limited, and should have a bigger impact on your datawarehouse performance.
The winner was Sqoop, it is pretty reliable, it's the import/export tool of the Hadoop ecosystem and was second fastest solution, according to my tests.(1.5x slower than first place)
Sqoop with OraOop (last updated 2012) was slower than the latest version of Sqoop, and requires extra configuration on the cluster.
Finally the worst time was obtained using Oracle's BigData Connectors, if you have a big cluster(>100 machines) than it should not be as bad as the time I obtained. The export was done in two steps. First step involves reprocessing the output and converting it to an Oracle Format that plays nice with the Datawarehouse. The second step was transfering the result to the Datawarehouse. This approach is better if you have allot of processing power, and you would not impact the Datawarehouse's performace as much as the other solutions.

Apache Spark comparing files with SQL data

I going to use Apache Spark for processing big text files where in processing cycle is a part with comparing text-parts with data from big SQL table.
The task is:
1) Process files and break text into pieces
2) Compare pieces with database ones
Definitely, the bottleneck will be a SQL. I'm completely new to Apache Spark and while I'm sure, that Subtask #1 is "his guy", I'm not fully sure, that the Subtask #2 can be handled by Spark (I mean, in efficient way).
The question is how Spark deals with iterable selects from big SQL (maybe, cache as much as can?) in parallel and distributed environment?
Posting as an answer per request:
If you need to repetitively process data from a SQL data source, I usually find it worth using Sqoop to pull the data into HDFS so that my processing can run more easily. This is particularly useful while I'm developing my data flow, since I'll often run the same job on a sample of data several times in a short time period, and if it has been sqooped I don't have to hit the database server every time.
If your job is periodic/batch style (a daily data cleanup or report or something), this may be a sufficient implementation, and having a collection of historic data in HDFS ends up being useful for other purposes many times.
If you need live, up-to-the-minute data, then you'll want to use JdbcRDD, as described in this other answer, which lets you treat a SQL data source as an RDD in your Spark data flow.
Good luck.

Resources