Concurrency in Sqoop - hadoop

I have read documents where it is recommended to install sqoop on edgenode for many reasons which is understood and for every mapper a connection to source database is established. My question is will all the 4 connections be established from edgenode or sqoop-client in edgenode just creates some kind of driver which monitors the ingestion while datanodes connect to the databases,get the data(part) and split it locally and then put in HDFS.

Sqoop is a wrapper over Map reduce to perform import export operation.
Mappers will run in your cluster , while the sqoop client will run the edge node.
Each mapper will open a connection to your database.
What rows are consumed by your mapper are decided by the client when submitting the job.

Edge node acts as an interface to Hadoop cluster, sqoop import/export launches the MapReduce job based on the generic and specific arguments.
MapReduce job runs the number of mappers based on the -m or --num-mappers argument given.
For Detailed information see below links:
http://www.dummies.com/programming/big-data/hadoop/edge-nodes-in-hadoop-clusters/
https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1764013

Related

If we using 6 mapper in sqoop to importing the data from Oracle, then how many connection will be establish between sqoop and source

If we using 6 mapper in sqoop to importing the data from Oracle, then how many connection will be establish between sqoop and source.
Will it be a single connection or it will be 6 connections for each mapper.
As per sqoop docs:
Likewise, do not increase the degree of parallism higher than that which your database can reasonably support. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result.
That means all the mappers will make concurrent connections.
Also keep in mind, if your table has 2 records only, then sqoop will only use 2 mappers not all the 6 mappers.
Check my other answer to understand concept of number of mappers in Sqoop command.
EDIT:
All the mappers will make inactive connections as JDBC client program. Then active connections (which actually fires SQL query) will be shared among multiple mappers.
Fire SQOOP IMPORT command in -verbose mode, you will see logs -
DEBUG manager.OracleManager$ConnCache: Got cached connection for jdbc:oracle:thin:#192.xx.xx.xx:1521:orcl/dev
DEBUG manager.OracleManager$ConnCache: Caching released connection for jdbc:oracle:thin:#192.xx.xx.xx:1521:orcl/dev
Check getConnection and recycle methods for more details.
Each map task will get a DB connection. so in your case 6 maps then 6 connections. please visit github/sqoop to see how it was implemented
-m specify the number of mapper task will be running as part of the Job.
so more number of mappers then more number of connections.
It probably depends on Manager but I guess all of them likely to create one. Take DirectPostgresSqlManager. It creates one connection per mapper through psql COPY TO STDOUT
Please take a look at managers at
Sqoop Managers

Apache Spark-SQL vs Sqoop benchmarking while transferring data from RDBMS to hdfs

I am working on a use case where I have to transfer data from RDBMS to HDFS. We have done the benchmarking of this case using sqoop and found out that we are able to transfer around 20GB data in 6-7 Mins.
Where as when I try the same with Spark SQL, the performance is very low(1 Gb of records is taking 4 min to transfer from netezza to hdfs). I am trying to do some tuning and increase its performance but its unlikely to tune it to the level of sqoop(around 3 Gb of data in 1 Min).
I agree to the fact that spark is primarily a processing engine but my main question is that both spark and sqoop are using JDBC driver internally so why there is so much difference in the performance(or may be I am missing something). I am posting my code here.
object helloWorld {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Netezza_Connection").setMaster("local")
val sc= new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("numPartitions","14").option("lowerBound","0").option("upperBound","13").option("partitionColumn", "id").option("fetchSize","100000").load().registerTempTable("POC")
val df2 =sqlContext.sql("select * from POC")
val partitioner= new org.apache.spark.HashPartitioner(14)
val rdd=df2.rdd.map(x=>(String.valueOf(x.get(1)),x)).partitionBy(partitioner).values
rdd.saveAsTextFile("hdfs://Hostname/test")
}
}
I have checked many other post but could not get a clear answer for the internal working and tuning of sqoop nor I got sqoop vs spark sql benchmarking .Kindly help in understanding this issue.
You are using the wrong tools for the job.
Sqoop will launch a slew of processes (on the datanodes) that will each make a connections to your database (see num-mapper) and they will each extract a part of the dataset. I don't think you can achieve kind of read parallelism with Spark.
Get the dataset with Sqoop and then process it with Spark.
you can try the following:-
Read data from netezza without any partitions and with increased fetch_size to a million.
sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("fetchSize","1000000").load().registerTempTable("POC")
repartition the data before writing it to final file.
val df3 = df2.repartition(10) //to reduce the shuffle
ORC formats are more optimized than TEXT. Write the final output to parquet/ORC.
df3.write.format("ORC").save("hdfs://Hostname/test")
#amitabh
Although marked as an answer, I disagree with it.
Once you give the predicate to partition the data while reading from the jdbc, spark will run separate tasks for each partition. In your case no of tasks should be 14 (u can confirm this using spark UI).
I notice that you are using local as master, which would provide only 1 core for executors. Hence there will be no parallelism. Which is what is happening in your case.
Now to get the same throughput as sqoop you need to make sure that these tasks are running in parallel. Theoretically this can be done either by:
1. Using 14 executors with 1 core each
2. Using 1 executor with 14 cores (other end of the spectrum)
Typically, I would go with 4-5 cores per executor. So I test the performance with 15/5= 3 executors (i added 1 to 14 to consider 1 core for the driver running in clustor mode).
Use: executor.cores, executor.instances in sparkConf.set to play with the configs.
If this does not significantly increase performance, the next thing would be to look at the executor memory.
Finally, I would tweak the application logic to look at mapRDD sizes, partition sizes and shuffle sizes.
I had the same problem because the piece of code you are using it's not working for partition.
sqlContext.read.format("jdbc").option("url","jdbc:netezza://hostname:port/dbname").option("dbtable","POC_TEST").option("user","user").option("password","password").option("driver","org.netezza.Driver").option("numPartitions","14").option("lowerBound","0").option("upperBound","13").option("partitionColumn", "id").option("fetchSize","100000").load().registerTempTable("POC")
You can check number of partitions created in you spark job by
df.rdd.partitions.length
you can use following code to connect db:
sqlContext.read.jdbc(url=db_url,
table=tableName,
columnName="ID",
lowerBound=1L,
upperBound=100000L,
numPartitions=numPartitions,
connectionProperties=connectionProperties)
To optimize your spark job following are the parameters:
1. # of partitions
2. --num-executors
3.--executor-cores
4. --executor-memory
5. --driver-memory
6. fetch-size
2,3,4 and 5 options are depends on you cluster configurations
you can monitor your spark job on spark ui.
Sqoop and Spark SQL both use JDBC connectivity to fetch the data from RDBMS engines but Sqoop has an edge here since it is specifically made to migrate the data between RDBMS and HDFS.
Every single option available in Sqoop has been fine-tuned to get the best performance while doing the data ingestions.
You can start with discussing the option -m which control the number of mappers.
This is what you need to do to fetch data in parallel from RDBMS. Can I do it in Spark SQL?
Of course yes but the developer would need to take care of "multithreading" that Sqoop has been taking care automatically.
The below solution helped me
var df=spark.read.format("jdbc").option("url","
"url").option("user","user").option("password","password").option("dbTable","dbTable").option("fetchSize","10000").load()
df.registerTempTable("tempTable")
var dfRepart=spark.sql("select * from tempTable distribute by primary_key") //this will repartition the data evenly
dfRepart.write.format("parquet").save("hdfs_location")
Apache Sqoop is retired now - https://attic.apache.org/projects/sqoop.html
Using Apache Spark is a good option. This link shows how Spark can be used instead of Sqoop - https://medium.com/zaloni-engineering/apache-spark-vs-sqoop-engineering-a-better-data-pipeline-ef2bcb32b745
Else one can choose any cloud services like Azure Data Factory or Amazon Redshift etc.

Moving hive data from one Hadoop cluster to another without using distcp command?

How to move hive data from one Hadoop cluster to another Hadoop cluster without using distcp command. As we can not use this. Do we have another option like Sqoop or Flume?
distcp is the efficient way to move huge amounts of data from one hadoop cluster to another.
Sqoop and Flume cannot be used to transfer data from one hadoop cluster to another. Sqoop is predominantly used to move data between hadoop and relational databases whereas Flume is used to ingest streaming data to Hadoop.
Your other option would be to use:
high-throughput msg queue like Kafka, but this would become more complicated than using distcp.
Use traditional hadoop fs shell commands like cp or get followed by put
FYI when you are talking about Hive data, you also should consider keeping hive metadata (metastore) in-sync between the clusters.

Sqoop speculative execution

I have below question in Sqoop ?
I was curious if we can set speculative execution off/on for a sqoop import/export job.
And also do we have any option of setting number of reducers in sqoop import/export process. According to my analysis sqoop will not require any reducers, but not sure if Im correct. Please correct me on this.
I have used sqoop with mysql, oracle and what other databases can we use other than above.
Thanks
1) In sqoop by default speculative execution is off, because if Multiple mappers run for single task, we get duplicates of data in HDFS. Hence to avoid this decrepency it is off.
2) Number of reducers for sqoop job is 0, since it is merely a job running a MAP only job that dumps data into HDFS. We are not aggregating anything.
3) You can use Postgresql, HSQLDB along with mysql, oracle. How ever the direct import is supported in mysql and Postgre.
Speculative execution is turned on by default. It can be enabled or disabled independently
for map tasks and reduce tasks, on a cluster-wide basis, or on a per-job basis.
[NO Reducer for Sqoop ][1]: http://i.stack.imgur.com/CH8pb.png
Any JDBC compatible RDBMS i.e MySQL, oracle, Postgre

Running pig on a multi node Cassandra cluster

I am working on BI process that will read data from cassandra, create summaries using Map Reduce and write back to a different keyspace.
Starting with a single node, everything worked as i expected, but when moving to a multi-node, i am not sure I fully understand the topology and configuration.
I have a setup with 3 nodes. Each has a Cassandra node (version 1.1.9), data node and task tracker (version 0.20.2+923.421- CDH3U5) . The NameNode and job tracker are on a different server. At this point i am trying to run Pig script from the DataNode server.
The thing i am not sure of is the pig argument PIG_INITIAL_ADDRESS. I assumed the query would run on all Cassandra nodes, each task tracker would only query the local Cassandra node, and the reducer would handle any duplicates. Based on that assumption i thought the PIG_INITIAL_ADDRESS should be localhost. But when running the pig script it fails:
java.io.IOException: Unable to connect to server localhost:9160
My questions are- should the initial address be any one of the Cassandra nodes, and Splitting the map on the cluster is done from Cassandra keys partitions (will i get the distribution i need)?
IF I where to use java map reduce, will i still need to supply the initial address?
Is the current implementation assumes pig is running from a Cassandra node?
The PIG_INITIAL_ADDRESS is the address of one of the Cassandra nodes in your ring. In order to have the Hadoop job read data from or write data to Cassandra, it just needs to have some properties set. Those properties are also available to set in the job properties or in the default Hadoop configuration on the server that you're running the job from. Other than that, it's just like submitting a job to a job tracker.
For more information, I would look at the readme that's in the cassandra source download under examples/pig. There is a lot of explanation in there as well.

Resources