Sqoop speculative execution - hadoop

I have below question in Sqoop ?
I was curious if we can set speculative execution off/on for a sqoop import/export job.
And also do we have any option of setting number of reducers in sqoop import/export process. According to my analysis sqoop will not require any reducers, but not sure if Im correct. Please correct me on this.
I have used sqoop with mysql, oracle and what other databases can we use other than above.
Thanks

1) In sqoop by default speculative execution is off, because if Multiple mappers run for single task, we get duplicates of data in HDFS. Hence to avoid this decrepency it is off.
2) Number of reducers for sqoop job is 0, since it is merely a job running a MAP only job that dumps data into HDFS. We are not aggregating anything.
3) You can use Postgresql, HSQLDB along with mysql, oracle. How ever the direct import is supported in mysql and Postgre.

Speculative execution is turned on by default. It can be enabled or disabled independently
for map tasks and reduce tasks, on a cluster-wide basis, or on a per-job basis.
[NO Reducer for Sqoop ][1]: http://i.stack.imgur.com/CH8pb.png
Any JDBC compatible RDBMS i.e MySQL, oracle, Postgre

Related

Concurrency in Sqoop

I have read documents where it is recommended to install sqoop on edgenode for many reasons which is understood and for every mapper a connection to source database is established. My question is will all the 4 connections be established from edgenode or sqoop-client in edgenode just creates some kind of driver which monitors the ingestion while datanodes connect to the databases,get the data(part) and split it locally and then put in HDFS.
Sqoop is a wrapper over Map reduce to perform import export operation.
Mappers will run in your cluster , while the sqoop client will run the edge node.
Each mapper will open a connection to your database.
What rows are consumed by your mapper are decided by the client when submitting the job.
Edge node acts as an interface to Hadoop cluster, sqoop import/export launches the MapReduce job based on the generic and specific arguments.
MapReduce job runs the number of mappers based on the -m or --num-mappers argument given.
For Detailed information see below links:
http://www.dummies.com/programming/big-data/hadoop/edge-nodes-in-hadoop-clusters/
https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1764013

How to move data from RDBMS to hadoop without Sqoop?

I need to move huge data from RDBMS to Hadoop without using Sqoop. I have database of 2200 tables and using Sqoop to import them to hdfs is a hectic job consuming alot of time and hitting the database to select each time effect the performance. I have more sources to move from RDBMS to hdfs. And i query the files in hdfs with hive. Can someone help me with a more efficient way ?
You could always do it maually with any back-end code: read data from database and streaming write to HDFS. Then in you application configuration you could have any customization you need (threads, timeouts, data batches amount, etc.). And this is rather straightforward solution. We've tried this once for some reason I don't remember. But mostly we use sqoop and have no issues here. You could also do a copy (sime kind of replica) of database, which would not be used by any external systems other than your sqoop job. So user selects would not affect performance.

Relationship between Hive and Hadoop MapReduce?

Is there any Hive internal process that connects to reduce or map tasks?
Adding to that!
How does Hive work in relation with MapReduce?
How is the job getting scheduled?
How does the query result return to the hive driver?
For HIVE there is no process to communicate Map/Reduce tasks directly. It's communicates (flow 6.3) with Jobtracker(Application Master in YARN) only for job processing related things once it got scheduled.
This image will give clear understanding about,
How HIVE uses MapReduce as execution engine?
How is the job getting scheduled?
How does the result return to the driver?
Edit: suggested by dennis-jaheruddin
Hive is typically controlled by means of HQL (Hive Query Language)
which is often conveniently abbreviated to Hive.
source

How to load incremental records from Oracle to HDFS on daily basis and Can we use Sqoop or MR Jobs. Which is the preferred method

How to load incremental records from Oracle to HDFS on daily basis? Can we use Sqoop or MR Jobs?
Sqoop is designed exactly for this purpose, and will result in MR jobs that do the work of copying data. There are several methods of determining what is new in the Oracle table, for example using the table's id, or perhaps a date modified field if you have one.
Compared to most thing in Hadoop, Sqoop is pretty easy. Here's a link to the doc -- search for "incremental" or start with section 7.2.9 for more info. http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
FYI Once you get this working normally, check out the Sqoop extension designed to work with Oracle database that uses a vey efficient method for streaming data directly, making the process even faster and lightweight on your Oracle DB.

Move data from oracle to HDFS, process and move to Teradata from HDFS

My requirement is to
Move data from Oracle to HDFS
Process the data on HDFS
Move processed data to Teradata.
It is also required to do this entire processing every 15 minutes. The volume of source data may be close to 50 GB and the processed data also may be the same.
After searching a lot on the internet, i found that
ORAOOP to move data from Oracle to HDFS (Have the code withing the shell script and schedule it to run at the required interval).
Do large scale processing either by Custom MapReduce or Hive or PIG.
SQOOP - Teradata Connector to move data from HDFS to Teradata (again have a shell script with the code and then schedule it).
Is this the right option in the first place and is this feasible for the required time period (Please note that this is not the daily batch or so)?
Other options that i found are the following
STORM (for real time data processing). But i am not able to find the oracle Spout or Teradata bolt out of the box.
Any open source ETL tools like Talend or Pentaho.
Please share your thoughts on these options as well and any other possibilities.
Looks like you have several questions so let's try to break it down.
Importing in HDFS
It seems you are looking for Sqoop. Sqoop is a tool that lets you easily transfer data in/out of HDFS, and can connect to various databases including Oracle natively. Sqoop is compatible with the Oracle JDBC thin driver. Here is how you would transfer from Oracle to HDFS:
sqoop import --connect jdbc:oracle:thin#myhost:1521/db --username xxx --password yyy --table tbl --target-dir /path/to/dir
For more information: here and here. Note than you can also import directly into a Hive table with Sqoop which could be convenient to do your analysis.
Processing
As you noted, since your data initially is relational, it is a good idea to use Hive to do your analysis since you might be more familiar with SQL-like syntax. Pig is more pure relational algebra and the syntax is NOT SQL-like, it is more a matter of preference but both approaches should work fine.
Since you can import data into Hive directly with Sqoop, your data should be directly ready to be processed after it is imported.
In Hive you could run your query and tell it to write the results in HDFS:
hive -e "insert overwrite directory '/path/to/output' select * from mytable ..."
Exporting into TeraData
Cloudera released last year a connector for Teradata for Sqoop as described here, so you should take a look as this looks like exactly what you want. Here is how you would do it:
sqoop export --connect jdbc:teradata://localhost/DATABASE=MY_BASE --username sqooptest --password xxxxx --table MY_DATA --export-dir /path/to/hive/output
The whole thing is definitely doable in whatever time period you want, in the end what will matter is the size of your cluster, if you want it quick then scale your cluster up as needed. The good thing with Hive and Sqoop is that processing will be distributed in your cluster, so you have total control over the schedule.
If you have concerns about the overhead or latency of moving the data from Oracle into HDFS, a possible commercial solution might be Dell Software’s SharePlex. They recently released a connector for Hadoop that would allow you to replicate table data from Oracle to Hadoop. More info here.
I’m not sure if you need to reprocess the entire data set each time or can possibly just use the deltas. SharePlex also supports replicating the change data to a JMS queue. It might be possible to create a Spout that reads from that queue. You could probably also build your own trigger based solution but it would be a bit of work.
As a disclosure, I work for Dell Software.

Resources