How to move data from RDBMS to hadoop without Sqoop? - hadoop

I need to move huge data from RDBMS to Hadoop without using Sqoop. I have database of 2200 tables and using Sqoop to import them to hdfs is a hectic job consuming alot of time and hitting the database to select each time effect the performance. I have more sources to move from RDBMS to hdfs. And i query the files in hdfs with hive. Can someone help me with a more efficient way ?

You could always do it maually with any back-end code: read data from database and streaming write to HDFS. Then in you application configuration you could have any customization you need (threads, timeouts, data batches amount, etc.). And this is rather straightforward solution. We've tried this once for some reason I don't remember. But mostly we use sqoop and have no issues here. You could also do a copy (sime kind of replica) of database, which would not be used by any external systems other than your sqoop job. So user selects would not affect performance.

Related

Hdfs and Hbase: how it works?

Hi everybody
I'm quite new with bigdata, I have installed a HDFS + Hbase test database and I use Talend Big Data (an ETL) to make my test.
I would like to know : if I put a file directly in the HDFS, without going via hbase, I could never request these data ? I mean, I have to read the entire file if I want to filter data I want to chose, is that right ?
Thanks a lot for any help !
HDFS is just a distributed file system, you cannot query your files without passing by an intermidiate component.
Hbase is a nosql database that persist your data on the HDFS, use it when you need a random access to your data.
If you want to store your files on the HDFS as they are and query them, you can create an external table upon them using Hive.
The best option is to use hive on the top of the files which are on the HDFS. You can use bucketing and partitioning in the hive for performance improvement.

How to load incremental records from Oracle to HDFS on daily basis and Can we use Sqoop or MR Jobs. Which is the preferred method

How to load incremental records from Oracle to HDFS on daily basis? Can we use Sqoop or MR Jobs?
Sqoop is designed exactly for this purpose, and will result in MR jobs that do the work of copying data. There are several methods of determining what is new in the Oracle table, for example using the table's id, or perhaps a date modified field if you have one.
Compared to most thing in Hadoop, Sqoop is pretty easy. Here's a link to the doc -- search for "incremental" or start with section 7.2.9 for more info. http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
FYI Once you get this working normally, check out the Sqoop extension designed to work with Oracle database that uses a vey efficient method for streaming data directly, making the process even faster and lightweight on your Oracle DB.

Extracting Data from Oracle to Hadoop. Is Sqoop a good idea

I'm looking to extract some data from an Oracle database and transferring it to a remote HDFS file system. There appears to be a couple of possible ways of achieving this:
Use Sqoop. This tool will extract the data, copy it across the network and store it directly into HDFS
Use SQL to read the data and store in on the local file system. When this has been completed copy (ftp?) the data to the Hadoop system.
My question will the first method (which is less work for me) cause Oracle to lock tables for longer than required?
My worry is that that Sqoop might take out a lock on the database when it starts to query the data and this lock isn't going to be released until all of the data has been copied across to HDFS. Since I'll be extracting large amounts of data and copying it to a remote location (so there will be significant network latency) the lock will remain longer than would otherwise be required.
Sqoop issues usual select queries on the Oracle batabase, so it does
the same locks as the select query would. No extra additional locking
is performed by Sqoop.
Data will be transferred in several concurrent tasks(mappers). Any
expensive function call will put a significant performance burden on
your database server. Advanced functions could lock certain tables,
preventing Sqoop from transferring data in parallel. This will
adversely affect transfer performance.
For efficient advanced filtering, run the filtering query on your
database prior to import, save its output to a temporary table and
run Sqoop to import the temporary table into Hadoop without the —where parameter.
Sqoop import has nothing to do with copy of data accross the network.
Sqoop stores at one location and based on the Replication Factor of
the cluster HDFS replicates the data

Hadoop MapReduce DBInputFormat and DBOutputFormat

I need to import data from MYSQL, run a MR and export it back to MYSQL.
I am able to do it successfully in a single MR job for a few records using DBInputFormat and DBOutputFormat.
When I scale input records to 100+ million records , MR job hangs.
Alternative to this is export data to HDFS , run MR job and push back to My SQL.
For a huge dataset of around 400+ Million records ,which option is better one, using a DBInputFormat and DBOutputFormat or using HDFS as data source and destination.
Using a HDFS adds a step before and after my MR job.
Since data is stored on HDFS it would be replicated(default 3) and will require more hard drive space.
Thanks
Rupesh
I think the best approach should be using SQOOP in dealing with such situation.Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases such as such as MySQL or Oracle.Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance. Please look into this link and explore Sqoop for detials. SQOOP details
In order to use DBInputFormat you need to write a class that deserializes the columns from the database record into individual data fields to work with. This is pretty tedious—and entirely algorithmic. Sqoop auto-generates class definitions to deserialze the data from the database. These classes can also be used to store the results in Hadoop’s SequenceFile format, which allows you to take advantage of built-in compression within HDFS too. The classes are written out as .java files that you can incorporate in your own data processing pipeline later. The class definition is created by taking advantage of JDBC’s ability to read metadata about databases and tables.
When Sqoop is invoked, it retrieves the table’s metadata, writes out the class definition for the columns you want to import, and launches a MapReduce job to import the table body proper.

Move data from oracle to HDFS, process and move to Teradata from HDFS

My requirement is to
Move data from Oracle to HDFS
Process the data on HDFS
Move processed data to Teradata.
It is also required to do this entire processing every 15 minutes. The volume of source data may be close to 50 GB and the processed data also may be the same.
After searching a lot on the internet, i found that
ORAOOP to move data from Oracle to HDFS (Have the code withing the shell script and schedule it to run at the required interval).
Do large scale processing either by Custom MapReduce or Hive or PIG.
SQOOP - Teradata Connector to move data from HDFS to Teradata (again have a shell script with the code and then schedule it).
Is this the right option in the first place and is this feasible for the required time period (Please note that this is not the daily batch or so)?
Other options that i found are the following
STORM (for real time data processing). But i am not able to find the oracle Spout or Teradata bolt out of the box.
Any open source ETL tools like Talend or Pentaho.
Please share your thoughts on these options as well and any other possibilities.
Looks like you have several questions so let's try to break it down.
Importing in HDFS
It seems you are looking for Sqoop. Sqoop is a tool that lets you easily transfer data in/out of HDFS, and can connect to various databases including Oracle natively. Sqoop is compatible with the Oracle JDBC thin driver. Here is how you would transfer from Oracle to HDFS:
sqoop import --connect jdbc:oracle:thin#myhost:1521/db --username xxx --password yyy --table tbl --target-dir /path/to/dir
For more information: here and here. Note than you can also import directly into a Hive table with Sqoop which could be convenient to do your analysis.
Processing
As you noted, since your data initially is relational, it is a good idea to use Hive to do your analysis since you might be more familiar with SQL-like syntax. Pig is more pure relational algebra and the syntax is NOT SQL-like, it is more a matter of preference but both approaches should work fine.
Since you can import data into Hive directly with Sqoop, your data should be directly ready to be processed after it is imported.
In Hive you could run your query and tell it to write the results in HDFS:
hive -e "insert overwrite directory '/path/to/output' select * from mytable ..."
Exporting into TeraData
Cloudera released last year a connector for Teradata for Sqoop as described here, so you should take a look as this looks like exactly what you want. Here is how you would do it:
sqoop export --connect jdbc:teradata://localhost/DATABASE=MY_BASE --username sqooptest --password xxxxx --table MY_DATA --export-dir /path/to/hive/output
The whole thing is definitely doable in whatever time period you want, in the end what will matter is the size of your cluster, if you want it quick then scale your cluster up as needed. The good thing with Hive and Sqoop is that processing will be distributed in your cluster, so you have total control over the schedule.
If you have concerns about the overhead or latency of moving the data from Oracle into HDFS, a possible commercial solution might be Dell Software’s SharePlex. They recently released a connector for Hadoop that would allow you to replicate table data from Oracle to Hadoop. More info here.
I’m not sure if you need to reprocess the entire data set each time or can possibly just use the deltas. SharePlex also supports replicating the change data to a JMS queue. It might be possible to create a Spout that reads from that queue. You could probably also build your own trigger based solution but it would be a bit of work.
As a disclosure, I work for Dell Software.

Resources