How do I output data in a MapReduce job for Sqoop to export? - hadoop

I've read a lot about importing from SQL using Sqoop, but there are only tidbits on exporting, and the examples always assume that you're exporting imported/pre-formatted data for some reason, or are using Hive.
How, from a MapReduce job, do I write data to HDFS that Sqoop can read and export?
This Sqoop documentation shows me the file formats supported. I guess I can use text/CSV, but how do I get there in MapReduce?
I've found this answer, which says to just modify the options for TextOutputFormat, but just writes key/values. My "values" are multiple fields/columns!

Try using other storages like avro or parquet (more buggy), so you have a schema. Then you can "query" those files and export their data into a RDBMS.
However, it looks like that support was a bit buggy/broken, and only worked properly if you created the files with Kite or sqoop (which internally uses kite).
http://grokbase.com/t/sqoop/user/1532zggqb7/how-does-sqoop-export-detect-avro-schema

I used the codegen tool to generate classes that could write to SequenceFiles:
sqoop/bin/sqoop-codegen --connect jdbc://sqlserver://... --table MyTable --class-name my.package.name.ClassForMyTable --outdir ./out/
And then I was able to read those in using Sqoop, exporting with the bulk setting. But the performance was abysmal. In the end, I instead just wrote simple CSV-like text files importable with the BCP tool, and what took hours with Sqoop completed in minutes.

Related

Can we use Sqoop to move any structured data file apart from moving data from RDBMS?

This question was asked to me in a recent interview.
As per my knowledge we can use Sqoop to transfer data between RDBMS and hadoop ecosystems(hdfs, hive,pig,hbase).
Can someone please help me in finding answer?
As per my understanding, Sqoop can't move any structured data file (like CSV) to HDFS or other Hadoop ecosystem component like Hive, HBase, etc.
Why would you use Sqoop for this?
You can simply put any data file directly into HDFS using it's REST, Web or Java API.
Sqoop is not meant for this type of use case.
Main purpose of sqoop import is to fetch data from RDBMS in parallel.
Apart from that, Sqoop has Sqoop Import Mainframe.
The import-mainframe tool imports all sequential datasets in a partitioned dataset(PDS) on a mainframe to HDFS. A PDS is akin to a directory on the open systems. The records in a dataset can contain only character data. Records will be stored with the entire record as a single text field.

Options to Import Data from MySql DB to MapR DB/HBase

I have a single table in MySql which contains around 24000000 records. I need a way to import this data into a table in MapR DB with multiple column families. I initially chose Sqoop as the tool to import the data but later found that I cannot use Sqoop to directly import the data as Sqoop does not support multiple column family import as yet.
I have populated the data in MapR FS using Sqoop from the MySql database.
What are my choices to import this data from MapR FS to MapR DB table with 3 column families?
It seems for bulk import, I have two choices:
ImportTSV tool: this probably requires the source data to be in TSV format. But the data that I have imported in MapR FS from MySql using Sqoop seems to be in the CSV format. What is the standard solution for this approach?
Write a custom Map Reduce program to translate the data in MapR FS to HFile and load it into MapR DB.
I just wanted to ensure that are these the only two choices available to load the data. This seems to be a bit restrictive given the fact that such a requirement is a very basic one in any system.
If custom Map Reduce is the way to go, an example or working sample would be really helpful.
Create Hive table pointing to MapRDB using HBaseStorageHandler. You can use sqoop to import to hive table.
If you have already downloaded the data MapRFS. Use hive load command to load data to MapRDB.

Data moving from RDBMS to Hadoop, using SQOOP and FLUME

I am in the process of learning Hadoop and stuck with few concepts on moving data from Relational database to Hadoop and vice versa.
I have transferred files from MySQL to HDFS using SQOOP import queries. The files I transferred were structured datasets and not any server log data. I recently read that we usually use flume for moving log files into Hadoop,
My question is:
1. Can we use SQOOP as well for moving log files?
2. If yes, which of SQOOP or FLUME is more preferred for log files and why?
1) Sqoop can be used to transfer data between any rdbms and hdfs. To use scoop the data has to be structured usually specified by schema of database from where data is being imported or exported.Log files are not always structured,depending on source and type of log so sqoop is not used for moving log files.
2)Flume can collect, aggregate data from many different kinds of customizable data sources. It gives more flexibility in controlling what specific events to capture and use in user defined work flow before storing in say hdfs.
I hope it clarified difference between sqoop and flume.
SQOOP is designed to transfer data from RDMS to HDFS whereas FLUME is for moving large amounts of log data.
Both are different and specialized for different purposes.
Like
You can use SQOOP to import data via JDBC ( which you can not do in FLUME ),
and
You can use FLUME to say something like "I want to tail 200 lines of log file from this server".
Read more about FLUME here
http://flume.apache.org/
SQOOP not only transfers data from RDBMS but also from NOSql databases like MongoDB. You can directly transfer data to HDFS or Hive.
Transferring data to Hive you need not have to create table beforehand.. It takes the scheme from database itself.
Flume is used to fetch log data or streaming data

Hadoop MapReduce DBInputFormat and DBOutputFormat

I need to import data from MYSQL, run a MR and export it back to MYSQL.
I am able to do it successfully in a single MR job for a few records using DBInputFormat and DBOutputFormat.
When I scale input records to 100+ million records , MR job hangs.
Alternative to this is export data to HDFS , run MR job and push back to My SQL.
For a huge dataset of around 400+ Million records ,which option is better one, using a DBInputFormat and DBOutputFormat or using HDFS as data source and destination.
Using a HDFS adds a step before and after my MR job.
Since data is stored on HDFS it would be replicated(default 3) and will require more hard drive space.
Thanks
Rupesh
I think the best approach should be using SQOOP in dealing with such situation.Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases such as such as MySQL or Oracle.Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance. Please look into this link and explore Sqoop for detials. SQOOP details
In order to use DBInputFormat you need to write a class that deserializes the columns from the database record into individual data fields to work with. This is pretty tedious—and entirely algorithmic. Sqoop auto-generates class definitions to deserialze the data from the database. These classes can also be used to store the results in Hadoop’s SequenceFile format, which allows you to take advantage of built-in compression within HDFS too. The classes are written out as .java files that you can incorporate in your own data processing pipeline later. The class definition is created by taking advantage of JDBC’s ability to read metadata about databases and tables.
When Sqoop is invoked, it retrieves the table’s metadata, writes out the class definition for the columns you want to import, and launches a MapReduce job to import the table body proper.

Move data from oracle to HDFS, process and move to Teradata from HDFS

My requirement is to
Move data from Oracle to HDFS
Process the data on HDFS
Move processed data to Teradata.
It is also required to do this entire processing every 15 minutes. The volume of source data may be close to 50 GB and the processed data also may be the same.
After searching a lot on the internet, i found that
ORAOOP to move data from Oracle to HDFS (Have the code withing the shell script and schedule it to run at the required interval).
Do large scale processing either by Custom MapReduce or Hive or PIG.
SQOOP - Teradata Connector to move data from HDFS to Teradata (again have a shell script with the code and then schedule it).
Is this the right option in the first place and is this feasible for the required time period (Please note that this is not the daily batch or so)?
Other options that i found are the following
STORM (for real time data processing). But i am not able to find the oracle Spout or Teradata bolt out of the box.
Any open source ETL tools like Talend or Pentaho.
Please share your thoughts on these options as well and any other possibilities.
Looks like you have several questions so let's try to break it down.
Importing in HDFS
It seems you are looking for Sqoop. Sqoop is a tool that lets you easily transfer data in/out of HDFS, and can connect to various databases including Oracle natively. Sqoop is compatible with the Oracle JDBC thin driver. Here is how you would transfer from Oracle to HDFS:
sqoop import --connect jdbc:oracle:thin#myhost:1521/db --username xxx --password yyy --table tbl --target-dir /path/to/dir
For more information: here and here. Note than you can also import directly into a Hive table with Sqoop which could be convenient to do your analysis.
Processing
As you noted, since your data initially is relational, it is a good idea to use Hive to do your analysis since you might be more familiar with SQL-like syntax. Pig is more pure relational algebra and the syntax is NOT SQL-like, it is more a matter of preference but both approaches should work fine.
Since you can import data into Hive directly with Sqoop, your data should be directly ready to be processed after it is imported.
In Hive you could run your query and tell it to write the results in HDFS:
hive -e "insert overwrite directory '/path/to/output' select * from mytable ..."
Exporting into TeraData
Cloudera released last year a connector for Teradata for Sqoop as described here, so you should take a look as this looks like exactly what you want. Here is how you would do it:
sqoop export --connect jdbc:teradata://localhost/DATABASE=MY_BASE --username sqooptest --password xxxxx --table MY_DATA --export-dir /path/to/hive/output
The whole thing is definitely doable in whatever time period you want, in the end what will matter is the size of your cluster, if you want it quick then scale your cluster up as needed. The good thing with Hive and Sqoop is that processing will be distributed in your cluster, so you have total control over the schedule.
If you have concerns about the overhead or latency of moving the data from Oracle into HDFS, a possible commercial solution might be Dell Software’s SharePlex. They recently released a connector for Hadoop that would allow you to replicate table data from Oracle to Hadoop. More info here.
I’m not sure if you need to reprocess the entire data set each time or can possibly just use the deltas. SharePlex also supports replicating the change data to a JMS queue. It might be possible to create a Spout that reads from that queue. You could probably also build your own trigger based solution but it would be a bit of work.
As a disclosure, I work for Dell Software.

Resources