Export data from database and write to HDFS(hadoop fs) - performance

Now i am trying to export data from a db table, and write it into hdfs.
And the problem is: will the name node become bottleneck? and how is the machanism, will name node cache a slice(64MB) and then give it to data node?
And is there any better way rather than write the hdfs? because i think it dosen't take the advantage of parellism.
Thanks:)

Have you considered using Sqoop. Sqoop can be used to extract data from any DB with supports JDBC and put it in HDFS.
http://www.cloudera.com/blog/2009/06/introducing-sqoop/
Sqoop import command takes the number of map jobs to be run (it defaults to 1). Also, while parallelizing the work (map tasks > 1) the splitting column can be specified or Sqoop will make a guess based on the sequence key for the table. Each map file will create a separate file for the results in a directory. The NN will not be a bottleneck unless a huge number of files created is huge (the NN keeps the meta data about the files in the memory).
Sqoop can also interpret the source DB (Oracle, MySQL or others) and use the DB specific tools like mysqldump and import instead of the JDBC channel for better performance.

Related

Can we use Sqoop to move any structured data file apart from moving data from RDBMS?

This question was asked to me in a recent interview.
As per my knowledge we can use Sqoop to transfer data between RDBMS and hadoop ecosystems(hdfs, hive,pig,hbase).
Can someone please help me in finding answer?
As per my understanding, Sqoop can't move any structured data file (like CSV) to HDFS or other Hadoop ecosystem component like Hive, HBase, etc.
Why would you use Sqoop for this?
You can simply put any data file directly into HDFS using it's REST, Web or Java API.
Sqoop is not meant for this type of use case.
Main purpose of sqoop import is to fetch data from RDBMS in parallel.
Apart from that, Sqoop has Sqoop Import Mainframe.
The import-mainframe tool imports all sequential datasets in a partitioned dataset(PDS) on a mainframe to HDFS. A PDS is akin to a directory on the open systems. The records in a dataset can contain only character data. Records will be stored with the entire record as a single text field.

Save and access table-like data structure in hadoop

I want to save and access a table like data structure in HDFS with MapReduce programming. Part of this DS is shown in the following picture. This DS have tens of thousands of columns and hundreds of rows and All nodes should have access to it.
My Question is: How can I save this DS in HDFS and access it with MapReduce programming. Should I use arrays? (Or Hive tables ? Or Hbase?)
Thank you.
HDFS is distributed file System which stores your big files in distributed servers.
You can copy your files from local system to HDFS using command
hadoop fs -copyFromLocal /source/local/path destincation/hdfs/path
Once copy completed an External hive table can be formed on destincation/hdfs/path.
This table can be queried using hive shell.
Do consider Hive for this scenario. If you want to do table type of processing like SAS dataset or R dataframe/dataTable or python pandas; almost always an equivalent thing is possible in SQL. Hive provides powerful SQL abstraction through MapReduce and Tez engines. If you want to graduate to Spark sometime then you can read Hive tables in dataframes. As #sumit pointed you just need to transfer your data from local to HDFS (using HDFS copyFromLocal or put command) and define an external Hive table on that.
If in case you want to write some custom map-reduce on this data then access the background hive table data (more likely at /user/hive/warehouse). After reading the data from stdin, parse it in mapper (separator could be find using describe extended <hive_table>) and emit in key-value pair format.

Extracting Data from Oracle to Hadoop. Is Sqoop a good idea

I'm looking to extract some data from an Oracle database and transferring it to a remote HDFS file system. There appears to be a couple of possible ways of achieving this:
Use Sqoop. This tool will extract the data, copy it across the network and store it directly into HDFS
Use SQL to read the data and store in on the local file system. When this has been completed copy (ftp?) the data to the Hadoop system.
My question will the first method (which is less work for me) cause Oracle to lock tables for longer than required?
My worry is that that Sqoop might take out a lock on the database when it starts to query the data and this lock isn't going to be released until all of the data has been copied across to HDFS. Since I'll be extracting large amounts of data and copying it to a remote location (so there will be significant network latency) the lock will remain longer than would otherwise be required.
Sqoop issues usual select queries on the Oracle batabase, so it does
the same locks as the select query would. No extra additional locking
is performed by Sqoop.
Data will be transferred in several concurrent tasks(mappers). Any
expensive function call will put a significant performance burden on
your database server. Advanced functions could lock certain tables,
preventing Sqoop from transferring data in parallel. This will
adversely affect transfer performance.
For efficient advanced filtering, run the filtering query on your
database prior to import, save its output to a temporary table and
run Sqoop to import the temporary table into Hadoop without the —where parameter.
Sqoop import has nothing to do with copy of data accross the network.
Sqoop stores at one location and based on the Replication Factor of
the cluster HDFS replicates the data

How to get data from HDFS? Hive?

I am new to Hadoop. I ran a map reduce on my data and now I want to query it so I can put it into my website. Is Apache Hive the best way to do that? I would greatly appreciate any help.
Keep in mind that Hive is a batch processing system, which under the hoods converts the SQL statements to bunch of MapReduce jobs with stage builds in between. Also, Hive is a high latency system i.e. based on your dataset sizes you are looking at minutes to hours or even days to process a complicated query.
So, if you want to serve the results from your MapReduce job output in your website, its highly recommended you export the results back to a RDBMS using sqoop and then take it from there.
Or, if the data itself is huge and cannot be exported back to RDBMS. Then another option you could think of is using a NoSQL system like HBase.
welcome to Hadoop!
I highly recommend you watch Cloudera Essentials for Apache Hadoop | Chapter 5: The Hadoop Ecosystem and familiarize yourself with the different ways to transfer data inbound and outbound from your HDFS cluster. The video is easy-to-watch and describes advantages / disadvantages to each tool, but this outline should give you the basics of the Hadoop Ecosystem:
Flume - Data integration and import of flat files into HDFS. Designed for asynchronous data streams (e.g., log files). Distributed, scalable, and extensible. Supports various endpoints. Allows preprocessing on data before loading to HDFS.
Sqoop - Bidirectional transfer of structured data (RDBMS) and HDFS. Permits incremental import to HDFS. RDBMS must support JDBC or ODBC.
Hive - SQL-like interface to Hadoop. Requires table structure. JDBC and/or ODBC is required.
Hbase - Allows interactive access of HDFS. Sits on top of HDFS and apply structure to data. Allows for random reads, scales horizontally with cluster. Not a full query language; only permits get/put/scan operations (can be used with Hive and/or Impala). Row-key indexes only on data. Does not use Map Reduce paradigm.
Impala - Similar to Hive, high-performance SQL Engine for querying vast amounts of data stored in HDFS. Does not use Map Reduce. Good alternative to Hive.
Pig - Data flow language for transforming large datasets. Permits schema optionally defined at runtime. PigServer (Java API) permits programmatic access.
Note: I assume the data you are trying to read already exists in HDFS. However, some of the products in the Hadoop ecosystem may be useful for your application or as a general reference, so I included them.
If you're only looking to get data from HDFS then yes, you can do so via Hive.
However, you'll most beneficiate from it if your data are already organized (for instance, in columns).
Lets take an example : your map-reduce job produced a csv file named wordcount.csv and containing two rows : word and count. This csv file is on HDFS.
Let's now suppose you want to know the occurence of the word "gloubiboulga". You can simply achieve this via the following code :
CREATE TABLE data
(
word STRING,
count INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/wordcount.csv'
OVERWRITE INTO TABLE data;
select word, count from data where word=="gloubiboulga";
Please note that while this language looks highly like SQL, you'll still have to learn a few things about it.

Hadoop MapReduce DBInputFormat and DBOutputFormat

I need to import data from MYSQL, run a MR and export it back to MYSQL.
I am able to do it successfully in a single MR job for a few records using DBInputFormat and DBOutputFormat.
When I scale input records to 100+ million records , MR job hangs.
Alternative to this is export data to HDFS , run MR job and push back to My SQL.
For a huge dataset of around 400+ Million records ,which option is better one, using a DBInputFormat and DBOutputFormat or using HDFS as data source and destination.
Using a HDFS adds a step before and after my MR job.
Since data is stored on HDFS it would be replicated(default 3) and will require more hard drive space.
Thanks
Rupesh
I think the best approach should be using SQOOP in dealing with such situation.Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases such as such as MySQL or Oracle.Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance. Please look into this link and explore Sqoop for detials. SQOOP details
In order to use DBInputFormat you need to write a class that deserializes the columns from the database record into individual data fields to work with. This is pretty tedious—and entirely algorithmic. Sqoop auto-generates class definitions to deserialze the data from the database. These classes can also be used to store the results in Hadoop’s SequenceFile format, which allows you to take advantage of built-in compression within HDFS too. The classes are written out as .java files that you can incorporate in your own data processing pipeline later. The class definition is created by taking advantage of JDBC’s ability to read metadata about databases and tables.
When Sqoop is invoked, it retrieves the table’s metadata, writes out the class definition for the columns you want to import, and launches a MapReduce job to import the table body proper.

Resources