Extracting Data from Oracle to Hadoop. Is Sqoop a good idea - oracle

I'm looking to extract some data from an Oracle database and transferring it to a remote HDFS file system. There appears to be a couple of possible ways of achieving this:
Use Sqoop. This tool will extract the data, copy it across the network and store it directly into HDFS
Use SQL to read the data and store in on the local file system. When this has been completed copy (ftp?) the data to the Hadoop system.
My question will the first method (which is less work for me) cause Oracle to lock tables for longer than required?
My worry is that that Sqoop might take out a lock on the database when it starts to query the data and this lock isn't going to be released until all of the data has been copied across to HDFS. Since I'll be extracting large amounts of data and copying it to a remote location (so there will be significant network latency) the lock will remain longer than would otherwise be required.

Sqoop issues usual select queries on the Oracle batabase, so it does
the same locks as the select query would. No extra additional locking
is performed by Sqoop.
Data will be transferred in several concurrent tasks(mappers). Any
expensive function call will put a significant performance burden on
your database server. Advanced functions could lock certain tables,
preventing Sqoop from transferring data in parallel. This will
adversely affect transfer performance.
For efficient advanced filtering, run the filtering query on your
database prior to import, save its output to a temporary table and
run Sqoop to import the temporary table into Hadoop without the —where parameter.
Sqoop import has nothing to do with copy of data accross the network.
Sqoop stores at one location and based on the Replication Factor of
the cluster HDFS replicates the data

Related

How to move data from RDBMS to hadoop without Sqoop?

I need to move huge data from RDBMS to Hadoop without using Sqoop. I have database of 2200 tables and using Sqoop to import them to hdfs is a hectic job consuming alot of time and hitting the database to select each time effect the performance. I have more sources to move from RDBMS to hdfs. And i query the files in hdfs with hive. Can someone help me with a more efficient way ?
You could always do it maually with any back-end code: read data from database and streaming write to HDFS. Then in you application configuration you could have any customization you need (threads, timeouts, data batches amount, etc.). And this is rather straightforward solution. We've tried this once for some reason I don't remember. But mostly we use sqoop and have no issues here. You could also do a copy (sime kind of replica) of database, which would not be used by any external systems other than your sqoop job. So user selects would not affect performance.

When should I use Greenplum Database versus HAWQ?

We are having use case for retail industry data. We are into making of EDW.
We are currently doing reporting from HAWQ.But We wanted to shift our MPP database from Hawq into Greenplum.
Basically,We would like to make changes into current data pipeline.
Our confusion points about gpdb :
HOW gpdb layer going to affect our existing data pipeline. Here data
pipeline is external system --> talend -->hadoop-hawq-->tableau. We
want to transform our data pipeline as external system --> talend
-->hadoop-hawq-->greenplum -->tableau.
How Greenplum is physically or logically going to help in SQL
transformation and reporting.
Which file format should i opt for storing the files in GPDB while
HAWQ we are storing files in plain text format.What are the supported format is good for writing in gpdb like avro,parquet etc.
How is data file processed from GPDB . so, that
it also bring faster reporting and predictive analysis.
Is there any way to push data from HAWQ into Greenplum? We are
looking for guidance how to take shift our reporting use case from
HAWQ INTO Greenplum.
Any help on it would be much appreciated?
This query is sort of like asking, "when should I use a wrench?" The answer is also going to be subjective as Greenplum can be used for many different things. But, I will do my best to give my opinion because you asked.
HOW gpdb layer going to affect our existing data pipeline. Here data pipeline is external system --> talend -->hadoop-hawq-->tableau. We want to transform our data pipeline as external system --> talend -->hadoop-hawq-->greenplum -->tableau.
There are lots of ways to do the data pipeline your goal of loading data into Hadoop first and then load it to Greenplum is very common and works well. You can use External Tables in Greenplum to read data in parallel, directly from HDFS. So the data movement from the Hadoop cluster to Greenplum can be achieved with a simple INSERT statement.
INSERT INTO greenplum_customer SELECT * FROM hdfs_customer_file;
How Greenplum is physically or logically going to help in SQL transformation and reporting.
Isolation for one. With a separate cluster for Greenplum, you can provide analytics to your customers without impacting the performance of your Hadoop activity and vice-versa. This isolation also can provide an additional security layer.
Which file format should i opt for storing the files in GPDB while
HAWQ we are storing files in plain text format.What are the supported format is good for writing in gpdb like avro,parquet etc.
With your data pipeline as you suggested, I would make the data format decision in Greenplum based on performance. So large tables, partition the tables and make it column oriented with quicklz compression. For smaller tables, just make it append optimized. And for tables that have lots of updates or deletes, keep it the default heap.
How is data file processed from GPDB . so, that it also bring faster reporting and predictive analysis.
Greenplum is an MPP database. The storage is "shared nothing" meaning that each node has unique data that no other node has (excluding mirroring for high-availability). A segment's data will always be on the local disk.
In HAWQ, because it uses HDFS, the data for the segment doesn't have to be local. Day 1, when you wrote the data to HDFS, it was local but after failed nodes, expansion, etc, HAWQ may have to fetch the data from other nodes. This makes Greenplum's performance a bit more predictable than HAWQ because of how Hadoop works.
Is there any way to push data from HAWQ into Greenplum? We are
looking for guidance how to take shift our reporting use case from
HAWQ INTO Greenplum.
Push, no but pull, Yes. As I mentioned above, you can create an External Table in Greenplum to SELECT data from HDFS. You can also create Writable External Tables in Greenplum to push data to HDFS.

Will hadoop(sqoop) load oracle faster than SQL loader?

We presently load CDRs to an oracle warehouse using a combination of bash shell scripts and SQL loader with multiple threads. We are hoping to offload this process to hadoop because we envisage that the increase in data due to increase in subscriber base will soon max out the current system. And we also want to gradually introduce hadoop into our data warehouse environment.
Will loading from hadoop be faster?
If so what's is the best set of hadoop tool for this?
Further info:
We usually will get contunoius stream of pipe delimited text files through ftp to a folder, add two more fields to each record, load to temp tables in oracle and run a procedure to load to final table. How would u advice the process flow to be in terms of tools to use. For example;
files are ftp to the Linux file system (or is possible to ftp straight to hadoop?) and flume loads to hadoop.
fields are added (what will be best to do this? Pig, hive, spark or any other recommendations)
files are then loaded to oracle using sqoop
the final procedure is called(can sqoop make an oracle procedure call? If not what tool will be best to execute procedure and help control the whole process ?)
Also how can one control the level of paralleism ? Does it equate the number of mappers running the job?
Had a similar task of exporting data from a < 6 node Hadoop cluster to an Oracle Datewarehouse.
I've tested the following:
Sqoop
OraOop
Oracle Loader for Hadoop from the "Oracle BigData Connectors" suite
Hadoop streaming job which uses sqloader as mapper, in its configuration you can read from stdin using: load data infile "-"
Considering just speed, the Hadoop streaming job with sqloader as a mapper was the fastest way to transfer the data, but you have to install sqloader on each machine of your cluster. It was more of a personal curiosity, I would not recommend using this way to export data, the logging capabilities are limited, and should have a bigger impact on your datawarehouse performance.
The winner was Sqoop, it is pretty reliable, it's the import/export tool of the Hadoop ecosystem and was second fastest solution, according to my tests.(1.5x slower than first place)
Sqoop with OraOop (last updated 2012) was slower than the latest version of Sqoop, and requires extra configuration on the cluster.
Finally the worst time was obtained using Oracle's BigData Connectors, if you have a big cluster(>100 machines) than it should not be as bad as the time I obtained. The export was done in two steps. First step involves reprocessing the output and converting it to an Oracle Format that plays nice with the Datawarehouse. The second step was transfering the result to the Datawarehouse. This approach is better if you have allot of processing power, and you would not impact the Datawarehouse's performace as much as the other solutions.

Hbase in comparison with Hive

Im trying to get a clear understanding on HBASE.
Hive:- It just create a Tabular Structure for the Underlying Files in
HDFS. So that we can enable the user to have Querying Abilities on the
HDFS file. Correct me if im wrong here?
Hbase- Again, we have create a Similar table Structure, But bit more
in Structured way( Column Oriented) again over HDFS File system.
aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce.
Also is that true that we cant create a Hbase table over an Already existing HDFS file?
Hive shares a very similar structures to traditional RDBMS (But Not all), HQL syntax is almost similar to SQL which is good for Database Programmer from learning perspective where as HBase is completely diffrent in the sense that it can be queried only on the basis of its Row Key.
If you want to design a table in RDBMS, you will be following a structured approach in defining columns concentrating more on attributes, while in Hbase the complete design is concentrated around the data, So depending on the type of query to be used we can design a table in Hbase also the columns will be dynamic and will be changing at Runtime (core feature of NoSQL)
You said aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce .This is not a simple thinking.Because when a hive query is executed, a mapreduce job will be created and triggered.Depending upon data size and complexity it may consume time, since for each mapreduce job, there are some number of steps to do by JobTracker, initializing tasks like maps,combine,shufflesort, reduce etc.
But in case we access HBase, it directly lookup the data they indexed based on specified Scan or Get parameters. Means it just act as a database.
Hive and HBase are completely different things
Hive is a way to create map/reduce jobs for data that resides on HDFS (can be files or HBase)
HBase is an OLTP oriented key-value store that resides on HDFS and can be used in Map/Reduce jobs
In order for Hive to work it holds metadata that maps the HDFS data into tabular data (since SQL works on tables).
I guess it is also important to note that in recent versions Hive is evolving to go beyond a SQL way to write map/reduce jobs and with what HortonWorks calls the "stinger initiative" they have added a dedicated file format (Orc) and import Hive's performance (e.g. with the upcoming Tez execution engine) to deliver SQL on Hadoop (i.e. relatively fast way to run analytics queries for data stored on Hadoop)
Hive:
It's just create a Tabular Structure for the Underlying Files in HDFS. So that we can enable the user to have SQL-like Querying Abilities on existing HDFS files - with typical latency up to minutes.
However, for best performance it's recommended to ETL data into Hive's ORC format.
HBase:
Unlike Hive, HBase is NOT about running SQL queries over existing data in HDFS.
HBase is a strictly-consistent, distributed, low-latency KEY-VALUE STORE.
From The HBase Definitive Guide:
The canonical use case of Bigtable and HBase is the webtable, that is, the web pages
stored while crawling the Internet.
The row key is the reversed URL of the page—for example, org.hbase.www. There is a
column family storing the actual HTML code, the contents family, as well as others
like anchor, which is used to store outgoing links, another one to store inbound links,
and yet another for metadata like language.
Using multiple versions for the contents family allows you to store a few older copies
of the HTML, and is helpful when you want to analyze how often a page changes, for
example. The timestamps used are the actual times when they were fetched from the
crawled website.
The fact that HBase uses HDFS is just an implementation detail: it allows to run HBase on an existing Hadoop cluster, it guarantees redundant storage of data; but it is not a feature in any other sense.
Also is that true that we cant create a Hbase table over an already
existing HDFS file?
No, it's NOT true. Internally HBase stores data in its HFile format.

Export data from database and write to HDFS(hadoop fs)

Now i am trying to export data from a db table, and write it into hdfs.
And the problem is: will the name node become bottleneck? and how is the machanism, will name node cache a slice(64MB) and then give it to data node?
And is there any better way rather than write the hdfs? because i think it dosen't take the advantage of parellism.
Thanks:)
Have you considered using Sqoop. Sqoop can be used to extract data from any DB with supports JDBC and put it in HDFS.
http://www.cloudera.com/blog/2009/06/introducing-sqoop/
Sqoop import command takes the number of map jobs to be run (it defaults to 1). Also, while parallelizing the work (map tasks > 1) the splitting column can be specified or Sqoop will make a guess based on the sequence key for the table. Each map file will create a separate file for the results in a directory. The NN will not be a bottleneck unless a huge number of files created is huge (the NN keeps the meta data about the files in the memory).
Sqoop can also interpret the source DB (Oracle, MySQL or others) and use the DB specific tools like mysqldump and import instead of the JDBC channel for better performance.

Resources