load 2T data from Hive to local server - hadoop

I have 2TB of data in My hadoop cluster through Hive database and I would bring these data in my local server, So I use Hive to perform this task by using beeline CLI as below:
use db1;
for i in (T1 T2 T3 ...)do
export table $1 to '/tmp/$i';
done
(Notice: maybe you notice same errors in this query above, it's not what I'm looking for, this syntaxe isn't the same I've used, but it's close enough and it works for me, so don't care about this query).
this query is really slow to done this task, So what I'm looking for actually is to know if there is some other solution like using Scoop or (hadoop fs -get /user/hive/warehouse/database.db) or even hive to do this task as fast as possible.

Related

data cleaning in hdfs without using hive

Is there an option where i can do hadoop fs -sed , essentially I am trying to replace "\" into "something" in my data directly in hdfs without having to bring data into local and load.
currently I am using getmerge to bring the data into local , clean it and load it with copyFromlocal to hdfs back. it takes a lot of time this way . so is there more easier solution or faster way of doing the replacement of character data.
Not clear why you'd use Hive for this anyway.
Pig or Spark are far better options that don't require an explicit schema for the data.
See Pig REPLACE function
In any case, Hadoop CLI has no sed option
Another option would be NiFi, but that requires more setup, and is overkill for this task.

How to move data from RDBMS to hadoop without Sqoop?

I need to move huge data from RDBMS to Hadoop without using Sqoop. I have database of 2200 tables and using Sqoop to import them to hdfs is a hectic job consuming alot of time and hitting the database to select each time effect the performance. I have more sources to move from RDBMS to hdfs. And i query the files in hdfs with hive. Can someone help me with a more efficient way ?
You could always do it maually with any back-end code: read data from database and streaming write to HDFS. Then in you application configuration you could have any customization you need (threads, timeouts, data batches amount, etc.). And this is rather straightforward solution. We've tried this once for some reason I don't remember. But mostly we use sqoop and have no issues here. You could also do a copy (sime kind of replica) of database, which would not be used by any external systems other than your sqoop job. So user selects would not affect performance.

How to load incremental records from Oracle to HDFS on daily basis and Can we use Sqoop or MR Jobs. Which is the preferred method

How to load incremental records from Oracle to HDFS on daily basis? Can we use Sqoop or MR Jobs?
Sqoop is designed exactly for this purpose, and will result in MR jobs that do the work of copying data. There are several methods of determining what is new in the Oracle table, for example using the table's id, or perhaps a date modified field if you have one.
Compared to most thing in Hadoop, Sqoop is pretty easy. Here's a link to the doc -- search for "incremental" or start with section 7.2.9 for more info. http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
FYI Once you get this working normally, check out the Sqoop extension designed to work with Oracle database that uses a vey efficient method for streaming data directly, making the process even faster and lightweight on your Oracle DB.

Will hadoop(sqoop) load oracle faster than SQL loader?

We presently load CDRs to an oracle warehouse using a combination of bash shell scripts and SQL loader with multiple threads. We are hoping to offload this process to hadoop because we envisage that the increase in data due to increase in subscriber base will soon max out the current system. And we also want to gradually introduce hadoop into our data warehouse environment.
Will loading from hadoop be faster?
If so what's is the best set of hadoop tool for this?
Further info:
We usually will get contunoius stream of pipe delimited text files through ftp to a folder, add two more fields to each record, load to temp tables in oracle and run a procedure to load to final table. How would u advice the process flow to be in terms of tools to use. For example;
files are ftp to the Linux file system (or is possible to ftp straight to hadoop?) and flume loads to hadoop.
fields are added (what will be best to do this? Pig, hive, spark or any other recommendations)
files are then loaded to oracle using sqoop
the final procedure is called(can sqoop make an oracle procedure call? If not what tool will be best to execute procedure and help control the whole process ?)
Also how can one control the level of paralleism ? Does it equate the number of mappers running the job?
Had a similar task of exporting data from a < 6 node Hadoop cluster to an Oracle Datewarehouse.
I've tested the following:
Sqoop
OraOop
Oracle Loader for Hadoop from the "Oracle BigData Connectors" suite
Hadoop streaming job which uses sqloader as mapper, in its configuration you can read from stdin using: load data infile "-"
Considering just speed, the Hadoop streaming job with sqloader as a mapper was the fastest way to transfer the data, but you have to install sqloader on each machine of your cluster. It was more of a personal curiosity, I would not recommend using this way to export data, the logging capabilities are limited, and should have a bigger impact on your datawarehouse performance.
The winner was Sqoop, it is pretty reliable, it's the import/export tool of the Hadoop ecosystem and was second fastest solution, according to my tests.(1.5x slower than first place)
Sqoop with OraOop (last updated 2012) was slower than the latest version of Sqoop, and requires extra configuration on the cluster.
Finally the worst time was obtained using Oracle's BigData Connectors, if you have a big cluster(>100 machines) than it should not be as bad as the time I obtained. The export was done in two steps. First step involves reprocessing the output and converting it to an Oracle Format that plays nice with the Datawarehouse. The second step was transfering the result to the Datawarehouse. This approach is better if you have allot of processing power, and you would not impact the Datawarehouse's performace as much as the other solutions.

Move data from oracle to HDFS, process and move to Teradata from HDFS

My requirement is to
Move data from Oracle to HDFS
Process the data on HDFS
Move processed data to Teradata.
It is also required to do this entire processing every 15 minutes. The volume of source data may be close to 50 GB and the processed data also may be the same.
After searching a lot on the internet, i found that
ORAOOP to move data from Oracle to HDFS (Have the code withing the shell script and schedule it to run at the required interval).
Do large scale processing either by Custom MapReduce or Hive or PIG.
SQOOP - Teradata Connector to move data from HDFS to Teradata (again have a shell script with the code and then schedule it).
Is this the right option in the first place and is this feasible for the required time period (Please note that this is not the daily batch or so)?
Other options that i found are the following
STORM (for real time data processing). But i am not able to find the oracle Spout or Teradata bolt out of the box.
Any open source ETL tools like Talend or Pentaho.
Please share your thoughts on these options as well and any other possibilities.
Looks like you have several questions so let's try to break it down.
Importing in HDFS
It seems you are looking for Sqoop. Sqoop is a tool that lets you easily transfer data in/out of HDFS, and can connect to various databases including Oracle natively. Sqoop is compatible with the Oracle JDBC thin driver. Here is how you would transfer from Oracle to HDFS:
sqoop import --connect jdbc:oracle:thin#myhost:1521/db --username xxx --password yyy --table tbl --target-dir /path/to/dir
For more information: here and here. Note than you can also import directly into a Hive table with Sqoop which could be convenient to do your analysis.
Processing
As you noted, since your data initially is relational, it is a good idea to use Hive to do your analysis since you might be more familiar with SQL-like syntax. Pig is more pure relational algebra and the syntax is NOT SQL-like, it is more a matter of preference but both approaches should work fine.
Since you can import data into Hive directly with Sqoop, your data should be directly ready to be processed after it is imported.
In Hive you could run your query and tell it to write the results in HDFS:
hive -e "insert overwrite directory '/path/to/output' select * from mytable ..."
Exporting into TeraData
Cloudera released last year a connector for Teradata for Sqoop as described here, so you should take a look as this looks like exactly what you want. Here is how you would do it:
sqoop export --connect jdbc:teradata://localhost/DATABASE=MY_BASE --username sqooptest --password xxxxx --table MY_DATA --export-dir /path/to/hive/output
The whole thing is definitely doable in whatever time period you want, in the end what will matter is the size of your cluster, if you want it quick then scale your cluster up as needed. The good thing with Hive and Sqoop is that processing will be distributed in your cluster, so you have total control over the schedule.
If you have concerns about the overhead or latency of moving the data from Oracle into HDFS, a possible commercial solution might be Dell Software’s SharePlex. They recently released a connector for Hadoop that would allow you to replicate table data from Oracle to Hadoop. More info here.
I’m not sure if you need to reprocess the entire data set each time or can possibly just use the deltas. SharePlex also supports replicating the change data to a JMS queue. It might be possible to create a Spout that reads from that queue. You could probably also build your own trigger based solution but it would be a bit of work.
As a disclosure, I work for Dell Software.

Resources