How to store hadoop data into oracle - oracle

My final table is in Hive(HDFS)
1) I have tired "Sqoop"
2) sql loader
3) oraoop
Performance of all are very discouraging ,while we are putting data into sql database
have to import 1 TB file and 1 GB is taking over all 8 Min (1297372920 Rows) in 5 node cluster with sqoop ,oraoop,sql loader

Your Sqoop export to Oracle speed will be determined by various factors including data size/characteristics, network performance and perhaps most importantly the target database server's configuration. Since the current release of Sqoop doesn't allow for use of "direct" in exporting data to Oracle the optimizations available in this use case are limited. I'd strongly encourage you to review the documentation (http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_export_literal) and try to get yourself into a position where you can work with incremental imports/exports if possible since you're displeased with latency using your 1tb dataset. Perhaps go with an initial full load of your entire desired dataset and find a way to only update incrementally from there.

Related

Teradata Fast export (or TPT) vs Sqoop export

Edit: Need to identify which one is better for exporting huge data volume from Teradata - Sqoop, TPT, or fexp
OP: I am already aware that teradata's fast export and TPT cannot be used to export data directly to Hadoop. I can bring data to local environment and move it to hadoop parallely.
I want to know which tool extracts the data from Teradata in the most efficient way.
I have to extract dataset having huge data volume (almost 25 billion records ~ 15 TB in size).
Of course the data in Teradata is well partitioned and I am going to split my extraction strategy based on partitions and Unique PI.
I was not able to find enough content which would provide direct comparison between Teradata utilities and Sqoop.
Which tool would make least impact on currently running jobs in Teradata environment and extract the data in most optimized way.
Of course Teradata's FastExport can't be used to export to Hadoop directly, it's an old legacy tool which is not enhanced anymore.
Any new development should be done using TPT, e.g.
Using the DataConnector Operator to Write Files and Tables in Hadoop
Common Data Movement Jobs

What is the best way to ingest data from Terdata into Hadoop with Informatica?

What is the best ways to parallel ingest data from Teradata database into Hadoop with parallel data moving?
If we create a job which is simple opens one session to Teradata database it will take a lot of time to load huge table.
if we create a set of sessions to load data in parallel, and also make Select in each of the sessions, than it will make a set of Full table scans Teradata to produce a data
What is the recommended best practice to load data in parallelised streams and make unnecessary workload to Teradata?
If Tera data supports table partitioning like oracle, you could try reading the table based on partitioning points which will enable parallelism in read...
Other option you have is, split the table into multiple partitions like adding a where clause on indexed column. This will ensure index scan and you can avoid full table scan.
The most scalable way to ingest data into Hadoop form teradata, which i found is to use Teradata connector for hadoop. It is included in Cloudera & Hortonworks distributions. I will show example base on Cloudera documentation, but the same works with Hortonworks as well:
Informatica big Data edition is using standard Scoop invocation via command line and submitting set of parameters to it. So the main question is - which driver to use to make parallel connections between two MPP systems.
Here is the link to the Cloudera documentation:
Using the Cloudera Connector Powered by Teradata
And here is the digest from this documentation (You could find that this connector support different kinds of load balancing between connections):
Cloudera Connector Powered by Teradata supports the following methods for importing data from Teradata to Hadoop:
split.by.amp
split.by.value
split.by.partition
split.by.hash
split.by.amp Method
This optimal method retrieves data from Teradata. The connector creates one mapper per available Teradata AMP, and each mapper subsequently retrieves data from each AMP. As a result, no staging table is required. This method requires Teradata 14.10 or higher.
If you use partition names in the select clause, Power Center will select only the rows within that partition so there won't be duplicate read (don't forget to choose Database partitioning in Informatica session level). However if you use key range partition you have to choose the range as you mentioned in settings. Usually we use NTILE oracle analytical function to split the table into multiple portions so that the read will be unique across the selects. Please let me know if you have any question. If you have range/auto generated/surrogate key column in the table use it in where clause - write a sub-query to divide the table into multiple portions.

Cognos reporting on Hive datasource is very slow?

I am new to Cognos and trying to create reports on top of Hadoop using Hive JDBC Driver. I'm able to connect to Hive through JDBC and can able to generate reports, but here the report runs very slow. I did the same job while connecting with DB2 and the data is same as in Hadoop. Reports ran very quickly when compared to reports on top of Hive. I'm using same data-sets in both Hadoop and DB2, but can't figure out why reports on top of Hadoop are very slow. I installed Hadoop in pseudo distributed mode and connected through JDBC.
I installed following versions of software's which I used,
IBM Cognos 10.2.1 with fix pack 11,
Apache Hadoop 2.7.2,
Apache Hive 0.12.
Both are installed in different systems, Cognos on top of Windows 7 and Hadoop on top of Red-Hat.
Can any one tell me where I might be wrong in setting up of Cognos or Hadoop. Is there any way to speed up the report running time in Cognos on top of Hadoop.
When you say you installed Hadoop in pseudo distributed mode are you saying you are only running it on a single server? If so, it's never going to be as fast as DB2. Hadoop and Hive are designed to run on a cluster and scale. Get 3 or 4 servers running in a cluster and you should find that you can start to see some impressive query speeds over large datasets.
Check that you have allowed the Cognos Query Service to access more than the default amount of memory for it's Java Heap (http://www-01.ibm.com/support/docview.wss?uid=swg21587457) I currently run an initial size of 8Gb and max of 12Gb, but still manage to blow this occasionally.
Next issue you will run into is that Cognos doesn't know Hive SQL specifics (or Impala which is what I am using). This means that any non-basic query is going to be converted to a select from and maybe a group by. The big missing piece will be a where clause, which will mean that Cognos is going to try to suck in all the data from the Hive table and then do the filtering in Cognos rather than pass that off to Hive where it belongs. Cognos knows how to write DB2 SQL and all the specifics so it can pass that workload through.
The more complex the query, or any platform specific functions etc will generally not get passed to Hive (date functions, analytic functions etc), so try to structure your data and queries so they are required in filters.
Use the Hive query logs to monitor the queries that Cognos is running. Also try things like add fields to the query and then drag that field to the filter rather than direct from the model into the filter. I have found this can help in getting Cognos to include the filter in a where clause.
The other option is to use passthrough SQL queries in Report Studio and just write it all in Hive's SQL. I have just done this for a set of dashboards which required a stack of top 5's from a fact table with 5 million rows. For 5 rows Cognos was extracting all 5 million rows and then ranking them within Cognos. Do this a number of times and all of a sudden Cognos is going to struggle. With a passthrough query I could use the Impala Rank() function and only get 5 rows, much much faster, and faster than what DB2 would do seeing I am running on a proper (but small) cluster.
Another consideration with Hive is whether you are using Hive on Map Reduce or Hive on TEZ. From what a colleague has found, Hive on TEZ is much faster at the type of queries Cognos runs than Hive on Map Reduce.

Extracting Data from Oracle to Hadoop. Is Sqoop a good idea

I'm looking to extract some data from an Oracle database and transferring it to a remote HDFS file system. There appears to be a couple of possible ways of achieving this:
Use Sqoop. This tool will extract the data, copy it across the network and store it directly into HDFS
Use SQL to read the data and store in on the local file system. When this has been completed copy (ftp?) the data to the Hadoop system.
My question will the first method (which is less work for me) cause Oracle to lock tables for longer than required?
My worry is that that Sqoop might take out a lock on the database when it starts to query the data and this lock isn't going to be released until all of the data has been copied across to HDFS. Since I'll be extracting large amounts of data and copying it to a remote location (so there will be significant network latency) the lock will remain longer than would otherwise be required.
Sqoop issues usual select queries on the Oracle batabase, so it does
the same locks as the select query would. No extra additional locking
is performed by Sqoop.
Data will be transferred in several concurrent tasks(mappers). Any
expensive function call will put a significant performance burden on
your database server. Advanced functions could lock certain tables,
preventing Sqoop from transferring data in parallel. This will
adversely affect transfer performance.
For efficient advanced filtering, run the filtering query on your
database prior to import, save its output to a temporary table and
run Sqoop to import the temporary table into Hadoop without the —where parameter.
Sqoop import has nothing to do with copy of data accross the network.
Sqoop stores at one location and based on the Replication Factor of
the cluster HDFS replicates the data

Will hadoop(sqoop) load oracle faster than SQL loader?

We presently load CDRs to an oracle warehouse using a combination of bash shell scripts and SQL loader with multiple threads. We are hoping to offload this process to hadoop because we envisage that the increase in data due to increase in subscriber base will soon max out the current system. And we also want to gradually introduce hadoop into our data warehouse environment.
Will loading from hadoop be faster?
If so what's is the best set of hadoop tool for this?
Further info:
We usually will get contunoius stream of pipe delimited text files through ftp to a folder, add two more fields to each record, load to temp tables in oracle and run a procedure to load to final table. How would u advice the process flow to be in terms of tools to use. For example;
files are ftp to the Linux file system (or is possible to ftp straight to hadoop?) and flume loads to hadoop.
fields are added (what will be best to do this? Pig, hive, spark or any other recommendations)
files are then loaded to oracle using sqoop
the final procedure is called(can sqoop make an oracle procedure call? If not what tool will be best to execute procedure and help control the whole process ?)
Also how can one control the level of paralleism ? Does it equate the number of mappers running the job?
Had a similar task of exporting data from a < 6 node Hadoop cluster to an Oracle Datewarehouse.
I've tested the following:
Sqoop
OraOop
Oracle Loader for Hadoop from the "Oracle BigData Connectors" suite
Hadoop streaming job which uses sqloader as mapper, in its configuration you can read from stdin using: load data infile "-"
Considering just speed, the Hadoop streaming job with sqloader as a mapper was the fastest way to transfer the data, but you have to install sqloader on each machine of your cluster. It was more of a personal curiosity, I would not recommend using this way to export data, the logging capabilities are limited, and should have a bigger impact on your datawarehouse performance.
The winner was Sqoop, it is pretty reliable, it's the import/export tool of the Hadoop ecosystem and was second fastest solution, according to my tests.(1.5x slower than first place)
Sqoop with OraOop (last updated 2012) was slower than the latest version of Sqoop, and requires extra configuration on the cluster.
Finally the worst time was obtained using Oracle's BigData Connectors, if you have a big cluster(>100 machines) than it should not be as bad as the time I obtained. The export was done in two steps. First step involves reprocessing the output and converting it to an Oracle Format that plays nice with the Datawarehouse. The second step was transfering the result to the Datawarehouse. This approach is better if you have allot of processing power, and you would not impact the Datawarehouse's performace as much as the other solutions.

Resources