What is the best way to ingest data from Terdata into Hadoop with Informatica? - hadoop

What is the best ways to parallel ingest data from Teradata database into Hadoop with parallel data moving?
If we create a job which is simple opens one session to Teradata database it will take a lot of time to load huge table.
if we create a set of sessions to load data in parallel, and also make Select in each of the sessions, than it will make a set of Full table scans Teradata to produce a data
What is the recommended best practice to load data in parallelised streams and make unnecessary workload to Teradata?

If Tera data supports table partitioning like oracle, you could try reading the table based on partitioning points which will enable parallelism in read...
Other option you have is, split the table into multiple partitions like adding a where clause on indexed column. This will ensure index scan and you can avoid full table scan.

The most scalable way to ingest data into Hadoop form teradata, which i found is to use Teradata connector for hadoop. It is included in Cloudera & Hortonworks distributions. I will show example base on Cloudera documentation, but the same works with Hortonworks as well:
Informatica big Data edition is using standard Scoop invocation via command line and submitting set of parameters to it. So the main question is - which driver to use to make parallel connections between two MPP systems.
Here is the link to the Cloudera documentation:
Using the Cloudera Connector Powered by Teradata
And here is the digest from this documentation (You could find that this connector support different kinds of load balancing between connections):
Cloudera Connector Powered by Teradata supports the following methods for importing data from Teradata to Hadoop:
split.by.amp
split.by.value
split.by.partition
split.by.hash
split.by.amp Method
This optimal method retrieves data from Teradata. The connector creates one mapper per available Teradata AMP, and each mapper subsequently retrieves data from each AMP. As a result, no staging table is required. This method requires Teradata 14.10 or higher.

If you use partition names in the select clause, Power Center will select only the rows within that partition so there won't be duplicate read (don't forget to choose Database partitioning in Informatica session level). However if you use key range partition you have to choose the range as you mentioned in settings. Usually we use NTILE oracle analytical function to split the table into multiple portions so that the read will be unique across the selects. Please let me know if you have any question. If you have range/auto generated/surrogate key column in the table use it in where clause - write a sub-query to divide the table into multiple portions.

Related

Why Hive when HDFS already provide data storage?

I have started learning Hadoop.I understood that HDFS provides distributed storage system and Mapreduce is for data processing.Now i ma reading Hadoop ecosystem.
From the definition of Hive, it is a data ware house built on hadoop for providing SQL like interface.
My question is when hadoop provides HDFS which is falut tolerant , distributed then why hive? Does hive replaces HDFS?.
Does hive provide only sql interface or storage also?
Hive does not replace HDFS. Hive provides sql type interface to data that is stored in HDFS. Its basically used for querying and analysis of data that is stored. Hive in a sense actually eliminates a lot of boiler plate code, that you would have to write if you were using mapreduce. for example just think of how you are going to create different types of joins(left, right, bucketed) or group by clause or any other sql clause in mapreduce and you will get your answer (you lines of code will easily scale to 100's ). Hive provides them out-of-the-box. You dont need to write those lengthy programs in mapreduce. Hive already does that for you.
One thing to note is, Hive itself uses Mapreduce behind the scenes. So any group by, count, join is converted to mapreduce jobs only. You can change this though to Tez/Spark.
for your second question, hive does not provide any storage, it just uses a database (derby as default, MySQL would be a good choice if you want to use a different db) as a metastore just to store the metadata related to the tables, partitions, views, buckets etc.. (metadata is like location of tables, type of data stored in tables, partitions info of the tables, created date, modified date etc..) you create with hive.
To answer your question in comment...
Hive can process structured (csv,txt etc) data & semi-structured(xml,json,parquet etc). It cannot process unstructured data like audio, video etc.
Note: Semi structured data can be handled in DDLs and also through spark to be put into Hive.
I encourage you to learn what is external and managed tables in hive too.
Happy learning.

Query Parquet data through Vertica (Vertica Hadoop Integration)

So I have a Hadoop cluster with three nodes. Vertica is co-located on cluster. There are Parquet files (partitioned by Hive) on HDFS. My goal is to query those files using Vertica.
Right now what I did is using HDFS Connector, basically create an external table in Vertica, then link it to HDFS:
CREATE EXTERNAL TABLE tableName (columns)
AS COPY FROM "hdfs://hostname/...../data" PARQUET;
Since the data size is big. This method will not achieve good performance.
I have done some research, Vertica Hadoop Integration
I have tried HCatalog but there's some configuration error on my Hadoop so that's not working.
My use case is to not change data format on HDFS(Parquet), while query it using Vertica. Any ideas on how to do that?
EDIT: The only reason Vertica got slow performance is because it cant use Parquet's partitions. With higher version Vertica(8+), it can utlize hive's metadata now. So no HCatalog needed.
Terminology note: you're not using the HDFS Connector. Which is good, as it's deprecated as of 8.0.1. You're using the direct interface described in Reading Hadoop Native File Formats, with libhdfs++ (the hdfs scheme) rather than WebHDFS (the webhdfs scheme). That's all good so far. (You can also use the HCatalog Connector, but you need to do some additional configuration and it will not be faster than an external table.)
Your Hadoop cluster has only 3 nodes and Vertica is co-located on them, so you should be getting the benefits of node locality automatically -- Vertica will use the nodes that have the data locally when planning queries.
You can improve query performance by partitioning and sorting the data so Vertica can use predicate pushdown, and also by compressing the Parquet files. You said you don't want to change the data so maybe these suggestions don't work for you; they're not specific to Vertica so they might be worth considering anyway. (If you're using other tools to interact with your Parquet data, they'll benefit from these changes too.) The documentation of these techniques was improved in 8.0.x (link is to 8.1 but this was in 8.0.x too).
Additional partitioning support was added in 8.0.1. It looks like you're using at least 8.0; I can't tell if you're using 8.0.1. If you are, you can create the external table to only pay attention to the partitions you care about with something like:
CREATE EXTERNAL TABLE t (id int, name varchar(50),
created date, region varchar(50))
AS COPY FROM 'hdfs:///path/*/*/*'
PARQUET(hive_partition_cols='created,region');

how to manage modified data in Apache Hive

We are working on Cloudera CDH and trying to perform reporting on the data stored on Apache Hadoop. We send daily reports to client so need to import data from operational store to hadoop daily.
Hadoop works on the append only mode. Hence we can not perform the Hive update/delete query. We can perform Insert overwrite on dimension tables and add delta values in the fact tables. Introducing thousands for the delta rows daily does not seem quite impressive solution.
Are there any other standard better ways to update modified data in Hadoop?
Thanks
HDFS might be append only, but Hive does support updates from 0.14 on.
see here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update
A design pattern is to take all your previous and current data and insert it into a new table every time.
Depending on your usecase have a look at Apache Impala/Hbase/... or even Drill.

Questions about migration, data model and performance of CDH/Impala

I have some questions about migration, data model and performance of Hadoop/Impala.
How to migrate Oracle application to cloudera hadoop/Impala
1.1 How to replace oracle stored procedure in impala or M/R or java/python app.
For example, the original SP include several parameters and sqls.
1.2 How to replace unsupported or complex SQL like over by partition from Oracle to impala.
Are there any existing examples or Impala UDF?
1.3 How to handle update operation since part of data has to be updated.
For example, use data timestamp? use the store model which can support update like HBase? or use delete all data/partition/dir and insert it again(insert overwrite).
Data store model , partition design and query performance
2.1 How to chose impala internal table or external table like csv, parquet, habase?
For example, if there are several kind of data like importing exsited large data in Oracle into hadoop, new business data into hadoop, computed data in hadoop and frequently updated data in hadoop, how to choose the data model? Do you need special attention if the different kind of data need to join?
We have XX TB's data from Oracle, do you have any suggestion about the file format like csv or parquet? Do we need to import the data results into impala internal table or hdfs fs after calculation. If those kind of data can be updated, how to we considered that?
2.2 How to partition the table /external table when joining
For example, there are huge number of sensor data and each one includes measuring data, acquisition timestamp and region information.
We need:
calculate measuring data by different region
Query a series of measuring data during a certain time interval for specific sensor or region.
Query the specific sensor data from huge number of data cross all time.
Query data for all sensors on specific date.
Would you please provide us some suggestion about how to setup up the partition for internal and directories structure for external table(csv) .
In addition, for the structure of the directories, which is better when using date=20090101/area=BEIJING or year=2009/month=01/day=01/area=BEIJING? Is there any guide about that?

Hbase in comparison with Hive

Im trying to get a clear understanding on HBASE.
Hive:- It just create a Tabular Structure for the Underlying Files in
HDFS. So that we can enable the user to have Querying Abilities on the
HDFS file. Correct me if im wrong here?
Hbase- Again, we have create a Similar table Structure, But bit more
in Structured way( Column Oriented) again over HDFS File system.
aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce.
Also is that true that we cant create a Hbase table over an Already existing HDFS file?
Hive shares a very similar structures to traditional RDBMS (But Not all), HQL syntax is almost similar to SQL which is good for Database Programmer from learning perspective where as HBase is completely diffrent in the sense that it can be queried only on the basis of its Row Key.
If you want to design a table in RDBMS, you will be following a structured approach in defining columns concentrating more on attributes, while in Hbase the complete design is concentrated around the data, So depending on the type of query to be used we can design a table in Hbase also the columns will be dynamic and will be changing at Runtime (core feature of NoSQL)
You said aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce .This is not a simple thinking.Because when a hive query is executed, a mapreduce job will be created and triggered.Depending upon data size and complexity it may consume time, since for each mapreduce job, there are some number of steps to do by JobTracker, initializing tasks like maps,combine,shufflesort, reduce etc.
But in case we access HBase, it directly lookup the data they indexed based on specified Scan or Get parameters. Means it just act as a database.
Hive and HBase are completely different things
Hive is a way to create map/reduce jobs for data that resides on HDFS (can be files or HBase)
HBase is an OLTP oriented key-value store that resides on HDFS and can be used in Map/Reduce jobs
In order for Hive to work it holds metadata that maps the HDFS data into tabular data (since SQL works on tables).
I guess it is also important to note that in recent versions Hive is evolving to go beyond a SQL way to write map/reduce jobs and with what HortonWorks calls the "stinger initiative" they have added a dedicated file format (Orc) and import Hive's performance (e.g. with the upcoming Tez execution engine) to deliver SQL on Hadoop (i.e. relatively fast way to run analytics queries for data stored on Hadoop)
Hive:
It's just create a Tabular Structure for the Underlying Files in HDFS. So that we can enable the user to have SQL-like Querying Abilities on existing HDFS files - with typical latency up to minutes.
However, for best performance it's recommended to ETL data into Hive's ORC format.
HBase:
Unlike Hive, HBase is NOT about running SQL queries over existing data in HDFS.
HBase is a strictly-consistent, distributed, low-latency KEY-VALUE STORE.
From The HBase Definitive Guide:
The canonical use case of Bigtable and HBase is the webtable, that is, the web pages
stored while crawling the Internet.
The row key is the reversed URL of the pageā€”for example, org.hbase.www. There is a
column family storing the actual HTML code, the contents family, as well as others
like anchor, which is used to store outgoing links, another one to store inbound links,
and yet another for metadata like language.
Using multiple versions for the contents family allows you to store a few older copies
of the HTML, and is helpful when you want to analyze how often a page changes, for
example. The timestamps used are the actual times when they were fetched from the
crawled website.
The fact that HBase uses HDFS is just an implementation detail: it allows to run HBase on an existing Hadoop cluster, it guarantees redundant storage of data; but it is not a feature in any other sense.
Also is that true that we cant create a Hbase table over an already
existing HDFS file?
No, it's NOT true. Internally HBase stores data in its HFile format.

Resources