I have two clusters A and B. Cluster A has 5 tables. Now I need to run a hive query on these 5 tables, result of the query should update the cluster B Table data(covers all the columns of result query)
Note: We should not create any files on cluster A during this process but temp file is allowed.
Can this doable? What are permissions/Configurations required between two clusters two achieve this?
How Can I get this task/Any other efficient alternative?
After achieving this task, I should automate using Oozie..
Do you use a database for each cluster's metadata or hive tables? If yes then - if you use the same database for storing hive tables in both clusters then you can share them. I know it sounds intuitive, but just mentioned it incase you haven't thought about it.
Related
What is the best ways to parallel ingest data from Teradata database into Hadoop with parallel data moving?
If we create a job which is simple opens one session to Teradata database it will take a lot of time to load huge table.
if we create a set of sessions to load data in parallel, and also make Select in each of the sessions, than it will make a set of Full table scans Teradata to produce a data
What is the recommended best practice to load data in parallelised streams and make unnecessary workload to Teradata?
If Tera data supports table partitioning like oracle, you could try reading the table based on partitioning points which will enable parallelism in read...
Other option you have is, split the table into multiple partitions like adding a where clause on indexed column. This will ensure index scan and you can avoid full table scan.
The most scalable way to ingest data into Hadoop form teradata, which i found is to use Teradata connector for hadoop. It is included in Cloudera & Hortonworks distributions. I will show example base on Cloudera documentation, but the same works with Hortonworks as well:
Informatica big Data edition is using standard Scoop invocation via command line and submitting set of parameters to it. So the main question is - which driver to use to make parallel connections between two MPP systems.
Here is the link to the Cloudera documentation:
Using the Cloudera Connector Powered by Teradata
And here is the digest from this documentation (You could find that this connector support different kinds of load balancing between connections):
Cloudera Connector Powered by Teradata supports the following methods for importing data from Teradata to Hadoop:
split.by.amp
split.by.value
split.by.partition
split.by.hash
split.by.amp Method
This optimal method retrieves data from Teradata. The connector creates one mapper per available Teradata AMP, and each mapper subsequently retrieves data from each AMP. As a result, no staging table is required. This method requires Teradata 14.10 or higher.
If you use partition names in the select clause, Power Center will select only the rows within that partition so there won't be duplicate read (don't forget to choose Database partitioning in Informatica session level). However if you use key range partition you have to choose the range as you mentioned in settings. Usually we use NTILE oracle analytical function to split the table into multiple portions so that the read will be unique across the selects. Please let me know if you have any question. If you have range/auto generated/surrogate key column in the table use it in where clause - write a sub-query to divide the table into multiple portions.
The scenario is I need to process a file(Input) and for each records I need to check whether certain fields in input file are matching the fields stored in an Hadoop cluster.
We are in a thought of using MRJob to process the the input file and use HIVE to get data from hadoop cluster. I would like to know whether it is possible for me to connect HIVE inside a MRJob module. If so how to do that?
If not what would be the ideal approach to fulfill my requirement.
I am new to Hadoop, MRJob and Hive.
Please provide some suggestion.
"matching the fields stored in an Hadoop cluster." --> You mean that you need to search if the fields exists in this file too?
About how many files are there in total which you need to scan?
One solution is to load every single item in an HBase table and for every record in the input file, "GET"ing the record from the table. If the GET is successful then the record exists elsewhere in HDFS or else it doesn't. You would need a unique identifier for each HBase record and the same identifier should exist in your input file also.
You could connect to Hive also but the schema would need to be rigid in order for all your HDFS files to be able to be loaded into a single Hive table. HBase doesn't really care about columns (only ColumnFamilies needed). One more downside with MapReduce and Hive is that the speed will be low as compared to HBase (near real time).
Hope this helps.
I have a data structure in Hadoop with 100 columns and few hundred rows. Most of the times I need to query 65% of columns. In this case which is better to use HBASE or HIVE? Please advice.
Just number of columns you are accessing is NOT the criteria for deciding hbase or hive.
HIVE (SQL) :
Use Hive when you have warehousing needs and you are good at SQL and don't want to write MapReduce jobs. One important point though, Hive queries get converted into a corresponding MapReduce job under the hood which runs on your cluster and gives you the result. Hive does the trick for you. But each and every problem cannot be solved using HiveQL. Sometimes, if you need really fine grained and complex processing you might have to take MapReduce's shelter.
Hbase (NoSQL database):
You can use Hbase to serve that purpose. If you have some data which you want to access real time, you could store it in Hbase.
hbase get 'rowkey' is powerful when you know your access pattern
Hbase follows CP of CAP Theorm
Consistency:
Every node in the system contains the same data (e.g. replicas are never out of data)
Availability:
Every request to a non-failing node in the system returns a response
Partition Tolerance:
System properties (consistency and/or availability) hold even when the system is partitioned (communicate lost) and data is lost (node lost)
also have a look at this
Its very difficult to answer the question in one line.
HBASE is NoSQL database: your data need to store denormalized data because HBASE is very bad for joi
ning tables.
Hive: You can store data in similar format (normalized) in Hive, but would only see benefits when doing batch processing.
everyone
I am new to Hadoop World and i have some problem with Hbase join.
I have two cluster,clusterA's Hbase have employee table ,clusterB's Hbase have department table.
So,how to join empolyee and department ?
Should i need to install Hive ?
If the tables are in two separate clusters, you'll need to get one of the HBase tables from one cluster to another. This can be done via sqoop.
From there, you could, in theory, use Phoenix as suggested by Vignesh I in the comments, however, there are some limitations there. You would need to create a Phoenix view of both of those HBase tables. Native HBase views in Phoenix, currently, do not automatically update if they are updated outside of Phoenix, which most native HBase tables would be. This effectively renders views of native HBase tables in Phoenix snapshots instead of views; you will need to rebuild any indexes on a regular basis (and potentially stats as well) in order to capture any updates to the underlying HBase tables.
There is a JIRA open to enhance this behavior so that it would auto update, but the ETA of such a feature is unknown at this time.
What I would recommend, unless you have very specific real-time needs (in which case Phoenix, if you could live with the view limitations, may be the better choice), is to use Pig.
Within the Pig script, you can join the two HBase tables and then perform various transformations.
Hive would be another option, but in that case, you would need to sqoop both tables from HBase into Hive, and then proceed from there within Hive.
I am converting SSIS solution to Hadoop for ETL processing in the data-warehouse.
My expected system:
ETL - landing & staging (Hadoop) ----put-data---> Data-warehouse(MySQL)
The problem is: in transform phrase, I need to lookup data in MySQL from hadoop side (pig or mapreduce job). There are 2 solutions:
1st: Clone all tables need to lookup from MySQL into Hadoop. It means that we need to maintain data from 2 places.
2nd: query directly to MySQL. I am worried about many connections come to MySQL server.
What is solution/best practise for this problem? Are there any other solutions.
You will have to have some representation of your dimensional tables in Hadoop. Depending on the way how you do ETL of the dimension data, you might actually have them as a side effect of the ETL.
Are you planning to store the most granular fact data in MySQL? I my experience, Hive + Hadoop beat realational databases when it comes to storing and analyzing the fact data. If you need a realtime access to the results of the queries, you then can "cache" the summary results by storing them in MySQL.