what is hive best suited for - hadoop

I need daily snapshots from all databases of the enterprise and update hive with it.
In case that is the best approach, how do I approach this? I have used sqoop to manually import data to hive but what do I connect PHP to? Hive or Sqoop?
I understand hive is used for OLAP and not OLTP, but taking snapshots once in a day is what hive would be supporting nicely or I should consider other options like Hbase?
I am open to more suggestions considering that the data is structured for the most part.

Related

Purpose of using HBase in Hadoop instead of Hive [duplicate]

This question already has an answer here:
What is the difference between hbase and hive? (Hadoop)
(1 answer)
Closed 5 years ago.
In my project, we are using Hadoop 2, Spark, Scala. Scala is the programming language and Spark is using here for analysing. we are using Hive and HBase both. I can access all details like file etc. of HDFS using Hive.
But my confusions are -
When I can able to performed all jobs using Hive, Then why HBase is required to store the data. Is it not an overhead?
What are the functionality of HIVE and HBase?
If we only used Hive, Then what should be the problem?
Can anyone please let me know.
When I can able to performed all jobs using Hive, Then why HBASE is required to store the data. Is it not a overhead?
What are the functionality of Hive and Hbase
HBase is No Sql database which stores the data in key value pair. Hive has integration with Hbase.Hbase HIve Integration
Advantage :- Hive queries over HBase. Think joins and a easy way to do aggregates and simple operations on your Hbase data.
Hbase gives you a scalable storage infrastructure that keeps data online. StumbleUpon uses Hbase for their live website. Hive is not a real-time query engine, so its data store could not be used for similar purposes. Hive over HBase gives you the benefit of both worlds.
If we only used Hive, Then what should be the problem?
If we will use Hive There is no problem . But in project there so many scenarios we have to consider .
Performance
Storage
Stability of used technology
Compatibility (Hive ware house is easily accessible for most of the Tools in Hadoop)
When I can able to performed all jobs using Hive, Then why HBase is
required to store the data. Is it not an overhead?
I can't say it's overhead or not. But HBase responds to requests in real-time as its database when it comes to Hive it runs jobs on MapReduce/Spark/Tez engines.
What are the functionality of Hive and HBase?
Hive:
It's a SQL-like language that gets translated into MapReduce/Spark/Tez jobs. it only runs batch processes on Hadoop. for more check this how Hive queries run on MapReduce engine
HBase:
It's key/value store database which runs on top of HDFS/S3(on AWS). It does real-time operations for requests.
If we only used Hive, Then what should be the problem?
As discussed If the query needs to process in real-time then HBase is the choice over Hive.

Oracle Hadoop Connectors vs Sqoop

I have used Sqoop to ingest data from Oracle to Hadoop and it worked well. It took only 4 mins to bring 86 million records from Oracle to Hive table without using partitions on Sqoop. Can anyone give some details about Oracle Hadoop connectors, Will it perform better than Sqoop?
Most of connectors would have the performance close to same as you'll have have a set of MapReduce jobs on the very end of your workflow and this would play the main role in your overall performance.
Oracle provides a set of different connectors for accessing the Hive and you could check a nice overview about standard solutions but I doubt that on the very end you will expect significant performance differences other then you see in Sqoop:
https://docs.oracle.com/cd/E37231_01/doc.20/e36961/start.htm#BDCUG119
Sqoop is a generic tool for working with the relational databases from Hadoop realm, and it is not limited by Oracle only. Besides it has an integration with other Hadoop solutions like Oozie for making complicated workflows, which makes it a good candidate over other types of connectors.
Personally myself I prefer Sqoop for Hadoop-driven import-export operations and connector approach for querying the data in Hadoop.
Sqoop will leverage a standard JDBC connection. Oracles connector will work with a fastloader/fastexport class integrated into the sqoop connection. It should be faster that Sqoop.

Does Hive depend on/require Hadoop?

Hive installation guide says that Hive can be applied to RDBMS, my question is, sounds like Hive can exist without Hadoop, right? It's an independent HQL engineer that could work with any data source?
You can run Hive in local mode to use it without Hadoop for debugging purposes. See below url
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-Hive,Map-ReduceandLocal-Mode
Hive provided JDBC driver to query hive like JDBC, however if you are planning to run Hive queries on production system, you need Hadoop infrastructure to be available. Hive queries eventually converts into map-reduce jobs and HDFS is used as data storage for Hive tables.

Load data into Greenplum DB using MapReduce or Sqoop

I want to try loading the data into Greenplum using mapreduce or sqoop. For now, the ways to load greenplum db from hdfs is, creating an extenrnal table with gphdfs and then loading internal table. I want to tryout solution to directly load the data into greenplum with sqoop or mapreduce. I need some inputs on how i can proceed on this. Could you please help me out.?
With regards to Sqoop, Sqoop export will help to achieve this.
http://www.tutorialspoint.com/sqoop/sqoop_export.htm
While not sqoop, I am currently in the experimental phases of using Greenplum's external tables to load from hdfs. So far it seems to perform.

Build an application for reporting and analysis on Hadoop framework

I have an application with SAS where I pull the data from Oracle and produce report to excel using Base SAS and SAS macros. Now the problem is day by day my database is getting huge and fetching data from Oracle is taking more time, as a result my jobs are running slow.
So I want my application to be built on Hadoop for Reporting and analysis purpose. Can someone please suggest me any approach and what are the tools I need to use for this.
The short answer is: it depends.
For unloading data from Oracle I would recommend you to use Sqoop (http://sqoop.apache.org/), it is designed for this specific use case and can even do incremental loads and can create Hive table for unloaded data
When the data is unloaded, you can use Impala to build the report you need. Impala can natively work with Hive tables, so the sings are really simple. Of course, you would have to rewrite your SAS code to a set of SQL statements that would run on top of Impala.
Next, if you need visualization tool to run on top of it, you can either try something like Tableau or any other tool that is capable of utilizing ODBC/JDBC to connect to Impala
Finally, I think Hadoop + Sqoop + Impala would cover your needs. But I'd recommend you also to take a look at the MPP databases, because using SAS means you have pretty structured data and MPP database would be a better fit for this case

Resources