Is hive, Pig or Impala used from command line only? - hadoop

I am new to Hadoop and have this confusion. Can you please help?
Q. How Hive, Pig or Impala are used in practical projects? Are they used from command line only or from within Java, Scala etc?

One can use Hive and Pig from the command line, or run scripts written in their language.
Of course it is possible to call(/build) these scripts in any way you like, so you could have a Java program build a pig command on the fly and execute it.
The Hive (and Pig) languages are typically used to talk to a Hive database. Besides this, it is also possible to talk to the hive database via a link (JDBC/ODBC). This could be done directly from anywhere, so you could let a java program make a JDBC connection to talk to your Hive tables.
Within the context of this answer, I belive everything I said about the Hive language also applies to Impala.

Related

Does Hive depend on/require Hadoop?

Hive installation guide says that Hive can be applied to RDBMS, my question is, sounds like Hive can exist without Hadoop, right? It's an independent HQL engineer that could work with any data source?
You can run Hive in local mode to use it without Hadoop for debugging purposes. See below url
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-Hive,Map-ReduceandLocal-Mode
Hive provided JDBC driver to query hive like JDBC, however if you are planning to run Hive queries on production system, you need Hadoop infrastructure to be available. Hive queries eventually converts into map-reduce jobs and HDFS is used as data storage for Hive tables.

Hadoop real time implementation

I would like to know how Hadoop components are used in real time.
here are my questions:
data Importing/export:
I know the options available in Sqoop but like to know how Sqoop is used in real time implementations (in common)
if I'm correct
1.1 sqoop commands placed in shell script and being called from schedulers/event triggers. can I have real time code-example on this, specifically passing parameters to Sqoop dynamically (such as table name) in shell script.
1.2 believe Ooozie workflow could also be used. any examples please
Pig
how pig commands are commonly called in real time? via java programs?
any realtime code-examples would be a great help
if I am correct Pig is commonly used for data quality checks/cleanups on staging data before loading them in to actual hdfs path or as hive tables.
and we could see pig scripts in shell scripts (in real time projects)
please correct me or add if I missed any
Hive
where we will see Hive commands in real time scenarios?
in shell scripts or in java api calls for reporting?
HBase
Hbase commands are commonly called as api calls in languages like Java.
am I correct?
sorry for too many questions. I don't see any article/blog on how these components are used in real time scenarios.
Thanks in advance.
The reason you don't see articles on the use of those components for realtime scenarios, is because those components are not realtime oriented, but batch oriented.
Scoop: not used in realtime - it is batch oriented.
I would use something like Flume to ingest data.
Pig, Hive: Again, not realtime ready. Both are batch oriented. The setup time of each query/script can take tens of seconds.
You can replace both with something like Spark Streaming (it even supports Flume).
HBase: It is a NoSQL database on top of HDFS. Can be used for realtime. Quick on inserts. It can be used from spark.
If you want to use those systems to help realtime apps, think of something like a Lambda architecture, that has a batch layer (using hive, pig and what not) and a speed layer, using streaming/realtime technologies.
Regards.

Build an application for reporting and analysis on Hadoop framework

I have an application with SAS where I pull the data from Oracle and produce report to excel using Base SAS and SAS macros. Now the problem is day by day my database is getting huge and fetching data from Oracle is taking more time, as a result my jobs are running slow.
So I want my application to be built on Hadoop for Reporting and analysis purpose. Can someone please suggest me any approach and what are the tools I need to use for this.
The short answer is: it depends.
For unloading data from Oracle I would recommend you to use Sqoop (http://sqoop.apache.org/), it is designed for this specific use case and can even do incremental loads and can create Hive table for unloaded data
When the data is unloaded, you can use Impala to build the report you need. Impala can natively work with Hive tables, so the sings are really simple. Of course, you would have to rewrite your SAS code to a set of SQL statements that would run on top of Impala.
Next, if you need visualization tool to run on top of it, you can either try something like Tableau or any other tool that is capable of utilizing ODBC/JDBC to connect to Impala
Finally, I think Hadoop + Sqoop + Impala would cover your needs. But I'd recommend you also to take a look at the MPP databases, because using SAS means you have pretty structured data and MPP database would be a better fit for this case

Can PL/SQL Reliably be Converted to Pig Lating or an Oozie Pipeline with Pig Latin and Hive

I am curious about replacing my Oracle db with Hadoop and am learning about the Hadoop ecosystem.
I have many PL/SQL scripts that would require replacement if I were to go this route.
I am under the impression that with some hard work I would be able to convert/translate any PL/SQL script into an analogous Pig Latin script. If not only Pig Latin, then a combination of Hive and Pig via Oozie.
Is this correct?
While most SQL statements can be translated into equivalent Pig and/or Hive statements there are several limitations that are inherent to the hadoop filesystem that get passed down to the languages. The primary limitation is that HDFS is a write-once, read-many system. This means that a statement that includes something like an UPDATE SQL command, or a DELETE sql command will not work. This is primarily due to the fact that both would require that the programming language be capable of changing the contents of an already existing file, which would contradict the write-once paradigm of hadoop.
There are however workaround to these. These commands can both be simulated through copying the file in question and making the changes when writing to the copy, deleting the original, and moving the copy into the original's location. Neither pig nor Hive have this functionality so you would have to slightly branch out of these languages in order to do so. For instance a few lines of bash could probably handle the deletion amd movement of the copy once the pig script has executed. Given that you can use bash to call the pig script in the first place this allows for a fairly simple solution. Or you could look into HBase which provides the ability to do something similar. However both solutions involve things outside of Pig/Hive, so if you absolutely cannot go outside of those languages the answer is no.
You can use PL/HQL - Procedural SQL on Hadoop which is open source project and it is intended to provide PL/SQL-like procedural language for Hive and other SQL-on-Hadoop implementations.
PL/HQL is an open source tool (Apache License 2.0) that implements procedural SQL language for Apache Hive and other SQL-on-Hadoop implementations.
PL/HQL language is compatible to a large extent with Oracle PL/SQL, ANSI/ISO SQL/PSM (IBM DB2, MySQL, Teradata i.e), Teradata BTEQ, PostgreSQL PL/pgSQL (Netezza), Transact-SQL (Microsoft SQL Server and Sybase) that allows you leveraging existing SQL/DWH skills and familiar approach to implement data warehouse solutions on Hadoop. It also facilitates migration of existing business logic to Hadoop.

Hadoop Basics: What do I do with the output?

(I'm sure a similar question exists, but I haven't found the answer I'm looking for yet.)
I'm using Hadoop and Hive (for our developers with SQL familiarity) to batch process multiple terabytes of data nightly. From an input of a few hundred massive CSV files, I'm outputting four or five fairly large CSV files. Obviously, Hive stores these in HDFS. Originally these input files were extracted from a giant SQL data warehouse.
Hadoop is extremely valuable for what it does. But what's the industry standard for dealing with the output? Right now I'm using a shell script to copy these back to a local folder and upload them to another data warehouse.
This question: ( Hadoop and MySQL Integration ) calls the practice of re-importing Hadoop exports non-standard. How do I explore my data with a BI tool, or integrate the results into my ASP.NET app? Thrift? Protobuf? Hive ODBC API Driver? There must be a better way.....
Enlighten me.
At foursquare I'm using Hive's Thrift driver to put the data into databases/spreadsheets as needed.
I maintain a job server that executes jobs via the Hive driver and then moves the output wherever it is needed. Using thrift directly is very easy and allows you to use any programming language.
If you're dealing with hadoop directly (and can't use this) you should check out Sqoop, built by Cloudera
Sqoop is designed for moving data in batch (whereas Flume is designed for moving it in real-time, and seems more aligned with putting data into hdfs than taking it out).
Hope that helps.

Resources