Spark & HCatalog? - hadoop

I feel comfortable with loading HCatalog using Pig and was wondering if it's possible to use Spark instead of Pig. Unfortunately, I'm quite new with Spark...
Can you provide any materials on how to start? Are there any Spark libraries to use?
Any Examples? I've made all exercises on http://spark.apache.org/ but they are focusing on RDD and don't go any further..
I will be grateful for any help...
Regards
Pawel

You can use spark SQL to read from Hive Table instead of HCatalog.
https://spark.apache.org/sql/
You can apply same transformations like Pig using Spark Java/Scala/Python language like filter, join, group by..

You can reference the following link for using HCatalog InputFormat wrapper with Spark; which was written prior to SparkSQL.
https://gist.github.com/granturing/7201912

Our systems have loaded both and we can use either. Spark takes on traits of the language you are using, Scala, Python...,. For example using Spark with Python you can utilize many of the libraries of Python within Spark.

Related

running word-counter example with hadoop and hbase

word-counter example with hbase and hadoop
I am new to hadoop and hbase, i am going to implement a real example on a data set and understand the logic behind them.
I have already install hadoop and hbase on my system (ubuntu 17.04).
hadoop-2.8.0
hbase-1.3.1
is there any step-by-step tutorial for implementing word-counter example?
(word-counter example or any basic example exist)
There is comprehensive tutorial provided in HBase reference guide:
http://hbase.apache.org/book.html#mapreduce.example
Note, HBase provides alternative mechanism called Cascading which is similar to Map-Reduce, but allow to write code in simplified way (it's described in ref. guide too).

Is hive, Pig or Impala used from command line only?

I am new to Hadoop and have this confusion. Can you please help?
Q. How Hive, Pig or Impala are used in practical projects? Are they used from command line only or from within Java, Scala etc?
One can use Hive and Pig from the command line, or run scripts written in their language.
Of course it is possible to call(/build) these scripts in any way you like, so you could have a Java program build a pig command on the fly and execute it.
The Hive (and Pig) languages are typically used to talk to a Hive database. Besides this, it is also possible to talk to the hive database via a link (JDBC/ODBC). This could be done directly from anywhere, so you could let a java program make a JDBC connection to talk to your Hive tables.
Within the context of this answer, I belive everything I said about the Hive language also applies to Impala.

Hive on Spark in Mapr Distribution

Currently we are working on Hive, which by default uses map reduce as processing framework in our MapR cluster. Now we want to change from map reduce to spark for better performance. As per my understanding we need to set hive.execution.engine=spark.
Now my question is Hive on spark is currently supported by MapR ? if yes, what are configuration changes that we need to do ?
Your help is very much appreciated. Thanks
No, MapR (5.2) doesn't support that. From their docs,
MapR does not support Hive on Spark. Therefore, you cannot use Spark as an execution engine for Hive. However, you can run Hive and Spark on the same cluster. You can also use Spark SQL and Drill to query Hive tables.
Cheers.
I know and understand that your question is about using Spark as data processing engine for Hive; and as you can see in the various answer it is today not officially supported by MapR.
However, if you goal is to make Hive faster, and do not use MapReduce you can switch to Tez, for this install the MEP 3.0.
See: http://maprdocs.mapr.com/home/Hive/HiveandTez.html

Hadoop real time implementation

I would like to know how Hadoop components are used in real time.
here are my questions:
data Importing/export:
I know the options available in Sqoop but like to know how Sqoop is used in real time implementations (in common)
if I'm correct
1.1 sqoop commands placed in shell script and being called from schedulers/event triggers. can I have real time code-example on this, specifically passing parameters to Sqoop dynamically (such as table name) in shell script.
1.2 believe Ooozie workflow could also be used. any examples please
Pig
how pig commands are commonly called in real time? via java programs?
any realtime code-examples would be a great help
if I am correct Pig is commonly used for data quality checks/cleanups on staging data before loading them in to actual hdfs path or as hive tables.
and we could see pig scripts in shell scripts (in real time projects)
please correct me or add if I missed any
Hive
where we will see Hive commands in real time scenarios?
in shell scripts or in java api calls for reporting?
HBase
Hbase commands are commonly called as api calls in languages like Java.
am I correct?
sorry for too many questions. I don't see any article/blog on how these components are used in real time scenarios.
Thanks in advance.
The reason you don't see articles on the use of those components for realtime scenarios, is because those components are not realtime oriented, but batch oriented.
Scoop: not used in realtime - it is batch oriented.
I would use something like Flume to ingest data.
Pig, Hive: Again, not realtime ready. Both are batch oriented. The setup time of each query/script can take tens of seconds.
You can replace both with something like Spark Streaming (it even supports Flume).
HBase: It is a NoSQL database on top of HDFS. Can be used for realtime. Quick on inserts. It can be used from spark.
If you want to use those systems to help realtime apps, think of something like a Lambda architecture, that has a batch layer (using hive, pig and what not) and a speed layer, using streaming/realtime technologies.
Regards.

Can I use hadoop in Jupyter/IPython

Can I use Hadoop & MapReduce in Jupyter/IPython? Is there something similar to what PySpark for Spark is?
Of course you can. Many Frameworks like Hadoop Streaming, mrjob and dumbo to name a few. The techical aspect of including these in Jupyter should concist of either subprocess.Popen() calls or typical python imports, depending on the framework.
A nice overview/critique of some of these Frameworks can be found in this cloudera blogpost.

Resources