Apache Impala metadata - saparate from Hive or is it the same? - hadoop

Pretty new to Hadoop ecosystem services here, starting to learn stuff bit by bit.
I tried to find out about Apache Impala and its metadata store. The documentation says there's a daemon specifically for it and how it's gathered for each table. Other sources say that Hive metadata store is in fact also the Impala metadata store.
I can't find a single source for best practices or any type of user guide type document for handling metadata in Impala, is it because the metadata store is in fact the Hive store?
Just generally confused by the lack of information out there.

Related

What is the relationship between Spark, Hadoop and Cassandra

My understanding was that Spark is an alternative to Hadoop. However, when trying to install Spark, the installation page asks for an existing Hadoop installation. I'm not able to find anything that clarifies that relationship.
Secondly, Spark apparently has good connectivity to Cassandra and Hive. Both have sql style interface. However, Spark has its own sql. Why would one use Cassandra/Hive instead of Spark's native sql? Assuming that this is a brand new project with no existing installation?
Spark is a distributed in memory processing engine. It does not need to be paired with Hadoop, but since Hadoop is one of the most popular big data processing tools, Spark is designed to work well in that environment. For example, Hadoop uses the HDFS (Hadoop Distributed File System) to store its data, so Spark is able to read data from HDFS, and to save results in HDFS.
For speed, Spark keeps its data sets in memory. It will typically start a job by loading data from durable storage, such as HDFS, Hbase, a Cassandra database, etc. Once loaded into memory, Spark can run many transformations on the data set to calculate a desired result. The final result is then typically written back to durable storage.
In terms of it being an alternative to Hadoop, it can be much faster than Hadoop at certain operations. For example a multi-pass map reduce operation can be dramatically faster in Spark than with Hadoop map reduce since most of the disk I/O of Hadoop is avoided. Spark can read data formatted for Apache Hive, so Spark SQL can be much faster than using HQL (Hive Query Language).
Cassandra has its own native query language called CQL (Cassandra Query Language), but it is a small subset of full SQL and is quite poor for things like aggregation and ad hoc queries. So when Spark is paired with Cassandra, it offers a more feature rich query language and allows you to do data analytics that native CQL doesn't provide.
Another use case for Spark is for stream processing. Spark can be set up to ingest incoming real time data and process it in micro-batches, and then save the result to durable storage, such as HDFS, Cassandra, etc.
So spark is really a standalone in memory system that can be paired with many different distributed databases and file systems to add performance, a more complete SQL implementation, and features they may lack such a stream processing.
Im writing a paper about Hadoop for university. And stumbled over your question. Spark is just using Hadoop for persistence and only if you want to use it. It's possible to use it with other persistence tiers like Amazon EC2.
On the other hand-side spark is running in-memory and it's not primarly build to be used for map reduce use-cases like Hadoop was/is.
I can recommend this article, if you like a more detailed description: https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
The README.md file in Spark can solve your puzzle:
A Note About Hadoop Versions
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
storage systems. Because the protocols have changed in different versions of
Hadoop, you must build Spark against the same version that your cluster runs.
Please refer to the build documentation at
"Specifying the Hadoop Version"
for detailed guidance on building for a particular distribution of Hadoop, including
building for particular Hive and Hive Thriftserver distributions.

When to use Hcatalog and what are its benefits

I'm new to Hcatlog (HCAT), we would like to know in what usecases/scenario's we use HCAT, Benefits of making use of HCAT, Is there any Performance Improvement can be gain from HCatlog. Can any one just provide information on when to use Hcatlog
Apache HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache Map/Reduce, and Apache Hive – to more easily read and write data on the grid.
HCatalog creates a table abstraction layer over data stored on an HDFS cluster. This table abstraction layer presents the data in a familiar relational format and makes it easier to read and write data using familiar query language concepts.
HCatalog data structures are defined using Hive's data definition language (DDL) and the Hive metastore stores the HCatalog data structures. Using the command-line interface (CLI), users can create, alter, and drop tables. Tables are organized into databases or are placed in the default database if none are defined for the table. Once tables are created, you can explore the metadata of the tables using commands such as Show Table and Describe Table.
HCatalog commands are the same as Hive's DDL commands.
HCatalog’s ensures that users need not worry about where or in what format their data is stored. HCatalog displays data from RCFile format, text files, or sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables’ metadata.
HCatalog opens up the hive metadata to other Map/Reduce tools. Every Map/Reduce tools has its own notion about HDFS data (example Pig sees the HDFS data as set of files, Hive sees it as tables) HCatalog supported Map/Reduce tools do not need to care about where the data is stored, in which format and storage location.
It assist integration with other tools and supplies read and write interfaces for Pig, Hive and Map/Reduce.
It provide shared schema and data types for Hadoop tools.You do not have to explicitly type the data structures in each program.
It Expose the information as Rest Interface for external data access.
It also integrates with Sqoop, which is a tool designed to transfer data back and forth between Hadoop and relational databases such as SQL Server and Oracle
It provide APIs and webservice wrapper for accessing metadata in hive metastore.
HCatalog also exposes a REST interface so that you can create custom tools and applications to interact with Hadoop data structures.
This allows us to use the right tool for the right job. For example, we can load data into Hadoop using HCatalog, perform some ETL on the data using Pig, and then aggregate the data using Hive. After the processing, you could then send the data to your data warehouse housed in SQL Server using Sqoop. You can even automate the process using Oozie.
How it works:
Pig- HCatLoader and HCatStore interface
Map/Reduce- HCatInputFormat and HCatOutputFormat interface
Hive- No Interface Necessary. Direct access to metadata
References:
Microsoft Big Data Solution
http://hortonworks.com/hadoop/hcatalog/
Answer to your question:
As I described earlier HCatalog provides shared schema and data types for hadoop tools It simplifies your work during data processing. If you have created a table using HCatalog, you can directly access that hive table through pig or Map/Reduce (you cannot simply access a hive table through pig or Map Reduce).You don't need to create schema for every tool.
If you are working with the shared data that can be used from multiple
users(some team using Hive, some team using pig, some team using Map/Reduce) then HCatalog will be useful as they just need to table only to access the data for processing.
It is not replacement of any tool It a facility to provide single access to many tools.
Performance depends on your hadoop cluster. You should do some performance benchmarking in your Hadoop cluster to major performance.

Difference between MapR-DB and Hbase

I am bit new in MapR but i am aware about hbase. I was going through one of the video where I found that Mapr-DB is a NoSQL DB in MapR and it similar to Hbase. In addition to this Hbase can also be run on MapR. I am confused between MapR-Db and Hbase. What is the exact difference between them ?
When to use Mapr-DB and when to use Hbase?
Basically I have one java code which do bulk load in Hbase on MapR , Now here if I use same code that i have used for Apache hadoop , will that code work here?
Please help me to avoid this confusion.
They are both NOSQL, wide column stores.
HBase is open source and can be installed as a part of a Hadoop installation.
MapR-DB is a proprietary (not open source) NOSQL database that MapR offers. A core difference that MapR will detail with MapR-DB (along with their file system (they do not use HDFS)) is that MapR-DB offers significant performance and scalability over HBase (unlimited tables, columns, re-architecture to name a few).
MapR maintains that you can use MapR-DB or HBase interchangeably. I suggest testing on both extensively before committing to one vs the other. You also need to realize that MapR-DB is MapR's proprietary NOSQL HBase equivalent and if you require support for MapR-DB you'll have to get that from MapR (HBase support can come from any of the other Hadoop distributions as well as the open source community).
Some links you should look at:
http://www.theregister.co.uk/2013/05/01/mapr_hadoop_m7_edition_solr/
https://www.mapr.com/blog/get-real-hadoop-enterprise-grade-nosql#.VVfHuvlVhBc
They are similar but not same. MapR claims that MapR DB is faster and more efficient as they have migrated the critical functionality in native C/C++ code and interface is kept the same. But end of the day MapR DB is propriatory and you depend on the support of MapR for any thing which is done differently than HBase. I didn't liked MapR-DB because it's not compatible with Apache Phoenix(HBase coprocessors are not present in MapR DB) - the SQL way of accessing HBase kind of NoSQL databases.
Limitations that i have taken from MapR documentation:
Custom HBase filters are not supported.
User permissions for column families are not supported. User
permissions for tables and columns are supported.
HBase authentication is not supported.
HBase replication is handled with Mirror Volumes.
Bulk loads using the HFiles workaround are not supported and not
necessary. HBase coprocessors are not supported.
Filters use a different regular expression library
Co processors are not supported
So i second previous answer - try out your solution in both(MapR DB vs HBase) before going too far. I didn't liked to very idea of MapR DB from MapR as it's propitiatory and the code is not open source. If any Hadoop distributor is enhancing hadoop - they should also make it available to open source community. Why one should totally rely on commercial support when using opensource.

Confusion in Apache Nutch, HBase, Hadoop, Solr, Gora

I am new to all these terms and given some time to understand it. But i have some confusions in it. Please correct me if i am wrong.
Nutch: It's for web crawling, using it we can crawl web pages. We can store these web pages somewhere in db.
Solr: Solr can be used for indexing web pages crawled by Apache Nutch. It helps in searching the indexes web pages.
HBase: It's used as an interface to interact with Hadoop. It helps in getting data at real time from HDFS. It provides simple SQL type interface for interacting.
Hadoop: It provides two functionalities: One is HDFS (Hadoop data file system) and other is Map-Reduce functionality taken from Google algorithms. Its basically used for offline data backup etc.
Gora and ZooKeeper: I am not sure of.
Confusions:
1). Is HBase a key-value pair DB or just an interface to Hadoop ? or i should ask, can HBase exist without Hadoop ?
If yes, can you explain a bit more about its usage.
2). Is there any use of crawling data using Apache Nutch without indexing into Solr ?
3). For running apache nutch, do we need HBase and Hadoop ? If no, how we can make it work without it?
4). Is Hadoop part of HBase ?
Here is a good short discussion of HBase vs. Hadoop: Difference between HBase and Hadoop/HDFS
Because HBase is built on top of Hadoop you can't really have HBase without Hadoop.
Yes you can run Nutch without Solr; there do not seem to be lots of use cases, however, much less living examples in the wild.
Yes, you can run Nutch without Hadoop, but again there don't seem to be a lot of real-world examples of people doing this.
Yes Hadoop is part of HBase, in that there is no HBase without Hadoop, but of course Hadoop is used for other things as well.
Zookeeper is used for configuration, naming, synchronization, etc. in Hadoop stack workflows. Gora is a memory management/persistence framework and is built on top of Hadoop.

What to use.. Impala on HDFS, or Impala on Hbase or just the Hbase?

I am working on Proof of Concept task.
The task is to implement a feature of our product using Hadoop technology.
Feature is quite simple, we have a UI which will let you insert details about "Network Issue".
All details about such a issue are captured and inserted into a table in Oracle DB.
We then process data in this table and calculate a Health Score.
I have to use Hadoop instead of a traditional Db So my question is what to go for?
Impala on HDFS? or
Impala on Hbase ? or
Hbase?
I am using a cloudera VM for the POC implementation.
As per my understanding, Hbase is NoSQL distributed database, which is actually a layer on HDFS , which provides java APIs to access data.
Impala is a tool which also provides JDBC access to access data over Hbase or directly over HDFS.
I am very new to hadoop, can some one please help?
Well, it depends on several things, like the kind of processing you are going to perform, desired response time etc. But by looking at whatever you have written here, HBase seems to be fine. I don't find any need of Impala as of now. HBase API is good and will serve your most of the needs.
IMHO, it's better to keep things simple initially and add a tool only if it is really required. Same holds good here. If you reach a point where you find that HBase API is not able to serve the purpose you could definitely add Impala to your stack.
That being said, there is one thing which you should keep in mind. HBase is a NoSQL DB and doesn't follow RDBMS conventions and terminologies. So, you might find it a bit strange initially. It's better to keep this in mind and then proceed as you have to design the schema in a way which is totally different from the RDBMS style of schema design.

Resources