how to set spark RDD StorageLevel in hive on spark? - hadoop

In my hive on spark job , I get this error :
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
thanks for this answer (Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?) , I know it may be my hiveonspark job has the same problem
since hive translates sql to a hiveonspark job, I don't how to set it in hive to make its hiveonspark job change from StorageLevel.MEMORY_ONLY to StorageLevel.MEMORY_AND_DISK ?
thanks for you help~~~~

You can use CACHE/UNCACHE [LAZY] Table <table_name> to manage caching. More details.
If you are using DataFrame's then you can use the persist(...) to specify the StorageLevel. Look at API here..
In addition to setting the storage level, you can optimize other things as well. SparkSQL uses a different caching mechanism called Columnar storage which is a more efficient way of caching data (as SparkSQL is schema aware). There are different set of config properties that can be tuned to manage caching as described in detail here (THis is latest version documentation. Refer to the documentation of version you are using).
spark.sql.inMemoryColumnarStorage.compressed
spark.sql.inMemoryColumnarStorage.batchSize

Related

Impala vs Hive. How Impala circumvents MapReduce?

How is Impala able to achieve lower latency than Hive in query processing?
I was going through http://impala.apache.org/overview.html, where it is stated:
To avoid latency, Impala circumvents MapReduce to directly access the
data through a specialized distributed query engine that is very
similar to those found in commercial parallel RDBMSs. The result is
order-of-magnitude faster performance than Hive, depending on the type
of query and configuration.
How Impala fetches the data without MapReduce (as in Hive)?
Can we say that Impala is closer to HBase and should be compared with HBase instead of comparing with Hive?
Edit:
Or can we say that as classically, Hive is on top of MapReduce and does require less memory to work on while Impala does everything in memory and hence it requires more memory to work by having the data already being cached in memory and acted upon on request?
Just read Impala Architecture and Components
Impala is a massively parallel processing (MPP) database engine. It consists of different daemon processes that run on specific hosts.... Impala is different from Hive and Pig because it uses its own daemons that are spread across the cluster for queries.
It circumvents MapReduce containers by having a long running daemon on every node that is able to accept query requests. There is no singular point of failure that handles requests like HiveServer2; all impala engines are able to immediately respond to query requests rather than queueing up MapReduce YARN containers.
Impala however does rely on the Hive Metastore service because it is just a useful service for mapping out metadata stored in the RDBMS to the Hadoop filesystem. Pig, Spark, PrestoDB, and other query engines also share the Hive Metastore without communicating though HiveServer.
Data is not "already cached" in Impala. Similar to Spark, you must read the data into a large portion of memory in order for operations to be quick. Unlike Spark, the daemons and statestore services remain active for handling subsequent queries.
Impala can query HBase, but it is not similar in architecture and in my experience, a well designed HBase table is faster to query than Impala. Impala is probably closer to Kudu.
Also worth mentioning that it's not really recommended to use MapReduce Hive anymore. Tez is far better, and Hortonworks states Hive LLAP is better than Impala, although as you quoted, it largely "depends on the type of query and configuration."
Impala use "Impala Daemon" service to read data directly from the dataNode (it must be installed with the same hosts of dataNode) .he cache only the location of files and some statistics in memory not the data itself.
that why impala can't read new files created within the table . you must invalidate or refresh (depend on your case) to tell impala to cache the new files and be able to read them directly
since impala is in memory , you need to have enough memory for the data read by the query , if you query will use more data than your memory (complexe query with aggregation on huge tables),use hive with spark engine not the default map reduce
set hive.execution.engine=spark; just before the query
you can use the same query in hive with spark engine
impala is cloudera product , you won't find it for hortonworks and MapR (or others) .
Tez is not included with cloudera for exemple.
it all depends on the platform you are using

Hive - How to know which execution engine I am currently using

I want to automate my hive ETL workflow in such a way
that I need to execute hive jobs on the basis of execution engine (Tez
or MR) because of memory constraints.
Would you please help, as I wanted to cross-check in-between of my whole work-flow which execution engine currently I'm dealing with.
Thanks in advance.
The Hive execution engine is controlled by hive.execution.engine property. It can be either of the following:
mr (Map Reduce, default)
tez (Tez execution, for Hadoop 2 only)
spark (Spark execution, for Hive 1.1.0 onward).
The property can be read & updated using hive/beeline cli
For reading - SET hive.execution.engine;
For updating - SET hive.execution.engine=tez;
If you want to programmatically get this value out, you must go for HiveClient which supports multiple ways like JDBC, Java, Python, PHP, Ruby, C++, etc.
References
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=82903061#ConfigurationProperties-hive.execution.engine
https://cwiki.apache.org/confluence/display/Hive/HiveClient

Spark on Parquet vs Spark on Hive(Parquet format)

Our use case is a narrow table(15 fields) but large processing against the whole dataset(billions of rows). I am wondering what combination provides better performance:
env: CDH5.8 / spark 2.0
Spark on Hive tables(as format of parquet)
Spark on row files(parquet)
Without additional context of your specific product and usecase - I'd vote for SparkSql on Hive tables for two reasons:
sparksql is usually better than core spark since databricks wrote different optimizations in sparksql, which is higher abstaction and gives ability to optimize code(read about Project Tungsten). In some cases manually written spark core code will be better, but it demands from the programmer deep understanding of the internals. In addition sparksql sometimes is limited and doesn't permit you to control low-level mechanisms, but you can always fallback to work with core rdd.
hive and not files - I'm assuming hive with external metastore. Metastore saves definitions of partitions of your "tables"(in files it could be some directory). This is one of the most important parts for the good performance. I.e. when working with files spark will need to load this info(which could be time consuming - e.g. s3 list operation is very slow). So metastore permits spark to fetch this info in simple and fast way
There's only two options here. Spark on files, or Spark on Hive. SparkSQL works on both, and you should prefer to use the Dataset API, not RDD
If you can define the Dataset schema yourself, Spark reading the raw HDFS files will be faster because you're bypassing the extra hop to the Hive Metastore.
When I did a simple test myself years ago (with Spark 1.3), I noticed that extracting 100000 rows as a CSV file was orders of magnitude faster than a SparkSQL Hive query with the same LIMIT

Query Parquet data through Vertica (Vertica Hadoop Integration)

So I have a Hadoop cluster with three nodes. Vertica is co-located on cluster. There are Parquet files (partitioned by Hive) on HDFS. My goal is to query those files using Vertica.
Right now what I did is using HDFS Connector, basically create an external table in Vertica, then link it to HDFS:
CREATE EXTERNAL TABLE tableName (columns)
AS COPY FROM "hdfs://hostname/...../data" PARQUET;
Since the data size is big. This method will not achieve good performance.
I have done some research, Vertica Hadoop Integration
I have tried HCatalog but there's some configuration error on my Hadoop so that's not working.
My use case is to not change data format on HDFS(Parquet), while query it using Vertica. Any ideas on how to do that?
EDIT: The only reason Vertica got slow performance is because it cant use Parquet's partitions. With higher version Vertica(8+), it can utlize hive's metadata now. So no HCatalog needed.
Terminology note: you're not using the HDFS Connector. Which is good, as it's deprecated as of 8.0.1. You're using the direct interface described in Reading Hadoop Native File Formats, with libhdfs++ (the hdfs scheme) rather than WebHDFS (the webhdfs scheme). That's all good so far. (You can also use the HCatalog Connector, but you need to do some additional configuration and it will not be faster than an external table.)
Your Hadoop cluster has only 3 nodes and Vertica is co-located on them, so you should be getting the benefits of node locality automatically -- Vertica will use the nodes that have the data locally when planning queries.
You can improve query performance by partitioning and sorting the data so Vertica can use predicate pushdown, and also by compressing the Parquet files. You said you don't want to change the data so maybe these suggestions don't work for you; they're not specific to Vertica so they might be worth considering anyway. (If you're using other tools to interact with your Parquet data, they'll benefit from these changes too.) The documentation of these techniques was improved in 8.0.x (link is to 8.1 but this was in 8.0.x too).
Additional partitioning support was added in 8.0.1. It looks like you're using at least 8.0; I can't tell if you're using 8.0.1. If you are, you can create the external table to only pay attention to the partitions you care about with something like:
CREATE EXTERNAL TABLE t (id int, name varchar(50),
created date, region varchar(50))
AS COPY FROM 'hdfs:///path/*/*/*'
PARQUET(hive_partition_cols='created,region');

Hbase in comparison with Hive

Im trying to get a clear understanding on HBASE.
Hive:- It just create a Tabular Structure for the Underlying Files in
HDFS. So that we can enable the user to have Querying Abilities on the
HDFS file. Correct me if im wrong here?
Hbase- Again, we have create a Similar table Structure, But bit more
in Structured way( Column Oriented) again over HDFS File system.
aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce.
Also is that true that we cant create a Hbase table over an Already existing HDFS file?
Hive shares a very similar structures to traditional RDBMS (But Not all), HQL syntax is almost similar to SQL which is good for Database Programmer from learning perspective where as HBase is completely diffrent in the sense that it can be queried only on the basis of its Row Key.
If you want to design a table in RDBMS, you will be following a structured approach in defining columns concentrating more on attributes, while in Hbase the complete design is concentrated around the data, So depending on the type of query to be used we can design a table in Hbase also the columns will be dynamic and will be changing at Runtime (core feature of NoSQL)
You said aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce .This is not a simple thinking.Because when a hive query is executed, a mapreduce job will be created and triggered.Depending upon data size and complexity it may consume time, since for each mapreduce job, there are some number of steps to do by JobTracker, initializing tasks like maps,combine,shufflesort, reduce etc.
But in case we access HBase, it directly lookup the data they indexed based on specified Scan or Get parameters. Means it just act as a database.
Hive and HBase are completely different things
Hive is a way to create map/reduce jobs for data that resides on HDFS (can be files or HBase)
HBase is an OLTP oriented key-value store that resides on HDFS and can be used in Map/Reduce jobs
In order for Hive to work it holds metadata that maps the HDFS data into tabular data (since SQL works on tables).
I guess it is also important to note that in recent versions Hive is evolving to go beyond a SQL way to write map/reduce jobs and with what HortonWorks calls the "stinger initiative" they have added a dedicated file format (Orc) and import Hive's performance (e.g. with the upcoming Tez execution engine) to deliver SQL on Hadoop (i.e. relatively fast way to run analytics queries for data stored on Hadoop)
Hive:
It's just create a Tabular Structure for the Underlying Files in HDFS. So that we can enable the user to have SQL-like Querying Abilities on existing HDFS files - with typical latency up to minutes.
However, for best performance it's recommended to ETL data into Hive's ORC format.
HBase:
Unlike Hive, HBase is NOT about running SQL queries over existing data in HDFS.
HBase is a strictly-consistent, distributed, low-latency KEY-VALUE STORE.
From The HBase Definitive Guide:
The canonical use case of Bigtable and HBase is the webtable, that is, the web pages
stored while crawling the Internet.
The row key is the reversed URL of the pageā€”for example, org.hbase.www. There is a
column family storing the actual HTML code, the contents family, as well as others
like anchor, which is used to store outgoing links, another one to store inbound links,
and yet another for metadata like language.
Using multiple versions for the contents family allows you to store a few older copies
of the HTML, and is helpful when you want to analyze how often a page changes, for
example. The timestamps used are the actual times when they were fetched from the
crawled website.
The fact that HBase uses HDFS is just an implementation detail: it allows to run HBase on an existing Hadoop cluster, it guarantees redundant storage of data; but it is not a feature in any other sense.
Also is that true that we cant create a Hbase table over an already
existing HDFS file?
No, it's NOT true. Internally HBase stores data in its HFile format.

Resources