Invalidate metadata/refresh imapala from spark code - hadoop

I'm working on a NRT solution that requires me to frequently update the metadata on an Impala table.
Currently this invalidation is done after my spark code has run.
I would like to speed things up by doing this refresh/invalidate directly from my Spark code.
What would be the most efficient approach?
Oozie is just too slow (30 sec overhead? no thanks)
An SSH action to an (edge) node seems like a valid solution but feels "hackish"
I don't see a way to do this from the hive context in Spark either.

REFRESH and INVALIDATE METADATA commands are specific to Impala.
You must be connected to an Impala daemon to be able to run these -- which trigger a refresh of the Impala-specific metadata cache (in your case you probably just need a REFRESH of the list of files in each partition, not a wholesale INVALIDATE to rebuild the list of all partitions and all their files from scratch)
You could use the Spark SqlContext to connect to Impala via JDBC and read data -- but not run arbitrary commands. Damn. So you are back to the basics:
download the latest Cloudera JDBC driver for Impala
install it on the server where you run your Spark job
list all the JARs in your *.*.extraClassPath properties
develop some Scala code to open a JDBC session against an Impala daemon and run arbitrary commands (such as REFRESH somedb.sometable) -- the hard way
Hopefully Google will find some examples of JDBC/Scala code such as this one

Seems this has been fixed by Impala 3.3.0 (cf. Section "Metadata Performance Improvements" here):
Automatic invalidation of metadata
With automatic metadata management enabled, you no longer have to issue INVALIDATE / REFRESH in a number of conditions.
In Impala 3.3, the following additional event in Hive Metastore can trigger automatic INVALIDATE / REFRESH of Metadata:
INSERT into tables and partitions from Impala or from Spark on the same or multiple cluster configuration

all the above steps are not required, you can write the below code and execute invalidate metadata query to impala table.
impala_node_ip_address = "XX.XX.XX.XX"
impala Query = "impala-shell -i "+"\"" + str(impala_node_ip_address) + "\"" + " -k -q " + "\""+"invalidate metadata DBNAME"+"." + "TableName" + "\""

Related

How hive manage the Non-Tez and Non-MapReduce based queries

Create table t1(id int)
I was firing above query on Hive 2.3.6 (MapR Hadoop Distribution 6.3.0).
Default hive engine was tez.
So after firing the query I was not able to see any TEZ application is launched on the yarn resource manager web ui
So I've changed the execution engine to MapReduce.
set hive.execution.engine=mr
And tried to run the same query again.
Same I was not able to see any MR application was launched on the yarn resource manager web ui
So my questions are how hive manage such types of queries?
And where the details of this queries are stored like application id, start time so on?
create table - is a metadata operation only, data is not being processed. It creates records in the metastore database, no distributed processing framework like Tez or MR is necessary for this, Yarn is not used.
Compiler translates DDL to the metastore query only if possible.
Also some simple DQL queries can be executed as metastore only if statistics exists and this feature is enabled: https://stackoverflow.com/a/41021682/2700344, without using Tez or MR.
Also small tables can be queried without distributed framework, using fetch-only task, see this: Why is Fetch task in Hive works faster than Map-only task?

Is Hive and Impala integration possible?

Is Hive and Impala integration possible?
After data processing in hive i want to store result data in impala for better read, is it possible?
If yes can you please share one example.
Both hive and impala, do not store any data. The data is stored in the HDFS location and hive an impala both are used just to visualize/transform the data present in the HDFS.
So yes, you can process the data using hive and then read it using impala, considering both of them have been setup properly. But since impala needs to be refreshed, you need to run the invalidate metadata and refresh commands
Impala uses the HIVE metastore to read the data. Once you have a table created in hive, it is possible to read the same and query the same using Impala. All you need is to refresh the table or trigger INVALIDATE METADATA in impala to read the data.
Hope this helps :)
Hive and impala are two different query engines. Each query engine is unique in terms of its architecture as well as performance. We can use hive metastore to get metadata and running query using impala. The common usecase is to connect impala/hive from tableau. If we are visualizing hive from tableau, we can get the latest data without any work around. If we keep on loading the data continuously, metadata will be updated as well. Impala does not aware of those changes. So we should run metadata invalidate query by connecting impalad to refresh its state and sync with the latest info available in metastore. So that user will get the same results as hive when the run the same query from tableau using impala engine.
There is no configuration parameter available now to run this invalidation query periodically. This blog reads well to execute meta data invalidation query through oozie scheduler periodically to handle such problems, Or simply we can set up a cronjob from the server itself.

cache spark table in thrift server

when using jupyter to cache some data into spark (using sqlcontext.cacheTable) i can see the table cached for the sparkcontext running within Jupyter. But now i want to access those cached tables from BI tools via odbc using the thrift server. when checking the thriftserver cache I dont see any table, the question is how do i get those tables cached to be consumed from BI tools?
do i have to send the same spark commands via jdbc? in that case, is the context related to the current session?
regards,
miguel
I found the solution. In order to have the tables cached to be used with jdbc/odbc clients via thriftserver i have to use CACHE TABLE from one of the clients, for exmaple from beeline. Once this is done the table is in-memory for all the different sessions.
It is also important to be sure you are using the right spark thriftserver. In order to know that just do a show table; in beeline, if you get just one column back you are not using the spark one and the CACHE TABLE wont work.

Hive / Tez job won't start

I am trying to create an ORC table in Hive by importing from a text file in HDFS. I have tried multiple different ways, searched online for help, and regardless the insert job won't start.
I can get the text file to HDFS, I can read the text file to Hive, but I cannot convert from that to ORC.
I tried many different variations, including this one that can be used as a reference to this question:
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/moving_data_from_hdfs_to_hive_external_table_method.html
I have a single-node HDP cluster (being used for development) - version:
HDP-2.3.2.0
(2.3.2.0-2950)
And here are the relevant service versions:
Service Version Status Description
HDFS 2.7.1.2.3 Installed Apache Hadoop Distributed File System
MapReduce2 2.7.1.2.3 Installed Apache Hadoop NextGen MapReduce (YARN)
YARN 2.7.1.2.3 Installed Apache Hadoop NextGen MapReduce (YARN)
Tez 0.7.0.2.3 Installed Tez is the next generation Hadoop Query Processing framework written on top of YARN.
Hive 1.2.1.2.3 Installed Data warehouse system for ad-hoc queries & analysis of large datasets and table & storage management service
What happens when I run a SQL like this (again, I've tried many variations including directly from online tutorials):
INSERT OVERWRITE TABLE mycars SELECT * FROM cars;
My job stays like this:
Total number of applications (application-types: [] and states:
[SUBMITTED, ACCEPTED, RUNNING]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1455989658079_0002 HIVE-3f41161c-b806-4e7d-974e-c18e028d683f TEZ hive root.hive ACCEPTED UNDEFINED 0% N/A
And it just hangs there. (Literally, I've tried a 20 row sample table and let it run for hours before killing it).
I am by no means an Hadoop expert (yet) and am sure it's probably a config issue, but I have been unable to figure it out.
All other Hive operations I've tried, such as creating dropping tables, loading a file to a text table, selects, all work fine. It's just when I create an ORC table that it does this. And I need an ORC table for my requirement.
Any advice would be helpful.
Most of the time it has to do with increasing your Yarn Scheduling capacity, but if your resources are already capped you can also reduce the amount of memory requested by individual TEZ tasks, through adjusting the following property in TEZ configuration :
task.resource.memory.mb
In order to increase the Cluster's capacity you can do it in the configuration settings of YARN or directly through Ambari or Cloudera Manager
In order to monitor what is happening behind the hoods you can run Yarn Resource Manager UI and check the diagnostics tab of the specific Application there are useful explicit messages about resource allocation especially when the job is accepted and keeps pending.

Cloudera Impala INVALIDATE METADATA

As has been discussed in impala tutorials, Impala uses a Metastore shared by Hive. but has been mentioned that if you create or do some editions on tables using hive, you should execute INVALIDATE METADATA or REFRESH command to inform impala about changes.
So I've got confused and my question is: if the Database of Metadata is shared, why there is a need for executing INVALIDATE METADATA or REFRESH by impala?
and if it is for caching of metadata by impala, why the daemons do not update their cache in the occurrence of cache miss themselves and without need to refresh metadata manually?
any help is appreciated.
Ok! Let's start with your question in the comment that what is the benefit of a centralized meta store.
Having a central meta store don't require the user to maintain meta data at two different locations, one each for Hive and Impala. User can have a central repository and both the tools can access this location for any metadata information.
Now, the second part, why there is a need to do INVALIDATE METADATA or REFRESH when the meta store is shared?
Impala utilizes Massively Parallel Processing paradigm to get the work done. Instead of reading from the centralized meta store for each and every query, it tends to keep the metadata with executor nodes so that it can completely bypass the COLD STARTS where a significant amount of time may be spent in reading the metadata.
INVALIDATE METADATA/REFRESH propagates the metadata/block information to the executor nodes.
Why do it manually?
In the earlier version of Impala, catalogd process was not present. The meta data updates were need to be propagated via the aforementioned commands. Starting Impala 1.2, catalogd is added and this process relays the metadata changes from Impala SQL statements to all the nodes in a cluster.
Hence removing the need to do it manually!
Hope that helps.
It is shared, but Impala caches the metadata and uses its statistics in its optimizer, but if it's changed in hive, you have to manually tell impala to refresh its cache, which is kind of inconvenient.
But if you create/change tables in impala, you don't have to do anything on the hive side.
#masoumeh when you modify a table via Impala SQL statements no need for INVALIDATE METADATA or REFRESH, this job is done by catalogd.
But when you insert :
a NEW table through HIVE i.e sqoop import .... --hive-import ... then you have to do : INVALIDATE METADATA tableName via Impala-Shell.
new data files into an existing table (append data) then you have to : REFRESH tableName because the only thing you want is the metadata for the last added info.

Resources