Cloudera Impala INVALIDATE METADATA - hadoop

As has been discussed in impala tutorials, Impala uses a Metastore shared by Hive. but has been mentioned that if you create or do some editions on tables using hive, you should execute INVALIDATE METADATA or REFRESH command to inform impala about changes.
So I've got confused and my question is: if the Database of Metadata is shared, why there is a need for executing INVALIDATE METADATA or REFRESH by impala?
and if it is for caching of metadata by impala, why the daemons do not update their cache in the occurrence of cache miss themselves and without need to refresh metadata manually?
any help is appreciated.

Ok! Let's start with your question in the comment that what is the benefit of a centralized meta store.
Having a central meta store don't require the user to maintain meta data at two different locations, one each for Hive and Impala. User can have a central repository and both the tools can access this location for any metadata information.
Now, the second part, why there is a need to do INVALIDATE METADATA or REFRESH when the meta store is shared?
Impala utilizes Massively Parallel Processing paradigm to get the work done. Instead of reading from the centralized meta store for each and every query, it tends to keep the metadata with executor nodes so that it can completely bypass the COLD STARTS where a significant amount of time may be spent in reading the metadata.
INVALIDATE METADATA/REFRESH propagates the metadata/block information to the executor nodes.
Why do it manually?
In the earlier version of Impala, catalogd process was not present. The meta data updates were need to be propagated via the aforementioned commands. Starting Impala 1.2, catalogd is added and this process relays the metadata changes from Impala SQL statements to all the nodes in a cluster.
Hence removing the need to do it manually!
Hope that helps.

It is shared, but Impala caches the metadata and uses its statistics in its optimizer, but if it's changed in hive, you have to manually tell impala to refresh its cache, which is kind of inconvenient.
But if you create/change tables in impala, you don't have to do anything on the hive side.

#masoumeh when you modify a table via Impala SQL statements no need for INVALIDATE METADATA or REFRESH, this job is done by catalogd.
But when you insert :
a NEW table through HIVE i.e sqoop import .... --hive-import ... then you have to do : INVALIDATE METADATA tableName via Impala-Shell.
new data files into an existing table (append data) then you have to : REFRESH tableName because the only thing you want is the metadata for the last added info.

Related

Is Hive and Impala integration possible?

Is Hive and Impala integration possible?
After data processing in hive i want to store result data in impala for better read, is it possible?
If yes can you please share one example.
Both hive and impala, do not store any data. The data is stored in the HDFS location and hive an impala both are used just to visualize/transform the data present in the HDFS.
So yes, you can process the data using hive and then read it using impala, considering both of them have been setup properly. But since impala needs to be refreshed, you need to run the invalidate metadata and refresh commands
Impala uses the HIVE metastore to read the data. Once you have a table created in hive, it is possible to read the same and query the same using Impala. All you need is to refresh the table or trigger INVALIDATE METADATA in impala to read the data.
Hope this helps :)
Hive and impala are two different query engines. Each query engine is unique in terms of its architecture as well as performance. We can use hive metastore to get metadata and running query using impala. The common usecase is to connect impala/hive from tableau. If we are visualizing hive from tableau, we can get the latest data without any work around. If we keep on loading the data continuously, metadata will be updated as well. Impala does not aware of those changes. So we should run metadata invalidate query by connecting impalad to refresh its state and sync with the latest info available in metastore. So that user will get the same results as hive when the run the same query from tableau using impala engine.
There is no configuration parameter available now to run this invalidation query periodically. This blog reads well to execute meta data invalidation query through oozie scheduler periodically to handle such problems, Or simply we can set up a cronjob from the server itself.

What happens sqoop fails in between proceed of data

What happens when Sqoop import job fails while importing data into RDBMS-HDFS and vice-versa?
Sqoop can export data from HDFS into an RDBMS using parallel data transfer tasks. Each task will open a connection to the database, insert into the database via transactions, and commit periodically. This means that before the entire export job is complete, partial data will be available in the database.
If an export map task fails even after multiple retries, the entire job will fail. The reasons for task failures could include network connectivity issues, database integrity constraints, malformed records on HDFS, cluster capacity issues etc. In such a failure case, the already committed data will still be available in the database.

cache spark table in thrift server

when using jupyter to cache some data into spark (using sqlcontext.cacheTable) i can see the table cached for the sparkcontext running within Jupyter. But now i want to access those cached tables from BI tools via odbc using the thrift server. when checking the thriftserver cache I dont see any table, the question is how do i get those tables cached to be consumed from BI tools?
do i have to send the same spark commands via jdbc? in that case, is the context related to the current session?
regards,
miguel
I found the solution. In order to have the tables cached to be used with jdbc/odbc clients via thriftserver i have to use CACHE TABLE from one of the clients, for exmaple from beeline. Once this is done the table is in-memory for all the different sessions.
It is also important to be sure you are using the right spark thriftserver. In order to know that just do a show table; in beeline, if you get just one column back you are not using the spark one and the CACHE TABLE wont work.

Invalidate metadata/refresh imapala from spark code

I'm working on a NRT solution that requires me to frequently update the metadata on an Impala table.
Currently this invalidation is done after my spark code has run.
I would like to speed things up by doing this refresh/invalidate directly from my Spark code.
What would be the most efficient approach?
Oozie is just too slow (30 sec overhead? no thanks)
An SSH action to an (edge) node seems like a valid solution but feels "hackish"
I don't see a way to do this from the hive context in Spark either.
REFRESH and INVALIDATE METADATA commands are specific to Impala.
You must be connected to an Impala daemon to be able to run these -- which trigger a refresh of the Impala-specific metadata cache (in your case you probably just need a REFRESH of the list of files in each partition, not a wholesale INVALIDATE to rebuild the list of all partitions and all their files from scratch)
You could use the Spark SqlContext to connect to Impala via JDBC and read data -- but not run arbitrary commands. Damn. So you are back to the basics:
download the latest Cloudera JDBC driver for Impala
install it on the server where you run your Spark job
list all the JARs in your *.*.extraClassPath properties
develop some Scala code to open a JDBC session against an Impala daemon and run arbitrary commands (such as REFRESH somedb.sometable) -- the hard way
Hopefully Google will find some examples of JDBC/Scala code such as this one
Seems this has been fixed by Impala 3.3.0 (cf. Section "Metadata Performance Improvements" here):
Automatic invalidation of metadata
With automatic metadata management enabled, you no longer have to issue INVALIDATE / REFRESH in a number of conditions.
In Impala 3.3, the following additional event in Hive Metastore can trigger automatic INVALIDATE / REFRESH of Metadata:
INSERT into tables and partitions from Impala or from Spark on the same or multiple cluster configuration
all the above steps are not required, you can write the below code and execute invalidate metadata query to impala table.
impala_node_ip_address = "XX.XX.XX.XX"
impala Query = "impala-shell -i "+"\"" + str(impala_node_ip_address) + "\"" + " -k -q " + "\""+"invalidate metadata DBNAME"+"." + "TableName" + "\""

hadoop hive question

I'm trying to create tables pragmatically using JDBC. However, I can't really see the table I created from the hive shell. What's worse, when i access hive shell from different directories, i see different result of the database.
Is any setting i need to configure?
Thanks in advance.
Make sure you run hive from the same directory every time because when you launch hive CLI for the first time, it creates a metastore derby db in the current directory. This derby DB contains metadata of hive tables. If you change directories, you will have unorganized metadata for hive tables. Also the Derby DB cannot handle multiple sessions. To allow for concurrent Hive access you would need to use a real database to manage the Metastore rather than the wimpy little derbyDB that comes with it. You can download mysql for this and change hive properties for jdbc connection to mysql type 4 pure java driver.
Try emailing the Hive userlist or the IRC channel.
You probably need to setup the central Hive metastore (by default, Derby, but it can be mySQL/Oracle/Postgres). The metastore is the "glue" between Hive and HDFS. It tells Hive where your data files live in HDFS, what type of data they contain, what tables they belong to, etc.
For more information, see http://wiki.apache.org/hadoop/HiveDerbyServerMode
Examine your hadoop logs. For me this happened when my hadoop system was not setup properly. The namenode was not able to contact the datanodes on other machines etc.
Yeah, it's due to the metastore not being set up properly. Metastore stores the metadata associated with your Hive table (e.g. the table name, table location, column names, column types, bucketing/sorting information, partitioning information, SerDe information, etc.).
The default metastore is an embedded Derby database which can only be used by one client at any given time. This is obviously not good enough for most practical purposes. You, like most users, should configure your Hive installation to use a different metastore. MySQL seems to be a popular choice. I have used this link from Cloudera's website to successfully configure my MySQL metastore.

Resources