Hive: modify external table's location take too long - hadoop

Hive has two kinds of tables which are Managed and External Tables, for the difference, you can check Managed. VS External Tables.
Currently, to move external database from HDFS to Alluxio, I need to modify external table's location to alluxio://.
The statement is something like: alter table catalog_page set location "alluxio://node1:19998/user/root/tpcds/1000/catalog_returns"
According to my understanding, it should be a simple metastore modification,however, for some tables modification, it will spend dozens of minutes. The database itself contains about 1TB data btw.
Is there anyway for me to accelerate the table alter process? If no, why it's so slow? Any comment is welcomed, thanks.

I found suggested way which is metatool under $HIVE_HOME/bin.
metatool -updateLocation <new-loc> <old-loc> Update FS root location in the
metastore to new location.Both
new-loc and old-loc should be
valid URIs with valid host names
and schemes.When run with the
dryRun option changes are
displayed but are not persisted.
When run with the
serdepropKey/tablePropKey option
updateLocation looks for the
serde-prop-key/table-prop-key
that is specified and updates
its value if found.
By using this tool, the location modification is very fast. (maybe several seconds.)
Leave this thread here for anyone who might run into the same situation.

Related

Can Hive periodically append or insert incremental data to the same table file in hdfs?

I'm loading network captured data every minute from Spark streaming (from Flume exec), then aggregate data according to ip address, save to Hive at the end. To make it faster I create Hive ORC table with partition on ip address, it works well. The only issue is every minute it creates many (depends on how many ip addresses) kb small files, now I use "ALTER TABLE...CONCATENATE;" to merge them manually, but I think it could be easier, so want to ask whether there is solution that can incrementally merge/append new data to first minute table files instead of creating new table files every minute. Any suggestion is appreciated!
I give up, looks no direct solution as Hive can't append content to existing datafile for performance consideration. Now my alternative is still to concatenate it every week, the problem is query will be broken with error message (complaining it can't find data file) when it's doing concatenation, so there is big business impact. Now I'm thinking replacing Hive with HBase or Kudu which is more flexible and can provide update/delete operation.

hive external table needing write access

I am trying to load a dataset stored on HDFS (textfile) into hive for analysis.
I am using create external table as follows:
CREATE EXTERNAL table myTable(field1 STRING...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/myusername/datasetlocation';
This works fine, but it requires write access to the hdfs location. Why is that?
In general, what is the right way to load text data to which I do not have write access? is there a 'read-only' external table type?
Edit: I noticed this issue on hive regarding the question. It does not seem to have been resolved.
Partially answering my own question:
Indeed it seems not to be resolved by hive at this moment. But here is an interesting fact: hive does not require write access to the files themselves, but only to the folder. For example, you could have a folder with permissions 777, but the files within it, which are accessed by hive, can stay read-only, e.g. 644.
I don't have a solution to this, but as a workaround I've discovered that
CREATE TEMPORARY EXTERNAL TABLE
works without write permissions, the difference being the table (but not the underlying data) will disappear after your session.
If you require write access to hdfs files give
hadoop dfs -chmod 777 /folder name
this means your giving all access permissions to that particular file.

Hive - Multiple clusters pointing to same metastore

We have two clusters say one as old and one as new. Both of them are on AWS - EMR. Hive on these clusters pointing to same Hive metastore, which is on RDS. We are migrating from old to new.
Now the question is if I stop old cluster will there be any issue for accessing old tables? " All the data is on S3. All tables are EXTERNAL. But still the databases are on HDFS.. like
hdfs://old:1234/user/hive/warehouse/myfirst.db
If I stop the old cluster this location be void which makes db invalid and also tables? Though they are external.
I am really not sure if this will be an issue but this is on prod so I am trying to find if anyone already faced this issue.
Thanks!
As long as all your tables have the LOCATION set to S3, loosing the location for the DATABASE/SCHEMA will not impact access to your metadata.
The only impact it will have in your new cluster is that CREATE TABLE statements performed in the custom database ("myfirstdb" in your example) without a explicit LOCATION will fail to reach the default HDFS path, which is inherited from the DATABASE location.
Tables created in the "default" schema will not fail as Hive will resolve the location for the new table to the value of the property "hive.metastore.warehouse.dir", which is "/user/hive/warehouse" in Elastic MapReduce.
Again, this does not affect tables with an explicit LOCATION set at creation time.
In general, to achieve a completely "portable" Metastore what you will want to do is:
Make sure all the TABLES have LOCATION set to S3 (any data in HDFS is obviously bound to the cluster lifecycle).
This can be achieved by:
explicitely setting LOCATION in the CREATE TABLE statement or
setting LOCATION for all the DATABASE/SCHEMAS (other than 'default') to a path in S3
Optionally (but strongly recommended) use EXTERNAL (user managed a.k.a. non-managed) tables to prevent accidental data loss due to DDL statements

Providing access to unstructured files in Hadoop

So I have a collection of files archived in a HDFS with a unique key in the file-name. I have a table of records in a HIVE table with the same unique key.
How would I provide access to the files to other users? I may need to restrict access to certain users.
I was thinking of providing a reference to the files in the hive table.
I could also look at some sort of web interface for searching for an downloading files.
Hive kicks off a MapReduce job (or several) every time you execute a query. A latency introduced by setting up and tearing down a MapReduce job(s) exceeds any acceptable standards for a responsivness expected from a web interface.
I recommend you keep the metadata for the files in a relational database. You would have to have a relational database, like PostgreSQL, to store Hive metadata. I sure hope you are not using default Derby for that!

Basic thing about Hadoop and Hive

I have started working with Hadoop recently. There is table named Checkout that I access through Hive. And below is the path where the data goes to HDFS and other info. So what information I can get if I have to read the below three lines?
Path Size Record Count Date Loaded
/sys/edw/dw_checkout_trans/snapshot/2012/07/04/00 1.13 TB 9,294,245,800 2012-07-05 07:26
/sys/edw/dw_checkout_trans/snapshot/2012/07/03/00 1.13 TB 9,290,477,963 2012-07-04 09:37
/sys/edw/dw_checkout_trans/snapshot/2012/07/02/00 1.12 TB 9,286,199,847 2012-07-03 07:08
So my question is-
1) Firstly, We are loading the data to HDFS and then through Hive I am querying it to get the result back? Right?
2) Secondly, When you look into the above path and other things, the only thing that I am confuse is, when I will be querying using Hive then I will be getting data from all the three paths above? or the most recent one at the top?
As I am new to these stuff, so I am having lot of problem. Can anyone explain me hive gets the data from where? And we store all the data in HDFS and then we use Hive or Pig to get data back from HDFS? And it will be great if some one give high level knowledge of Hadoop and Hive.
I think you need to get the difference between Hive's native table and Hive's external table.
Hive native table mean that you load data into hive, and it takes care how data is stored in the HDFS. We usually do not care what is directory structure in this case.
Hive External table mean that we put data in some directory (if we forget about partitioning for the moment) and tell to Hive - it is table's data. Please treat is as such. And hive enable us to query it, join with other external or regular table. And it is our responsibility to add data, delete it, etc

Resources