Hive - Multiple clusters pointing to same metastore - hadoop

We have two clusters say one as old and one as new. Both of them are on AWS - EMR. Hive on these clusters pointing to same Hive metastore, which is on RDS. We are migrating from old to new.
Now the question is if I stop old cluster will there be any issue for accessing old tables? " All the data is on S3. All tables are EXTERNAL. But still the databases are on HDFS.. like
hdfs://old:1234/user/hive/warehouse/myfirst.db
If I stop the old cluster this location be void which makes db invalid and also tables? Though they are external.
I am really not sure if this will be an issue but this is on prod so I am trying to find if anyone already faced this issue.
Thanks!

As long as all your tables have the LOCATION set to S3, loosing the location for the DATABASE/SCHEMA will not impact access to your metadata.
The only impact it will have in your new cluster is that CREATE TABLE statements performed in the custom database ("myfirstdb" in your example) without a explicit LOCATION will fail to reach the default HDFS path, which is inherited from the DATABASE location.
Tables created in the "default" schema will not fail as Hive will resolve the location for the new table to the value of the property "hive.metastore.warehouse.dir", which is "/user/hive/warehouse" in Elastic MapReduce.
Again, this does not affect tables with an explicit LOCATION set at creation time.
In general, to achieve a completely "portable" Metastore what you will want to do is:
Make sure all the TABLES have LOCATION set to S3 (any data in HDFS is obviously bound to the cluster lifecycle).
This can be achieved by:
explicitely setting LOCATION in the CREATE TABLE statement or
setting LOCATION for all the DATABASE/SCHEMAS (other than 'default') to a path in S3
Optionally (but strongly recommended) use EXTERNAL (user managed a.k.a. non-managed) tables to prevent accidental data loss due to DDL statements

Related

Hive: modify external table's location take too long

Hive has two kinds of tables which are Managed and External Tables, for the difference, you can check Managed. VS External Tables.
Currently, to move external database from HDFS to Alluxio, I need to modify external table's location to alluxio://.
The statement is something like: alter table catalog_page set location "alluxio://node1:19998/user/root/tpcds/1000/catalog_returns"
According to my understanding, it should be a simple metastore modification,however, for some tables modification, it will spend dozens of minutes. The database itself contains about 1TB data btw.
Is there anyway for me to accelerate the table alter process? If no, why it's so slow? Any comment is welcomed, thanks.
I found suggested way which is metatool under $HIVE_HOME/bin.
metatool -updateLocation <new-loc> <old-loc> Update FS root location in the
metastore to new location.Both
new-loc and old-loc should be
valid URIs with valid host names
and schemes.When run with the
dryRun option changes are
displayed but are not persisted.
When run with the
serdepropKey/tablePropKey option
updateLocation looks for the
serde-prop-key/table-prop-key
that is specified and updates
its value if found.
By using this tool, the location modification is very fast. (maybe several seconds.)
Leave this thread here for anyone who might run into the same situation.

Extracting Data from Oracle to Hadoop. Is Sqoop a good idea

I'm looking to extract some data from an Oracle database and transferring it to a remote HDFS file system. There appears to be a couple of possible ways of achieving this:
Use Sqoop. This tool will extract the data, copy it across the network and store it directly into HDFS
Use SQL to read the data and store in on the local file system. When this has been completed copy (ftp?) the data to the Hadoop system.
My question will the first method (which is less work for me) cause Oracle to lock tables for longer than required?
My worry is that that Sqoop might take out a lock on the database when it starts to query the data and this lock isn't going to be released until all of the data has been copied across to HDFS. Since I'll be extracting large amounts of data and copying it to a remote location (so there will be significant network latency) the lock will remain longer than would otherwise be required.
Sqoop issues usual select queries on the Oracle batabase, so it does
the same locks as the select query would. No extra additional locking
is performed by Sqoop.
Data will be transferred in several concurrent tasks(mappers). Any
expensive function call will put a significant performance burden on
your database server. Advanced functions could lock certain tables,
preventing Sqoop from transferring data in parallel. This will
adversely affect transfer performance.
For efficient advanced filtering, run the filtering query on your
database prior to import, save its output to a temporary table and
run Sqoop to import the temporary table into Hadoop without the —where parameter.
Sqoop import has nothing to do with copy of data accross the network.
Sqoop stores at one location and based on the Replication Factor of
the cluster HDFS replicates the data

Connecting to an existing database (on a HDFS) from Hive

I have a database on HDFS (Hadoop filesystem) along with a schema file.
I'm trying to connect to this existing database from hive.
Any pointer are really appreciated.
Not sure what you mean by database, but using the External Table feature of Hive, this is fairly easy. You'll need 3 things: a location to the data, and Input(Output)Format to read (write) your data (rows), and potentially a a SerDe to interpret your data (columns). If you need to keep your Hive Schema and external schema in sync, there isn't really a good way to do it out of the box. You'll have to write some custom code that monitors the source schema and modifies the Hive schema on a schema change. Though non-trivial, it's also fairly easy to do this.

Hbase in comparison with Hive

Im trying to get a clear understanding on HBASE.
Hive:- It just create a Tabular Structure for the Underlying Files in
HDFS. So that we can enable the user to have Querying Abilities on the
HDFS file. Correct me if im wrong here?
Hbase- Again, we have create a Similar table Structure, But bit more
in Structured way( Column Oriented) again over HDFS File system.
aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce.
Also is that true that we cant create a Hbase table over an Already existing HDFS file?
Hive shares a very similar structures to traditional RDBMS (But Not all), HQL syntax is almost similar to SQL which is good for Database Programmer from learning perspective where as HBase is completely diffrent in the sense that it can be queried only on the basis of its Row Key.
If you want to design a table in RDBMS, you will be following a structured approach in defining columns concentrating more on attributes, while in Hbase the complete design is concentrated around the data, So depending on the type of query to be used we can design a table in Hbase also the columns will be dynamic and will be changing at Runtime (core feature of NoSQL)
You said aren't they both Same considering the type of job they does. except that Hive runs on Mapredeuce .This is not a simple thinking.Because when a hive query is executed, a mapreduce job will be created and triggered.Depending upon data size and complexity it may consume time, since for each mapreduce job, there are some number of steps to do by JobTracker, initializing tasks like maps,combine,shufflesort, reduce etc.
But in case we access HBase, it directly lookup the data they indexed based on specified Scan or Get parameters. Means it just act as a database.
Hive and HBase are completely different things
Hive is a way to create map/reduce jobs for data that resides on HDFS (can be files or HBase)
HBase is an OLTP oriented key-value store that resides on HDFS and can be used in Map/Reduce jobs
In order for Hive to work it holds metadata that maps the HDFS data into tabular data (since SQL works on tables).
I guess it is also important to note that in recent versions Hive is evolving to go beyond a SQL way to write map/reduce jobs and with what HortonWorks calls the "stinger initiative" they have added a dedicated file format (Orc) and import Hive's performance (e.g. with the upcoming Tez execution engine) to deliver SQL on Hadoop (i.e. relatively fast way to run analytics queries for data stored on Hadoop)
Hive:
It's just create a Tabular Structure for the Underlying Files in HDFS. So that we can enable the user to have SQL-like Querying Abilities on existing HDFS files - with typical latency up to minutes.
However, for best performance it's recommended to ETL data into Hive's ORC format.
HBase:
Unlike Hive, HBase is NOT about running SQL queries over existing data in HDFS.
HBase is a strictly-consistent, distributed, low-latency KEY-VALUE STORE.
From The HBase Definitive Guide:
The canonical use case of Bigtable and HBase is the webtable, that is, the web pages
stored while crawling the Internet.
The row key is the reversed URL of the page—for example, org.hbase.www. There is a
column family storing the actual HTML code, the contents family, as well as others
like anchor, which is used to store outgoing links, another one to store inbound links,
and yet another for metadata like language.
Using multiple versions for the contents family allows you to store a few older copies
of the HTML, and is helpful when you want to analyze how often a page changes, for
example. The timestamps used are the actual times when they were fetched from the
crawled website.
The fact that HBase uses HDFS is just an implementation detail: it allows to run HBase on an existing Hadoop cluster, it guarantees redundant storage of data; but it is not a feature in any other sense.
Also is that true that we cant create a Hbase table over an already
existing HDFS file?
No, it's NOT true. Internally HBase stores data in its HFile format.

Providing access to unstructured files in Hadoop

So I have a collection of files archived in a HDFS with a unique key in the file-name. I have a table of records in a HIVE table with the same unique key.
How would I provide access to the files to other users? I may need to restrict access to certain users.
I was thinking of providing a reference to the files in the hive table.
I could also look at some sort of web interface for searching for an downloading files.
Hive kicks off a MapReduce job (or several) every time you execute a query. A latency introduced by setting up and tearing down a MapReduce job(s) exceeds any acceptable standards for a responsivness expected from a web interface.
I recommend you keep the metadata for the files in a relational database. You would have to have a relational database, like PostgreSQL, to store Hive metadata. I sure hope you are not using default Derby for that!

Resources