How to integrate manually copied hbase files into an hbase instance? - hadoop

I have copied the files associated an hbase table into another cluster and stored the files in the hbase folder. I can see the table when i do a list. When I do scan 'myTable' it can't find the table.
When I go through the HBase-WebUI, I see the table including its cf information, when I click on the table I get:
org.apache.hadoop.hbase.TableNotFoundException: hbTable
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1024)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:889)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:846)
How do i get the hbaseRegionServers to manage the table?
P.S. For the purposes of this exercise I'm not interested in using the export utility or the copyTable Utility.

Please refer this link : Complete Bulk Load
Assuming here that you copied the Hfiles from one cluster to another.

Did you create the table in the other cluster? I think the error is due to the missing information in the .META

Related

How to write incremental data to hive using flink

I use flink 1.6,I know I can use custom sink and hive jdbc to write to hive,or use JDBCAppendTableSink,but it is still use jdbc.The problem is hive jdbc do not suppot batchExecute method.I think it will be very slow.
Then I seek another way,I write a DataSet to hdfs with writeAsText method,then create hive table from hdfs.But there is still a problem:the how to append incremental data.
The api of WriteMode is:
Enum FileSystem.WriteMode
Enum Constant and Description
NO_OVERWRITE
Creates the target file only if no file exists at that path already.
OVERWRITE
Creates a new target file regardless of any existing files or directories.
For example,first batch,I write data of September to hive,then I get data of October,I want to append it.
But If I use OVERWRITE to the same hdfs file,data of September will not exist any more,if I use NO_OVERWRITE,I must write it to a new hdfs file,then a new hive table,we need them in a same hive table.And I do not know how to combine 2 hdfs file to a hive table.
So How to write incremental data to hive using flink?
As you already wrote there is no HIVE-Sink. I guess the default pattern is to write (text, avro, parquett)-files to HDFS and define an external hive table on that directory. There it doesn't matter if you have a single file or mutiple files. But you most likely have to repair this table on a regular basis (msck repair table <db_name>.<table_name>;). This will update the meta-data and the new files will be available.
For bigger amounts of data I would recommend to partition the table and add the partitions on demand (This blogpost might give you a hint: https://resources.zaloni.com/blog/partitioning-in-hive).

What are Hive Common Use Cases?

I'm new to Hive; so, I'm not sure how companies use Hive. Let me give you a scenario and see if I'm conceptually correct about the use of Hive.
Let's say my company wants to keep some web server log files and be able to always search through and analyze the logs. So, I create a table columns of which correspond to the columns in the log file. Then I load the log file into the table. Now, I can start query the data. So, as the data comes in at future dates, I just keep adding the data to this table, and thus I always have my log files as a table in Hive that I can search through and analyze.
Is that scenario above a common use? And if it is, then how do I keep adding new log files to the table? Do I have to keep adding them to the table manually each day?
You can use Hive, for analysis over static datasets, but if you have streaming logs, I really wouldn't suggest Hive for this. It's not a search engine and will take minutes just to find any reasonable data you're looking for.
HBase would probably be a better alternative if you must stay within the Hadoop ecosystem. (Hive can query Hbase)
Use Splunk, or the open source alternatives of Solr / Elasticsearch / Graylog if you want reasonable tools for log analysis.
But to answer your questions
how do I keep adding new log files to the table? Do I have to keep adding them to the table manually each day?
Use an EXTERNAL Hive table over an HDFS location for your logs. Use Flume to send log data to that path (or send your logs to Kafka, and from Kafka to HDFS, as well as a search/analytics system)
You only need to update the table if you're adding date partitions (which you should because that's how you get faster Hive queries). You'd use MSCK REPAIR TABLE to detect missing partitions on HDFS. Or run ALTER TABLE ADD PARTITION yourself on a schedule. Note: Confluent's HDFS Kafka Connect will automatically create Hive table partitions for you
If you must use Hive, you can improve the queries better if you convert the data into ORC or Parquet format

Querying data from har archives - Apache Hive

I am using Hadoop and facing the dreaded problem of large numbers of small files. I need to be able to create har archives out of existing hive partitions and query them at the same time. However, Hive apparently supports archiving partitions only in managed tables and not external tables - which is pretty sad. I am trying to find a workaround for this, by manually archiving the files inside a partition's directory, using hadoop's archive tool. I now need to configure hive to be able to query the data stored in these archives, along with the unarchived data stored in other partition directories. Please note that we only have external tables in use.
The namespace for accessing the files in the created partition-har corresponds to the hdfs path of the partition dir.
For example, For example, a file in hdfs:
hdfs:///user/user1/data/db1/tab1/ds=2016_01_01/f1.txt
can after archiving be accessed as:
har:///user/user1/data/db1/tab1/ds=2016_01_01.har/f1.txt
Would it be possible for hive to query the har archives from the external table? Please suggest a way if yes.
Best Regards
In practice, the line between "managed" and "external" tables is very thin.
My suggestion:
create a "managed" table
add explicitly partitions for some days in the future, but with ad hoc locations -- i.e. the directories your external process expects to use
let the external process dump its file directly at HDFS level -- they are automagically exposed in Hive queries, "managed" or not(the Metastore does not track individual files and blocks, they are detected on each query; as a side note, you can run backup & restore operations at HDFS level if you wish, as long as you don't mess with the directory structure)
when a partition is "cold" and you are pretty sure there will never be another file dumped there, you can run a Hive command to archive the partition i.e. move small files in a single HAR + flag the partition as "archived" in the Metastore
Bonus: it's easy to unarchive your partition within Hive (whereas there is no hadoop unarchive command AFAIK).
Caveat: it's a "managed" table so remember not to DROP anything unless you have safely moved your data out of the Hive-managed directories.

Load Data into Hive from Flat files or existing database

We are setting up Hadoop and Hive in our organization.
Also we will be having the sample data created by data generator tool. The data will be around 1 TB.
My question is - i have to load that data into Hive and Hadoop. What is the process i need to follow for this?
Also we will be having HBase installed with Hadoop.
We need to create the same database design which is right now there in SQL Server..But using Hive. Cz after this data loaded into hive we want to use the Business Objects 4.1 as a front end to create the Reports.
The challage is to load the sample data into the Hive..
Please help me as we want to do all the things asap.
First ingest your data in HDFS
Use Hive external tables, pointing to the location where you ingested the data i.e. your hdfs directory.
You are all set to query the data from the tables you created in Hive.
Good luck.
For the first case you need to put data in hdfs.
Transport your data file(s) to a client node (app node)
put your files en distribute file system (hdfs dfs -put ... )
create an external Table pointing the hdfs directory in which you uploaded those files. Your data have been structure of some way. For instance delimited by semicolon symbol.
Now you can operate over the data with sql queries.
For the second case you can create another hive table (using HBaseStorageHandler , https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration) and load from the first table with Insert statement.
I hope this can help you.

Basic thing about Hadoop and Hive

I have started working with Hadoop recently. There is table named Checkout that I access through Hive. And below is the path where the data goes to HDFS and other info. So what information I can get if I have to read the below three lines?
Path Size Record Count Date Loaded
/sys/edw/dw_checkout_trans/snapshot/2012/07/04/00 1.13 TB 9,294,245,800 2012-07-05 07:26
/sys/edw/dw_checkout_trans/snapshot/2012/07/03/00 1.13 TB 9,290,477,963 2012-07-04 09:37
/sys/edw/dw_checkout_trans/snapshot/2012/07/02/00 1.12 TB 9,286,199,847 2012-07-03 07:08
So my question is-
1) Firstly, We are loading the data to HDFS and then through Hive I am querying it to get the result back? Right?
2) Secondly, When you look into the above path and other things, the only thing that I am confuse is, when I will be querying using Hive then I will be getting data from all the three paths above? or the most recent one at the top?
As I am new to these stuff, so I am having lot of problem. Can anyone explain me hive gets the data from where? And we store all the data in HDFS and then we use Hive or Pig to get data back from HDFS? And it will be great if some one give high level knowledge of Hadoop and Hive.
I think you need to get the difference between Hive's native table and Hive's external table.
Hive native table mean that you load data into hive, and it takes care how data is stored in the HDFS. We usually do not care what is directory structure in this case.
Hive External table mean that we put data in some directory (if we forget about partitioning for the moment) and tell to Hive - it is table's data. Please treat is as such. And hive enable us to query it, join with other external or regular table. And it is our responsibility to add data, delete it, etc

Resources