How to insert data from file into HBase table? - hadoop

I made example.nt which looks like below.
1 "aaaaa1" "bbbbb1" "ccccc1"
2 "aaaaa2" "bbbbb2" "ccccc2"
3 "aaaaa3" "bbbbb3" "ccccc3"
.......
I want insert this data into HBase table which consist of
(key int,subject string,predicate string,object string)
(:key,cf1:val1,cf1:val2,cf1:val3)
I want perform this inserting on the hbase shell.
How can I do this?

HBase shell is not designed for these purposes, it allows insert data to HBase only line by line with put commands.
Instead of this you can use importtsv tool which allows you import text data directly to HBase.
Assuming you have already created HBase table so_table with one column family cf1 and your example.nt file is in the /tmp/example/ HDFS directory. So it's possible to use it by the following way:
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,cf1:val1,cf1:val2,cf1:val3 so_table /tmp/example/
May be you will need add option to change column separator:
-Dimporttsv.separator=';'
Furthermore you should understand that this way data inserts to HBase directly via many put command. There is another way to use importtsv tool which is well suitable for bulk loading big amounts of input data. You can generate StoreFiles and then load it entirely to HBase with completebulkload tool:
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.bulk.output=/tmp/example_output -Dimporttsv.columns=HBASE_ROW_KEY,cf1:val1,cf1:val2,cf1:val3 so_table /tmp/example/
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /tmp/example_output so_table
You can read official documentation of this tool: https://hbase.apache.org/book.html#_importtsv

Related

HBase Shell - Create a reduced table from existing Hbase table

I want to create a reduced version of an HBase Table via Hbase shell. For example:
HBase Table 'test' is already present in HBase with following info:
TableName: 'test'
ColumnFamily: 'f'
Columns: 'f:col1', 'f:col2', 'f:col3', 'f:col4'
I want to create another table in HBase 'test_reduced' which looks like this
TableName: 'test_reduced'
ColumnFamily: 'f'
Columns: 'f:col1', 'f:col3'
How can we do this via HBase shell ? I know how to copy the table using snapshot command So I am mainly looking for dropping column names in HBase Table.
can't do it. you need to use Hbase Client API.
1- read the table in.
2- only "put" columns you want into your new table.
Cloudera came close by enabling users to perform "Partial HBase table copies" with "CopyTable" function, but that will allow you to change column_family names only ... (I am not sure you are using cloudera), but even that, is not what you are looking for.
for your ref:
http://blog.cloudera.com/blog/2012/06/online-hbase-backups-with-copytable-2/

delete duplicates using pig where there is no primary key

Im a newbie to hadoop and I have a use case where there are 3 columns name,value,time stamp.The data is , comma separated and is in csv format I need to check for the duplicates and delete them using pig.How can I achieve that.
You can use pig DISTINCT function to remove duplicate.
Please refer this link to know about DISTINCT function.
As you are saying that your data reside in HIVE table and you want to access those data through pig, You can use HCatLoader() to access HIVE table through pig. HCatalog can be used for both external and internal HIVE table. But before using this function, please verify that your cluster has configured HCatalog. If you are using Hadoop 2.X then it should be there.
Using HCatalog your pig LOAD command will be like this.
A = LOAD 'table_name' using HCatLoader();
If you don't want to use HCatalog and if your HIVE tables are external table and you know the HDFS location of the data then you can use CSVLoader() to access the data. Using CSVLoader() your pig LOAD command will be like this.
REGISTER piggybank.jar
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
--Load data using CSVLoader.
A = LOAD '/user/hdfs/dirtodata/MyData.csv' using CSVLoader AS (
name:chararray, value:chararray, timestamp:chararray,
);
Hive external tables are designed in such a way that user can access
the data from outside hive such as Pig and MapReduce programming. But if your HIVE table is internal table and you want to analyze the data using Pig, then you can use HCatLoader() to access hive table data through pig.
In both scenario there wont be any effect in original data during the analytic. Here you are accessing the data, you are not modifying the original data.
Please refer below useful link to understand more about HCat.
http://hortonworks.com/hadoop-tutorial/how-to-use-hcatalog-basic-pig-hive-commands/
https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat

How can I know all the column in hbase table?

In hbase shell , I use describe 'table_name' , there is only column_family return. How can I get to know all the column in each columnfamily?
As #zsxwing said you need to scan all the rows since in HBase each row can have a completely different schema (that's part of the power of Hadoop - the ability to store poly-structured data). You can see the HFile file structure and see that HBase doesn't track the columns
Thus the column family(s) and its(their) setting are in fact the schema of the HBase table and that's what you get when you 'describe' it

MapReduce & Hive application Design

I have a design question where in in my CDH 4.1.2(Cloudera) installation I have daily rolling log data dumped into the HDFS. I have some reports to calculate the success and failure rates per day.
I have two approaches
load the daily log data into Hive Tables and create a complex query.
Run a MapReduce job upfront everyday to generate the summary (which
is essentially few lines) and keep appending to a common file which is a Hive Table. Later while running the report I could use a simple select query to fetch the summary.
I am trying to understand which would be a better approach among the two or if there is a better one.
The second approach adds some complexity in terms of merging files. If not merged I would have lots of very small files which seems to be a bad idea.
Your inputs are appreciated.
Thanks
Hive seems well suited to this kind of tasks, and it should be fairly simple to do:
Create an EXTERNAL table in Hive which should be partitioned by day. The goal is that the directory where you will dump your data will be directly in your Hive table. You can specify the delimiter of the fields in your daily logs like shown below where I use commas:
create external table mytable(...) partitioned by (day string) row format delimited keys terminated by ',' location '/user/hive/warehouse/mytable`
When you dump your data in HDFS, make sure you dump it on the same directory with day= so it can be recognized as a hive partition. For example in /user/hive/warehouse/mytable/day=2013-01-23.
You need then to let Hive know that this table has a new partition:
alter table mytable add partition (day='2013-01-23')
Now the Hive metastore knows about your partition, you can run your summary query. Make sure you're only querying the partition by specifying ... where day='2013-01-23'
You could easily script that to run daily in cron or something else and get the current date (for example with the shell date command) and substitute variables to this in a shell script doing the steps above.

how to load data in hive automatically

recently I want to load the log files into hive tables, I want a tool which can read data from a certain directory and load them into hive automatically. This directory may include lots of subdirectories, for example, the certain directory is '/log' and the subdirectories are '/log/20130115','/log/20130116','/log/201301017'. Is there some ETL tools which can achieve the function that:once the new data is stored in the certain directory, the tool can detect this data automatically and load them into hive table. Is there such tools, do I have to write script by myself?
You can easily do this using Hive external tables and partitioning your table by day. For example, create your table as such:
create external table mytable(...)
partitioned by (day string)
location '/user/hive/warehouse/mytable';
This will essentially create an empty table in the metastore and make it point to /user/hive/warehouse/mytable.
Then you can load your data in this directory with the format key=value where key is your partition name (here "day") and value is the value of your partition. For example:
hadoop fs -put /log/20130115 /user/hive/warehouse/mytable/day=20130115
Once your data is loaded there, it is in the HDFS directory, but the Hive metastore doesn't know yet that it belongs to the table, so you can add it this way:
alter table mytable add partition(day='20130115');
And you should be good to go, the metastore will be updated with your new partition, and you can now query your table on this partition.
This should be trivial to script, you can create a cron job running once a day that will do these command in order and find the partition to load with the date command, for example continuously doing this command:
hadoop fs -test /log/`date +%Y%m%d`
and checking if $? is equal to 0 will tell you if the file is here and if it is, you can transfer it and add the partition as described above.
You can make use of LOAD DATA command provided by Hive. It exactly matches your use case. Specify a directory in your local file system and make Hive tables from it.
Example usage -
LOAD DATA LOCAL INPATH '/home/user/some-directory'
OVERWRITE INTO TABLE table

Resources