Creating HBase table for files in HDFS directory

Creating HBase table for files in HDFS directory - hadoop

I am trying to load all files data in a HDFS directory into HBase existed table.Can you please share me how to load all files data and incremental data into HBase table.
I created HBase table as
hbase>create 'sample','cf'
I have to copy
hdfs://ip:port/user/test
into sample hbase table.please suggest me any solution.

Answer 1:(possible)
ImportTSV, if you try providing /user/hadoop/ directory path only instead of full file path, it should process all files with in that dir.
Answer 2:(seems not possible)
The special column name HBASE_ROW_KEY is used to designate that this
column should be used as the row key for each imported record. You
must specify exactly one column to be the row key, and you must
specify a column name for every column that exists in the input data.

Related

Where is HIVE metadata stored by default?

I have created an external table in Hive using following:
create external table hpd_txt(
WbanNum INT,
YearMonthDay INT ,
Time INT,
HourlyPrecip INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as textfile
location 'hdfs://localhost:9000/user/hive/external';
Now this table is created in location */hive/external.
Step-1: I loaded data in this table using:
load data inpath '/input/hpd.txt' into table hpd_txt;
the data is successfully loaded in the specified path ( */external/hpd_txt)
Step-2: I delete the table from */hive/external path using following:
hadoop fs -rmr /user/hive/external/hpd_txt
Questions:
why is the table deleted from original path? (*/input/hpd.txt is deleted from hdfs but table is created in */external path)
After I delete the table from HDFS as in step 2, and again I use show tables; It still gives the table hpd_txt in the external path.
so where is this coming from.
Thanks in advance.

Hive doesn't know that you deleted the files. Hive still expects to find the files in the location you specified. You can do whatever you want in HDFS and this doesn't get communicated to hive. You have to tell hive if things change.
hadoop fs -rmr /user/hive/external/hpd_txt
For instance the above command doesn't delete the table it just removes the file. The table still exists in hive metastore. If you want to delete the table then use:
drop if exists tablename;
Since you created the table as an external table this will drop the table from hive. The files will remain if you haven't removed them. If you want to delete an external table and the files the table is reading from you can do one of the following:
Drop the table and then remove the files
Change the table to managed and drop the table
Finally the location of the metastore for hive is by default located here /usr/hive/warehouse.

The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes is handy if you already have data generated. Else, you will have data loaded (conventionally or by creating a file in the directory being pointed by the hive table)
When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.
Source: Hive docs
So, in your step 2, removing the file /user/hive/external/hpd_txt removes the data source(data pointing to the table) but the table still exists and would continue to point to hdfs://localhost:9000/user/hive/external as it was created
#Anoop : Not sure if this answers your question. let me know if you have any questions further.

Do not use load path command. The Load operation is used to MOVE ( not COPY) the data into corresponding Hive table. Use put Or copyFromLocal to copy file from non HDFS format to HDFS format. Just provide HDFS file location in create table after execution of put command.
Deleting a table does not remove HDFS file from disk. That is the advantage of external table. Hive tables just stores metadata to access data files. Hive tables store actual data of data file in HIVE tables. If you drop the table, the data file is untouched in HDFS file location. But in case of internal tables, both metadata and data will be removed if you drop table.

After going through you helping comments and other posts, I have found answer to my question.
If I use LOAD INPATH command then it "moves" the source file to the location where external table is being created. Which although, wont be affected in case of dropping the table, but changing the location is not good. So use local inpath in case of loading data in Internal tables .
To load data in external tables from a file located in the HDFS, use the location in the CREATE table query which will point to the source file, for example:
create external table hpd(WbanNum string,
YearMonthDay string ,
Time string,
hourprecip string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as textfile
location 'hdfs://localhost:9000/input/hpd/';
So this sample location will point to the data already present in HDFS in this path. so no need to use LOAD INPATH command here.
Its a good practice to store a source files in their private dedicated directories. So that there is no ambiguity while external tables are created as data is in a properly managed directory system.
Thanks a lot for helping me understand this concept guys! Cheers!

How to read from a cluster using Hive?

Consider I have a certain data distributed on many computers in my cluster.
How Can I load my data using Hive without worrying about it location?
Thanks

See detail below how to load data into Hive from HDFS.
Loading files into tables
Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Syntax
LOAD DATA INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
Synopsis
Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
i)-
filepath can be:
a relative path, such as project/data1
an absolute path, such as /user/hive/project/data1
a full URI with scheme and (optionally) an authority, such as
hdfs://namenode:9000/user/hive/project/data1
ii)-
The target being loaded to can be a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns.
iii)-
filepath can refer to a file (in which case Hive will move the file into the table) or it can be a directory (in which case Hive will move all the files within that directory into the table). In either case, filepath addresses a set of files.
iv)-
If the keyword LOCAL is not specified, then Hive will either use the full URI of filepath, if one is specified, or will apply the following rules:
If scheme or authority are not specified, Hive will use the scheme
and authority from the hadoop configuration variable fs.default.name
that specifies the Namenode URI.
If the path is not absolute, then Hive will interpret it relative to
/user/
Hive will move the files addressed by filepath into the table (or
partition)
v)-
If the OVERWRITE keyword is used then the contents of the target table (or partition) will be deleted and replaced by the files referred to by filepath; otherwise the files referred by filepath will be added to the table.
Note that if the target table (or partition) already has a file whose
name collides with any of the filenames contained in filepath, then
the existing file will be replaced with the new file.
See full detail from following link.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML

How do I partition in hive by a specific column?

I have 3 columns: user, datetime, and data
My data is space delimited and each row is delimited by a new line
right now I'm using the regexserde to read in my input, however I want to partition by the user. If I do that user can no longer be a column, correct? If so how do I load my data onto my tables?

In Hive each partition corresponds to a folder in HDFS. You can reload the data from your unpartitioned Hive table into a new partitioned HIve table using a create-table-as-select (CTAS) statement. See https://cwiki.apache.org/Hive/languagemanual-ddl.html#LanguageManualDDL-CreateTable for more details.

You can order the data in HDFS in sub-directories under the current directory, the directory name has to be in the format PART_NAME=PART_VALUE.
If your data is split into files where in each file you have only one type of "user" just create directories corresponding to the usernames (e.g. USERNAME=XYZ) and put all the files that match that username in its directory.
Next you can create an external-table with partitions (see example).
The only problem is that you'll have to define the column "user" that's in your data anyway (but you can just ignore it) and query the other column (USERNAME) which will provide the needed partition pruning.

Hadoop :Approach to load Local xml files from Share location to Hive

My requirement is to load XML files which are collected in to a network share folder by different sources into Hive. I need a confirmation with approach to follow.
As my understanding goes I have to
1. load all the files into HDFS first
2. Then using Mapreduce or sqoop transform xml files into required table then I have to load them into Hive.
Please suggest me any better approach if exists.

To proces and read XML files
Mahout has XML input format , see the blog post below to read more
https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java
http://xmlandhadoop.blogspot.com.au/2010/08/xml-processing-in-hadoop.html
Pig has XMLLoader
http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/piggybank/storage/XMLLoader.html
After processing with any of the above approach you can push then to Hive location.
Thanks

you do not required to copy data into the HDFS, you can directly load the data into the hive table using command,
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
filepath can be a
1.relative path, eg: project/data1
2.absolute path, eg: /user/hive/project/data1
3.a full URI with scheme and (optionally) an authority, eg:
hdfs://namenode:9000/user/hive/project/data1
The target being loaded to can be a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns.
filepath can refer to a file (in which case hive will move the file into the table) or it can be a directory (in which case hive will move all the files within that directory into the table). In either case filepath addresses a set of files.
If the keyword LOCAL is specified, then:
1.the load command will look for filepath in the local file system. If a relative path is specified - it will be interpreted relative to the current directory of the user. User can specify a full URI for local files as well - for example: file:///user/hive/project/data1
2.the load command will try to copy all the files addressed by filepath to the target filesystem. The target file system is inferred by looking at the location attribute of the table. The copied data files will then be moved to the table.
If the keyword LOCAL is not specified, then
Hive will either use the full URI of filepath if one is specified. Otherwise the following rules are applied:
If scheme or authority are not specified, Hive will use the scheme and authority from hadoop configuration variable fs.default.name that specifies the Namenode URI.
If the path is not absolute - then Hive will interpret it relative to /user/
Hive will move the files addressed by filepath into the table (or partition)
if the OVERWRITE keyword is used then the contents of the target table (or partition) will be deleted and replaced with the files referred to by filepath. Otherwise the files referred by filepath will be added to the table.
Note that if the target table (or partition) already has a file whose name collides with any of the filenames contained in filepath - then the existing file will be replaced with the new file.

how to load data in hive automatically

recently I want to load the log files into hive tables, I want a tool which can read data from a certain directory and load them into hive automatically. This directory may include lots of subdirectories, for example, the certain directory is '/log' and the subdirectories are '/log/20130115','/log/20130116','/log/201301017'. Is there some ETL tools which can achieve the function that:once the new data is stored in the certain directory, the tool can detect this data automatically and load them into hive table. Is there such tools, do I have to write script by myself?

You can easily do this using Hive external tables and partitioning your table by day. For example, create your table as such:
create external table mytable(...)
partitioned by (day string)
location '/user/hive/warehouse/mytable';
This will essentially create an empty table in the metastore and make it point to /user/hive/warehouse/mytable.
Then you can load your data in this directory with the format key=value where key is your partition name (here "day") and value is the value of your partition. For example:
hadoop fs -put /log/20130115 /user/hive/warehouse/mytable/day=20130115
Once your data is loaded there, it is in the HDFS directory, but the Hive metastore doesn't know yet that it belongs to the table, so you can add it this way:
alter table mytable add partition(day='20130115');
And you should be good to go, the metastore will be updated with your new partition, and you can now query your table on this partition.
This should be trivial to script, you can create a cron job running once a day that will do these command in order and find the partition to load with the date command, for example continuously doing this command:
hadoop fs -test /log/`date +%Y%m%d`
and checking if $? is equal to 0 will tell you if the file is here and if it is, you can transfer it and add the partition as described above.

You can make use of LOAD DATA command provided by Hive. It exactly matches your use case. Specify a directory in your local file system and make Hive tables from it.
Example usage -
LOAD DATA LOCAL INPATH '/home/user/some-directory'
OVERWRITE INTO TABLE table

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio