How to load csv file from hdfs to hbase table using Dimporttsv - shell

I am trying to load csv file into an hbase table using shell command Dimporttsv.
The csv files reside in a dir in my hdfs (/csvFiles)
the csv file was generated from a mysql table with the following feilds:
+-------------+
Field
+-------------+
tweet_id
user_id
screen_name
description
created_at
+-------------+
I created a table in hbase with a single family name as shown below:
create 'dummyTable', 'cf1'
the command I am using:
ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY,cf1:user_id,cf1:tweet_id,cf1:screen_name,cf1:description,cf1:created_at dummyTable /csvFiles/all_users.csv
however I am getting this syntax error:
SyntaxError: (hbase):8: syntax error, unexpected tSYMBEG
I've looked at the following posts and followed the recommendations in them but to no avail. I would appreciate your help.
Import TSV file into hbase table
https://community.hortonworks.com/articles/4942/import-csv-data-into-hbase-using-importtsv.html
http://hbase.apache.org/book.html#importtsv

Exit from Hbase shell and try by adding single quotes to importtsv.columns
bash$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns='HBASE_ROW_KEY,cf1:user_id,cf1:tweet_id,cf1:screen_name,cf1:description,cf1:created_at' dummyTable hdfs://<your_name_node_addr>/csvFiles/all_users.csv
(or)
From Hbase Shell:
hbase(main):001:0> ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns='HBASE_ROW_KEY,cf1:user_id,cf1:tweet_id,cf1:screen_name,cf1:description,cf1:created_at' dummyTable hdfs://<your_name_node_addr>/csvFiles/all_users.csv

Related

Hadoop backend with millions of records insertion

I am new to hadoop, can someone please suggest me how to upload millions of records to hadoop? Can I do this with hive and where can I see my hadoop records?
Until now I have used hive for creation of the database on hadoop and I am accessing it with localhost 50070. But I am unable to load data from csv file to hadoop from terminal. As it is giving me error:
FAILED: Error in semantic analysis: Line 2:0 Invalid path ''/user/local/hadoop/share/hadoop/hdfs'': No files matching path hdfs://localhost:54310/usr/local/hadoop/share/hadoop/hdfs
Can anyone suggest me some way to resolve it?
I suppose initially the data is in the Local file system.
So a simple workflow could be: load data from local to hadoop file system(HDFS), create a hive table over it and then load the data in hive table.
Step 1:
// put in HDFS
$~ hadoop fs -put /local_path/file_pattern* /path/to/your/HDFS_directory
// check files
$~ hadoop fs -ls /path/to/your/HDFS_directory
Step 2:
CREATE EXTERNAL TABLE if not exists mytable (
Year int,
name string
)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as TEXTFILE;
// display table structure
describe mytable;
Step 3:
Load data local INPATH '/path/to/your/HDFS_directory'
OVERWRITE into TABLE mytable;
// simple hive statement to fetch top 10 records
SELECT * FROM mytable limit 10;
You should use LOAD DATA LOCAL INPATH <local-file-path> to load the files from local directory to Hive tables.
If you dont specify LOCAL , then load command will assume to lookup the given file path from HDFS location to load.
Please refer below link,
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables

How to do a bulkload to Hbase from CSV from command line

I am trying to do a bulkload which is a csv file using command line.
This is what I am trying
bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hdfs://localhost:9000/transactionsFile.csv bulkLoadtable
The error I am getting is below:
15/09/01 13:49:44 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://localhost:9000/transactionsFile.csv
15/09/01 13:49:44 WARN mapreduce.LoadIncrementalHFiles: Bulk load operation did not find any files to load in directory hdfs://localhost:9000/transactionsFile.csv. Does it contain files in subdirectories that correspond to column family names?
Is it possible to do bulkload from command line without using java mapreduce.
You are almost correct, only thing missed is that the input to the bulkLoadtable must be directory. I suggest to keep the csv file under a directory and pass the path upto directory name as an argument to the command. Please refer the below link.
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.html#doBulkLoad(org.apache.hadoop.fs.Path,%20org.apache.hadoop.hbase.client.Admin,%20org.apache.hadoop.hbase.client.Table,%20org.apache.hadoop.hbase.client.RegionLocator)
Hope this helps.
You can do bulk load from command line,
There are multiple ways to do this,
a. Prepare your data by creating data files (StoreFiles) from a MapReduce job using HFileOutputFormat.
b. Import the prepared data using the completebulkload tool
eg: hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable
more details,
hbase bulk load
2.
Using importtsv
eg:
hbase> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns="HBASE_ROW_KEY,id,temp:in,temp:out,vibration,pressure:in,pressure:out" sensor hdfs://sandbox.hortonworks.com:/tmp/hbase.csv
more details

ImportTsv command is not working in Hbase

I am using HBase 0.98.1-cdh5.1.3. I am trying to ingest a csv file present in my hdfs at location /user/hdfs/exp to Hbase. My file has data in the following format:
1,abc,xyz
2,def,uvw
3,ghi,rst
I am using the command below:
bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv '-Dimporttsv.separator=,' -Dimporttsv.columns=HBASE_ROW_KEY,CF:firstname,CF:lastname tablename /user/hdfs/exp
I have also used different combinations like
bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,CF:firstname,CF:lastname tablename /user/hdfs/exp '-Dimporttsv.separator=,'
and
bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,CF:firstname,CF:lastname '-Dimporttsv.separator=,' tablename /user/hdfs/exp
but nothing works. It fails to detect separator i.e , in my case and is not parsed properly. Can anybody help me figure out where I am going wrong.
this is just one line of data set:
10000064202896309897,1000006420,2896309897,10180,hdfs://btc5x015:8020/user/mr_test/logsJan/log_jan20_29/10180_log201501260000.log,3.2.3.1,9,2015-01-26,15:46:12.12,REF SHOULDER 4,n,n,SHOULDER,60,17.0,M,487093458,[study_16004_16004_],exam_16004_16004,[patient_16004_1_],Schulter std,SCHULTERGELENK RECHTS,8.10,NOT_EXIST,-8.1,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,n,NOT_EXIST,NOT_EXIST,y,n,HF,NOT_EXIST,NOT_EXIST,NOT_EXIST,,,NOT_EXIST,NOT_EXIST,N,NOT_EXIST,1,NOT_EXIST,IMAG,FFE,T1FFE,4.0,0.72,34,NOT_EXIST,NOT_EXIST,NOT_EXIST,4.0,cor,,NOT_EXIST,no,0,n,NOT_EXIST,1,0.0,,,,,,,NOT_EXIST,102,NOT_EXIST,NOT_EXIST,NOT_EXIST,15:45:29.28,15:46:12.12,15:46:12.12,9.5,9.4,1.002,NOT_EXIST,TRUE,no,NOT_EXIST,0.0,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,0.0,NOT_EXIST,NOT_EXIST,0.0,0.09,0.3,3.3,0,false,false,n,n,26.1,LT,0.33,0.03,NOT_EXIST,null,1,Dres.GrafKernHausmann,E:\Export\DataMonitoring\p_i_20150126_154530.frame,hdfs://btc5x015.code1.emi.philips.com:8020/user/mr_test/logsJan/log_jan20_29/10180_log201501260000.log,317774,883,0,0,1,8,6,2,0,0,0,0,0,0,6014,0,15:44:08.15,15:44:59.93,15:45:23.14,15:45:29.28,00:00:00.00,00:00:00.00,15:45:30.57,15:45:38.45,15:45:29.28,15:46:12.12,15:45:38.45,15:46:12.12,42984,33967,0,7988,6014,00:00:00.00,00:00:00.00,0,00:00:00.00,00:00:00.00,0,169,102,SENSE-SHOULDER8,,SENSE-SHOULDER8/(19) BODY-QUAD,190,190,CLINICAL,0,0,94166709,Radiologische Gemeinschaftspraxis,Dr. med. Michael Graf,Dr. med. Andreas Kern,Dr. med. Hausmann,Wetzlar,35578,Hausertorstr. 47,6,NOT_EXIST,1,NOT_EXIST,NO,NOT_EXIST,NOT_EXIST,NOT_EXIST,1,NOT_EXIST,NOT_EXIST,NOT_EXIST,NO,NOT_EXIST,SHORTEST,NOT_EXIST,SHORTEST,1,NO,YES,DEFAULT,FFE,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,3,NOT_EXIST,CARTESIAN,YES,NO,NOT_EXIST,NO,NOT_EXIST,low,FULL,NOT_EXIST,NOT_EXIST,NOT_EXIST,3D,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,YES,NOT_EXIST,no,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NO,NOT_EXIST,USER_DEF,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NO,DEF,H,NOT_EXIST,NOT_EXIST,NO,NOT_EXIST,NOT_EXIST,NOT_EXIST,MPU_MTC_MODE_NO,NO,NOT_EXIST,T1,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NO,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,450,405,YES,Supine,HF,NOT_EXIST,NO,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,DEFAULT,NOT_EXIST,2,SENSE-SHOULDER8 BODY-QUAD,F,400 400,,,100 100,5.625,7.23214293,4,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NO,NO,NOT_EXIST,NOT_EXIST,80,PARALLEL,NO,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NO,NOT_EXIST,0,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,OFF,NOT_EXIST,NO,NO,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,0,5.625,NOT_EXIST,NOT_EXIST,NOT_EXIST,10180,15:45:29.29,15:46:13.76,null,,,96,,1,,,0,10180,PATTERN_SRN,SHOULDER,SCHULTER,MATCHED_SHOULDER,SHOULDER,UPPER EXTREMITIES,ANATOMY_GROUP_MAPPED,10180,10180,10180,3.2.3,Achieva 3.0T,Achieva,T30,3.0T,NO,F2000,Watercooled2,274-D,Master,NONE,,,0,16,0,1,S26_128,NONE,8,null,CDAS,LOGFOLDER_SYSFOLDER_MATCHED_RELEASE_NOT_CHECKED,FALSE,null,null,null,null,12.15,SENSE-SHOULDER8/(19) Q-BODY,0,SCAN_PARSE_SUCCESS,SHOULDER,-4.64757729 5.37445641,
I just loaded a single line give in question into hbase table with ImportTSV command by providing 414 columns and it worked perfectly for me.Here is a command that I used.
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,CF:c1,CF:c2,CF:c3,CF:c4,CF:c5,CF:c6,CF:c7,CF:c8,CF:c9,CF:c10,CF:c11,CF:c12,CF:c13,CF:c14,CF:c15,CF:c16,CF:c17,CF:c18,CF:c19,CF:c20,CF:c21,CF:c22,CF:c23,CF:c24,CF:c25,CF:c26,CF:c27,CF:c28,CF:c29,CF:c30,CF:c31,CF:c32,CF:c33,CF:c34,CF:c35,CF:c36,CF:c37,CF:c38,CF:c39,CF:c40,CF:c41,CF:c42,CF:c43,CF:c44,CF:c45,CF:c46,CF:c47,CF:c48,CF:c49,CF:c50,CF:c51,CF:c52,CF:c53,CF:c54,CF:c55,CF:c56,CF:c57,CF:c58,CF:c59,CF:c60,CF:c61,CF:c62,CF:c63,CF:c64,CF:c65,CF:c66,CF:c67,CF:c68,CF:c69,CF:c70,CF:c71,CF:c72,CF:c73,CF:c74,CF:c75,CF:c76,CF:c77,CF:c78,CF:c79,CF:c80,CF:c81,CF:c82,CF:c83,CF:c84,CF:c85,CF:c86,CF:c87,CF:c88,CF:c89,CF:c90,CF:c91,CF:c92,CF:c93,CF:c94,CF:c95,CF:c96,CF:c97,CF:c98,CF:c99,CF:c100,CF:c101,CF:c102,CF:c103,CF:c104,CF:c105,CF:c106,CF:c107,CF:c108,CF:c109,CF:c110,CF:c111,CF:c112,CF:c113,CF:c114,CF:c115,CF:c116,CF:c117,CF:c118,CF:c119,CF:c120,CF:c121,CF:c122,CF:c123,CF:c124,CF:c125,CF:c126,CF:c127,CF:c128,CF:c129,CF:c130,CF:c131,CF:c132,CF:c133,CF:c134,CF:c135,CF:c136,CF:c137,CF:c138,CF:c139,CF:c140,CF:c141,CF:c142,CF:c143,CF:c144,CF:c145,CF:c146,CF:c147,CF:c148,CF:c149,CF:c150,CF:c151,CF:c152,CF:c153,CF:c154,CF:c155,CF:c156,CF:c157,CF:c158,CF:c159,CF:c160,CF:c161,CF:c162,CF:c163,CF:c164,CF:c165,CF:c166,CF:c167,CF:c168,CF:c169,CF:c170,CF:c171,CF:c172,CF:c173,CF:c174,CF:c175,CF:c176,CF:c177,CF:c178,CF:c179,CF:c180,CF:c181,CF:c182,CF:c183,CF:c184,CF:c185,CF:c186,CF:c187,CF:c188,CF:c189,CF:c190,CF:c191,CF:c192,CF:c193,CF:c194,CF:c195,CF:c196,CF:c197,CF:c198,CF:c199,CF:c200,CF:c201,CF:c202,CF:c203,CF:c204,CF:c205,CF:c206,CF:c207,CF:c208,CF:c209,CF:c210,CF:c211,CF:c212,CF:c213,CF:c214,CF:c215,CF:c216,CF:c217,CF:c218,CF:c219,CF:c220,CF:c221,CF:c222,CF:c223,CF:c224,CF:c225,CF:c226,CF:c227,CF:c228,CF:c229,CF:c230,CF:c231,CF:c232,CF:c233,CF:c234,CF:c235,CF:c236,CF:c237,CF:c238,CF:c239,CF:c240,CF:c241,CF:c242,CF:c243,CF:c244,CF:c245,CF:c246,CF:c247,CF:c248,CF:c249,CF:c250,CF:c251,CF:c252,CF:c253,CF:c254,CF:c255,CF:c256,CF:c257,CF:c258,CF:c259,CF:c260,CF:c261,CF:c262,CF:c263,CF:c264,CF:c265,CF:c266,CF:c267,CF:c268,CF:c269,CF:c270,CF:c271,CF:c272,CF:c273,CF:c274,CF:c275,CF:c276,CF:c277,CF:c278,CF:c279,CF:c280,CF:c281,CF:c282,CF:c283,CF:c284,CF:c285,CF:c286,CF:c287,CF:c288,CF:c289,CF:c290,CF:c291,CF:c292,CF:c293,CF:c294,CF:c295,CF:c296,CF:c297,CF:c298,CF:c299,CF:c300,CF:c301,CF:c302,CF:c303,CF:c304,CF:c305,CF:c306,CF:c307,CF:c308,CF:c309,CF:c310,CF:c311,CF:c312,CF:c313,CF:c314,CF:c315,CF:c316,CF:c317,CF:c318,CF:c319,CF:c320,CF:c321,CF:c322,CF:c323,CF:c324,CF:c325,CF:c326,CF:c327,CF:c328,CF:c329,CF:c330,CF:c331,CF:c332,CF:c333,CF:c334,CF:c335,CF:c336,CF:c337,CF:c338,CF:c339,CF:c340,CF:c341,CF:c342,CF:c343,CF:c344,CF:c345,CF:c346,CF:c347,CF:c348,CF:c349,CF:c350,CF:c351,CF:c352,CF:c353,CF:c354,CF:c355,CF:c356,CF:c357,CF:c358,CF:c359,CF:c360,CF:c361,CF:c362,CF:c363,CF:c364,CF:c365,CF:c366,CF:c367,CF:c368,CF:c369,CF:c370,CF:c371,CF:c372,CF:c373,CF:c374,CF:c375,CF:c376,CF:c377,CF:c378,CF:c379,CF:c380,CF:c381,CF:c382,CF:c383,CF:c384,CF:c385,CF:c386,CF:c387,CF:c388,CF:c389,CF:c390,CF:c391,CF:c392,CF:c393,CF:c394,CF:c395,CF:c396,CF:c397,CF:c398,CF:c399,CF:c400,CF:c401,CF:c402,CF:c403,CF:c404,CF:c405,CF:c406,CF:c407,CF:c408,CF:c409,CF:c410,CF:c411,CF:c412,CF:c413,CF:c414 '-Dimporttsv.separator=,' tablename /user/hdfs/exp
I have given random column name , you can update it as per your need.
Note : Make sure that number of columns you are passing through command are matching with your input data source. Even I got Bad Line issue when I passed 412 columns instead of 414.
Hope this will help.:)
It looks like the single quotation is misplaced while specifying separator. Try using this: -Dimporttsv.separator=',' instead of '-Dimporttsv.separator=,'
If the input file is prepared such that any column value consists of a field delimiter (in this case, comma), it will fail. Better to keep a different delimiter (such as |) while preparing the CSV file

How to get the hive table output or text file in hdfs on which hive table created to .CSV format.

So there is one condition with the cluster i'm working on. Nothing can be taken out of cluster to linux box.
Files on which hive table are built are in sequence file format or text format.
I need to change those files to CSV format with out outputting them to linux box and also i can create table from existing table which can be STORED AS CSVfile if possible. (i'm not sure if i can do that).
I have tried lot things..but couldn't do it unless i output it to linux box. Any help is appreciated.
You can create another hive table like this:
CREATE TABLE hivetable_csv ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' as
select * from hivetable;
Then copy the table contents to a new directory
hadoop fs -cat /user/hive/warehouse/csv_dump/* | hadoop fs -put - /user/username/hivetable.csv
Alternatively, you can also try
hadoop fs -cp

Checking the table existence and loading the data into Hbase and HIve table

I have data in HDFS. And I wanted to load that data into hbase and hive table.
I have written a bash shell script in which I have written a pig script to load the data form HDFS to HBASE and also written hive script to load the data from HDFS to HIVE table which are working perfectly fine.Here my HDFS data files are with the same structure and I'm loading all the data files into single hbase and hive table.
Now my query is suppose if I receive some more data files in HDFS directory and if I run the shell script again it will create hbase and hive table again with the same name and tells table already exists. How can I write a hive and hbase query so that 1st it will check for the table existence, if table does not exists it create the table for the 1st time and load the data from HDFS to HBASE & Hive table. If the table is already exists then it will just insert the data into an existing hbase and hive table. It should not overwrite the data alreday exists in the tables.
How this can be done ?
Below is my script file: myScript.sh
echo "create 'goodtable','gt'" | hbase shell
pig -f a.pig -param input=/user/user/d/
hive -f h.hql
Where a.pig :
G = LOAD '$input' USING PigStorage(',') as (c1:chararray, c2:chararray,c3:chararray,c4:chararray,c5:chararray);
STORE G INTO 'hbase://goodtable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('gt:name gt:state gt:phone_no gt:gender');
h.hql:
create external table hive_table(
id int,
name string,
state string,
phone_no int,
gender string) row format delimited fields terminated by ',' stored as textfile;
LOAD DATA INPATH '/user/user/d/' INTO TABLE hive_table;
I just wanted to add an example for HBase as Hive was already covered before:
if [[ $(echo "exists 'goodtable'" | hbase shell | grep 'not exist') ]];
then
echo "create 'goodtable','gt'" | hbase shell;
fi
For HIVE, you can add the command IF NOT EXISTS in the CREATE TABLE statement. See the documentation
I don't have much experience on Hbase, but I believe you can use EXISTS table_name command to check whether the table exists and then create the table is it doesn't exist. See here
#visakh is correct - you can see if table exists in HBase by entering the HBase shell, and typing : exists '<tablename>
In order to do this without entering the HBase shell interactively, you can create a simple ruby script such as the following:
exists 'mytable'
exit
Let's say you save this to a file called tabletest.rb. You can then execute this script by calling hbase shell tabletest.rb. This will create the following output, which you can then parse from your shell script:
Table tableisthere does exist
0 row(s) in 0.9830 seconds
OR
Table tableisNOTthere does not exist
0 row(s) in 0.9830 seconds
Adding more details for 'all in one' script:
Alternatively, you can create a more advanced script in ruby that checks for table existence and then will create it if needed - this is done calling the HBaseAdmin java api from within the ruby script.
conf = HBaseConfiguration.new
hbaseAdmin = HBaseAdmin.new(conf)
if !hbaseAdmin.tableExists('mytable')
hbaseAdmin.createTable('mytable',...)
end

Resources