Hive creates empty table, even there're plenty of file - hadoop

I put some files into hdfs (/path/to/directory/) which contain data like following;
63 EB44863EA74AA0C5D3ECF3D678A7DF59
62 FABBC9ED9719A5030B2F6A4591EDB180
59 6BF6D40AF15DE2D7E295EAFB9574BBF8
All of them named as _user_hive_warehouse_file_name_000XYZ_A. These files had downloaded from another hdfs.
I'm trying to create external table via Hive;
CREATE EXTERNAL TABLE users(
id int,
user string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/path/to/directory/';
It says;
OK
Time taken: 0.098 seconds
select * from users; returns empty.
select count(1) from users; returns 0.
Hive creates the table successfully, but it's always empty. If I put another file like another.txt, that contains the sample data mentioned above, select count(1) from users; returns 3.
What am I missing, why the table is empty?
Environment:
JDK 7
Hadoop 2.6.0
Hive 0.14.0
Ubuntu 14.04

I think you are encountering an issue that is peripherally discussed in HIVE-6431. In particular, this comment is the important one:
By default, FileInputFormat(which is the super class of various formats) in hadoop ignores file name starts with "_" or ".", and hard to walk around this in hive codebase.
The workaround is probably to avoid using filenames that begin with _ or .

When you run any command on Hive, it is run internally as a MapReduce Job on the HDFS path that you stored the file. The job uses the FileInputFormat to read the HDFS files which has a hiddenFileFilter which ignores any files starting with underscore ("_") and ("."). You can actually set other files to ignore by setting the FileInputFormat.SetInputPathFilter to a CustomPathFilter. Hadoop uses the files with underscores are "special" files to show job output and logs. This is probably why they are ignored.

Related

Hadoop backend with millions of records insertion

I am new to hadoop, can someone please suggest me how to upload millions of records to hadoop? Can I do this with hive and where can I see my hadoop records?
Until now I have used hive for creation of the database on hadoop and I am accessing it with localhost 50070. But I am unable to load data from csv file to hadoop from terminal. As it is giving me error:
FAILED: Error in semantic analysis: Line 2:0 Invalid path ''/user/local/hadoop/share/hadoop/hdfs'': No files matching path hdfs://localhost:54310/usr/local/hadoop/share/hadoop/hdfs
Can anyone suggest me some way to resolve it?
I suppose initially the data is in the Local file system.
So a simple workflow could be: load data from local to hadoop file system(HDFS), create a hive table over it and then load the data in hive table.
Step 1:
// put in HDFS
$~ hadoop fs -put /local_path/file_pattern* /path/to/your/HDFS_directory
// check files
$~ hadoop fs -ls /path/to/your/HDFS_directory
Step 2:
CREATE EXTERNAL TABLE if not exists mytable (
Year int,
name string
)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as TEXTFILE;
// display table structure
describe mytable;
Step 3:
Load data local INPATH '/path/to/your/HDFS_directory'
OVERWRITE into TABLE mytable;
// simple hive statement to fetch top 10 records
SELECT * FROM mytable limit 10;
You should use LOAD DATA LOCAL INPATH <local-file-path> to load the files from local directory to Hive tables.
If you dont specify LOCAL , then load command will assume to lookup the given file path from HDFS location to load.
Please refer below link,
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables

Hive INSERT OVERWRITE to Google Storage as LOCAL DIRECTORY not working

I use the following Hive Query:
hive> INSERT OVERWRITE LOCAL DIRECTORY "gs:// Google/Storage/Directory/Path/Name" row format delimited fields terminated by ','
select * from <HiveDatabaseName>.<HiveTableName>;
I am getting the following error:
"Error: Failed with exception Wrong FS:"gs:// Google/Storage/Directory/PathName", expected: file:///
What could I be doing wrong?
Remove Local from your syntax.
See the below syntax
INSERT OVERWRITE DIRECTORY 'gs://Your_Bucket_Path/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY "\n"
SELECT * FROM YourExistingTable;
There's a bug in Hive, including IIRC Hive 1.2.1, where it uses the configured fs.default.name or fs.defaultFS for its scratchdir even if the table path is in a different filesystem. In your case, it appears you have the out-of-the-box defaults setting fs.defaultFS to file:///, which is why it says "expected: file:///". On a distributed Hadoop cluster, you might see it say "expected: hdfs://..." instead.
You can fix it within the single hive prompt by overriding fs.default.name and fs.defaultFS:
> set fs.default.name=gs://your-bucket/
> set fs.defaultFS=gs://your-bucket/
You may also want to modify those entries inside your core-site.xml file to point at your GCS location to make it easier.

ImportTsv command is not working in Hbase

I am using HBase 0.98.1-cdh5.1.3. I am trying to ingest a csv file present in my hdfs at location /user/hdfs/exp to Hbase. My file has data in the following format:
1,abc,xyz
2,def,uvw
3,ghi,rst
I am using the command below:
bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv '-Dimporttsv.separator=,' -Dimporttsv.columns=HBASE_ROW_KEY,CF:firstname,CF:lastname tablename /user/hdfs/exp
I have also used different combinations like
bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,CF:firstname,CF:lastname tablename /user/hdfs/exp '-Dimporttsv.separator=,'
and
bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,CF:firstname,CF:lastname '-Dimporttsv.separator=,' tablename /user/hdfs/exp
but nothing works. It fails to detect separator i.e , in my case and is not parsed properly. Can anybody help me figure out where I am going wrong.
this is just one line of data set:
10000064202896309897,1000006420,2896309897,10180,hdfs://btc5x015:8020/user/mr_test/logsJan/log_jan20_29/10180_log201501260000.log,3.2.3.1,9,2015-01-26,15:46:12.12,REF SHOULDER 4,n,n,SHOULDER,60,17.0,M,487093458,[study_16004_16004_],exam_16004_16004,[patient_16004_1_],Schulter std,SCHULTERGELENK RECHTS,8.10,NOT_EXIST,-8.1,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,n,NOT_EXIST,NOT_EXIST,y,n,HF,NOT_EXIST,NOT_EXIST,NOT_EXIST,,,NOT_EXIST,NOT_EXIST,N,NOT_EXIST,1,NOT_EXIST,IMAG,FFE,T1FFE,4.0,0.72,34,NOT_EXIST,NOT_EXIST,NOT_EXIST,4.0,cor,,NOT_EXIST,no,0,n,NOT_EXIST,1,0.0,,,,,,,NOT_EXIST,102,NOT_EXIST,NOT_EXIST,NOT_EXIST,15:45:29.28,15:46:12.12,15:46:12.12,9.5,9.4,1.002,NOT_EXIST,TRUE,no,NOT_EXIST,0.0,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,0.0,NOT_EXIST,NOT_EXIST,0.0,0.09,0.3,3.3,0,false,false,n,n,26.1,LT,0.33,0.03,NOT_EXIST,null,1,Dres.GrafKernHausmann,E:\Export\DataMonitoring\p_i_20150126_154530.frame,hdfs://btc5x015.code1.emi.philips.com:8020/user/mr_test/logsJan/log_jan20_29/10180_log201501260000.log,317774,883,0,0,1,8,6,2,0,0,0,0,0,0,6014,0,15:44:08.15,15:44:59.93,15:45:23.14,15:45:29.28,00:00:00.00,00:00:00.00,15:45:30.57,15:45:38.45,15:45:29.28,15:46:12.12,15:45:38.45,15:46:12.12,42984,33967,0,7988,6014,00:00:00.00,00:00:00.00,0,00:00:00.00,00:00:00.00,0,169,102,SENSE-SHOULDER8,,SENSE-SHOULDER8/(19) BODY-QUAD,190,190,CLINICAL,0,0,94166709,Radiologische Gemeinschaftspraxis,Dr. med. Michael Graf,Dr. med. Andreas Kern,Dr. med. Hausmann,Wetzlar,35578,Hausertorstr. 47,6,NOT_EXIST,1,NOT_EXIST,NO,NOT_EXIST,NOT_EXIST,NOT_EXIST,1,NOT_EXIST,NOT_EXIST,NOT_EXIST,NO,NOT_EXIST,SHORTEST,NOT_EXIST,SHORTEST,1,NO,YES,DEFAULT,FFE,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,3,NOT_EXIST,CARTESIAN,YES,NO,NOT_EXIST,NO,NOT_EXIST,low,FULL,NOT_EXIST,NOT_EXIST,NOT_EXIST,3D,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,YES,NOT_EXIST,no,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NO,NOT_EXIST,USER_DEF,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NO,DEF,H,NOT_EXIST,NOT_EXIST,NO,NOT_EXIST,NOT_EXIST,NOT_EXIST,MPU_MTC_MODE_NO,NO,NOT_EXIST,T1,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NO,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,450,405,YES,Supine,HF,NOT_EXIST,NO,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,DEFAULT,NOT_EXIST,2,SENSE-SHOULDER8 BODY-QUAD,F,400 400,,,100 100,5.625,7.23214293,4,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NO,NO,NOT_EXIST,NOT_EXIST,80,PARALLEL,NO,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NO,NOT_EXIST,0,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,OFF,NOT_EXIST,NO,NO,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,NOT_EXIST,0,5.625,NOT_EXIST,NOT_EXIST,NOT_EXIST,10180,15:45:29.29,15:46:13.76,null,,,96,,1,,,0,10180,PATTERN_SRN,SHOULDER,SCHULTER,MATCHED_SHOULDER,SHOULDER,UPPER EXTREMITIES,ANATOMY_GROUP_MAPPED,10180,10180,10180,3.2.3,Achieva 3.0T,Achieva,T30,3.0T,NO,F2000,Watercooled2,274-D,Master,NONE,,,0,16,0,1,S26_128,NONE,8,null,CDAS,LOGFOLDER_SYSFOLDER_MATCHED_RELEASE_NOT_CHECKED,FALSE,null,null,null,null,12.15,SENSE-SHOULDER8/(19) Q-BODY,0,SCAN_PARSE_SUCCESS,SHOULDER,-4.64757729 5.37445641,
I just loaded a single line give in question into hbase table with ImportTSV command by providing 414 columns and it worked perfectly for me.Here is a command that I used.
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,CF:c1,CF:c2,CF:c3,CF:c4,CF:c5,CF:c6,CF:c7,CF:c8,CF:c9,CF:c10,CF:c11,CF:c12,CF:c13,CF:c14,CF:c15,CF:c16,CF:c17,CF:c18,CF:c19,CF:c20,CF:c21,CF:c22,CF:c23,CF:c24,CF:c25,CF:c26,CF:c27,CF:c28,CF:c29,CF:c30,CF:c31,CF:c32,CF:c33,CF:c34,CF:c35,CF:c36,CF:c37,CF:c38,CF:c39,CF:c40,CF:c41,CF:c42,CF:c43,CF:c44,CF:c45,CF:c46,CF:c47,CF:c48,CF:c49,CF:c50,CF:c51,CF:c52,CF:c53,CF:c54,CF:c55,CF:c56,CF:c57,CF:c58,CF:c59,CF:c60,CF:c61,CF:c62,CF:c63,CF:c64,CF:c65,CF:c66,CF:c67,CF:c68,CF:c69,CF:c70,CF:c71,CF:c72,CF:c73,CF:c74,CF:c75,CF:c76,CF:c77,CF:c78,CF:c79,CF:c80,CF:c81,CF:c82,CF:c83,CF:c84,CF:c85,CF:c86,CF:c87,CF:c88,CF:c89,CF:c90,CF:c91,CF:c92,CF:c93,CF:c94,CF:c95,CF:c96,CF:c97,CF:c98,CF:c99,CF:c100,CF:c101,CF:c102,CF:c103,CF:c104,CF:c105,CF:c106,CF:c107,CF:c108,CF:c109,CF:c110,CF:c111,CF:c112,CF:c113,CF:c114,CF:c115,CF:c116,CF:c117,CF:c118,CF:c119,CF:c120,CF:c121,CF:c122,CF:c123,CF:c124,CF:c125,CF:c126,CF:c127,CF:c128,CF:c129,CF:c130,CF:c131,CF:c132,CF:c133,CF:c134,CF:c135,CF:c136,CF:c137,CF:c138,CF:c139,CF:c140,CF:c141,CF:c142,CF:c143,CF:c144,CF:c145,CF:c146,CF:c147,CF:c148,CF:c149,CF:c150,CF:c151,CF:c152,CF:c153,CF:c154,CF:c155,CF:c156,CF:c157,CF:c158,CF:c159,CF:c160,CF:c161,CF:c162,CF:c163,CF:c164,CF:c165,CF:c166,CF:c167,CF:c168,CF:c169,CF:c170,CF:c171,CF:c172,CF:c173,CF:c174,CF:c175,CF:c176,CF:c177,CF:c178,CF:c179,CF:c180,CF:c181,CF:c182,CF:c183,CF:c184,CF:c185,CF:c186,CF:c187,CF:c188,CF:c189,CF:c190,CF:c191,CF:c192,CF:c193,CF:c194,CF:c195,CF:c196,CF:c197,CF:c198,CF:c199,CF:c200,CF:c201,CF:c202,CF:c203,CF:c204,CF:c205,CF:c206,CF:c207,CF:c208,CF:c209,CF:c210,CF:c211,CF:c212,CF:c213,CF:c214,CF:c215,CF:c216,CF:c217,CF:c218,CF:c219,CF:c220,CF:c221,CF:c222,CF:c223,CF:c224,CF:c225,CF:c226,CF:c227,CF:c228,CF:c229,CF:c230,CF:c231,CF:c232,CF:c233,CF:c234,CF:c235,CF:c236,CF:c237,CF:c238,CF:c239,CF:c240,CF:c241,CF:c242,CF:c243,CF:c244,CF:c245,CF:c246,CF:c247,CF:c248,CF:c249,CF:c250,CF:c251,CF:c252,CF:c253,CF:c254,CF:c255,CF:c256,CF:c257,CF:c258,CF:c259,CF:c260,CF:c261,CF:c262,CF:c263,CF:c264,CF:c265,CF:c266,CF:c267,CF:c268,CF:c269,CF:c270,CF:c271,CF:c272,CF:c273,CF:c274,CF:c275,CF:c276,CF:c277,CF:c278,CF:c279,CF:c280,CF:c281,CF:c282,CF:c283,CF:c284,CF:c285,CF:c286,CF:c287,CF:c288,CF:c289,CF:c290,CF:c291,CF:c292,CF:c293,CF:c294,CF:c295,CF:c296,CF:c297,CF:c298,CF:c299,CF:c300,CF:c301,CF:c302,CF:c303,CF:c304,CF:c305,CF:c306,CF:c307,CF:c308,CF:c309,CF:c310,CF:c311,CF:c312,CF:c313,CF:c314,CF:c315,CF:c316,CF:c317,CF:c318,CF:c319,CF:c320,CF:c321,CF:c322,CF:c323,CF:c324,CF:c325,CF:c326,CF:c327,CF:c328,CF:c329,CF:c330,CF:c331,CF:c332,CF:c333,CF:c334,CF:c335,CF:c336,CF:c337,CF:c338,CF:c339,CF:c340,CF:c341,CF:c342,CF:c343,CF:c344,CF:c345,CF:c346,CF:c347,CF:c348,CF:c349,CF:c350,CF:c351,CF:c352,CF:c353,CF:c354,CF:c355,CF:c356,CF:c357,CF:c358,CF:c359,CF:c360,CF:c361,CF:c362,CF:c363,CF:c364,CF:c365,CF:c366,CF:c367,CF:c368,CF:c369,CF:c370,CF:c371,CF:c372,CF:c373,CF:c374,CF:c375,CF:c376,CF:c377,CF:c378,CF:c379,CF:c380,CF:c381,CF:c382,CF:c383,CF:c384,CF:c385,CF:c386,CF:c387,CF:c388,CF:c389,CF:c390,CF:c391,CF:c392,CF:c393,CF:c394,CF:c395,CF:c396,CF:c397,CF:c398,CF:c399,CF:c400,CF:c401,CF:c402,CF:c403,CF:c404,CF:c405,CF:c406,CF:c407,CF:c408,CF:c409,CF:c410,CF:c411,CF:c412,CF:c413,CF:c414 '-Dimporttsv.separator=,' tablename /user/hdfs/exp
I have given random column name , you can update it as per your need.
Note : Make sure that number of columns you are passing through command are matching with your input data source. Even I got Bad Line issue when I passed 412 columns instead of 414.
Hope this will help.:)
It looks like the single quotation is misplaced while specifying separator. Try using this: -Dimporttsv.separator=',' instead of '-Dimporttsv.separator=,'
If the input file is prepared such that any column value consists of a field delimiter (in this case, comma), it will fail. Better to keep a different delimiter (such as |) while preparing the CSV file

How to get the hive table output or text file in hdfs on which hive table created to .CSV format.

So there is one condition with the cluster i'm working on. Nothing can be taken out of cluster to linux box.
Files on which hive table are built are in sequence file format or text format.
I need to change those files to CSV format with out outputting them to linux box and also i can create table from existing table which can be STORED AS CSVfile if possible. (i'm not sure if i can do that).
I have tried lot things..but couldn't do it unless i output it to linux box. Any help is appreciated.
You can create another hive table like this:
CREATE TABLE hivetable_csv ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' as
select * from hivetable;
Then copy the table contents to a new directory
hadoop fs -cat /user/hive/warehouse/csv_dump/* | hadoop fs -put - /user/username/hivetable.csv
Alternatively, you can also try
hadoop fs -cp

add date time from flat file name cloudera

I started an EC2 cluster on amazon to install cloudera...I got it installed and configured and loaded some of the Wiki Page Views public snapshot into HDFS. The structure of the files are as such:
projectcode, pagename, pageviews, bytes
the files are named as such:
pagecounts-20090430-230000.gz
date time
when loading the data from HDFS to Impala, I do it as such:
CREATE EXTERNAL TABLE wikiPgvws
(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LOCATION '/user/hdfs';
one thing I missed is the date and time of each of the file. The dir:
/user/hdfs
contains multiple pagecount files associated with different dates and times. How can one pull that information and store it in a column when loading to impala?
I think the thing you are missing is the concept of partitions. If you define the table as partitioned, the data may be divided to different partitions based on the timestamp(in the name) of the file. I'm able to work around it in hive, I hope you to do the needful(if any) for impala as there query syntax is the same.
For me, this problem is not possible to solve only using hive. So I mixed up bash with hive scripting and it works fine for me. This is how I wrapped it up :
Create table wikiPgvws with partition
Create table wikiTmp with same fields as wikiPgvws except for partitions
For each file
i. Load data into wikiTmp
ii. grep timeStamp from fileName
iii. Use sed to replace placeholders in a predefined hql script file to load the data to the actual table. Then run it.
Drop table wikiTmp & remove tmp.hql
The script is as follows :
#!/bin/bash
hive -e "CREATE EXTERNAL TABLE wikiPgvws(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
PARTITIONED BY(dts STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE";
hive -e "CREATE TABLE wikiTmp(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE"
for fileName in $(hadoop fs -ls /user/hdfs/bounty/pagecounts-*.txt | grep -Po '(?<=\s)(/user.*$)')
do
echo "currentFile :$fileName"
dst=$(echo $filename | grep -oE '[0-9]{8}-[0-9]{6}')
echo "currentStamp $dst"
sed "s!sourceFile!'$fileName'!" t.hql > tmp.hql
sed -i "s!targetPartition!$dst!" tmp.hql
hive -f tmp.hql
done
hive -e "DROP TABLE wikiTmp"
rm -f tmp.hql
The hql script consists of just two lines :
LOAD DATA INPATH sourceFile OVERWRITE INTO TABLE wikiTmp;
INSERT OVERWRITE TABLE wikiPgvws PARTITION (dts = 'targetPartition') SELECT w.* FROM wikiTmp w;
Epilogue :
Check, whether options equivalent to hive -e & hive -f are available in impala. Without them, this script is of no use to you. Again the grep commands to fetch the fileName & timeStamp need to be modified according to your table location and stamp pattern. It's just one a way to show how the job can be done, but couldn't able to find another one.
Enhencement
If everything works well, consider merging the first two DDLs into another script to make it look cleaner. Although, I'm not sure that hql script arguments can be used to define partition values, still you can have a look to replace sed.

Resources