Invalid path because of filename when hive load data local inpath - hadoop

The file '/home/hadoop/_user_active_score_small' exactly exists. But when run load data local as below, get a SemanticException:
hive> load data local inpath '/home/hadoop/_user_active_score_small' overwrite into table user_active_score_tmp ;
FAILED: SemanticException Line 1:24 Invalid path ''/home/hadoop/_user_active_score_small'': No files matching path file:/home/hadoop/_user_active_score_small
But, cp /home/hadoop/_user_active_score_small /home/hadoop/user_active_score_small, and then run load data again:
hive> load data local inpath '/home/hadoop/user_active_score_small' overwrite into table user_active_score_tmp ;
Loading data to table user_bg_action.user_active_score_tmp
OK
Time taken: 0.368 seconds
The files' access type are the same, in the same directory:
-rw-rw-r-- 1 hadoop hadoop 614 7月 5 13:49 _user_active_score_small
-rw-rw-r-- 1 hadoop hadoop 614 7月 5 11:48 user_active_score_small
I don't know how does this happen. Is file name which starts with '_' not allowed by hive?

Files and directories that starts with underscore _ are considered hidden in MapReduce, that's probably the reason of the observed behavior.
If you look at FileInputFormat source code you can find this:
protected static final PathFilter hiddenFileFilter = new PathFilter(){
public boolean accept(Path p){
String name = p.getName();
return !name.startsWith("_") && !name.startsWith(".");
}
};

Related

Unable to load data into parquet file format?

I am trying to parse log data into parquet file format in hive , the separator used is "||-||".
The sample row is
"b8905bfc-dc34-463e-a6ac-879e50c2e630||-||syntrans1||-||CitBook"
After performing the data staging I am able to get the result
"b8905bfc-dc34-463e-a6ac-879e50c2e630 syntrans1 CitBook ".
While converting the data to parquet file format I got error :
`
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2185)
at org.apache.hadoop.hive.ql.plan.PartitionDesc.getDeserializer(PartitionDesc.java:137)
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:297)
... 24 more
This is what I have tried
create table log (a String ,b String ,c String)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES (
"field.delim"="||-||",
"collection.delim"="-",
"mapkey.delim"="#"
);
create table log_par(
a String ,
b String ,
c String
) stored as PARQUET ;
insert into logspar select * from log_par ;
`
Aman kumar,
To resolve this issue, run the hive query after adding the following jar:
hive> add jar hive-contrib.jar;
To add the jar permanently, do the following:
1.On Hive Server host, create a /usr/hdp//hive/auxlib directory.
2.Copy /usr/hdp//hive/lib/hive-contrib-.jar to /usr/hdp//hive/auxlib.
3.Restart the HS2 server.
Please check further reference.
https://community.hortonworks.com/content/supportkb/150175/errororgapachehadoophivecontribserde2multidelimits.html.
https://community.hortonworks.com/questions/79075/loading-data-to-hive-via-pig-orgapachehadoophiveco.html
Let me know,if you face any issues

How to merge orc files for external tables?

I am trying to merge multiple small ORC files. Came across ALTER TABLE CONCATENATE command but that only works for managed tables.
Hive gave me the following error when I try to run it :
FAILED: SemanticException
org.apache.hadoop.hive.ql.parse.SemanticException: Concatenate/Merge
can only be performed on managed tables
Following are the table parameters :
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
EXTERNAL TRUE
numFiles 535
numRows 27051810
orc.compress SNAPPY
rawDataSize 20192634094
totalSize 304928695
transient_lastDdlTime 1512126635
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
I believe your table is an external table,then there are two ways:
Either you can change it to Managed table (ALTER TABLE <table> SET
TBLPROPERTIES('EXTERNAL'='FALSE') and run the ALTER TABLE
CONCATENATE.Then you can convert the same back to external changing
it to TRUE.
Or you can create a managed table using CTAS and insert the data. Then run the merge query and import the data back to external table
From my previous answer to this question, here is a small script in Python using PyORC to concatenate the small ORC files together. It doesn't use Hive at all, so you can only use it if you have direct access to the files and are able to run a Python script on them, which might not always be the case in managed hosts.
import pyorc
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument('-o', '--output', type=argparse.FileType(mode='wb'))
parser.add_argument('files', type=argparse.FileType(mode='rb'), nargs='+')
args = parser.parse_args()
schema = str(pyorc.Reader(args.files[0]).schema)
with pyorc.Writer(args.output, schema) as writer:
for i, f in enumerate(args.files):
reader = pyorc.Reader(f)
if str(reader.schema) != schema:
raise RuntimeError(
"Inconsistent ORC schemas.\n"
"\tFirst file schema: {}\n"
"\tFile #{} schema: {}"
.format(schema, i, str(reader.schema))
)
for line in reader:
writer.write(line)
if __name__ == '__main__':
main()

error while storing in Hbase using Pig

hadoop dfs input data cat:
[ituser1#genome-dev3 ~]$ hadoop fs -cat FOR_COPY/COMPETITOR_BROKERING/part-r-00000 | head -1
returns:
836646827,1000.0,2016-02-20,34,CAPITAL BOOK,POS/CAPITAL BOOK/NEW DELHI/200216/14:18,BOOKS AND STATIONERY,5497519004453567/41043516,MARRIED,M,SALARIED,D,5942,1
My Pig code:
DATA = LOAD 'FOR_COPY/COMPETITOR_BROKERING' USING PigStorage(',') AS (CUST_ID:chararray,TXN_AMT:chararray,TXN_DATE:chararray,AGE_CASA:chararray,MERCH_NAME:chararray,TXN_PARTICULARS:chararray,MCC_CATEGORY:chararray,TXN_REMARKS:chararray,MARITAL_STATUS_CASA:chararray,GENDER_CASA:chararray,OCCUPATION_CAT_V2_NEW:chararray,DR_CR:chararray,MCC_CODE:chararray,OCCURANCE:int);
DATA_FIL = FOREACH DATA GENERATE
(chararray)CUST_ID AS CUST_ID,
(chararray)TXN_AMT AS TXN_AMT,
(chararray)TXN_DATE AS TXN_DATE,
(chararray)AGE_CASA AS AGE_CASA,
(chararray)MERCH_NAME AS MERCH_NAME,
(chararray)TXN_PARTICULARS AS TXN_PARTICULARS,
(chararray)MCC_CATEGORY AS MCC_CATEGORY,
(chararray)TXN_REMARKS AS TXN_REMARKS,
(chararray)MARITAL_STATUS_CASA AS MARITAL_STATUS_CASA,
(chararray)GENDER_CASA AS GENDER_CASA,
(chararray)OCCUPATION_CAT_V2_NEW AS OCCUPATION_CAT_V2_NEW,
(chararray)DR_CR AS DR_CR,
(chararray)MCC_CODE AS MCC_CODE;
STORE DATA_FIL INTO 'hbase://TXN_EVENTS' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('DETAILS:CUST_ID DETAILS:TXN_AMT DETAILS:TXN_DATE DETAILS:AGE_CASA DETAILS:MERCH_NAME DETAILS:TXN_PARTICULARS DETAILS:MCC_CATEGORY DETAILS:TXN_REMARKS DETAILS:MARITAL_STATUS_CASA DETAILS:GENDER_CASA DETAILS:OCCUPATION_CAT_V2_NEW DETAILS:DR_CR DETAILS:MCC_CODE');
but Giving error:
ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job job_1457792710587_0100 failed, hadoop does not return any error message
But my Load is working perfectly:
HDATA = LOAD 'hbase://TXN_EVENTS'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'DETAILS:CUST_ID DETAILS:TXN_AMT DETAILS:TXN_DATE DETAILS:AGE_CASA DETAILS:MERCH_NAME DETAILS:TXN_PARTICULARS DETAILS:MCC_CATEGORY DETAILS:TXN_REMARKS DETAILS:MARITAL_STATUS_CASA DETAILS:GENDER_CASA DETAILS:OCCUPATION_CAT_V2_NEW DETAILS:DR_CR DETAILS:MCC_CODE','-loadKey true' )
AS (ROWKEY:chararray,CUST_ID:chararray,TXN_AMT:chararray,TXN_DATE:chararray,AGE_CASA:chararray,MERCH_NAME:chararray,TXN_PARTICULARS:chararray,MCC_CATEGORY:chararray,TXN_REMARKS:chararray,MARITAL_STATUS_CASA:chararray,GENDER_CASA:chararray,OCCUPATION_CAT_V2_NEW:chararray,DR_CR:chararray,MCC_CODE:chararray);
DUMP HDATA; (this gives perfect result):
2016-03-01,1,20.0,2016-03-22,27,test_merch,test/particulars,test_category,test/remarks,married,M,service,D,1234
A help is appreciated
I am using Horton stack in distributed mode:
HDP2.3
Apache Pig version 0.15.0
HBase 1.1.1
Also all jars are in place as I have installed them through Ambari.
solved the data upload :
as i was missing to Rank the relation , hence hbase rowkey becomes the rank.\
DATA_FIL_1 = RANK DATA_FIL_2;
NOTE: this will generate arbitrary rowkey.
But if you want to define your row key then use like:
you have to give another relation , only STORE function won't work.
this will take first tuple as rowkey(which you have defined)
storage_data = STORE DATA_FIL INTO 'hbase://genome:event_sink' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('event_data:CUST_ID event_data:EVENT transaction_data:TXN_AMT transaction_data:TXN_DATE transaction_data:AGE_CASA transaction_data:MERCH_NAME transaction_data:TXN_PARTICULARS transaction_data:MCC_CATEGORY transaction_data:TXN_REMARKS transaction_data:MARITAL_STATUS_CASA transaction_data:GENDER_CASA transaction_data:OCCUPATION_CAT_V2_NEW transaction_data:DR_CR transaction_data:MCC_CODE');

read file reference storm from hdfs

Hi I want to ask something. I've started to learn about apache storm. is it possible to storm to read data file in hdfs.??
example: I have a data txt file in directory /user/hadoop on hdfs. is it possible to storm to read that file.?? thx before
because when I try to running the storm I've an error message if the file does not exist. when I try to run it by read file from my local storage it was successful
>>> Starting to create Topology ...
---> Read Class: FileReaderSpout , 1387957892009
---> Read Class: Split , 1387958291266
---> Read Class: Identity , 247_Identity_1387957902310_1
---> Read Class: SubString , 1387964607853
---> Read Class: Trim , 1387962789262
---> Read Class: SwitchCase , 1387962333010
---> Read Class: Reference , 1387969791518
File /reff/test.txt .. not exists ..
Of course! Here is an example for how to read a file from HDFS:
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
// ...stuff...
FileSystem fs = FileSystem.get(URI.create("hdfs://prodserver"), new Configuration());
String hdfspath = "/apps/storm/conf/config.json";
Path path = new Path();
if (fs.exists(path)) {
InputStreamReader reader = new InputStreamReader(fs.open(path));
// do stuff with the reader
} else {
LOG.info("Does not exist {}", hdfspath);
}
This doesn't use anything specific to storm, just the Hadoop API (hadoop-common.jar).
The error you're getting looks like it is because your file path is incorrect.

Creating schema for Tuple in Apache Pig

How can I create Pig schema for the below tuple data while loading the relation?
]$ cat data
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
I tried the below statement in local mode
A = LOAD '/home/cloudera/data' AS (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
If I dump the data, I expected the result
DUMP A;
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))
But what I get was,
((3,8,9),)
((1,4,7),)
((2,5,8),)
I am using Apache Pig version 0.11.0-cdh4.7.0
the next work:
A = load '$input' using PigStorage(' ') AS (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
describe A;
dump A;
The dump:
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))

Resources