Hive - How to load data from a file with filename as a column? - hadoop

I am running the following commands to create my table ABC and insert data from all files that are in my designated file path. Now I want to add a column with filename, but I can't find any way to do that without looping through the files or something. Any suggestions on what the best way to do this would be?
CREATE TABLE ABC
(NAME string
,DATE string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
hive -e "LOAD DATA LOCAL INPATH '${DATA_FILE_PATH}' INTO TABLE ABC;"

Hive does have virtual columns, which include INPUT__FILE__NAME. The link shows how to use this in a statement.
To fill another table with the filename as a column. Assuming your location of your data is hdfs://hdfs.location:port/data/folder/filename1
DROP TABLE IF EXISTS ABC2;
CREATE TABLE ABC2 (
filename STRING COMMENT 'this is the file the row was in',
name STRING,
date STRING);
INSERT INTO TABLE ABC2 SELECT split(INPUT__FILE__NAME,'folder/')[1],* FROM ABC;
You can alter the split to change how much of the full path you actually want to store.

Related

Failed make hive table on desired path and insert the values

I want to make table in hive containing of only 1 column and 2 values: 'Y' and 'N'
I already try this:
create external table if not exists tx_test_table (FLAG string)
row format delimited fields terminated by ','
stored as textfile location "/user/hdd/data/";
My question is : why it locate at default table?
how to make it through the path I desire?
When I make query from the table I jut make, it failed to show the field (using select * from )
Bad status for request TFetchResultsReq(fetchType=0,
operationHandle=TOperationHandle(hasResultSet=True, modifiedRowCount=None,
operationType=0,
operationId=THandleIdentifier(secret='pE\xff\xfdu\xf6B\xd4\xb3\xb7\x1c\xdd\x16\x95\xb85',
guid="\n\x05\x16\xe7'\xe4G \xb6R\xe06\x0b\xb9\x04\x87")),
orientation=4, maxRows=100):
TFetchResultsResp(status=TStatus(errorCode=0,
errorMessage='java.io.IOException: java.io.IOException: Not a file:
hdfs://nameservice1/user/hdd/data/AC22', sqlState=None,
infoMessages=['*org.apache.hive.service.cli.HiveSQLException:java.io.IOException:
java.io.IOException: Not a file: hdfs://nameservice1/user/hdd/data/AC22:14:13',
'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:496',
'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:297',
'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:869', 'org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:507',
'org.apache.hive.service.cli.thrift.ThriftCLIService:FetchResults:ThriftCLIService.java:708',
'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1717',
'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1702',
'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39',
'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor:process:HadoopThriftAuthBridge.java:605',
'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286',
'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149',
'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', 'java.lang.Thread:run:Thread.java:748',
'*java.io.IOException:java.io.IOException: Not a file: hdfs://nameservice1/user/hdd/data/AC22:18:4',
'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:521'
, 'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:428',
'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:146',
'org.apache.hadoop.hive.ql.Driver:getResults:Driver.java:2227',
'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:491',
'*java.io.IOException:Not a file: hdfs://nameservice1/user/hdd/data/AC22:21:3',
'org.apache.hadoop.mapred.FileInputFormat:getSplits:FileInputFormat.java:329',
'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextSplits:FetchOperator.java:372',
'org.apache.hadoop.hive.ql.exec.FetchOperator:getRecordReader:FetchOperator.java:304',
'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:459'], statusCode=3),
results=None, hasMoreRows=None)
Each table in HDFS has it's own location. And location you specified for your table seems used as common location where other table folders are located.
According to the exception: java.io.IOException:Not a file: hdfs://nameservice1/user/hdd/data/AC22:21:3', at least one folder (not a file) was found in the /user/hdd/data/ location. I guess it belongs to some other table.
You should specify table location where will be stored only files which belong to this table, not the common data warehouse location, in which other table locations are.
Usually table location is named as table name: /user/hdd/data/tx_test_table
Fixed create table sentence:
create external table if not exists tx_test_table (FLAG string)
row format delimited fields terminated by ','
stored as textfile location "/user/hdd/data/tx_test_table";
Now table will have it's own location which will contain it's files, not mixed with other table folders or files.
You can put files into /user/hdd/data/tx_test_table location or load data into the table using INSERT, files will be created in the location.

Insert overwrite to Hive table saves less records than actual record number

I have a partitioned table tab and I want to create some tmp table test1 from it. Here is how I created the tmp table:
CREATE TABLE IF NOT EXISTS test1
(
COL1 string,
COL2 string,
COL3 string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
Write to this table with:
INSERT OVERWRITE TABLE test1
SELECT TAB.COL1 as COL1,
TAB.COL2 as COL2,
TAB.COL3 as COL3
FROM TAB
WHERE PT='2019-05-01';
Then I count records in test1, it has 94493486 records, while the following SQL returns count 149248486:
SELECT COUNT(*) FROM
(SELECT TAB.COL1 as COL1,
TAB.COL2 as COL2,
TAB.COL3 as COL3
FROM TAB
WHERE PT='2019-05-01') AS TMP;
Also, when I save the selected partition(PT is the partition column) to HDFS, the record count is correct:
INSERT OVERWRITE directory '/user/me/wtfhive' row format delimited fields terminated by '|'
SELECT TAB.COL1 as COL1,
TAB.COL2 as COL2,
TAB.COL3 as COL3
FROM TAB
WHERE PT='2019-05-01';
My Hive version is 3.1.0 coming with Ambari 2.7.1.0. Anyone have any idea what may cause this issue? Thanks.
=================== UPDATE =================
I find something might related to this issue.
The table tab uses ORC as storage format. Its data is imported from ORC data file of another table, in another Hive cluster, with following script:
LOAD DATA INPATH '/data/hive_dw/db/tablename/pt=2019-04-16' INTO TABLE tab PARTITION(pt='2019-04-16');
As the 2 table shares same format, the loading procedure is basically just moves data file from HDFS source directory to Hive directory.
In following procedure, I can load without any issue:
export data from ORC table tab to HDFS text file
load from the text file to a Hive temp table
load data back to tab from the temp table
now I can select/export from tab to other tables without any record missing
I suspect the issue is in ORC format. I just don't understand why it can export to HDFS text file without any problem, but export to another table(no matter what storage format the other table uses) will loss data.
use this below serde properties :
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\""
)
STORED as TEXTFILE

Partition column equal to current date in Hive

I am trying to load data into a Hive table using partition.
The code is as follow:
CREATE EXTERNAL TABLE URL(url STRING, clicks INT)
COMMENT 'Unique Clicks per URL'
PARTITIONED BY(dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/mypath/URL';
LOAD DATA INPATH '/inputpath/' INTO TABLE URL
PARTITION (dt=date_format(CURRENT_TIMESTAMP, "yyyy.MM.dd HH:mm:ss"));
I am gettin the following error:
FAILED: ParseException line 4:14 cannot recognize input near
'date_format' '(' 'CURRENT_TIMESTAMP' in constant
I tried using
SET hive.exec.dynamic.partition.mode=nonstrict;
but nothing changed.
Why is it not working?
How to set the current date as partition column?
Thank you in advance.
Lorenzo
Why move the files when you can create the external table on top of them?
LOAD DATA INPATH just moves the files (HDFS metadata operation) "as is", to the table's location.
Why define the partition column as a string when it is clearly a date?
CREATE EXTERNAL TABLE URL ... PARTITIONED BY(dt DATE) ...
Why are you trying to use non-ISO formats (yyyy.MM.dd)?
ISO date format is yyyy-MM-dd
Since it seems the partition information is not part of the data you have 3 options:
1.
Use a constant (no expression are allowed, including functions), e.g.
LOAD DATA INPATH '/inputpath/' INTO TABLE URL PARTITION (dt=date '2017-03-04');
2.
Create an additional table,URL_STG, similar to URL but without partition and use it to insert the partitions dynamically.
set hive.exec.dynamic.partition.mode=nonstrict;
insert into URL select *,current_date from URL_STG;
3.
Supply the date as a variable from the CLI
hive --hivevar dt=$(date +"%Y-%m-%d") -e \
'LOAD DATA INPATH '\''/inputpath/'\'' INTO TABLE URL PARTITION (dt=date '\''${hivevar:dt}'\'')'

hive: external partitioned table without location

Is it possible to create external partitioned table without location? I want to add all the locations later, together with partitions.
i tried:
CREATE EXTERNAL TABLE IF NOT EXISTS a.b
(line STRING)
COMMENT 'abc'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE
PARTITIONED BY day;
but i got ParseException: missing EOF at 'PARTITIONED' near 'TEXTFILE'
I don't think so, as said in alter location.
But anyway, i think your query as some errors and the correct script would be :
CREATE EXTERNAL TABLE IF NOT EXISTS a.b
(line STRING)
COMMENT 'abc'
PARTITIONED BY (day String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE
;
I think the issue is that you have not specified data type for your partition column "day". And you can create a HIVE external table without location and can use ALTER table options later to change the location.

Hive table not retrieving rows from external file

I have a text file called as sample.txt. The file looks like:
abc,23,M
def,25,F
efg,25,F
I am trying to create a table in hive using:
CREATE EXTERNAL TABLE ppldb(name string, age int,gender string)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/path/to/sample.txt';
But the data isn't getting into the table. When I run the query:
select count(*) from ppldb
I get 0 in output.
What could be the reason for data not getting loaded into the table?
The location in a external table in Hive should be an HDFS directory and not the full path of the file.
If that directory does not exists then the location we give will be created automatically. In your case /path/to/sample.txt is being treated as a directory.
So just give the /path/to/ in the LOCATION and keep the sample.txt file inside the directory. It will work.
Hope it helps...!!!
the LOCATION clause indicates where the table will be stored, not where to retrieve data from. After moving the samples.txt file into hdfs with something like
hdfs dfs -copyFromLocal ~/samples.txt /user/tables/
you could load the data into a table in hive with
create table temp(name string, age int, gender string)
row format delimited fields terminated by ','
stored as textfile;
load data inpath '/user/tables/samples.txt' into table temp;
That should work

Resources