HDFS -extra data after last expected column - hadoop

We have source and target system. trying to import the data from SQL server 2012 to Pivotal Hadoop (PHD 3.0) version using talend tool.
Getting error:
ERROR: extra data after last expected column (seg0 slice1 datanode.domain.com:40000 pid=15035)
Detail: External table pick_report_stg0, line 5472 of pxf://masternnode/path/to/hdfs?profile=HdfsTextSimple: "5472;2016-11-28 08:39:54.217;;2016-11-15 00:00:00.0;SAMPLES;0005525;MORGAN -EVENTS;254056;1;IHBL-NHO..."
We tried
We have identified the BAD line as
[hdfs#mdw ~]$ hdfs dfs -cat /path/to/hdfs|grep 3548
3548;2016-11-28 04:21:39.97;;2016-11-15 00:00:00.0;SAMPLES;0005525;MORGAN -EVENTS;254056;1;IHBL-NHO-13OZ-01;0;ROC NATION; NH;2016-11-15 00:00:00.0;2016-11-15 00:00:00.0;;2.0;11.99;SA;SC01;NH02;EA;1;F2;NEW PKG ONLY PLEASE!! BY NOON
Structure of External table and Format clause
CREATE EXTERNAL TABLE schemaname.tablename
(
"ID" bigint,
"time" timestamp without time zone,
"ShipAddress4" character(40),
"EntrySystemDate" timestamp without time zone,
"CorpAcctName" character(40),
"Customer" character(7),
"CustomerName" character(30),
"SalesOrder" character(6),
"OrderStatus" character(1),
"MStockCode" character(30),
"ShipPostalCode" character(9),
"CustomerPoNumber" character(30),
"OrderDate" timestamp without time zone,
"ReqShipDate" timestamp without time zone,
"DateValue" timestamp without time zone,
"MOrderQty" numeric(9,0),
"MPrice" numeric(9,0),
"CustomerClass" character(2),
"ProductClass" character(4),
"ProductGroup" character(10),
"StockUom" character(3),
"DispatchCount" integer,
"MWarehouse" character(2),
"AlphaValue" varchar(100)
)
LOCATION (
'pxf://path/to/hdfs?profile=HdfsTextSimple'
)
FORMAT 'csv' (delimiter ';' null '' quote ';')
ENCODING 'UTF8';
Finding : Extra semi colon appeared which causes extra data. But I am still unable to supply correct format clause . Please guide How do I remove extra data column error.
What format clause should I use.
Any help on it would be much Appreciated !

If you append the following to your external table definition, after the ENCODING clause, it should help to resolve the issue where a small number of rows fail due to this issue:
LOG ERRORS INTO my_err_table SEGMENT REJECT LIMIT 1 PERCENT;
Here is a reference on this syntax: http://gpdb.docs.pivotal.io/4320/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html

Related

HiveAccessControlException Permission denied: user [hive] does not have [ALL] privilege on [hdfs://sandbox-....:8020/user/..] (state=42000,code=40000)

When I'm trying to load a CSV file from local hadoop on sandbox to hive table, I'm getting the following exception
LOCATION 'hdfs://sandbox-hdp.hortonworks.com:8020/user/maria_dev/practice';
Error: Error while compiling statement: FAILED: HiveAccessControlException Permission denied: user [hive] does not have [ALL] privilege on [hdfs://sandbox-hdp.hortonworks.com:8020/user/ma
ria_dev/practice] (state=42000,code=40000)
I used the following code, can you please suggest a solution for this?
CREATE TABLE Sales_transactions(
Transaction_date DATE,
Product STRING,
Price FLOAT,
Payment_Type STRING,
Name STRING,
City STRING,
State STRING,
Country STRING,
Account_Created TIMESTAMP,
Last_Login TIMESTAMP,
Latitude FLOAT,
Longitude FLOAT,
Zip STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
**LOCATION 'hdfs://sandbox-hdp.hortonworks.com:8020/user/maria_dev/practice';** //Error pointing this line.
It is actually two step process and i think you missed step1.(Assuming your user have all proper access.)
Step 1 - Load local file into hdfs file system.
hdfs dfs -put /~/Sales_transactions.csv hdfs://sandbox-hdp.hortonworks.com:8020/user/maria_dev/practice`
Step 2 - Then load above hdfs data into the table.
load data inpath 'hdfs://sandbox-hdp.hortonworks.com:8020/user/maria_dev/practice/Sales_transactions.csv' into table myDB.Sales_transactions_table
Alternately you can use this as well -
LOAD DATA LOCAL INPATH '/~/Sales_transactions.csv' INTO TABLE mydb.Sales_transactions_table;

Getting NULL values after loading data into Hive tables from an online dataset

I am trying to load a data from an online dataset into my hive table using hue interface but I am getting NULL values.
Here's my dataset:
https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=aisles.csv
Here's my code:
CREATE TABLE IF NOT EXISTS AISLES (aisles_id INT, aisles STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
Here's how I loaded the data:
LOAD DATA LOCAL INPATH '/home/hadoop/aisles.csv' INTO TABLE aisles;
My Workaround, but no go:
FIELDS TERMINATED BY ','
FIELDS TERMINATED BY '\t'
FIELDS TERMINATED BY ''
FIELDS TERMINATED BY ' '
Also tried removing LINES TERMINATED BY '\n'
This is how I downloaded the data:
[hadoop#ip-172-31-76-58 ~]$ wget -O aisles.csv "https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=aisles.csv"
--2020-10-14 23:50:06-- https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=aisles.csv
Resolving www.kaggle.com (www.kaggle.com)... 35.244.233.98
Connecting to www.kaggle.com (www.kaggle.com)|35.244.233.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘aisles.csv’
I checked the location of the table I created this is what it says;
hdfs://ip-172-31-76-58.ec2.internal:8020/user/hive/warehouse/aisles
I tried browsing the directory and see where the file was saved:
[hadoop#ip-172-31-76-58 ~]$ hdfs dfs -ls /user/hive/warehouse
Found 1 items
drwxrwxrwt - arjiesaenz hadoop 0 2020-10-15 00:57 /user/hive/warehouse/aisles
So, I tried to change my load script like this;
LOAD DATA INPATH '/user/hive/warehouse/aisles.csv' INTO TABLE aisles;
But I got an error:
Error while compiling statement: FAILED: SemanticException line 6:61 Invalid path ''/user/hive/warehouse/aisles.csv'': No files matching path hdfs://ip-172-31-76-58.ec2.internal:8020/user/hive/warehouse/aisles.csv
Hopefully someone can help me pinpoint the problem with my code.
Thanks.
I tried the same on my hadoop cluster. The code worked without any issues.
Here's my execution snippet:
hive> CREATE TABLE IF NOT EXISTS AISLES (aisles_id INT, aisles STRING)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> LINES TERMINATED BY '\n'
> STORED AS TEXTFILE
> tblproperties("skip.header.line.count"="1");
OK
Time taken: 0.034 seconds
hive> load data inpath '/user/hirwuser1448/aisles.csv' into table AISLES;
Loading data to table revisit.aisles
Table revisit.aisles stats: [numFiles=1, totalSize=2603]
OK
Time taken: 0.183 seconds
hive> select * from AISLES limit 10;
OK
1 prepared soups salads
2 specialty cheeses
3 energy granola bars
4 instant foods
5 marinades meat preparation
6 other
7 packaged meat
8 bakery desserts
9 pasta sauce
10 kitchen supplies
Time taken: 0.038 seconds, Fetched: 10 row(s)
I think you need to cross check if your dataset aisles.csv is at the hdfs location and not stored at local directory.
The problem is with your load cmd.
LOAD DATA INPATH '/user/hive/warehouse/aisles.csv' INTO TABLE aisles;
I see you tried browsing the dir to see the saved file. Do you see aisles.csv under that dir? If the file's there, then you're giving wrong path in your load cmd else file isn't there at all.
I found a workaround by downloading the dataset and uploaded it into the Amazon S3 bucket and used the S3 path in the LOAD command.

SQL Loader error for opening log file

I have created an external table :
CREATE TABLE XX_Lookup_EXT
(
LOOKUP_TYPE varchar2(200),
LOOKUP_CODE varchar2(200),
MEANING varchar2(200),
ENABLED_FLAG varchar2(10)
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY INTF_DIR1
ACCESS PARAMETERS
( RECORDS DELIMITED BY NEWLINE SKIP 1
NODISCARDFILE
FIELDS TERMINATED BY '|'
OPTIONALLY ENCLOSED BY '"'
MISSING FIELD VALUES ARE NULL
REJECT ROWS WITH ALL NULL FIELDS
)
LOCATION (INTF_DIR1:'LOOKUP_CODE.csv')
)
REJECT LIMIT UNLIMITED
NOPARALLEL
nomonitoring;
When I am querying this table it is giving me the following error :
ORA-29913: error in executing ODCIEXTTABLEOPEN callout
ORA-29400: data cartridge error
error opening file /orabin/tst/test/XX_LOOKUP_EXT_30723.log
29913. 00000 - "error in executing %s callout"
*Cause: The execution of the specified callout caused an error.
*Action: Examine the error messages take appropriate action.
I have tried everything out. Still I am getting this error.
#alex poole is right. The /orabin/tst/test/ directory must be local to the database and the database server account, usually 'oracle', needs read and write permissions within the directory.

add date time from flat file name cloudera

I started an EC2 cluster on amazon to install cloudera...I got it installed and configured and loaded some of the Wiki Page Views public snapshot into HDFS. The structure of the files are as such:
projectcode, pagename, pageviews, bytes
the files are named as such:
pagecounts-20090430-230000.gz
date time
when loading the data from HDFS to Impala, I do it as such:
CREATE EXTERNAL TABLE wikiPgvws
(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LOCATION '/user/hdfs';
one thing I missed is the date and time of each of the file. The dir:
/user/hdfs
contains multiple pagecount files associated with different dates and times. How can one pull that information and store it in a column when loading to impala?
I think the thing you are missing is the concept of partitions. If you define the table as partitioned, the data may be divided to different partitions based on the timestamp(in the name) of the file. I'm able to work around it in hive, I hope you to do the needful(if any) for impala as there query syntax is the same.
For me, this problem is not possible to solve only using hive. So I mixed up bash with hive scripting and it works fine for me. This is how I wrapped it up :
Create table wikiPgvws with partition
Create table wikiTmp with same fields as wikiPgvws except for partitions
For each file
i. Load data into wikiTmp
ii. grep timeStamp from fileName
iii. Use sed to replace placeholders in a predefined hql script file to load the data to the actual table. Then run it.
Drop table wikiTmp & remove tmp.hql
The script is as follows :
#!/bin/bash
hive -e "CREATE EXTERNAL TABLE wikiPgvws(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
PARTITIONED BY(dts STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE";
hive -e "CREATE TABLE wikiTmp(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE"
for fileName in $(hadoop fs -ls /user/hdfs/bounty/pagecounts-*.txt | grep -Po '(?<=\s)(/user.*$)')
do
echo "currentFile :$fileName"
dst=$(echo $filename | grep -oE '[0-9]{8}-[0-9]{6}')
echo "currentStamp $dst"
sed "s!sourceFile!'$fileName'!" t.hql > tmp.hql
sed -i "s!targetPartition!$dst!" tmp.hql
hive -f tmp.hql
done
hive -e "DROP TABLE wikiTmp"
rm -f tmp.hql
The hql script consists of just two lines :
LOAD DATA INPATH sourceFile OVERWRITE INTO TABLE wikiTmp;
INSERT OVERWRITE TABLE wikiPgvws PARTITION (dts = 'targetPartition') SELECT w.* FROM wikiTmp w;
Epilogue :
Check, whether options equivalent to hive -e & hive -f are available in impala. Without them, this script is of no use to you. Again the grep commands to fetch the fileName & timeStamp need to be modified according to your table location and stamp pattern. It's just one a way to show how the job can be done, but couldn't able to find another one.
Enhencement
If everything works well, consider merging the first two DDLs into another script to make it look cleaner. Although, I'm not sure that hql script arguments can be used to define partition values, still you can have a look to replace sed.

Hive error: parseexception missing EOF

I am not sure what I am doing wrong here:
hive> CREATE TABLE default.testtbl(int1 INT,string1 STRING)
stored as orc
tblproperties ("orc.compress"="NONE")
LOCATION "/user/hive/test_table";
FAILED: ParseException line 1:107 missing EOF at 'LOCATION' near ')'
while the following query works perfectly fine:
hive> CREATE TABLE default.testtbl(int1 INT,string1 STRING)
stored as orc
tblproperties ("orc.compress"="NONE");
OK
Time taken: 0.106 seconds
Am I missing something here. Any pointers will help. Thanks!
Try put the "LOCATION" in front of "tblproperties" like below, worked for me.
CREATE TABLE default.testtbl(int1 INT,string1 STRING)
stored as orc
LOCATION "/user/hive/test_table"
tblproperties ("orc.compress"="NONE");
It seems even the sample SQL from book "Programming Hive" got the order wrong. Please reference to the official definition of create table command:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
#Haiying Wang pointed out that LOCATION is to be put in front of tblproperties.
But I think the error also occurs when location is specified above stored as.
Its better to stick to the correct order:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later)
[AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
Refer: Hive Create Table
Check this post:
Loading Data from a .txt file to Table Stored as ORC in Hive
And check your source files present at the specified directory /user/hive/test_table. Incase the files are in .txt or some other non ORC format then you can follow the steps in the above post to come out of the error.
ParseException line lineNumber missing EOF at '.' near 'schemaName':
Got the above error while trying to execute the following command from linux script to truncate a hive table
dse -u username -p password hive -e "truncate table keyspace.tablename;"
Fix:
Need to separate the commands within the script line as follows -
dse -u username -p password hive -e "use keyspace; truncate table keyspace.tablename;"
Happy coding!
Got the same error while creating a table in hive.
I used the drop command to drop the table and then run the create table command that I had again.
Worked for me.
If you see this error when running the HiveQL from a file with the command "hive -f file.hql". And that it points the first line of your query most definitely this is because of a forgotten semicolon(;) for a previous query.
Since parser looks for semicolon(;) as a terminator for each query.
for example:
DROP TABLE IF EXISTS default.emp
create table default.emp (
field1 type,
field2 type)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 's3://gts-promocube/source-data/Lowes/POS/';
If you save the above in a file and execute it with hive -f, then you'll get the error:
FAILED: ParseException line 2:0 missing EOF at 'CREATE' near emp.
Solution: Put a semicolon(;) for the DROP TABLE command above.

Resources