I am trying to load a segmented table in HP Vertica through COPY DIRECT from a pipe separated text file.
COPY CSI.MKT_RSRCH_AGG_ALL FROM '/opt/vertica/CSI/MKT_RSRCH_AGG_ALL.txt' DELIMITER '|' NULL '' DIRECT;
Result
Rows Loaded
-------------
582006
dbadmin=> select get_num_rejected_rows();
get_num_rejected_rows
-----------------------
6046
I am unable to figure out what is causing the data to be rejected.
All of my dimensions are defined NOT NULL.
Can I check any logs/info for the rejected records.
I would start by verifying the integrity of the data. Then, I would send any rejected rows to a file. You can specify this in your COPY command:
COPY CSI.MKT_RSRCH_AGG_ALL
FROM '/opt/vertica/CSI/MKT_RSRCH_AGG_ALL.txt'
DELIMITER '|'
NULL ''
REJECTED DATA '/path/to/rejected/data'
DIRECT;
If you haven't already, I would recommend that you start using stream names to identify the process:
COPY ... DIRECT STREAM NAME 'My stream name';
You can than easily monitor the stream:
SELECT * FROM v_monitor.load_streams WHERE stream_name = 'My stream name';
Documentation
Checking COPY Metrics
Specifying a Rejected Data File (REJECTED DATA)
COPY Exception and Rejected Data Files
COPY
Related
When importing a file into Greenplum,one lines fails,and the whole file is not imported successfully.Is there a way can skip the wrong line and import other data into Greenplum successfully?
Here are my SQL execution and error messages:
copy cjh_test from '/gp_wkspace/outputs/base_tables/error_data_test.csv' using delimiters ',';
ERROR: invalid input syntax for integer: "FE00F760B39BD3756BCFF30000000600"
CONTEXT: COPY cjh_test, line 81, column local_city: "FE00F760B39BD3756BCFF30000000600"
Greenplum has an extension to the COPY command that lets you log errors and set up a certain amount of errors that can occur that won't stop the load. Here is an example from the documentation for the COPY command:
COPY sales FROM '/home/usr1/sql/sales_data' LOG ERRORS
SEGMENT REJECT LIMIT 10 ROWS;
That tells COPY that 10 bad rows can be ignored without stopping the load. The reject limit can be # of rows or a percentage of the load file. You can check the full syntax in psql with: \h copy
If you are loading a very large file into Greenplum, I would suggest looking at gpload or gpfdist (which also support the segment reject limit syntax). COPY is single threaded through the master server where gpload/gpfdist load the data in parallel to all segments. COPY will be faster for smaller load files and the others will be faster for millions of rows in a load file(s).
I am trying to load a data from an online dataset into my hive table using hue interface but I am getting NULL values.
Here's my dataset:
https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=aisles.csv
Here's my code:
CREATE TABLE IF NOT EXISTS AISLES (aisles_id INT, aisles STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
Here's how I loaded the data:
LOAD DATA LOCAL INPATH '/home/hadoop/aisles.csv' INTO TABLE aisles;
My Workaround, but no go:
FIELDS TERMINATED BY ','
FIELDS TERMINATED BY '\t'
FIELDS TERMINATED BY ''
FIELDS TERMINATED BY ' '
Also tried removing LINES TERMINATED BY '\n'
This is how I downloaded the data:
[hadoop#ip-172-31-76-58 ~]$ wget -O aisles.csv "https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=aisles.csv"
--2020-10-14 23:50:06-- https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=aisles.csv
Resolving www.kaggle.com (www.kaggle.com)... 35.244.233.98
Connecting to www.kaggle.com (www.kaggle.com)|35.244.233.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘aisles.csv’
I checked the location of the table I created this is what it says;
hdfs://ip-172-31-76-58.ec2.internal:8020/user/hive/warehouse/aisles
I tried browsing the directory and see where the file was saved:
[hadoop#ip-172-31-76-58 ~]$ hdfs dfs -ls /user/hive/warehouse
Found 1 items
drwxrwxrwt - arjiesaenz hadoop 0 2020-10-15 00:57 /user/hive/warehouse/aisles
So, I tried to change my load script like this;
LOAD DATA INPATH '/user/hive/warehouse/aisles.csv' INTO TABLE aisles;
But I got an error:
Error while compiling statement: FAILED: SemanticException line 6:61 Invalid path ''/user/hive/warehouse/aisles.csv'': No files matching path hdfs://ip-172-31-76-58.ec2.internal:8020/user/hive/warehouse/aisles.csv
Hopefully someone can help me pinpoint the problem with my code.
Thanks.
I tried the same on my hadoop cluster. The code worked without any issues.
Here's my execution snippet:
hive> CREATE TABLE IF NOT EXISTS AISLES (aisles_id INT, aisles STRING)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> LINES TERMINATED BY '\n'
> STORED AS TEXTFILE
> tblproperties("skip.header.line.count"="1");
OK
Time taken: 0.034 seconds
hive> load data inpath '/user/hirwuser1448/aisles.csv' into table AISLES;
Loading data to table revisit.aisles
Table revisit.aisles stats: [numFiles=1, totalSize=2603]
OK
Time taken: 0.183 seconds
hive> select * from AISLES limit 10;
OK
1 prepared soups salads
2 specialty cheeses
3 energy granola bars
4 instant foods
5 marinades meat preparation
6 other
7 packaged meat
8 bakery desserts
9 pasta sauce
10 kitchen supplies
Time taken: 0.038 seconds, Fetched: 10 row(s)
I think you need to cross check if your dataset aisles.csv is at the hdfs location and not stored at local directory.
The problem is with your load cmd.
LOAD DATA INPATH '/user/hive/warehouse/aisles.csv' INTO TABLE aisles;
I see you tried browsing the dir to see the saved file. Do you see aisles.csv under that dir? If the file's there, then you're giving wrong path in your load cmd else file isn't there at all.
I found a workaround by downloading the dataset and uploaded it into the Amazon S3 bucket and used the S3 path in the LOAD command.
The "REJECTMAX" parameter is a technique of executing copy command even though there are invalid records in the csv
(so if i have 100 records, 9 of them are invalid & max rejected is 10 the file will upload)
I wonder if there is a way that i can get as a text the rejected records that prints into the rejected file so i can log it into application error log.
Here you have an example on how to use REJECTED DATA. Suppose you have a table like this:
SQL> CREATE TABLE public.mydata ( id INTEGER ) ;
CREATE TABLE
and an input file containing:
$ cat /tmp/mydata
1
2
3
ABC
4
5
Clearly ABC won't fit into an integer...
So we run:
SQL> COPY public.mydata FROM '/tmp/mydata' REJECTMAX 2 REJECTED DATA '/tmp/mydata.rejected' ;
NOTICE 7850: In a multi-threaded load, rejected record data may be written to additional files
HINT: Rejected data may be written to files [/tmp/mydata.rejected], [/tmp/mydata.rejected.1], etc
Rows Loaded
-------------
5
And now...
$ cat /tmp/mydata.rejected
ABC
Is this what you were looking for?
We have source and target system. trying to import the data from SQL server 2012 to Pivotal Hadoop (PHD 3.0) version using talend tool.
Getting error:
ERROR: extra data after last expected column (seg0 slice1 datanode.domain.com:40000 pid=15035)
Detail: External table pick_report_stg0, line 5472 of pxf://masternnode/path/to/hdfs?profile=HdfsTextSimple: "5472;2016-11-28 08:39:54.217;;2016-11-15 00:00:00.0;SAMPLES;0005525;MORGAN -EVENTS;254056;1;IHBL-NHO..."
We tried
We have identified the BAD line as
[hdfs#mdw ~]$ hdfs dfs -cat /path/to/hdfs|grep 3548
3548;2016-11-28 04:21:39.97;;2016-11-15 00:00:00.0;SAMPLES;0005525;MORGAN -EVENTS;254056;1;IHBL-NHO-13OZ-01;0;ROC NATION; NH;2016-11-15 00:00:00.0;2016-11-15 00:00:00.0;;2.0;11.99;SA;SC01;NH02;EA;1;F2;NEW PKG ONLY PLEASE!! BY NOON
Structure of External table and Format clause
CREATE EXTERNAL TABLE schemaname.tablename
(
"ID" bigint,
"time" timestamp without time zone,
"ShipAddress4" character(40),
"EntrySystemDate" timestamp without time zone,
"CorpAcctName" character(40),
"Customer" character(7),
"CustomerName" character(30),
"SalesOrder" character(6),
"OrderStatus" character(1),
"MStockCode" character(30),
"ShipPostalCode" character(9),
"CustomerPoNumber" character(30),
"OrderDate" timestamp without time zone,
"ReqShipDate" timestamp without time zone,
"DateValue" timestamp without time zone,
"MOrderQty" numeric(9,0),
"MPrice" numeric(9,0),
"CustomerClass" character(2),
"ProductClass" character(4),
"ProductGroup" character(10),
"StockUom" character(3),
"DispatchCount" integer,
"MWarehouse" character(2),
"AlphaValue" varchar(100)
)
LOCATION (
'pxf://path/to/hdfs?profile=HdfsTextSimple'
)
FORMAT 'csv' (delimiter ';' null '' quote ';')
ENCODING 'UTF8';
Finding : Extra semi colon appeared which causes extra data. But I am still unable to supply correct format clause . Please guide How do I remove extra data column error.
What format clause should I use.
Any help on it would be much Appreciated !
If you append the following to your external table definition, after the ENCODING clause, it should help to resolve the issue where a small number of rows fail due to this issue:
LOG ERRORS INTO my_err_table SEGMENT REJECT LIMIT 1 PERCENT;
Here is a reference on this syntax: http://gpdb.docs.pivotal.io/4320/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html
I ran the below create script and it created the table:-
Create writable external table FLTR (like dbname.FLTR)
LOCATION ('gpfdist://172.90.38.190:8081/fltr.out')
FORMAT 'CSV' (DELIMITER ',' NULL '')
DISTRIBUTED BY (fltr_key);
But when I tried inserting into the file like insert into fltr.out select * from dbname.fltr
I got the below error, cannot find server connection.
Please help me out
I think your gpfdist is probably not running try:
gpfdist -p 8081 -l ~/gpfdist.log -d ~/ &
on 172.90.38.190.
This will start gpfidist using your home directory as the data directory.
When I do that my inserts work and create a file ~/fltr.out