Import file failed to greenplum because of one line of data on navicate - greenplum

When importing a file into Greenplum,one lines fails,and the whole file is not imported successfully.Is there a way can skip the wrong line and import other data into Greenplum successfully?
Here are my SQL execution and error messages:
copy cjh_test from '/gp_wkspace/outputs/base_tables/error_data_test.csv' using delimiters ',';
ERROR: invalid input syntax for integer: "FE00F760B39BD3756BCFF30000000600"
CONTEXT: COPY cjh_test, line 81, column local_city: "FE00F760B39BD3756BCFF30000000600"

Greenplum has an extension to the COPY command that lets you log errors and set up a certain amount of errors that can occur that won't stop the load. Here is an example from the documentation for the COPY command:
COPY sales FROM '/home/usr1/sql/sales_data' LOG ERRORS
SEGMENT REJECT LIMIT 10 ROWS;
That tells COPY that 10 bad rows can be ignored without stopping the load. The reject limit can be # of rows or a percentage of the load file. You can check the full syntax in psql with: \h copy
If you are loading a very large file into Greenplum, I would suggest looking at gpload or gpfdist (which also support the segment reject limit syntax). COPY is single threaded through the master server where gpload/gpfdist load the data in parallel to all segments. COPY will be faster for smaller load files and the others will be faster for millions of rows in a load file(s).

Related

Upload to HDFS stops with warning "Slow ReadProcessor read"

When I try to upload files that are about 20 GB into HDFS they usually upload till about 12-14 GB then they stop uploading and I get a bunch of these warnings through command line
"INFO hdfs.DataStreamer: Slow ReadProcessor read fields for block BP-222805046-10.66.4.100-1587360338928:blk_1073743783_2960 took 62414ms (threshold=30000ms); ack: seqno: 226662 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0, targets:"
However, if I try to upload the files like 5-6 times, they sometimes work after the 4th or 5th attempt. I believe if I alter some data node storage settings I can achieve consistent uploads without issues but I don't know what parameters to modify in the hadoop configurations. Thanks!
Edit: This happens when I put the file into HDFS through python program which uses a subprocess call to put the file in. However, even if I directly call it from command line I still run into the same issue.

Faster way of Appending/combining thousands (42000) of netCDF files in NCO

I seem to be having trouble properly combining thousands of netCDF files (42000+) (3gb in size, for this particular folder/variable). The main variable that i want to combine has a structure of (6, 127, 118) i.e (time,lat,lon)
Im appending each file 1 by 1 since the number of files is too long.
I have tried:
for i in input_source/**/**/*.nc; do ncrcat -A -h append_output.nc $i append_output.nc ; done
but this method seems to be really slow (order of kb/s and seems to be getting slower as more files are appended) and is also giving a warning:
ncrcat: WARNING Intra-file non-monotonicity. Record coordinate "forecast_period" does not monotonically increase between (input file file1.nc record indices: 17, 18) (output file file1.nc record indices 17, 18) record coordinate values 6.000000, 1.000000
that basically just increases the variable "forecast_period" 1-6 n-times. n = 42000files. i.e. [1,2,3,4,5,6,1,2,3,4,5,6......n]
And despite this warning i can still open the file and ncrcat does what its supposed to, it is just slow, at-least for this particular method
I have also tried adding in the option:
--no_tmp_fl
but this gives an eror:
ERROR: nco__open() unable to open file "append_output.nc"
full error attached below
If it helps, im using wsl and ubuntu in windows 10.
Im new to bash and any comments would be much appreciated.
Either of these commands should work:
ncrcat --no_tmp_fl -h *.nc
or
ls input_source/**/**/*.nc | ncrcat --no_tmp_fl -h append_output.nc
Your original command is slow because you open and close the output files N times. These commands open it once, fill-it up, then close it.
I would use CDO for this task. Given the huge number of files it is recommended to first sort them on time (assuming you want to merge them along the time axis). After that, you can use
cdo cat *.nc outfile

MYSQL bulk insert - Linux

I am trying to load the text file in MYSQL but I got below error.
Error Code: 1064
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'Rank=#Rank' at line 7
LOAD DATA LOCAL INFILE 'F:/keyword/Key_2018-10-06_06-44-09.txt'
INTO TABLE table
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\r\n'
IGNORE 0 LINES
(#dump_date,#Rank)
SET dump_date=#dump_date,Rank=#Rank;
But the above query working in windows server. And same time not working in Linux server .
I am going to suggest here that you try executing that command from the command line in a single line:
LOAD DATA LOCAL INFILE 'F:/keyword/Key_2018-10-06_06-44-09.txt' INTO TABLE
table FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\r\n' IGNORE 0 LINES
(#dump_date,#Rank) SET dump_date=#dump_date,Rank=#Rank;
For formatting reasons, I have added newlines above, but don't do that when you run it from the Linux prompt, just use a single line. Anyway, the text should nicely wrap around when you type it.

what is the file ORA_DUMMY_FILE.f in oracle?

oracle version: 12.2.0.1
As you know, these are then unix processes for the parallel servers in oracle:
ora_p000_ora12c
ora_p001_ora12c
....
ora_p???_ora12c
They can be seen also with the view: gv$px_process.
The spid for each parallel server can be obtained from there.
Then I look for the open files associated with te parallel server here:
ls -l /proc/<spid>/fd
And I'm obtaining around 500-10000 file descriptors for several parallel servers equal to this one:
991 -> /u01/app/oracle/admin/ora12c/dpdump/676185682F2D4EA0E0530100007FFF5E/ORA_DUMMY_FILE.f (deleted)
I've deleted them using:(actually I've create a small script for doing it because there are thousands of them)
gdb -p <spid>
gdb> p close(<fd_id>)
But after some hours the file descriptors start being created again (hundreds every day)
If they are not deleted then eventually the linux limit is reached and any parallel query throws an error like this:
ORA-12801: error signaled in parallel query server P001
ORA-01116: error in opening database file 132
ORA-01110: data file 132: '/u02/oradata/ora12c/pdbname/tablespacenaname_ts_1.dbf'
ORA-27077: too many files open
Does anyone have any idea of how and why this file descriptors are being created, and how to avoid it?.
Edited: Added some more information that could be useful.
I've tested that when a new PDB is created a directory DATA_PUMP_DIR is created in it (select * from all_directories) that is pointing to:
/u01/app/oracle/admin/ora12c/dpdump/<xxxxxxxxxxxxx>
The linux directory is also created.
Also one file descriptor is created pointing to ORA_DUMMY_FILE.f in the new dpdump subdirectory like the ones described initially
lsof | grep "ORA_DUMMY_FILE.f (deleted)"
/u01/app/oracle/admin/ora12c/dpdump/<xxxxxxxxxxxxx>/ORA_DUMMY_FILE.f (deleted)
This may be ok, the problem I face is the continuos growing of the file descriptors pointing to ORA_DUMMY_FILE that reach the linux limits.

got error 22 from storage engine mysql

mysqldump: Error: 'got error 22 from storage engine' when trying to dump
tablespaces
mysqldump: Got error: 23: Out of resources when opening file '.\database\table.MYD' (Errcode: 24) when using LOCK TABLES
i got this error when trying to make a dump in any database that I select , looks like that database is corrupted , is possible repair that ?
You seem to have reached the maximum number of open files. This limit is either MySQL's or the system's.
increase the value for the open_files_limit in your MySQL configuration file (this directive does not exist in a default installation, so you might need to create it in the [mysqld] section)
increase the limit at system level (but I am not sure this applies to Windows)
Here are some reasons for this error:
Type “source path-to-SQL-file“. BUT, you must follow these rules:
Use the full source command, not the . shortcut.
Have no spaces in your path. I copied mine to a root of a drive. Note that spaces in the file name is OK, just not the path.
Do not quote the file name, even if it has spaces. This gave error 22.
Use forward slashes in the path, e.g., C:/path/to/filename.sql. Otherwise you’ll get error 2.
Do not end with a semicolon.
Please check your read write access to the drive where you have stored your mySQL database.
error 22 occurred usually when you have no write access to that drive.

Resources