I am allowing a user to upload a .csv file via OHS/mod_plsql. This file gets saved as a BLOB file in the database, which I then convert to CLOB. I want to then take the contents of this CLOB file and save its contents into a table. I already have some working code that splits the file based on line endings, then splits each resulting string along commas and inserts the records.
What I need is a way to handle the case where a string within the CSV is enclosed in double-quotes and contains a comma. For example:
col1,col2,col3,col4
some,text,more,text
this,text,has,"commas, semicolons, and periods"
My code will know how to process the second line, but not the third. Does anyone have some code that is smart enough to treat "commas, semicolons, and periods" as a single token? I could probably hack something together, but I don't trust my regular expression skills enough, and I figure someone else has probably already written something that does this same thing that would be willing to share.
There's a good CSV parser in the Alexandria PL/SQL Library - CSV_UTIL_PKG.
https://code.google.com/p/plsql-utils/
More info:
http://ora-00001.blogspot.com.au/2010/04/select-from-spreadsheet-or-how-to-parse.html
Related
Question from a relative Hadoop/Hive newbie: How can I pass the contents of a Microsoft Word (binary) document as a parameter to a Hive function?
My goal is to be able to provide the full contents of a binary file (a Microsoft Word document in my particular use case) as a binary parameter to a UDTF. My initial approach has been to slurp the file's contents into a staging table and then provide it to the UDTF in a query later on, and this was how I attempted to build that staging table:
create table worddoc(content BINARY);
load data inpath '/path/to/wordfile' into table worddoc;
Unfortunately, there seem to be newlines in the Word document (or something acting enough like newlines) that results in the staging table having many rows instead of a single comprehensive blob, the latter of which is what I was hoping for. Is there some way of ensuring that the ingest doesn't get exploded into multiple rows? I've seen similar questions here on SO regarding other binary data like image files, so that is why I'm guessing it's the newlines that are tripping me up.
Failing all that, is there a way to skip storing the file's contents in an intermediary Hive table and just provide the content directly to the UDTF at invocation time? Nothing obvious jumped out during my search through Hive's built-in functions, but maybe I am missing something.
Version-wise, the environment is Hive 0.13.1 and Hadoop 1.2.1 (although upgrades to both are pending).
This is a hack-y workaround but what I ended up doing is this:
1) base64 encode the binary document and put the encoded file into HDFS
2) In Hive:
CREATE TABLE staging_table (content STRING);
LOAD DATA INPATH '/path/to/base64_encoded_file' INTO TABLE staging_table;
CREATE TABLE target_table (content BINARY);
INSERT INTO target_table SELECT unbase64(content) FROM staging_table;
Theoretically this should work for any arbitrary binary file that you'd want to squish into Hive this way. A gotcha to watch out for is to make sure your base64 encoding implementation produces a single-line file (my OS X base64 utility produces 1-line output, while the base64 utility in a CentOS 6 VM I was using produced hundreds of lines) - if it doesn't, you can manually glue it together before putting it into HDFS.
I am trying to insert into Hive table through files. But it so happens that the the last column in text file has data which spills across different lines.
Example data:
col1|col2|col3|this line is spilling into different line
as is this, this is spilling this is spilling this is sp
iliing and so is this
col1|col2|col3|this can be inserted without problem
So the spilled data is considered as a new row instead to wrapping into the last column. I tried using lines terminated by option, but cannot get this to work.
This is a special case of the more general problem that having a newline (end of line/record) symbol embedded in a column. Typical csv file formats have quotation characters around the string fields, and thus detecting embedded newlines in fields is simplified by noting the newline is inside quotes.
You do not have quote characters, but you do have knowledge of the number of fields, so you can detect when a newline would lead to the premature end of the record. But detecting the newline in the last field is harder. You need to notice that subsequent lines do not have field separators, and assume that these following lines are part of the record.
I'm trying to import data from a csv file which, unfortunately, contains multiple data tables. Actually, it's not really a pure csv file.
It contains a header field with some metadata and then the actual csv data parts are separated by:
//-------------
Table <table_nr>;;;;
An example file looks as follows:
Summary;;
Reporting Date;29/05/2013;12:36:18
Report Name;xyz
Reporting Period From;20/05/2013;00:00:00
Reporting Period To;26/05/2013;23:59:59
//-------------
Table 1;;;;
header1;header2;header3;header4;header5
string_aw;0;0;0;0
string_ax;1;1;1;0
string_ay;1;2;0;1
string_az;0;0;0;0
TOTAL;2;3;1;1
//-------------
Table 2;;;
header1;header2;header3;header4
string_bv;2;2;2
string_bw;3;2;3
string_bx;1;1;1
string_by;1;1;1
string_bz;0;0;0
What would be the best way to process load such data using kettle?
Is there a way to split this file into the header and csv data parts and then process each of them as separate inputs?
Thanks in advance for any hints and tips.
Best,
Haes.
I don't think there are any steps that will really help you with data in such a format. You probably need to do some preprocessing before bringing your data into a CSV step. You could still do this in your job, though, by calling out to the shell and executing a command there first, like maybe an awk script to split up the file into its component files and then load those files via the normal Kettle pattern.
I am working on Oracle Data Integrator 11g
I have to create an ODI package, where I need to process an incoming file. The file name is not a constant string, as it has a timestamp entry appended to it, something like this: FILTER_DATA_011413.TXT
Due to the MMDDYY, I can't hardcode the filename in my package. The way, we're handling it right now is, a shell script lists the files in the directory, and loads the filename into a table (using control file). This table is then queried to get the filename and the same is passed to the variable which stores the filename for processing.
I am looking for any other way, where I can avoiud having this temporary table to store the file name.
Can someone suggest me any alternative?
I regularly upload various tab-delimited text data files with SQL Loader. The control files always specify TERMINATED BY X'09'. Certain "cells" in those data files may be null, i.e. there is no character between two subsequent tabs. It always works like a clock.
Now, I have run into a specific case where I have to strip a data column of double quotes that may or may not enclose the actual data (side effect of a text export from Excel).
I tried simply adding OPTIONALLY ENCLOSED BY '"' behind the terminated delimiter and it did work for the file and column in question. Still, with this new option, files with null values are no longer decoded correctly. The loader seems to simply skip those values, which provokes column shifts and results in either a corrupt load, or a load failure.
For now, I've got a workaround dropping the new option and executing an SQL script eliminating double quotes directly in the database, but that obviously cannot last.
Any ideas?