Multi-line insert in Hive - hadoop

I am trying to insert into Hive table through files. But it so happens that the the last column in text file has data which spills across different lines.
Example data:
col1|col2|col3|this line is spilling into different line
as is this, this is spilling this is spilling this is sp
iliing and so is this
col1|col2|col3|this can be inserted without problem
So the spilled data is considered as a new row instead to wrapping into the last column. I tried using lines terminated by option, but cannot get this to work.

This is a special case of the more general problem that having a newline (end of line/record) symbol embedded in a column. Typical csv file formats have quotation characters around the string fields, and thus detecting embedded newlines in fields is simplified by noting the newline is inside quotes.
You do not have quote characters, but you do have knowledge of the number of fields, so you can detect when a newline would lead to the premature end of the record. But detecting the newline in the last field is harder. You need to notice that subsequent lines do not have field separators, and assume that these following lines are part of the record.

Related

NiFi ReplaceText Processor inserting Empty strings

I'm trying to convert Fixed width file text file to pipe delimited text file. I'm using NiFi's ReplaceText Processor for doing the same. These are my processor configurations
Replacement Strategy-Regex Replace
Evaluation Mode-Line-by-Line
Line-by-Line Evaluation Mode-All
Search Value- (.{1})(.{4})(.{16})(.{11})(.{14})(.{1})(.{8})(.{16})(.{9})(.{19})(.{5})(.{14})(.{4})(.{33})
replacement Value- ${'$1':trim()}${literal('|'):unescapeXml()}${'$3':trim()}${literal('|'):unescapeXml()}${'$4':trim()}${literal('|'):unescapeXml()}${'$5':toDecimal()}${literal('|'):unescapeXml()}${'$8':trim()}${literal('|'):unescapeXml()}${'$9':trim():toNumber()}${literal('|'):unescapeXml()}${'$10':trim()}${literal('|'):unescapeXml()}${'$11':toNumber()}${literal('|'):unescapeXml()}${'$12':toDecimal()}${literal('|'):unescapeXml()}${'$13':trim()}${literal('|'):unescapeXml()}${header:substring(63,69)}
I'm trying to split record according to the column length's provided to me and trying to trim spaces and and parsing to different types. In this process I observe that randomly some column in output file are empty strings even though the records in fixed width file contains some data. I can't figure out why the expression evaluation is inserting zero length strings randomly in the file. When I'm trying to with small set of records(some 100 records) from original file it is working fine. My original file is having 12 million records in it.

How to handle multiline record for inputsplit?

I have a text file of 100 TB and it has multiline records. And we are not given that each records takes how many lines. One records can be of size 5 lines, other may be of 6 lines another may be 4 lines. Its not sure the line size may vary for each record.
So I cannot use default TextInputFormat, I have written my own inputformat and a custom record reader but my confusion is : When splits are happening, I am not sure if each split will contain the full record. Some part of record can go in split 1 and another in split 2. But this is wrong.
So, can you suggest how to handle this scenario so that I guarantee that my full record goes in a single InputSplit ?
Thanks in advance
-JE
You need to know if the records are actually delimited by some known sequence of characters.
If you know this you can set the textinputformat.record.delimiter config parameter to separate the records.
If the records aren't character delimited, you'll need some extra logic that, for example, counts a known number of fields (if there are a known number of fields) and presents that as a record. This usually makes things more complex, prone to error and slow as there's another lot of text processing going on.
Try determining if the records are delimited. Perhaps posting a short example of a few records would help.
In your record reader you need to define an algorithm by which you can:
Determine if your in the middle of a record
How to scan over that record and read the next full record
This is similar to what the TextInputFormat LineReader already does - when the input split has an offset, the line record reader scans forward from that offset for the first newline it finds and then reads the next record after that newline as the first record it will emit. Tied with this, if the block length falls short of the EOF, the line record reader will upto and past the end of the block to find the line terminating character for the current record.

Oracle PL/SQL package/procedure for processing CSV from CLOB

I am allowing a user to upload a .csv file via OHS/mod_plsql. This file gets saved as a BLOB file in the database, which I then convert to CLOB. I want to then take the contents of this CLOB file and save its contents into a table. I already have some working code that splits the file based on line endings, then splits each resulting string along commas and inserts the records.
What I need is a way to handle the case where a string within the CSV is enclosed in double-quotes and contains a comma. For example:
col1,col2,col3,col4
some,text,more,text
this,text,has,"commas, semicolons, and periods"
My code will know how to process the second line, but not the third. Does anyone have some code that is smart enough to treat "commas, semicolons, and periods" as a single token? I could probably hack something together, but I don't trust my regular expression skills enough, and I figure someone else has probably already written something that does this same thing that would be willing to share.
There's a good CSV parser in the Alexandria PL/SQL Library - CSV_UTIL_PKG.
https://code.google.com/p/plsql-utils/
More info:
http://ora-00001.blogspot.com.au/2010/04/select-from-spreadsheet-or-how-to-parse.html

Avoiding skipped null columns when using terminated and optionally enclosed delimiters

I regularly upload various tab-delimited text data files with SQL Loader. The control files always specify TERMINATED BY X'09'. Certain "cells" in those data files may be null, i.e. there is no character between two subsequent tabs. It always works like a clock.
Now, I have run into a specific case where I have to strip a data column of double quotes that may or may not enclose the actual data (side effect of a text export from Excel).
I tried simply adding OPTIONALLY ENCLOSED BY '"' behind the terminated delimiter and it did work for the file and column in question. Still, with this new option, files with null values are no longer decoded correctly. The loader seems to simply skip those values, which provokes column shifts and results in either a corrupt load, or a load failure.
For now, I've got a workaround dropping the new option and executing an SQL script eliminating double quotes directly in the database, but that obviously cannot last.
Any ideas?

how to perform ETL in map/reduce

how do we design mapper/reducer if I have to transform a text file line-by-line into another text file.
I wrote a simple map/reduce programs which did a small transformation but the requirement is a bit more elaborate below are the details:
the file is usually structured like this - the first row contains a comma separated list of column names. Second and the rest of the rows specify values against the columns
In some rows the trailing column values might be missing ex: if there are 15 columns then values might be specified only for the first 10 columns.
I have about 5 input files which I need to transform and aggregate into one file. the transformations are specific to each of the 5 input files.
How do I pass contextual information like file name to the mapper/reducer program?
Transformations are specific to columns so how do I remember the columns mentioned in the first row and then correlate and transform values in rows?
Split file into lines, transform (map) each line in parallel, join (reduce) the resulting lines into one file?
You can not rely on the column info in the first row. If your file is larger than a HDFS block, your file will be broken into multiple splits and each split handed to a different mapper. In that case, only the mapper receiving the first split will receive the first row with column info and the rest won't.
I would suggest passing file specific meta data in separate file and distribute it as side data. Your mapper or reducer tasks could read the meta data file.
Through the Hadoop Context object, you can get hold of the name of the file being processed by a mapper. Between all these, I think you have all the context information you are referring to and you can do file specific transformation. Even though the transformation logic is different for different files, the mapper output needs to have the same format.
If you using reducer, you could set the number of reducers to one, to force all output to aggregate to one file.

Resources