I'm trying to convert Fixed width file text file to pipe delimited text file. I'm using NiFi's ReplaceText Processor for doing the same. These are my processor configurations
Replacement Strategy-Regex Replace
Evaluation Mode-Line-by-Line
Line-by-Line Evaluation Mode-All
Search Value- (.{1})(.{4})(.{16})(.{11})(.{14})(.{1})(.{8})(.{16})(.{9})(.{19})(.{5})(.{14})(.{4})(.{33})
replacement Value- ${'$1':trim()}${literal('|'):unescapeXml()}${'$3':trim()}${literal('|'):unescapeXml()}${'$4':trim()}${literal('|'):unescapeXml()}${'$5':toDecimal()}${literal('|'):unescapeXml()}${'$8':trim()}${literal('|'):unescapeXml()}${'$9':trim():toNumber()}${literal('|'):unescapeXml()}${'$10':trim()}${literal('|'):unescapeXml()}${'$11':toNumber()}${literal('|'):unescapeXml()}${'$12':toDecimal()}${literal('|'):unescapeXml()}${'$13':trim()}${literal('|'):unescapeXml()}${header:substring(63,69)}
I'm trying to split record according to the column length's provided to me and trying to trim spaces and and parsing to different types. In this process I observe that randomly some column in output file are empty strings even though the records in fixed width file contains some data. I can't figure out why the expression evaluation is inserting zero length strings randomly in the file. When I'm trying to with small set of records(some 100 records) from original file it is working fine. My original file is having 12 million records in it.
Related
I'm trying to process large records in Hadoop that span multiple lines. Each records consists of this:
>record_id|record_name // Should be the key
JAKSJDUVUAKKSKJJDJBUUBLAKSJDJUBKAKSJSDJB // Should be the value
KSJGFJJASPAKJWNMFKASKLSKJHUUBNFNNAKSLKJD
JDKSKSKALSLDKSDKPBPBPKASJDKSALLKSADJSAKD
I want to read the file containing these records as bytes because reading it as a String is just too memory intensive, as a single record can be well over 100MB. I cannot split these records on anything but the > character that defines a new record in the file.
I've been looking for a default RecordReader and InputFormat that can do these steps for me, but I haven't been able to find it. I'm trying to write my own. But I have no examples/tutorials to follow on this subject.
How should I approach this?
I am trying to insert into Hive table through files. But it so happens that the the last column in text file has data which spills across different lines.
Example data:
col1|col2|col3|this line is spilling into different line
as is this, this is spilling this is spilling this is sp
iliing and so is this
col1|col2|col3|this can be inserted without problem
So the spilled data is considered as a new row instead to wrapping into the last column. I tried using lines terminated by option, but cannot get this to work.
This is a special case of the more general problem that having a newline (end of line/record) symbol embedded in a column. Typical csv file formats have quotation characters around the string fields, and thus detecting embedded newlines in fields is simplified by noting the newline is inside quotes.
You do not have quote characters, but you do have knowledge of the number of fields, so you can detect when a newline would lead to the premature end of the record. But detecting the newline in the last field is harder. You need to notice that subsequent lines do not have field separators, and assume that these following lines are part of the record.
I must read Avro record serialized in avro files in HDFS. To do that, I use the AvroKeyInputFormat, so my mapper is able to work with the read records as keys.
My question is, how can I control the split size? With the text input format it consists on define the size in bytes. Here I need to define how many records every split will consist of.
I would like to manage every file in my input directory like a one big file. Have I to use CombineFileInputFormat? Is it possible to use it with Avro?
Splits honor logical record boundaries and the min and max boundaries are in bytes - text input format won't break lines in a text file even though the split boundaries are defined in bytes.
To have each file in a split, you can either set the max split size to Long.MAX_VALUE or you can override the isSplitable method in your code and return false.
I have a text file of 100 TB and it has multiline records. And we are not given that each records takes how many lines. One records can be of size 5 lines, other may be of 6 lines another may be 4 lines. Its not sure the line size may vary for each record.
So I cannot use default TextInputFormat, I have written my own inputformat and a custom record reader but my confusion is : When splits are happening, I am not sure if each split will contain the full record. Some part of record can go in split 1 and another in split 2. But this is wrong.
So, can you suggest how to handle this scenario so that I guarantee that my full record goes in a single InputSplit ?
Thanks in advance
-JE
You need to know if the records are actually delimited by some known sequence of characters.
If you know this you can set the textinputformat.record.delimiter config parameter to separate the records.
If the records aren't character delimited, you'll need some extra logic that, for example, counts a known number of fields (if there are a known number of fields) and presents that as a record. This usually makes things more complex, prone to error and slow as there's another lot of text processing going on.
Try determining if the records are delimited. Perhaps posting a short example of a few records would help.
In your record reader you need to define an algorithm by which you can:
Determine if your in the middle of a record
How to scan over that record and read the next full record
This is similar to what the TextInputFormat LineReader already does - when the input split has an offset, the line record reader scans forward from that offset for the first newline it finds and then reads the next record after that newline as the first record it will emit. Tied with this, if the block length falls short of the EOF, the line record reader will upto and past the end of the block to find the line terminating character for the current record.
how do we design mapper/reducer if I have to transform a text file line-by-line into another text file.
I wrote a simple map/reduce programs which did a small transformation but the requirement is a bit more elaborate below are the details:
the file is usually structured like this - the first row contains a comma separated list of column names. Second and the rest of the rows specify values against the columns
In some rows the trailing column values might be missing ex: if there are 15 columns then values might be specified only for the first 10 columns.
I have about 5 input files which I need to transform and aggregate into one file. the transformations are specific to each of the 5 input files.
How do I pass contextual information like file name to the mapper/reducer program?
Transformations are specific to columns so how do I remember the columns mentioned in the first row and then correlate and transform values in rows?
Split file into lines, transform (map) each line in parallel, join (reduce) the resulting lines into one file?
You can not rely on the column info in the first row. If your file is larger than a HDFS block, your file will be broken into multiple splits and each split handed to a different mapper. In that case, only the mapper receiving the first split will receive the first row with column info and the rest won't.
I would suggest passing file specific meta data in separate file and distribute it as side data. Your mapper or reducer tasks could read the meta data file.
Through the Hadoop Context object, you can get hold of the name of the file being processed by a mapper. Between all these, I think you have all the context information you are referring to and you can do file specific transformation. Even though the transformation logic is different for different files, the mapper output needs to have the same format.
If you using reducer, you could set the number of reducers to one, to force all output to aggregate to one file.