How to handle multiline record for inputsplit? - hadoop

I have a text file of 100 TB and it has multiline records. And we are not given that each records takes how many lines. One records can be of size 5 lines, other may be of 6 lines another may be 4 lines. Its not sure the line size may vary for each record.
So I cannot use default TextInputFormat, I have written my own inputformat and a custom record reader but my confusion is : When splits are happening, I am not sure if each split will contain the full record. Some part of record can go in split 1 and another in split 2. But this is wrong.
So, can you suggest how to handle this scenario so that I guarantee that my full record goes in a single InputSplit ?
Thanks in advance
-JE

You need to know if the records are actually delimited by some known sequence of characters.
If you know this you can set the textinputformat.record.delimiter config parameter to separate the records.
If the records aren't character delimited, you'll need some extra logic that, for example, counts a known number of fields (if there are a known number of fields) and presents that as a record. This usually makes things more complex, prone to error and slow as there's another lot of text processing going on.
Try determining if the records are delimited. Perhaps posting a short example of a few records would help.

In your record reader you need to define an algorithm by which you can:
Determine if your in the middle of a record
How to scan over that record and read the next full record
This is similar to what the TextInputFormat LineReader already does - when the input split has an offset, the line record reader scans forward from that offset for the first newline it finds and then reads the next record after that newline as the first record it will emit. Tied with this, if the block length falls short of the EOF, the line record reader will upto and past the end of the block to find the line terminating character for the current record.

Related

NiFi ReplaceText Processor inserting Empty strings

I'm trying to convert Fixed width file text file to pipe delimited text file. I'm using NiFi's ReplaceText Processor for doing the same. These are my processor configurations
Replacement Strategy-Regex Replace
Evaluation Mode-Line-by-Line
Line-by-Line Evaluation Mode-All
Search Value- (.{1})(.{4})(.{16})(.{11})(.{14})(.{1})(.{8})(.{16})(.{9})(.{19})(.{5})(.{14})(.{4})(.{33})
replacement Value- ${'$1':trim()}${literal('|'):unescapeXml()}${'$3':trim()}${literal('|'):unescapeXml()}${'$4':trim()}${literal('|'):unescapeXml()}${'$5':toDecimal()}${literal('|'):unescapeXml()}${'$8':trim()}${literal('|'):unescapeXml()}${'$9':trim():toNumber()}${literal('|'):unescapeXml()}${'$10':trim()}${literal('|'):unescapeXml()}${'$11':toNumber()}${literal('|'):unescapeXml()}${'$12':toDecimal()}${literal('|'):unescapeXml()}${'$13':trim()}${literal('|'):unescapeXml()}${header:substring(63,69)}
I'm trying to split record according to the column length's provided to me and trying to trim spaces and and parsing to different types. In this process I observe that randomly some column in output file are empty strings even though the records in fixed width file contains some data. I can't figure out why the expression evaluation is inserting zero length strings randomly in the file. When I'm trying to with small set of records(some 100 records) from original file it is working fine. My original file is having 12 million records in it.

how to load the fixed width data where multiple records are in one line

I have a delimited file like the below
donaldtrump 23 hyd tedcruz 25 hyd james 27 hyd
the first three set of fields should be one record ,second 3 set of fields are one record and so on...what is the best way in loading this file into a hive table like below(emp_name,age,location)
A very, very dirty way to do that could be:
design a simple Perl script (or Python script, or sed command line) that takes source records from stdin, breaks them into N logical records, and push these to stdout
tell Hive to use that script/command as a custom Map step, using the TRANSFORM syntax -- the manual is there but it's very cryptic, you'd better Google for some examples such as this or that or whatever
Caveat: this "streaming" pattern is rather slow, because of the necessary Serialization / Deserialisation to plain text. But once you have a working examople, the development cost is minimal.
Additional caveat: of course, if source records must be processed in order -- because the logical records can spill on the next row, for example -- then you have a big problem, because Hadoop may split the source file arbitrarily and feed the splits to different Mappers. And you have no criteria for a DISTRIBUTE BY clause in your example. Then, a very-very-very dirty trick would be to compress the source file with GZIP so that it is de facto un-splittable.

Read text file as bytes, split on a character

I'm trying to process large records in Hadoop that span multiple lines. Each records consists of this:
>record_id|record_name // Should be the key
JAKSJDUVUAKKSKJJDJBUUBLAKSJDJUBKAKSJSDJB // Should be the value
KSJGFJJASPAKJWNMFKASKLSKJHUUBNFNNAKSLKJD
JDKSKSKALSLDKSDKPBPBPKASJDKSALLKSADJSAKD
I want to read the file containing these records as bytes because reading it as a String is just too memory intensive, as a single record can be well over 100MB. I cannot split these records on anything but the > character that defines a new record in the file.
I've been looking for a default RecordReader and InputFormat that can do these steps for me, but I haven't been able to find it. I'm trying to write my own. But I have no examples/tutorials to follow on this subject.
How should I approach this?

Interpolate data of a text file (mapreduce)

I have a big text file, every line has a timestamp and some other data, like this:
timestamp1,data
timestamp2,data
timestamp5,data
timestamp7,data
...
timestampN,data
This file is ordered by timestamp but there might be gaps between consecutive timestamps.
I need to fill those gaps and write the new file.
I've thought about reading two consecutive lines of the file. But I have two problems here:
How to read two consecutive lines? NLineInputFormat or
MultipleLineTextInputFormat may help with this, will they read
line1+line2, line2+line3,... or line1+line2, line3+line4?
How to manage lines when I have several mappers running?
Any other algorithm/solution? Maybe this can not be done with mapreduce?
(Pig/Hive solutions are also valid)
Thanks in advance.
You can use similar approach to famous 1 Tb sort
If you know range of timestamp values in your file you can do following:
Mappers should map data by some timestamp region(which will be your key).
Reducers process data in context of one key and you can implement any desired logic there.
Also, secondary sort may help to get values sorted by timestamps in your reducer.

Multi-line insert in Hive

I am trying to insert into Hive table through files. But it so happens that the the last column in text file has data which spills across different lines.
Example data:
col1|col2|col3|this line is spilling into different line
as is this, this is spilling this is spilling this is sp
iliing and so is this
col1|col2|col3|this can be inserted without problem
So the spilled data is considered as a new row instead to wrapping into the last column. I tried using lines terminated by option, but cannot get this to work.
This is a special case of the more general problem that having a newline (end of line/record) symbol embedded in a column. Typical csv file formats have quotation characters around the string fields, and thus detecting embedded newlines in fields is simplified by noting the newline is inside quotes.
You do not have quote characters, but you do have knowledge of the number of fields, so you can detect when a newline would lead to the premature end of the record. But detecting the newline in the last field is harder. You need to notice that subsequent lines do not have field separators, and assume that these following lines are part of the record.

Resources