grep results in more lines that the original file has - amazon-ec2

I am grepping a few extremely large csv files(around 24 million rows each) using two mutually exclusive regex's to filter rows. I cannot share the regex's or the files(not that you would ever want to download them).
The idea is that rows that match regex A get piped into file A. Rows that match regex B get piped into file B.
What I end up with is about 5 million extra rows in the target files after this process completes.
The regex's are guaranteed to be mutually exclusive, and the line counts are correct.
The task is running on an Amazon EC2 node. Has anyone ever seen this kind of issue when running grep in the cloud?

Using awk instead seems to fix the problem.
Thanks Barmar!

Related

how to split a large file into smaller files for parallel processing in spring batch?

We have a large file which can be split logically (not be range but by the occurrence of next header record)
For example
HeaderRecord1
...large number of detail records
HeaderRecord2
...large number of detail records
and so on...
We want to split the file into multiple small files at the HeaderRecord level and process them in parallel.
How to achieve this in Spring Batch? When I google, I came across Systemcommandtasklet and to use Linux / Unix Split command to split.
Is that the best approach? Are there any partition options within Spring Batch?
Thanks and Regards
You need to create a custom Partitioner that calculates the indexes of each logical partition (begin/end index). Then use a custom item reader (that could extend FlatFileItemReader) which reads only the lines of the given partition (and ignore other lines).

how to load the fixed width data where multiple records are in one line

I have a delimited file like the below
donaldtrump 23 hyd tedcruz 25 hyd james 27 hyd
the first three set of fields should be one record ,second 3 set of fields are one record and so on...what is the best way in loading this file into a hive table like below(emp_name,age,location)
A very, very dirty way to do that could be:
design a simple Perl script (or Python script, or sed command line) that takes source records from stdin, breaks them into N logical records, and push these to stdout
tell Hive to use that script/command as a custom Map step, using the TRANSFORM syntax -- the manual is there but it's very cryptic, you'd better Google for some examples such as this or that or whatever
Caveat: this "streaming" pattern is rather slow, because of the necessary Serialization / Deserialisation to plain text. But once you have a working examople, the development cost is minimal.
Additional caveat: of course, if source records must be processed in order -- because the logical records can spill on the next row, for example -- then you have a big problem, because Hadoop may split the source file arbitrarily and feed the splits to different Mappers. And you have no criteria for a DISTRIBUTE BY clause in your example. Then, a very-very-very dirty trick would be to compress the source file with GZIP so that it is de facto un-splittable.

Interpolate data of a text file (mapreduce)

I have a big text file, every line has a timestamp and some other data, like this:
timestamp1,data
timestamp2,data
timestamp5,data
timestamp7,data
...
timestampN,data
This file is ordered by timestamp but there might be gaps between consecutive timestamps.
I need to fill those gaps and write the new file.
I've thought about reading two consecutive lines of the file. But I have two problems here:
How to read two consecutive lines? NLineInputFormat or
MultipleLineTextInputFormat may help with this, will they read
line1+line2, line2+line3,... or line1+line2, line3+line4?
How to manage lines when I have several mappers running?
Any other algorithm/solution? Maybe this can not be done with mapreduce?
(Pig/Hive solutions are also valid)
Thanks in advance.
You can use similar approach to famous 1 Tb sort
If you know range of timestamp values in your file you can do following:
Mappers should map data by some timestamp region(which will be your key).
Reducers process data in context of one key and you can implement any desired logic there.
Also, secondary sort may help to get values sorted by timestamps in your reducer.

How to handle multiline record for inputsplit?

I have a text file of 100 TB and it has multiline records. And we are not given that each records takes how many lines. One records can be of size 5 lines, other may be of 6 lines another may be 4 lines. Its not sure the line size may vary for each record.
So I cannot use default TextInputFormat, I have written my own inputformat and a custom record reader but my confusion is : When splits are happening, I am not sure if each split will contain the full record. Some part of record can go in split 1 and another in split 2. But this is wrong.
So, can you suggest how to handle this scenario so that I guarantee that my full record goes in a single InputSplit ?
Thanks in advance
-JE
You need to know if the records are actually delimited by some known sequence of characters.
If you know this you can set the textinputformat.record.delimiter config parameter to separate the records.
If the records aren't character delimited, you'll need some extra logic that, for example, counts a known number of fields (if there are a known number of fields) and presents that as a record. This usually makes things more complex, prone to error and slow as there's another lot of text processing going on.
Try determining if the records are delimited. Perhaps posting a short example of a few records would help.
In your record reader you need to define an algorithm by which you can:
Determine if your in the middle of a record
How to scan over that record and read the next full record
This is similar to what the TextInputFormat LineReader already does - when the input split has an offset, the line record reader scans forward from that offset for the first newline it finds and then reads the next record after that newline as the first record it will emit. Tied with this, if the block length falls short of the EOF, the line record reader will upto and past the end of the block to find the line terminating character for the current record.

Generate multiple outputs with Hadoop Pig

I've got this file containing a list of data in Hadoop. I've build a simple Pig script which analyze the file by the id number, and so on...
The last step I'm looking for is this: I'd like to to create (store) a file for each unique id number. So this should depend on a group step...however, I haven't understood if this is possible (maybe there is a custom store module?).
Any idea?
Thanks
Daniele
While keeping in mind what is said by frail, MultiStorage, in PiggyBank, seems to be what you are looking for.
for getting an output(file or anything) you need to assign data to a variable, thats how it works with STORE. If id's are limited and finite you can FILTER them one by one and then STORE them. (I always do that for action types which is about 20-25).
But if you need to get each unique id file badly then make 2 files. 1 with whole data in it grouped by id, 1 with just unique ids. Then try generating 1(or more if you have too many) pig scripts that FILTER BY that id. But it's a bad solution. Assuming you would group 10 ids in a pig script you would have (unique id count/10) pig scripts to run.
Beware that Hdfs ain't good at handling too many small files.
Edit:
A better solution would be to GROUP and SORT by unique id to a big file. Then since its sorted you can easily divide the contents with a 3rd party script.

Resources