I have two files with different data formats in HDFS. How would a job set up look like, if I needed to reduce across both data files?
e.g. imagine the common word count problem, where in one file you have space as the world delimiter and in another file the underscore. In my approach I need different mappers for the various file formats, that than feed into a common reducer.
How to do that?
Or is there a better solution than mine?
Check out the MultipleInputs class that solves this exact problem. It's pretty neat-- you pass in the InputFormat and optionally the Mapper class.
If you are looking for code examples on google, search for "Reduce-side join", which is where this method is typically used.
On the other hand, sometimes I find it easier to just use a hack. For example, if you have one set of files that is space delimited and the other that is underscore delimited, load both with the same mapper and TextInputFormat and tokenize on both possible delimiters. Count the number of tokens from the two results set. In the word count example, pick the one with more tokens.
This also works if both files are the same delimiter but have a different number of standard columns. You can tokenize on comma then see how many tokens there are. If it is say 5 tokens it is from data set A, if it is 7 tokens it is from data set B.
Related
I have several TSV tables (for any characters below ASCII 32 only common characters are contained, such as '\a\b\t\n\v\f\r\e'). I'd like to put them into a single stream. I think that ASCII control characters can be used to separate them. But I am not sure which ASCII control character (other than the ones already used as shown above) is standard for this purpose. Does anybody know what is the standard?
I don't know of standards for what you suggest.
Many protocols use envelopes with a count for what is to follow (TCP) or an index with offsets and lengths (tar, zip) or an arbitrary unique separator that is defined in the header (multipart MIME). There are some that are a simple series like yours but that have formatting that makes item separation obvious (XML element stream [which differs from a document because it has no root element]).
Is the idea that you might not know from the start how many tables and/or rows there are? (That is the use case for XML element streams.)
␀ seems reasonable, perhaps with final extra ␀ to mark the end of the stream (DOS/Windows program environment block).
␊ making an empty line also seems reasonable (like in SMTP).
␚(^Z) is a common file terminator from the days of CP/M and DOS.
In Hadoop Cascading Flow, i have a number of tuples which is processed and finally i have sunk into a destination.
Now my requirement is: To sink that tuples in destination file with certain defined constant String values at beginning and at the end.
For example: I have following input tuples
10|11|12|13|14|15|16|17|18|19|20
20|21|22|23|24|25|26|27|28|29|30
1|2|3|4|5|6|7|8|9|10
Now i need to have like this output:
Certain data before those data
10|11|12|13|14|15|16|17|18|19|20
20|21|22|23|24|25|26|27|28|29|30
1|2|3|4|5|6|7|8|9|10
Certain data after those data
Little bit i have searched of repository class DelimitedParser and its methods like joinLine, joinFirstLine but due to poor documentation i am unable to get exact point of it.
It may depend on what "Certain data before those data" means ?
If you are using TextDelimited, then you can dump the header values in the sink. By default header values are not written as per the documentation hence you will need to enable it. Another thing to remember is that the header values represents the output fields.
-Amit
In a scenario we have a bunch of files which we want to read from bottom to top. So to say in reverse order, i.e. last line first, then second last line.
Looking into Hadoop API there are bunch of RecordReader classes such as LineRecordReader which leverages LineReader.
Basically I'd require a ReverseLineRecordReader leveraging a ReverseLineReader. The ReverseLineReader would then read in reverse order lines in from input splits.
This would be very beneficial if you have a large file sorted in some order where you would need to have the first and last entry related to some key. So you would first scan top down and then bottom up.
Since I guess this is not very exotic but couldn't find any implementation I was wondering if someone could help out here.
How can i use WholeFileInputFormat with many files as input?
Many files as one file...
FileInputFormat.addInputPaths(job, String ...); doesnt seem to work properly
You need to set "isSplittable" in your InputFormat to "false" so that the input file doesn't get split and get processed by just 1 mapper. One small suggestion though, you could give Sequence File a try. Combine multiple files, you are trying to process, into a single Sequence File and then process it. It would be more efficient as Sequence Files are already in key/value form.
I need to generate a flat file with several different sections, each with different record structures. All data is delimited text, single line per record. What would be a good delimiting sequence, or mechanism to differentiate sections, given that records can contain line feeds etc. within quoted text fields?
Well, provided you are free of the necessity of editing the file with an inferior text editor or that the file be human-readable, you could use one of the four C0 control codes which are ASCII characters 28–31, which are meant for delimiting text records. They just never caught on because of the first two points.
You haven't specified what you intend to do with the file. What language do you use, etc.
If the file will contain different sections, and different structures, I would suggest using the YAML structure.
There are many libraries that allow for to reading/writing using YAML.