My map function emits two different kinds of key value pairs for the same data. Naturally I would need 2 independent reduce functions to handle this. Is it possible?
Like, can I have multiple output.collect() statements at the end of map with an additional parameter specifying the reducer?
I tried looking it up but couldn't find anything.
You should consider using MultipleOutputs class. It has nice and self-explanatory documentation.
Related
I am using apache arrow golang library to read parquet. No-repeated column seems straight forward, but how can I read repeated field?
For reading repeated fields in Parquet there's really two answers: a complex way and an easy way.
The easy way is to use the pqarrow package and just read directly into an Arrow list array of some kind and let the complexity be handled for you. (https://pkg.go.dev/github.com/apache/arrow/go/v10#v10.0.1/parquet/pqarrow)
To read them the complex way, you have to understand repetition and definition levels and how Parquet uses them. Instead of trying to explain them here, I'm going to point you to the excellent write-up on the Apache Arrow blog here: https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/ which explains how to decode definition and repetition levels (yes it's in the context of the Rust implementation of Parquet, but the basic concepts are the same for the Go implementation).
All of the ColumnChunkReader types allow you to retrieve those Definition and Repetition levels in their ReadBatch methods. For an example have a look at https://pkg.go.dev/github.com/apache/arrow/go/v10#v10.0.1/parquet/file#Float32ColumnChunkReader.ReadBatch
When you call ReadBatch you can pass an []int16 for the definition levels and the repetition levels to be filled in alongside the data, and then you can use those to decode the repeated field accordingly. Personally, I prefer to use the pqarrow package which does it for you, but sometimes you do need the granular access.
I got a problem that I have some measurement data ( like Echo measurements), that can potentially have multiple values associated with it. In other words, you have a single measurement, but multiple values associated.
Is there a standard way to represent multiple values for a single measurement as a Observation?If so, what is the best way?
I notice that under observation, you can have multiple components, should I put my LOINC code for my measurement just at observation level and put each value at component level? Or I have to use extensions?
Thanks!
I am not sure exactly what your data looks like but here are a couple of patterns:
There is sampledData Datatype that can be used for datastreams like an EKG
example
If you have discrete values that are all interpreted together with an observation ( they can't stand alone as independent observations) the using components with an Observation.code= code, Observation.value[x] is empty , Observation.component.code= code, Observation.component.value[x]= result value. here is an example of this pattern.
In some cases you will have an Observation.value[x] as well.
Note Observation.component.code is required for each component.
For grouping indpendent observations together using component is not appropriate. This grouping is done using DiagnosticReport.result or Observation.related. The DiagnosticReport resource which typically used for reporting diagnostics in responce to an order.
In my case I do not need a reduce function. So I am assuming that the map function should not worry about choosing and splitting the input text file to some key-value pair.
Yes it does. Mappers always output a key,value pair. If you don't want to use a reducer, you can write Map output directly to file system, or you can use an Identity Reducer. If you're not Interested in what the key is, you can maybe just assign some default key. If you can share some more details about what you're trying to do, we can probably help you out better.
I have two files with different data formats in HDFS. How would a job set up look like, if I needed to reduce across both data files?
e.g. imagine the common word count problem, where in one file you have space as the world delimiter and in another file the underscore. In my approach I need different mappers for the various file formats, that than feed into a common reducer.
How to do that?
Or is there a better solution than mine?
Check out the MultipleInputs class that solves this exact problem. It's pretty neat-- you pass in the InputFormat and optionally the Mapper class.
If you are looking for code examples on google, search for "Reduce-side join", which is where this method is typically used.
On the other hand, sometimes I find it easier to just use a hack. For example, if you have one set of files that is space delimited and the other that is underscore delimited, load both with the same mapper and TextInputFormat and tokenize on both possible delimiters. Count the number of tokens from the two results set. In the word count example, pick the one with more tokens.
This also works if both files are the same delimiter but have a different number of standard columns. You can tokenize on comma then see how many tokens there are. If it is say 5 tokens it is from data set A, if it is 7 tokens it is from data set B.
I am working on simple map reduce program. I want to create different files after reducer for each different word in the key. For example, after executing Mapreduce I have something like
Priority1 x 2
Priority1 y 2
Priority1 z 2
priority2 x 2
priority2 y 2
Now I want different files after reduce phase, saying Priority1 and Priority2 which have all these values according to the priority. I am using java and want to know what should be written in reducer for having this kind of output?
I just want to know if this is even possible or if it is how to approach or solve this?
I am using Hadoop 0.20.203 and hence multipleoutputs doesn't work.
Any pointers will be helpful.
Thanks for the help!
Atul
You need to create a partioner class first, that partions based on your criteria.
You then need to create your own outputformat class and a recordwriter class.
The recordwriter class, needs to write to different files as per your needs. Further if you need to sort your values create comparator class for your key field.
Have a look at MultipleOutputs.