How do I make the mapper process the entire file from HDFS - hadoop

This is the code where I read the file that contain Hl7 messages and iterate through them using Hapi Iterator (from http://hl7api.sourceforge.net)
File file = new File("/home/training/Documents/msgs.txt");
InputStream is = new FileInputStream(file);
is = new BufferedInputStream(is);
Hl7InputStreamMessageStringIterator iter = new
Hl7InputStreamMessageStringIterator(is);
I want to make this done inside the map function? obviously I need to prevent the splitting in InputFormat to read the entire file as once as a single value and change it toString (the file size is 7KB), because as you know Hapi can parse only entire message.
I am newbie to all of this so please bear with me.

You will need to implement you own FileInputFormat subclass:
It must override isSplittable() method to false which means that number of mappers will be equal to number of input files: one input file per each mapper.
You also need to implement getRecordReader() method. This is exactly the class where you need to put you parsing logic from above to.

If you do not want your data file to split or you want a single mapper which will process your entire file. So that one file will be processed by only one mapper. In that case extending map/reduce inputformat and overriding isSplitable() method and return "false" as boolean will help you.
For ref : ( Not based on your code )
https://gist.github.com/sritchie/808035

As the input is getting from the text file, you can override isSplitable() method of fileInputFormat. Using this, one mapper will process the whole file.
public boolean isSplitable(Context context,Path args[0])
{
return false;
}

Related

How does RecordReader send data to mapper in Hadoop

I'm new to Hadoop and currently I'm learning mapreduce design pattern from Donald Miner & Adam Shook MapReduce Design Pattern book. So in this book there is Cartesian Product Pattern. My question is:
When does record reader send data to mapper?
Where is the code that send the data to mapper?
What I see is next function in CartesianRecordReader class read both split without sending the data.
Here is the source code https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch5/CartesianProduct.java
That's all, thanks in advance :)
When does record reader send data to mapper?
Let me answer by giving you an idea how how the mapper and the RecordReader are related. This is the Hadoop code that sends data
to the mapper. 1
RecordReader<K1, V1> input;
K1 key = input.createKey();
V1 value = input.createValue();
while (input.next(key, value)) {
// map pair to output
mapper.map(key, value, output, reporter);
if(incrProcCount) {
reporter.incrCounter(SkipBadRecords.COUNTER_GROUP,
SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS, 1);
}
}
Basically, the Hadoop will call next until it returns false, and at every call key and value will obtain new values. Key being normally the bytes read so far and value the next line in the file.
Where is the code that send the data to mapper?
That code is at the source code of hadoop (Probably at the MapContextImpl class) but it resembles what I have wrote in the code snippet.
EDIT : The source code is at MapRunner.

ExecutionScript output two different flowfiles NIFI

I'm using executionScript with python and I'm having a dataset which it may have some corrupted data, my idea is to process the good data, and put it in my flowfile content to my success relationship and the corrupted one redirect them in the failure relationship, I have done something like this :
for msg in messages :
try :
id = msg['id']
timestamp = msg['time']
value_encoded = msg['data']
hexFrameType = '0x'+value_encoded[0:2]
matches = re.match(regex,value_encoded)
....
except:
error_catched.append(msg)
pass
any idea how can I do that ?
For the purposes of this answer I am assuming you have an incoming flow file called "flowFile" which you obtained from session.get(). If you simply want to inspect the contents of flowFile and then route it to success or failure based on an error occurring, then in your success path you can use:
session.transfer(flowFile, REL_SUCCESS)
And in your error path you can do:
session.transfer(flowFile, REL_FAILURE)
If instead you want new files (perhaps one containing a single "msg" in your loop above) you can use:
outputFlowFile = session.create(flowFile)
to create a new flow file using the input flow file as a parent. If you want to write to the new flow file, you can use the PyStreamCallback technique described in my blog post.
If you create a new flow file, be sure to transfer the latest version of it to REL_SUCCESS or REL_FAILURE using the session.transfer() calls described above (but with outputFlowFile rather than flowFile). Also you'll need to remove your incoming flow file (since you have created child flow files from it and transferred those instead). For this you can use:
session.remove(flowFile)

Flume: HDFSEventSink - how to multiplex dynamically?

Summary: I have a multiplexing scenario, and would like to know how to multiplex dynamically - not based on a value statically configured, but based on the variable value of a field(e.g. dates).
Details:
I have an input, that is separated by an entityId.
As I know the entities that I am working with, I can configure it in typical Flume multi-channel selection.
agent.sources.jmsSource.channels = chan-10 chan-11 # ...
agent.sources.jmsSource.selector.type = multiplexing
agent.sources.jmsSource.selector.header = EntityId
agent.sources.jmsSource.selector.mapping.10 = chan-10
agent.sources.jmsSource.selector.mapping.11 = chan-11
# ...
Each of the channels goes to a separate HDFSEventSink, "hdfsSink-n":
agent.sinks.hdfsSink-10.channel = chan-10
agent.sinks.hdfsSink-10.hdfs.path = hdfs://some/path/
agent.sinks.hdfsSink-10.hdfs.filePrefix = entity10
# ...
agent.sinks.hdfsSink-11.channel = chan-11
agent.sinks.hdfsSink-11.hdfs.path = hdfs://some/path/
agent.sinks.hdfsSink-11.hdfs.filePrefix = entity11
# ...
This generates a file per entity, which is fine.
Now I want to introduce a second variable, which is dynamic: a date. Depending on event date, I want to create files per-entity per-date.
Date is a dynamic value, so I cannot preconfigure a number of sinks so each one sends to a separate file. Also, you can only specify one HDFS output per Sink.
So, it's like a "Multiple Outputs HDFSEventSink" was needed (in a similar way as Hadoop's MultipleOutputs library). Is there such a functionality in Flume?
If not, is there any elegant way to fix this or work this around? Another option is to modify HDFSEventSink and it seems it could be implemented, by having a different creation of "realName" (String) for each event.
Actually you can specific the variable in you hdfs sink's path or filePrefix.
For example, if the variable's key is "date" in event's headers, then you can configure like this:
agent.sinks.hdfsSink-11.hdfs.filePrefix = entity11-%{date}

Save image (via ImageWriter / FileImageOutputStream) to the filesystem without use of a File object

As a learning task I am converting my software I use every day to NIO, with the somewhat arbitrary objective of having zero remaining instances of java.io.File.
I have been successful in every case except one. It seems an ImageWriter can only write to a FileImageOutputStream which requires a java.io.File.
Path path = Paths.get(inputFileName);
InputStream is = Files.newInputStream(path, StandardOpenOption.READ);
BufferedImage bi = ImageIO.read(is);
...
Iterator<ImageWriter> iter = ImageIO.getImageWritersBySuffix("jpg");
ImageWriter writer = iter.next();
ImageWriteParam param = writer.getDefaultWriteParam();
File outputFile = new File(outputFileName);
ImageOutputStream ios = new FileImageOutputStream(outputFile);
IIOImage iioi = new IIOImage(bi, null, null);
writer.setOutput(ios);
writer.write(null, iioi, param);
...
Is there a way to do this with a java.nio.file.Path? The java 8 api doc for ImageWriter only mentions FileImageOutputStream.
I understand there might only be a symbolic value to doing this, but I was under the impression that NIO is intended to provide a complete alternative to java.io.File.
A RandomAccessFile, constructed with just a String for a filename, can be supplied to the ImageOutputStream constructor constructor.
This doesn't "use NIO" any more than just using the File in the first place, but it doesn't require File to be used directly..
For direct support of Path (or to "use NIO"), the FileImageOutputStream (or RandomAccessFile) could be extended, or a type deriving from the ImageOutputStream interface created, but .. how much work is it worth?
The intended way to instantiate an ImageInputStream or ImageOutputStream in the javax.imageio API, is through the ImageIO.createImageInputStream() and ImageIO.createImageOutputStream() methods.
You will see that both these methods take Object as its parameter. Internally, ImageIO will use a service lookup mechanism, and delegate the creation to a provider able to create a stream based on the parameter. By default, there are providers for File, RandomAccessFile and InputStream.
But the mechanism is extendable. See the API doc for the javax.imageio.spi package for a starting point. If you like, you can create a provider that takes a java.nio.Path and creates a FileImageOutputStream based on it, or alternatively create your own implementation using some more fancy NIO backing (ie. SeekableByteChannel).
Here's source code for a sample provider and stream I created to read images from a byte array, that you could use as a starting point.
(Of course, I have to agree with #user2864740's thoughts on the cost/benefit of doing this, but as you are doing this for the sake of learning, it might make sense.)

processing multiple files in minimum time

I am new to hadoop. Basically I am writing a program which takes two multifasta files (ref.fasta,query.fasta) which are 3+ GB.....
ref.fasta:
gi|12345
ATATTATAGGACACCAATAAAATT..
gi|5253623
AATTATCGCAGCATTA...
..and so on..
query.fasta:
query
ATTATTTAAATCTCACACCACATAATCAATACA
AATCCCCACCACAGCACACGTGATATATATACA
CAGACACA...
NOw to each mapper I need to give a single part of ref file and the whole query file.
i.e
gi|12345
ATATTATAGGACACCAATA....
(a single fasta sequence from ref file)
AND the entire query file.because I want to run an exe inside mapper which takes these both as input.
so do i process ref.fasta outside and then give it to mapper?or some thing else..??
I just need approach which will take minimum time.
Thanks.
The best approach for your use-case may be to have the query file in distributed cache and get the file object ready in the configure()/setup() to be used in the map(). And have the ref file as normal input.
You may do the following:
In your run() add the query file to the distributed cache:
DistributedCache.addCacheFile(new URI(queryFile-HDFS-Or-S3-Path), conf);
Now have the mapper class something like following:
public static class MapJob extends MapReduceBase implements Mapper {
File queryFile;
#Override
public void configure(JobConf job) {
Path queryFilePath = DistributedCache.getLocalCacheFiles(job)[0];
queryFile = new File(queryFilePath.toString());
}
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
// Use the queryFile object and [key,value] from your ref file here to run the exe file as desired.
}
}
I faced a similar problem.
I'd suggest you pre-process your ref file and split it into multiple files (one per sequence).
Then copy those files to a folder on the hdfs that you will set as your input path in your main method.
Then implement a custom input format class and custom record reader class. Your record reader will just pass the name of the local file split path (as a Text value) to either the key or value parameter of your map method.
For the query file that is require by all map functions, again add your query file to the hdfs and then add it to the DistributedCache in your main method.
In your map method you'll then have access to both local file paths and can pass them to your exe.
Hope that helps.
I had a similar problem and eventually re-implemented the functionality of blast exe file so that I didn't need to deal with reading files in my map method and could instead deal entire with Java objects (Genes and Genomes) that are parsed from the input files by my custom record reader and then passed as objects to my map function.
Cheers, Wayne.

Resources