Flume: HDFSEventSink - how to multiplex dynamically? - hadoop

Summary: I have a multiplexing scenario, and would like to know how to multiplex dynamically - not based on a value statically configured, but based on the variable value of a field(e.g. dates).
Details:
I have an input, that is separated by an entityId.
As I know the entities that I am working with, I can configure it in typical Flume multi-channel selection.
agent.sources.jmsSource.channels = chan-10 chan-11 # ...
agent.sources.jmsSource.selector.type = multiplexing
agent.sources.jmsSource.selector.header = EntityId
agent.sources.jmsSource.selector.mapping.10 = chan-10
agent.sources.jmsSource.selector.mapping.11 = chan-11
# ...
Each of the channels goes to a separate HDFSEventSink, "hdfsSink-n":
agent.sinks.hdfsSink-10.channel = chan-10
agent.sinks.hdfsSink-10.hdfs.path = hdfs://some/path/
agent.sinks.hdfsSink-10.hdfs.filePrefix = entity10
# ...
agent.sinks.hdfsSink-11.channel = chan-11
agent.sinks.hdfsSink-11.hdfs.path = hdfs://some/path/
agent.sinks.hdfsSink-11.hdfs.filePrefix = entity11
# ...
This generates a file per entity, which is fine.
Now I want to introduce a second variable, which is dynamic: a date. Depending on event date, I want to create files per-entity per-date.
Date is a dynamic value, so I cannot preconfigure a number of sinks so each one sends to a separate file. Also, you can only specify one HDFS output per Sink.
So, it's like a "Multiple Outputs HDFSEventSink" was needed (in a similar way as Hadoop's MultipleOutputs library). Is there such a functionality in Flume?
If not, is there any elegant way to fix this or work this around? Another option is to modify HDFSEventSink and it seems it could be implemented, by having a different creation of "realName" (String) for each event.

Actually you can specific the variable in you hdfs sink's path or filePrefix.
For example, if the variable's key is "date" in event's headers, then you can configure like this:
agent.sinks.hdfsSink-11.hdfs.filePrefix = entity11-%{date}

Related

ruby: howto use multiple boolean options and keep the code readable

I need to process a couple of boolean options, and I am trying to do it like it is usually done in C:
DICT = 0x000020000
FILTER = 0x000040000
HIGH = 0x000080000
KEEP = 0x000100000
NEXT = 0x000200000
I can now assign arbitrary options to a Integer variable, and test for them:
action if (opts & HIGH|KEEP) != 0
But this looks ugly and gets hard to read. I would prefer writing it like
action if opts.have HIGH|KEEP
This would require to load have method onto Integer class.
The question now is: where would I do that, in order to keep this method contained to the module where those options are used and the classes that include this module? I don't think it's a good idea to add it globally, as somebody might define another have somewhere.
Or, are there better approaches, for the general task or for the given use-case? Adding a separate Options class looks like overkill - or should I?
You can use anybits?:
action if opts.anybits?(HIGH|KEEP)
The methods returns true if any bits from the given mask are set in the receiver, and false otherwise.

Spring integration SFTP - issue with filters and number of messages emits

I started using spring integration SFTP and I have some questions.
Filters not working. I have example configuration:
Sftp.inboundAdapter(ftpFileSessionFactory())
.preserveTimestamp(true)
.deleteRemoteFiles(false)
.remoteDirectory(integrationProperties.getRemoteDirectory())
.filter(sftpFileListFilter()) // doesn't work
.patternFilter("*.xlsx") // doesn't work
And my ChainFileListFilter:
private ChainFileListFilter<ChannelSftp.LsEntry> sftpFileListFilter() {
ChainFileListFilter<ChannelSftp.LsEntry> chainFileListFilter = new ChainFileListFilter<>();
chainFileListFilter.addFilter(new SftpPersistentAcceptOnceFileListFilter(metadataStore(), "INT"));
chainFileListFilter.addFilter(new SftpSimplePatternFileListFilter("*.xlsx"));
return chainFileListFilter;
}
If I understand correctly, only the XLSX file should be saved in the local directory. If yes it doesn't work with this configuration. Am I doing something wrong or misunderstood this?
How I can configure SFTP that each downloaded file emit message? I see in the doc two params max-messages-per-poll and max-fetch-size, but I don't know how to set it up so that every file emits a message. I would like to sync files once every 24 hours and produce batch job queue. Maybe there is a workaround?
Is there built-in filter which allow me fetch only files with changed content? The best solution would be to check the checksums of the files.
I will be grateful for your help and explanations.
You cannot combine filter() and patternFilter(). Only one of them can be used: the last one overrides whatever you used before. In other words: or filter() or patternFilter() - not both. By default the logic is like this:
public SftpInboundChannelAdapterSpec patternFilter(String pattern) {
return filter(composeFilters(new SftpSimplePatternFileListFilter(pattern)));
}
private CompositeFileListFilter<ChannelSftp.LsEntry> composeFilters(FileListFilter<ChannelSftp.LsEntry>
fileListFilter) {
CompositeFileListFilter<ChannelSftp.LsEntry> compositeFileListFilter = new CompositeFileListFilter<>();
compositeFileListFilter.addFilters(fileListFilter,
new SftpPersistentAcceptOnceFileListFilter(new SimpleMetadataStore(), "sftpMessageSource"));
return compositeFileListFilter;
}
So, technically you don't need your custom one, if you don't use external persistent MetadataStore. But if you do, think about flipping SftpSimplePatternFileListFilter with SftpPersistentAcceptOnceFileListFilter. Since it is better to check for the pattern before storing the file into MetadataStore.
It is the fact that every synched remote file, passed those filters, is stored into local dir and the message for that local file is emitted immediately when the poller does a request.
The maxFetchSize plays the role when we load remote files into a local dir. The maxMessagesPerPoll is used from the poller, but those are already built from the local files. The message is emitted per local file, not as a batch for all of them. That's not what messaging is designed for.
Please, share more info what does not work with files. The SftpPersistentAcceptOnceFileListFilter checks not only file name, but also mtime of the file. So, that it not about any checksum, but more last modified timestamp of the file.

ExecutionScript output two different flowfiles NIFI

I'm using executionScript with python and I'm having a dataset which it may have some corrupted data, my idea is to process the good data, and put it in my flowfile content to my success relationship and the corrupted one redirect them in the failure relationship, I have done something like this :
for msg in messages :
try :
id = msg['id']
timestamp = msg['time']
value_encoded = msg['data']
hexFrameType = '0x'+value_encoded[0:2]
matches = re.match(regex,value_encoded)
....
except:
error_catched.append(msg)
pass
any idea how can I do that ?
For the purposes of this answer I am assuming you have an incoming flow file called "flowFile" which you obtained from session.get(). If you simply want to inspect the contents of flowFile and then route it to success or failure based on an error occurring, then in your success path you can use:
session.transfer(flowFile, REL_SUCCESS)
And in your error path you can do:
session.transfer(flowFile, REL_FAILURE)
If instead you want new files (perhaps one containing a single "msg" in your loop above) you can use:
outputFlowFile = session.create(flowFile)
to create a new flow file using the input flow file as a parent. If you want to write to the new flow file, you can use the PyStreamCallback technique described in my blog post.
If you create a new flow file, be sure to transfer the latest version of it to REL_SUCCESS or REL_FAILURE using the session.transfer() calls described above (but with outputFlowFile rather than flowFile). Also you'll need to remove your incoming flow file (since you have created child flow files from it and transferred those instead). For this you can use:
session.remove(flowFile)

How do I make the mapper process the entire file from HDFS

This is the code where I read the file that contain Hl7 messages and iterate through them using Hapi Iterator (from http://hl7api.sourceforge.net)
File file = new File("/home/training/Documents/msgs.txt");
InputStream is = new FileInputStream(file);
is = new BufferedInputStream(is);
Hl7InputStreamMessageStringIterator iter = new
Hl7InputStreamMessageStringIterator(is);
I want to make this done inside the map function? obviously I need to prevent the splitting in InputFormat to read the entire file as once as a single value and change it toString (the file size is 7KB), because as you know Hapi can parse only entire message.
I am newbie to all of this so please bear with me.
You will need to implement you own FileInputFormat subclass:
It must override isSplittable() method to false which means that number of mappers will be equal to number of input files: one input file per each mapper.
You also need to implement getRecordReader() method. This is exactly the class where you need to put you parsing logic from above to.
If you do not want your data file to split or you want a single mapper which will process your entire file. So that one file will be processed by only one mapper. In that case extending map/reduce inputformat and overriding isSplitable() method and return "false" as boolean will help you.
For ref : ( Not based on your code )
https://gist.github.com/sritchie/808035
As the input is getting from the text file, you can override isSplitable() method of fileInputFormat. Using this, one mapper will process the whole file.
public boolean isSplitable(Context context,Path args[0])
{
return false;
}

Hadoop Multiple Outputs with CQL3

I need to output the results of a MR job to multiple CQL3 column families.
In my reducer, I specify the CF using MultipleOutputs, but all the results are written to the one CF defined in the job's OutputCQL statement.
Job definiton:
...
job.setOutputFormatClass(CqlOutputFormat.class);
ConfigHelper.setOutputKeyspace(job.getConfiguration(), "keyspace1");
MultipleOutputs.addNamedOutput(job, "CF1", CqlOutputFormat.class, Map.class, List.class);
MultipleOutputs.addNamedOutput(job, "CF2", CqlOutputFormat.class, Map.class, List.class);
CqlConfigHelper.setOutputCql(job.getConfiguration(), "UPDATE keyspace1.CF1 SET value = ? ");
...
Reducer class setup:
mos = new MultipleOutputs(context);
Reduce method (psudo code):
keys = new LinkedHashMap<>();
keys.put("key", ByteBufferUtil.bytes("rowKey"));
keys.put("name", ByteBufferUtil.bytes("columnName"));
List<ByteBuffer> variables = new ArrayList<>();
variables.add(ByteBufferUtil.bytes("columnValue"));
mos.write("CF2", keys, variables);
The problem is that my reducer ignores the CF I specify in mos.write() and instead must just run the outputCQL. So in the example above, everything is written to CF1.
Ive tried using a prepared statement to inject the CF into the outputCQL, along the lines of "UPDATE keyspace1.? SET value = ?", but I dont think its possible to use a placeholder for the CF like this.
Is there any way I can overwrite the outputCQL inside the reducer class?
So the simple answer is that you cannot output results from a mr job to multiple CFs. However, having the need to do this actually highlights a flaw in the approach, rather than a missing feature in Hadoop.
Instead of processing a bunch of records and trying to produce 2 different results sets in one pass, a better approach is to arrive at the desired result sets iteratively. Basically, this means having multiple jobs iterating over the results of previous jobs until the desired results are achieved.

Resources