ExecutionScript output two different flowfiles NIFI - apache-nifi

I'm using executionScript with python and I'm having a dataset which it may have some corrupted data, my idea is to process the good data, and put it in my flowfile content to my success relationship and the corrupted one redirect them in the failure relationship, I have done something like this :
for msg in messages :
try :
id = msg['id']
timestamp = msg['time']
value_encoded = msg['data']
hexFrameType = '0x'+value_encoded[0:2]
matches = re.match(regex,value_encoded)
....
except:
error_catched.append(msg)
pass
any idea how can I do that ?

For the purposes of this answer I am assuming you have an incoming flow file called "flowFile" which you obtained from session.get(). If you simply want to inspect the contents of flowFile and then route it to success or failure based on an error occurring, then in your success path you can use:
session.transfer(flowFile, REL_SUCCESS)
And in your error path you can do:
session.transfer(flowFile, REL_FAILURE)
If instead you want new files (perhaps one containing a single "msg" in your loop above) you can use:
outputFlowFile = session.create(flowFile)
to create a new flow file using the input flow file as a parent. If you want to write to the new flow file, you can use the PyStreamCallback technique described in my blog post.
If you create a new flow file, be sure to transfer the latest version of it to REL_SUCCESS or REL_FAILURE using the session.transfer() calls described above (but with outputFlowFile rather than flowFile). Also you'll need to remove your incoming flow file (since you have created child flow files from it and transferred those instead). For this you can use:
session.remove(flowFile)

Related

Spring integration SFTP - issue with filters and number of messages emits

I started using spring integration SFTP and I have some questions.
Filters not working. I have example configuration:
Sftp.inboundAdapter(ftpFileSessionFactory())
.preserveTimestamp(true)
.deleteRemoteFiles(false)
.remoteDirectory(integrationProperties.getRemoteDirectory())
.filter(sftpFileListFilter()) // doesn't work
.patternFilter("*.xlsx") // doesn't work
And my ChainFileListFilter:
private ChainFileListFilter<ChannelSftp.LsEntry> sftpFileListFilter() {
ChainFileListFilter<ChannelSftp.LsEntry> chainFileListFilter = new ChainFileListFilter<>();
chainFileListFilter.addFilter(new SftpPersistentAcceptOnceFileListFilter(metadataStore(), "INT"));
chainFileListFilter.addFilter(new SftpSimplePatternFileListFilter("*.xlsx"));
return chainFileListFilter;
}
If I understand correctly, only the XLSX file should be saved in the local directory. If yes it doesn't work with this configuration. Am I doing something wrong or misunderstood this?
How I can configure SFTP that each downloaded file emit message? I see in the doc two params max-messages-per-poll and max-fetch-size, but I don't know how to set it up so that every file emits a message. I would like to sync files once every 24 hours and produce batch job queue. Maybe there is a workaround?
Is there built-in filter which allow me fetch only files with changed content? The best solution would be to check the checksums of the files.
I will be grateful for your help and explanations.
You cannot combine filter() and patternFilter(). Only one of them can be used: the last one overrides whatever you used before. In other words: or filter() or patternFilter() - not both. By default the logic is like this:
public SftpInboundChannelAdapterSpec patternFilter(String pattern) {
return filter(composeFilters(new SftpSimplePatternFileListFilter(pattern)));
}
private CompositeFileListFilter<ChannelSftp.LsEntry> composeFilters(FileListFilter<ChannelSftp.LsEntry>
fileListFilter) {
CompositeFileListFilter<ChannelSftp.LsEntry> compositeFileListFilter = new CompositeFileListFilter<>();
compositeFileListFilter.addFilters(fileListFilter,
new SftpPersistentAcceptOnceFileListFilter(new SimpleMetadataStore(), "sftpMessageSource"));
return compositeFileListFilter;
}
So, technically you don't need your custom one, if you don't use external persistent MetadataStore. But if you do, think about flipping SftpSimplePatternFileListFilter with SftpPersistentAcceptOnceFileListFilter. Since it is better to check for the pattern before storing the file into MetadataStore.
It is the fact that every synched remote file, passed those filters, is stored into local dir and the message for that local file is emitted immediately when the poller does a request.
The maxFetchSize plays the role when we load remote files into a local dir. The maxMessagesPerPoll is used from the poller, but those are already built from the local files. The message is emitted per local file, not as a batch for all of them. That's not what messaging is designed for.
Please, share more info what does not work with files. The SftpPersistentAcceptOnceFileListFilter checks not only file name, but also mtime of the file. So, that it not about any checksum, but more last modified timestamp of the file.

Rsyslog omprog pass message to scripts

Accurately, I want to filter logs and send some warning email.
Firstly, I tried ommail, but unfortunately, this module only support mail server which do not need authentication, but my mail server needs.
So I tried to use omprog, I wrote a python script to logon to my mail server, it will recieve one parameter which is the log and send it as mail body.
Then I got the problem, I cannot pass the log to my script, if I try like this, $msg will be recognized as a string .
if $fromhost-ip == "x.x.x.x" then {
action(type="omprog"
binary="/usr/bin/python3 /home/elancao/Python/sendmail.py $msg")
}
I tried to search the official doc.
module(load="omprog")
action(type="omprog"
binary="/path/to/log.sh p1 p2 --param3=\"value 3\""
template="RSYSLOG_TraditionalFileFormat")
but in the sample, what they are using is a string "p1", not a dynamic parameter.
Can you please help? Thanks a lot!
The expected use of omprog is for your program to read stdin and there it will find the full default RSYSLOG_FileFormat template data (with date, host, tag, msg). This is useful, as it means you can write your program so that it is started only once, and then it can loop and handle all messages as they arrive.
This cuts down on the overhead of restarting your program for each message, and makes it react faster. However, if you prefer, your program can exit after reading one line, and then rsyslog will restart it for the next message. (You may want to implement confirmMessages=on).
If you just want the msg part as data, you can use template=... in the action to specify your own minimal template.
If you really must have the msg as an argument, you can use the legacy filter syntax:
^program;template
This will run program once for each message, passing it as argument the output of the template. This is not recommended.
if omprog script is not doing or not saving to a file the problem is that :
rsyslog is sending the full message to that script so you need to define or use a template
your script needs to listen to and return an
example in perl whit omprog
#my $input = join( '-', #ARGV ); ///not working I lost 5 hours of my life
my $input = ; now this is what you need
Hope this what the perl/python/rsyslog community needs.

spring-integration-aws dynamic file download

I've a requirement to download a file from S3 based on a message content. In other words, the file to download is previously unknown, I've to search and find it at runtime. S3StreamingMessageSource doesn't seem to be a good fit because:
It relies on polling where as I need to wait for the message.
I can't find any way to create a S3StreamingMessageSource dynamically in the middle of a flow. gateway(IntegrationFlow) looks interesting but what I need is a gateway(Function<Message<?>, IntegrationFlow>) that doesn't exist.
Another candidate is S3MessageHandler but it has no support for listing files which I need for finding the desired file.
I can implement my own message handler using AWS API directly, just wondering if I'm missing something, because this doesn't seem like an unusual requirement. After all, not every app just sits there and keeps polling S3 for new files.
There is S3RemoteFileTemplate with the list() function which you can use in the handle(). Then split() result and call S3MessageHandler for each remote file to download.
Although the last one has functionality to download the whole remote dir.
For anyone coming across this question, this is what I did. The trick is to:
Set filters later, not at construction time. Note that there is no addFilters or getFilters method, so filters can only be set once, and can't be added later. #artem-bilan, this is inconvenient.
Call S3StreamingMessageSource.receive manually.
.handle(String.class, (fileName, h) -> {
if (messageSource instanceof S3StreamingMessageSource) {
S3StreamingMessageSource s3StreamingMessageSource = (S3StreamingMessageSource) messageSource;
ChainFileListFilter<S3ObjectSummary> chainFileListFilter = new ChainFileListFilter<>();
chainFileListFilter.addFilters(
new S3SimplePatternFileListFilter("**/*/*.json.gz"),
new S3PersistentAcceptOnceFileListFilter(metadataStore, ""),
new S3FileListFilter(fileName)
);
s3StreamingMessageSource.setFilter(chainFileListFilter);
return s3StreamingMessageSource.receive();
}
log.warn("Expected: {} but got: {}.",
S3StreamingMessageSource.class.getName(), messageSource.getClass().getName());
return messageSource.receive();
}, spec -> spec
.requiresReply(false) // in case all messages got filtered out
)

Flume: HDFSEventSink - how to multiplex dynamically?

Summary: I have a multiplexing scenario, and would like to know how to multiplex dynamically - not based on a value statically configured, but based on the variable value of a field(e.g. dates).
Details:
I have an input, that is separated by an entityId.
As I know the entities that I am working with, I can configure it in typical Flume multi-channel selection.
agent.sources.jmsSource.channels = chan-10 chan-11 # ...
agent.sources.jmsSource.selector.type = multiplexing
agent.sources.jmsSource.selector.header = EntityId
agent.sources.jmsSource.selector.mapping.10 = chan-10
agent.sources.jmsSource.selector.mapping.11 = chan-11
# ...
Each of the channels goes to a separate HDFSEventSink, "hdfsSink-n":
agent.sinks.hdfsSink-10.channel = chan-10
agent.sinks.hdfsSink-10.hdfs.path = hdfs://some/path/
agent.sinks.hdfsSink-10.hdfs.filePrefix = entity10
# ...
agent.sinks.hdfsSink-11.channel = chan-11
agent.sinks.hdfsSink-11.hdfs.path = hdfs://some/path/
agent.sinks.hdfsSink-11.hdfs.filePrefix = entity11
# ...
This generates a file per entity, which is fine.
Now I want to introduce a second variable, which is dynamic: a date. Depending on event date, I want to create files per-entity per-date.
Date is a dynamic value, so I cannot preconfigure a number of sinks so each one sends to a separate file. Also, you can only specify one HDFS output per Sink.
So, it's like a "Multiple Outputs HDFSEventSink" was needed (in a similar way as Hadoop's MultipleOutputs library). Is there such a functionality in Flume?
If not, is there any elegant way to fix this or work this around? Another option is to modify HDFSEventSink and it seems it could be implemented, by having a different creation of "realName" (String) for each event.
Actually you can specific the variable in you hdfs sink's path or filePrefix.
For example, if the variable's key is "date" in event's headers, then you can configure like this:
agent.sinks.hdfsSink-11.hdfs.filePrefix = entity11-%{date}

Manually populate an ImageField

I have a models.ImageField which I sometimes populate with the corresponding forms.ImageField. Sometimes, instead of using a form, I want to update the image field with an ajax POST. I am passing both the image filename, and the image content (base64 encoded), so that in my api view I have everything I need. But I do not really know how to do this manually, since I have always relied in form processing, which automatically populates the models.ImageField.
How can I manually populate the models.ImageField having the filename and the file contents?
EDIT
I have reached the following status:
instance.image.save(file_name, File(StringIO(data)))
instance.save()
And this is updating the file reference, using the right value configured in upload_to in the ImageField.
But it is not saving the image. I would have imagined that the first .save call would:
Generate a file name in the configured storage
Save the file contents to the selected file, including handling of any kind of storage configured for this ImageField (local FS, Amazon S3, or whatever)
Update the reference to the file in the ImageField
And the second .save would actually save the updated instance to the database.
What am I doing wrong? How can I make sure that the new image content is actually written to disk, in the automatically generated file name?
EDIT2
I have a very unsatisfactory workaround, which is working but is very limited. This illustrates the problems that using the ImageField directly would solve:
# TODO: workaround because I do not yet know how to correctly populate the ImageField
# This is very limited because:
# - only uses local filesystem (no AWS S3, ...)
# - does not provide the advance splitting provided by upload_to
local_file = os.path.join(settings.MEDIA_ROOT, file_name)
with open(local_file, 'wb') as f:
f.write(data)
instance.image = file_name
instance.save()
EDIT3
So, after some more playing around I have discovered that my first implementation is doing the right thing, but silently failing if the passed data has the wrong format (I was mistakingly passing the base64 instead of the decoded data). I'll post this as a solution
Just save the file and the instance:
instance.image.save(file_name, File(StringIO(data)))
instance.save()
No idea where the docs for this usecase are.
You can use InMemoryUploadedFile directly to save data:
file = cStringIO.StringIO(base64.b64decode(request.POST['file']))
image = InMemoryUploadedFile(file,
field_name='file',
name=request.POST['name'],
content_type="image/jpeg",
size=sys.getsizeof(file),
charset=None)
instance.image = image
instance.save()

Resources