Incremental write of Protocol Buffer object - protocol-buffers

I have Protocol Buffer for logging data.
message Message {
required double val1 = 1;
optional int val2 = 2;
}
message BigObject {
repeated Message message = 1;
}
I receive messages one per second. They stored in memory with my BigObject and they used for some tasks. But at the same time i want to store that messages in file for backup in case application crash. Simple writing BigObject every time will be waste of time. And I trying to find way to write only added messages since last write to file. Is there a way for that?

Protobuf is an appendable format, and your layout is ideal for this. Just open your file positioned at the end, and start with a new (empty) BigObject. Add/serialize just the new Message instance, and write to the file (from the end onwards).
Now, if you parse your file from the beginning you will get a single BigObject with all the Message instances (old and new).
You could actually do this by logging each individual Message as it arrives, as long as you wrap it in a BigObject each time, i.e. in pseudo-code
loop {
msg = await NextMessage();
wrapper = new BigObject();
wrapper.Messages.Add(msg);
file = OpenFileAtEnd();
wrapper.WriteTo(file);
file.Close();
}

Related

Batch processing flowfiles in apache nifi

I have written custom nifi processor which tries to batch process input flow files.
However, it seems it is not behaving as expected. Here is what happening:
I copy paste some files on server. FethFromServerProcessor fetches those files from server and puts it in queue1. MyCustomProcessor reads files in batch from queue1. I have batchSize property defined on MyCustomProcessor and inside its onTrigger() method, I am getting all flow files from queue1 in current batch by doing following:
session.get(context.getProperty(batchSize).asInteger())
First line of onTrigger() creates timestamp and adds this timestamp on all flow files. So all files in the batch should have same timestamp. However, that is not happening. Usually first flow file get one timestamp and rest of the flow files get other timestamp.
It seems that when FetchFromServerProcessor fetches first file from server and puts it in the queue1, MyCustomProcessor gets triggered and it fetches all files from queue. Incidentally, it happens that there used to be single file, which is picked up as only file in this batch. By the time MyCustomProcessor has processed this file, FetchFromServerProcessor has fetched all the files from server and put them in the queue1. So after processing first file, MyCustomProcessor takes all the files in queue1 and forms second batch, whereas I want all files picked up in single batch.
How can I avoid two batches getting formed? I see that people discuss wait-notify in this context:1, 2. But I am not able to make quick sense out of these posts. Can someone give me minimal steps to achieve this using wait notify processors or can someone point me to minimal tutorial which gives step by step procedure to use wait-notify processors? Also is wait-notify pattern standard approach to solve batch related problem I explained? Or is there any other standard approach to get this done?
It sounds as if this batch size is the required count of incoming flowfiles to CustomProcessor, so why not write your CustomProcessor#onTrigger() as follows:
#Override
public void onTrigger(ProcessContext context, ProcessSession session) throws ProcessException {
final ComponentLog logger = getLogger();
// Try to get n flowfiles from incoming queue
final Integer desiredFlowfileCount = context.getProperty(batchSize).asInteger();
final int queuedFlowfileCount = session.getQueueSize().getObjectCount();
if (queuedFlowfileCount < desiredFlowfileCount) {
// There are not yet n flowfiles queued up, so don't try to run again immediately
if (logger.isDebugEnabled()) {
logger.debug("Only {} flowfiles queued; waiting for {}", new Object[]{queuedFlowfileCount, desiredFlowfileCount});
}
context.yield();
return;
}
// If we're here, we do have at least n queued flowfiles
List<FlowFile> flowfiles = session.get(desiredFlowfileCount);
try {
// TODO: Perform work on all flowfiles
flowfiles = flowfiles.stream().map(f -> session.putAttribute(f, "timestamp", "my static timestamp value")).collect(Collectors.toList());
session.transfer(flowfiles, REL_SUCCESS);
// If extending AbstractProcessor, this is handled for you and you don't have to explicitly commit
session.commit();
} catch (Exception e) {
logger.error("Helpful error message");
if (logger.isDebugEnabled()) {
logger.error("Further stacktrace: ", e);
}
// Penalize the flowfiles if appropriate (also done for you if extending AbstractProcessor and an exception is thrown from this method
session.rollback(true);
// --- OR ---
// Transfer to failure if they can't be retried
session.transfer(flowfiles, REL_FAILURE);
}
}
The Java 8 stream syntax can be replaced by this if it's unfamiliar:
for (int i = 0; i < flowfiles.size(); i++) {
// Write the same timestamp value onto all flowfiles
FlowFile f = flowfiles.get(i);
flowfiles.set(i, session.putAttribute(f, "timestamp", "my timestamp value"));
}
The semantics between penalization (telling the processor to delay performing work on a specific flowfile) and yielding (telling the processor to wait some period of time to try performing any work again) are important.
You probably also want the #TriggerSerially annotation on your custom processor to ensure you do not have multiple threads running such that a race condition could arise.

peek in a parallel stream for incrementing a counter

I have a pipeline where files are processes in parallel, but I am a bit suspicious about the peek function.
File file = articles.parallelStream( )
.map( article -> {
String fileName = processer.getFriendlyName( article, locale );
currentCount.incrementAndGet();
return new ImmutablePair<>( fileName, converted );
} )
.peek( pair -> statusMessageSender.sendStatusMessage( totalCount, currentCount.get(), pair.getKey( ) ) )
.collect( new Archiver( archivePath ) );
By reading the javadocs, I am not completely sure if the counter that is supposed to send the current status of progress is doing its job (basically, looking for assurance in the docs here)
For parallel stream pipelines, the action may be called at whatever
time and in whatever thread the element is made available by the
upstream operation.
It seems to me that an observer would get the current count, regardless if the file name is correct in relation to the processing order, which is fine. but in the end of the day,I am in a path where I am distrusting the peek, and leading towards sync on sendStatusMessage's receptor.
In the end I am looking for a way to send status in a parallel stream, any thoughts?
Initially the discussion had a lot about the peek and why I was splitting the messaging part from the mapping expression. This was more a matter of style as I tend to favor mapping functions for mapping and nothing more.
I could see why people would defend peek or argue against it. But button line it acts to consume a value and pass it along in the pipe. So, as I was looking for a colateral behavior (passing a message) the peek function seemed perfect.
In the parallel stream the issue is that one cannot predict when peek is actually called. but there was two aspects to consider: when the message is sent was irrelevant for the problem at hands and the message itself could be sent at anytime.
In the end the counter could be in the peek part as well with the message receiver was the only true factor here. The message receiver could have its own counter or only consider the highest received in the time frame.
Button line, the question that begun with suggestions around peek, ended up with the following:
In terms of functionality, the peek function would do its job just fine: mainly because the sequence in the pipe was not ordered.
However the message consumer would tell if it could consume that message correctly.Given that only one consumer was using this information and the others were not, the final conclusion was that we had a problem in the protocol design and not around the peek function. We removed the counter from the std message and the problem was gone. peek could be used in a safe way for this problem, yes it could but...
so:
It could be:
File archive = articles.parallelStream( )
.map( article -> {
File converted = converter.getFile( ... );
String fileName = converter.getFriendlyName( ... );
return new ImmutablePair<>( fileName, converted );
} )
.peek( pair -> statusMessageSender.sendStatusMessage( pair.getKey() ) )
.collect( new Archiver( archivePath, deleteArchivedFiles ) );
or:
File archive = articles.parallelStream( )
.map( article -> {
File converted = converter.getFile( ... );
String fileName = converter.getFriendlyName( ... );
return new ImmutablePair<>( fileName, converted );
} )
.peek( pair -> statusMessageSender.sendStatusMessage( currentCount.incrementAndGet(), pair.getKey() ) )
.collect( new Archiver( archivePath, deleteArchivedFiles ) );
But in the end it was about the protocol and not peek. peek could definitely be used, and the non ordered nature of the problem was the reason way it could be used. (thanks for your help people on SO)

ZeroMQ (clrzmq4) polling issue

What I'm trying to accomplish is to implement reading a message from one of two sockets, wherever it arrives first. As far as I understand polling (zmq_poll) is the right thing to do (as demonstrated in mspoller in guide). Here I'll provide small pseudo-code snippet:
TimeSpan timeout = TimeSpan.FromMilliseconds(50);
using (var receiver1 = new ZSocket(ZContext.Current, ZSocketType.DEALER))
using (var receiver2 = new ZSocket(ZContext.Current, ZSocketType.PAIR))
{
receiver1.Bind("tcp://someaddress");
// Note that PAIR socket is inproc:
receiver2.Connect("inproc://otheraddress");
var poll = ZPollItem.CreateReceiver();
ZError error;
ZMessage msg;
while (true)
{
if (receiver1.PollIn(poll, out msg, out error, timeout))
{
// ...
}
if (receiver2.PollIn(poll, out msg, out error, timeout))
{
// ...
}
}
}
As you can see it is actually the same exact implementation as in mspoller in guide.
In my case receiver2 (PAIR socket) should receive a large number of messages. In fact I've created a test in which number of messages sent to it is always greater than the number of messages it is capable to receive (at least in demonstrated implementation).
I've run the test for 2 seconds, and I was very surprised with results:
Number of messages sent to receiver2: 180 (by "sent" I mean that they are handed out to another PAIR socket not shown in the previous snippet);
Number of messages received by receiver2: 21 ??? Only 21 messages in 2 seconds??? 10 messages per second???
Then I've tried to play with different timeout values and I've found out that it significantly influences the number of messages received. Duration (2 seconds) and number of messages sent (180) remain the same. The results are:
timeout value of 200 milliseconds - number of messages received drops to 10 (5 per second);
timeout value of 10 milliseconds - number of messages received rises to 120 (60 per second).
The results are telling me that polling simply does not work. If polling were working properly, as far as I understand the mechanism, timeout should not have any influence in this scenario. No matter if we set timeout to 1 hour or 5 milliseconds - since there are always messages to receive there's nothing to wait for, so the loop should work with the same speed.
My another big concern is the fact that even with very small timeout value receiver2 is not capable to receive all 180 messages. I'm struggling here to accomplish receiving rate of 100 messages per second, although I've selected ZeroMQ which should be very fast (benchmarks are mentioning numbers as 6 million messages per second).
So my question is obvious: am I doing something wrong here? Is there a better way to implement polling?
By browsing clrzmq4 code I've noticed that there's also possibility to call pollIn method on enumeration of sockets ZPollItems.cs, line 151, but I haven't found any example anywhere!
Can this be the right approach? Any documentation anywhere?
Thanks
I've found the problem / solution for this. Instead using PollIn method on each socket separately we should use PollIn method on array of sockets. Obviously the example from the guide is HUGELY MISLEADING. Here's the correct approach:
TimeSpan timeout = TimeSpan.FromMilliseconds(50);
using (var receiver1 = new ZSocket(ZContext.Current, ZSocketType.DEALER))
using (var receiver2 = new ZSocket(ZContext.Current, ZSocketType.PAIR))
{
receiver1.Bind("tcp://someaddress");
receiver2.Connect("inproc://otheraddress");
// We should "remember" the order of sockets within the array
// because order of messages in the received array will correspond to it.
ZSocket[] sockets = { receiver1, receiver2 };
// Note that we should use two ZPollItem instances:
ZPollItem[] pollItems = { ZPollItem.CreateReceiver(), ZPollItem.CreateReceiver() };
ZError error;
ZMessage[] msg;
while (true)
{
if (sockets.PollIn(pollItems, out msg, out error, timeout))
{
if (msg[0] != null)
{
// The first message gotten from receiver1
}
if (msg[1] != null)
{
// The second message gotten from receiver2
}
}
}
}
Now receiver2 reaches 15,000 received messages per second, no matter timeout value, and no matter number of messages received by receiver1.
UPDATE: Folks from clrzmq4 have acknowledged this issue, so probably the example will be corrected soon.

Java Stream BufferedReader file stream

I am using Java 8 Streams to create stream from a csv file.
I am using BufferedReader.lines(), I read the docs for BufferedReader.lines():
After execution of the terminal stream operation there are no guarantees that the reader will be at a specific position from which to read the next character or line.
public class Streamy {
public static void main(String args[]) {
Reader reader = null;
BufferedReader breader = null;
try {
reader = new FileReader("refined.csv");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
breader = new BufferedReader(reader);
long l1 = breader.lines().count();
System.out.println("Line Count " + l1); // this works correctly
long l2 = breader.lines().count();
System.out.println("Line Count " + l2); // this gives 0
}
}
It looks like after reading the file for first time, reader does not get to beginning of the file. What is the way around for this problem
It looks like after reading the file for first time, reader does not get to beginning of the file.
No - and I don't know why you would expect it to given the documentation you quoted. Basically, the lines() method doesn't "rewind" the reader before starting, and may not even be able to. (Imagine the BufferedReader wraps an InputStreamReader which wraps a network connection's InputStream - once you've read the data, it's gone.)
What is the way around for this problem
Two options:
Reopen the file and read it from scratch
Save the result of lines() to a List<String>, so that you're then not reading from the file at all the second time. For example:
List<String> lines = breader.lines().collect(Collectors.toList());
As an aside, I'd strongly recommend using Files.newBufferedReader instead of FileReader - the latter always uses the platform default encoding, which isn't generally a good idea.
And for that matter, to load all the lines into a list, you can just use Files.readAllLines... or Files.lines if you want the lines as a stream rather than a list. (Note the caveats in the comments, however.)
Probably the cited fragment from JavaDoc needs to be clarified. Usually you would expect that after reading the whole file reader will point to the end of the file. But using streams it depends on whether short-circuit terminal operation is used and whether the stream is parallel. For example, if you use
String magicLine = breader.lines()
.filter(str -> str.startsWith("magic"))
.findAny()
.orElse(null);
Your reader will likely to stop after the first found line (because no need to read further) or read the whole input file if such line is not found. If you make the same operation in parallel stream, then the resulting position will be unpredictable, because input will be split to some implementation-dependent chunks where the search will be performed. That's why it's written this way in the documentation.
As for workaround ways, please read the #JonSkeet answer. And consider closing your streams via try-with-resource construct.
If there are no guarantees that the reader will be at a specific line, why wouldn't you create two readers?
reader1=new FileReader("refined.csv");
reader2=new FileReader("refined.csv");

protobuffer file with many sub messages - one big file or imports?

We recently started using protobuffers in the company I work for, i was wondering what was the best practice regarding a message that holds other messages as fields.
Is it common to write everything in one big proto file or is it better to separate the different messages to different files and import the messages you need in the main file?
For example:
Option 1:
message A {
message B {
required int id = 1;
}
repeated B ids = 1;
}
Option 2:
import B.proto;
message A {
repeated B ids = 1;
}
And in a different file:
message B {
required int id = 1;
}
It depends on your dataset and the usage.
if your data set is small, you should prefer option 1. It leeds to less coding for serialization and deserialization.
if your data set is big, you should prefer option 2. If the file is too big, you can't load it completely into memory. And it will be very slow, if you need only one information and you read all the information of the file.
Maybe this is helpful.

Resources