Can I parse the chronicle queue file - chronicle

Write data to Chronicle queue.
This creates/updates the chronicle queue file and is written in 20220525F.cq4
Query :- Is it possible to parse the file 20220525F.cq4 , what is the data format used to write the file ?

You can use the chronicle queue tailer to read the contents of a chronicle queue, or net.openhft.chronicle.queue.ChronicleQueue#dump() to dump the queue out as text, but apart from the tools offered by Chronicle-Queue, there is no simple way to parse the queue file.

If you don't want to use the built-in tools, and you are happy to read the file very simply, you can read it. The main tools for dumping are in Chronicle Wire if you don't want to use Chronicle Queue.
The file is size prefixed bytes. The first 4 bytes are the length, followed the data in that blob. This repeats until you reach a length of 0.
https://github.com/OpenHFT/RFC/blob/master/Size-Prefixed-Blob/Size-Prefixed-Blob-1.0.adoc

Related

IBM Filenet p8 concurrently reading document content

I want to read a document content from FileNetP8 parallel to reduce my reading time. Also the issue is I write into a OutputStream. Is there anyway or any API from where I can parallelize my reads into a OutputStream. I am asking this because I am sure IBM would have provided some way to do it.
Also because let's say if my file is 1GB, then sequential reads are going to be performance hit.
I think from a Document instance there's only one API to retrieve the content - accessContentStream which gives you an object of InputStream. However, for reading huge files there's a new util class called ExtendedInputStream which you might be interested in.
An ExtendedInputStream is an input stream that can retrieve content at arbitrary positions within the stream. The ExtendedInputStream class includes methods that can read a certain number of bytes from the stream or read an unspecified number of bytes. The stream keeps track of the last byte position that was read. You can specify a position in the input stream to get to a later or earlier position within the stream.
More details at :
https://www.ibm.com/support/knowledgecenter/SSGLW6_5.2.1/com.ibm.p8.ce.dev.java.doc/com/filenet/api/util/ExtendedInputStream.html
Edit:
ExtendedInputStream has been introduced in v5.2.1 and is not available if you are using older version P8.

processing GBs of data in kafka/storm

Is it possible to process GBs of data in Kafka/Storm as a single message? File frequency is 30 minutes.
If not possible If I break the message into 1 MB each and then can I process it in Kafka/Storm?
My files is in SEGY format (Oil/gas domain) and I will call bin executables (written in c++) through storm to process this file. Whether tuples can be formed successfully for this file format?
Please help.
Are you sure you want to use Storm to do this processing? Seems like a batch application may be more appropriate.
Regardless, you might be able to get it to work but I would recommend having your spout split the data up into more manageable chunks that can be processed by your bolts.

WebHDFS and SequenceFiles

Is it true WebHDFS does not support SequenceFiles?
I can't find anything that says it does. I have the usual small file problem and believe SequenceFiles would work well enough, but I need to use WebHDFS. I need to create and then append to a SequenceFile via WebHDFS.
I think it's true. There is no web API to append to a sequence file.
However you can append binary data, and if your sequence file is not block-compressed, you should be able to format your data on the client with relatively little effort. You can do it by running your input through a sequence file writer on the client, and then using the output for uploading (either the whole file, or a slice representing the delta since last append).
You can read more about sequence file format here.

hadoop - How can i use data in memory as input format?

I'm writing a mapreduce job, and I have the input that I want to pass to the mappers in the memory.
The usual method to pass input to the mappers is via the Hdfs - sequencefileinputformat or Textfileinputformat. These inputformats need to have files in the fdfs which will be loaded and splitted to the mappers
I cant find a simple method to pass, lets say List of elemnts to the mappers.
I find myself having to wrtite these elements to disk and then use fileinputformat.
any solution?
I'm writing the code in java offcourse.
thanks.
Input format is not have to load data from the disk or file system.
There are also input formats reading data from other systems like HBase or (http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapred/TableInputFormat.html) where data is not implied to sit on the disk. It only is implied to be available via some API on all nodes of the cluster.
So you need to implement input format which splits data in your own logic (as soon as there is no files it is your own task) and to chop the data into records .
Please note that your in memory data source should be distributed and run on all nodes of the cluster. You will also need some efficient IPC mechanism to pass data from your process to the Mapper process.
I would be glad also to know what is your case which leads to this unusual requirement.

Google Protocol Buffers - Storing messages into file

I'm using google protocol buffer to serialize equity market data (ie. timestamp, bid,ask fields).
I can store one message into a file and deserialize it without issue.
How can I store multiple messages into a single file? Not sure how I can separate the messages. I need to be able to append new messages to the file on the fly.
I would recommend using the writeDelimitedTo(OutputStream) and parseDelimitedFrom(InputStream) methods on Message objects. writeDelimitedTo writes the length of the message before the message itself; parseDelimitedFrom then uses that length to read only one message and no farther. This allows multiple messages to be written to a single OutputStream to then be parsed separately. For more information, see https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/MessageLite#writeDelimitedTo(java.io.OutputStream)
From the docs:
http://code.google.com/apis/protocolbuffers/docs/techniques.html#streaming
Streaming Multiple Messages
If you want to write multiple messages to a single file or stream, it
is up to you to keep track of where one message ends and the next
begins. The Protocol Buffer wire format is not self-delimiting, so
protocol buffer parsers cannot determine where a message ends on their
own. The easiest way to solve this problem is to write the size of
each message before you write the message itself. When you read the
messages back in, you read the size, then read the bytes into a
separate buffer, then parse from that buffer. (If you want to avoid
copying bytes to a separate buffer, check out the CodedInputStream
class (in both C++ and Java) which can be told to limit reads to a
certain number of bytes.)
Protobuf does not include a terminator per outermost record, so you need to do that yourself. The simplest approach is to prefix the data with the length of the record that follows. Personally, I tend to use the approach of writing a string-header (for an arbitrary field number), then the length as a "varint" - this means the entire document is then itself a valid protobuf, and could be consumed as an object with a "repeated" element, however, just a fixed-length (typically 32-bit little-endian) marker would do just as well. With any such storage, it is appendable as you require.
If you're looking for a C++ solution, Kenton Varda submitted a patch to protobuf around August 2015 that adds support for writeDelimitedTo() and readDelimitedFrom() calls that will serialize/deserialize a sequence of proto messages to/from a file in a way that's compatible with the Java version of these calls. Unfortunately this patch hasn't been approved yet, so if you want the functionality you'll need to merge it yourself.
Another option is Google has open sourced protobuf file reading/writing code through other projects. The or-tools library, for example, contains the classes RecordReader and RecordWriter that serialize/deserialize a proto stream to a file.
If you would like stand-alone versions of these classes that have almost no external dependencies, I have a fork of or-tools that contains only these classes. See: https://github.com/moof2k/recordio
Reading and writing with these classes is straightforward:
File* file = File::Open("proto.log", "w");
RecordWriter writer(file);
writer.WriteProtocolMessage(msg1);
writer.WriteProtocolMessage(msg2);
...
writer.Close();
An easier way is to base64 encode each message and store it as a record per line.

Resources