I want to read a document content from FileNetP8 parallel to reduce my reading time. Also the issue is I write into a OutputStream. Is there anyway or any API from where I can parallelize my reads into a OutputStream. I am asking this because I am sure IBM would have provided some way to do it.
Also because let's say if my file is 1GB, then sequential reads are going to be performance hit.
I think from a Document instance there's only one API to retrieve the content - accessContentStream which gives you an object of InputStream. However, for reading huge files there's a new util class called ExtendedInputStream which you might be interested in.
An ExtendedInputStream is an input stream that can retrieve content at arbitrary positions within the stream. The ExtendedInputStream class includes methods that can read a certain number of bytes from the stream or read an unspecified number of bytes. The stream keeps track of the last byte position that was read. You can specify a position in the input stream to get to a later or earlier position within the stream.
More details at :
https://www.ibm.com/support/knowledgecenter/SSGLW6_5.2.1/com.ibm.p8.ce.dev.java.doc/com/filenet/api/util/ExtendedInputStream.html
Edit:
ExtendedInputStream has been introduced in v5.2.1 and is not available if you are using older version P8.
Related
I was getting segment error while uploading a large file.
I have read the file data in chunks of bytes using the Read method through io.Reader. Now, I need to upload the bytes of data continuously into the StorJ.
Storj, architected as an S3-compatible distributed object storage system, does not allow changing objects once uploaded. Basically, you can delete or overwrite, but you can't append.
You could make something that seemed like it supported append, however, using Storj as the backend. For example, by appending an ordinal number to your object's path, and incrementing it each time you want to add to it. When you want to download the whole thing, you would iterate over all the parts and fetch them all. Or if you only want to seek to a particular offset, you could calculate which part that offset would be in, and download from there.
sj://bucket/my/object.name/000
sj://bucket/my/object.name/001
sj://bucket/my/object.name/002
sj://bucket/my/object.name/003
sj://bucket/my/object.name/004
sj://bucket/my/object.name/005
Of course, this leaves unsolved the problem of what to do when multiple clients are trying to append to your "file" at the same time. Without some sort of extra coordination layer, they would sometimes end up overwriting each other's objects.
The question may be generic but I am trying to understand the major implications here.
I am trying to do some byte code engineering using BCEL library and part of the workflow requires me to read the same byte code file multiple times (from the beginning). The flow is the following
// 1. Get Input Stream
// 2. Do some work
// 3. Finish
// 4. Do some other work.
At step 4, I will need to reset the mark or get the stream as though it's from beginning. I know of the following choices.
1) Wrap the stream using BufferedInputStream - chance of getting "Resetting to invalid mark" IOException
2) Wrap it using ByteArrayInputStream - it always works even though some online research suggests that it's erroneous?
3) Simply call getInputStream() if I need to read from the stream again.
I am trying to understand which option would be better for me. I don't want to use BufferedInputStream because I have no clue where the last mark is called, so calling reset for a higher mark position will cause IOException. I would prefer using ByteArrayInputStream since it requires the minimum code change for me, but could anyone suggest whether option#2 or option#3 will be better?
I know that implementations for mark() and reset() are different for ByteArrayInputStream and BufferedInputStream in JDK.
Regards
The problem of mark/reset is not only that you have to know in advance the maximum amount of data being read between these calls, you also have to know whether the code you’re delegating to will use that feature for itself internally, rendering your mark obsolete. It’s impossible for code using mark/reset to remember and restore a previous mark for the caller.
So while it would be possible to fix the maximum issue by specifying the total file size as maximum readlimit, you can never rely on a working mark when passing the InputStream to an arbitrary library function that does not explicitly document to never use the mark/reset feature internally.
Also, a BufferedInputStream getting a readlimit matching the total file size would not be more efficient than a ByteArrayInputStream wrapping an array holding the entire file, as both end up maintaining a buffer of the same size.
The best solution would be to read the entire class file into an array once and directly use the array, e.g. for code under your control or when you have a choice regarding the library (ASM’s ClassReader supports using a byte array instead of an InputStream, for example).
If you have to feed an InputStream to a library function insisting on it, like BCEL, then wrap the byte array into a ByteArrayInputStream when needed, but create a new ByteArrayInputStream each time you have to re-parse the class file. Constructing the new ByteArrayInputStream costs nothing, as it is a lightweight wrapper and is reliable, as it does not depend on the state of an older input stream in any way. You could even have multiple ByteArrayInputStream instances reading the same array at the same time.
Calling getInputStream() again would be an option, if you have to deal with really large files for which buffering the entire contents is not an option, however, this is not the case for class files.
I have a flow, pretty big, which takes a csv and then eventually converts it to sql statements (via avro, json).
For a file of 5GB, flowfile_repo (while processing) went up to 24 GB and content_repo to 18 GB.
content_repo max 18 GB
flowfile_repo max 26 GB
Is there a way to predict how much space would I need for processing N files ?
Why it takes so much space ?
The flow file repo is check-pointed every 2 minutes by default, and is storing the state of every flow file as well as the attributes of every flow file. So it really depends how many flow files and how many attributes per flow file are being written during that 2 min window, as well as how many processors the flow files are passing through and how many of them are modifying the attributes.
The content repo is storing content claims, where each content claim contains the content of one or more flow files. Periodically there is a clean up thread that runs and determines if a content claim can be cleaned up. This is based on whether or not you have archiving enabled. If you have it disabled, then a content claim can be cleaned up when no active flow files reference any of the content in that claim.
The flow file content also follows a copy-on-write pattern, meaning the content is immutable and when a processor modifies the content it is actually writing a new copy. So if you had a 5GB flow file and it passed through a processor that modified the content like ReplaceText, it would write another 5GB to the content repo, and the original one could be removed based on the logic above about archiving and whether or not any flow files reference that content.
If you are interested in more info, there is an in depth document about how all this works here:
https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
I am using PStore to store the results of some computer simulations. Unfortunately, when the file becomes too large (more than 2GB from what I can see) I am not able to write the file to disk anymore and I receive the following error;
Errno::EINVAL: Invalid argument - <filename>
I am aware that this is probably a limitation of IO but I was wondering whether there is a workaround. For example, to read large JSON files, I would first split the file and then read it in parts. Probably the definitive solution should be to switch to a proper database in the backend, but because of some limitations of the specific Ruby (Sketchup) I am using this is not always possible.
I am going to assume that your data has a field that could be used as a crude key.
Therefore I would suggest that instead of dumping data into one huge file, you could put your data into different files/buckets.
For example, if your data has a name field, you could take the first 1-4 chars of the name, create a file with those chars like rojj-datafile.pstore and add the entry there. Any records with a name starting 'rojj' go in that file.
A more structured version is to take the first char as a directory, then put the file inside that, like r/rojj-datafile.pstore.
Obviously your mechanism for reading/writing will have to take this new file structure into account, and it will undoubtedly end up slower to process the data into the pstores.
I'm using google protocol buffer to serialize equity market data (ie. timestamp, bid,ask fields).
I can store one message into a file and deserialize it without issue.
How can I store multiple messages into a single file? Not sure how I can separate the messages. I need to be able to append new messages to the file on the fly.
I would recommend using the writeDelimitedTo(OutputStream) and parseDelimitedFrom(InputStream) methods on Message objects. writeDelimitedTo writes the length of the message before the message itself; parseDelimitedFrom then uses that length to read only one message and no farther. This allows multiple messages to be written to a single OutputStream to then be parsed separately. For more information, see https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/MessageLite#writeDelimitedTo(java.io.OutputStream)
From the docs:
http://code.google.com/apis/protocolbuffers/docs/techniques.html#streaming
Streaming Multiple Messages
If you want to write multiple messages to a single file or stream, it
is up to you to keep track of where one message ends and the next
begins. The Protocol Buffer wire format is not self-delimiting, so
protocol buffer parsers cannot determine where a message ends on their
own. The easiest way to solve this problem is to write the size of
each message before you write the message itself. When you read the
messages back in, you read the size, then read the bytes into a
separate buffer, then parse from that buffer. (If you want to avoid
copying bytes to a separate buffer, check out the CodedInputStream
class (in both C++ and Java) which can be told to limit reads to a
certain number of bytes.)
Protobuf does not include a terminator per outermost record, so you need to do that yourself. The simplest approach is to prefix the data with the length of the record that follows. Personally, I tend to use the approach of writing a string-header (for an arbitrary field number), then the length as a "varint" - this means the entire document is then itself a valid protobuf, and could be consumed as an object with a "repeated" element, however, just a fixed-length (typically 32-bit little-endian) marker would do just as well. With any such storage, it is appendable as you require.
If you're looking for a C++ solution, Kenton Varda submitted a patch to protobuf around August 2015 that adds support for writeDelimitedTo() and readDelimitedFrom() calls that will serialize/deserialize a sequence of proto messages to/from a file in a way that's compatible with the Java version of these calls. Unfortunately this patch hasn't been approved yet, so if you want the functionality you'll need to merge it yourself.
Another option is Google has open sourced protobuf file reading/writing code through other projects. The or-tools library, for example, contains the classes RecordReader and RecordWriter that serialize/deserialize a proto stream to a file.
If you would like stand-alone versions of these classes that have almost no external dependencies, I have a fork of or-tools that contains only these classes. See: https://github.com/moof2k/recordio
Reading and writing with these classes is straightforward:
File* file = File::Open("proto.log", "w");
RecordWriter writer(file);
writer.WriteProtocolMessage(msg1);
writer.WriteProtocolMessage(msg2);
...
writer.Close();
An easier way is to base64 encode each message and store it as a record per line.