Java Stream BufferedReader file stream - java-8

I am using Java 8 Streams to create stream from a csv file.
I am using BufferedReader.lines(), I read the docs for BufferedReader.lines():
After execution of the terminal stream operation there are no guarantees that the reader will be at a specific position from which to read the next character or line.
public class Streamy {
public static void main(String args[]) {
Reader reader = null;
BufferedReader breader = null;
try {
reader = new FileReader("refined.csv");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
breader = new BufferedReader(reader);
long l1 = breader.lines().count();
System.out.println("Line Count " + l1); // this works correctly
long l2 = breader.lines().count();
System.out.println("Line Count " + l2); // this gives 0
}
}
It looks like after reading the file for first time, reader does not get to beginning of the file. What is the way around for this problem

It looks like after reading the file for first time, reader does not get to beginning of the file.
No - and I don't know why you would expect it to given the documentation you quoted. Basically, the lines() method doesn't "rewind" the reader before starting, and may not even be able to. (Imagine the BufferedReader wraps an InputStreamReader which wraps a network connection's InputStream - once you've read the data, it's gone.)
What is the way around for this problem
Two options:
Reopen the file and read it from scratch
Save the result of lines() to a List<String>, so that you're then not reading from the file at all the second time. For example:
List<String> lines = breader.lines().collect(Collectors.toList());
As an aside, I'd strongly recommend using Files.newBufferedReader instead of FileReader - the latter always uses the platform default encoding, which isn't generally a good idea.
And for that matter, to load all the lines into a list, you can just use Files.readAllLines... or Files.lines if you want the lines as a stream rather than a list. (Note the caveats in the comments, however.)

Probably the cited fragment from JavaDoc needs to be clarified. Usually you would expect that after reading the whole file reader will point to the end of the file. But using streams it depends on whether short-circuit terminal operation is used and whether the stream is parallel. For example, if you use
String magicLine = breader.lines()
.filter(str -> str.startsWith("magic"))
.findAny()
.orElse(null);
Your reader will likely to stop after the first found line (because no need to read further) or read the whole input file if such line is not found. If you make the same operation in parallel stream, then the resulting position will be unpredictable, because input will be split to some implementation-dependent chunks where the search will be performed. That's why it's written this way in the documentation.
As for workaround ways, please read the #JonSkeet answer. And consider closing your streams via try-with-resource construct.

If there are no guarantees that the reader will be at a specific line, why wouldn't you create two readers?
reader1=new FileReader("refined.csv");
reader2=new FileReader("refined.csv");

Related

Batch processing flowfiles in apache nifi

I have written custom nifi processor which tries to batch process input flow files.
However, it seems it is not behaving as expected. Here is what happening:
I copy paste some files on server. FethFromServerProcessor fetches those files from server and puts it in queue1. MyCustomProcessor reads files in batch from queue1. I have batchSize property defined on MyCustomProcessor and inside its onTrigger() method, I am getting all flow files from queue1 in current batch by doing following:
session.get(context.getProperty(batchSize).asInteger())
First line of onTrigger() creates timestamp and adds this timestamp on all flow files. So all files in the batch should have same timestamp. However, that is not happening. Usually first flow file get one timestamp and rest of the flow files get other timestamp.
It seems that when FetchFromServerProcessor fetches first file from server and puts it in the queue1, MyCustomProcessor gets triggered and it fetches all files from queue. Incidentally, it happens that there used to be single file, which is picked up as only file in this batch. By the time MyCustomProcessor has processed this file, FetchFromServerProcessor has fetched all the files from server and put them in the queue1. So after processing first file, MyCustomProcessor takes all the files in queue1 and forms second batch, whereas I want all files picked up in single batch.
How can I avoid two batches getting formed? I see that people discuss wait-notify in this context:1, 2. But I am not able to make quick sense out of these posts. Can someone give me minimal steps to achieve this using wait notify processors or can someone point me to minimal tutorial which gives step by step procedure to use wait-notify processors? Also is wait-notify pattern standard approach to solve batch related problem I explained? Or is there any other standard approach to get this done?
It sounds as if this batch size is the required count of incoming flowfiles to CustomProcessor, so why not write your CustomProcessor#onTrigger() as follows:
#Override
public void onTrigger(ProcessContext context, ProcessSession session) throws ProcessException {
final ComponentLog logger = getLogger();
// Try to get n flowfiles from incoming queue
final Integer desiredFlowfileCount = context.getProperty(batchSize).asInteger();
final int queuedFlowfileCount = session.getQueueSize().getObjectCount();
if (queuedFlowfileCount < desiredFlowfileCount) {
// There are not yet n flowfiles queued up, so don't try to run again immediately
if (logger.isDebugEnabled()) {
logger.debug("Only {} flowfiles queued; waiting for {}", new Object[]{queuedFlowfileCount, desiredFlowfileCount});
}
context.yield();
return;
}
// If we're here, we do have at least n queued flowfiles
List<FlowFile> flowfiles = session.get(desiredFlowfileCount);
try {
// TODO: Perform work on all flowfiles
flowfiles = flowfiles.stream().map(f -> session.putAttribute(f, "timestamp", "my static timestamp value")).collect(Collectors.toList());
session.transfer(flowfiles, REL_SUCCESS);
// If extending AbstractProcessor, this is handled for you and you don't have to explicitly commit
session.commit();
} catch (Exception e) {
logger.error("Helpful error message");
if (logger.isDebugEnabled()) {
logger.error("Further stacktrace: ", e);
}
// Penalize the flowfiles if appropriate (also done for you if extending AbstractProcessor and an exception is thrown from this method
session.rollback(true);
// --- OR ---
// Transfer to failure if they can't be retried
session.transfer(flowfiles, REL_FAILURE);
}
}
The Java 8 stream syntax can be replaced by this if it's unfamiliar:
for (int i = 0; i < flowfiles.size(); i++) {
// Write the same timestamp value onto all flowfiles
FlowFile f = flowfiles.get(i);
flowfiles.set(i, session.putAttribute(f, "timestamp", "my timestamp value"));
}
The semantics between penalization (telling the processor to delay performing work on a specific flowfile) and yielding (telling the processor to wait some period of time to try performing any work again) are important.
You probably also want the #TriggerSerially annotation on your custom processor to ensure you do not have multiple threads running such that a race condition could arise.

Deallocate memory of reducer Iterable input

I am facing a GC overhead problem because the reducer input is huge. There is no way for me to filter out any of this input data as all entries in the Iterable argument are of use. I tried increasing the heap size on the EMR cluster where I am running this job, but even that didn't help.
Basically what my reducer does is that it takes a list of strings and converts them into a list of objects A. These objects are then combined to form a bigger object B. The solution that I could think of was, that I can dump the entire Iterable input onto disk and free the Iterable object completely. Then from disk read the dumped string one at a time and keep building the larger object B, by intermediately building an object A and then release A before fetching the next string from the disk. This way, I would keep nearly only half of data on the heap than before. But, I think I am unable to free the Iterable input as I am still getting GC overhead.
The way I tried to do this is as follows:
public void reduce(final Text key, Iterable<MapWritable> inputValues,
final Context context)
{
BufferedWriter bw = new BufferedWriter(
new FileWriterWithEncoding(this.fileName, this.encoding));
for (Iterator<MapWritable> iterator = inputValues.iterator(); iterator.hasNext();)
{
MapWritable mapWritable = iterator.next();
// ----
// Put string contained in mapWritable into a file on disk
// ----
}
bw.close();
// Release the input Iterable instance
inputValues = null; //This doesn't seem to work :'(
BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream(this.fileName), this.encoding));
for (String line; (line = br.readLine()) != null;)
{
// ----
// Then read from the file saved in disk one line at a time
// and process it to build the object B
// ----
}
br.close();
}
My question is, is there a way to deallocate the memory of the Iterable input of reducer, from within the reducer method?

SuperCSV with null delimiter

I'm creating a file that isn't really a csv file, but SuperCSV can help me to make the creation of this file easier. The structure of the file uses different lengths for each line, following a layout that don't separate the different information. So, to know which information has in one line you need look at the first 2 characters (the name of the register), count the characters and extract it by size.
I've configured SuperCSV to use empty delimiter, however, the created file is using a space where it should have nothing.
public class TarefaGerarArquivoRegistrosFiscais implements ITarefa {
private static final CsvPreference FORMATO_ANEXO_IV = new CsvPreference.Builder('"', '\0' , "\r\n").build();
public void processar() {
try {
writer = new CsvListWriter(getFileWriter(), FORMATO_ANEXO_IV);
writer.write(geradorRegistroU1.gerar());
} finally {
if (writer != null)
writer.close();
}
}
}
I'm doing something wrong? '\0' is the correct code for a null char?
It's probably not what you want to hear, but I wouldn't recommend using Super CSV for this (and I'm a committer!). Its sole purpose is to deal with delimited files - and you're not using delimiters.
You could misuse Super CSV by creating a wrapper object (containing your List) whose toString() method simply concatenates all of the values together, then passing that single object to writer.write(), but it's an awful hack.
I'd recommend either finding another library more suited to your problem, or writing your own solution.

Hadoop Custom Input format with the new API

I'm a newbie to Hadoop and I'm stuck with the following problem. What I'm trying to do is to map a shard of the database (please don't ask why I need to do that etc) to a mapper, then do certain operation on this data, output the results to reducers and use that output again to do the second phase map/reduce job on the same data using the same shard format.
Hadoop does not provide any input method to send a shard of the database. You can only send line by line using LineInputFormat and LineRecordReader. NLineInputFormat doesn't also help in this case. I need to extend FileInputFormat and RecordReader classes to write my own InputFormat. I have been advised to use LineRecordReader since the underlying code already deals with the FileSplits and all the problems associated with splitting the files.
All I need to do now is to override the nextKeyValue() method which I don't exactly know how.
for(int i=0;i<shard_size;i++){
if(lineRecordReader.nextKeyValue()){
lineValue.append(lineRecordReader.getCurrentValue().getBytes(),0,lineRecordReader.getCurrentValue().getLength());
}
}
The above code snippet is the one that wrote but somehow doesn't work well.
I would suggest to put into your input files connection strings and some other indications where to find the shard.
Mapper will take this information, connect to the database and do a job. I would not suggest t o convert result sets to hadoop's writable classes - it will hinder performance.
The problem I see to be addressed - is to have enough splits of this relatively small input.
You can simply create enough small files with a few shards references each, or you can tweak input format to build small splits. Second way will be more flexible.
What I did, is something like this. I wrote my own record reader to read n lines at a time and send them to mappers as input
public boolean nextKeyValue() throws IOException,
InterruptedException {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 5; i++) {
if (!lineRecordReader.nextKeyValue()) {
return false;
}
lineKey = lineRecordReader.getCurrentKey();
lineValue = lineRecordReader.getCurrentValue();
sb.append(lineValue.toString());
sb.append(eol);
}
lineValue.set(sb.toString());
//System.out.println(lineValue.toString());
return true;
// throw new UnsupportedOperationException("Not supported yet.");
}
how do you thin

Incremental write of Protocol Buffer object

I have Protocol Buffer for logging data.
message Message {
required double val1 = 1;
optional int val2 = 2;
}
message BigObject {
repeated Message message = 1;
}
I receive messages one per second. They stored in memory with my BigObject and they used for some tasks. But at the same time i want to store that messages in file for backup in case application crash. Simple writing BigObject every time will be waste of time. And I trying to find way to write only added messages since last write to file. Is there a way for that?
Protobuf is an appendable format, and your layout is ideal for this. Just open your file positioned at the end, and start with a new (empty) BigObject. Add/serialize just the new Message instance, and write to the file (from the end onwards).
Now, if you parse your file from the beginning you will get a single BigObject with all the Message instances (old and new).
You could actually do this by logging each individual Message as it arrives, as long as you wrap it in a BigObject each time, i.e. in pseudo-code
loop {
msg = await NextMessage();
wrapper = new BigObject();
wrapper.Messages.Add(msg);
file = OpenFileAtEnd();
wrapper.WriteTo(file);
file.Close();
}

Resources