Camel bindy: paralelle writing - parallel-processing

I'm having a collection of objects that I would like to serialize into a same CSV file.
What is the fastest way to write these objects into the same file ?
Is using parallelProcessing() a safe approach ?

I'd rather implement it with two routes: one with parallel processing collecting data and the second one for writing the collected data into CSV. The latter will be not parallel of course.

Related

Why do a Beam io need beam.AddFixedKey+beam.GroupByKey to work properly?

I'm working on a Beam IO for Elasticsearch in Golang and at the moment I have a working draft version but, only managed to make it work by doing something that's not clear to me why do I need it.
Basically I looked at existing IO's and found that writes only work if I add the following:
x := beam.AddFixedKey(s, pColl)
y := beam.GroupByKey(s, x)
A full example is in the existing BigQuery IO
Basically I would like to understand why do I need both AddFixedKey followed by a GroupByKey to make it work. Also checked the issue BEAM-3860, but doesn't have much more details about it.
Those two transforms essentially function as a way to group all elements in a PCollection into one list. For example, its usage in the BigQuery example you posted allows grouping the entire input PCollection into a list that gets iterated over in the ProcessElement method.
Whether to use this approach depends how you are implementing the IO. The BigQuery example you posted performs its writes as a batch once all elements are available, but that may not be the best approach for your use case. You might prefer to write elements one at a time as they come in, especially if you can parallelize writes among different workers. In that case you would want to avoid grouping the input PCollection together.

Comparing two koalas dataframes for testing purposes

Pandas has a testing module that includes assert_frames_equal. Does Koalas have anything similar?
I am writing tests on a whole set of transformations to koalas dataframes. At first, since my test csv files have only a few (<10) rows, I thought about just using pandas. Unfortunately, the files are quite wide (close to 200 columns) and have a variety of data types that are specified when spark reads the files. Since the type specification is very different for pandas than it is for koalas, I would have to write a whole ~200 list of dtypes, in addition to the type schema we already wrote for spark. Which is why we decided it would be more efficient to use spark and koalas to create the dataframes for the tests. But then, I can't find in the docs a way to compare the dataframes to see if the result of the transformations is the same as the expected one we created.
I ended up using this:
assert_frames_equal(kdf1.to_pandas(), kdf2.to_pandas())
This works, and I think it is okay because the data frames are "small." I wonder if the reason nothing like this has been implemented natively in koalas is because the main use of such an assertion would be in tests, and the tests should be small data frames anyway.

Writing large amount of data using Akka in more efficient way

I've implemented Scala Akka application that streams 4 different types of data from biomodule sensor (ECG, EEG, Breath and general data). These data (timestamp and value) are typically stored in 4 different CSV files. However, sometimes I have to store each sample in two different files with different timestamps, so application is writing in 8 different CSV files at the same time.
Initially I've implemented one Akka actor that is responsible for persisting data, which receive path to the file in which to write data, timestamp and value. However, this was a bottleneck, since a number of samples that I need to store is large (e.g. one ECG sample is received each 4ms). As a result, this actor had finished recording in very short experiment 1-2 minutes after experiment is over.
I've also tried with 4 actors for 4 different message types, with the idea to distribute work. I didn't notice significant improvement in performances.
I'm wondering if someone has an idea how to improve the performance. Is it better to use one actor for storing files, few actors or it is most efficient if I have one actor for each file? Or maybe, it doesn't make any difference? Could I improve my code for storing data?
This is my method responsible for storing data:
def processValue(sample: WaveformValue): Unit ={
val csvfilewriter=new PrintWriter(new BufferedWriter(new FileWriter(sample.filepath,true)))
csvfilewriter.append(sample.timestamp.toString)
csvfilewriter.append(",")
csvfilewriter.append(sample.value.toString)
csvfilewriter.append("\r\n")
csvfilewriter.flush()
csvfilewriter.close()
}
It seems to me that your bottleneck is I/O -- disk access. It looks like you are opening, writing to, and closing a file for each sample, which is very expensive. I would suggest:
Open each file just once, and close it at the end of all processing. You might need to store the file in a member variable, or if you have have an arbitrary collection of files then store them in a map in a member variable.
Don't flush after every sample write.
Use buffered writes for each file writer. This avoids flushing data to the filesystem with every write, which involves a system call and waiting for the data to be written to disk. I see that you're already doing this, but the benefit is lost since you are flushing/closing the file after each sample anyway.

How to use common data in MapReduce?

I want to load data in memory and have each Mapper use these data.
How do i do it?
Should I just use the setup method in Mapper?
Then, will each Mapper be able to use a common data in once the data is loaded ?
Yes, that is exactly the way to go.
You read the stuff on setup() and keep it in your tasks memory.

Hector API SliceQuery versus ColumnQuery performance

I'm writing an application that uses Hector to access a Cassandra database. I have some situations where I only need to query one column, and some where I need to query multiple columns at once. Writing one method that takes an array of column names and returns a list of columns using SliceQuery would be simplest in terms of code, but I'm wondering whether there's a significant drawback to using SliceQuery for one column compared to using ColumnQuery.
In short, are there enough (or any) performance benefits of using ColumnQuery over SliceQuery for one column to make it worth the extra code to deal with a one-column case separately?
By looking at Hector's code , the difference between using a ColumnQuery (ThriftColumnQuery.java) and a SliceQuery (ThriftSliceQuery.java) is the different thrift command being sent - "get" or "get_slice" (respectively).
I didn't find an exact documentation of how each of those operations are implemented by Cassandra's server, but I took a quick look in Cassandra's sources and after examining CassandraServer.java I got the impression that the "get" operation is there more for client's convenience than for better performance when querying a single column:
For a "get" request, a SliceByNamesReadCommand instance is created and executed.
For a "get_slice" request (assuming you're using Hector's setColumnNames method and not setRange), a SliceByNamesReadCommand instance is created for each of the wanted columns and then executed (the row is read only once though).
Bottom line, as far as I see it there's not much more than the (negligible) overhead of creating some collections meant for handling the multiple columns.
If you're still worried however, I believe it shouldn't be too difficult to handle the two cases differently when wrapping the use of Hector in your DAOs.
Hope I managed to help.

Resources