Comparing two koalas dataframes for testing purposes - spark-koalas

Pandas has a testing module that includes assert_frames_equal. Does Koalas have anything similar?
I am writing tests on a whole set of transformations to koalas dataframes. At first, since my test csv files have only a few (<10) rows, I thought about just using pandas. Unfortunately, the files are quite wide (close to 200 columns) and have a variety of data types that are specified when spark reads the files. Since the type specification is very different for pandas than it is for koalas, I would have to write a whole ~200 list of dtypes, in addition to the type schema we already wrote for spark. Which is why we decided it would be more efficient to use spark and koalas to create the dataframes for the tests. But then, I can't find in the docs a way to compare the dataframes to see if the result of the transformations is the same as the expected one we created.

I ended up using this:
assert_frames_equal(kdf1.to_pandas(), kdf2.to_pandas())
This works, and I think it is okay because the data frames are "small." I wonder if the reason nothing like this has been implemented natively in koalas is because the main use of such an assertion would be in tests, and the tests should be small data frames anyway.

Related

Validating datasets produced by identical Apache airflows

I have the same workflow on two different environments. To validate that both workflows are identical, I feed the same input data to both workflows. If they are identical, I am expecting the output dataset of each workflow to be same.
In this requirement, I cannot alter the workflow in any way (add/remove DAG's etc.).
Which tool is best suited for this use case? I was reading up on data validation frameworks like Apache Griffin and Great Expectations. Can either of this be used for this use case? Or is there a simpler alternative?
Update: I forgot to mention that I want the validation process to be as non interactive as possible. Reading the Great Expectations tutorial, it talks about manually opening & running Jupyter notebooks and I want to minimize processes like this as much as possible. If that makes sense.
Update 2:
Dataset produced by workflow in first environment:
Name
Value
ABC
10
DEF
20
Dataset produced by workflow in second environment:
Name
Value
DEF
20
ABC
10
After running validation, I want the output to be that both datasets are identical even though they are in a different order.

Why do a Beam io need beam.AddFixedKey+beam.GroupByKey to work properly?

I'm working on a Beam IO for Elasticsearch in Golang and at the moment I have a working draft version but, only managed to make it work by doing something that's not clear to me why do I need it.
Basically I looked at existing IO's and found that writes only work if I add the following:
x := beam.AddFixedKey(s, pColl)
y := beam.GroupByKey(s, x)
A full example is in the existing BigQuery IO
Basically I would like to understand why do I need both AddFixedKey followed by a GroupByKey to make it work. Also checked the issue BEAM-3860, but doesn't have much more details about it.
Those two transforms essentially function as a way to group all elements in a PCollection into one list. For example, its usage in the BigQuery example you posted allows grouping the entire input PCollection into a list that gets iterated over in the ProcessElement method.
Whether to use this approach depends how you are implementing the IO. The BigQuery example you posted performs its writes as a batch once all elements are available, but that may not be the best approach for your use case. You might prefer to write elements one at a time as they come in, especially if you can parallelize writes among different workers. In that case you would want to avoid grouping the input PCollection together.

Spark handling small files (coalesce vs CombineFileInputFormat)

I have a use case where I have millions of small files in S3 which needs to be processed by Spark. I have two options to reduce number of tasks:
1. Use Coalesce
2. Extend CombineFileInputFormat
But I'm not clear of performance implications with bot and when to use one over other.
Also, CombineFileInputFormat is an abstract class, that means I need to provide my implementation. But Spark API (newAPIHadoopRDD) takes the class name as param, I'm not sure how to pass configurable maxSplitSize
Another great option to consider for such scenarios is SparkContext.wholeTextFiles() which makes one record for each file with its name as the key and the content as the value -- see Documentation

How to do Lazy Map deserialization in Haskell

Similar to this question by #Gabriel Gonzalez: How to do fast data deserialization in Haskell
I have a big Map full of Integers and Text that I serialized using Cerial. The file is about 10M.
Every time I run my program I deserialize the whole thing just so I can lookup an handful of the items. Deserialization takes about 500ms which isn't a big deal but I alway seem to like profiling on Friday.
It seems wasteful to always deserialize 100k to 1M items when I only ever need a few of them.
I tried decodeLazy and also changing the map to a Data.Map.Lazy (not really understanding how a Map can be Lazy, but ok, it's there) and this has no effect on the time except maybe it's a little slower.
I'm wondering if there's something that can be a bit smarter, only loading and decoding what's necessary. Of course a database like sqlite can be very large but it only loads what it needs to complete a query. I'd like to find something like that but without having to create a database schema.
Update
You know what would be great? Some fusion of Mongo with Sqlite. Like you could have a JSON document database using flat-file storage ... and of course someone has done it https://github.com/hamiltop/MongoLiteDB ... in Ruby :(
Thought mmap might help. Tried mmap library and segfaulted GHCI for the first time ever. No idea how can even report that bug.
Tried bytestring-mmap library and that works but no performance improvement. Just replacing this:
ser <- BL.readFile cacheFile
With this:
ser <- unsafeMMapFile cacheFile
Update 2
keyvaluehash may be just the ticket. Performance seems really good. But the API is strange and documentation is missing so it will take some experimenting.
Update 3: I'm an idiot
Clearly what I want here is not lazier deserialization of a Map. I want a key-value database and there's several options available like dvm, tokyo-cabinet and this levelDB thing I've never seen before.
Keyvaluehash looks to be a native-Haskell key-value database which I like but I still don't know about the quality. For example, you can't ask the database for a list of all keys or all values (the only real operations are readKey, writeKey and deleteKey) so if you need that then have to store it somewhere else. Another drawback is that you have to tell it a size when you create the database. I used a size of 20M so I'd have plenty of room but the actual database it created occupies 266M. No idea why since there isn't a line of documentation.
One way I've done this in the past is to just make a directory where each file is named by a serialized key. One can use unsafeinterleaveIO to "thunk" the deserialized contents of each read file, so that values are only forced on read...

Hector API SliceQuery versus ColumnQuery performance

I'm writing an application that uses Hector to access a Cassandra database. I have some situations where I only need to query one column, and some where I need to query multiple columns at once. Writing one method that takes an array of column names and returns a list of columns using SliceQuery would be simplest in terms of code, but I'm wondering whether there's a significant drawback to using SliceQuery for one column compared to using ColumnQuery.
In short, are there enough (or any) performance benefits of using ColumnQuery over SliceQuery for one column to make it worth the extra code to deal with a one-column case separately?
By looking at Hector's code , the difference between using a ColumnQuery (ThriftColumnQuery.java) and a SliceQuery (ThriftSliceQuery.java) is the different thrift command being sent - "get" or "get_slice" (respectively).
I didn't find an exact documentation of how each of those operations are implemented by Cassandra's server, but I took a quick look in Cassandra's sources and after examining CassandraServer.java I got the impression that the "get" operation is there more for client's convenience than for better performance when querying a single column:
For a "get" request, a SliceByNamesReadCommand instance is created and executed.
For a "get_slice" request (assuming you're using Hector's setColumnNames method and not setRange), a SliceByNamesReadCommand instance is created for each of the wanted columns and then executed (the row is read only once though).
Bottom line, as far as I see it there's not much more than the (negligible) overhead of creating some collections meant for handling the multiple columns.
If you're still worried however, I believe it shouldn't be too difficult to handle the two cases differently when wrapping the use of Hector in your DAOs.
Hope I managed to help.

Resources