Partial Vertical Caching of DataFrame

Partial Vertical Caching of DataFrame - caching

I use spark with parquet.
I'd like to be able to cache the columns we use most often for filtering, while keeping the other on disk.
I'm running something like:
myDataFrame.select("field1").cache
myDataFrame.select("field1").count
myDataFrame.select("field1").where($"field1">5).count
myDataFrame.select("field1", "field2").where($"field1">5).count
The fourth line doesn't use the cache.
Any simple solutions that can help here?

The reason this will not cache is that whenever you do a transformation on a dataframe (e.g. select) you are actually creating a new one. What you basically did is cached a dataframe containing only field1 and a dataframe containing only field1 where it is larger than 5 (probably you meant field2 here but it doesn't matter).
On the fourth line you are creating a third dataframe which has no lineage to the original two, just to the original dataframe.
If you generally do strong filtering (i.e. you get a very small number of elements) you can do something like this:
cachedDF = myDataFrame.select("field1", "field2", ... "fieldn").cache
cachedDF.count()
filteredDF = cachedDF.filter(some strong filter)
res = myDataFrame.join(broadcast(filteredDF), cond)
i.e. cachedDF has all the fields you filter on, then you filter very strongly and then do an inner join (with cond being all relevant selected fields or some id field) which would give all relevant data.
That said, in most cases, assuming you use a file format such as parquet, caching will not help you much.

Related

Sorting after Repartitioning PySpark Dataframe

We have a giant file which we repartitioned according to one column, for example, say it is STATE. Now it seems like after repartitioning, the data cannot be sorted completely. We are trying to save our final file as a text file but instead of the first state listed being Alabama, now California shows up first. OrderBy doesn't seem to have an effect after running the repartition.
df = df.repartition(100, ['STATE_NAME'])\
.sortWithinPartitions('STATE_NAME', 'CUSTOMER_ID', 'ROW_ID')

I can't find a clear statement in the documentation about this, only this hint for pyspark.sql.DataFrame.repartition:
The resulting DataFrame is hash partitioned.
Obviously, repartition doesn't bring the rows in a specific (namely alphabetic) order (not even if they were ordered previously), it only groups them. That .sortWithinPartitions imposes no global order is no wonder considering the name, which implies that the sorting only occurs within the partitions, not on them. You can try .sort instead.

Is there a simple way to use PySpark DataFrame filter() to split a frame into exactly two frames, based on condition?

I'm working on a data warehouse project. I'm reading input data into a frame, and then I want to filter out the bad rows. However, I want to print some sample bad rows. What I have now is
df_good = df_input.filter(((df_input.info.isNull()) | (df_input.info == '')))
This filter works, but I cannot print out a sample of the dropped records. What I would like is something like:
df_keep, df_reject = df_input.filter_split(((df_input.info.isNull()) | (df_input.info == '')))
print("Sample rejected records:")
df_reject.show(5)
I found one method which involves running the filter, then joining the good data back to the original data with an outer join, then filtering to find original data not-in the good data set. But this iterates over the original data twice; I would like to pass through the list just once.
Any ideas? I am doing this in AWS Glue, so I may be able to use a Dynamic Frame function.

CouchDb filter and sort in one view

I'm new to the CouchDb.
I have to filter records by date (date must be between two values) and to sort the data by the name or by the date etc (it depends on user's selection in the table).
In MySQL it looks like
SELECT * FROM table WHERE date > "2015-01-01" AND date < "2015-08-01" ORDER BY name/date/email ASC/DESC
I can't figure out if I can use one view for all these issues.
Here is my map example:
function(doc) {
emit(
[doc.date, doc.name, doc.email],
{
email:doc.email,
name:doc.name,
date:doc.date,
}
);
}
I try to filter data using startkey and endkey, but I'm not sure how to sort data in this way:
startkey=["2015-01-01"]&endkey=["2015-08-01"]
Can I use one view? Or I have to create some views with keys order depending on my current order field: [doc.date, doc.name, doc.email], [doc.name, doc.date, doc.email] etc?
Thanks for your help!

As Sebastian said you need to use a list function to do this in Couch.
If you think about it, this is what MySQL is doing. Its query optimizer will pick an index into your table, it will scan a range from that index, load what it needs into memory, and execute query logic.
In Couch the view is your B-tree index, and a list function can implement whatever logic you need. It can be used to spit out HTML instead of JSON, but it can also be used to filter/sort the output of your view, and still spit out JSON in the end. It might not scale very well to millions of documents, but MySQL might not either.
So your options are the ones Sebastian highlighted:
view sorts by date, query selects date range and list function loads everything into memory and sorts by email/etc.
views sort by email/etc, list function filters out everything outside the date range.
Which one you choose depends on your data and architecture.
With option 1 you may skip the list function entirely: get all the necessary data from the view in one go (with include_docs), and sort client side. This is how you'll typically use Couch.
If you need this done server side, you'll need your list function to load every matching document into an array, and then sort it and JSON serialize it. This obviously falls into pieces if there are soo many matching documents that they don't even fit into memory or take to long to sort.
Option 2 scans through preordered documents and only sends those matching the dates. Done right this avoids loading everything into memory. OTOH it might scan way too many documents, trashing your disk IO.
If the date range is "very discriminating" (few documents pass the test) option 1 works best; otherwise (most documents pass) option 2 can be better. Remember that in the time it takes to load a useless document from disk (option 2), you can sort tens of documents in memory, as long as they fit in memory (option 1). Also, the more indexes, the more disk space is used and the more writes are slowed down.

you COULD use a list function for that, in two ways:
1.) Couch-View is ordered by dates and you sort by e-amil => but pls. be aware that you'd have to have ALL items in memory to do this sort by e-mail (i.e. you can do this only when your result set is small)
2.) Couch-View is ordered by e-mail and a list function drops all outside the date range (you can only do that when the overall list is small - so this one is most probably bad)
possibly #1 can help you

Reading XML-files with StAX / Kettle (Pentaho)

I'm doing an ETL-process with Pentaho (Spoon / Kettle) where I'd like to read XML-file and store element values to db.
This works just fine with "Get data from XML" -component...but the XML file is quite big, several giga bytes, and there fore reading the file takes too long.
Pentaho Wiki says:
The existing Get Data from XML step is easier to use but uses DOM
parsers that need in memory processing and even the purging of parts
of the file is not sufficient when these parts are very big.
The XML Input Stream (StAX) step uses a completely different approach
to solve use cases with very big and complex data stuctures and the
need for very fast data loads...
There fore I'm now trying to do the same with StAX, but it just doesn't seem to work out like planned. I'm testing this with XML-file which only has one element group. The file is read and then mapped/inserted to table...but now I get multiple rows to table where all the values are "undefined" and some rows where I have the right values. In total I have 92 rows in the table, even though it should only have one row.
Flow goes like:
1) read with StAX
2) Modified Java Script Value
3) Output to DB
At step 2) I'm doing as follow:
var id;
if ( xml_data_type_description.equals("CHARACTERS") &&
xml_path.equals("/labels/label/id") ) {
id = xml_data_value; }
...
I'm using positional-staz.zip from http://forums.pentaho.com/showthread.php?83480-XPath-in-Get-data-from-XML-tool&p=261230#post261230 as an example.
How to use StAX for reading XML-file and storing the element values to DB?
I've been trying to look for examples but haven't found much. The above example uses "Filter Rows" -component before inserting the rows. I don't quite understand why it's being used, can't I just map the values I need? It might be that this problem occurs because I don't use, or know how to use, Filter Rows -component.
Cheers!

I posted a possible StAX-based solution on the forum listed above, but I'll post the gist of it here since it is awaiting moderator approval.
Using the StAX parser, you can select just those elements that you care about, namely those with a data type of CHARACTERS. For the forum example, you basically need to denormalize the rows in sets of 4 (EXPR, EXCH, DATE, ASK). To do this you add the row number to the stream (using an Add Sequence step) then use a Calculator to determine a "bucket number" = INT((rownum-1)/4). This will give you a grouping field for a Row Denormaliser step.
When the post is approved, you'll see a link to a transformation that uses StAX and the method I describe above.
Is this what you're looking for? If not please let me know where I misunderstood and maybe I can help.

Using LINQ to query flat text files with fixed-length records?

I've got a file filled with records like this:
NCNSCF1124557200811UPPY19871230
The codes are all fixed-length, and some of them link to other flat files (sort of like a relational database). What's the best way of querying this data using LINQ?
This is what I came up with intuitively, but I was wondering if there's a more elegant way:
var records = File.ReadAllLines("data.txt");
var table = from record in records
select new { FirstCode = record.Substring(0, 2),
OtherCode = record.Substring(18, 4) };

For one thing I wouldn't read it all into memory to start with. It's very easy to write a LineReader class which iterates over a file a line at a time. I've got a version in MiscUtil which you can use.
Unless you only want to read the results once, however, you might want to call ToList() at the end to avoid reading the file multiple times. (This is still nicer than reading all the lines and keeping that in memory - you only want to do the splitting once.)
Once you've basically got in-memory collections of all the tables, you can use normal LINQ to Objects to join them together etc. You might want to go to a more sophisticated data model to get indexes though.

I don't think there's a better way out of the box.
One could define a Flat-File Linq Provider which could make the whole thing much simpler, but as far as I know, no one has yet.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio