dynamic parameters on datasets in Kedro - kedro

I would like to call an API to enrich an existing dataset.
The existing dataset is a CSVDataSet configured in the catalog.
Now I would like to create a Node, that enriches the CSVDataSet with data from the API, that I have to call for every row in the CSV file. Then save the data into a database (SQLTableDataSet). My approach is to create an APIDataSet entry in the catalog and provide it as an input for the node, next to the CSVDataSet.
The issue here is, the APIDataSet is static (in general the DataSets seem to be very static). I need to call the load function at runtime within the Node for every entry in the csv file.
I didn't find a way to do this. Is it just a bad approach? Do I have to call the API within the Node instead of creating a APIDataSet?

So typically, we don't like our nodes having knowledge of IO configuration. The belief is that functionally pure python functions are easier to test, maintain and build.
Typically the way we would keep this distinction would be for you to subclass our APIDataSet / CSVDataSet or both and then add your custom logic to do it all there.

I have done this in my GDALRasterDataSet implementation. The idea is that if you need to enrich a dataset on the go, you can overload the load() method in a custom dataset and pass additional parameters there.
You can see an implementation here and an example of usage here.
The only extra thing you need to do is to re-write the load() method to accept kwargs (line 143) and write your own _load method that enriches your dataset. Everything else is boilerplate.

Related

How do i 'destroy all' a given Resource type in redux-saga?

I'm new to Redux-Saga, so please assume very shaky foundational knowledge.
In Redux, I am able to define an action and a subsequent reducer to handle that action. In my reducer, i can do just about whatever i want, such as 'delete all' of a specific state tree node, eg.
switch action.type
...
case 'DESTROY_ALL_ORDERS'
return {
...state,
orders: []
}
However, it seems to me (after reading the docs), that reducers are defined by Saga, and you have access to them in the form of certain given CRUD verb prefixes with invocation post fixes. E.g.
fetchStart, destroyStart
My instinct is to use destroyStart, but the method accepts a model instance, not a collection, i.e. it only can destroy a given resource instance (in my case, one Order).
TL;DR
Is there a destroyStart equivalent for a group of records at once?
If not, is there a way i can add custom behavior to the Saga created reducers?
What have a missed? Feel free to be as mean as you want, I have no idea what i'm doing but when you are done roasting me do me a favor and point me in the right direction.
EDIT:
To clarify, I'm not trying to delete records from my database. I only want to clear the Redux store of all 'Order' Records.
Two key bit's of knowledge were gained here.
My team is using a library called redux-api-resources which to some extent I was conflating with Saga. This library was created by a former employee, and adds about as much complexity as it removes. I would not recommend it. DestroyStart is provided by this library, and not specifically related to Saga. However the answer for anyone using this library (redux-api-resources) is no, there is no bulk destroy action.
Reducers are created by Saga, as pointed out in the above comments by #Chad S.. The mistake in my thinking was that I believed I should somehow crack open this reducer and fill it with complex logic. The 'Saga' way to do this is to put logic in your generator function, which is where you (can) define your control flow. I make no claim that this is best practice, only that this is how I managed to get my code working.
I know very little about Saga and Redux in general, so please take these answers with a grain of salt.

Trying to identify if a data injection method has a name already

Lets say we have a class "Car" than has different pieces of data ( maker, model, color, fabrication date, registration date, etc). The class has no method to get data, but it knows to as for it from another object (sent via constructor, let's cal it for short DS).- and the same for when needing to update changes.
A method getColor() would be implemented like this
if(! this->loaded('color')){
this->askDS('color') // this will do the necesarry work to generate a request to DS
}
return this->information('color');
Nothing too fancy so far. No comes the part i want to find out if it has a name, or if there are libraries / frameworks that do this already.
DS has a list of methods registered dinamically based on the class that needs data. For car we have:
input: car serial number, output: method to use to read the numbers to extract raw values
input: car raw color value, output: color code
input: car color code, manufacturer, year, mode, output:human-readable color (for example navy blue)
Now, DS or any method does not have an ordered list of using command to start from serial number and return the color blue, but if can construct a chain of methods that from one set of data, it can run them in order and get the desired data.
For our example above, DS runs 1,2,3 in that order and injects the data resulted from all methods into the class object that needed it.
Now if the car needs registration info, we have method (4) that gets that from the police database with an api request.
So, given:
- a type of model (class/object)
- a list of methods that take a fixed list of input(object properties) and give out a fixed list of output (object properties)
- a class DS that can glue the methods and run the needed ones for a model to get from property A (serial) to properby B (human readable colour) without the model or DS having a preconfigured way to get this data but finding it as needed.
does this have a name or is it already implemented somewhere ?
I've implemented a very basic prototype and it works very nice and i think this implementation method has useful features:
if you have a set of methods that do sql queries and then your app switches to using an api, you only need to change the methods and don't have to touch any other part of the application
when looking for a chain of methods that resolve the 'need' the object has, you can find a method chain, run it, if it fails keep looking for another list of methods based on the currently available data - so if you have multiple sources for a piece of data, it can try multiple versions
starting from the above paragraph i could start with an app that only has sql queries for data retrieval - when i find out a part of the app overloads the sql server i could add a method to retrieve data from cache with a lower cost than the one from database (or multiple layered caches, each with different costs)
i could probably add business logi in the mix the same ways as cache, and based on the user location / options present different data
this requires less coding overall, and decouples the data source from the object, making each piece easier to mock/test
what is needed to make this fast is a caching solution for the discovered method chains, since matching hundreds of thousands of methods per model type would be time-consuming but I don't think this is very hard to do - just store all found chains in memory as you find them and some metadata to be able to resume a search from any point in time - when you update the methods, just clear the cache, take a performance hit for the first requests
Thank you for your time
What you describe sounds like a somewhat roundabout way of doing Dependency Injection. Quote:
"Passing the service to the client, rather than allowing a client to
build or find the service, is the fundamental requirement of the
pattern."
Depending on what language you're using, there should be several Dependency Injection frameworks/libraries available.

Does hive instantiate a new UDF object for each record?

Say I'm building a UDF class called StaticLookupUDF that has to load some static data from a local file during construction.
In this case I want to ensure that I'm not replicating work more than I need to be, in that I don't want to re-load the static data on every call to the evaluate() method.
Clearly each mapper uses it's own instantiation of the UDF, but does a new instance get generated for each record processed?
For example, a mapper is going to process 3 rows. Does it create a single StaticLookupUDF and call evaluate() 3 times, or does it create a new StaticLookupUDF for each record, and call evaluate only once per instance?
If the second example is true, in what alternate way should I structure this?
Couldn't find this anywhere in the docs, I'm going to look through the code, but figured I'd ask the smart people here at the same time.
Still not totally sure about this, but I got around it by having a static lazy value that loaded data as needed.
This way you have one-instance of the static value per mapper. So if you're reading in a dataset and you have 6 map tasks you'll read in the data 6 times. Not ideal, but better than once per record.

Displaying computed data with external dependencies

I'm building a report that needs to include an 'estimate' column, which is based on data that's not available in the dataset.
Ideally I'd like to be able to define a Java interface
public int getEstimate(int foo_id, int bar_id, int quantity);
where foo_id, bar_id and quantity are available in the row I want the estimate presented.
There will be multiple strategies for producing the estimate so it would be good to use an interface to allow swapping them when needed.
Looking at the BIRT docs, I think it's possible I ought to be using the event handler mechanisms, but that seems to only allow defining a class to use and I'd somehow like to inject a configured estimator.
A non-obfuscated example might be to say that I have a dataset which includes an IP address column, and I'd like to be able to use some GeoIP service to resolve the country from the IP address. In that case I'd have an interface public String getCountryName(String address) and the actual implementations may use MaxMind, a local cache or some other system.
How would I go about doing this?
Or.. would I be better off by writing a scripted data source that can integrate the computed data before delivering it to BIRT?
Or.. some sort of scripted data source that is then used to create a join data set?
I think a Scripted Data Source would work fine, but a Java-based event handler would be more straightforward. You can implement it as a simple POJO and get access to any and all the complex objects and tools that will allow you to calculate your estimate. The simplest solution of all may simply to be adding a calculated field to the data set.
When creating the calculated field, you can get pretty complex in terms of the scripting logic you can leverage in order to produce the resultant value. The nicest thing about this route is that all the other column values in the row (which I assume you need to calculate the estimate) are made available via the Expression editor. You can pull in complex objects (POJOs) to help in your calculations here as well by using the "Packages" object (i.e. var red = new Packages.redwood.HelloWorld())
If you want to create the Event Handler class, here is what I would do. I would create a text object and bind the onCreate even to your POJO (by extending the TextItemEventAdapter) and override the "onCreate" method. There you can do any work you want to and at the end simply call 'text.setText(theEstimateResult);' to make the estimate itself visible. As far as accessing data values to do your calculations, You can get to those in the POJO too. I assume the estimate will be a part of a larger table of values. You can access any specific row value via the reportContext.
Those are the two ideas I would give a try first. The computed column is the fastest to implement and the least likely to throw you a curve during deployment. Let me know which way you choose and we can hash it out further if needed.

Appropriate data structure for flat file processing?

Essentially, I have to get a flat file into a database. The flat files come in with the first two characters on each line indicating which type of record it is.
Do I create a class for each record type with properties matching the fields in the record? Should I just use arrays?
I want to load the data into some sort of data structure before saving it in the database so that I can use unit tests to verify that the data was loaded correctly.
Here's a sample of what I have to work with (BAI2 bank statements):
01,121000358,CLIENT,050312,0213,1,80,1,2/
02,CLIENT-STANDARD,BOFAGB22,1,050311,2359,,/
03,600812345678,GBP,fab1,111319005,,V,050314,0000/
88,fab2,113781251,,V,050315,0000,fab3,113781251,,V,050316,0000/
88,fab4,113781251,,V,050317,0000,fab5,113781251,,V,050318,0000/
88,010,0,,,015,0,,,045,0,,,100,302982205,,,400,302982205,,/
16,169,57626223,V,050311,0000,102 0101857345,/
88,LLOYDS TSB BANK PL 779300 99129797
88,TRF/REF 6008ABS12300015439
88,102 0101857345 K BANK GIRO CREDIT
88,/IVD-11 MAR
49,1778372829,90/
98,1778372839,1,91/
99,1778372839,1,92
I'd recommend creating classes (or structs, or what-ever value type your language supports), as
record.ClientReference
is so much more descriptive than
record[0]
and, if you're using the (wonderful!) FileHelpers Library, then your terms are pretty much dictated for you.
Validation logic usually has at least 2 levels, the grosser level being "well-formatted" and the finer level being "correct data".
There are a few separate problems here. One issue is that of simply verifying the data, or writing tests to make sure that your parsing is accurate. A simple way to do this is to parse into a class that accepts a given range of values, and throws the appropriate error if not,
e.g.
public void setField1(int i)
{
if (i>100) throw new InvalidDataException...
}
Creating different classes for each record type is something you might want to do if the parsing logic is significantly different for different codes, so you don't have conditional logic like
public void setField2(String s)
{
if (field1==88 && s.equals ...
else if (field2==22 && s
}
yechh.
When I have had to load this kind of data in the past, I have put it all into a work table with the first two characters in one field and the rest in another. Then I have parsed it out to the appropriate other work tables based on the first two characters. Then I have done any cleanup and validation before inserting the data from the second set of work tables into the database.
In SQL Server you can do this through a DTS (2000) or an SSIS package and using SSIS , you may be able to process the data onthe fly with storing in work tables first, but the prcess is smilar, use the first two characters to determine the data flow branch to use, then parse the rest of the record into some type of holding mechanism and then clean up and validate before inserting. I'm sure other databases also have some type of mechanism for importing data and would use a simliar process.
I agree that if your data format has any sort of complexity you should create a set of custom classes to parse and hold the data, perform validation, and do any other appropriate model tasks (for instance, return a human readable description, although some would argue this would be better to put into a separate view class). This would probably be a good situation to use inheritance, where you have a parent class (possibly abstract) define the properties and methods common to all types of records, and each child class can override these methods to provide their own parsing and validation if necessary, or add their own properties and methods.
Creating a class for each type of row would be a better solution than using Arrays.
That said, however, in the past I have used Arraylists of Hashtables to accomplish the same thing. Each item in the arraylist is a row, and each entry in the hashtable is a key/value pair representing column name and cell value.
Why not start by designing the database that will hold the data then you can use the entity framwork to generate the classes for you.
here's a wacky idea:
if you were working in Perl, you could use DBD::CSV to read data from your flat file, provided you gave it the correct values for separator and EOL characters. you'd then read rows from the flat file by means of SQL statements; DBI will make them into standard Perl data structures for you, and you can run whatever validation logic you like. once each row passes all the validation tests, you'd be able to write it into the destination database using DBD::whatever.
-steve

Resources