statsmodels - create model from params - statsmodels

I am trying to create an empty model from params saved from a previously trained model, but the constructor stubbornly wants me to provide both endogenous and exogenous variables, which I don't have. Is there any way to get around this?
For example, I only want to do:
logit = sm.Logit()
pred = logit.predict(params, X)
But the first line won't work.

No, this is not supported in statsmodels. Models are always associated with data.
However, for the usecase of prediction, it is possible to pickle the model and optionally delete all full length arrays including the data from the model instance and from the results instance before pickling. This doesn't work with formulas.
On the other hand, since this is Python, there might be several ways how to cheat, at your own risk.
It would be helpful if you open a issue on github https://github.com/statsmodels/statsmodels/issues with a description of your usecase, and it might be possible to get the relevant features into a future version.

Related

Trying to identify if a data injection method has a name already

Lets say we have a class "Car" than has different pieces of data ( maker, model, color, fabrication date, registration date, etc). The class has no method to get data, but it knows to as for it from another object (sent via constructor, let's cal it for short DS).- and the same for when needing to update changes.
A method getColor() would be implemented like this
if(! this->loaded('color')){
this->askDS('color') // this will do the necesarry work to generate a request to DS
}
return this->information('color');
Nothing too fancy so far. No comes the part i want to find out if it has a name, or if there are libraries / frameworks that do this already.
DS has a list of methods registered dinamically based on the class that needs data. For car we have:
input: car serial number, output: method to use to read the numbers to extract raw values
input: car raw color value, output: color code
input: car color code, manufacturer, year, mode, output:human-readable color (for example navy blue)
Now, DS or any method does not have an ordered list of using command to start from serial number and return the color blue, but if can construct a chain of methods that from one set of data, it can run them in order and get the desired data.
For our example above, DS runs 1,2,3 in that order and injects the data resulted from all methods into the class object that needed it.
Now if the car needs registration info, we have method (4) that gets that from the police database with an api request.
So, given:
- a type of model (class/object)
- a list of methods that take a fixed list of input(object properties) and give out a fixed list of output (object properties)
- a class DS that can glue the methods and run the needed ones for a model to get from property A (serial) to properby B (human readable colour) without the model or DS having a preconfigured way to get this data but finding it as needed.
does this have a name or is it already implemented somewhere ?
I've implemented a very basic prototype and it works very nice and i think this implementation method has useful features:
if you have a set of methods that do sql queries and then your app switches to using an api, you only need to change the methods and don't have to touch any other part of the application
when looking for a chain of methods that resolve the 'need' the object has, you can find a method chain, run it, if it fails keep looking for another list of methods based on the currently available data - so if you have multiple sources for a piece of data, it can try multiple versions
starting from the above paragraph i could start with an app that only has sql queries for data retrieval - when i find out a part of the app overloads the sql server i could add a method to retrieve data from cache with a lower cost than the one from database (or multiple layered caches, each with different costs)
i could probably add business logi in the mix the same ways as cache, and based on the user location / options present different data
this requires less coding overall, and decouples the data source from the object, making each piece easier to mock/test
what is needed to make this fast is a caching solution for the discovered method chains, since matching hundreds of thousands of methods per model type would be time-consuming but I don't think this is very hard to do - just store all found chains in memory as you find them and some metadata to be able to resume a search from any point in time - when you update the methods, just clear the cache, take a performance hit for the first requests
Thank you for your time
What you describe sounds like a somewhat roundabout way of doing Dependency Injection. Quote:
"Passing the service to the client, rather than allowing a client to
build or find the service, is the fundamental requirement of the
pattern."
Depending on what language you're using, there should be several Dependency Injection frameworks/libraries available.

Laravel / Eloquent special relation type based on parsed string attribute

I have developed a system where various classes have attributes consisting of a custom formula. The formula can contain special tokens which refer to different types of object. For example an object of class FruitSalad may have the following attribute;
$contents = "[A12] + [B76]";
In somewhat abstract terms, this means "add apple 12 to banana 76". It can also get significantly more complex than that with as many as 15 or 20 references to other objects involved in one formula.
I have a trait which passes formulae such as this and each time it finds a reference to a model (i.e. "[A12]") it gets it from the database with A::find(12) and adds it to an array of component objects which can be used for other processes later on in the request.
So, in essence, it's a relationship. But instead of a pivot table to describe the relationship, there is a formula on the parent model which can include references to child models.
This is all working. Yay! But it's really inefficient because there are so many tiny queries to get single models as formulae are parsed. One request may quite easily result in hundreds of queries. Oops.
I see two potential options;
1. Get all my apples and bananas from the database at the start of the request and get them from an in-memory store instead of from the database when parsing a formula (is this the repository pattern??).
2. Create a custom relation type (something like hasManyFromFormula) which makes eager loading work so that the parsing becomes much simpler because the relevant apples and bananas would already be loaded into the parent model.
Is there a precedent for this? As for why I am doing it like this, it would a bit tough to explain in brief but suffice to say it is to support a highly configurable data retrieval system which supports as-yet unknown input data configurations.
Help!
Thanks,
Geoff
Am not completely sure if it is the best solution, but in the end I created a new directory class for basic components and then set it up in the app service provider as a singleton. The constructor for the directory class loaded all models of several relevant classes and made them available as collections throughout the app.

Laravel Repository pattern and many to many relation

In our new project we decided to use hexagonal architecture. We decided to use repository pattern to gain more data access abstraction. We are using command bus pattern as service layer.
In our dashboard page we need a lot of data and because of that we should use 3 level many to many relations (user -> projects -> skills -> review) and also skills should be active(status=1).
The problem rises here, where should i put this?
$userRepository->getDashboardData($userId).
2.$userRepository->getUser($userId)->withProjects()->withActiveSkills()->withReviews();
3.$user = $userRepository->getById();
$projects = $projectRepository->getByUserId($user->id);
$skills = $skillRepository->getActiveSkillsByProjectsIds($projectIds);
In this case, I couldn't find the benefits of repository pattern except coding to interface which can be achived with model interfac.
I think solution 3 is prefect but it adds a lot of work.
You have to decide (for example) from an object-oriented perspective if a "User" returned is one that has a collection of skills within it. If so, your returned user will already have those objects.
In the case of using regular objects, try to avoid child entities unless it makes good sense. Like, for example.. The 'User' entity is responsible for ensuring that the child entities play by the business rules. Prefer to use a different repository to select the other types of entities based on whatever other criteria.
Talking about a "relationship" in this way makes me feel like you're using ActiveRecord because otherwise they'd just be child objects. The "relationship" exists in the relational database. It only creeps into your objects if you're mixing database record / object like with AR.
In the case of using ActiveRecord objects, you might consider having specific methods on the repository to load the correctly configured member objects. $members->allIncludingSkills() or something perhaps. This is because you have to solve for N+1 when returning multiple entities. Then, you need to use eager-loading for the result set and you don't want to use the same eager loading configuration for every request.. Therefore, you need a way to delineate configurations per request.. One way to do this is to call different methods on the repository for different requests.
However, for me.. I'd prefer not to have a bunch of objects with just.. infinite reach.. For example.. You can have a $member->posts[0]->author->posts[0]->author->posts[0]->author->posts[0].
I prefer to keep things as 'flat' as possible.
$member = $members->withId($id);
$posts = $posts->writtenBy($member->id);
Or something like that. (just typing off the top of my head).
Nobody likes tons of nested arrays and ActiveRecord can be abused to the point where its objects are essentially arrays with methods and the potential for infinite nesting. So, while it can be a convenient way to work with data. I would work to prevent abusing relationships as a concept and keep your structures as flat as possible.
It's not only very possible to code without ORM 'relationship' functionality.. It's often easier.. You can tell that this functionality adds a ton of trouble because of just how many features the ORM has to provide in order to try to mitigate the pain.
And really, what's the point? It just keeps you from having to use the ID of a specific Member to do the lookup? Maybe it's easier to loop over a ton of different things I guess?
Repositories are really only particularly useful in the ActiveRecord case if you want to be able to test your code in isolation. Otherwise, you can create scopes and whatnot using Laravel's built-in functionality to prevent the need for redundant (and consequently brittle) query logic everywhere.
It's also perfectly reasonable to create models that exist SPECIFICALLY for the UI. You can have more than one ActiveRecord model that uses the same database table, for example, that you use just for a specific user-interface use-case. Dashboard for example. If you have a new use-case.. You just create a new model.
This, to me.. Is core to designing systems. Asking ourselves.. Ok, when we have a new use-case what will we have to do? If the answer is, sure our architecture is such that we just do this and this and we don't really have to mess with the rest.. then great! Otherwise, the answer is probably more like.. I have no idea.. I guess modify everything and hope it works.
There's many ways to approach this stuff. But, I would propose to avoid using a lot of complex tooling in exchange for simpler approaches / solutions. Repository is a great way to abstract away data persistence to allow for testing in isolation. If you want to test in isolation, use it. But, I'm not sure that I'm sold much on how ORM relationships work with an object model.
For example, do we have some massive Member object that contains the following?
All comments ever left by that member
All skills the member has
All recommendations that the member has made
All friend invites the member has sent
All friends that the member has established
I don't like the idea of these massive objects that are designed to just be containers for absolutely everything. I prefer to break objects into bits that are specifically designed for use-cases.
But, I'm rambling. In short..
Don't abuse ORM relationship functionality.
It's better to have multiple small objects that are specifically designed for a use-case than a few large ones that do everything.
Just my 2 cents.

Achieve Multi-tenancy with GATE

I am using GATE in one of my applications and I have few queries related to Multi-tenancy. My requirements are as given below.
I have the keywords set, specific for each user and depending on
which user is signed in, I need to initialise gazetteer with the
applicable set of keywords.
At a given time there could be multiple users logging into my
application and I want to make sure that the multi-tenancy
approach will not be inefficient.
I don't want to store the keywords for each user in the .lst
file(s) but store it on a DB (mongo) and inject only at the
runtime.
I searched the web for few samples and though I found some thoughts on working with Processing Resource, I have no idea how the performance will be affected.
Your help is much appreciated.
Thanks in advance,
Sajith
That's an interesting use-case for a GATE gazetteer.
One thing I believe you should definitely do is add the user ID as a feature when you're creating the document. This way you'll be able to make your MongoDB query in a processing resource later on.
When you're processing the document, you have several options:
Create a custom PR which calls MongoDB and replicates the DefaultGazetteer code but with overwritten "init" method (or inherit or wrap it, haven't looked into much detail if that's possible). Instead of the default init method you should provide your list of keywords, then set the needed fields and call execute().
If you don't have too many keywords, create a custom PR (or groovy scripting PR) which calls MongoDB and does some simple regex search like the one in this thread.
They also suggest the stringsearch library in the comments. Then just use start and end indices to create Lookup annotations on your own.
You said you don't want that but still, several million words can be handled by both the default and the Hash gazetteer. Although, you should be careful as gate documents could be very memory-intensive if you have too many annotations - in your case Lookups for all user keywords.
Hope this helps.

Which one do you prefer for Searching/Reporting DataTable or DTO or Domain Class?

The project currently I am working in requires a lot of searhing/filtering pages. For example I have a comlex search page to get Issues by data,category,unit,...
Issue Domain Class is complex and contains lots of value objects and child objects.
.I am wondering how people deal with Searching/Filtering/Reporting for UI. As far As I know I have 3 options but none of them make me happier.
1.) Send parameters to Repository/DAO to Get DataTable and Bind DataTable to UI Controls.For Example to ASP.NET GridView
DataTable dataTable =issueReportRepository.FindBy(specs);
.....
grid.DataSource=dataTable;
grid.DataBind();
In this option I can simply by pass the Domain Layer and query database for given specs. And I dont have to get fully constructed complex Domain Object. No need for value objects,child objects,.. Get data to displayed in UI in DataTable directly from database and show in the UI.
But If have have to show a calculated field in UI like method return value I have to do this in the DataBase because I don't have fully domain object. I have to duplicate logic and DataTable problems like no intellisense etc...
2.)Send parameters to Repository/DAO to Get DTO and Bind DTO to UI Controls.
IList<IssueDTO> issueDTOs =issueReportRepository.FindBy(specs);
....
grid.DataSource=issueDTOs;
grid.DataBind();
In this option is same as like above but I have to create anemic DTO objects for every search page. Also For different Issue search pages I have to show different parts of the Issue Objects.IssueSearchDTO, CompanyIssueTO,MyIssueDTO....
3.) Send parameters to Real Repository class to get fully constructed Domain Objects.
IList<Issue> issues =issueRepository.FindBy(specs);
//Bind to grid...
I like Domain Driven Design and Patterns. There is no DTO or duplication logic in this option.but in this option I have to create lot's of child and value object that will not shown in the UI.Also it requires lot's ob join to get full domain object and performance cost for needles child objects and value objects.
I don't use any ORM tool Maybe I can implement Lazy Loading by hand for this version but It seems a bit overkill.
Which one do you prefer?Or Am I doing it wrong? Are there any suggestions or better way to do this?
I have a few suggestions, but of course the overall answer is "it depends".
First, you should be using an ORM tool or you should have a very good reason not to be doing so.
Second, implementing Lazy Loading by hand is relatively simple so in the event that you're not going to use an ORM tool, you can simply create properties on your objects that say something like:
private Foo _foo;
public Foo Foo
{
get {
if(_foo == null)
{
_foo = _repository.Get(id);
}
return _foo;
}
}
Third, performance is something that should be considered initially but should not drive you away from an elegant design. I would argue that you should use (3) initially and only deviate from it if its performance is insufficient. This results in writing the least amount of code and having the least duplication in your design.
If performance suffers you can address it easily in the UI layer using Caching and/or in your Domain layer using Lazy Loading. If these both fail to provide acceptable performance, then you can fall back to a DTO approach where you only pass back a lightweight collection of value objects needed.
This is a great question and I wanted to provide my answer as well. I think the technically best answer is to go with option #3. It provides the ability to best describe and organize the data along with scalability for future enhancements to reporting/searching requests.
However while this might be the overall best option, there is a huge cost IMO vs. the other (2) options which are the additional design time for all the classes and relationships needed to support the reporting needs (again under the premise that there is no ORM tool being used).
I struggle with this in a lot of my applications as well and the reality is that #2 is the best compromise between time and design. Now if you were asking about your busniess objects and all their needs there is no question that a fully laid out and properly designed model is important and there is no substitute. However when it comes to reporting and searching this to me is a different animal. #2 provides strongly typed data in the anemic classes and is not as primitive as hardcoded values in DataSets like #1, and still reduces greatly the amount of time needed to complete the design compared to #3.
Ideally I would love to extend my object model to encompass all reporting needs, but sometimes the effort required to do this is so extensive, that creating a separate set of classes just for reporting needs is an easier but still viable option. I actually asked almost this identical question a few years back and was also told that creating another set of classes (essentially DTOs) for reporting needs was not a bad option.
So to wrap it up, #3 is technically the best option, but #2 is probably the most realistic and viable option when considering time and quality together for complex reporting and searching needs.

Resources