How to get the number of clusters used in a panel OLS regression python - panel

I'm performing a Panel OLS regression using time and entity fixed effects and I'm also supposed to use clustered standard errors. I manage to obtain most of the results that I want however I'm asked to identify the number of clusters used in the regression which is not displayed by linearmodels when using PANEL OLS.
Thanks for your help !
Here is a screenshot of my results in case it might help. Thank you for your time.

Related

How to detect which data are affecting the result of a feature with machine learning?

Firstly, I will illustrate the scenario that I have a dataset like;
ProductID, ProductType, MachineID, MachineModel, MachineSpeed, RejectDate, RejectVolume etc.
And I want to find which field(s) is the reason for the increase in my RejectVolume? Also, in the scenario, all products have a RejectVolume. I mean RejectVolume is nonzero and there are continuous but different values. Thanks to this, I can recognize the reason(s) and find the solution for reducing the value of RejectVolume.
Can you give me any ideas for creating the model?
Thank you.
You want to look at Feature Selection methods.
In this scenario you could start with Linear Regression using Lasso for feature selection. This is done by successively increasing the lasso regularization term, which will decrease the weight of unimportant features, leaving you with the features with the most impact.

Designing an algorithm for detecting anamoly and statistical significance for ordinal data using python

Firstly, I would like to apologise for the detailed problem statement. Being a novice, I couldn't express it in any lesser words.
Environment Setup Details:
To give some background, I work in a cloud company where we have multiple servers geographically located in all continents. So, we have hierarchy like this:
Several partitions
Each partition has 7 pop's
Each pop has multiple nodes all set up with redundancy.
Turn servers connecting traffic to each node depending on the client location
Actual clients-ios, android, mac, windows,etc.
Now, every time the user uses our product/service, he leaves a rating out of 5, 5 being outstanding. This data is stored in our databases and we mine it and analyse it to pin-point the exact issue on any particular day.
For example, if the users from Asia are giving more bad ratings on Tuesday this week than a usual Tuesday, what factors can cause this - is it something to do with clients app version, or server release , physical factors, loss, increased round trip delay etc.
What we have done:
Till now we have been using visualization tools to track each of these metrics separately per day to see the trends and detect the issues manually.
But, due to growing micr-services, it is becoming difficult day by day. Now, we want to automate it using python/pandas.
What I want to do:
If the ratings drop on a particular day/hour, I run the script and it should do all the manual work by taking all the permutations and combinations of all factors and list out the exact combinations which could have lead to the drop.
The second step would be to check whether the drop was significant due to varying number of ratings.
What I know:
I understand that I can do this using pandas by creating a dataframe for each predictor variable and trying to do it per variable.
And then I can apply tests like whitney test etc for ordinal data.
What I need help with:
But I just wanted to know if there is a better way to do it? It is perfectly fine if there is a learning curve involved. I can learn and do it. I just wanted some help in choosing the right approach for this.

Odd correlated posterior traceplots in multilevel model

I'm trying out PyMC3 with a simple multilevel model. When using both fake and real data the traces of the random effect distributions move with each other (see plot below) and appear to be offsets of the same trace. Is this an expected artifact of NUTS or an indication of a problem with my model?
Here is a traceplot on real data:
Here is an IPtyhon notebook of the model and the functions used to create the fake data. Here is the corresponding gist.
I would expect this to happen in accordance with the group mean distribution on alpha. If you think about it, if the group mean shifts around it will influence all alphas to the same degree. You could confirm this by doing a scatter plot of the group mean trace against some of the alphas. Hierarchical models are in general difficult for most samplers because of these complex interdependencies between group mean and variance and the individual RVs. See http://arxiv.org/abs/1312.0906 for more information on this.
In your specific case, the trace doesn't look too worrisome to me, especially after iteration 1000. So you could probably just discard those as burn-in and keep in mind that you have some sampling noise but probably got the right posterior overall. In addition, you might want to perform a posterior predictive check to see if the model can reproduce the patterns in your data you are interested in.
Alternatively, you could try to estimate a better hessian using pm.find_hessian(), e.g. https://github.com/pymc-devs/pymc/blob/3eb2237a8005286fee32776c304409ed9943cfb3/pymc/examples/hierarchical.py#L51
I also found this paper which looks interesting (haven't read it yet but might be cool to implement in PyMC3): arxiv-web3.library.cornell.edu/pdf/1406.3843v1.pdf

cluster analysis Hadoop, Map reduce environment

we are currently trying to create some very basic personas based on our user data base (few million profiles). The goal is to find out at this stage what the characteristics of our users are, for example what they look like and what they are looking for and to create several "typical" user profiles.
I believe the best way to achieve this would be to run a cluster analysis in order to find similarities among users.
The big roadblock however is how to get there. We are tracking our data in a Hadoop environment and I am being told that this could be potentially achieved with our tools.
I have familiarised myself with the theory of the topic and know that it can be done for example in SPSS (quite hard to use and limited to samples of large data sets).
The big question: Is it possible to perform a or different types of cluster analysis in a Hadoop environment and then visualise the results like in SPSS? It is my understanding that we would need to run several types of analysis in order to find the best way to cluster the data, also when it comes to distance measurements of the clusters.
I have not found any information on the internet with regards to this, so I wonder if this is possible at all, without a major programming effort (meaning literally implementing for example all the standard tools available in SPSS: Dendrograms, the different result tables and cluster graphs etc.).
Any input would be much appreaciated. Thanks.

Use clustering for prediction in Weka

Can I use clustering (e.g. using k-means) to make predictions in Weka?
I have some data based on a research for president elections. I have answers from questionnaires (numeric attributes), and I have one attribute that is the answer for the question Who are you going to vote? (1, 2 or 3)
I make predictions using some classifiers (e.g. Bayes) in Weka. My results are based on that answer(vote intention) and I have about 60% recall(rate of correct predictions).
I understand that clustering is a different thing, but can I use clustering to make predictions? I've already tried so, but I've realized clustering always selects its own centroids, and it does not use my vote intention question.
Explain results of K-means
must be a colleague of yours. He seems to use the same data set, and it would be helpful if we could all have a look at the data.
In general, clustering is not classification or prediction.
However, you can try to improve your classification by using the information gained from clustering. Two such techniques:
substitute your data set with the cluster centers, and use this for classification (at least if your clusters are reasonably pure wrt. to the class label!)
train a separate classifier on each cluster, and build an ensemble out of them (in particular, if your clusters are inhomogenous)
But I belive your understanding of classification or clustering is not yet far enough to try out these. You need to handle them carefully, and know your data very well.
Yes. You can use the Weka interface to do prediction via clustering. First, upload your training data using the Preprocess tab. Then, go to classify tab, under classifier, click choose and under meta, choose ClassificationViaClustering. The default clustering algorithm used by weka is SimpleKMean but you can change that by clicking on the options string (i.e. the text next to the choose button) and weka will display a message box, click choose and a set of clustering algorithms will be listed to choose from (e.g. EM). After that, you can do Cross-Validation or upload a test data by clicking on set as you normally do when you use weka for classification.
Hope this will help anyone having the same question!

Resources