Can the algorithm in "partykit" ensure causal explanation? - party

Can the exposure-response relationship estimated within each subgroup, generated by using "partykit", have causal interpretation?

Yes, but it may need additional work.
If the data come from a randomized controlled trial, then fitting a treatment-response model in every subgroup has the same type of causal interpretation it has in the full sample.
However, in observational data, it is necessary to first estimate the propensity of treatment (i.e., the probability of treatment given the regressors) and use that in the treatment-response model in every subset. This is also known as "local centering" of the treatment indicator. Additionally, local centering of the dependent response variable may improve the performance of the model further.
See Dandl et al. (2022) for more details and comparisons. For the setup in randomized controlled trials, there is also a dedicated interface package model4you that facilitates fitting "personalized" treatment-response models using trees and random forests. See the Seibold et al. publications for details on the software and underlying methods.
Susanne Dandl, Torsten Hothorn, Heidi Seibold, Erik Sverdrup, Stefan Wager, Achim Zeileis (2022). “What Makes Forest-Based Heterogeneous Treatment Effect Estimators Work?.” arXiv:2206.10323, arXiv.org E-Print Archive. doi:10.48550/arXiv.2206.10323
Heidi Seibold, Achim Zeileis, Torsten Hothorn (2019). “model4you: An R Package for Personalised Treatment Effect Estimation.” Journal of Open Research Software, 7(17), 1-6. doi:10.5334/jors.219
Heidi Seibold, Achim Zeileis, Torsten Hothorn (2018). “Individual Treatment Effect Prediction for Amyotrophic Lateral Sclerosis Patients.” Statistical Methods in Medical Research, 27(10), 3104-3125. doi:10.1177/0962280217693034
Heidi Seibold, Achim Zeileis, Torsten Hothorn (2016). “Model-Based Recursive Partitioning for Subgroup Analyses.” The International Journal of Biostatistics, 12(1), 45-63. doi:10.1515/ijb-2015-0032

Related

How to understand F-test based lmfit confidence intervals

The excellent lmfit package lets one to run nonlinear regression. It can report two different conf intervals - one based on the covarience matrix the other using a more sophisticated tecnique based on an F-test. Details can be found on the doc. I would like to understand he reasoning behind this technique in depth. Which topics should i read about? Note: i have sufficient stats knowledge
F stats and other associated methods for obtaining confidence intervals are far superior to a simple estimation of te co variance matrix for non-linear models (and others).
The primary reason for this is the lack of assumptions about the Gaussian nature of error when using these methods. For non-linear systems, confidence intervals can (they don't have to be) be asymmetric. This means that the parameter value can effect the error surface differently and therefore the one, two, or three sigma limits have different magnitudes in either direction from the best fit.
The analytical ultracentrifugation community has excellent articles involving error analysis (Tom Laue, John J. Correia, Jim Cole, Peter Schuck are some good names for article searches). If you want a good general read about proper error analysis, check out this article by Michael Johnson:
http://www.researchgate.net/profile/Michael_Johnson53/publication/5881059_Nonlinear_least-squares_fitting_methods/links/0deec534d0d97a13a8000000.pdf
Cheers!

When to use a certain Reinforcement Learning algorithm?

I'm studying Reinforcement Learning and reading Sutton's book for a university course. Beside the classic PD, MC, TD and Q-Learning algorithms, I'm reading about policy gradient methods and genetic algorithms for the resolution of decision problems.
I have never had experience before in this topic and I'm having problems understanding when a technique should be preferred over another. I have a few ideas, but I'm not sure about them. Can someone briefly explain or tell me a source where I can find something about typical situation where a certain methods should be used? As far as I understand:
Dynamic Programming and Linear Programming should be used only when the MDP has few actions and states and the model is known, since it's very expensive. But when DP is better than LP?
Monte Carlo methods are used when I don't have the model of the problem but I can generate samples. It does not have bias but has high variance.
Temporal Difference methods should be used when MC methods need too many samples to have low variance. But when should I use TD and when Q-Learning?
Policy Gradient and Genetic algorithms are good for continuous MDPs. But when one is better than the other?
More precisely, I think that to choose a learning methods a programmer should ask himlself the following questions:
does the agent learn online or offline?
can we separate exploring and exploiting phases?
can we perform enough exploration?
is the horizon of the MDP finite or infinite?
are states and actions continuous?
But I don't know how these details of the problem affect the choice of a learning method.
I hope that some programmer has already had some experience about RL methods and can help me to better understand their applications.
Briefly:
does the agent learn online or offline? helps you to decide either using on-line or off-line algorithms. (e.g. on-line: SARSA, off-line: Q-learning). On-line methods have more limitations and need more attention to pay.
can we separate exploring and exploiting phases? These two phase are normally in a balance. For example in epsilon-greedy action selection, you use an (epsilon) probability for exploiting and (1-epsilon) probability for exploring. You can separate these two and ask the algorithm just explore first (e.g. choosing random actions) and then exploit. But this situation is possible when you are learning off-line and probably using a model for the dynamics of the system. And it normally means collecting a lot of sample data in advance.
can we perform enough exploration? The level of exploration can be decided depending on the definition of the problem. For example, if you have a simulation model of the problem in memory, then you can explore as you want. But real exploring is limited to amount of resources you have. (e.g. energy, time, ...)
are states and actions continuous? Considering this assumption helps to choose the right approach (algorithm). There are both discrete and continuous algorithms developed for RL. Some of "continuous" algorithms internally discretize the state or action spaces.

Software metrics to identify developers by their coding style

Traditionional software metrics deal with quality of software. I'm looking for metrics that can be used to identify developers by their code, in the same vein as plagiarism software and stylometry can be used to identify authors by their writing style. I can imagine that certain existing metrics can be used here as well, such as comment ratio. I can also imagine metrics that would irrelevant from a quality point of view, such as the (over)use of certain methods or design patterns, average length of variable names, etc.
I'm interested either in a pointer to a collection of such metrics or studies, or individual metrics. They may be language-agnostic or related to a language or programming paradigm.
I want to use it to understand and analyze different coding styles, not to detect plagiarism.
I see there are already a couple of studies that looked into this. They might help.
Kothari, J., Shevertalov, M., Stehle, E., Mancoridis, S., "A probabilistic approach to source code authorship identification", In Proceedings of the International Conference on Information Technology, pp.243-248, IEEE, 2007.
Available online here
Quoting from the abstract:
We begin by computing a set of metrics to build profiles for a population of known authors using code samples that are verified to be authentic. We then compute metrics on unidentified source code to determine the closest matching profile. [...] In our case study we are able
to determine authorship with greater than 70% accuracy in choosing the single nearest match and greater than 90% accuracy in choosing the top three ordered nearest matches.
Shevertalov, M., Kothari, J., Stehle, E., Mancoridis, S., "On the use of discretized source code metrics for author identification", In Proceedings of the 1st International Symposium on Search Based Software Engineering, pp.69-78, IEEE, 2009.
Available online here, this is a follow-up of the previous study.
Lange, R., Mancoridis, S., "Using code metric histograms and genetic algorithms to perform author identification for software forensics", In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, pp.2082-2089, ACM, 2007.
Available online here
This is also related to the first reference (common author), and discusses the metrics in more detail. Again quoting from the abstract:
Our method involves measuring the differences in histogram distributions for code metrics. Identifying a combination of metrics that is effective in distinguishing developer styles is key to the utility of the technique. Our case study involves 18 metrics.
You can also use Google Scholar for other references, and for finding other papers based on the ones above (using the "cited by" option).
If you're looking for potential metrics, you might try reviewing some coding standards. Since these dictate a particular style, it follows that the things they talk about (spacing, placement of braces, identifier lengths, mandatory comments, etc.) are things that might be used to identify developers from their code.
Also, if you're interested in .NET code, you might find NDepend to be a useful tool. It enables you to run queries against a code base, and supports 82 metrics.

Comparing two English strings for similarities

So here is my problem. I have two paragraphs of text and I need to see if they are similar. Not in the sense of string metrics but in meaning. The following two paragraphs are related but I need to find out if they cover the 'same' topic. Any help or direction to solving this problem would be greatly appreciated.
Fossil fuels are fuels formed by natural processes such as anaerobic
decomposition of buried dead organisms. The age of the organisms and
their resulting fossil fuels is typically millions of years, and
sometimes exceeds 650 million years. The fossil fuels, which contain
high percentages of carbon, include coal, petroleum, and natural gas.
Fossil fuels range from volatile materials with low carbon:hydrogen
ratios like methane, to liquid petroleum to nonvolatile materials
composed of almost pure carbon, like anthracite coal. Methane can be
found in hydrocarbon fields, alone, associated with oil, or in the
form of methane clathrates. It is generally accepted that they formed
from the fossilized remains of dead plants by exposure to heat and
pressure in the Earth's crust over millions of years. This biogenic
theory was first introduced by Georg Agricola in 1556 and later by
Mikhail Lomonosov in the 18th century.
Second:
Fossil fuel reforming is a method of producing hydrogen or other
useful products from fossil fuels such as natural gas. This is
achieved in a processing device called a reformer which reacts steam
at high temperature with the fossil fuel. The steam methane reformer
is widely used in industry to make hydrogen. There is also interest in
the development of much smaller units based on similar technology to
produce hydrogen as a feedstock for fuel cells. Small-scale steam
reforming units to supply fuel cells are currently the subject of
research and development, typically involving the reforming of
methanol or natural gas but other fuels are also being considered such
as propane, gasoline, autogas, diesel fuel, and ethanol.
That's a tall order. If I were you, I'd start reading up on Natural Language Processing. NLP is a fairly large field -- I would recommend looking specifically at the things mentioned in the Wikipedia Text Analytics article's "Processes" section.
I think if you make use of information retrieval, named entity recognition, and sentiment analysis, you should be well on your way.
In general, I believe that this is still an open problem. Natural language processing is still a nascent field and while we can do a few things really well, it's still extremely difficult to do this sort of classification and categorization.
I'm not an expert in NLP, but you might want to check out these lecture slides that discuss sentiment analysis and authorship detection. The techniques you might use to do the sort of text comparison you've suggested are related to the techniques you would use for the aforementioned analyses, and you might find this to be a good starting point.
Hope this helps!
You can also have a look on Latent Dirichlet Allocation (LDA) model in machine learning. The idea there is to find a low-dimensional representation of each document (or paragraph), simply as a distribution over some 'topics'. The model is trained in an unsupervised fashion using a collection of documents/paragraphs.
If you run LDA on your collection of paragraphs, then by looking into the similarity of the hidden topics vector, you can find whether a given two paragraphs are related or not.
Of course, the baseline is to not use the LDA, and instead use the term frequencies (augmented with tf/idf) to measure similarities (vector space model).

What are some of the applications of Order Statistics Algorithms?

I am watching the MIT lectures and Eric Demaine says that they discussed some of the applications of Order Statistics Algorithms. I was wondering if the SO community would help me figure out some of the applications of the selection algorithms.
Finding the median is a common application of such algorithm. E.g. I've used it in image processing for the median filter. Min, max, k-NN also use order statistic algorithms, so that's another application.
Here are some other applications that I can think of in addition to what Jacob said:
Most of the services care about 95-th or 99-th percentile of latency rather than mean 'coz they want to keep most of the users happy.
In machine-learning if you want to convert a continuous valued feature into Boolean features by bucketing it, one common approach is to partition it by percentile so that the cardinality of each Boolean feature is somewhat similar.
There are probably hundreds of applications of order statistics. The algorithms to compute them can change based on what kind of scaling you need and what approximations you can tolerate. If you can give more context on what Eric Demaine says, probably you can get better answers.

Resources