How does one arrive at "fair" priors for spatial and non-spatial effects - probability

In a basic BYM model may be written as
sometimes with covariates but that doesn't matter much here. Where s are the spatially structured effects and u the unstructured effects over units.
In Congdon (2020) they refer to the fair prior on these as one in which
where is the average number of neighbors in the adjacency matrix.
It is defined similarly (in terms of precision, I think) in Bernardinelli et al. (1995).
However, for the gamma distribution, scaling appears to only impact the scale term
I haven't been able to find a worked example for this, and don't understand how the priors are arrived at, for example, in the well-known lip cancer data
I am hoping someone could help me understand how these are reached in this setting, even in the simple case of two gamma hyperpriors.
References
Congdon, P. D. (2019). Bayesian Hierarchical Models: With Applications Using R, Second Edition (2nd edition). Chapman and Hall/CRC.
Bernardinelli, L., Clayton, D. and Montomoli, C. (1995). Bayesian estimates of disease maps: How important are priors? Statistics in Medicine 14 2411–2431.

Related

How is IBM Watson Tradeoff Analytics any different from simple constrained decision making?

I am continuously astounded by the technological genius of the IBM Watson package. The tools do things from recognizing the subjects in images to extracting the emotion in a letter, and they're amazing. And then there's Tradeoff Analytics. In their Nests demo, you select a state and then a series of constraints (price must be between W and X, square footage must be between Y and Z, there must be Insured Escrow financing available, etc.) and they rank the houses based on how well they fit your constraints.
It would seem that all Tradeoff Analytics does is run a simple query on the order of:
SELECT * FROM House WHERE price >= W AND price <= X AND square_footage >= Y
AND square_footage <= Z AND ...
Am I not understanding Tradeoff Analytics correctly? I have tremendous respect for the people over at IBM that built all of these amazing tools, but Tradeoff Analytics seems like simple constrained decision making, which appears in any Intro to Programming course as you're learning if statements. What am I missing?
As #GuyGreer pointed out the service indeed uses Pareo Optimization which is much different than simple constraints.
For example:
Say you have three houses
Sqr Footage Price
HouseA 6000 1000K
HouseB 9000 750K
HouseC 8000 800K
Now say your constraints are Sqr Footage > 5000 and Price < 900K
then you are left with House B and House C
Tradeoff Analytics will return to you only houseB.
Since according to Pareto, give your objectives of Price and Footage,
HouseB dominates House C as it has larger footage and is cheaper.
Obviously, this is a made up example, and in real life there are more objecitves (attributes) on which you take into account when you buy a house.
The idea with Pareto, is to find the Pareto Frontier.
Tradeoff Analytics add to Pareto Optimization additional home-grown algorithms to give you more insights on the tradeoff.
Finally the service, is accompanied with a client-side widget that uses novel method for visualizing Pareto Frontiers. In its own a sophisticated problems, given that such frontier is multi-diemnsional.
The page you link to says they use Pareto Optimisation that tries to optimise all the parameters to come to a pareto-optimal solution - a solution or set of solutions for when you can't optimise each individual parameter, so have to settle for some sub-optimal ones.
Rather than just find anything that matches the criteria they are trying to find some sort of optimal solution(s) given the constraints. That's how it's different than simple constrained decision-making.
Note I'm basing this answer completely off of their statement:
The service uses a mathematical filtering technique called “Pareto Optimization,”...
and what I've read about Pareto problems. I have no experience with this technology or Pareto problems myself.

How to understand F-test based lmfit confidence intervals

The excellent lmfit package lets one to run nonlinear regression. It can report two different conf intervals - one based on the covarience matrix the other using a more sophisticated tecnique based on an F-test. Details can be found on the doc. I would like to understand he reasoning behind this technique in depth. Which topics should i read about? Note: i have sufficient stats knowledge
F stats and other associated methods for obtaining confidence intervals are far superior to a simple estimation of te co variance matrix for non-linear models (and others).
The primary reason for this is the lack of assumptions about the Gaussian nature of error when using these methods. For non-linear systems, confidence intervals can (they don't have to be) be asymmetric. This means that the parameter value can effect the error surface differently and therefore the one, two, or three sigma limits have different magnitudes in either direction from the best fit.
The analytical ultracentrifugation community has excellent articles involving error analysis (Tom Laue, John J. Correia, Jim Cole, Peter Schuck are some good names for article searches). If you want a good general read about proper error analysis, check out this article by Michael Johnson:
http://www.researchgate.net/profile/Michael_Johnson53/publication/5881059_Nonlinear_least-squares_fitting_methods/links/0deec534d0d97a13a8000000.pdf
Cheers!

Finding an optimum learning rule for an ANN

How do you find an optimum learning rule for a given problem, say a multiple category classification?
I was thinking of using Genetic Algorithms, but I know there are issues surrounding performance. I am looking for real world examples where you have not used the textbook learning rules, and how you found those learning rules.
Nice question BTW.
classification algorithms can be classified using many Characteristics like:
What does the algorithm strongly prefer (or what type of data that is most suitable for this algorithm).
training overhead. (does it take a lot of time to be trained)
When is it effective. ( large data - medium data - small amount of data ).
the complexity of analyses it can deliver.
Therefore, for your problem classifying multiple categories I will use Online Logistic Regression (FROM SGD) because it's perfect with small to medium data size (less than tens of millions of training examples) and it's really fast.
Another Example:
let's say that you have to classify a large amount of text data. then Naive Bayes is your baby. because it strongly prefers text analysis. even that SVM and SGD are faster, and as I experienced easier to train. but these rules "SVM and SGD" can be applied when the data size is considered as medium or small and not large.
In general any data mining person will ask him self the four afomentioned points when he wants to start any ML or Simple mining project.
After that you have to measure its AUC, or any relevant, to see what have you done. because you might use more than just one classifier in one project. or sometimes when you think that you have found your perfect classifier, the results appear to be not good using some measurement techniques. so you'll start to check your questions again to find where you went wrong.
Hope that I helped.
When you input a vector x to the net, the net will give an output depend on all the weights (vector w). There would be an error between the output and the true answer. The average error (e) is a function of the w, let's say e = F(w). Suppose you have one-layer-two-dimension network, then the image of F may look like this:
When we talk about training, we are actually talking about finding the w which makes the minimal e. In another word, we are searching the minimum of a function. To train is to search.
So, you question is how to choose the method to search. My suggestion would be: It depends on how the surface of F(w) looks like. The wavier it is, the more randomized method should be used, because the simple method based on gradient descending would have bigger chance to guide you trapped by a local minimum - so you lose the chance to find the global minimum. On the another side, if the suface of F(w) looks like a big pit, then forget the genetic algorithm. A simple back propagation or anything based on gradient descending would be very good in this case.
You may ask that how can I know how the surface look like? That's a skill of experience. Or you might want to randomly sample some w, and calculate F(w) to get an intuitive view of the surface.

Comparing two English strings for similarities

So here is my problem. I have two paragraphs of text and I need to see if they are similar. Not in the sense of string metrics but in meaning. The following two paragraphs are related but I need to find out if they cover the 'same' topic. Any help or direction to solving this problem would be greatly appreciated.
Fossil fuels are fuels formed by natural processes such as anaerobic
decomposition of buried dead organisms. The age of the organisms and
their resulting fossil fuels is typically millions of years, and
sometimes exceeds 650 million years. The fossil fuels, which contain
high percentages of carbon, include coal, petroleum, and natural gas.
Fossil fuels range from volatile materials with low carbon:hydrogen
ratios like methane, to liquid petroleum to nonvolatile materials
composed of almost pure carbon, like anthracite coal. Methane can be
found in hydrocarbon fields, alone, associated with oil, or in the
form of methane clathrates. It is generally accepted that they formed
from the fossilized remains of dead plants by exposure to heat and
pressure in the Earth's crust over millions of years. This biogenic
theory was first introduced by Georg Agricola in 1556 and later by
Mikhail Lomonosov in the 18th century.
Second:
Fossil fuel reforming is a method of producing hydrogen or other
useful products from fossil fuels such as natural gas. This is
achieved in a processing device called a reformer which reacts steam
at high temperature with the fossil fuel. The steam methane reformer
is widely used in industry to make hydrogen. There is also interest in
the development of much smaller units based on similar technology to
produce hydrogen as a feedstock for fuel cells. Small-scale steam
reforming units to supply fuel cells are currently the subject of
research and development, typically involving the reforming of
methanol or natural gas but other fuels are also being considered such
as propane, gasoline, autogas, diesel fuel, and ethanol.
That's a tall order. If I were you, I'd start reading up on Natural Language Processing. NLP is a fairly large field -- I would recommend looking specifically at the things mentioned in the Wikipedia Text Analytics article's "Processes" section.
I think if you make use of information retrieval, named entity recognition, and sentiment analysis, you should be well on your way.
In general, I believe that this is still an open problem. Natural language processing is still a nascent field and while we can do a few things really well, it's still extremely difficult to do this sort of classification and categorization.
I'm not an expert in NLP, but you might want to check out these lecture slides that discuss sentiment analysis and authorship detection. The techniques you might use to do the sort of text comparison you've suggested are related to the techniques you would use for the aforementioned analyses, and you might find this to be a good starting point.
Hope this helps!
You can also have a look on Latent Dirichlet Allocation (LDA) model in machine learning. The idea there is to find a low-dimensional representation of each document (or paragraph), simply as a distribution over some 'topics'. The model is trained in an unsupervised fashion using a collection of documents/paragraphs.
If you run LDA on your collection of paragraphs, then by looking into the similarity of the hidden topics vector, you can find whether a given two paragraphs are related or not.
Of course, the baseline is to not use the LDA, and instead use the term frequencies (augmented with tf/idf) to measure similarities (vector space model).

What are good algorithms for detecting abnormality?

Background
Here is the problem:
A black box outputs a new number each day.
Those numbers have been recorded for a period of time.
Detect when a new number from the black box falls outside the pattern of numbers established over the time period.
The numbers are integers, and the time period is a year.
Question
What algorithm will identify a pattern in the numbers?
The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.
Ideas
I have some ideas, but am uncertain as to the best approach, or what solutions already exist:
Machine learning algorithms?
Neural network?
Classify normal and abnormal numbers?
Statistical analysis?
Cluster your data.
If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.
If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.
Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.
You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.
PCA is an other technique that comes to mind, when dealing with this type of data.
You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.
Sounds like a fun problem! Good luck
There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.
Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)
Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.
I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

Resources