"rcorr" function and multiple testing correction - correlation

Do p values produced by rcorr have to be corrected for multiple comparison or does this function account for multiple testing?
In the output of p-values table I have correlations significant at different levels (e.g. .05, .001) and I don't seem to be able to find if these are corrected for multiple testing.

Related

Which test to use to compare differences in frequency?

So basically, I have been looking at Fisher's exact / barnard testing to establish which method is more appropriate for my hypothesis. However, first of all my contigency table is 2x6 (example below) and not the 2 x 2 that you must have for these tests.
I've estimated the number of modes that two groups of individuals (G1 and G2) have within a distribution of a certain variable. In essence, I want to say
'The types of distribution were observed in G1 was significantly different to G2' - Which test is best to do this for my table?
[My contingency table]

repeatedly computing the fft over a varying number of rows

I am interested in computing the fft of the first rows of a matrix, but I do not know in advance how many rows I need. I need to do this repeatedly but the number of rows I need to transform can change.
I will illustrate with the following example. Suppose I have a 100 by 128 array. If I plan for 1-dimensional fft's on each row, FFTW produces the following plan:
(dft-ct-dit/8
(dftw-direct-8/28-x100 "t2fv_8_sse2")
(dft-vrank>=1-x8/1
(dft-direct-16-x100 "n1fv_16_sse2")))
Although I don't fully understand this output, I do see the key ingredients: 1) This is a single Cooley-Tucker pass, note that 8*16=128. 2) Because of the x100 postfix on two lines, the plan states that this needs to happen for 100 rows.
I see three possibilities:
One-size-fits-all planning: plan for the 100 by 128 array, and execute this big plan even when only the first (say) 20 rows are needed.
Pros: we need only one plan so there is little planning overhead. Cons: potentially substantial performance loss in the execution phase (transforming more than I need).
Exhaustive planning: obtain plans using the same input/output array but for a all possible number of rows. In the example I would have 100 plans, where plan i carries out the fft for each of the first i rows. Pros: Transforming exactly what I need. Cons: Experiments show that I have to pay the planning penalty over and over, even though say for i=50 the plan will be the same as above but with x50 instead of x100. (I suppose there is no guarantee this will indeed be the plan identified by the FFTW planner, but I wouldn't mind losing "optimality" if I can cut out the planning time.)
Single-row planning: plan for a single row and use a loop to move data into input, transform, and move data out of output. Pros: I'm transforming exactly what I need. Cons: it seems to me this removes a lot of the FFTW optimizations, for instance when I use multiple threads. (I'm generally confused how multithreading works in FFTW since it is ill-documented... I know threading information is part of the plan, but printing the plan doesn't display any of it. This is off-topic though.)
I was thinking that I would combine all three ideas by first creating the one-size-fits-all plan, modifying this plan 99 times in a for loop instead of planning for the different sizes, and executing as under the exhaustive-planning approach. However I can't find any documentation on the plan/wisdom format, the output of wisdom with hexadecimal numbers is impenetrable. So I am wondering how I can carry out this hybrid approach.

AMPL: what's a good way to specify equality constraints for large list of pairs of variable-size sets?

I'm working on a problem that involves reconciling data that represents estimates of the same system under two different classification hierarchies. I want to enforce the requirement that equivalent classes or groups of classes have the same sum.
For example, say Classification A divides industries into: Agriculture (sheep/cattle), Agriculture (non-sheep/cattle), Mining, Manufacturing (textiles), Manufacturing (non-textiles), ...
Meanwhile, Classification B has a different breakdown: Agriculture, Mining (iron ore), Mining (non-iron-ore), Manufacturing (chemical), Manufacturing (non-chemical), ...
In this case, any total for A_Agric_SheepCattle + A_Agric_NonSheepCattle should match the equivalent total for B_Agric; A_Mining should match B_MiningIronOre + B_Mining_NonIronOre; and A_MFG_Textiles+A_MFG_NonTextiles should match B_MFG_Chemical+B_MFG_NonChemical.
For bonus complication, one category may be involved in multiple equivalencies, e.g. B_Mining_IronOre might be involved in an equivalency with both A_Mining and A_Mining_Metallic.
I will be working with multi-dimensional tables, with this sort of concordance applied to more than one dimension - e.g. I might be compiling data on Industry x Product, so each equivalency will be used in multiple constraints; hence I need an efficient way to define them once and invoke repeatedly, instead of just setting a direct constraint "A_Agric_SheepCattle + A_Agric_NonSheepCattle = B_Agric".
The most natural way to represent this sort of concordance would seem to be as a list of pairs of sets. The catch is that the set sizes will vary - sometimes we have a 1:1 equivalence, sometimes it's "these 5 categories equate to those 7 categories", etc.
I found this related question which offers two answers for dealing with variable-sized sets. One is to define all set members in a single ordered set with indices, then define the starting index for each set within that. However, this seems unwieldy for my problem; both classifications are likely to be long, so I'd need to be hopping between two loooong lists of industries and two looong lists of indices to see a single equivalency. This seems like it would be a nuisance to check, and hard to modify (since any change to membership for one of the early sets changes the index numbers for all following sets).
The other is to define pairs of long fixed-length sets, and then pad each set to the required length with null members.
This would be a much better option for my purposes since it lets me eyeball a single line and see the equivalence that it represents. But it would require a LOT of padding; most of the equivalence groups will be small but a few might be quite large, and everything has to be padded to the size of the largest expected length.
Is there a better approach?

How many simulations need to do?

Hello my problem is more related with the validation of a model. I have done a program in netlogo that i'm gonna use in a report for my thesis but now the question is, how many repetitions (simulations) i need to do for justify my results? I already have read some methods using statistical approach and my colleagues have suggested me some nice mathematical operations, but i also want to know from people who works with computational models what kind of statistical test or mathematical method used to know that.
There are two aspects to this (1) How many parameter combinations (2) How many runs for each parameter combination.
(1) Generally you would do experiments, where you vary some of your input parameter values and see how some model output changes. Take the well known Schelling segregation model as an example, you would vary the tolerance value and see how the segregation index is affected. In this case you might vary the tolerance from 0 to 1 by 0.01 (if you want discrete) or you could just take 100 different random values in the range [0,1]. This is a matter of experimental design and is entirely affected by how fine you wish to examine your parameter space.
(2) For each experimental value, you also need to run multiple simulations so that you can can calculate the average and reduce the impact of randomness in the simulation run. For example, say you ran the model with a value of 3 for your input parameter (whatever it means) and got a result of 125. How do you know whether the 'real' answer is 125 or something else. If you ran it 10 times and got 10 different numbers in the range 124.8 to 125.2 then 125 is not an unreasonable estimate. If you ran it 10 times and got numbers ranging from 50 to 500, then 125 is not a useful result to report.
The number of runs for each experiment set depends on the variability of the output and your tolerance. Even the 124.8 to 125.2 is not useful if you want to be able to estimate to 1 decimal place. Look up 'standard error of the mean' in any statistics text book. Basically, if you do N runs, then a 95% confidence interval for the result is the average of the results for your N runs plus/minus 1.96 x standard deviation of the results / sqrt(N). If you want a narrower confidence interval, you need more runs.
The other thing to consider is that if you are looking for a relationship over the parameter space, then you need fewer runs at each point than if you are trying to do a point estimate of the result.
Not sure exactly what you mean, but maybe you can check the books of Hastie and Tishbiani
http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
specially the sections on resampling methods (Cross-Validation and bootstrap).
They also have a shorter book that covers the possible relevant methods to your case along with the commands in R to run this. However, this book, as a far as a I know, is not free.
http://www.springer.com/statistics/statistical+theory+and+methods/book/978-1-4614-7137-0
Also, could perturb the initial conditions to see you the outcome doesn't change after small perturbations of the initial conditions or parameters. On a larger scale, sometimes you can break down the space of parameters with regard to final state of the system.
1) The number of simulations for each parameter setting can be decided by studying the coefficient of variance Cv = s / u, here s and u are standard deviation and mean of the result respectively. It is explained in detail in this paper Coefficient of variance.
2) The simulations where parameters are changed can be analyzed using several methods illustrated in the paper Testing methods.
These papers provide scrupulous analyzing methods and refer to other papers which may be relevant to your question and your research.

Comparing two large datasets using a MapReduce programming model

Let's say I have two fairly large data sets - the first is called "Base" and it contains 200 million tab delimited rows and the second is call "MatchSet" which has 10 million tab delimited rows of similar data.
Let's say I then also have an arbitrary function called Match(row1, row2) and Match() essentially contains some heuristics for looking at row1 (from MatchSet) and comparing it to row2 (from Base) and determining if they are similar in some way.
Let's say the rules implemented in Match() are custom and complex rules, aka not a simple string match, involving some proprietary methods. Let's say for now Match(row1,row2) is written in psuedo-code so implementation in another language is not a problem (though it's in C++ today).
In a linear model, aka program running on one giant processor - we would read each line from MatchSet and each line from Base and compare one to the other using Match() and write out our match stats. For example we might capture: X records from MatchSet are strong matches, Y records from MatchSet are weak matches, Z records from MatchSet do not match. We would also write the strong/weak/non values to separate files for inspection. Aka, a nested loop of sorts:
for each row1 in MatchSet
{
for each row2 in Base
{
var type = Match(row1,row2);
switch(type)
{
//do something based on type
}
}
}
I've started considering Hadoop streaming as a method for running these comparisons as a batch job in a short amount of time. However, I'm having a bit of a hardtime getting my head around the map-reduce paradigm for this type of problem.
I understand pretty clearly at this point how to take a single input from hadoop, crunch the data using a mapping function and then emit the results to reduce. However, the "nested-loop" approach of comparing two sets of records is messing with me a bit.
The closest I'm coming to a solution is that I would basically still have to do a 10 million record compare in parallel across the 200 million records so 200 million/n nodes * 10 million iterations per node. Is that that most efficient way to do this?
From your description, it seems to me that your problem can be arbitrarily complex and could be a victim of the curse of dimensionality.
Imagine for example that your rows represent n-dimensional vectors, and that your matching function is "strong", "weak" or "no match" based on the Euclidean distance between a Base vector and a MatchSet vector. There are great techniques to solve these problems with a trade-off between speed, memory and the quality of the approximate answers. Critically, these techniques typically come with known bounds on time and space, and the probability to find a point within some distance around a given MatchSet prototype, all depending on some parameters of the algorithm.
Rather than for me to ramble about it here, please consider reading the following:
Locality Sensitive Hashing
The first few hits on Google Scholar when you search for "locality sensitive hashing map reduce". In particular, I remember reading [Das, Abhinandan S., et al. "Google news personalization: scalable online collaborative filtering." Proceedings of the 16th international conference on World Wide Web. ACM, 2007] with interest.
Now, on the other hand if you can devise a scheme that is directly amenable to some form of hashing, then you can easily produce a key for each record with such a hash (or even a small number of possible hash keys, one of which would match the query "Base" data), and the problem becomes a simple large(-ish) scale join. (I say "largish" because joining 200M rows with 10M rows is quite a small if the problem is indeed a join). As an example, consider the way CDDB computes the 32-bit ID for any music CD CDDB1 calculation. Sometimes, a given title may yield slightly different IDs (i.e. different CDs of the same title, or even the same CD read several times). But by and large there is a small set of distinct IDs for that title. At the cost of a small replication of the MatchSet, in that case you can get very fast search results.
Check the Section 3.5 - Relational Joins in the paper 'Data-Intensive Text Processing
with MapReduce'. I haven't gone in detail, but it might help you.
This is an old question, but your proposed solution is correct assuming that your single stream job does 200M * 10M Match() computations. By doing N batches of (200M / N) * 10M computations, you've achieved a factor of N speedup. By doing the computations in the map phase and then thresholding and steering the results to Strong/Weak/No Match reducers, you can gather the results for output to separate files.
If additional optimizations could be utilized, they'd like apply to both the single stream and parallel versions. Examples include blocking so that you need to do fewer than 200M * 10M computations or precomputing constant portions of the algorithm for the 10M match set.

Resources