What kind of statistics to use to compare logFold values? - rna-seq

I recently did an RNAseq experiment in which I had controls and experiments at 3 different time periods. My samples were distributed as following for a total of 10 samples:
T1_control1, T2_control1, T2_control2, T3_control1,
T1_exp1, T1_exp2, T2_exp1, T2_exp2, T3_exp1, T3_exp2
I did differential expression analysis with DESeq2 and from it I obtained 3 files from each time period T1, T2, and T3 that show the logFold change values from the control to the experimental for each gene. My question is how I can statistically compare the logFold change value for one gene in one period vs another time period. I am not sure what test to use since there is only one logFold change values per each time period for each gene.
Thank you in advance.
I am not sure what test to use since there is only one logFold change values per each time period for each gene.

Since RNA-sequencing is costly (for many purposes), I think most groups ensure their sequencing run is accurate is by combining multiple biological replicates in each group or sequencing deep. An argument could be made that showing only one data point for each gene at each time point is appropriate given that standard protocols were followed.
Although one option is to increase the sample size of each time period by determining fold change between each control-experimental partner. It would be important however to consult the literature and colleagues whether this is appropriate for the specific type of analysis you are doing.

Related

Algorithms for finding coincidences in sets of data

Background & Motivation
I have three lists of timestamps (UNIX timestamps, plus a 10-digit subsecond part (e.g. 1451606400.9475839201)). They are not necessarily of the same size, however they are ordered (in non-descending order).
Each list corresponds to data from a real-life instrument, of which there are 3. The instruments in question record a timestamp each time they observe an "event". The issue is that the instruments are very sensitive, thus they record timestamps on the order of 10+ hz, and I'm working with a ~1 year of data, where only a tiny portion correspond to actual events.
Unfortunately, it's difficult to put a number on precisely how many timestamps should be real events.
We may assume that the "random" timestamps are uniformly distributed (and, in practice, this seems to be the case). There are, however, gaps in the data (e.g. all three instruments may have gone down during the months of March, April, May). These gaps will be the same for all three lists.
Unfortunately, we cannot assume that the clocks for the three instruments are well synchronized. A drift of a fraction of a microsecond to a few microseconds can be expected. Further, the "events" in question are light, and the instruments in question close together, so we can calculate the maximum difference in observation time, on the order of ~10 - 15 microseconds (assuming the clocks were synchronized, and there is no noise).
Goal:
Using only the timestamps, I want to identify those which are most "likely" to correspond to a "real" event, such that further analysis can be conducted.
What I've Tried:
My first inclination was to, from the three list, produce a list of triplets, one timestamp contributed by each list/instrument, such that max(A,B,C) - min(A,B,C) was minimized. Something like this simple algorithm.
Unfortunately, this found very few, and sometimes no, "coincidences" which fell within a reasonable time window. Further, those few that did didn't correspond to real events, upon further analysis.
Next, I tried the above, but this time minimizing the RSS error, which I defined to be, for some triplet A,B,C, as (A-B)**2 + (A-C)**2 + (B-C)**2. This found not too many more triplets within a reasonable time window, and none corresponded to real events.
Lastly, I tried simply iterating over all elements of the first vector, and finding the closest match in the second and third vectors (by binary search), then repeating for the second and third vectors. This gave me identical results to the RSS minimization code.
Is there a better or a "standard" approach?
I'm worried about nothing outside of effectively finding these "real events". This includes efficiency... if it works well, speed/efficiency is of little concern.

java.time.temporal.ChronoUnit VS java.time.temporal.ChronoField

While looking at java 8 Time API I see a lot of methods expect as a parameter ChronoUnit (implementation of TemporalUnit) as here while other expect a ChronoField (implementation of TemporalField) as here.
Could anyone help me clarify the designers decision when a method is expecting to use a ChronoUnit and when a ChronoField and what are their differences?
Thanks.
Units are used to measure a quantity of time - years, months, days, hours, minutes, seconds. For example, the second is an SI unit.
By contrast, fields are how humans generally refer to time, which is in parts. If you look at a digital clock, the seconds count from 0 up to 59 and then go back to 0 again. This is a field - "second-of-minute" in this case, formed by counting seconds within a minute. Similarly, days are counted within a month, and months within a year. To define a complete point on the time-line you have to have a set of linked fields, eg:
second-of-minute
minute-of-hour
hour-of-day
day-of-month
month-of-year
year (-of-forever)
The ChronoField API exposes the two parts of second-of-minute. Use getBaseUnit() to get "seconds" and getRangeUnit() to get "minutes".
The Chrono part of the name refers to the fact that the definitions are chronology-neutral. Specifically, this means that the unit or field has a meaning only when associated with a calendar system, or Chronology. An example of this is the Coptic chronology, where there are 13 months in a year. Despite this being different to the common civil/ISO calendar system, the ChronoField.MONTH_OF_YEAR constant can still be used.
The TemporalUnit and TemporalField interfaces provide the higher level abstraction, allowing units/fields that are not chronology-neutral to be added and processed.
A TemporalUnit serves as general unit of time measurement. Therefore it can be used in determining the size of temporal amount between two given points in time (in abstract sense).
However, a TemporalField is not necessarily related to any kind of (abstract) time axis and usually represents a detail value of a point in time. Example: A month is only one component of a complete calendar date consisting of year, month and day-of-month.
Some people might argue that a calendar month and the month unit could be interpreted more or less as equivalent. Older libraries like java.util.Calendar don't make this difference. However, field and unit are used in a very different way as shown above (composing points in time versus measuring temporal amount).
Interestingly, the JDK-8-designers have decided that a field must have a base unit which is not null (I am personally not happy about this narrowing decision because I can imagine other fields not necessarily having a base unit). In case of months it is quite trivial. In case of days, we have different fields with the same base unit DAYS, for example day-of-month, day-of-year, day-of-week. This 1:n-relationship justifies the separation of units and fields in context of JSR-310 (aka java.time-package).

Gene representation for production planning with constraints

I'm trying to improve the throughput of a production system. The exact type of the system isn't relevant (I think).
Description
The system consists of a LINE of stations (numbered 1, 2, 3...) and an ARM.
The system receives an ITEM at random times.
Each ITEM has a PLAN associated with it (for example, ITEM1 may have a PLAN which
says it needs to go through station 3, then 1, then 5). The PLAN includes timing information on
how long the ITEM would be at each station (a range of hard max/min values).
Every STATION can hold one ITEM at a time.
The ARM is used to move each ITEM from one STATION to the next. Each PLAN includes
timing information for the ARM as well, which is a fixed value.
Current Practice
I have two current (working) planning solutions.
The first maintains a master list of usage for each STATION, consider this a 'booking' approach.
As each new ITEM-N enters, the system searches ahead to find the earliest possible slot where
PLAN-N would fit. So for example, it would try to fit it at t=0, then progressively try higher
delays till it found a fit (well actually I have some heuristics here to cut down processing time,
but the approach holds)
The second maintains a list for each ITEM specifying when it is to start. When a new ITEM-N
enters, the system compares its' PLAN-N with all existing lists to find a suitable time to
start. Again, it starts at t=0 then progressively tries higher delays.
Neither of the two solutions take advantage of the range of times an ITEM is allowed at each
station. A fixed time is assumed (midpoint or minimum).
Ideal Solution
It's quite self-evident that there exists situations where an incoming ITEM would be able to
start earlier than otherwise possible if some of the current ITEMs change the duration they
spend in certain STATION, whether by shortening that duration (so the new ITEM could enter
the STATION instead) or lengthening that duration (so the ARM has time to move the
ITEM).
I'm trying to implement a Genetic Algorithm solution to the problem. My current gene contains N
numbers (between 0 and 1) where N is the total number of stations among all item currently in the
system as well as a new item which is to be added in. It's trivial to convert this gene to an
actual duration (0 would be the min duration, 1 would be the max, scale linearly in between).
However, this gene representation consistently produces un-usable plans which overlap with each
other. The reason for this is that when multiple items are already arranged ideally (consecutive in
time, planning wise), no variation on durations is possible. This is unavoidable because once items
are already being processed, they cannot be delayed or brought forward.
An example of the above situation, say ITEMA is in STATION3 for durations t1 to t2 and t3 to
t4. ITEMB then comes along and occupies STATION3 for duration t2 to t3 (so STATION3 is fully
utilized between t1 and t4). With my current gene representation, I'm virtually guaranteed never to
find a valid solution, since that would require certain elements of the gene to have exactly the
correct value so as not to generate an overlap.
Questions
Is there a better gene representation than I describe above?
Would I be better served doing some simple hill-climbing to find modifiable timings? Or, is GA
actually suited to this problem?

How many simulations need to do?

Hello my problem is more related with the validation of a model. I have done a program in netlogo that i'm gonna use in a report for my thesis but now the question is, how many repetitions (simulations) i need to do for justify my results? I already have read some methods using statistical approach and my colleagues have suggested me some nice mathematical operations, but i also want to know from people who works with computational models what kind of statistical test or mathematical method used to know that.
There are two aspects to this (1) How many parameter combinations (2) How many runs for each parameter combination.
(1) Generally you would do experiments, where you vary some of your input parameter values and see how some model output changes. Take the well known Schelling segregation model as an example, you would vary the tolerance value and see how the segregation index is affected. In this case you might vary the tolerance from 0 to 1 by 0.01 (if you want discrete) or you could just take 100 different random values in the range [0,1]. This is a matter of experimental design and is entirely affected by how fine you wish to examine your parameter space.
(2) For each experimental value, you also need to run multiple simulations so that you can can calculate the average and reduce the impact of randomness in the simulation run. For example, say you ran the model with a value of 3 for your input parameter (whatever it means) and got a result of 125. How do you know whether the 'real' answer is 125 or something else. If you ran it 10 times and got 10 different numbers in the range 124.8 to 125.2 then 125 is not an unreasonable estimate. If you ran it 10 times and got numbers ranging from 50 to 500, then 125 is not a useful result to report.
The number of runs for each experiment set depends on the variability of the output and your tolerance. Even the 124.8 to 125.2 is not useful if you want to be able to estimate to 1 decimal place. Look up 'standard error of the mean' in any statistics text book. Basically, if you do N runs, then a 95% confidence interval for the result is the average of the results for your N runs plus/minus 1.96 x standard deviation of the results / sqrt(N). If you want a narrower confidence interval, you need more runs.
The other thing to consider is that if you are looking for a relationship over the parameter space, then you need fewer runs at each point than if you are trying to do a point estimate of the result.
Not sure exactly what you mean, but maybe you can check the books of Hastie and Tishbiani
http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
specially the sections on resampling methods (Cross-Validation and bootstrap).
They also have a shorter book that covers the possible relevant methods to your case along with the commands in R to run this. However, this book, as a far as a I know, is not free.
http://www.springer.com/statistics/statistical+theory+and+methods/book/978-1-4614-7137-0
Also, could perturb the initial conditions to see you the outcome doesn't change after small perturbations of the initial conditions or parameters. On a larger scale, sometimes you can break down the space of parameters with regard to final state of the system.
1) The number of simulations for each parameter setting can be decided by studying the coefficient of variance Cv = s / u, here s and u are standard deviation and mean of the result respectively. It is explained in detail in this paper Coefficient of variance.
2) The simulations where parameters are changed can be analyzed using several methods illustrated in the paper Testing methods.
These papers provide scrupulous analyzing methods and refer to other papers which may be relevant to your question and your research.

How to notice unusual news activity

Suppose you were able keep track of the news mentions of different entities, like say "Steve Jobs" and "Steve Ballmer".
What are ways that could you tell whether the amount of mentions per entity per a given time period was unusual relative to their normal degree of frequency of appearance?
I imagine that for a more popular person like Steve Jobs an increase of like 50% might be unusual (an increase of 1000 to 1500), while for a relatively unknown CEO an increase of 1000% for a given day could be possible (an increase of 2 to 200). If you didn't have a way of scaling that your unusualness index could be dominated by unheard-ofs getting their 15 minutes of fame.
update: To make it clearer, it's assumed that you are already able to get a continuous news stream and identify entities in each news item and store all of this in a relational data store.
You could use a rolling average. This is how a lot of stock trackers work. By tracking the last n data points, you could see if this change was a substantial change outside of their usual variance.
You could also try some normalization -- one very simple one would be that each category has a total number of mentions (m), a percent change from the last time period (δ), and then some normalized value (z) where z = m * δ. Lets look at the table below (m0 is the previous value of m) :
Name m m0 δ z
Steve Jobs 4950 4500 .10 495
Steve Ballmer 400 300 .33 132
Larry Ellison 50 10 4.0 400
Andy Nobody 50 40 .20 10
Here, a 400% change for unknown Larry Ellison results in a z value of 400, a 10% change for the much better known Steve Jobs is 495, and my spike of 20% is still a low 10. You could tweak this algorithm depending on what you feel are good weights, or use standard deviation or the rolling average to find if this is far away from their "expected" results.
Create a database and keep a history of stories with a time stamp. You then have a history of stories over time of each category of news item you're monitoring.
Periodically calculate the number of stories per unit of time (you choose the unit).
Test if the current value is more than X standard deviations away from the historical data.
Some data will be more volatile than others so you may need to adjust X appropriately. X=1 is a reasonable starting point
Way over simplified-
store people's names and the amount of articles created in the past 24 hours with their name involved. Compare to historical data.
Real life-
If you're trying to dynamically pick out people's names, how would you go about doing that? Searching through articles how do you grab names? Once you grab a new name, do you search for all articles for him? How do you separate out Steve Jobs from Apple from Steve Jobs the new star running back that is generating a lot of articles?
If you're looking for simplicity, create a table with 50 people's names that you actually insert. Every day at midnight, have your program run a quick google query for past 24 hours and store the number of results. There are a lot of variables in this though that we're not accounting for.
The method you use is going to depend on the distribution of the counts for each person. My hunch is that they are not going to be normally distributed, which means that some of the standard approaches to longitudinal data might not be appropriate - especially for the small-fry, unknown CEOs you mention, who will have data that are very much non-continuous.
I'm really not well-versed enough in longitudinal methods to give you a solid answer here, but here's what I'd probably do if you locked me in a room to implement this right now:
Dig up a bunch of past data. Hard to say how much you'd need, but I would basically go until it gets computationally insane or the timeline gets unrealistic (not expecting Steve Jobs references from the 1930s).
In preparation for creating a simulated "probability distribution" of sorts (I'm using terms loosely here), more recent data needs to be weighted more than past data - e.g., a thousand years from now, hearing one mention of (this) Steve Jobs might be considered a noteworthy event, so you wouldn't want to be using expected counts from today (Andy's rolling mean is using this same principle). For each count (day) in your database, create a sampling probability that decays over time. Yesterday is the most relevant datum and should be sampled frequently; 30 years ago should not.
Sample out of that dataset using the weights and with replacement (i.e., same datum can be sampled more than once). How many draws you make depends on the data, how many people you're tracking, how good your hardware is, etc. More is better.
Compare your actual count of stories for the day in question to that distribution. What percent of the simulated counts lie above your real count? That's roughly (god don't let any economists look at this) the probability of your real count or a larger one happening on that day. Now you decide what's relevant - 5% is the norm, but it's an arbitrary, stupid norm. Just browse your results for awhile and see what seems relevant to you. The end.
Here's what sucks about this method: there's no trend in it. If Steve Jobs had 15,000 a week ago, 2000 three days ago, and 300 yesterday, there's a clear downward trend. But the method outlined above can only account for that by reducing the weights for the older data; it has no way to project that trend forward. It assumes that the process is basically stationary - that there's no real change going on over time, just more and less probable events from the same random process.
Anyway, if you have the patience and willpower, check into some real statistics. You could look into multilevel models (each day is a repeated measure nested within an individual), for example. Just beware of your parametric assumptions... mention counts, especially on the small end, are not going to be normal. If they fit a parametric distribution at all, it would be in the Poisson family: the Poisson itself (good luck), the overdispersed Poisson (aka negative binomial), or the zero-inflated Poisson (quite likely for your small-fry, no chance for Steve).
Awesome question, at any rate. Lend your support to the statistics StackExchange site, and once it's up you'll be able to get a much better answer than this.

Resources