gnomAD database allele frequency - bioinformatics

I need help in understanding the gnomAD allele frequency.
I need to filter the variants having < 1 % in population. I have seen in some annotated file "gnomAD highest frequency" column, on the basis of this the other scientists have filtered out < 1% variants.
In my file i can only see AF in my gnomAD table. Also the numbers in AF column are like 0.9876 , 0.087 but not in percentage form.
My question is should i take AF column for selecting <1 % . Also for that i need to first multiply the numbers in AF column by 100 to get it in percentage.
Please guide me if i am on the right path or not.
Thanks in advance!

The gnomAD data sets contain many different allele frequencies for different population. AF in your file could be any of these. If it is not documented you should be able to work it out by comparing values with those on the gnomAD website. With regards which frequency to use, it all depends on what you are trying to achieve. I use AF_popmax (Maximum allele frequency across populations) to filter out common variants in the context of genomic testing for rare disease but if you want to filter out rare variants then the best frequency to use would depend on your specific use case. All the values are expressed as frequencies, ie 1 = 100%, 0.01 = 1%.

Related

Tableau - Calculated fields / grouping / Custom Dim

Tableau:
This may seem simple, but I ran out of the usual tricks I've used in other systems.
I want a variance column. Essentially adding a member 'Variance' to the Act/Plan dimension which only contains the members 'Actual' and 'Plan'
I've come in where the data structure and reporting is set up like so:
Actual | Plan
Profit measure
measure 2
measure 3
etc
The goal is to have a Variance column (calculated and not part of the Actual/Plan dimension)
Actual | Plan | Variance
Profit measure
measure 2
measure 3
etc
There are solutions where it works for one measure only, and I've looked into that.
ie, create calculated field as such
Profit_Actual | Profit_Plan | Variance
You put this on the columns, and you get a grid that I want... except a grid with only 1 measure.
This does not work if I want to run several measures on rows. Essentially the solution above will only display the Profit measure, not Measure 1_Actual , Measure 2_Plan etc.
So I tried a trick where I grouped a the 3 calculated measures, ie Profit_Actual | Profit_Plan | Profit_Variance as 'Profit_Measure'
Created a parameter list - 'Actual', 'Plan', 'Variance'
Now I can half achieve my goal, by having the parameter on columns and the 'Profit Measure' on Rows (so I can have Measure 123_group etc down on rows too). Trouble is, I found that parameters are single select only. Only if it can display all options in the custom paramater at once, I would've solved my problem.
Any ideas on how I can achieve the Variance column I want?
Virtually adding a member to a dimension/Calculated fieds/tricks/workaround
Thank you
Any leads is appreciated
Gemmo
Okay. First thing, I had a really hard time trying to understand how your data is organized, try to be more clear (say how each entry in your database looks like, and not how a specific view in Tableau looks like).
But I think I got it. I guess you have a collection of entries, and each entry has a number of measure fields (profits and etc.) and an Act/Plan field, to identify whether that entry is an actual value or a planned value. Is that correct?
Well, if that's the case, I'm sorry to say you have to calculate a variance field for each dimension. Think about it, how your original dataset is structured. Do you think you can add a single field "Variance" to represent the variance of each measure? Well, you can, store the values in a string, and then collect it back using some string functions, but it's not very practical. The problem is that each entry have many measures, if it had only 1 measure, than 1 single variance field would suffice.
So, if you can re-organize your data, what would be an easier to work set (but with many more entries) is something with the fields: Measure, Value, Actual/Plan. The measure field would have a string to identify what you're measuring in that entry. Value would be a number to represent the actual measure. And the Actual/Plan is the same. For instance:
Measure Value Actual/Plan
Profit 100 Actual
So, each line in your current model would become n entries, where n is the number of measures you have right now. So a larger dataset in a way, but easier to work with. Think about, now you can have a calculated field, and use some table calculations to calculate the variance only for that measure and/or Actual/Plan. Just use WINDOW_VAR, and put Measure and/or Actual/Plan in the partition.
Table calculations are awesome, take a look at this to understand it better. http://onlinehelp.tableausoftware.com/current/pro/online/en-us/help.htm#calculations_tablecalculations_understanding_addressing.html
I generally like to have my data staged such that Actual is its own column and Plan is its own column in the data being fed to Tableau. It makes calculations so much easier.
If your data is such that there is a column called "Actual/Plan" and every row is populated with either "Actual" or "Plan" and there is another column called "Value" or "Measure" that is populated with the values, you can force Tableau to make them columns assuming you can't or won't rearrange your data.
Create a calculated field called "Actual" with the following calc:
IF [Actual/Plan] = 'Actual' THEN [Value] END
Similarly, create a calculated field called "Plan" with the following calc:
IF [Actual/Plan] = 'Plan' THEN [Value] END
Now, you can finally create your "Variance" and "Variance %" calculations (respectively):
SUM([Actual]) - SUM([Plan])
[Variance] / SUM([Plan])

How to convert raw score to standardized score "sensibly"?

I want to convert about 7,000 raw scores (0 to 20,000) to a standardized score (0 to 100).
The distribution of the scores is not normal. The median is 270 but the top dozen scores are:
18586,
17151,
9690,
8034,
7723,
7026,
7027,
6725,
6722,
5637,
4996,
4452.
How do I do this conversion in a "sensible" way such that the standardized scores (from 0 to 100) both reflect the raw scores AND the fact that half of the scores are below 270?
I don't have a definition of "sensible" and want to have your suggestions as to what is sensible in this case.
I would suggest doing a histogram. It is basically bin numbers together to get the frequency. Here is the link to the Wiki http://en.wikipedia.org/wiki/Histogram. If you google histogram, there are links to several other site which may be helpful.
Given the range of the data, you may want to take to log of each number and used that for the historgram. That may help decrease the range and then scale that between 0 and 100.

Statistical estimation algorithm

I'm not sure if this question is appropriate for Stack Overflow but I'll give it a try anyway.
I have some data as follows:
I also have another set of data that I believe follows a similar distribution but I only know the total percent (e.g. 30% rather than 17%.) Can anyone suggest an algorithm to estimate the %s for each individual tier based on the new total % and the original distribution?
You question is unclear. If you want to estimate a new total percent by including the additinal data you are getting you must have quantity associated with your percentage so that you can create a meaninful weighted average.
If you want to determine if the new set of data has a different distribution than the historical data there are several tests mostly doing obtuse calculations on cummulative actual vs. expected percentages of values falling underneath a particular value. There is a lot of literature on the subject on comparing the distributions of two populations.
For paired samples Wilcoxon-Rank is a standard method if you can make no assumptions about the distribuion of the data. For non paired data non-parametric statistics exist but they require some in depth study.
Step-1: If your overall percentage 17% → 30% then, Actual (total) 105 → ~189.
Step-2: This number needs to be distributed over all elements in Actual column
From here things become non-linear, and we need some formula for arriving at Actual from POssible. And this needs to be a function of total.
i.e., function (possible, total (actual)) = actual.
If we can arrive at the above, then it might work ;)
If your new total is x, then put (22/627)*x as possible for tier 1, and (21/627)*x as actual for tier 1, which will give you the same percentage as before for tier 1. Then do the same thing for the other tiers (so possible for tier 2 is (45/627)*x, etc.).

Algorithm to score similarness of sets of numbers

What is an algorithm to compare multiple sets of numbers against a target set to determine which ones are the most "similar"?
One use of this algorithm would be to compare today's hourly weather forecast against historical weather recordings to find a day that had similar weather.
The similarity of two sets is a bit subjective, so the algorithm really just needs to diferentiate between good matches and bad matches. We have a lot of historical data, so I would like to try to narrow down the amount of days the users need to look through by automatically throwing out sets that aren't close and trying to put the "best" matches at the top of the list.
Edit:
Ideally the result of the algorithm would be comparable to results using different data sets. For example using the mean square error as suggested by Niles produces pretty good results, but the numbers generated when comparing the temperature can not be compared to numbers generated with other data such as Wind Speed or Precipitation because the scale of the data is different. Some of the non-weather data being is very large, so the mean square error algorithm generates numbers in the hundreds of thousands compared to the tens or hundreds that is generated by using temperature.
I think the mean square error metric might work for applications such as weather compares. It's easy to calculate and gives numbers that do make sense.
Since your want to compare measurements over time you can just leave out missing values from the calculation.
For values that are not time-bound or even unsorted, multi-dimensional scatter data it's a bit more difficult. Choosing a good distance metric becomes part of the art of analysing such data.
Use the pearson correlation coefficient. I figured out how to calculate it in an SQL query which can be found here: http://vanheusden.com/misc/pearson.php
In finance they use Beta to measure the correlation of 2 series of numbers. EG, Beta could answer the question "Over the last year, how much would the price of IBM go up on a day that the price of the S&P 500 index went up 5%?" It deals with the percentage of the move, so the 2 series can have different scales.
In my example, the Beta is Covariance(IBM, S&P 500) / Variance(S&P 500).
Wikipedia has pages explaining Covariance, Variance, and Beta: http://en.wikipedia.org/wiki/Beta_(finance)
Look at statistical sites. I think you are looking for correlation.
As an example, I'll assume you're measuring temp, wind, and precip. We'll call these items "features". So valid values might be:
Temp: -50 to 100F (I'm in Minnesota, USA)
Wind: 0 to 120 Miles/hr (not sure if this is realistic but bear with me)
Precip: 0 to 100
Start by normalizing your data. Temp has a range of 150 units, Wind 120 units, and Precip 100 units. Multiply your wind units by 1.25 and Precip by 1.5 to make them roughly the same "scale" as your temp. You can get fancy here and make rules that weigh one feature as more valuable than others. In this example, wind might have a huge range but usually stays in a smaller range so you want to weigh it less to prevent it from skewing your results.
Now, imagine each measurement as a point in multi-dimensional space. This example measures 3d space (temp, wind, precip). The nice thing is, if we add more features, we simply increase the dimensionality of our space but the math stays the same. Anyway, we want to find the historical points that are closest to our current point. The easiest way to do that is Euclidean distance. So measure the distance from our current point to each historical point and keep the closest matches:
for each historicalpoint
distance = sqrt(
pow(currentpoint.temp - historicalpoint.temp, 2) +
pow(currentpoint.wind - historicalpoint.wind, 2) +
pow(currentpoint.precip - historicalpoint.precip, 2))
if distance is smaller than the largest distance in our match collection
add historicalpoint to our match collection
remove the match with the largest distance from our match collection
next
This is a brute-force approach. If you have the time, you could get a lot fancier. Multi-dimensional data can be represented as trees like kd-trees or r-trees. If you have a lot of data, comparing your current observation with every historical observation would be too slow. Trees speed up your search. You might want to take a look at Data Clustering and Nearest Neighbor Search.
Cheers.
Talk to a statistician.
Seriously.
They do this type of thing for a living.
You write that the "similarity of two sets is a bit subjective", but it's not subjective at all-- it's a matter of determining the appropriate criteria for similarity for your problem domain.
This is one of those situation where you are much better off speaking to a professional than asking a bunch of programmers.
First of all, ask yourself if these are sets, or ordered collections.
I assume that these are ordered collections with duplicates. The most obvious algorithm is to select a tolerance within which numbers are considered the same, and count the number of slots where the numbers are the same under that measure.
I do have a solution implemented for this in my application, but I'm looking to see if there is something that is better or more "correct". For each historical day I do the following:
function calculate_score(historical_set, forecast_set)
{
double c = correlation(historical_set, forecast_set);
double avg_history = average(historical_set);
double avg_forecast = average(forecast_set);
double penalty = abs(avg_history - avg_forecast) / avg_forecast
return c - penalty;
}
I then sort all the results from high to low.
Since the correlation is a value from -1 to 1 that says whether the numbers fall or rise together, I then "penalize" that with the percentage difference the averages of the two sets of numbers.
A couple of times, you've mentioned that you don't know the distribution of the data, which is of course true. I mean, tomorrow there could be a day that is 150 degree F, with 2000km/hr winds, but it seems pretty unlikely.
I would argue that you have a very good idea of the distribution, since you have a long historical record. Given that, you can put everything in terms of quantiles of the historical distribution, and do something with absolute or squared difference of the quantiles on all measures. This is another normalization method, but one that accounts for the non-linearities in the data.
Normalization in any style should make all variables comparable.
As example, let's say that a day it's a windy, hot day: that might have a temp quantile of .75, and a wind quantile of .75. The .76 quantile for heat might be 1 degree away, and the one for wind might be 3kmh away.
This focus on the empirical distribution is easy to understand as well, and could be more robust than normal estimation (like Mean-square-error).
Are the two data sets ordered, or not?
If ordered, are the indices the same? equally spaced?
If the indices are common (temperatures measured on the same days (but different locations), for example, you can regress the first data set against the second,
and then test that the slope is equal to 1, and that the intercept is 0.
http://stattrek.com/AP-Statistics-4/Test-Slope.aspx?Tutorial=AP
Otherwise, you can do two regressions, of the y=values against their indices. http://en.wikipedia.org/wiki/Correlation. You'd still want to compare slopes and intercepts.
====
If unordered, I think you want to look at the cumulative distribution functions
http://en.wikipedia.org/wiki/Cumulative_distribution_function
One relevant test is Kolmogorov-Smirnov:
http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test
You could also look at
Student's t-test,
http://en.wikipedia.org/wiki/Student%27s_t-test
or a Wilcoxon signed-rank test http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
to test equality of means between the two samples.
And you could test for equality of variances with a Levene test http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm
Note: it is possible for dissimilar sets of data to have the same mean and variance -- depending on how rigorous you want to be (and how much data you have), you could consider testing for equality of higher moments, as well.
Maybe you can see your set of numbers as a vector (each number of the set being a componant of the vector).
Then you can simply use dot product to compute the similarity of 2 given vectors (i.e. set of numbers).
You might need to normalize your vectors.
More : Cosine similarity

How to organise and rank observations of a variable?

I have this dataset containing world bilateral trade data for a few years.
I would like to determine which goods were the most exported ones in the timespan considered by the dataset.
The dataset is composed by the following variables:
"year"
"hs2", containing a two-digit number that tells which good is exported
"exp_val", giving the value of the export in a certain year, for that good
"exp_qty", giving the exported quantity of the good in a certain year
Basically, I would like to get the total sum of the quantity exported for a certain good, so an output like
hs2 exp_qty
01 34892
02 54548
... ...
and so forth. Right now, the column "hs2" gives me a very large number of observations and, as you can understand, they repeat themselves multiple times (as the variables vary across both time and country of destination). So, the task would be to have every hs2 number just once, with the correspondent value of "total" exports.
Also (but that would be just a plus, I could just check the numbers by myself) it would be nice to get a result sorted by exp_qty, so to have a ranking of the most exported goods by quantity.
The following might be a start at what you need.
collapse (sum) exp_qty, by(hs2)
gsort -exp_qty
collapse summarizes the data in memory to one observation per value of hs2, summing the values of exp_qty. gsort then sorts the collapsed data by descending value of exp_qty so the first observation will be the largest. See help collapse and help gsort for further details.

Resources