What is the maximum quality score in a VCF file - bioinformatics

Does anybody know if the maximum VCF quality score is explicitly defined somewhere?
Thanks in advance :)
I have a VCF file containing roughly 8.3 million variations. I was wondering if there is a limit to the quality score in the VCF file. The highest I found was a quality of 999. Roughly 20% of my VCF file has this 999 quality score, so I am assuming that this is the maximum. I'm just not sure and want to use this information for my graduation thesis.

There is no maximum value for quality defined within the vcf specification https://samtools.github.io/hts-specs/VCFv4.2.pdf and all of the different variant callers will manage this differently. This is however not a problem in practice because one would never be applying filters at such high levels of confidence. The real question you should be asking is what is the lowest quality score that I am prepared to accept. Unfortunately there is no universal answer to this question as it depends on the sequencing technology, pipeline and application. That being said, filtering out variants with a quality score of less than 30 is a common strategy that works well in a variety of scenarios and it would be uncommon to use a value significantly higher than this.

Related

Method to choose overall winner across multiple categories [migrated]

I have four numeric variables. All of them are measures of soil quality. Higher the variable, higher the quality. The range for all of them is different:
Var1 from 1 to 10
Var2 from 1000 to 2000
Var3 from 150 to 300
Var4 from 0 to 5
I need to combine four variables into single soil quality score which will successfully rank order.
My idea is very simple. Standardize all four variables, sum them up and whatever you get is the score which should rank-order. Do you see any problem with applying this approach. Is there any other (better) approach that you would recommend?
Thanks
Edit:
Thanks guys. A lot of discussion went into "domain expertise"... Agriculture stuff... Whereas I expected more stats-talk. In terms of technique that I will be using... It will probably be simple z-score summation + logistic regression as an experiment. Because vast majority of samples has poor quality 90% I'm going to combine 3 quality categories into one and basically have binary problem (somequality vs no-quality). I kill two birds with one stone. I increase my sample in terms of event rate and I make a use of experts by getting them to clasify my samples. Expert classified samples will then be used to fit log-reg model to maximize level of concordance / discordance with the experts.... How does that sound to you?
The proposed approach may give a reasonable result, but only by accident. At this distance--that is, taking the question at face value, with the meanings of the variables disguised--some problems are apparent:
It is not even evident that each variable is positively related to "quality." For example, what if a 10 for 'Var1' means the "quality" is worse than the quality when Var1 is 1? Then adding it to the sum is about as wrong a thing as one can do; it needs to be subtracted.
Standardization implies that "quality" depends on the data set itself. Thus the definition will change with different data sets or with additions and deletions to these data. This can make the "quality" into an arbitrary, transient, non-objective construct and preclude comparisons between datasets.
There is no definition of "quality". What is it supposed to mean? Ability to block migration of contaminated water? Ability to support organic processes? Ability to promote certain chemical reactions? Soils good for one of these purposes may be especially poor for others.
The problem as stated has no purpose: why does "quality" need to be ranked? What will the ranking be used for--input to more analysis, selecting the "best" soil, deciding a scientific hypothesis, developing a theory, promoting a product?
The consequences of the ranking are not apparent. If the ranking is incorrect or inferior, what will happen? Will the world be hungrier, the environment more contaminated, scientists more misled, gardeners more disappointed?
Why should a linear combination of variables be appropriate? Why shouldn't they be multiplied or exponentiated or combined as a posynomial or something even more esoteric?
Raw soil quality measures are commonly re-expressed. For example, log permeability is usually more useful than the permeability itself and log hydrogen ion activity (pH) is much more useful than the activity. What are the appropriate re-expressions of the variables for determining "quality"?
One would hope that soils science would answer most of these questions and indicate what the appropriate combination of the variables might be for any objective sense of "quality." If not, then you face a multi-attribute valuation problem. The Wikipedia article lists dozens of methods for addressing this. IMHO, most of them are inappropriate for addressing a scientific question. One of the few with a solid theory and potential applicability to empirical matters is Keeney & Raiffa's multiple attribute valuation theory (MAVT). It requires you to be able to determine, for any two specific combinations of the variables, which of the two should rank higher. A structured sequence of such comparisons reveals (a) appropriate ways to re-express the values; (b) whether or not a linear combination of the re-expressed values will produce the correct ranking; and (c) if a linear combination is possible, it will let you compute the coefficients. In short, MAVT provides algorithms for solving your problem provided you already know how to compare specific cases.
Anyone looked at Russell G. Congalton 'Review of Assessing the Accuracy of Classifications of Remotely Sensed Data' 1990 ?. It describes a technique known as error matrix for varing matrices, also a term he uses called ' Normalizing data' , whereby one gets all the different vectors and 'normalizes' or sets them to equal from 0 to 1. You basically change all vectors to equal ranges from 0 to 1.
One other thing you did not discuss is the scale of the measurements. V1 and V5 looks like they are of rank order and the other seem not. So standardization may be skewing the score. So you may be better transforming all of the variables into ranks, and determining a weighting for each variable, since it is highly unlikely that they have the same weight. Equal weighting is more of a "no nothing" default. You might want to do some correlation or regression analysis to come up with some a priori weights.
I had a similar problem recently and though I add my approach to the nice answers. I think in order to find a simple way to determine which variable leads to the best ranking. One could transform your problem to a gridsearch approach:
Basically use a combined score for the ranking which is composed as such:
Finel_score = Var1 * A + Var2 * B + Var3 * C ....
Then you can compute the final score with different values for A,B,C (sklearn gridsearch could be used) ... and compare the resulting ranking to an expected ranking (some ground truth is needed to determine the goodness of you ranking). The best parameters result in the weights of your individual variables.
Following up on Ralph Winters' answer, you might use PCA (principal component analysis) on the matrix of suitably standardized scores. This will give you a "natural" weight vector that you can use to combine future scores.
Do this also after all scores have been transformed into ranks. If the results are very similar, you have good reasons to continue with either method. If there are discrepancies, this will lead to interesting questions and a better understanding.

Mahout - Naive Bayes Model Very Slow

I have about 44 Million training examples across about 6200 categories.
After training, the model comes out to be ~ 450MB
And while testing, with 5 parallel mappers (each given enough RAM), the classification proceeds at a rate of ~ 4 items a second which is WAY too slow.
How can speed things up?
One way i can think of is to reduce the word corpus, but i fear losing accuracy. I had maxDFPercent set to 80.
Another way i thought of was to run the items through a clustering algorithm and empirically maximize the number of clusters while keeping the items within each category restricted to a single cluster. This would allow me to build separate models for each cluster and thereby (possibly) decrease training and testing time.
Any other thoughts?
Edit :
After some of the answers given below, i started contemplating doing some form of down-sampling by running a clustering algorithm, identifying groups of items that are "highly" close to one another and then taking a union of a few samples from those "highly" close groups and other samples that are not that tightly close to one another.
I also started thinking about using some form of data normalization techniques that involve incorporating edit distances while using n-grams (http://lucene.apache.org/core/4_1_0/suggest/org/apache/lucene/search/spell/NGramDistance.html)
I'm also considering using the hadoop streaming api to leverage some of the ML libraries available in Python from listed here http://pydata.org/downloads/ , and here http://scikit-learn.org/stable/modules/svm.html#svm (These I think use liblinear mentioned in one of the answers below)
Prune stopwords and otherwise useless words (too low support etc.) as early as possible.
Depending on how you use clustering, it may actually make in particular the test phase even more expensive.
Try other tools than Mahout. I found Mahout to be really slow in comparison. It seems that it somewhere comes at a really high overhead.
Using less training exampes would be an option. You will see that after a specific amount of training examples you classification accuracy on unseen examples won't increase. I would recommend to try to train with 100, 500, 1000, 5000, ... examples per category and using 20% for cross validating the accuracy. When it doesn't increase anymore, you have found the amount of data you need which may be a lot less then you use now.
Another approach would be to use another library. For document-classification i find liblinear very very very fast. It's may be more low-level then mahout.
"but i fear losing accuracy" Have you actually tried using less features or less documents? You may not lose as much accuracy as you fear. There may be a few things at play here:
Such a high number of documents are not likely to be from the same time period. Over time, the content of a stream will inevitably drift and words indicative of one class may become indicative of another. In a way, adding data from this year to a classifier trained on last year's data is just confusing it. You may get much better performance if you train on less data.
The majority of features are not helpful, as #Anony-Mousse said already. You might want to perform some form of feature selection before you train your classifier. This will also speed up training. I've had good results in the past with mutual information.
I've previously trained classifiers for a data set of similar scale and found the system worked best with only 200k features, and using any more than 10% of the data for training did not improve accuracy at all.
PS Could you tell us a bit more about your problem and data set?
Edit after question was updated:
Clustering is a good way of selecting representative documents, but it will take a long time. You will also have to re-run it periodically as new data come in.
I don't think edit distance is the way to go. Typical algorithms are quadratic in the length of the input strings, and you might have to run for each pair of words in the corpus. That's a long time!
I would again suggest that you give random sampling a shot. You say you are concerned about accuracy, but are using Naive Bayes. If you wanted the best model money can buy, you would go for a non-linear SVM, and you probably wouldn't live to see it finish training. People resort to classifiers with known issues (there's a reason Naive Bayes is called Naive) because they are much faster than the alternative but performance will often be just a tiny bit worse. Let me give you an example from my experience:
RBF SVM- 85% F1 score - training time ~ month
Linear SVM- 83% F1 score - training time ~ day
Naive Bayes- 82% F1 score - training time ~ day
You find the same thing in the literature: paper . Out of curiosity, what kind of accuracy are you getting?

Remove noisy and redundant features

I have extracted features from a video sequence based on facial markers as means and standard deviations of those markers over a video sequence. They need to be classified into four different classes based on those markers.
In all I have a feature set of around 260 features. How should I determine which features are noisy and redundant in my set. I read about it in some research papers and some of them used the plus l take away r algorithm that I found to be quite appropriate but in such algorithms they always rate one feature against the other and say its good or bad compared to it.
How do I rate my features to be good or bad? What criterion are used for that generally?
I researched a lot for a couple of days but found nothing clear cut and useful. Would be grateful for the help, Thanks.
Think of your 260 features as a basis for a 260 dimensional room. However, your basis-vectors are not normal to each other so they contain a lot of redundant information. You'd like to transform these vectors into a vector-set where all vectors are normal to each other, thus minimizing the dimensions without losing (much) information.
This is what Principal component analysis does.
Linear discriminant analysis may also be of interest to you.
You can use pca or you can train some classifiers, and after this you loop all over yours features adding a big value to each feature, testing if this alteration changes the precision of the classifier, if not, you can remove this feature, after remove all the redundat features, and then retrain your classifiers!
Its a good ideia to train not one classifier but a lot of them, and them make your prediction based on votes, you can user MODE function in matlab to do this!
Use classification rate to determine a subset of feature how much good. You have 260 feature and then have 2^260 subset, this is too much! and search in this space is very difficult. Thus it's better to remove some feature by Filter method (for example FA, t-test, fisher and ...) and then use your search method to find best subset of feature.
Plus l take away r algorithm (or other search algorithm) find various subset and rate it (in this stage use classification rate) and at last specify which subset is better.

Data mining: Apriori issue. Min-support

I wrote data mining apriori algorithm, it works well on small test data but I am having issue to run it on bigger data sets.
I am trying to generate rules of items which were bought together frequently.
My small test data is 5 transactions and 10 products.
My big test data is 11 million transactions and around 2700 products.
Problem: Min-support and Filter non frequent items.
Lets imagine we are interested in items which frequency is 60% or more.
frequency = 0.60;
When I compute Min-support for a small data set with 60% frequency algorithm will remove all items which where bought less than 3 times. Min-support = numberOfTransactions * frequency;
But when I am trying to do the same thing for a large data set, algorithm will filter almost all item set after first iteration, just couple of items able to meet such plane.
So I've started decreasing that plane lower and lower, running algorithm many times. But not even 5% giving desired results. I had to lower my frequency percents until 0.0005 to get it at least 50% of items involved in first iteration.
What do you think about current situation is it might be a data problem, since it is generated artificially? (Microsoft adventure works version)
Or it is my code or min support computation problems?
Maybe you can offer any other solution or better way of doing this?
Thanks!
Maybe that is just how your data is like.
If you have a lot of different items, and few items per transaction, the chances of items co-occurring are low.
Did you verify the result, is it incorrectly pruning, or is the algorithm correct, and your parameters bad?
Can you actually name an itemset that Apriori pruned but that shouldn't have pruned?
The problem is, yes, choosing the parameters is hard. And no, apriori cannot use an adaptive threshold, because that wouldn't satisfy the monotonicity requirement. You must use the same threshold for all itemset sizes.
Actually, it all depends on your data. For some real datasets, I had to set the support threshold lower than 0.0002 to get some results. For some other datasets' i used 0.9. It really depends on your data.
By the way, there exists variation of Apriori and FPGrowth that can consider multiple minimum supports at the same time to use different threshold for different items. For example, CFP-Growth or MIS-Apriori. There also exists some algorithms specialized for mining rare itemsets or rare association rules. If you are interested by this topic, you could check my software which offers some of these algorithms : http://www.philippe-fournier-viger.com/spmf/

Optimal Document Size for LSI Similarity Model

I'm using Gensim's excellent library to compute similarity queries on a corpus using LSI. However, I have a distinct feeling that the results could be better, and I'm trying to figure out whether I can adjust the corpus itself in order to improve the results.
I have a certain amount of control over how to split the documents. My original data has a lot of very short documents (mean length is 12 words in a document, but there exist documents that are 1-2 words long...), and there are a few logical ways to concatenate several documents into one. The problem is that I don't know whether it's worth doing this or not (and if so, to what extent). I can't find any material addressing this question, but only regarding the size of the corpus, and the size of the vocabulary. I assume this is because, at the end of the day, the size of a document is bounded by the size of the vocabulary. But I'm sure there are still some general guidelines that could help with this decision.
What is considered a document that is too short? What is too long? (I assume the latter is a function of |V|, but the former could easily be a constant value.)
Does anyone have experience with this? Can anyone point me in the direction of any papers/blog posts/research that address this question? Much appreciated!
Edited to add:
Regarding the strategy for grouping documents - each document is a text message sent between two parties. The potential grouping is based on this, where I can also take into consideration the time at which the messages were sent. Meaning, I could group all the messages sent between A and B within a certain hour, or on a certain day, or simply group all the messages between the two. I can also decide on a minimum or maximum number of messages grouped together, but that is exactly what my question is about - how do I know what the ideal length is?
Looking at number of words per document does not seem to me to be the correct approach. LSI/LSA is all about capturing the underlying semantics of the documents by detecting common co-occurrences.
You may want to read:
LSI: Probabilistic Analysis
Latent Semantic Analysis (particularly section 3.2)
A valid excerpt from 2:
An important feature of LSI is that it makes no assumptions
about a particular generative model behind the data. Whether
the distribution of terms in the corpus is “Gaussian”, Poisson, or
some other has no bearing on the effectiveness of this technique, at
least with respect to its mathematical underpinnings. Thus, it is
incorrect to say that use of LSI requires assuming that the attribute
values are normally distributed.
The thing I would be more concerned is if the short documents share similar co-occurring terms that will allow LSI to form an appropriate topic grouping all of those documents that for a human share the same subject. This can be hardly done automatically (maybe with a WordNet / ontology) by substituting rare terms with more frequent and general ones. But this is a very long shot requiring further research.
More specific answer on heuristic:
My best bet would be to treat conversations as your documents. So the grouping would be on the time proximity of the exchanged messages. Anything up to a few minutes (a quarter?) I would group together. There may be false positives though (strongly depending on the actual contents of your dataset). As with any hyper-parameter in NLP - your mileage will vary... so it is worth doing a few experiments.
Short documents are indeed a challenge when it comes to applying LDA, since the estimates for the word co-occurrence statistics are significantly worse for short documents (sparse data). One way to alleviate this issue is, as you mentioned, to somehow aggregate multiple short texts into one longer document by some heuristic measure.
One particularity nice test-case for this situation is topic modeling Twitter data, since it's limited by definition to 140 characters. In Empirical Study of Topic Modeling in Twitter (Hong et al, 2010), the authors argue that
Training a standard topic model on aggregated user messages leads to a
faster training process and better quality.
However, they also mention that different aggregation methods lead to different results:
Topics learned by using different aggregation strategies of
the data are substantially different from each other.
My recommendations:
If you are using your own heuristic for aggregating short messages into longer documents, make sure to experiment with different aggregation techniques (potentially all the "sensical" ones)
Consider using a "heuristic-free" LDA variant that is better tailored for short messages, e.g, Unsupervised Topic Modeling for Short Texts Using Distributed
Representations of Words

Resources