Is there a plink1.9 option to find out the right filter --geno to keep at least 100k SNPs? - genetics

I am quite new to the plink world.
I am performing QC of some genotypes and I have been questioned to find out the right filter --geno to apply in order to keep at least 100k SNPs passing the final QC.
Is there any particular option to be used with plink1.9 to know the right --geno filter to apply in order to retain at least final 100k SNPs in my dataset?
Thank you

Related

In SPSS modeler (17/18), what is criteria for evaluating ties encountered while sorting particular column using sorting nugget?

Am sorting by particular column using sorting nugget in SPSS modeler 17/18. However, do not understand how ties are evaluated when values are repeated in sorting column. None of the other columns have any sequence associated with it? Can someone throw some light on this.
Have attached illustration here where am sorting on col3 (excel file is original data). However, after sorting, no other cols (Key) seem to follow any sequence/order. How was final data arrived at then?
I have not been able to find any documentation to answer this question, but I believe that the order of ties after the sort is essentially random or at least determined by a number of factors that are outside of the user's control. Generally, I think it is determined by the order of the records in the source, but if you are querying a database or similar without specifying a sort order, you may see that the data will be sorted differently depending on the source system and it may even differ between each execution.
If your processing depends on the sort order of the data (including the order of the ties), the best approach will be to specify the sort order in such a detail that ties will not happen.

Nested sorting in dimension hierarchies (Tableau)

I am working on a vizualisation in Tableau that has dimension hierarchy (Product category, product sub-category, product type etc.) sorted descending by number of orders. I want my viz to show by default only first product level (product category) sorted the same way, but give an option to drill down (using "+" on the dimension) to detailed product levels and using nested sorting (again, descending by number of orders).
superstore data sample
I tried using nested sorting option for each product level, but when I drill up and down again, the sorting is wrong again, as if it clears out. I cannot find an option to keep them fixed unless I keep all product levels visible in the viz (without drill-down option).
Does anyone know how can I do it? I tried also different ways of indexing and ranking calculations, but nothing seem to work. I know there is one option to combine hierarchy dimensions and using sorting option on them, but it keeps the viz really untidy.
Thanks in advance!
Tableau will always sort based on the left most column. With the newer nested sorting you can more easily do a secondary sorting. However, when you expand up/down hierarchies like you are noting the formatting might not be retained.
The "classic" way to do this is to create a Rank by number of orders (sounds like you were close on this one). rank(COUNT([Order ID]),'desc'). Make this a discrete measure and put it to the left of all the other dimensions.
To clean it up, you can uncheck "Show Header" on the rank pill.
And if you expand/collapse the hierarchy, it keeps the sorting... Final product:
EDIT: Here is another way to try to accomplish this. It seems to work for 3 levels but starts to break down after that. (It also didn't seem to work well on grouped dimensions.)
Expand the hierarchy to all three levels.
On each dimension, enforce a sort order of Count of Order ID Descending.

Fast algorithm for approximate lookup on multiple keys

I have formulated a solution to a problem where I am storing parameters in a set of tables, and I want to be able to look up the parameters based on multiple criteria.
For example, if criteria 1 and criteria 2 can each be either A or B, then I'd have four potential parameters - one for each combination A&A, A&B, B&A and B&B. For these sort of criteria I could concatenate the fields or something similar and create a unique key to look up each value quickly.
Unfortunately not all of my criteria are like this. Some of the criteria are numerical and I only care about whether or not a result sits above or below a boundary. That also wouldn't be a problem on its own - I could maybe use a binary search or something relatively quick to find the nearest key above or below my value.
My problem is I need to include a number of each in the same table. In other words, I could have three criteria - two with A/B entries, and one with less-than-x/greater-than-x type entries, where x is in no way fixed. So in this example I would have a table with 8 entries. I can't just do a binary search for the boundary because the closest boundary won't necessarily be applicable due to the other criteria. For example, if the first two criteria are A&B, then the closest boundary might be 100, but if the if first two criteria are A&A, the closest boundary might be 50. If I want to look up A, A, 101, then I want it to recognise that 50 is the closest boundary that applies - not 100.
I have a procedure to do the lookup but it gets very slow as the tables get bigger - it basically goes through each criteria, checks if a match is still possible, and if so it looks at more criteria - if not, it moves on to check the next entry in the table. So in other words, my procedure requires cycling through the table entries one by one and checking for a match. I have tried to optimise that by ensuring the tables that are input to the procedure are as small as possible and by making sure it looks at the criteria that are least likely to match first (so that it checks each entry as quickly as possible) but it is still very slow.
The biggest tables are maybe 200 rows with about 10 criteria to check, but many are much smaller (maybe 10x5). The issue is that I need to call the procedure many times during my application, so algorithms with some initial overhead don't necessarily make things better. I do have some scope to change the format of the tables before runtime but I would like to keep away from that as much as possible (while recognising it may be the only way forward).
I've done quite a bit of research but I haven't had any luck. Does anyone know of any algorithms that have been designed to tackle this kind of problem? I was really hoping that there would be some clever hash function or something that means I won't have to cycle through the tables, but from my limited knowledge something like that would struggle here. I feel confident that I understand the problem well enough to gradually optimise the solution I have at the moment, but I want to be sure I've not missed a much better solution.
Apologies for the very long and abstract description of the problem - hopefully it's clear what I'm trying to do. I'll amend my question if it's unclear.
Thanks for any help.
this is basically what a query optimizer does in SQL land. There are fast, free, in memory databases for exactly this purpose. Checkout sqlite https://www.sqlite.org/inmemorydb.html.
It sounds like you are doing what is called a 'full table scan' for each query, which is like the last resort for a query optimizer.
As I've understood, you want to select entries by criteria like
A& not B & x1 >= lower_x1 & x1 < upper_x1 & x2 >= lower_x2 & x2 < lower_x2 & ...
The easiest way is to have them sorted by all possible xi, where i=1,2.. in separate sets, and have separated 'words' for various combination of A,B,..
The search will works as follows:
Select a proper world by Boolean criteria combination
For each i, find the population of lower_xi..upper_xi range in corresponding set (this operation is O(log(N))
Select i where the population is the lowest
While iterating instances through lower_xi..upper_xi range, filter the results by checking other upper/lower bound criteria (for all xj where j!=i)
Note that this s a general solution. Of course if you know some relation between your bound(s), you may use a list sorted by respective combination(s) of item values.

Setting priority in lucene.net results

I am using a Lucene.Net query like this
(PropertyID:1 OR PropertyID:25 OR PropertyID:5 OR PropertyID:10 OR PropertyID:15)
I want result from Lucene.Net in order of PropertyId. I passed for example first record should be for PropertyId 1 second for 25 and third for 5. But currently Lucene.Net arranging result set in different way.
The order of fields in the query has no effect on sorting.
There are 2 ways to achieve the sorting you're looking for:
Use boosts in your query. You can boost PropertyID:1 higher than the rest so that these matches are scored higher and thus appear first in the results, then score PropertyID:2 second highest, etc. For example:
(PropertyID:1^5 OR PropertyID:25^4 OR PropertyID:5^3 OR PropertyID:10^2 OR PropertyID:15) This is simple to implement but may not work right if you're including other criteria in your query because that other criteria will affecting the scoring.
Implement custom sorting via your own Comparator class. This may take quite a bit of work especially given the lack of resources on the web for doing this, however it will give you the greatest control over your sorting. Here is an example of a custom Comparator used to sort by a string value alphabetically that may be a good place for you to start.

Amortizing the calculation of distribution (and percentile), applicable on App Engine?

This is applicable to Google App Engine, but not necessarily constrained for it.
On Google App Engine, the database isn't relational, so no aggregate functions (such as sum, average etc) can be implemented. Each row is independent of each other. To calculate sum and average, the app simply has to amortize its calculation by recalculating for each individual new write to the database so that it's always up to date.
How would one go about calculating percentile and frequency distribution (i.e. density)? I'd like to make a graph of the density of a field of values, and this set of values is probably on the order of millions. It may be feasible to loop through the whole dataset (the limit for each query is 1000 rows returned), and calculate based on that, but I'd rather do some smart approach.
Is there some algorithm to calculate or approximate density/frequency/percentile distribution that can be calculated over a period of time?
By the way, the data is indeterminate in that the maximum and minimum may be all over the place. So the distribution would have to take approximately 95% of the data and only do a density based on that.
Getting the whole row (with that limit of 1000 at a time...) over and over again in order to get a single number per row is sure unappealing. So denormalize the data by recording that single number in a separate entity that holds a list of numbers (to a limit of I believe 1 MB per query, so with 4-byte numbers no more than 250,000 numbers per list).
So when adding a number also fetch the latest "added data values list" entity, if full make a new one instead, append the new number, save it. Probably no need to be transactional if a tiny error in the statistics is no killer, as you appear to imply.
If the data for an item can be changed have separate entities of the same kind recording the "deleted" data values; to change one item's value from 23 to 45, add 23 to the latest "deleted values" list, and 45 to the latest "added values" one -- this covers item deletion as well.
It may be feasible to loop through the whole dataset (the limit for each query is 1000 rows returned), and calculate based on that, but I'd rather do some smart approach.
This is the most obvious approach to me, why are you are you trying to avoid it?

Resources