Genotype data simulation - genetics

I would like to simulate genotype data. The desired output file is plink map and ped file(bed bim fam is also fine). For each person, there are 10000 SNPs. Within the genotype data, two populations are required. I would like to generate 6000 individuals for the first population. I would like to generate another 4000 individuals for the second population. Then within the first population, the 6000 individuals are paired into 3000 couples randomly. Each couple has two children, the child genotype are generated by gene dropping. Within the second population, the 4000 individuals are paired into 2000 couples randomly. Each couple has two children, the child genotype are generated by gene dropping. Finally, I want to have a plink map and ped file with 20000 individuals, each individual has 10000 SNPs.

Related

How to select highly variable genes in bulk RNA seq data?

As a pre-processing step, I need to select the top 1000 highly variable genes (rows) from a bulk RNA-seq data which contains about 60k genes across 100 different samples(columns). The column value already contains the mean of the triplicates. The table contains normalized value in FPKM (Note: I don't have access to raw counts and am not able to use common R packages as these packages takes raw counts as input.)
In this case, what is the best way to select the top 1000 variable genes ?
I have tried to filter out the genes using rowSums() function (to remove the genes with lower rowsums values) and narrowed it down from 60k genes to 10K genes but I am not sure if it the right way to select highly variable genes. Any input is appreciated.
row sum is first filtration step. after this your data will discarded by log2fold change cutoff and padjst value (0.05 or o.o1 depend on your goal). you can repeat this pathway with different row sum cutoff to see results. I personal discard row sums zero

Gene representation for production planning with constraints

I'm trying to improve the throughput of a production system. The exact type of the system isn't relevant (I think).
Description
The system consists of a LINE of stations (numbered 1, 2, 3...) and an ARM.
The system receives an ITEM at random times.
Each ITEM has a PLAN associated with it (for example, ITEM1 may have a PLAN which
says it needs to go through station 3, then 1, then 5). The PLAN includes timing information on
how long the ITEM would be at each station (a range of hard max/min values).
Every STATION can hold one ITEM at a time.
The ARM is used to move each ITEM from one STATION to the next. Each PLAN includes
timing information for the ARM as well, which is a fixed value.
Current Practice
I have two current (working) planning solutions.
The first maintains a master list of usage for each STATION, consider this a 'booking' approach.
As each new ITEM-N enters, the system searches ahead to find the earliest possible slot where
PLAN-N would fit. So for example, it would try to fit it at t=0, then progressively try higher
delays till it found a fit (well actually I have some heuristics here to cut down processing time,
but the approach holds)
The second maintains a list for each ITEM specifying when it is to start. When a new ITEM-N
enters, the system compares its' PLAN-N with all existing lists to find a suitable time to
start. Again, it starts at t=0 then progressively tries higher delays.
Neither of the two solutions take advantage of the range of times an ITEM is allowed at each
station. A fixed time is assumed (midpoint or minimum).
Ideal Solution
It's quite self-evident that there exists situations where an incoming ITEM would be able to
start earlier than otherwise possible if some of the current ITEMs change the duration they
spend in certain STATION, whether by shortening that duration (so the new ITEM could enter
the STATION instead) or lengthening that duration (so the ARM has time to move the
ITEM).
I'm trying to implement a Genetic Algorithm solution to the problem. My current gene contains N
numbers (between 0 and 1) where N is the total number of stations among all item currently in the
system as well as a new item which is to be added in. It's trivial to convert this gene to an
actual duration (0 would be the min duration, 1 would be the max, scale linearly in between).
However, this gene representation consistently produces un-usable plans which overlap with each
other. The reason for this is that when multiple items are already arranged ideally (consecutive in
time, planning wise), no variation on durations is possible. This is unavoidable because once items
are already being processed, they cannot be delayed or brought forward.
An example of the above situation, say ITEMA is in STATION3 for durations t1 to t2 and t3 to
t4. ITEMB then comes along and occupies STATION3 for duration t2 to t3 (so STATION3 is fully
utilized between t1 and t4). With my current gene representation, I'm virtually guaranteed never to
find a valid solution, since that would require certain elements of the gene to have exactly the
correct value so as not to generate an overlap.
Questions
Is there a better gene representation than I describe above?
Would I be better served doing some simple hill-climbing to find modifiable timings? Or, is GA
actually suited to this problem?

pair-wise aggregation of input lines in Hadoop

A bunch of driving cars produce traces (sequences of ordered positions)
car_id order_id position
car1 0 (x0,y0)
car1 1 (x1,y1)
car1 2 (x2,y2)
car2 0 (x0,y0)
car2 1 (x1,y1)
car2 2 (x2,y2)
car2 3 (x3,y3)
car2 4 (x4,y4)
car3 0 (x0,y0)
I would like to compute the distance (path length) driven by the cars.
At the core, I need to process all records line by line pair-wise. If the
car_id of the previous line is the same as the current one then I need to
compute the distance to the previous position and add it up to the aggregated
value. If the car_id of the previous line is different from the current line
then I need to output the aggregate for the previous car_id, and initialize the
aggregate of the current car_id with zero.
How should the architecture of the hadoop program look like? Is it possible to
archieve the following:
Solution (1):
(a) Every mapper computes the aggregated distance of the trace (per physical
block)
(b) Every mapper aggregates the distances further in case the trace was split
among multiple blocks and nodes
Comment: this solution requires to know whether I am on the last record (line)
of the block. Is this information available at all?
Solution (2)
(a) The mappers read the data line by line (do no computations) and send the
data to the reducer based on the car_id.
(b) The reducers sort the data for individual car_ids based on order_id,
computes the distances, and aggregates them
Comment: high network load due to laziness of mappers
Solution (3)
(a) implement a custom reader to read define a logical record to be the whole
trace of one car
(b) each mapper computes the distances and the aggregate
(c) reducer is not really needed as everything is done by the mapper
Comment: high main memory costs as the whole trace needs to be loaded into main
memory (although only two lines are used at a time).
I would go with Solution (2), since it is the cleanest to implement and reuse.
You certainly want to sort based on car_id AND order_id, so you can compute the distances on the fly without loading them all up into memory.
Your concern about high network usage is valid, however, you can pre-aggregate the distances in a combiner.
How would that look like, let's take some pseudo-code:
Mapper:
foreach record:
emit((car_id, order_id), (x,y))
Combiner:
if(prev_order_id + 1 == order_id): // subsequent measures
// compute distance and emit that as the last possible order
emit ((car_id, MAX_VALUE), distance(prev, cur))
else:
// send to the reducer, since it is probably crossing block boundaries
emit((car_id, order_id), (x,y))
The reducer then has two main parts:
compute the sum over subsequent measures, like the combiner did
sum over all existing sums, tagged with order_id = MAX_VALUE
That's already best-effort what you can get from a network usage POV.
From a software POV, better use Spark- your logic will be five lines instead of 100 across three class files.
For your other question:
this solution requires to know whether I am on the last record (line)
of the block. Is this information available at all?
Hadoop only guarantees that it is not splitting through records when reading, it may very well be that your record is already touching two different blocks underneath. The way to find that out is basically to rewrite your input format to make this information available to your mappers, or even better- take your logic into account when splitting blocks.

random group generator in R

I have a data set consisting of 250 students from different schools and classrooms. For the experimental design, I would like to generate in a random manner 35 groups consisting of approx. 7 students in each group, and then after the first activity, break up the students as randomly as possible in to 25 groups of 10 students each. Is there a package and example of how I can perform this in R?
If there's no relation between which groups students are in during the first and second activities, then it's the same problem twice over.
Assuming a student can only belong to one group at a time, just shuffle the array and pull out elements per your group size until there are no more left.
students=1:250;
rand_students=sample(students,length(students));

How to organise and rank observations of a variable?

I have this dataset containing world bilateral trade data for a few years.
I would like to determine which goods were the most exported ones in the timespan considered by the dataset.
The dataset is composed by the following variables:
"year"
"hs2", containing a two-digit number that tells which good is exported
"exp_val", giving the value of the export in a certain year, for that good
"exp_qty", giving the exported quantity of the good in a certain year
Basically, I would like to get the total sum of the quantity exported for a certain good, so an output like
hs2 exp_qty
01 34892
02 54548
... ...
and so forth. Right now, the column "hs2" gives me a very large number of observations and, as you can understand, they repeat themselves multiple times (as the variables vary across both time and country of destination). So, the task would be to have every hs2 number just once, with the correspondent value of "total" exports.
Also (but that would be just a plus, I could just check the numbers by myself) it would be nice to get a result sorted by exp_qty, so to have a ranking of the most exported goods by quantity.
The following might be a start at what you need.
collapse (sum) exp_qty, by(hs2)
gsort -exp_qty
collapse summarizes the data in memory to one observation per value of hs2, summing the values of exp_qty. gsort then sorts the collapsed data by descending value of exp_qty so the first observation will be the largest. See help collapse and help gsort for further details.

Resources