Randomness Comparison Experiment - random

I have a drug analysis experiment that need to generate a value based on given drug database and set of 1000 random experiments.
The original database looks like this where the number in the columns represent the rank for the drug. This is a simplified version of actual database, the actual database will have more Drug and more Gene.
+-------+-------+-------+
| Genes | DrugA | DrugB |
+-------+-------+-------+
| A | 1 | 3 |
| B | 2 | 1 |
| C | 4 | 5 |
| D | 5 | 4 |
| E | 3 | 2 |
+-------+-------+-------+
A score is calculated based on user's input: A and C, using the following formula:
# Compute Function
# ['A','C'] as array input
computeFunction(array) {
# do some stuff with the array ...
}
The formula used will be same for any provided value.
For randomness test, each set of experiment requires the algorithm to provide randomized values of A and C, so both A and C can be having any number from 1 to 5
Now I have two methods of selecting value to generate the 1000 sets for P-Value calculation, but I would need someone to point out if there is one better than another, or if there is any method to compare these two methods.
Method 1
Generate 1000 randomized database based on given database input shown above, meaning all the table should contain different set of value pair.
Example for 1 database from 1000 randomized database:
+-------+-------+-------+
| Genes | DrugA | DrugB |
+-------+-------+-------+
| A | 2 | 3 |
| B | 4 | 4 |
| C | 3 | 2 |
| D | 1 | 5 |
| E | 5 | 1 |
+-------+-------+-------+
Next we perform computeFunction() with new A and C value.
Method 2
Pick any random gene from original database and use it as a newly randomized gene value.
For example, we pick the values from E and B as a new value for A and C.
From original database, E is 3, B is 2.
So, now A is 3, C is 2. Next we perform computeFunction() with new A and C value.
Summary
Since both methods produce completely randomized input, therefore it seems to me that it will produce similar 1000-value outcome. Is there any way I could prove they are similar?

Related

BigQuery: Sample a varying number of rows per group

I have two tables. One has a list of items, and for each item, a number n.
item | n
--------
a | 1
b | 2
c | 3
The second one has a list of rows containing item, uid, and other rows.
item | uid | data
------------------
a | x | foo
a | x | baz
a | x | bar
a | z | arm
a | z | leg
b | x | eye
b | x | eye
b | x | eye
b | x | eye
b | z | tap
c | y | tip
c | z | top
I would like to sample, for each (item,uid) pair, n rows (arbitrary, it's better if this is uniformly random, but it doesn't have to be). In the example above, I want to keep maximum one row per user for item a, two rows per user for item b, and three rows per user to item c:
item | uid | data
------------------
a | x | baz
a | z | arm
b | x | eye
b | x | eye
b | z | tap
c | y | tip
c | z | top
ARRAY_AGG with LIMIT n doesn't work for two reasons: first, I suspect that given that n can be large (on the order of 100,000), this won't scale. The second, more fundamental problem is that n needs to be a constant.
Table sampling also doesn't seem to solve my problem, since it's per-table, and also only supports sampling a fixed percentage of rows, rather than a fixed number of rows.
Are there any other options?
Consider below solution
select * except(n)
from rows_list
join items_list
using(item)
where true
qualify row_number() over win <= n
window win as (partition by item, uid order by rand())
if applied to sample data in your question - output is

Is there any calibration tool between two languages performance?

I'm measuring the performance of A and B programs. A is written in Golang, B is written in Python. The important point here is that I'm interested in how the performance value increases, not the absolute performance value of the two programs over time.
For example,
+------+-----+-----+
| time | A | B |
+------+-----+-----+
| 1 | 3 | 500 |
+------+-----+-----+
| 2 | 5 | 800 |
+------+-----+-----+
| 3 | 9 | 1300|
+------+-----+-----+
| 4 | 13 | 1800|
+------+-----+-----+
Where the values in columns A and B(A: 3, 5, 9, 13 / B: 500, 800, 1300, 1800) are the execution times of the program. This execution time can be seen as performance, and the difference between the absolute performance values of A and B is very large. Therefore, the slope comparison of the performance graphs of the two programs would be meaningless.(Python is very slow compared to Golang.)
I want to compare the performance of Program A written in Golang with Program B written in Python, and I'm looking for a calibration tool or formula based on benchmarks that calculates the execution time when Program A is written in Python.
Is there any way to solve this problem ?
If you are interested in the relative change, you should normalize the data for each programming language. In other words, divide the values for golang with 3 and for python, divide with value 500.
+------+-----+-----+
| time | A | B |
+------+-----+-----+
| 1 | 1 | 1 |
+------+-----+-----+
| 2 | 1.66| 1.6 |
+------+-----+-----+
| 3 | 3 | 2.6 |
+------+-----+-----+
| 4 |4.33 | 3.6 |
+------+-----+-----+

How to filter by pattern in a variable length Cypher query

I have a very simple graph with 5 nodes (named n1 - n5), 1 node type (:Node) and 2 relationship types (:r1, :r2). The nodes and relationships are arranged as follows (apologies for the ascii art):
(n1)-[:r1]->(n2)-[:r1]->(n3)
(n1)-[:r2]->(n4)-[:r2]->(n3)
(n1)-[:r1]->(n5)-[:r2]->(n3)
I have a query using a variable length path. I expected to be able restrict the paths returned by describing a specific pattern in the WHERE clause:
MATCH p = (n:Node {name: 'n1'})-[*..2]->()
WHERE (n)-[:r1]->()-[:r1]->()
RETURN p
The problem is that the response returns all possible paths. My question; is it possible to filter the returned paths when specifying a variable length path in a query?
If all relationships or nodes have to adhere to the same predicate, this is easy. You'll need a variable for the path, and you'll need to use all() (or none()) in your WHERE clause to apply the predicate for all relationships or nodes in your path:
MATCH p = (n:Node {name: 'n1'})-[*..2]->()
WHERE all(rel in relationships(p) WHERE type(rel) = 'r1')
RETURN p
That said, when all you want is for all relationships in the var-length path to be of the same type (or types, if you want multiple), that's best done in the pattern itself:
MATCH p = (n:Node {name: 'n1'})-[:r1*..2]->()
RETURN p
For more complicated cases, such as multiple relationship types (where the order of those types matters in the path), or repeating sequences of types or node labels in the path, then alternate approaches are needed. APOC path expanders may help.
EDIT
You mentioned in the comments that your case deals with sequences of relationships of varying lengths. While the APOC path expanders may help, it there are a few restrictions:
The path expanders currently operate on node labels and relationship types, but not properties, so if your expansions rely on predicates on properties, the path expanders won't be able to handle that for you during expansion, that would have to be done by filtering the path expander results after.
There are limits to the relationship sequence support for path expanders. We can define sequences of any length, and can accept multiple relationship types at each step in the sequence, but we don't currently support diverging sequences ((r1 then r2 then r3) or (r2 then r5 then r6)).
If we wanted to do a 3-step sequence of r1 (incoming), r2 (outgoing), then r3 or r4 (with r3 in either direction and r4 outgoing), repeating the sequence up to 3 times we could do so like this:
MATCH (n:Node {name: 'n1'})
CALL apoc.path.expandConfig(n, {relationshipFilter:'<r1, r2>, r3 | r4>', minLevel:1, maxLevel:9) YIELD path
RETURN path
Note that we can provide differing directions per relationship in the filter, or leave off the arrow entirely if we don't care about the direction.
Label filtering is more complex, but I didn't see any need for that present in the examples so far.
Your query return all paths because your WHERE clause (Filter operator) is applied before the VarLengthExpand operator:
+-----------------------+----------------+------+---------+-----------------+-------------------+----------------------+----------------------------+------------------------------------------------------------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Page Cache Hits | Page Cache Misses | Page Cache Hit Ratio | Variables | Other |
+-----------------------+----------------+------+---------+-----------------+-------------------+----------------------+----------------------------+------------------------------------------------------------------------------------------------------------+
| +ProduceResults | 0 | 4 | 0 | 0 | 0 | 0.0000 | anon[32], anon[41], n, p | |
| | +----------------+------+---------+-----------------+-------------------+----------------------+----------------------------+------------------------------------------------------------------------------------------------------------+
| +Projection | 0 | 4 | 0 | 0 | 0 | 0.0000 | p -- anon[32], anon[41], n | {p : PathExpression(NodePathStep(Variable(n),MultiRelationshipPathStep(Variable(),OUTGOING,NilPathStep)))} |
| | +----------------+------+---------+-----------------+-------------------+----------------------+----------------------------+------------------------------------------------------------------------------------------------------------+
| +VarLengthExpand(All) | 0 | 4 | 7 | 0 | 0 | 0.0000 | anon[32], anon[41] -- n | (n)-[:*..2]->() |
| | +----------------+------+---------+-----------------+-------------------+----------------------+----------------------------+------------------------------------------------------------------------------------------------------------+
| +Filter | 0 | 1 | 6 | 0 | 0 | 0.0000 | n | n.name = { AUTOSTRING0}; GetDegree(Variable(n),Some(RelTypeName(KNOWS)),OUTGOING) > 0 |
| | +----------------+------+---------+-----------------+-------------------+----------------------+----------------------------+------------------------------------------------------------------------------------------------------------+
| +NodeByLabelScan | 4 | 4 | 5 | 0 | 0 | 0.0000 | n | :Crew |
+-----------------------+----------------+------+---------+-----------------+-------------------+----------------------+----------------------------+------------------------------------------------------------------------------------------------------------+
This should get you going:
MATCH p = (n:Node {name: 'n1'})-[*..2]->()
WITH n, relationships(p)[0] as rel0, relationships(p)[1] as rel1, p
MATCH (n)-[rel0:r1]->()-[rel1:r1]->()
RETURN p

MapReduce matrix multiplication complexity

Assume, that we have large file, which contains descriptions of the cells of two matrices (A and B):
+---------------------------------+
| i | j | value | matrix |
+---------------------------------+
| 1 | 1 | 10 | A |
| 1 | 2 | 20 | A |
| | | | |
| ... | ... | ... | ... |
| | | | |
| 1 | 1 | 5 | B |
| 1 | 2 | 7 | B |
| | | | |
| ... | ... | ... | ... |
| | | | |
+---------------------------------+
And we want to calculate the product of this matrixes: C = A x B
By definition: C_i_j = sum( A_i_k * B_k_j )
And here is a two-step MapReduce algorithm, for calculation of this product (I will provide a pseudocode):
First step:
function Map (input is a single row of the file from above):
i = row[0]
j = row[1]
value = row[2]
matrix = row[3]
if(matrix == 'A')
emit(i, {j, value, 'A'})
else
emit(j, {i, value, 'B'})
Complexity of this Map function is O(1)
function Reduce(Key, List of tuples from the Map function):
Matrix_A_tuples =
filter( List of tuples from the Map function, where matrix == 'A' )
Matrix_B_tuples =
filter( List of tuples from the Map function, where matrix == 'B' )
for each tuple_A from Matrix_A_tuples
i = tuple_A[0]
value_A = tuple_A[1]
for each tuple_B from Matrix_B_tuples
j = tuple_B[0]
value_B = tuple_B[1]
emit({i, j}, {value_A * value_b, 'C'})
Complexity of this Reduce function is O(N^2)
After the first step we will get something like the following file (which contains O(N^3) lines):
+---------------------------------+
| i | j | value | matrix |
+---------------------------------+
| 1 | 1 | 50 | C |
| 1 | 1 | 45 | C |
| | | | |
| ... | ... | ... | ... |
| | | | |
| 2 | 2 | 70 | C |
| 2 | 2 | 17 | C |
| | | | |
| ... | ... | ... | ... |
| | | | |
+---------------------------------+
So, all we have to do - just sum the values, from lines, which contains the same values i and j.
Second step:
function Map (input is a single row of the file, which produced in first step):
i = row[0]
j = row[1]
value = row[2]
emit({i, j}, value)
function Reduce(Key, List of values from the Map function)
i = Key[0]
j = Key[1]
result = 0;
for each Value from List of values from the Map function
result += Value
emit({i, j}, result)
After the second step we will get the file, which contains cells of the matrix C.
So the question is:
Taking into account, that there are multiple number of instances in MapReduce cluster - which is the most correct way to estimate complexity of the provided algorithm?
The first one, which comes to mind is such:
When we assume that number of instances in the MapReduce cluster is K.
And, because of the number of lines - from file, which produced after the first step is O(N^3) - the overall complexity can be estimated as O((N^3)/K).
But this estimation doesn't take into account many details: such as network bandwidth between instances of MapReduce cluster, ability to distribute data between distances - and perform most of the calculations locally etc.
So, I would like to know which is the best approach for estimation of efficiency of the provided MapReduce algorithm, and does it make sense to use Big-O notation to estimate efficiency of MapReduce algorithms at all?
as you said the Big-O estimates the computation complexity, and does not take into consideration the networking issues such(bandwidth, congestion, delay...)
If you want to calculate how much efficient the communication between instances, in this case you need other networking metrics...
However, I want to tell you something, if your file is not big enough, you will not see an improvement in term of execution speed. This is because the MapReduce works efficiently only with BIG data. Moreover, your code has two steps, that means two jobs. MapReduce, from one job to another, takes time to upload the file and start the job again. This can affect slightly the performance.
I think you can calculate the efficiently in term of speed and time as the MapReduce approach is for sure faster when it comes to big data. This is if we compared it to the sequential algorithms.
Moreover, efficiency can be with regards to the fault-tolerance. This is because MapReduce will manage to handle failures by itself. So, no need for the programmers to handle instance failure or networking failures..

Data management with several variables

Currently I am facing the following problem, which I'm working in Stata to solve. I have added the algorithm tag, because it's mainly the steps that I'm interested in rather than the Stata code.
I have some variables, say, var1 - var20 that can possibly contain a string. I am only interested in some of these strings, let us call them A,B,C,D,E,F, but other strings can occur also (all of these will be denoted X). Also I have a unique identifier ID. A part of the data could look like this:
ID | var1 | var2 | var3 | .. | var20
1 | E | | | | X
1 | | A | | | C
2 | X | F | A | |
8 | | | | | E
Now I want to create an entry for every ID and for every occurrence of one of the strings A,B,C,E,D,F in any of the variables. The above data should look like this:
ID | var1 | var2 | var3 | .. | var20
1 | E | | | .. |
1 | | A | | |
1 | | | | | C
2 | | F | | |
2 | | | A | |
8 | | | | | E
Here we ignore every time there's a string X that is NOT A,B,C,D,E or F. My attempt so far was to create a variable that for each entry counts the number, N, of occurrences of A,B,C,D,E,F. In the original data above that variable would be N=1,2,2,1. Then for each entry I create N duplicates of this. This results in the data:
ID | var1 | var2 | var3 | .. | var20
1 | E | | | | X
1 | | A | | | C
1 | | A | | | C
2 | X | F | A | |
2 | X | F | A | |
8 | | | | | E
My problem is how do I attack this problem from here? And sorry for the poor title, but I couldn't word it any more specific.
Sorry, I thought the finally block was your desired output (now I understand that it's what you've accomplished so far). You can get the middle block with two calls to reshape (long, then wide).
First I'll generate data to match yours.
clear
set obs 4
* ids
generate n = _n
generate id = 1 in 1/2
replace id = 2 in 3
replace id = 8 in 4
* generate your variables
forvalues i = 1/20 {
generate var`i' = ""
}
replace var1 = "E" in 1
replace var1 = "X" in 3
replace var2 = "A" in 2
replace var2 = "F" in 3
replace var3 = "A" in 3
replace var20 = "X" in 1
replace var20 = "C" in 2
replace var20 = "E" in 4
Now the two calls to reshape.
* reshape to long, keep only desired obs, then reshape to wide
reshape long var, i(n id) string
keep if inlist(var, "A", "B", "C", "D", "E", "F")
tempvar long_id
generate int `long_id' = _n
reshape wide var, i(`long_id') string
The first reshape converts your data from wide to long. The var specifies that the variables you want to reshape to long all start with var. The i(n id) specifies that each unique combination of n and i is a unique observation. The reshape call provides one observation for each n-id combination for each of your var1 through var20 variables. So now there are 4*20=80 observations. Then I keep only the strings that you'd like to keep with inlist().
For the second reshape call var specifies that the values you're reshaping are in variable var and that you'll use this as the prefix. You wanted one row per remaining letter, so I made a new index (that has no real meaning in the end) that becomes the i index for the second reshape call (if I used n-id as the unique observation, then we'd end up back where we started, but with only the good strings). The j index remains from the first reshape call (variable _j) so the reshape already knows what suffix to give to each var.
These two reshape calls yield:
. list n id var1 var2 var3 var20
+-------------------------------------+
| n id var1 var2 var3 var20 |
|-------------------------------------|
1. | 1 1 E |
2. | 2 1 A |
3. | 2 1 C |
4. | 3 2 F |
5. | 3 2 A |
|-------------------------------------|
6. | 4 8 E |
+-------------------------------------+
You can easily add back variables that don't survive the two reshapes.
* if you need to add back dropped variables
forvalues i =1/20 {
capture confirm variable var`i'
if _rc {
generate var`i' = ""
}
}

Resources