Data management with several variables - algorithm

Currently I am facing the following problem, which I'm working in Stata to solve. I have added the algorithm tag, because it's mainly the steps that I'm interested in rather than the Stata code.
I have some variables, say, var1 - var20 that can possibly contain a string. I am only interested in some of these strings, let us call them A,B,C,D,E,F, but other strings can occur also (all of these will be denoted X). Also I have a unique identifier ID. A part of the data could look like this:
ID | var1 | var2 | var3 | .. | var20
1 | E | | | | X
1 | | A | | | C
2 | X | F | A | |
8 | | | | | E
Now I want to create an entry for every ID and for every occurrence of one of the strings A,B,C,E,D,F in any of the variables. The above data should look like this:
ID | var1 | var2 | var3 | .. | var20
1 | E | | | .. |
1 | | A | | |
1 | | | | | C
2 | | F | | |
2 | | | A | |
8 | | | | | E
Here we ignore every time there's a string X that is NOT A,B,C,D,E or F. My attempt so far was to create a variable that for each entry counts the number, N, of occurrences of A,B,C,D,E,F. In the original data above that variable would be N=1,2,2,1. Then for each entry I create N duplicates of this. This results in the data:
ID | var1 | var2 | var3 | .. | var20
1 | E | | | | X
1 | | A | | | C
1 | | A | | | C
2 | X | F | A | |
2 | X | F | A | |
8 | | | | | E
My problem is how do I attack this problem from here? And sorry for the poor title, but I couldn't word it any more specific.

Sorry, I thought the finally block was your desired output (now I understand that it's what you've accomplished so far). You can get the middle block with two calls to reshape (long, then wide).
First I'll generate data to match yours.
clear
set obs 4
* ids
generate n = _n
generate id = 1 in 1/2
replace id = 2 in 3
replace id = 8 in 4
* generate your variables
forvalues i = 1/20 {
generate var`i' = ""
}
replace var1 = "E" in 1
replace var1 = "X" in 3
replace var2 = "A" in 2
replace var2 = "F" in 3
replace var3 = "A" in 3
replace var20 = "X" in 1
replace var20 = "C" in 2
replace var20 = "E" in 4
Now the two calls to reshape.
* reshape to long, keep only desired obs, then reshape to wide
reshape long var, i(n id) string
keep if inlist(var, "A", "B", "C", "D", "E", "F")
tempvar long_id
generate int `long_id' = _n
reshape wide var, i(`long_id') string
The first reshape converts your data from wide to long. The var specifies that the variables you want to reshape to long all start with var. The i(n id) specifies that each unique combination of n and i is a unique observation. The reshape call provides one observation for each n-id combination for each of your var1 through var20 variables. So now there are 4*20=80 observations. Then I keep only the strings that you'd like to keep with inlist().
For the second reshape call var specifies that the values you're reshaping are in variable var and that you'll use this as the prefix. You wanted one row per remaining letter, so I made a new index (that has no real meaning in the end) that becomes the i index for the second reshape call (if I used n-id as the unique observation, then we'd end up back where we started, but with only the good strings). The j index remains from the first reshape call (variable _j) so the reshape already knows what suffix to give to each var.
These two reshape calls yield:
. list n id var1 var2 var3 var20
+-------------------------------------+
| n id var1 var2 var3 var20 |
|-------------------------------------|
1. | 1 1 E |
2. | 2 1 A |
3. | 2 1 C |
4. | 3 2 F |
5. | 3 2 A |
|-------------------------------------|
6. | 4 8 E |
+-------------------------------------+
You can easily add back variables that don't survive the two reshapes.
* if you need to add back dropped variables
forvalues i =1/20 {
capture confirm variable var`i'
if _rc {
generate var`i' = ""
}
}

Related

Delete column and Add column with default value Amplify graphql DynamoDB Appsync

I want to delete the whole column at once if there 1000's of data deleting one by one is time consuming so is there any best way to do it
Deleting this column
And my next question is I need to add new column with default value aftering deleting
i.e :
id | x | y | z
| A | B | C
| A | B | C
| A | B | C
| A | B | C
If above is the the table with thousands of data , I want to delete "z" and add "newColumn" with default value of "D" as below
id | x | y | newColumn
| A | B | D
| A | B | D
| A | B | D
| A | B | D

Randomness Comparison Experiment

I have a drug analysis experiment that need to generate a value based on given drug database and set of 1000 random experiments.
The original database looks like this where the number in the columns represent the rank for the drug. This is a simplified version of actual database, the actual database will have more Drug and more Gene.
+-------+-------+-------+
| Genes | DrugA | DrugB |
+-------+-------+-------+
| A | 1 | 3 |
| B | 2 | 1 |
| C | 4 | 5 |
| D | 5 | 4 |
| E | 3 | 2 |
+-------+-------+-------+
A score is calculated based on user's input: A and C, using the following formula:
# Compute Function
# ['A','C'] as array input
computeFunction(array) {
# do some stuff with the array ...
}
The formula used will be same for any provided value.
For randomness test, each set of experiment requires the algorithm to provide randomized values of A and C, so both A and C can be having any number from 1 to 5
Now I have two methods of selecting value to generate the 1000 sets for P-Value calculation, but I would need someone to point out if there is one better than another, or if there is any method to compare these two methods.
Method 1
Generate 1000 randomized database based on given database input shown above, meaning all the table should contain different set of value pair.
Example for 1 database from 1000 randomized database:
+-------+-------+-------+
| Genes | DrugA | DrugB |
+-------+-------+-------+
| A | 2 | 3 |
| B | 4 | 4 |
| C | 3 | 2 |
| D | 1 | 5 |
| E | 5 | 1 |
+-------+-------+-------+
Next we perform computeFunction() with new A and C value.
Method 2
Pick any random gene from original database and use it as a newly randomized gene value.
For example, we pick the values from E and B as a new value for A and C.
From original database, E is 3, B is 2.
So, now A is 3, C is 2. Next we perform computeFunction() with new A and C value.
Summary
Since both methods produce completely randomized input, therefore it seems to me that it will produce similar 1000-value outcome. Is there any way I could prove they are similar?

hive rows preceding unexpected behavior

Given this ridiculously simple data set:
+--------+-----+
| Bucket | Foo |
+--------+-----+
| 1 | A |
| 1 | B |
| 1 | C |
| 1 | D |
+--------+-----+
I want to see the value of Foo in the previous row:
select
foo,
max(foo) over (partition by bucket order by foo rows between 1 preceding and 1 preceding) as prev_foo
from
...
Which gives me:
+--------+-----+----------+
| Bucket | Foo | Prev_Foo |
+--------+-----+----------+
| 1 | A | A |
| 1 | B | A |
| 1 | C | B |
| 1 | D | C |
+--------+-----+----------+
Why do I get 'A' back for the first row? I would expect it to be be null. It's throwing off calculations where I'm looking for that null. I can work around it by throwing a row_number() in there, but I'd prefer to handle it with fewer calcs.
use the LAG function to get previous row:
LAG(foo) OVER(partition by bucket order by foo) as Prev_Foo

How can I save a matrix from one dataset to another?

I have created a matrix A from dataset1 and I want to use this later in dataset2.
How can I programmatically save this matrix and import it to dataset2?
Consider the following toy datasets:
/* create dataset 1 */
clear
set obs 5
forvalues i = 1 / 5 {
generate norm`i' = rnormal(10, 20)
}
list
+----------------------------------------------------------+
| norm1 norm2 norm3 norm4 norm5 |
|----------------------------------------------------------|
1. | 29.184 47.57735 -6.06845 47.43953 12.10697 |
2. | 9.9639 65.09492 31.92023 18.47133 39.01292 |
3. | 20.88154 -2.251937 1.185946 22.67908 -11.98451 |
4. | 10.03257 13.94616 -10.22853 18.34467 37.34412 |
5. | 17.15362 42.20448 30.38455 -.5586708 20.34926 |
+----------------------------------------------------------+
save data1, replace
/* create dataset 2 */
clear
set obs 5
forvalues i = 1 / 5 {
generate unif`i' = runiform()
}
list
+------------------------------------------------------+
| unif1 unif2 unif3 unif4 unif5 |
|------------------------------------------------------|
1. | .4398566 .222692 .359981 .8840723 .840627 |
2. | .8955406 .7279246 .7385288 .1269085 .2610574 |
3. | .6760237 .5028067 .9236897 .2413106 .8938763 |
4. | .9666038 .0491344 .0098985 .4427792 .8565752 |
5. | .4118744 .368421 .1528643 .8636661 .0944128 |
+------------------------------------------------------+
save data2, replace
One can do this using the svmat command:
use data1, clear
mkmat norm*, matrix(A)
use data2, clear
matrix list A
A[5,5]
norm1 norm2 norm3 norm4 norm5
r1 29.184 47.577354 -6.0684505 47.439529 12.106971
r2 9.9638996 65.094917 31.920233 18.471329 39.01292
r3 20.88154 -2.2519367 1.1859455 22.679077 -11.984506
r4 10.032575 13.946158 -10.228531 18.344669 37.344124
r5 17.153618 42.204475 30.384546 -.55867082 20.349257
svmat A, names(norm)
list
+-----------------------------------------------------------------------------------------------------------------+
| unif1 unif2 unif3 unif4 unif5 norm1 norm2 norm3 norm4 norm5 |
|-----------------------------------------------------------------------------------------------------------------|
1. | .4398566 .222692 .359981 .8840723 .840627 29.184 47.57735 -6.06845 47.43953 12.10697 |
2. | .8955406 .7279246 .7385288 .1269085 .2610574 9.9639 65.09492 31.92023 18.47133 39.01292 |
3. | .6760237 .5028067 .9236897 .2413106 .8938763 20.88154 -2.251937 1.185946 22.67908 -11.98451 |
4. | .9666038 .0491344 .0098985 .4427792 .8565752 10.03257 13.94616 -10.22853 18.34467 37.34412 |
5. | .4118744 .368421 .1528643 .8636661 .0944128 17.15362 42.20448 30.38455 -.5586708 20.34926 |
+-----------------------------------------------------------------------------------------------------------------+
Note that this solution will work if the clear matrix and/or clear all have not been invoked.

Regex that matches valid Ruby local variable names

Does anyone know the rules for valid Ruby variable names? Can it be matched using a RegEx?
UPDATE: This is what I could come up with so far:
^[_a-z][a-zA-Z0-9_]+$
Does this seem right?
Identifiers are pretty straightforward. They begin with letters or an underscore, and contain letters, underscore and numbers. Local variables can't (or shouldn't?) begin with an uppercase letter, so you could just use a regex like this.
/^[a-z_][a-zA-Z_0-9]*$/
It's possible for variable names to be unicode letters, in which case most of the existing regexes don't match.
varname = "\u2211" # => "∑"
eval(varname + '= "Tony the Pony"') => "Tony the Pony"
puts varname # => ∑
local_variable_identifier = /Insert large regular expression here/
varname =~ local_variable_identifier # => nil
See also "Fun with Unicode" in either the Ruby 1.9 Pickaxe or at Fun with Unicode.
According to http://rubylearning.com/satishtalim/ruby_names.html a Ruby variable consists of:
A name is an uppercase letter,
lowercase letter, or an underscore
("_"), followed by Name characters
(this is any combination of upper- and
lowercase letters, underscore and
digits).
In addition, global variables begin with a dollar sign, instance variables with a single at-sign, and class variables with two at-signs.
A regular expression to match all that would be:
%r{
(\$|#{1,2})? # optional leading punctuation
[A-Za-z_] # at least one upper case, lower case, or underscore
[A-Za-z0-9_]* # optional characters (including digits)
}x
Hope that helps.
I like #aboutruby's answer, but just to complete it, here's the equivalent using POSIX bracket expressions.
/^[_[:lower:]][_[:alnum:]]*$/
Or, since a-z is actually shorter than [:lower:]:
/^[_a-z][_[:alnum:]]*$/
I think /^(\$){0,1}[_a-zA-Z][a-zA-Z0-9_]*([?!]){0,1}$/ is a bit closer to what you will need...
It depends on whether you want to match method names as well.
If you are trying to match a name that might be encountered in an expression, then it might start with $ and it might end with ? or !. If you know for sure that it is just a local variable then the rule will be much simpler.
i was trying to figure one out for a rails patch, and Matthew Draper wrote this one, using the ruby parser as a reference:
/\A(?![A-Z0-9])(?:[[:alnum:]_]|[^\0-\177])+\z/
And here it is, straight from the horse's mouth. (The horse in this case is the Draft ISO Ruby Specification):
local-variable-identifier → ( lowercase-character | _ ) identifier-character *
identifier-character → lowercase-character | uppercase-character | decimal-digit | _
uppercase-character → A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
lowercase-character → a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z
decimal-digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
In Ruby 1.9, using named groups, you can translate this literally:
local_variable_identifier = %r{
(?<uppercase_character> A | B | C | D | E | F | G | H | I | J | K | L | M
| N | O | P | Q | R | S | T | U | V | W | X | Y | Z
){0}
(?<lowercase_character> a | b | c | d | e | f | g | h | i | j | k | l | m
| n | o | p | q | r | s | t | u | v | w | x | y | z
){0}
(?<decimal_digit> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9){0}
(?<identifier_character> \g<lowercase_character>
| \g<uppercase_character>
| \g<decimal_digit>
| _
){0}
( \g<lowercase_character> | _ ) \g<identifier_character>*
}x
Of course, this is not how you would really write it.

Resources