Stata: Extracting values and save them as scalars (and more) - for-loop

This question is a follow-up question from Stata: replace, if, forvalues. Consider this data:
set seed 123456
set obs 5000
g firmid = "firm" + string(_n) /* Observation (firm) id */
g nw = floor(100*runiform()) /* Number of workers in a firm */
g double lat = 39+runiform() /* Latitude in decimal degree of a firm */
g double lon = -76+runiform() /* Longitude in decimal degree of a firm */
The first 10 observations are:
+--------------------------------------+
| firmid nw lat lon |
|--------------------------------------|
1. | firm1 81 39.915526 -75.505018 |
2. | firm2 35 39.548523 -75.201567 |
3. | firm3 10 39.657866 -75.17988 |
4. | firm4 83 39.957938 -75.898837 |
5. | firm5 56 39.575881 -75.169157 |
6. | firm6 73 39.886184 -75.857255 |
7. | firm7 27 39.33288 -75.724665 |
8. | firm8 75 39.165549 -75.96502 |
9. | firm9 64 39.688819 -75.232764 |
10. | firm10 76 39.012228 -75.166272 |
+--------------------------------------+
I need to calculate the distances between firm 1 and all other firms. So, the vincenty command looks like:
. scalar theLat = 39.915526
. scalar theLon = -75.505018
. vincenty lat lon theLat theLon, hav(distance_km) inkm
The vincenty command creates the distance_km variable that has distances between each observation and firm 1. Here, I manually copy and paste the two numbers that are 39.915526 and -75.505018.
Question 1: What's the syntax that extracts those numbers?
Now, I can keep observations where distances_km <= 2. And,
. egen near_nw_sum = sum(nw)
will create the sum of workers within 2 kilometers of the firm 1. (Or, the collapse command may do the job.)
Question 2: I have to do this for all firms, and the final data should look like:
+-----------------------------------------------------------------+
| firmid nw lat lon near_nw_sum |
|-----------------------------------------------------------------|
1. | firm1 81 39.915526 -75.505018 (# workers near firm1) |
2. | firm2 35 39.548523 -75.201567 (# workers near firm2) |
3. | firm3 10 39.657866 -75.17988 (# workers near firm3) |
4. | firm4 83 39.957938 -75.898837 (# workers near firm4) |
5. | firm5 56 39.575881 -75.169157 (# workers near firm5) |
6. | firm6 73 39.886184 -75.857255 (# workers near firm6) |
7. | firm7 27 39.33288 -75.724665 (# workers near firm7) |
8. | firm8 75 39.165549 -75.96502 (# workers near firm8) |
9. | firm9 64 39.688819 -75.232764 (# workers near firm9) |
10. | firm10 76 39.012228 -75.166272 (# workers near firm10) |
+-----------------------------------------------------------------+
Creating the near_nw_sum variable is my final goal. I need your help here for my weak data management skill.

The following is basically the same strategy found here and is based on your "final goal". Again, it can be useful depending on the size of your original dataset.joinby creates observations so you may exceed the Stata limit. However, I believe it does what you want.
clear all
set more off
set seed 123456
set obs 10
g firmid = _n /* Observation (firm) id */
g nw = floor(100*runiform()) /* Number of workers in a firm */
g double lat = 39+runiform() /* Latitude in decimal degree of a firm */
g double lon = -76+runiform() /* Longitude in decimal degree of a firm */
gen dum = 1
list
* joinby procedure
tempfile main
save "`main'"
rename (firmid lat lon nw) =0
joinby dum using "`main'"
drop dum
* Pretty print
sort firmid0 firmid
order firmid0 firmid
list, sepby(firmid0)
* Uncomment if you do not want to include workers in the "base" firm.
*drop if firmid0 == firmid
* Compute distance
vincenty lat0 lon0 lat lon, hav(distance_km) inkm
keep if distance_km <= 40 // an arbitrary distance
list, sepby(firmid0)
* Compute workers of nearby-firms
collapse (sum) nw_sum=nw (mean) nw0 lat0 lon0, by(firmid0)
list
What it does is form pairwise combinations of firms to compute distances and sum workers of nearby-firms. No need here to extract scalars as asked in Question 1. Also, no need to complicate the variable firmid converting to string.
The following overcomes the problem of the Stata limit on number of observations.
clear all
set more off
* Create empty database
gen x = .
tempfile results
save "`results'", replace
* Create input for exercise
set seed 123456
set obs 500
g firmid = _n /* Observation (firm) id */
g nw = floor(100*runiform()) /* Number of workers in a firm */
g double lat = 39+runiform() /* Latitude in decimal degree of a firm */
g double lon = -76+runiform() /* Longitude in decimal degree of a firm */
gen dum = 1
*list
* Save number of firms
local size = _N
display "`size'"
* joinby procedure
tempfile main
save "`main'"
timer clear 1
timer clear 2
timer clear 3
timer clear 4
quietly {
timer on 1
forvalues i=1/`size'{
timer on 2
use "`main'" in `i', clear // assumed sorted on firmid
rename (firmid lat lon nw) =0
joinby dum using "`main'", unmatched(using)
drop _merge dum
order firmid0 firmid
timer off 2
timer on 3
vincenty lat0 lon0 lat lon, hav(dist) inkm
timer off 3
keep if dist <= 40 // an arbitrary distance
timer on 4
collapse (sum) nw_sum=nw (mean) nw0 lat0 lon0, by(firmid0)
append using "`results'"
save "`results'", replace
timer off 4
}
timer off 1
}
use "`results'", clear
sort firmid0
drop x
list
timer list
However inefficicent, some testing using timer shows that most of the computation time goes into the vincenty command which you won't be able to escape. The following is the time (in seconds) for 10,000 observations with an Intel Core i5 processor and a conventional hard drive (not SSD). Timer 1 is the total while 2, 3, 4 are the components (approx.). Timer 3 corresponds to vincenty:
. timer list
1: 1953.99 / 1 = 1953.9940
2: 169.19 / 10000 = 0.0169
3: 1669.95 / 10000 = 0.1670
4: 94.47 / 10000 = 0.0094
Of course, note that in both codes duplicate computations of distances are made (e.g. both the distances between firm1-firm2 and firm2-firm1 are computed) and this you can probably avoid. As it stands, for 110,000 observations it will take a long time. On the positive side, I noticed this second setup demands very little RAM as compared to the same amount of observations in the first setup. In fact, my 4GB machine freezes with the latter.
Also note that even though I use the same seed as you do, data is different because I create different numbers of observations (not 5000), which makes a difference in the variable creation process.
(By the way, if you wanted to save the value as a scalar you could use subscripting: scalar latitude = lat[1]).

Related

How to decide the probability percentage in question

I have the below question:
In the first part of the question, is says the probability that the selected person will be a male is 0.44, it means the number of males is 25*0.44 = 11. That's ok
In the second part, the probability of the selected person will be a male who was born before 1960 is 0.28, Does that mean 0.28 out of the total number which is 25 or out of the number of males?
I mean should the number of male who was born before 1960 equals into 250.28 OR 110.28
I find it easiest to think of these sorts of problems as contingency tables.
You use a maxtrix layout to express the distributions in terms of two or more factors or characteristics, each having two or more categories. The table can be constructed either with probabilities (proportions) or with counts, and switching back and forth is easy based on the total count in the table. Entries in the table are the intersections of the categories, corresponding to and in a verbal description. The numbers to the right or at the bottom of the table are called marginals, because they're found in the margins of the tables, and are always the sum of the table row or column entries in which they occur. The total probability (or count) in the table is found by summing across all the rows and columns. The marginal distribution of gender would be found by summing across rows, and the marginal distribution of birthdays would be found by summing across the columns.
Based on this, you can inferentially determine other values as indicated by the entries in parentheses below. With one more entry, either for gender or in the marginal row for birthdays, you'd be able to fill in the whole table inferentially. (This is related to the concept of degrees of freedom - how many pieces of info can you fill in independently before the others are determined by the known constraint that the totals are fixed or that probability adds to 1.)
Probabilities
Birthday
< 1960 | >= 1960
_______________________
G | | |
e F | | | (0.56)
n __|_________|__________|
d | | |
e M | 0.28 | (0.16) | 0.44
r __|_________|__________|______
? ? | 1.00
Counts
Birthday
< 1960 | >= 1960
_______________________
G | | |
e F | | | (14)
n __|_________|__________|
d | | |
e M | 7 | (4) | 11
r __|_________|__________|_____
? ? | 25
Conditional probability corresponds to limiting yourself to the subset of rows or columns specified in the condition. If you had been asked what is the probability of a birthday < 1960 given the gender is male, i.e., P{birthday < 1960 | M} in relatively standard notation, you'd be restricting your focus to just the M row, so the answer would be 7/11 = 0.28/0.44. Computationally, you take the probabilities or counts in the qualifying table entries and express them as a proportion of the probabilities or counts of the specified (given) marginal entries. This is often written in prob & stats texts as P(A|B) = P(AB)/P(B), where AB is a set shorthand for A and B (intersection).
0,44 = 11 / 25 people are male.
0,28 = 7 / 25 people are male & born before 1960.

Using Arrays to Calculate Previous and Next Values

Is there a way I can use Clickhouse (Arrays?) to calculate sequential values that are dependent on previously calculated values.
For e.g.
On day 1, I start with 0 -- consume 5 -- Add 100 -- ending up with = 0 - 5 + 100 = 95
My day2, starts with what I ended up on day 1 which is 95 -- again consume 10 -- add 5 -- ending up with 95-10+5=90 (which will be the start for day3)
Given
ConsumeArray [5,10,25]
AddArray [100,5,10]
Calculate EndingPosition and (= StartingPosition for Next day)
-
Day1 Day2 Day3
--------------------------------------------------------------------
StartingPosition (a) = Previous Ending Position | 0 95 90 Calculate
Consumed (b) | 5 10 25
Added (c) | 100 5 10
EdingPosition (d) = a-b+c | 95 90 75 Calculate
Just finish all the add/consume operations first and then do an accumulation.
WITH [5,10,25] as ConsumeArray,
[100,5,10] as AddArray
SELECT
arrayCumSum(arrayMap((c, a) -> a - c, ConsumeArray, AddArray));

Determine max slope of slowly descending signal

I have an analog power signal from a motor. The signal ramps up quickly, but powers off slowly over the course of several seconds. The signal looks almost like a series of plateaus on the descent. The problem is that the signal doesn't settle back to zero. It settles back to an intermediate level unknown, and varying from motor to motor. See chart below.
I'm trying to find a way determine when the motor is off and at that intermediate level.
My thought is to find and store the max point, and calculate the slopes thereafter until the max slope is greater than some large negative slope value like -160 (~ -60 degrees), and declare that the motor must be powering off. The sample points below are with all duplicates removed. (there's about 5000 samples typically).
My problem is determining the X values. In the formula (y2-y1) / (x2 - x1), the x values could far enough away in time that the slope never appears greater than -30 degrees. Picking an absolute number like 10 would fix this, but is there a more mathematically correct method?
The data shows me calculating slope with method described above and the max of 921. ie (y2 -y1) / ( (10+1) - 10). In this scheme, at datapoint 9, i would say the motor is "Off". I'm looking for a more precise means to determine an X value rather than randomly picking 10 for instance.
+---+-----+----------+
| X | Y | Slope |
+---+-----+----------+
| 1 | 65 | 856.000 |
| 2 | 58 | 863.000 |
| 3 | 57 | 864.000 |
| 4 | 638 | 283.000 |
| 5 | 921 | 0.000 |
| 6 | 839 | -82.000 |
| 7 | 838 | -83.000 |
| 8 | 811 | -110.000 |
| 9 | 724 | -197.000 |
+---+-----+----------+
EDIT: A much simpler answer:
Since your motor is either ON or OFF, and ON wattages are strictly higher than OFF wattages, you should be able to discriminate between ON and OFF wattages by maintaining an average wattage, reporting ON if the current measurement is higher than the average and OFF if it is lower.
Count = 0
Average = 500
Whenever a measurement comes in,
Count = Count + 1
Average = Average + (Measurement - Average) / Count
Return Measurement > Average ? ON : OFF
This represents an average of all the values the wattage has ever been. If we want to eventually "forget" the earliest values (before the motor was ever turned on), we could either keep a buffer of recent values and use that for a moving average, or approximate a moving average with an IIR like
Average = (1-X) * Average + X * Measurement
for some X between 0 and 1 (closer to 0 to change more slowly).
Original answer:
You could treat this as an online clustering problem, where you expect three clusters (before the motor turns on, when the motor is on, and when the motor is turned off), or perhaps four (before the motor turns on, peak power, when the motor is running normally, and when the motor turns off). In effect, you're trying to learn what it looks like when a motor is on (or off).
If you don't have any other information about whether the motor is on or off (which could be used to train a model), here's a simple approach:
Define an "Estimate" to contain:
float Value
int Count
Define an "Estimator" to contain:
float TotalError = 0.0
Estimate COLD_OFF = {Value = 0, Count = 1}
Estimate ON = {Value = 1000, Count = 1}
Estimate WARM_OFF = {Value = 500, Count = 1}
a function Update_Estimate(float Measurement)
Find the Estimate E such that E.Value is closest to Measurement
Update TotalError = TotalError + (E.Value - Measurement)*(E.Value - Measurement)
Update E.Value = (E.Value * E.Count + P) / (E.Count + 1)
Update E.Count = E.Count + 1
return E
This takes initial guesses for what the wattages of these stages should be and updates them with the measurements. However, this has some problems. What if our initial guesses are off?
You could initialize some number of Estimators with different possible (e.g. random) guesses for COLD_OFF, ON, and WARM_OFF; after receiving a measurement, let each Estimator update itself and aggregate their values somehow. This aggregation should reward the better estimates. Since you're storing TotalError for each estimate, you could just pick the output of the Estimator that has the lowest TotalError so far, or you could let the Estimators vote (giving each Estimator's vote a weight proportional to 1/(TotalError + 1) or something like that).

How to calculate one certain value from a rolling-window estimation in Stata

I'm using Stata to estimate Value-at-risk (VaR) with the historical simulation method. Basically, I will create a rolling window with 100 observations, to estimate VaR for the next 250 days (repeat 250 times). Hence, as I've known, the rolling window with time series command in Stata would be useful in this case. Here is the process:
Input: 350 values
1. Ascending sort the very first 100 values (by magnitude).
2. Then I need to take the 5th smallest for each window.
3. Repeat 250 times.
Output: a list of the 5th values (250 in total).
Sound simple, but I cannot do it the right way. This was my attempt below:
program his,rclass
sort lnreturn
return scalar actual=lnreturn in 5
end
tsset stt
time variable: stt, 1 to 350
delta: 1 unit
rolling actual=r(actual), window(100) saving(C:\result100.dta, replace) : his
(running his on estimation sample)
And the result is:
Start end actual
1 100 -.047856
2 101 -.047856
3 102 -.047856
4 103 -.047856
.... ..... ......
251 350 -.047856
What I want is 250 different 5th values in panel "actual", not the same like that.
If I understand this correctly, you want the 5th percentile of values in a window of 100. That should yield to summarize, detail or centile. I see no need to write a program.
Your bug is that your program his calculates the same thing each time it is called. There is no communication about windows other than what is explicit in your code. It is like saying
move here: now add 2 + 2
move there: now add 2 + 2
move to New York: now add 2 + 2
The result is invariant to your supposed position.
Note that I doubt that
return scalar actual=lnreturn in 5
really is your code. lnreturn[5] should work.
UPDATE You don't even need rolling here. Looping over data is easy enough. The data in this example are clearly fake.
clear
* sandpit
set obs 500
set seed 2803
gen y = ceil(exp(rnormal(3,2)))
l y in 1/5
* initialise
gen p5 = .
* windows of length 100: 1..100, 101..200, ...
quietly forval j = 1/401 {
local J = `j' + 99
su y in `j'/`J', detail
replace p5 = r(p5) in `j'
}
* check first calculation
su y in 1/100, detail
l in 1/5

split quantities algorithm (stock exchanges order)

I have a problem where I have multiple (few thousand) quantities that I need to split between a set number of recipients such that each quantity must be split into whole numbers and using the same proportion.
I need to find an algorighm that implements this reliably and efficiently (dont we all ?:-) )
This is to solve a problem in financial markets (stock exchange orders) where an order might get thousands of "fills" and at the end of the day must be distributed to a few clients while maintaining the order's average price. Here's an example:
Total Order Quantity 37300
Quantities filled by the Stock Exchange
Execution 1. 16700 shares filled at price 75.84
Execution 2. 5400 shares filled at price 75.85
Execution 3. 4900 shares filled at price 75.86
Execution 4. 10300 shares filled at price 75.87
Total 37300 shares filled at average price = (16700*75.84 + 5400*75.85 + 4900*75.86 + 10300*75.87) / 37300 = 75.85235925
Suppose I need to split these quantities between 3 clients such that :
Client1: 15000 shares
Client2: 10000 shares
Client3: 12300 shares
Each execution must be split individually (I can't just take each clients requested quantity priced at average price)
My first thought was to split each proportionately :
Client 1 gets 15000/37300=0.402144772
Client 2 gets 10000/37300=0.268096515
Client 3 gets 12300/37300=0.329758713
Which would lead to
Client1 - 15000 Client2 - 10000 Client3 - 12300
Ratio : 0.402144772 Ratio : 0.268096515 Ratio : 0.329758713
Splits (Sorry about the formatting - this was the best I could do in the Post editor)
+-------------+-------------+-------------+
| Client 1 | Client 2 | Client 3 |
+-------------+-------------+-------------+
| 6715.817694 | 4477.211796 | 5506.970509 |
| 2171.581769 | 1447.72118 | 1780.697051 |
| 1970.509383 | 1313.672922 | 1615.817694 |
| 4142.091153 | 2761.394102 | 3396.514745 |
+-------------+-------------+-------------+
| Totals: | | |
| 15000 | 10000 | 12300 |
+-------------+-------------+-------------+
The problem with this is that I can't assign fractional quantities to clients so I need a smart algorithm which adjusts the quantities such that the fractional part of these splits is 0. I understand that this may be impossible in many scenarios so this requirement can be relaxed a little bit so that a certain client gets a little more (or less).
Does anybody know of an algorithm that I can use as a starting point for this problem ?
You can round all the numbers (ratio[n] * totalQuantity) except the last one (possibly the smallest) The last one must be the totalQuantity - sum of the others. This will give you whole number quantities while having a correct total as close to the ratios you choose.
Try to look at this from a different angle. You already know how many shares each client is getting. You want to calculate what's the fair total amount each has to pay and do this without rounding errors.
You therefore want this total dollar amounts to have no rounding issues, i.e. be accurate to the 0.01.
Prices can then be computed using the dollar amounts and displayed to the required precision.
The opposite (calculate prices, then derive amounts) will always yield rounding issues with the dollar amounts.
Assuming price is per 100 units, here's one way to accomplish this:
Calculate total $ for the order (16,700*75.84/100 + 5,400*75.85/100 + 4,900*75.86/100 + 10,300*75.87/100) = $28,292.93
Allocate all clients except 1, based on the ratio quantities ordered / quantities filled:
Client 2 = $28,292.93 / 37,300 * 10,000 = $7,585,24
Price = 7,585,24 / 10,000 * 100 = 75.8524.
Client 3 = $28,292.93 / 37,300 * 12,300 = $9,329.84
Price = $9,329.84 / 12,300 * 100 = 75.85235772
Calculate the last client as the remaining $$$:
$28,292.93 - ($7,585,24 + $9,329.84) = $11,377.85.
Price = $11,377.85 / 15,000 * 100 = 75.85233333
Here I arbitrarily picked Client 1, the one with the largest quantity to be the object of the remainder calculation.

Resources