Transposing Data Variable names in Stata - format

Hello I want to use National Science Foundation dataset but the raw excel file variable names does not transpose the data properly. Does anyone have any sample code on how to transpose the dataset so that it fit the stata format for analysis. Here is the raw excel file so you can understand the problem.
Raw Excel File from NSF website

*EDITED to add view of import
Keeping in mind Nick's apt point about the nature of SO, I do have some example code that may be helpful here. Without knowing what you mean by "transpose" I cannot give an exact answer, but you can adapt the reshape commands below to your purposes.
Import and view
// Import Excel File defining TL and BR of table
import excel "nsb20197-tabs02-012.xlsx", cellrange(A6:K177) clear
list in 1/15
+----------------------------------------------------------------------------------------------------------------------------------------+
| A B C D E F G H I J K |
|----------------------------------------------------------------------------------------------------------------------------------------|
1. | Associate's level |
2. | American Indian or Alaska Native                     |
3. | All fields 6282 4131 2151 65.8 34.2 8935 5697 3238 63.8 36.2 |
4. | S&E 608 386 222 63.5 36.5 942 495 447 52.5 47.5 |
5. | Engineering 17 4 13 23.5 76.5 50 8 42 16 84 |
|----------------------------------------------------------------------------------------------------------------------------------------|
6. | Natural sciences 384 220 164 57.3 42.7 453 174 279 38.4 61.6 |
7. | Social and behavioral sciences 207 162 45 78.3 21.7 439 313 126 71.3 28.7 |
8. | Non-S&E 5674 3745 1929 66 34 7993 5202 2791 65.09999999999999 34.9 |
9. | Asian or Pacific Islander                     |
10. | All fields 27313 15522 11791 56.8 43.2 54809 30916 23893 56.4 43.6 |
|----------------------------------------------------------------------------------------------------------------------------------------|
11. | S&E 2649 1284 1365 48.5 51.5 7862 3492 4370 44.4 55.6 |
12. | Engineering 160 23 137 14.4 85.59999999999999 574 111 463 19.3 80.7 |
13. | Natural sciences 2010 939 1071 46.7 53.3 4419 1562 2857 35.3 64.7 |
14. | Social and behavioral sciences 479 322 157 67.2 32.8 2869 1819 1050 63.4 36.6 |
15. | Non-S&E 24664 14238 10426 57.7 42.3 46947 27424 19523 58.4 41.6 |
Structuring the data
// Create supercategories
* level = Column A if column A contains the word level (ignoring case)
gen level = word(A,1) if ustrregexm(A, "level", 1), before(A)
* demographic = Column A if next ob in column A contains the word all fields (ignoring case)
gen demographic = A if ustrregexm(A[_n+1], "all fields", 1), before(A)
* fill down demographic and level, and drop blank rows
foreach v of varlist level demographic {
replace `v' = `v'[_n-1] if missing(`v')
}
drop if mi(demographic) | demographic == A | regexm(A, level)
// rename variables
* rename A
rename A field
* rename count columns
local list "B C D E F"
local year = 2000
rename (`list') (all_`year' female_`year' male_`year' perc_female_`year' perc_male_`year' )
local list "G H I J K"
local year = 2017
rename (`list') (all_`year' female_`year' male_`year' perc_female_`year' perc_male_`year' )
* destring
destring *_2000 *_2017, replace
Reshaping to long
* reshape long
drop perc*
reshape long all_ male_ female_, i(level demographic field) j(year)
rename *_ degrees*
reshape long degrees, i(level demographic field year) j(gender) string
An example of how one might reshape wide by field
* test reshape wide by field + ensure the variable name is less than 32 characters post reshape
replace field = lower(strtoname(field))
replace field = substr(field, 1, 32 - strlen("degrees") - 1)
reshape wide degrees, i(level demographic year gender) j(field) string

Related

Oracle SQL or PL/SQL: How to identify candlestick pattern only in end of uptrend or downtrend and set a flag in column?

This question and related answers will be for educational or learning purpose only.
This question is much different from my other post and is not duplicate. Since it was creating confusion and as suggested by #MT0, I am posting this as a new question here.
I have below table, where I upload stock data on daily basis.
/* CREATE TABLE */
CREATE TABLE RAW_SOURCE(
Stock VARCHAR(100),
Close_Date DATE,
Open NUMBER,
High NUMBER,
Low NUMBER,
Close NUMBER,
Volume NUMBER
);
/* INSERT QUERY NO: 1 */
INSERT INTO RAW_SOURCE(Stock, Close_Date, Open, High, Low, Close, Volume)
VALUES
(
'XYZ', '01/01/2021', 40, 40.5, 38.5, 38.8, 83057
);
/* INSERT QUERY NO: 2 */
INSERT INTO RAW_SOURCE(Stock, Close_Date, Open, High, Low, Close, Volume)
VALUES
(
'XYZ', '02/01/2021', 39.2, 39.2, 37.2, 37.8, 181814
);
/* INSERT QUERY NO: 3 */
INSERT INTO RAW_SOURCE(Stock, Close_Date, Open, High, Low, Close, Volume)
VALUES
(
'XYZ', '03/01/2021', 38, 38.5, 36.5, 37, 117378
);
/* INSERT QUERY NO: 4 */
INSERT INTO RAW_SOURCE(Stock, Close_Date, Open, High, Low, Close, Volume)
VALUES
(
'XYZ', '04/01/2021', 36.5, 36.6, 35.6, 35.7, 93737
);
/* INSERT QUERY NO: 5 */
INSERT INTO RAW_SOURCE(Stock, Close_Date, Open, High, Low, Close, Volume)
VALUES
(
'XYZ', '05/01/2021', 35.35, 36.8, 35.1, 36.7, 169106
);
/* INSERT QUERY NO: 6 */
INSERT INTO RAW_SOURCE(Stock, Close_Date, Open, High, Low, Close, Volume)
VALUES
(
'XYZ', '06/01/2021', 36.5, 38.5, 36.5, 38, 123179
);
/* INSERT QUERY NO: 7 */
INSERT INTO RAW_SOURCE(Stock, Close_Date, Open, High, Low, Close, Volume)
VALUES
(
'XYZ', '07/01/2021', 37.5, 39.5, 37.3, 39.4, 282986
);
/* INSERT QUERY NO: 8 */
INSERT INTO RAW_SOURCE(Stock, Close_Date, Open, High, Low, Close, Volume)
VALUES
(
'XYZ', '08/01/2021', 39, 40.5, 38.5, 40, 117437
);
/* INSERT QUERY NO: 9 */
INSERT INTO RAW_SOURCE(Stock, Close_Date, Open, High, Low, Close, Volume)
VALUES
(
'XYZ', '09/01/2021', 39.7, 39.8, 39.3, 39.4, 873009
);
/* INSERT QUERY NO: 10 */
INSERT INTO RAW_SOURCE(Stock, Close_Date, Open, High, Low, Close, Volume)
VALUES
(
'XYZ', '10/01/2021', 39.2, 39.2, 37.2, 37.8, 62522
);
/* INSERT QUERY NO: 11 */
INSERT INTO RAW_SOURCE(Stock, Close_Date, Open, High, Low, Close, Volume)
VALUES
(
'XYZ', '11/01/2021', 38, 38.5, 36.5, 37, 114826
);
/* INSERT QUERY NO: 12 */
INSERT INTO RAW_SOURCE(Stock, Close_Date, Open, High, Low, Close, Volume)
VALUES
(
'XYZ', '12/01/2021', 36.5, 37.9, 36.3, 37.8, 281461
);
/* INSERT QUERY NO: 13 */
INSERT INTO RAW_SOURCE(Stock, Close_Date, Open, High, Low, Close, Volume)
VALUES
(
'XYZ', '13/01/2021', 37.5, 39.5, 37.3, 39.4, 77334
);
/* INSERT QUERY NO: 14 */
INSERT INTO RAW_SOURCE(Stock, Close_Date, Open, High, Low, Close, Volume)
VALUES
(
'XYZ', '14/01/2021', 39, 40.5, 38.5, 40, 321684
);
Below is the sample data for one stock "XYZ":
+-------+------------+-------+------+------+-------+--------+
| Stock | Close Date | Open | High | Low | Close | Volume |
+-------+------------+-------+------+------+-------+--------+
| XYZ | 01-01-2021 | 40 | 40.5 | 38.5 | 38.8 | 83057 |
| XYZ | 02-01-2021 | 39.2 | 39.2 | 37.2 | 37.8 | 181814 |
| XYZ | 03-01-2021 | 38 | 38.5 | 36.5 | 37 | 117378 |
| XYZ | 04-01-2021 | 36.5 | 36.6 | 35.6 | 35.7 | 93737 |
| XYZ | 05-01-2021 | 35.35 | 36.8 | 35.1 | 36.7 | 169106 |
| XYZ | 06-01-2021 | 36.5 | 38.5 | 36.5 | 38 | 123179 |
| XYZ | 07-01-2021 | 37.5 | 39.5 | 37.3 | 39.4 | 282986 |
| XYZ | 08-01-2021 | 39 | 40.5 | 38.5 | 40 | 117437 |
| XYZ | 09-01-2021 | 39.7 | 39.8 | 39.3 | 39.4 | 873009 |
| XYZ | 10-01-2021 | 39.2 | 39.2 | 37.2 | 37.8 | 62522 |
| XYZ | 11-01-2021 | 38 | 38.5 | 36.5 | 37 | 114826 |
| XYZ | 12-01-2021 | 36.5 | 37.9 | 36.3 | 37.8 | 281461 |
| XYZ | 13-01-2021 | 37.5 | 39.5 | 37.3 | 39.4 | 77334 |
| XYZ | 14-01-2021 | 39 | 40.5 | 38.5 | 40 | 321684 |
+-------+------------+-------+------+------+-------+--------+
Over the period of time, there will be more than thousands of records for each stock symbol and I would like to identify candlestick pattern only at the top of upmove/uptrend or at the bottom of downmove/downtrend but NOT in sideways (Since this will be false positive). Below is the sample screeshot:
Assuming today is 12th Jan 2021, below is the expected output:
+-------+-------------------+------------+------------+--------------+--------+---------------+
| Stock | Consecutive Count | Start Date | End Date | Latest Close | Volume | Pattern |
+-------+-------------------+------------+------------+--------------+--------+---------------+
| XYZ | 3 | 09-01-2021 | 12-01-2021 | 37.8 | 281461 | Piercing Line |
+-------+-------------------+------------+------------+--------------+--------+---------------+
Since the source table will have many other stocks, would like to show results on 12th Jan 2021 for other stocks as well if there is any pattern identified.
I feel this is quite challenging and complex logic. Hence seeking help here. Thanks in advance.
Update: Thank you #JustinCave
Here's the formula for calculation:
For Bullish Engulfing:
O1 > C1 and C > O and C > H1 and O < L1
where,
O1 = Previous day Open price
C1 = Previous day Close price
C = Today's Close price
O = Today's Open price
H1 = Previous day High price
L1 = Previous day Low price
For Bearish Harami:
(O1 < C1) and (O > C) and (O < C1) and (C > O1) and (H < H1) and (L > L1)
where,
O1 = Previous day Open price
C1 = Previous day Close price
C = Today's Close price
O = Today's Open price
H1 = Previous day High price
L1 = Previous day Low price
H = Today's High price
L = Today's Low price
For Piercing Line:
(O < C) and (O1 > C1) and (C > (C1 + O1)/2) and (O < C1) and (C < O1)
where,
O1 = Previous day Open price
C1 = Previous day Close price
C = Today's Close price
O = Today's Open price
Patterns in MATCH_RECOGNIZE work in a similar fashion to regular expressions; you want something like:
(Note: your PIERCING_LINE formula does not give the expected output so I have assumed you want C > (C1 + O1)/2 rather than C > C1 + (O1/2).)
SELECT *
FROM raw_source
MATCH_RECOGNIZE (
PARTITION BY stock
ORDER BY Close_Date
MEASURES
CLASSIFIER() AS pttrn
ALL ROWS PER MATCH
PATTERN (
^initial_value
|
down+ (bullish_engulfing | piercing_line | $)
|
up+ (bearish_harami | $)
|
other
)
DEFINE
down AS
PREV(open) > open
AND PREV(close) > close
AND PREV(open) > PREV(close)
AND open > close,
up AS
PREV(open) < open
AND PREV(close) < close
AND PREV(open) < PREV(close)
AND open < close,
bullish_engulfing AS
-- O1 > C1 and C > O and C > H1 and O < L1
PREV(open) > PREV(close)
AND close > open
AND close > PREV(high)
AND open < PREV(low),
bearish_harami AS
-- O1 < C1 and O > C and O < C1 and C > O1 and H < H1 and L > L1
PREV(open) < PREV(close)
AND open > close
AND open < PREV(close)
AND close > PREV(open)
AND high < PREV(high)
AND low > PREV(low),
piercing_line AS
-- O < C and O1 > C1 and C > (C1 + O1)/2 and O < C1 and C < O1
open < close
AND PREV(open) > PREV(close)
AND close > (PREV(close) + PREV(open))/2
AND open < PREV(close)
AND close < PREV(open)
)
Which outputs:
STOCK
CLOSE_DATE
PTTRN
OPEN
HIGH
LOW
CLOSE
VOLUME
XYZ
01/01/2021
INITIAL_VALUE
40
40.5
38.5
38.8
83057
XYZ
02/01/2021
DOWN
39.2
39.2
37.2
37.8
181814
XYZ
03/01/2021
DOWN
38
38.5
36.5
37
117378
XYZ
04/01/2021
DOWN
36.5
36.6
35.6
35.7
93737
XYZ
05/01/2021
BULLISH_ENGULFING
35.35
36.8
35.1
36.7
169106
XYZ
06/01/2021
UP
36.5
38.5
36.5
38
123179
XYZ
07/01/2021
UP
37.5
39.5
37.3
39.4
282986
XYZ
08/01/2021
UP
39
40.5
38.5
40
117437
XYZ
09/01/2021
BEARISH_HARAMI
39.7
39.8
39.3
39.4
873009
XYZ
10/01/2021
DOWN
39.2
39.2
37.2
37.8
62522
XYZ
11/01/2021
DOWN
38
38.5
36.5
37
114826
XYZ
12/01/2021
PIERCING_LINE
36.5
37.9
36.3
37.8
281461
XYZ
13/01/2021
UP
37.5
39.5
37.3
39.4
77334
XYZ
14/01/2021
UP
39
40.5
38.5
40
321684
db<>fiddle here
I've upvoted #MT0's answer and I would use match_recognize for this sort of thing myself since this is squarely the sort of problem it is designed to deal with. However, match_recognize is a pretty sophisticated construct and the patterns you're looking for are pretty simple. So as expressed, you could solve your problem with a simpler query that just uses a few lag analytic functions. As the patterns you're looking for get more sophisticated, you'll find that it'll be easier to express them using match_recognize and harder to handle them just with lag but the current problem is relatively easy to express this way.
Note that I'm making the same change to the "Piercing Line" formula that #MT0 did
with data as (
select src.stock,
src.close_date,
src.open o,
src.close c,
src.high h,
src.low l,
lag(src.open) over (partition by src.stock order by src.close_date) o1,
lag(src.close) over (partition by src.stock order by src.close_date) c1,
lag(src.high) over (partition by src.stock order by src.close_date) h1,
lag(src.low) over (partition by src.stock order by src.close_date) l1
from raw_source src
)
select d.*,
case when o1 > c1 and c > o and c > h1 and o < l1
then 'Bullish Engulfing'
when (O1 < C1) and (O > C) and (O < C1) and (C > O1) and (H < H1) and (L > L1)
then 'Bearish Harami'
when (O < C) and (O1 > C1) and (C > (C1 + O1)/2) and (O < C1) and (C < O1)
then 'Piercing Line'
end pattern
from data d
which produces the same results in the pattern column in this dbfiddle. Since we can use the same syntax you're using to express the formulas, though, it may be easier to follow the logic in this query than to understand the match_recognize syntax.

How do I generate age category? My PATIENT_YOB is given as 01jan1956 and I want to get exact age

I'm trying to use the following code but it gives error
01jan1986
05jan2001
07mar1983
and so on I need to get the exact age of them
gen agecat=1
if age 0-20==1
if age 21-40==2
if age 41-60==3
if age 61-64==4```
Here's one way:
gen age_cat = cond(age <= 20, 1, cond(age <= 40, 2, cond(age <= 60, 3, cond(age <= 64, 4, .))))
You might also want to look into egen, cut, see help egen.
To build off of Wouter's answer, you could do something like this to calculate the age to the tenth of a year:
clear
set obs 12
set seed 12352
global today = date("18Jun2021", "DMY")
* Sample Data
gen dob = runiformint(0,17000) // random Dates
format dob %td
* Create Age
gen age = round((ym(year(${today}),month(${today})) - ym(year(dob), month(dob)))/ 12,0.1)
* Correct age if dob in current month, but after today's date
replace age = age - 0.1 if (month(${today}) == month(dob)) & (day(dob) > day(${today}))
* age category
gen age_cat = cond(age <= 20, 1, cond(age <= 40, 2, cond(age <= 60, 3, cond(age <= 64, 4, .))))
The penultimate step is important as it decrements the age if their DOB is in the same month as the comparison date but has yet to be realised.
+----------------------------+
| dob age age_cat |
|----------------------------|
1. | 30jan2004 17.4 1 |
2. | 14aug1998 22.8 2 |
3. | 06aug1998 22.8 2 |
4. | 31aug1994 26.8 2 |
5. | 27mar1990 31.3 2 |
|----------------------------|
6. | 12jun1968 53 3 |
7. | 05may1964 57.1 3 |
8. | 06aug1994 26.8 2 |
9. | 21jun1989 31.9 2 |
10. | 10aug1984 36.8 2 |
|----------------------------|
11. | 22oct2001 19.7 1 |
12. | 03may1972 49.1 3 |
+----------------------------+
Note that the decimal is just approximate as it uses the month of the birthday and not the actual date.
You got some good advice in other answers, but this can be as simple as you want.
Consider this example, noting that presenting data as code we can run is a really helpful detail.
* Example generated by -dataex-. For more info, type help dataex
clear
input str9 sdate float dob
"01jan1986" 9497
"05jan2001" 14980
"07mar1983" 8466
end
format %td dob
The age at end 2020 is just 2020 minus the year people were born. Use any other year if it makes more sense.
. gen age = 2020 - year(dob)
. l
+-----------------------------+
| sdate dob age |
|-----------------------------|
1. | 01jan1986 01jan1986 34 |
2. | 05jan2001 05jan2001 19 |
3. | 07mar1983 07mar1983 37 |
+-----------------------------+
For 20 year bins, why not make them self-describing. Thus with this code, 20, 40 etc. are the upper limit of each bin. (You might need to tweak that if you have children under 1 year old in your data.)
. gen age2 = 20 * ceil(age/20)
. l
+------------------------------------+
| sdate dob age age2 |
|------------------------------------|
1. | 01jan1986 01jan1986 34 40 |
2. | 05jan2001 05jan2001 19 20 |
3. | 07mar1983 07mar1983 37 40 |
+------------------------------------+
This paper is a review of rounding and binning using Stata.

Algorithm to produce number series

I am not sure how to attack this problem... I tried many things, and it seems to be that it shouldn't be so difficult, but not getting there...
Is it possible to create a function "series ( _x )", that produces this :
The function for example should be myfunction( 11 ) => 211
The terms become suffix for the next terms. See below picture for more clarity. The boxes with same color gets repeated. So, we could just keep prepending 1 and 2 for previous results.
Code(In java):
public class Solution {
public static void main(String[] args) {
List<String> ans = solve(10);
for(int i=0;i<ans.size();++i) System.out.println(ans.get(i));
}
private static List<String> solve(int terms){
List<String> ans = new ArrayList<>();
String[] digits = new String[]{"1","2"};
ans.add("1");
if(terms == 1) return ans;
ans.add("2");
if(terms == 2) return ans;
List<String> final_result = new ArrayList<>();
final_result.addAll(ans);
terms -= 2;//since 2 numbers are already added
while(terms > 0){
List<String> temp = new ArrayList<>();
for(String s : digits){
for(int j=0;j<ans.size() && terms > 0;++j){
temp.add(s + ans.get(j));
terms--;
}
}
ans = temp;
final_result.addAll(ans);
}
return final_result;
}
}
This hint should help you... It isn't quite binary, but it is close. Let me know if you need any further help
0 -> - -> -
1 -> - -> -
10 -> 0 -> 1
11 -> 1 -> 2
100 -> 00 -> 11
101 -> 01 -> 12
110 -> 10 -> 21
111 -> 11 -> 22
1000 -> 000 -> 111
1001 -> 001 -> 112
1010 -> 010 -> 121
1011 -> 011 -> 122
1100 -> 100 -> 211
1101 -> 101 -> 212
1110 -> 110 -> 221
1111 -> 111 -> 222
Edit: I didn't like the way I ordered the columns, so I swapped 2 and 3
Python approach
First thing that we need to do is produce binary strings
in Python this can be done with bin(number)
However this will return a number in the form 0b101
We can easily strip away the 0b from the beginning though by telling python that we dont want the first two characters, but we want all the rest of them. The code for that is: bin(number)[2:] left side of the : says start two spaces in, and since the right side is blank go to the end
Now we have the binary numbers, but we need to strip away the first number. Luckily we already know how to strip away leading characters so change that line to bin(number)[3:].
All that is left to do now is add one to every position in the number. To do that lets make a new string and add each character from our other string to it after incrementing it by one.
# we already had this
binary = bin(user_in + 1)[3:]
new = ""
for char in binary:
# add to the string the character + 1
new += str(int(char) + 1)
And we are done. That snippet will convert from decimal to whatever this system is. One thing you might notice is that this solution will be offset by one (2 will be 1, 3 will be 2) we can fix this by simply adding one to user input before we begin.
final code with some convenience (a while loop and print statement)
while True:
user_in = int(input("enter number: "))
binary = bin(user_in + 1)[3:]
new = ""
for char in binary:
new += str(int(char) + 1)
print(user_in, "\t->\t", binary, "\t->\t", new)
According to A000055
We should perform 3 steps:
Convert value + 1 to base 2
Remove 1st 1
Add 1 to the rest digits
For instance, for 11 we have
Converting 11 + 1 == 12 to binary: 1100
Removing 1st 1: 100
Adding 1 to the rest digits: 211
So 11 has 211 representation.
C# code:
private static String MyCode(int value) =>
string.Concat(Convert
.ToString(value + 1, 2) // To Binary
.Skip(1) // Skip (Remove) 1st 1
.Select(c => (char)(c + 1))); // Add 1 to the rest digits
Demo:
var result = Enumerable
.Range(1, 22)
.Select(value => $"{MyCode(value),4} : {value,2}");
Console.Write(string.Join(Emvironment.NewLine, result));
Outcome:
1 : 1
2 : 2
11 : 3
12 : 4
21 : 5
22 : 6
111 : 7
112 : 8
121 : 9
122 : 10
211 : 11
212 : 12
221 : 13
222 : 14
1111 : 15
1112 : 16
1121 : 17
1122 : 18
1211 : 19
1212 : 20
1221 : 21
1222 : 22
In VB.NET, showing both the counting in base-3 and OEIS formula ways, with no attempts at optimisation:
Module Module1
Function OEIS_A007931(n As Integer) As Integer
' From https://oeis.org/A007931
Dim m = Math.Floor(Math.Log(n + 1) / Math.Log(2))
Dim x = 0
For j = 0 To m - 1
Dim b = Math.Floor((n + 1 - 2 ^ m) / (2 ^ j))
x += CInt((1 + b Mod 2) * 10 ^ j)
Next
Return x
End Function
Function ToBase3(n As Integer) As String
Dim s = ""
While n > 0
s = (n Mod 3).ToString() & s
n \= 3
End While
Return s
End Function
Function SkipZeros(n As Integer) As String
Dim i = 0
Dim num = 1
Dim s = ""
While i < n
s = ToBase3(num)
If s.IndexOf("0"c) = -1 Then
i += 1
End If
num += 1
End While
Return s
End Function
Sub Main()
Console.WriteLine("A007931 Base3 ITERATION")
For i = 1 To 22
Console.WriteLine(OEIS_A007931(i).ToString().PadLeft(7) & SkipZeros(i).PadLeft(7) & i.ToString().PadLeft(11))
Next
Console.ReadLine()
End Sub
End Module
Outputs:
A007931 Base3 ITERATION
1 1 1
2 2 2
11 11 3
12 12 4
21 21 5
22 22 6
111 111 7
112 112 8
121 121 9
122 122 10
211 211 11
212 212 12
221 221 13
222 222 14
1111 1111 15
1112 1112 16
1121 1121 17
1122 1122 18
1211 1211 19
1212 1212 20
1221 1221 21
1222 1222 22

glmnet input error, format of the input matrix incorrect

I get the following error
Error in storage.mode(y) <- "double" : invalid to change the storage mode of
a factor
an issue with getting the format of the input matrix right.
Here the code:
library("glmnet")
daten = read.csv("test.csv",header = 1)
# Sex Age Weight Height Other
# 0 22 54 154 1.51
# 1 34 76 178 1.94
# 1 38 75 178 1.93
# 1 32 102 178 2.19
# ...
# 1 35 94 184 2.18
trainX <- daten
# outcome variable
Y <- c(0,0,0,0,0,0,1,0,0,0,0,0,1,1,1,0,0,0,0,1)
trainY <- factor(Y)
fit.lasso=glmnet(trainX,trainY,alpha=1)
trainY seems to be formated correctly as factor, but what is wrong with trainX?
Any comments would be highly appreciated.
Problem solved!
Simply skip the step
trainY <- factor(Y)
and use
fit.lasso=glmnet(trainX,Y,alpha=1)
and it works fine!

Quickest way to determine range overlap in Perl

I have two sets of ranges. Each range is a pair of integers (start and end) representing some sub-range of a single larger range. The two sets of ranges are in a structure similar to this (of course the ...s would be replaced with actual numbers).
$a_ranges =
{
a_1 =>
{
start => ...,
end => ...,
},
a_2 =>
{
start => ...,
end => ...,
},
a_3 =>
{
start => ...,
end => ...,
},
# and so on
};
$b_ranges =
{
b_1 =>
{
start => ...,
end => ...,
},
b_2 =>
{
start => ...,
end => ...,
},
b_3 =>
{
start => ...,
end => ...,
},
# and so on
};
I need to determine which ranges from set A overlap with which ranges from set B. Given two ranges, it's easy to determine whether they overlap. I've simply been using a double loop to do this--loop through all elements in set A in the outer loop, loop through all elements of set B in the inner loop, and keep track of which ones overlap.
I'm having two problems with this approach. First is that the overlap space is extremely sparse--even if there are thousands of ranges in each set, I expect each range from set A to overlap with maybe 1 or 2 ranges from set B. My approach enumerates every single possibility, which is overkill. This leads to my second problem--the fact that it scales very poorly. The code finishes very quickly (sub-minute) when there are hundreds of ranges in each set, but takes a very long time (+/- 30 minutes) when there are thousands of ranges in each set.
Is there a better way I can index these ranges so that I'm not doing so many unnecessary checks for overlap?
Update: The output I'm looking for is two hashes (one for each set of ranges) where the keys are range IDs and the values are the IDs of the ranges from the other set that overlap with the given range in this set.
This sounds like the perfect use case for an interval tree, which is a data structure specifically designed to support this operation. If you have two sets of intervals of sizes m and n, then you can build one of them into an interval tree in time O(m lg m) and then do n intersection queries in time O(n lg m + k), where k is the total number of intersections you find. This gives a net runtime of O((m + n) lg m + k). Remember that in the worst case k = O(nm), so this isn't any better than what you have, but for cases where the number of intersections is sparse this can be substantially better than the O(mn) runtime you have right now.
I don't have much experience working with interval trees (and zero experience in Perl, sorry!), but from the description it seems like they shouldn't be that hard to build. I'd be pretty surprised if one didn't exist already.
Hope this helps!
In case you are not exclusively tied to perl; The IRanges package in R deals with interval arithmetic. It has very powerful primitives, it would probably be easy to code a solution with them.
A second remark is that the problem could possibly become very easy if the intervals have additional structure; for example, if within each set of ranges there is no overlap (in that case a linear approach sifting through the two ordered sets simultaneously is possible). Even in the absence of such structure, the least you can do is to sort one set of ranges by start point, and the other set by end point, then break out of the inner loop once a match is no longer possible. Of course, existing and general algorithms and data structures such as the interval tree mentioned earlier are the most powerful.
There are Several existing CPAN modules that solve this issue, I have developed 2 of them: Data::Range::Compare and Data::Range::Compare::Stream
Data::Range::Compare only works with arrays in memory, but supports generic range types.
Data::Range::Compare::Stream Works with streams of data via iterators, but it requires understanding OO Perl to extend to generic data types. Data::Range::Compare::Stream is recommended if you are processing very very large sets of data.
Here is an Excerpt form the Examples folder of Data::Range::Compare::Stream.
Given these 3 sets of data:
Numeric Range set: A contained in file: source_a.src
+----------+
| 1 - 11 |
| 13 - 44 |
| 17 - 23 |
| 55 - 66 |
+----------+
Numeric Range set: B contained in file: source_b.src
+----------+
| 0 - 1 |
| 2 - 29 |
| 88 - 133 |
+----------+
Numeric Range set: C contained in file: source_c.src
+-----------+
| 17 - 29 |
| 220 - 240 |
| 241 - 250 |
+-----------+
The expected output would be:
+--------------------------------------------------------------------+
| Common Range | Numeric Range A | Numeric Range B | Numeric Range C |
+--------------------------------------------------------------------+
| 0 - 0 | No Data | 0 - 1 | No Data |
| 1 - 1 | 1 - 11 | 0 - 1 | No Data |
| 2 - 11 | 1 - 11 | 2 - 29 | No Data |
| 12 - 12 | No Data | 2 - 29 | No Data |
| 13 - 16 | 13 - 44 | 2 - 29 | No Data |
| 17 - 29 | 13 - 44 | 2 - 29 | 17 - 29 |
| 30 - 44 | 13 - 44 | No Data | No Data |
| 55 - 66 | 55 - 66 | No Data | No Data |
| 88 - 133 | No Data | 88 - 133 | No Data |
| 220 - 240 | No Data | No Data | 220 - 240 |
| 241 - 250 | No Data | No Data | 241 - 250 |
+--------------------------------------------------------------------+
The Source code can be found here.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use lib qw(./ ../lib);
# custom package from FILE_EXAMPLE.pl
use Data::Range::Compare::Stream::Iterator::File;
use Data::Range::Compare::Stream;
use Data::Range::Compare::Stream::Iterator::Consolidate;
use Data::Range::Compare::Stream::Iterator::Compare::Asc;
my $source_a=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_a.src');
my $source_b=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_b.src');
my $source_c=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_c.src');
my $consolidator_a=new Data::Range::Compare::Stream::Iterator::Consolidate($source_a);
my $consolidator_b=new Data::Range::Compare::Stream::Iterator::Consolidate($source_b);
my $consolidator_c=new Data::Range::Compare::Stream::Iterator::Consolidate($source_c);
my $compare=new Data::Range::Compare::Stream::Iterator::Compare::Asc();
my $src_id_a=$compare->add_consolidator($consolidator_a);
my $src_id_b=$compare->add_consolidator($consolidator_b);
my $src_id_c=$compare->add_consolidator($consolidator_c);
print " +--------------------------------------------------------------------+
| Common Range | Numeric Range A | Numeric Range B | Numeric Range C |
+--------------------------------------------------------------------+\n";
my $format=' | %-12s | %-13s | %-13s | %-13s |'."\n";
while($compare->has_next) {
my $result=$compare->get_next;
my $string=$result->to_string;
my #data=($result->get_common);
next if $result->is_empty;
for(0 .. 2) {
my $column=$result->get_column_by_id($_);
unless(defined($column)) {
$column="No Data";
} else {
$column=$column->get_common->to_string;
}
push #data,$column;
}
printf $format,#data;
}
print " +--------------------------------------------------------------------+\n";
Try Tree::RB but to find mutually exclusive ranges, no overlaps
The performance is not bad, if I had about 10000 segments and had to find the segment for each discrete number. My input had 300 million records. I neaded to put them into separate buckets. Like partitioning the data. Tree::RB worked out great.
$var = [
[0,90],
[91,2930],
[2950,8293]
.
.
.
]
my lookup value were 10, 99, 991 ...
and basically I needed the position of the range for the given number
With this below comparison function, mine uses something like this:
my $cmp = sub
{
my ($a1, $b1) = #_;
if(ref($b1) && ref($a1))
{
return ($$a1[1]) <=> ($$b1[0]);
}
my $ret = 0;
if(ref($a1) eq 'ARRAY')
{
#
if($$a1[0] <= $b1 && $b1 >= $$a1[1])
{
$ret = 0;
}
if($$a1[0] < $b1)
{
$ret = -1;
}
if($$a1[1] > $b1)
{
$ret = 1;
}
}
else
{
if($$b1[0] <= $a1 && $a1 >= $$b1[1])
{
$ret = 0;
}
if($$b1[0] > $a1)
{
$ret = -1;
}
if($$b1[1] < $a1)
{
$ret = 1;
}
}
return $ret;
}
I should check time in order to know if its the fastest way, but according to the structure of your data you should try this:
use strict;
my $fromA = 12;
my $toA = 15;
my $fromB = 7;
my $toB = 35;
my #common_range = get_common_range($fromA, $toA, $fromB, $toB);
my $common_range = $common_range[0]."-".$common_range[-1];
sub get_common_range {
my #A = $_[0]..$_[1];
my %B = map {$_ => 1} $_[2]..$_[3];
my #common = ();
foreach my $i (#A) {
if (defined $B{$i}) {
push (#common, $i);
}
}
return sort {$a <=> $b} #common;
}

Resources