Correction of the Interpretation of the SAS code - xcode

I am pretty new to sas. Can you please help me to interpret the following lines of code:
proc means data=crsp1 noprint;
var ret;
by gvkey datadate year;
output out=exec_roll_vol_fyear n=nrollingstd std=rollingstd;
run;
data volatility;
set exec_roll_vol_fyear;
where &start_year <= year <= &end_year;
* we have volatility of monthly returns,
converting to annual volatility;
estimated_volatility=rollingstd*(12**0.5);
proc sort nodupkey;
by gvkey year;
run;
Does it mean the following: take data "crsp1" and create a dataset "exec_roll_vol_fyear" that will contain rolling standard deviation of "ret"? (I dont quite see what "proc means" stands for here)
Second part: use data "exec_roll_vol_fyear" to create a data set "volatility", where estimated_volatility=rollingstd*(12**0.5) and drop duplicates of gvkey year. Am I right?

PROC MEANS is a summarization procedure that summarizes data. In this case, it will calculate the n and standard deviation for each unique combination of gvkey datadate year, and output to a dataset exec_roll_vol_fyear. This might be a "rolling" standard deviation if the incoming data is structured appropriately to do that (basically, if datadate defines the rolling windows and if any given record is duplicated once for each window it falls in); impossible to tell. There are better tools for time series analysis in SAS, though.
Then, the data step applies a formula to create a new variable from the standard deviation, and then it sorts the resulting dataset removing duplicates by gvkey and year.

Related

SPSS: automatic counting of follow-up moments in a longitudinal long format database

I would like to structure my long format SPSS file so I can clean it and get a better overview. However, I run into some problems.
How can i create a new veriable counting the complation moments/waves/follow-up moments. I only have a completion data avaible in my dataset. Please open my image for a more explanation.
Preferably a numbering that continues counting if a year is missing.
If I understand right, you want the new variable to be an index of the year of involvement for each patient, as opposed to an index of data row per patient. To do this we can calculate for each entry the difference in years between the entry and the first entry of that patient:
(this assumes your dates are in date format)
compute year=xdate.year(OpenInvulMomenten).
aggregate /outfile=* mode=addvariables /break=PatientIdPseudo /firstYear=min(year).
compute newvar=1+year-firstYear.
exe.

Hive: filtering data between specified dates in string format. Which one is optimised and why?

Which one should be preferred and why? My_date_column is of type string and of format YYYY-MM-DD
SELECT *
FROM my_table
WHERE my_date_column >= '2019-08-31' AND your_date_column <= '2019-09-02';
OR
SELECT *
FROM my_table
WHERE my_date_column in ('2019-08-31','2019-09-01','2019-09-02') ;
Lastly, In general, should I be storing my date as Date data type or simply as a string? I chose string type simply to handle any corrupt/badly formatted data.
Always store dates as date type. If you store them as strings it adds overhead to do any kind of arithmetic with the values (including those inequalities).
If you decided to ignore this advice and store dates as string anyway, then I believe the second form (the IN list) will be faster as it is comparing text string against text string. The other option, using inequalities, may not give the results you expect unless you explicitly CAST(my_date_column as date) and CAST(your_date_column as date). Those cast will be applied to every record, which adds to the total cost of the query, and would be avoidable if those columns were stored as date instead.
A third option that you didn't mention is to use the BETWEEN operator (ie. WHERE my_date BETWEEN start_date AND end_date). This is the same as using the two inequalities, but probably a bit cleaner and more idiomatic.
When in doubt, take a look at the EXPLAIN plans to understand how the query will be executed.

How can you add FLOAT measures in Tableau formatted as a time stamp (hh:mm:ss)?

The fields look as described above. They are time fields from SQL imported as a varchar. I had to format as date in tableau. There can be NULL values, so I am having a tough time getting over that. Tableau statement I have is only ([time spent])+([time waited])+([time solved)].
Thank you!
If you only want to use the result for a graphical visualization of what took the longest, you can split and add all the values into seconds and using it into your view. E.g.
In this case the HH:MM:SS fields are Strings for Tableau.
The formula used to sum the three fields is:
//transforms everything into seconds for each variable
zn((INT(SPLIT([Time Spent],':',1))*3600))
+
zn((INT(SPLIT([Time Spent],":",2))*60))
+
zn((INT(SPLIT([Time Spent],":",3))))
+
zn((INT(SPLIT([Time Waited],':',1))*3600))
+
zn((INT(SPLIT([Time Waited],":",2))*60))
+
zn((INT(SPLIT([Time Waited],":",3))))
+
zn((INT(SPLIT([Time Solved],':',1))*3600))
+
zn((INT(SPLIT([Time Solved],":",2))*60))
+
zn((INT(SPLIT([Time Solved],":",3))))
Quick explanation of the formula:
I SPLIT every field three times, one for the hours, minutes and seconds, adding all the values.
There is an INT formula that will convert the strings into integers.
There is also a ZN for every field - this will make Null fields become Zeros.
You can also use the value as integer if you want, e.g. the Case A has a Total Time of 5310 seconds.
The best approach is usually to store dates in the database in a date field instead of in a string. That might mean a data prep/cleanup step before you get to Tableau, but it will help with efficiency, simplicity and robustness ever after.
You can present dates in many formats, including hh:mm, when the underlying representation is a date datatype. See the custom date options on the format pane in Tableau for example. But storing dates as formatted strings and converting them to something else for calculations is really doing things the hard way.
If you have no choice but to read in strings and convert them to dates, then you should look at the DateParse function.
Either way, decide what a null date means and make sure your calculations behave well in that case -- unless you can enforce that the date field not contain nulls in the database.
One example would be a field called Completed_Date in a table of Work_Orders. You could determine that a null Completed_Date meant the work order had not been fulfilled yet, and thus allow nulls for that field. But you could also have the database enforce that another field, say Submitted_Date, could never be null.

How do I specify invalid time in a time dimension?

I am building a time dimension for only time in my data warehouse. I already have a date dimension.
How do I denote an unknown time? In my DimDate dimension, I marked 01/01/1753 as being reserved for unknown dates, but I think a time will be a bit harder. We don't allow NULLs in our fact tables. How do I do this, and what might that row look like?
You state the "We don't allow NULLs in our fact tables " but ask "How do I denote an unknown time?"
Assuming you are using in your FACT table a data type TIME + enforce a NOT NULL constraint on data arriving from source system => you simply cannot insert unknown\invalid time into your fact and hence should have no problem.
The obvious exception to the above is an invalid business wise value reported by the source system such as Sunil proposed ('00:59:59.9999999') but this is very uncommon, unstable solution for obvius reasons (changing requirements can easily turn this value into a valid one)
If you chose to allow (and i hope you did) records with NULL values or invalid dates from your source system to enter the fact then the best practice would be using surrogate keys on our DimTime and insert them as FK into your FACT tables – this will easily allow you to represent valid + invalid values in your dimension.
This approach can easily also support the approach of an invalid business wise value ('00:59:59.9999999'), such a value gets an FK_DimTime=-1.
I strongly advise on allowing specific types of garbage from source systems to enter the FACT (i.e – invalid\missing\NULL date\time values) tables as long as you clearly mark it in relevant DIMs as this tends to drive Users to improve data quality in source systems.
Here is some background on the matter
https://www.kimballgroup.com/1997/07/its-time-for-time/
https://www.kimballgroup.com/2004/02/design-tip-51-latest-thinking-on-time-dimension-tables/
It can look like anything you want. Most dimensions have a 'display name' of some kind, so your dimensions could look something like this:
create table dbo.DimDate (DateID int, DateValue date, DisplayDate nvarchar(20))
go
-- this is an unknown date; 1753-01-01 is only there because we need some valid date value
insert into dbo.DimDate values (1, '1753-01-01', 'Unknown')
go
-- this is the real date 1 Jan 1753
insert into dbo.DimDate values (2, '1753-01-01', '01 Jan 1753')
go
create table dbo.DimTime (TimeID int, TimeValue time, DisplayTime nvarchar(20))
go
-- this is an unknown time; 00:00 is only there because we need some valid time value
insert into dbo.DimTime values (1, '00:00', 'Unknown')
go
-- this is the real time value for midnight
insert into dbo.DimTime values (2, '00:00', 'Midnight')
go
Of course, this assumes that your reporting tool and users use the DisplayDate and DisplayTime columns for filtering instead of the DateValue and TimeValue columns directly, but that's simply a matter of training and standards and whatever solution you adopt needs to be understood anyway.
There are other alternatives such as a flag column for unknown values, or a convention that a negative TimeID indicates an unknown value. But those are less obvious and harder to maintain than an explicit row value, in my opinion.
Just create a DimTime records with a -1 technical surrogate key and populate to the time column a value '00:59:59.9999999'. This way this will be a unlikely time ever captured (accuracy to the last digit) by your DWH, it will always equate to a unknown in your reports or queries when you want to put filter like,
EventTime < #ReportTime AND EventTime <> '00:59:59.9999999'
Hope this is viable solution to your problem.

Rapidminer: Memory issues transforming nominal to binominal attributes

I want to analyze a large dataset (2,000,000 records, 20,000 customer IDs, 6 nominal attributes) using the Generalized Sequential Pattern algorithm.
This requires all attributes, aside from the time and customer ID attribute, to be binominal. Having 6 nominal attributes which I want to analyze for patterns, I need to transform those into binominal attributes, using the "Nominal to Binominal" Function. This is causing memory problems on my workstation (with 16GB RAM, of which I allocated 12 to the Java instance running rapidminer).
Ideally I would like to set up my project in a way, that it writes temporarily to the disc or using temporary tables in my oracle database, from which my model also reads the data directly. In order to use the "write database" or "update database" function, I need to have an existing table already in my database with boolean columns already (if I'm not mistaken).
I tried to write step by step the results of the binominal conversion into csv files onto my local disk. I started using the nominal attribute with the least distinct values, resulting in a csv file containing my dataset ID and now 7 binominal attributes. I was seriously surprised seeing the filesize being >200MB already. This is cause by rapidminer writing strings for the binominal values "true"/"false". Wouldn't it be way more memory efficient just writing 0/1?
Is there a way to either use the oracle database directly or working with 0/1 values instead of "true"/"false"? My next column would have 3000 distinct values to be transformed which would end in a nightmare...
I'd highly appreciate recommendations on how to use the memory more efficient or work directly in the database. If anyone knows how to easily transform a varchar2-column in Oracle into boolean columns for each distinct value that would also be appreciated!
Thanks a lot,
Holger
edit:
My goal is to get from such a structure:
column_a; column_b; customer_ID; timestamp
value_aa; value_ba; 1; 1
value_ab; value_ba; 1; 2
value_ab; value_bb; 1; 3
to this structure:
customer_ID; timestamp; column_a_value_aa; column_a_value_ab; column_b_value_ba; column_b_value_bb
1; 1; 1; 0; 1; 0
1; 2; 0; 1; 1; 0
1; 3; 0; 1; 0; 1
This answer is too long for a comment.
If you have thousands of levels for the six variables you are interested in, then you are unlikely to get useful results using that data. A typical approach is to categorize the data going in, which results in fewer "binominal" variables. For instance, instead of "1 Gallon Whole Milk", you use "diary products". This can result in more actionable results. Remember, Oracle only allows 1,000 columns in a table so the database has other limiting factors.
If you are working with lots of individual items, then I would suggest other approaches, notably an approach based on association rules. This will not limit you by the number of variables.
Personally, I find that I can do much of this work in SQL, which is why I wrote a book on the topic ("Data Analysis Using SQL and Excel").
You can use the operator Nominal to Numeric to convert true and false values to 1 or 0. set the coding type parameter to be unique integers.

Resources