Creating n samples from multiple datasets in SAS - random

I have 6 datasets, x201201,x201202,... and I'm looking for a way to make a random sample n=200, taken from the 6 datasets.
I have been looking at proc surveyselect, but it only takes one dataset. I could make a temp set merging the 6 sets, but is this the easiest/only way to do it?

You don't have to create a copy. You can genearete a view.
data x2012 / view=x2012;
set x201201 x201202 x201203 x201204 x201205 x201206;
run;
proc surveyselect data=x2012 method=srs n=200 out=SampleSRS;
run;

Related

Correction of the Interpretation of the SAS code

I am pretty new to sas. Can you please help me to interpret the following lines of code:
proc means data=crsp1 noprint;
var ret;
by gvkey datadate year;
output out=exec_roll_vol_fyear n=nrollingstd std=rollingstd;
run;
data volatility;
set exec_roll_vol_fyear;
where &start_year <= year <= &end_year;
* we have volatility of monthly returns,
converting to annual volatility;
estimated_volatility=rollingstd*(12**0.5);
proc sort nodupkey;
by gvkey year;
run;
Does it mean the following: take data "crsp1" and create a dataset "exec_roll_vol_fyear" that will contain rolling standard deviation of "ret"? (I dont quite see what "proc means" stands for here)
Second part: use data "exec_roll_vol_fyear" to create a data set "volatility", where estimated_volatility=rollingstd*(12**0.5) and drop duplicates of gvkey year. Am I right?
PROC MEANS is a summarization procedure that summarizes data. In this case, it will calculate the n and standard deviation for each unique combination of gvkey datadate year, and output to a dataset exec_roll_vol_fyear. This might be a "rolling" standard deviation if the incoming data is structured appropriately to do that (basically, if datadate defines the rolling windows and if any given record is duplicated once for each window it falls in); impossible to tell. There are better tools for time series analysis in SAS, though.
Then, the data step applies a formula to create a new variable from the standard deviation, and then it sorts the resulting dataset removing duplicates by gvkey and year.

Looking up a "key" in an 8GB+ text file

I have some 'small' text files that contain about 500000 entries/rows. Each row has also a 'key' column. I need to find this keys in a big file (8GB, at least 219 million entries). When found, I need to append the 'Value' from the big file into the small file, at the end of the row as a new column.
The big file that is like this:
KEY VALUE
"WP_000000298.1" "abc"
"WP_000000304.1" "xyz"
"WP_000000307.1" "random"
"WP_000000307.1" "text"
"WP_000000308.1" "stuff"
"WP_000000400.1" "stuffy"
Simply put, I need to look up 'key' in the big file.
Obviously I need to load the whole table in RAM (but this is not a problem I have 32GB available). The big file seems to be already sorted. I have to check this.
The problem is that I cannot do a fast lookup using something like TDictionary because as you can see, the key is not unique.
Note: This is probably a one-time computation. I will use the program once, then throw it away. So, it doesn't have to be the BEST algorithm (difficult to implement). It just need to finish in decent time (like 1-2 days). PS: I prefer doing this without DB.
I was thinking at this possible solution: TList.BinarySearch. But it seems that TList is limited to only 134,217,727 (MaxInt div 16) items. So TList won't work.
Conclusion:
I choose Arnaud Bouchez solution. His TDynArray is impressive! I totally recommend it if you need to process large files.
AlekseyKharlanov's provided another nice solution but TDynArray is already implemented.
Instead of re-inventing the wheel of binary search or B-Tree, try with an existing implementation.
Feed the content into a SQLite3 in-memory DB (with the proper index, and with a transaction every 10,000 INSERT) and you are done. Ensure you target Win64, to have enough space in RAM. You may even use a file-based storage: a bit slower to create, but with indexes, queries by Key will be instant. If you do not have SQlite3 support in your edition of Delphi (via latest FireDAC), you may use our OpenSource unit and its associated documentation.
Using SQlite3 will be definitively faster, and use less resources than a regular client-server SQL database - BTW the "free" edition of MS SQL is not able to handle so much data you need, AFAIR.
Update: I've written some sample code to illustrate how to use SQLite3, with our ORM layer, for your problem - see this source code file in github.
Here are some benchmark info:
with index defined before insertion:
INSERT 1000000 rows in 6.71s
SELECT 1000000 rows per Key index in 1.15s
with index created after insertion:
INSERT 1000000 rows in 2.91s
CREATE INDEX 1000000 in 1.28s
SELECT 1000000 rows per Key index in 1.15s
without the index:
INSERT 1000000 rows in 2.94s
SELECT 1000000 rows per Key index in 129.27s
So for huge data set, an index is worth it, and creating the index after the data insertion reduces the resources used! Even if the insertion is slower, the gain of an index is huge when selecting per key.. You may try to do the same with MS SQL, or using another ORM, and I guess you will cry. ;)
Another answer, since it is with another solution.
Instead of using a SQLite3 database, I used our TDynArray wrapper, and its sorting and binary search methods.
type
TEntry = record
Key: RawUTF8;
Value: RawUTF8;
end;
TEntryDynArray = array of TEntry;
const
// used to create some fake data, with some multiple occurences of Key
COUNT = 1000000; // million rows insertion !
UNIQUE_KEY = 1024; // should be a power of two
procedure Process;
var
entry: TEntryDynArray;
entrycount: integer;
entries: TDynArray;
procedure DoInsert;
var i: integer;
rec: TEntry;
begin
for i := 0 to COUNT-1 do begin
// here we fill with some data
rec.Key := FormatUTF8('KEY%',[i and pred(UNIQUE_KEY)]);
rec.Value := FormatUTF8('VALUE%',[i]);
entries.Add(rec);
end;
end;
procedure DoSelect;
var i,j, first,last, total: integer;
key: RawUTF8;
begin
total := 0;
for i := 0 to pred(UNIQUE_KEY) do begin
key := FormatUTF8('KEY%',[i]);
assert(entries.FindAllSorted(key,first,last));
for j := first to last do
assert(entry[j].Key=key);
inc(total,last-first+1);
end;
assert(total=COUNT);
end;
Here are the timing results:
one million rows benchmark:
INSERT 1000000 rows in 215.49ms
SORT ARRAY 1000000 in 192.64ms
SELECT 1000000 rows per Key index in 26.15ms
ten million rows benchmark:
INSERT 10000000 rows in 2.10s
SORT ARRAY 10000000 in 3.06s
SELECT 10000000 rows per Key index in 357.72ms
It is more than 10 times faster than the SQLite3 in-memory solution. The 10 millions rows stays in memory of the Win32 process with no problem.
And a good sample of how the TDynArray wrapper works in practice, and how its SSE4.2 optimized string comparison functions give good results.
Full source code is available in our github repository.
Edit: with 100,000,000 rows (100 millions rows), under Win64, for more than 10GB of RAM used during the process:
INSERT 100000000 rows in 27.36s
SORT ARRAY 100000000 in 43.14s
SELECT 100000000 rows per Key index in 4.14s
Since this is One-time task. Fastest way is to load whole file into memory, scan memory line by line, parse key and compare it with search key(keys) and print(save) found positions.
UPD: If you have sorted list in source file. And assume you have 411000keys to lookup. You can use this trick: Sort you search keys in same order with source file. Read first key from both lists and compare it. If they differs read next from source until they equal. Save position, if next key in source also equal, save it too..etc. If next key is differ, read next key from search keys list. Continue until EOF.
Use memory-mapped files. Just think your file is already read into the memory in its entirety and do that very binary search in memory that you wanted. Let Windows care about reading the portions of file when you do your in-memory search.
https://en.wikipedia.org/wiki/Memory-mapped_file
https://msdn.microsoft.com/en-us/library/ms810613.aspx
https://stackoverflow.com/a/9609448/976391
https://stackoverflow.com/a/726527/976391
http://msdn.microsoft.com/en-us/library/aa366761%28VS.85.aspx
https://msdn.microsoft.com/en-us/library/windows/desktop/aa366537.aspx
You may take any of those sources for start, just do not forget to update them for Win64
http://torry.net/quicksearchd.php?String=memory+mapped+files&Title=No
A method that needs the file to be sorted, but avoids data structures entirely:
You only care about one line, so why even read the bulk of the file?
Open the file and move the "get pointer" (apologies for talking C) halfway through the file. You'll have to figure out if you're at a number or a word, but a number should be close by. Once you know the closest number, you know if it's higher or lower than what you want, and continue with the binary search.
Idea based on Aleksey Kharlanov answer. I accepted his answer.
I only copy his idea here because he did not elaborate it (no pseudo-code or deeper analysis of the algorithm). I want to confirm it works before implementing it.
We sort both files (once).
We load Big file in memory (once).
We read Small file line by line from disk (once).
Code:
In the code below, sKey is the current key in Small file. bKey is the current key in the Big file:
LastPos:= 0
for sKey in SmallFile do
for CurPos:= LastPos to BigFile.Count do
if sKey = bKey
then
begin
SearchNext // search (down) next entries for possible duplicate keys
LastPos:= CurPos
end
else
if sKey < bKey
then break
It works because I know the last position (in Big file) of the last key. The next key can only be somewhere UNDER the last position; ON AVERAGE it should be in the next 440 entries. However, I don't even have to always read 440 entries below LastPos because if my sKey does not exist in the big file, it will be smaller than the bKey so I quickly break the inner loop and move on.
Thoughts?
If I were doing this as a one-time thing, I'd create a set with all the keys I need to look up. Then read the file line-by line, check to see if the key exists in the set, and output the value if so.
Briefly, the algorithm is:
mySet = dictionary of keys to look up
for each line in the file
key = parse key from line
if key in mySet
output key and value
end for
Since Delphi doesn't have a generic set, I'd use TDictionary and ignore the value.
The dictionary lookup is O(1), so should be very fast. Your limiting factor will be file I/O time.
I figure that'd take about 10 minutes to code up, and less than 10 minutes to run.

Create Rows depending on count in Informatica

I am new to informatica power center tool and performing some assignment.
I have input data in a flat file.
data.csv contains
A,2
B,3
C,2
D,1
And Required output will be
output.csv should be like
A
A
B
B
B
C
C
D
Means I need to create output rows depending upon value in column. I tried it using java transformation and I got the result.
Is there any other way to do it.
Please help.
Java transformation is a very good approach, but if you insist on an alternative implementation, you can use a helper table and a Joiner transformation.
Create a helper table and populate it with appropriate amount of rows (you need to know the maximum value that may appear in the input file).
There is one row with COUNTER=1, two rows with COUNTER=2, three rows with COUNTER=3, etc.
Use a Joiner transformation to join data from the input file and the helper table - since the latter contains multiple rows for a single COUNTER value, the input rows will be multiplied.
COUNTER
-------------
1
2
2
3
3
3
4
4
4
4
(...)
Depending on your RDBMS, you may be able to produce the contents of the helper table using a SQL query in a source qualifier.

Rapidminer: Memory issues transforming nominal to binominal attributes

I want to analyze a large dataset (2,000,000 records, 20,000 customer IDs, 6 nominal attributes) using the Generalized Sequential Pattern algorithm.
This requires all attributes, aside from the time and customer ID attribute, to be binominal. Having 6 nominal attributes which I want to analyze for patterns, I need to transform those into binominal attributes, using the "Nominal to Binominal" Function. This is causing memory problems on my workstation (with 16GB RAM, of which I allocated 12 to the Java instance running rapidminer).
Ideally I would like to set up my project in a way, that it writes temporarily to the disc or using temporary tables in my oracle database, from which my model also reads the data directly. In order to use the "write database" or "update database" function, I need to have an existing table already in my database with boolean columns already (if I'm not mistaken).
I tried to write step by step the results of the binominal conversion into csv files onto my local disk. I started using the nominal attribute with the least distinct values, resulting in a csv file containing my dataset ID and now 7 binominal attributes. I was seriously surprised seeing the filesize being >200MB already. This is cause by rapidminer writing strings for the binominal values "true"/"false". Wouldn't it be way more memory efficient just writing 0/1?
Is there a way to either use the oracle database directly or working with 0/1 values instead of "true"/"false"? My next column would have 3000 distinct values to be transformed which would end in a nightmare...
I'd highly appreciate recommendations on how to use the memory more efficient or work directly in the database. If anyone knows how to easily transform a varchar2-column in Oracle into boolean columns for each distinct value that would also be appreciated!
Thanks a lot,
Holger
edit:
My goal is to get from such a structure:
column_a; column_b; customer_ID; timestamp
value_aa; value_ba; 1; 1
value_ab; value_ba; 1; 2
value_ab; value_bb; 1; 3
to this structure:
customer_ID; timestamp; column_a_value_aa; column_a_value_ab; column_b_value_ba; column_b_value_bb
1; 1; 1; 0; 1; 0
1; 2; 0; 1; 1; 0
1; 3; 0; 1; 0; 1
This answer is too long for a comment.
If you have thousands of levels for the six variables you are interested in, then you are unlikely to get useful results using that data. A typical approach is to categorize the data going in, which results in fewer "binominal" variables. For instance, instead of "1 Gallon Whole Milk", you use "diary products". This can result in more actionable results. Remember, Oracle only allows 1,000 columns in a table so the database has other limiting factors.
If you are working with lots of individual items, then I would suggest other approaches, notably an approach based on association rules. This will not limit you by the number of variables.
Personally, I find that I can do much of this work in SQL, which is why I wrote a book on the topic ("Data Analysis Using SQL and Excel").
You can use the operator Nominal to Numeric to convert true and false values to 1 or 0. set the coding type parameter to be unique integers.

Any tips to making excel dsum faster?

I am currently using a dsum to calculate some totals and I noticed excel has become really slow (needs 2 seconds per cell change).
This is the situation:
- I am trying to calculate 112 dsums to show in a chart;
- all dsums are queries on a table with 15 columns and +32k rows;
- all dsums have multiple criteria (5-6 constraints);
- the criteria uses both numerical and alpha-numerical constraints;
- i have the source table/range sorted;
- excel file is 3.4 mb in size;
(I am using excel 2007 on an 4 year old windows laptop)
Any ideas on what can be done to make it faster?
...other than reducing the number of dsums :P ====>>> already working on that one.
Thanks!
Some options are:
Change Calculation to Manual and press F9 whenever you want to calculate
Try SUMIFS rather than DSUM
Exploit the fact that the data is sorted by using MATCH and COUNTIF to find the first row and count of rows, then use OFFSET or INDEX to get the relevant subset of data to feed to SUMIFS for the remaining constraints
Instead of DSUMs you could also put it all in one or multiple Pivot tables - and then use GETPIVOTDATA to extract the data you need. The reading of the table will take up a bit of time (though 32k rows should be done below 1") - and then GETPIVOTDATA is lightning fast!
Downsides:
You need to manually refresh the pivot when you get new data
The pivot(s) need to be laid out so the requested data is show
File size will increase (unless Pivot cache is not stored, the file loading takes longer)

Resources