Storage, computation, and unpacking multiple Boolean values in Oracle and PowerBI - oracle

In my work, I often need to translate the data I find in the database into Pass/Fail or True/False values. For example, we may want to know if a patient's blood pressure was taken during a particular visit, with the only possible values being "True" or "False". In fact, there are dozens of such fields. Additionally, we actually don't need to do all of these things at every visit. There is another set of matching Boolean values that indicate if something, such as a blood pressure, was required.
I have researched this problem before and came up with little. Oracle does not support a Boolean datatype. However, I recently learned that Oracle DOES support bit-wise operators (logical AND, OR, XOR, NOT, etc.) and learned that the best way to store the data and use these operators is to:
Use the RAW datatype to store a large count of bits. Each bit can correspond to a particular real-world concept such as "Blood Pressure Done" or "Height Measurement Done".
Use bit-wise operators in the Oracle package UTL_RAW to do calculations on multiple RAW values to derive results such as "what was required AND done".
I have not yet determined that I want to go all the way down this rabbit-hole. But if I do, I have one challenge, yet unsolved and that is how do I unpack the RAW values into individual truth values, elegantly? Oracle will display the values only in hexadecimal, which is not convenient when we want to see the actual bits. It would be nice if I could carry out this operation in SQL for testing purposes. It is also necessary that I do this in Power BI so the results are formatted for the customer's needs. I could write a function, if one does not exist yet.
In resolving this challenge, I wish to not increase the size of the solution considerably. I am dealing with millions of rows and wish to have the space-savings from using the RAW datatype (storing each value as a single bit) but also have an output layer of unpacked bits into the required dozens of True-False columns for the customer's needs in seeing the details.
I feel like this problem has been present for me since I began working on these kinds of business problems over ten years ago. Certainly, I am not the only analyst who wonders: Has it been solved yet?
Edit 4/28:
I completed the Oracle-side of the work. I implemented a package with the following header file (I don't think I can show all the code in the body without permission from my employer, but feel free to ask for a peek).
With the Oracle-side of the project wrapped up, I have yet to figure out how to unpack these RAW values (called 'Binary' in Power BI) into their individual bits within Power BI. I need a visualization that will carry out a "bits to columns" operation on the fly or something like that.
Also, it would be nice to have an aggregation of a column of RAWs based upon a single bit position, so we can for example determine what percentage of the rows have a particular bit set to 1 without explicitly transforming all the data into columns, with one column per bit.
CREATE OR REPLACE PACKAGE bool AS
byte_count CONSTANT INTEGER := 8; --Must be an integer <= 2000
nibble_count CONSTANT INTEGER := byte_count * 2;
--construct a bitstring with a single '1', or with all zeros (pass 0).
FUNCTION raw_bitstring ( bit_to_set_to_one_in INTEGER ) RETURN RAW;
--visualize the bits as zeros and ones in a varchar2 field
FUNCTION binary_string ( input_in RAW ) RETURN VARCHAR2;
--takes an input RAW and sets a single bit to zero or one
--and return the altered RAW.
FUNCTION set_bit ( raw_bitstring_in RAW,
bit_loc_in INTEGER,
set_to_in INTEGER ) RETURN RAW;
--returns the value (0 or 1) of the indicated bit as an INTEGER.
FUNCTION bit_to_integer ( raw_bitstring_in RAW, bit_loc_in INTEGER) RETURN INTEGER;
--counts all the ones 1's in a RAW and returns the count as an INTEGER
FUNCTION bit_sum (raw_bitstring_in IN RAW) RETURN INTEGER;
END bool;
'''

Related

64 bit integer and 64 bit float homogeneous representation

Assume we have some sequence as input. For performance reasons we may want to convert it in homogeneous representation. And in order to transform it into homogeneous representation we are trying to convert it to same type. Here lets consider only 2 types in input - int64 and float64 (in my simple code I will use numpy and python; it is not the matter of this question - one may think only about 64-bit integer and 64-bit floats).
First we may try to cast everything to float64.
So we want something like so as input:
31 1.2 -1234
be converted to float64. If we would have all int64 we may left it unchanged ("already homogeneous"), or if something else was found we would return "not homogeneous". Pretty straightforward.
But here is the problem. Consider a bit modified input:
31000000 1.2 -1234
Idea is clear - we need to check that our "caster" is able to handle large by absolute value int64 properly:
format(np.float64(31000000), '.0f') # just convert to float64 and print
'31000000'
Seems like not a problem at all. So lets go to the deal right away:
im = np.iinfo(np.int64).max # maximum of int64 type
format(np.float64(im), '.0f')
format(np.float64(im-100), '.0f')
'9223372036854775808'
'9223372036854775808'
Now its really undesired - we lose some information which maybe needed. I.e. we want to preserve all the information provided in the input sequence.
So our im and im-100 values cast to the same float64 representation. The reason of this is clear - float64 has only 53 significand of total 64 bits. That is why its precision enough to represent log10(2^53) ~= 15.95 i.e. about all 16-length int64 without any information loss. But int64 type contains up to 19 digits.
So we end up with about [10^16; 10^19] (more precisely [10^log10(53); int64.max]) range in which each int64 may be represented with information loss.
Q: What decision in such situation should one made in order to represent int64 and float64 homogeneously.
I see several options for now:
Just convert all int64 range to float64 and "forget" about possible information loss.
Motivation here is "majority of input barely will be > 10^16 int64 values".
EDIT: This clause was misleading. In clear formulation we don't consider such solutions (but left it for completeness).
Do not make such automatic conversions at all. Only if explicitly specified.
I.e. we agree with performance drawbacks. For any int-float arrays. Even with ones as in simplest 1st case.
Calculate threshold for performing conversion to float64 without possible information loss. And use it while making casting decision. If int64 above this threshold found - do not convert (return "not homogeneous").
We've already calculate this threshold. It is log10(2^53) rounded.
Create new type "fint64". This is an exotic decision but I'm considering even this one for completeness.
Motivation here consists of 2 points. First one: it is frequent situation when user wants to store int and float types together. Second - is structure of float64 type. I'm not quite understand why one will need ~308 digits value range if significand consists only of ~16 of them and other ~292 is itself a noise. So we might use one of float64 exponent bits to indicate whether its float or int is stored here. But for int64 it would be definitely drawback to lose 1 bit. Cause would reduce our integer range twice. But we would gain possibility freely store ints along with floats without any additional overhead.
EDIT: While my initial thinking of this was as "exotic" decision in fact it is just a variant of another solution alternative - composite type for our representation (see 5 clause). But need to add here that my 1st composition has definite drawback - losing some range for float64 and for int64. What we rather do - is not to subtract 1 bit but add one bit which represents a flag for int or float type stored in following 64 bits.
As proposed #Brendan one may use composite type consists of "combination of 2 or more primitive types". So using additional primitives we may cover our "problem" range for int64 for example and get homogeneous representation in this "new" type.
EDITs:
Because here question arisen I need to try be very specific: Devised application in question do following thing - convert sequence of int64 or float64 to some homogeneous representation lossless if possible. The solutions are compared by performance (e.g. total excessive RAM needed for representation). That is all. No any other requirements is considered here (cause we should consider a problem in its minimal state - not writing whole application). Correspondingly algo that represents our data in homogeneous state lossless (we are sure we not lost any information) fits into our app.
I've decided to remove words "app" and "user" from question - it was also misleading.
When choosing a data type there are 3 requirements:
if values may have different signs
needed precision
needed range
Of course hardware doesn't provide a lot of types to choose from; so you'll need to select the next largest provided type. For example, if you want to store values ranging from 0 to 500 with 8 bits of precision; then hardware won't provide anything like that and you will need to use either 16-bit integer or 32-bit floating point.
When choosing a homogeneous representation there are 3 requirements:
if values may have different signs; determined from the requirements from all of the original types being represented
needed precision; determined from the requirements from all of the original types being represented
needed range; determined from the requirements from all of the original types being represented
For example, if you have integers from -10 to +10000000000 you need a 35 bit integer type that doesn't exist so you'll use a 64-bit integer, and if you need floating point values from -2 to +2 with 31 bits of precision then you'll need a 33 bit floating point type that doesn't exist so you'll use a 64-bit floating point type; and from the requirements of these two original types you'll know that a homogeneous representation will need a sign flag, a 33 bit significand (with an implied bit), and a 1-bit exponent; which doesn't exist so you'll use a 64-bit floating point type as the homogeneous representation.
However; if you don't know anything about the requirements of the original data types (and only know that whatever the requirements were they led to the selection of a 64-bit integer type and a 64-bit floating point type), then you'll have to assume "worst cases". This leads to needing a homogeneous representation that has a sign flag, 62 bits of precision (plus an implied 1 bit) and an 8 bit exponent. Of course this 71 bit floating point type doesn't exist, so you need to select the next largest type.
Also note that sometimes there is no "next largest type" that hardware supports. When this happens you need to resort to "composed types" - a combination of 2 or more primitive types. This can include anything up to and including "big rational numbers" (numbers represented by 3 big integers in "numerator / divisor * (1 << exponent)" form).
Of course if the original types (the 64-bit integer type and 64-bit floating point type) were primitive types and your homogeneous representation needs to use a "composed type"; then your "for performance reasons we may want to convert it in homogeneous representation" assumption is likely to be false (it's likely that, for performance reasons, you want to avoid using a homogeneous representation).
In other words:
If you don't know anything about the requirements of the original data types, it's likely that, for performance reasons, you want to avoid using a homogeneous representation.
Now...
Let's rephrase your question as "How to deal with design failures (choosing the wrong types which don't meet requirements)?". There is only one answer, and that is to avoid the design failure. Run-time checks (e.g. throwing an exception if the conversion to the homogeneous representation caused precision loss) serve no purpose other than to notify developers of design failures.
It is actually very basic: use 64 bits floating point. Floating point is an approximation, and you will loose precision for many ints. But there are no uncertainties other than "might this originally have been integral" and "does the original value deviates more than 1.0".
I know of one non-standard floating point representation that would be more powerfull (to be found in the net). That might (or might not) help cover the ints.
The only way to have an exact int mapping, would be to reduce the int range, and guarantee (say) 60 bits ints to be precise, and the remaining range approximated by floating point. Floating point would have to be reduced too, either exponential range as mentioned, or precision (the mantissa).

How to get a random integer in BigQuery?

I want to get a random integer between 0 and 9 in BigQuery. I tried the classic
SELECT CAST(10*RAND() AS INT64)
but it's producing numbers between 0 and 10
Adding this question as the results might surprise programmers used to CAST doing a TRUNC in most other languages.
Note this weird distribution of results:
Update 2019:
Now you can just do this:
SELECT fhoffa.x.random_int(0,10)
(blog post about persisted UDFs)
To get random integers between 0 and n (9 in this case) you need to FLOOR before CAST:
SELECT CAST(FLOOR(10*RAND()) AS INT64)
This because the SQL Standard doesn't specify if CAST to integer should TRUNC or ROUND the float being casted. BigQuery standard SQL implementation chooses to ROUND, so the classic formula with a CAST won't work as intended. Make sure to specify that you want to FLOOR (or TRUNC) your random number, and then CAST (to get an INT64 instead of a FLOAT).
From the SQL standard:
Whenever an exact or approximate numeric value is assigned to an
exact numeric value site, an approximation of its value that
preserves leading significant digits after rounding or truncating is
represented in the declared type of the target. The value is
converted to have the precision and scale of the target. The choice
of whether to truncate or round is implementation-defined.
https://github.com/twitter/mysql/blob/master/strings/decimal.c#L42
Another option would be
SELECT MOD(CAST(10*RAND() AS INT64), 10)

Convert a long character field to numeric, NOT scientific notation (SAS)

I need to join two tables - one table has householdid which is CHAR30, which appears to have center alignment and the other householdid as numeric 20. I need to convert to the numeric 20 but when I do that it appears truncated, perhaps because of the strange alignment (not all of the 30 positions are actually needed).
When I try to keep the full 30 positions as a numeric I instead get a conversion to scientific notation so of course this will not work as a key id for later operations.
As long as the number is converted properly, it doesn't matter what format it has. A format just tells SAS how to show you the number. Behind the scenes, it is just a DOUBLE.
1.0 = 1 = 1e0
Now if you have converted to a number and cannot get a join, then look at the informat you used to read it in.
try
num_id = input(strip(char_id),best32.);
Strip removes leading and trailing blanks. The BEST32. INFORMAT tries its "best" to read the number up to 32 characters in length.
You cannot store a 20 digit number as a numeric in SAS. SAS stores all numbers as 8 byte floating point and so does not have enough bits to represent that many digits uniquely. You can ask SAS what is the largest integer it can represent exactly by using the CONSTANT() function.
1 data _null_;
2 x=constant('EXACTINT',8);
3 put x = comma32. ;
4 run;
x=9,007,199,254,740,992
Read and store your 20 and 30 digit strings as character variables.
Use the bestd32. format. Tends to work out pretty well for long key variables. Depending on the length of the variable, you can change 32 to whichever length you need.
Based on the comments under the original question, the only thing you can do is convert all ID fields to strings, and use the strings to do the joins. #Reeza suggested this in one of the comments but it should have been posted as an answer.
I assume you are pulling this information out of another database/system that allows for greater numeric precision then SAS does. If you don't convert the values to strings when they are read into SAS, then you run the risk of losing precision.
If you lose precision, the ID in SAS is likely to become very slightly different to the ID in the original system, which can cause problems when searching the original system for an ID obtained from SAS.
Be sure you don't read the numbers into SAS as numeric, then convert to string. If you do it this way you are still losing precision as soon as the numbers are stored in SAS as numeric variables.

sas treatment of seed when generating random distributions

I needed to generate a poisson distribution in excel and found a method (Inverse Transform Method)
did it in excel and then in sas (just for fun, so I do not need a quick answer) to compare with the ranpoi sas function.
Here my code (which works):
data Poisson(keep=mean Poisson PoissonSas);
mean=0.2;
confronta=exp(-mean);
do obs=1 to 100;
found=0;
Poisson=0;
ranuni=1;
do until(found=1);
ranuni=ranuni*ranuni(12547);
if ranuni<confronta then found=1;
else Poisson=Poisson+1;
end;
PoissonSas=ranpoi(012584,mean);
output;
end;
run;
proc means data=Poisson(drop=mean);run;
So I initialized the seed in both random functions to replicate results.
The strange thing is that I get different results depending on whether I submit the data step with both methods or only one of them (commenting the other), but the same results over and over for each type of submission.
I expected the same results always! Why this is not so?
(I am using sas 9.3)
Thanks!
It looks like SAS is interleaving the calls to the PRNGs as a single stream. Pseudo-random numbers are a sequence of values that are actually deterministic. If you seed and use the sequence in one algorithm, you'll get the same results every time for that algorithm. If you use the sequence alternating between two or more algorithms, the set of algorithms will always yield the same set of results (which seems to be the case for you), but the results for a given algorithm will be different because some of the underlying PRNs it was drawing before are now being used by the other algorithms. This is at the core of the synchronization requirement when using so-called variance reduction techniques based on common random numbers. In general, if you want identical results the solution is to have multiple instances of your PRNG, one for each "source" of randomness in your program, and to seed the multiple sources independently of each other but identically across runs. It looks like you tried to do this, but SAS doesn't behave the way you think it does. According to their documentation, it appears that they produce a single PRN stream based on the first seed entry in your code! This is a subset of one of their examples:
/* This DATA step calls the RANUNI and the RANNOR functions */
/* and produces a single stream of random numbers based on */
/* a seed value of 7. */
data d;
d = ranuni (7); f = ' '; output;
d = ranuni (8); f = ' '; output;
d = rannor (9); f = 'n'; output;
/* they actually have more... */
run;
By the way, your Poisson algorithm is not generally regarded as an inverse transform algorithm. Inversion is 1-to-1, i.e., a single input uniform produces a single random variable. The loop you're performing is actually doing acceptance/rejection, and you use a variable number of uniforms to come up with each Poisson value.
PJS's answer is essentially correct, but a few clarifications.
SAS does indeed use a single seed when you do it the way you did; all of what I'd call 'primitive' random functions work off of one PRNG stream, and only the first seed matters (and only matters the first time it's encountered).
However, RANPOI is a little different - probably because of how SAS creates poissons. It's not made clear in the documentation, but it appears that it uses up two random numbers (not sure if it's always two, or just coincidence). See the following test:
data test;
U=ranuni(7);
P=ranpoi(8,100);
put u= p=;
run;
data test2;
p=ranpoi(8,100);
u=ranuni(7);
put u= p=;
run;
data test3;
u=ranuni(8);
p=ranuni(7);
put u= p=;=
run;
data test4;
u=ranuni(7);
p=ranuni(8);
put u= p=;
run;
data test5;
do _t = 1 to 5;
u=ranuni(8);
put u=;
end;
run;
Now, in test4, we see the first two ranuni's when starting with seed 7, and indeed the first one matches the first one from test. However, test3 has the first two starting with seed 8, and the second one does not match the one from test2! test5 shows that in fact the third matches, meaning ranpoi in test2 used up 2 numbers from the stream.
In any event, if you want to change the seed midstream, you have two options.
One is to use CALL RANPOI (and CALL RANUNI), which allow you to store the seed in a variable. Two is to use RAND function, which works with CALL STREAMINIT to set seeds whenever you want to. The RAND function is considered 'better' than the more primitive RANPOI and such - it uses a better PRNG algorithm.

Can dbms_utility.get_time rollover?

I'm having problems with a mammoth legacy PL/SQL procedure which has the following logic:
l_elapsed := dbms_utility.get_time - l_timestamp;
where l_elapsed and l_timestamp are of type PLS_INTEGER and l_timestamp holds the result of a previous call to get_time
This line suddenly started failing during a batch run with a ORA-01426: numeric overflow
The documentation on get_time is a bit vague, possibly deliberately so, but it strongly suggests that the return value has no absolute significance, and can be pretty much any numeric value. So I was suspicious to see it being assigned to a PLS_INTEGER, which can only support 32 bit integers. However, the interweb is replete with examples of people doing exactly this kind of thing.
The smoking gun is found when I invoke get_time manually, it is returning a value of -214512572, which is suspiciously close to the min value of a 32 bit signed integer. I'm wondering if during the time elapsed between the first call to get_time and the next, Oracle's internal counter rolled over from its max value and its min value, resulting in an overflow when trying to subtract one from the other.
Is this a likely explanation? If so, is this an inherent flaw in the get_time function? I could just wait and see if the batch fails again tonight, but I'm keen to get an explanation for this behaviour before then.
Maybe late, but this may benefit someone searching on the same question.
The underlying implementation is a simple 32 bit binary counter, which is incremented every 100th of a second, starting from when the database was last started.
This binary counter is is being mapped onto a PL/SQL BINARY_INTEGER type - which is a signed 32-bit integer (there is no sign of it being changed to 64-bit on 64-bit machines).
So, presuming the clock starts at zero it will hit the +ve integer limit after about 248 days, and then flip over to become a -ve value falling back down to zero.
The good news is that provided both numbers are the same sign, you can do a simple subtraction to find duration - otherwise you can use the 32-bit remainder.
IF SIGN(:now) = SIGN(:then) THEN
RETURN :now - :then;
ELSE
RETURN MOD(:now - :then + POWER(2,32),POWER(2,32));
END IF;
Edit : This code will blow the int limit and fail if the gap between the times is too large (248 days) but you shouldn't be using GET_TIME to compare durations measure in days anyway (see below).
Lastly - there's the question of why you would ever use GET_TIME.
Historically, it was the only way to get a sub-second time, but since the introduction of SYSTIMESTAMP, the only reason you would ever use GET_TIME is because it's fast - it is a simple mapping of a 32-bit counter, with no real type conversion, and doesn't make any hit on the underlying OS clock functions (SYSTIMESTAMP seems to).
As it only measures relative time, it's only use is for measuring the duration between two points. For any task that takes a significant amount of time (you know, over 1/1000th of a second or so) the cost of using a timestamp instead is insignificant.
The number of occasions on where it is actually useful is minimal (the only one I've found is checking the age of data in a cache, where doing a clock hit for every access becomes significant).
From the 10g doc:
Numbers are returned in the range -2147483648 to 2147483647 depending on platform and machine, and your application must take the sign of the number into account in determining the interval. For instance, in the case of two negative numbers, application logic must allow that the first (earlier) number will be larger than the second (later) number which is closer to zero. By the same token, your application should also allow that the first (earlier) number be negative and the second (later) number be positive.
So while it is safe to assign the result of dbms_utility.get_time to a PLS_INTEGER it is theoretically possible (however unlikely) to have an overflow during the execution of your batch run. The difference between the two values would then be greater than 2^31.
If your job takes a lot of time (therefore increasing the chance that the overflow will happen), you may want to switch to a TIMESTAMP datatype.
Assigning a negative value to your PLS_INTEGER variable does raise an ORA-01426:
SQL> l
1 declare
2 a pls_integer;
3 begin
4 a := -power(2,33);
5* end;
SQL> /
declare
*
FOUT in regel 1:
.ORA-01426: numeric overflow
ORA-06512: at line 4
However, you seem to suggest that -214512572 is close to -2^31, but it's not, unless you forgot to typ a digit. Are we looking at a smoking gun?
Regards,
Rob.

Resources