Randomly sampling data in Denodo - random

I wonder what is analog of SQL SERVER newid() command in Denodo (VQL)? My ultimate goal is to randomly sample data in Denodo.

You can try the encrypt function, as it provides a guid for a value, it generates different values for same input.

You can get a random sample of a query by combining a LIMIT command with RAND()...
SELECT *, RAND() AS rnd
FROM table
ORDER BY rnd
LIMIT 100;

Related

Can different set of records be randomly generated with the same seed

I have 2 questions related to the package dbms_random in Oracle:
I wonder if there is a difference between initialize and seed in the dbms_random package.
I have the need to randomize a fixed number of records from a larger set but I would like to have the records observed remain stable until I change the seed.
After reviewing the record, is there a way to generate a different set of records but still keep the seed, I wonder if there is a function to reset or reinitialize the package with the same seed?
Below is my sample code for illustration purpose. Can I generate a different set of records but still keep the original seed?
begin
--dbms_random.initialize(100);
dbms_random.seed(100);
SELECT *
FROM (
SELECT *
FROM table
ORDER BY DBMS_RANDOM.value)
WHERE rownum < 21;
end;
"I wonder if there is a difference between initialize and seed in the dbms_random package"
Yes, there is a difference. initialize() is deprecated and is only retained in DBMS_RANDOM for backwards compatibility. All new code should use seed(). This is covered in the Oracle documentation.
is there a way to generate a different set of records but still keep the seed
Not with the code sample you posted. That will set the seed each time, which will give you the same set of records each time. To get multiple sets of different records spawned from the same seed you need to separate the seed setting from the record generation.
Here is a demonstration on db<>fiddle

Want to create serial numbers

I want to generate the serial no.s
e.g.
I have,
NID
-----
ABD90
BGJ89
HSA76
and I want,
ID NID
---------
1 ABD90
2 BGJ89
3 HSA76
What code should I run for this outcome?
Please help me.
Since you tagged SAS, I'll answer with SAS.
Based on your question, getting that result from that input would be as simple as this
data result;
ID=_N_;
set input;
run;
or
proc sql;
select ID as monotonic()
,NID
from input
;
quit;
In pure Oracle you would do this
select rownum, NID
from input
However you might want to throw on ORDER BY in there because you'll likely get different results every time you run that.

Import blob through SAS from ORACLE DB

Good time of a day to everyone.
I face with a huge problem during my work on previous week.
Here ia the deal:
I need to download exel file (blob) from ORACLE database through SAS.
I am using:
First step i need to get data from oracle. I used the construction (blob file is nearly 100kb):
proc sql;
connect to oracle;
create table SASTBL as
select * from connection to oracle (
select dbms_lob.substr(myblobfield,1,32767) as blob_1,
dbms_lob.substr(myblobfield,32768,32767) as blob_2,
dbms_lob.substr(myblobfield,65535,32767) as blob_3,
dbms_lob.substr(myblobfield,97302,32767) as blob_4
from my_tbl;
);
quit;
And the result is:
blob_1 = 70020202020202...02
blob_2 = 02020202020...02
blob_3 = 02020202...02
I do not understand why the field consists from "02"(the whole file)
And the length of any variable in sas is 1024 (instead of 37767) $HEX2024 format.
If I ll take:
dbms_lob.substr(my_blob_field,2000,900) from the same object the result will mush more similar to the truth:
blob = "A234ABC4536AE7...."
The question is: 1. how can i get binary data from blob field correctly trough SAS? What is my mistake?
Thank you.
EDIT 1:
I get the information but max string is 2000 kb.
Use the DBMAX_TEXT option on the CONNECT statement (or a LIBNAME statement) to get up to 32,767 characters. The default is probably 1024.
PROC SQL uses SQL to interact with SAS datasets (create tables, query tables, aggregate data, connect externally, etc.). The procedure mostly follows the ANSI standard with a few SAS specific extensions. Each RDMS extends ANSI including Oracle with its XML handling such as saving content in a blob column. Possibly, SAS cannot properly read the Oracle-specific (non-ANSI) binary large object type. Typically SAS processes string, numeric, datetime, and few other types.
As an alternative, consider saving XML content from Oracle externally as an .xml file and use SAS's XML engine to read content into SAS dataset:
** STORING XML CONTENT;
libname tempdata xml 'C:\Path\To\XML\File.xml';
** APPEND CONTENT TO SAS DATASET;
data Work.XMLData;
set tempdata.NodeName; /* CHANGE TO REPEAT PARENT NODE OF XML. */
run;
Adding as another answer as I can't comment yet... the issue you experienced is that the return of dbms_lob.substr is actually a varchar so SAS limits it to 2,000. To avoid this, you could wrap it in to_clob( ... ) AND set the DBMAX_TEXT option as previously answered.
Another alternative is below...
The code below is an effective method for retrieving a single record with a large CLOB. Instead of calculating how many fields to split the clob into resulting in a very wide record, it instead splits it into multiple rows. See expected output at bottom.
Disclaimer: Although effective it may not be efficient ie may not scale well to multiple rows, the generally accepted approach then is row pipelining PLSQL. That being said, the below got me out of a pinch if you can't make a procedure...
PROC SQL;
connect to oracle (authdomain=YOUR_Auth path=devdb DBMAX_TEXT=32767 );
create table clob_chunks (compress=yes) as
select *
from connection to Oracle (
SELECT id
, key
, level clob_order
, regexp_substr(clob_value, '.{1,32767}', 1, level, 'n') clob_chunk
FROM (
SELECT id, key, clob_value
FROM schema.table
WHERE id = 123
)
CONNECT BY LEVEL <= regexp_count(clob_value, '.{1,32767}',1,'n')
)
order by id, key, clob_order;
disconnect from oracle;
QUIT;
Expected output:
ID KEY CHUNK CLOB
1 1 1 short_clob
2 2 1 long clob chunk1of3
2 2 2 long clob chunk2of3
2 2 3 long clob chunk3of3
3 3 1 another_short_one
Explanation:
DBMAX_TEXT tells SAS to adjust the default of 1024 for a clob field.
The regex .{1,32767} tells Oracle to match at least once but no more than 32767 times. This splits the input and captures the last chunk which is likely to be under 32767 in length.
The regexp_substr is pulling a chunk from the clob (param1) starting from the start of the clob (param2), skipping to the 'level'th occurance (param3) and treating the clob as one large string (param4 'n').
The connect by re-runs the regex to count the chunks to stop the level incrementing beyond end of the clob.
References:
SAS KB article for DBMAX_TEXT
Oracle docs for REGEXP_COUNT
Oracle docs for REGEXP_SUBSTR
Oracle regex syntax
Stackoverflow example of regex splitting

generating unique ids in hive

I have been trying to generate unique ids for each row of a table (30 million+ rows).
using sequential numbers obviously not does not work due to the parallel nature of Hadoop.
the built in UDFs rand() and hash(rand(),unixtime()) seem to generate collisions.
There has to be a simple way to generate row ids, and I was wondering of anyone has a solution.
my next step is just creating a Java map reduce job to generate a real hash string with a secure random + host IP + current time as a seed. but I figure I'd ask here before doing it ;)
Use the reflect UDF to generate UUIDs.
reflect("java.util.UUID", "randomUUID")
Update (2019)
For a long time, UUIDs were your best bet for getting unique values in Hive. As of Hive 4.0, Hive offers a surrogate key UDF which you can use to generate unique values which will be far more performant than UUID strings. Documentation is a bit sparse still but here is one example:
create table customer (
id bigint default surrogate_key(),
name string,
city string,
primary key (id) disable novalidate
);
To have Hive generate IDs for you, use a column list in the insert statement and don't mention the surrogate key column:
-- staging_table would have two string columns.
insert into customer (name, city) select * from staging_table;
Not sure if this is all that helpful, but here goes...
Consider the native MapReduce analog: assuming your input data set is text based, the input Mapper's key (and hence unique ID) would be, for each line, the name of the file plus its byte offset.
When you are loading the data into Hive, if you can create an extra 'column' that has this info, you get your rowID for free. It's semantically meaningless, but so too is the approach you mention above.
Elaborating on the answer by jtravaglini,
there are 2 built in Hive virtual columns since 0.8.0 that can be used to generate a unique identifier:
INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE
Use like this:
select
concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE) as rowkey,
...
;
...
OK
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:0
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:57
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:114
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:171
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:228
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:285
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:342
...
Or you can anonymize that with md5 or similar, here's a link to md5 UDF:
https://gist.github.com/dataminelab/1050002
(note the function class name is initcap 'Md5')
select
Md5(concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE)) as rowkey,
...
Use ROW_NUMBER function to generate monotonically increasing integer ids.
select ROW_NUMBER() OVER () AS id from t1;
See https://community.hortonworks.com/questions/58405/how-to-get-the-row-number-for-particular-values-fr.html
reflect("java.util.UUID", "randomUUID")
I could not vote up the other one. I needed a pure binary version, so I used this:
unhex(regexp_replace(reflect('java.util.UUID','randomUUID'), '-', ''))
Depending on the nature of your jobs and how frequently you plan on running them, using sequential numbers may actually be a reasonable alternative. You can implement a rank() UDF as described in this other SO question.
Write a custom Mapper that keeps a counter for every Map task and creates as row ID for a row the concatenation of JobID() (as obtained from the MR API) + current value of counter. Before the next row is examined, increment the counter.
If you want work with multiple mappers and with large dataset, try using this UDF: https://github.com/manojkumarvohra/hive-hilo
It makes use of zookeeper as central repository to maintain state of sequence and generated unique incrementing numeric values

Sampling in oracle

I'm trying to take a sample from a insurance claims database.
For example 20% random, sample from 1 million claims data where provider type is '25' and year is '2012'. Data is in sqldeveloper. I am a statistician with basic SQL knowledge.
You can use SAMPLE to get a random set of rows from a table.
SELECT *
FROM claim SAMPLE(20)
WHERE type ='25'
AND year = 2012;
SQL has a SAMPLE command built in. Example:
SELECT * FROM emp SAMPLE(25)
means each row in emp has a 25% chance of being included in the resulting set. NOTE: this does not mean that exactly 25% of the rows are necessarily selected
this blog was a quick read on more details on sampling
With this you get a single line of a sample that is shown random.
SELECT * FROM TABLE# SAMPLE(10)
FETCH NEXT 1 ROWS ONLY

Resources