The output of my sql select statement is a list of Customer ID in column A.
In column B i want to randomly assign each cust ID either bucket A or B to perform A/B testing.
How do I use rand to generate a or b in a new column in my select statement
You could use a solution like the one here:
Generating Random Number In Each Row In Oracle Query which was addressing issues with the use of dbms_random.value.
If you want to get a value that is either 0 or 1, then you can do that like this, assuming your customers are coming from a table named customers:
SELECT customer_id, FLOOR(dbms_random.value+0.5)
FROM customers
The random value is between 0 and 1 (including 0 and not including 1) so adding 0.5 and using floor means you will get half 0's and half 1's.
Notes on using dbms_random in Oracle 11g are available in their manual here: https://docs.oracle.com/cd/B28359_01/appdev.111/b28419/d_random.htm
You may need to initialize the seed to guarantee or improve randomness.
And if 0 and 1 are not appropriate, you could then wrap the FLOOR in a CASE or DECODE to turn the numbers 0 and 1 into the letters A and B.
Related
We have a database with a vast number of tables and columns that was set up by a 3rd party.
Many of these columns are entirely unused. I am trying to create a query that returns a list of all the columns that are actually used (contain > 0 values).
My current attempt -
SELECT table_name, column_name
FROM ALL_TAB_COLUMNS
WHERE OWNER = 'XUSER'
AND num_nulls < 1
;
Using num_nulls < 1 dramatically reduces the number of returned values, as expected.
However, on inspection of some of the tables, there are columns missing from the results of the query that appear to have values in them.
Could anybody explain why this might be the case?
First of all, statistics are not always 100% accurate. They can be gathered on a subset of the table rows, since they are, after all, statistics. Just like pollsters do not have to ask every American how they feel about a given politician, Oracle can get an accurate-enough sense of the data in a table by reading only a portion of it.
Even if the statistics were gathered on 100% of the rows in a table (and they can be gathered that way, if you want), the statistics will become outdated as soon as there are any inserts, updates, or deletes on the table.
Second of all, num_nulls < 1 wouldn't tell you the columns that had no data. Imagine a table with 100 rows and "column X" having num_nulls equal to 80. That would imply the column has 20 non-null values, but would NOT pass your filter. A better approach (if you trust your statistics are not stale and based on a 100% sample of the rows), might be to compare DBA_TAB_COLUMNS.NUM_NULLS < DBA_TABLES.NUM_ROWS. For example, a column that has 99 nulls in a 100 row table has data in 1 row.
"there are columns missing from the results of the query that appear to have values in them."
Potentially every non-mandatory column could appear in this set, because it is likely that some rows will have values but not all rows. "Some rows" being greater than zero means such columns won't pass your test for num_nulls < 1.
So maybe you should search for columns which aren't in use. This query will find columns where every row is null:
select t.table_name
, tc.column_name
from user_tables t
join user_tab_cols tc on t.table_name = tc.table_name
where t.num_rows > 0
and t.num_rows = tc.num_nulls;
Note that if you are using Partitioning you will need to scan user_tab_partitions.num_rows and user_part_col_statistics.num_nulls.
Also, I second the advice others have given regarding statistics. The above query may throw out some false positives. I would treat the results generated from that query as a list of candidates to be investigated further. For instance you could generate queries which counted the actual number of nulls for each column.
I want to get the row ID or record ID for last inserted record in the table in Trafodion.
Example:
1 | John <br/>
2 | Michael
When executing an INSERT statement, I want to return the created ID, means 3.
Could anyone tell me how to do that using trafodion or is it not possible ?
Are you using a sequence generator to generate unique ids for this table? Something like this:
create table idcol (a largeint generated always as identity not null,
b int,
primary key(a desc));
Either way, with or without sequence generator, you could get the highest key with this statement:
select max(a) from idcol;
The problem is that this statement could be very inefficient. Trafodion has a built-in optimization to read the min of a key column, but it doesn't use the same optimization for the max value, because HBase didn't have a reverse scan until recently. We should make use of the reverse scan, please feel free to file a JIRA. To make this more efficient with the current code, I added a DESC to the primary key declaration. With a descending key, getting the max key will be very fast:
explain select max(a) from idcol;
However, having the data grow from higher to lower values might cause issues in HBase, I'm not sure whether this is a problem or not.
Here is yet another solution: Use the Trafodion feature that allows you to select the inserted data, showing you the inserted values right away:
select * from (insert into idcol(b) values (11),(12),(13)) t(a,b);
A B
-------------------- -----------
1 11
2 12
3 13
--- 3 row(s) selected.
I have a table with 3 columns a, b and c. I want to know how to update the value of third column with concatenate of two other columns in each row.
before update
A B c
-------------
1 4
2 5
3 6
after update
A B c
-------------
1 4 1_4
2 5 2_5
3 6 3_6
How can I do this in oracle?
Use the concatentation operator ||:
update mytable set
c = a || '_' || b
Or better, to avoid having to rerun this whenever rows are inserted or updated:
create view myview as
select *, a || '_' || b as c
from mytable
Firstly, you are violating the rules of normalization. You must re-think about the design. If you have the values in the table columns, then to get a computed value, all you need is a select statement to fetch the result the way you want. Storing computed values is generally a bad idea and considered a bad design.
Anyway,
Since you are on 11g, If you really want to have a computed column, then I would suggest a VIRTUAL COLUMN than manually updating the column. There is a lot of overhead involved with an UPDATE statement. Using a virtual column would reduce a lot of the overhead. Also, you would completely get rid of the manual effort and those lines of code to do the update. Oracle does the job for you.
Of course, you will use the same condition of concatenation in the virtual column clause.
Something like,
Column_c varchar2(50) GENERATED ALWAYS AS (column_a||'_'||column_b) VIRTUAL
Note : There are certain restrictions on its use. So please refer the documentation before implementing it. However, for the simple use case provided by OP, a virtual column is a straight fit.
Update I did a small test. There were few observations. Please read this question for a better understanding about how to implement my suggestion.
i have tried this UDF in hive : UDFRowSequence.
But its not generating unique value i.e it is repeating the sequence depending on mappers.
Suppose i have one file (Having 4 records) availble at HDFS .it will create one mapper for this job and result will be like
1
2
3
4
but when there are multiple file (large size) at HDFS Location , Multiple mapper will get created for that job and for each mapper repetitive sequence number will get generated like below
1
2
3
4
1
2
3
4
1
2
.
Is there any solution for this so that unique number should be generated for each record
I think you are looking for ROW_NUMBER(). You can read about it and other "windowing" functions here.
Example:
SELECT *, ROW_NUMBER() OVER ()
FROM some_database.some_table
#GoBrewers14 :- Yes i did try that. We tried to use the ROW_NUMBER function ,but when we try to query this on small size data eg. file containing 500 rows, it’s working perfectly. But when it comes to large size data, query runs for couple of hours and finally fails to generate output .
I have came to know below information regarding this :-
Generating a sequential order in a distributed processing query is not possible with simple UDFs. This is because the approach will require some centralised entity to keep track of the counter, which will also result in severe inefficiency for distributed queries and is not recommended to apply.
If you want work with multiple mappers and with large dataset, try using this UDF: https://github.com/manojkumarvohra/hive-hilo
It makes use of zookeeper as central repository to maintain state of sequence
Query to generate Sequences. We can use this as Surrogate Key in Dimension table as well.
WITH TEMP AS
(SELECT if(max(seq) IS NULL, 0, max(seq)) max_seq
FROM seq_test)
SELECT col_id,
col_val,
row_number() over() + max_seq AS seq
FROM souce_table
INNER JOIN TEMP ON 1 = 1;
seq_test: Its your target table.
source_table: Its your source.
Seq: Surrogate key / Sequence number / Key column
If I have a table with columns a,b,c,d and and pk b-tree index on (a,b,c) in that order. I want to query like so:
(1)
select b, d from table
where a = :p1
and c = :p2
I.e. missing a where clause on the b column for perfect leveraging the index. Now the b column can only have one of a few possible values (20 unique) but c (and a) can have a lot (100 000's). I figured it would be more efficient to rewrite the query to:
(2)
select /*+USE_NL(table)*/ b, d from table
where a = :p1
and b IN (<allPossibleValues>)
and c = :p2
but I haven't been able to find any oracle documentation that explains how the range scan in (1) works when a non-leading column is missing from the composite index. All the sources seem to only cover the case where the leading column is missing. Those sources suggest using a skip scan like so:
(3)
select /*+INDEX_SS(table <theIndex>)*/ b, d from table
where a = :p1
and c = :p2
Would that work when the missing column is not the leading one but the second one (b). As I said all the sources I've found explaining skip scan have the leading column missing. Would query (2) and/or (3) be better than query (1).
Before starting premature optimization, check the explain plan and the performance of your original query in your production environment.
The Oracle query optimizer is quite good at choosing the correct plan. Depending on the size of your table it will probably choose either full index scan (I guess the table has way to many rows for this to happen), or index range scan.
If Oracle fails to choose a plan with good performance, then you can start optimizing.