CQL3 change primary key - sorting

Suppose I have a table of the following composition.
CREATE TABLE rollup (
hashname text,
key text,
day timestamp,
counter_value counter,
PRIMARY KEY (hashname, key, day)
) WITH
...
I want to run a query that looks like
SELECT * FROM rollup WHERE hashname='some_namespaced_hash' AND day>='2013-07-15' AND day<='2013-07-25';
However, this doesn't work because (I think) the following is also relevant to >,<, etc.
Composite keys means it now makes sense for CQL to sport the ORDER BY syntax in SELECT queries as well, but it’s still not nearly as flexible as you might be used to, doing ad-hoc queries in SQL. ORDER BY clauses can only select a single column, and that column has to be the second column in a composite PRIMARY KEY. This holds even for tables with more than 2 column components in the primary key
and here, day is the third column in the primary column key. The only way I can figure to do this is to change the primary compound key to PRIMARY KEY (hashname, day, key). I can't find any documentation that tells me how to do this. Is it possible?
Alternatively, am I missing the "correct" way to solve this problem/am I misinterpreting the problem?

You are right, the only way is switch the order of your primary key. The reason is that, currently, your keys are in this order:
hashname1 key1 day1: counter_value111
hashname1 key1 day2: counter_value112
hashname1 key2 day1: counter_value121
hashname1 key2 day2: counter_value122
hashname1 key3 day1: counter_value131
hashname1 key3 day2: counter_value132
so to retrieve a range of days, Cassandra would need to 'skip' for each key. This is inefficient and not supported. You need to have them in this order:
hashname1 day1 key1: counter_value111
hashname1 day1 key2: counter_value121
hashname1 day1 key3: counter_value131
hashname1 day2 key1: counter_value112
hashname1 day2 key2: counter_value122
hashname1 day2 key3: counter_value132
Unfortunately the only way to do this is to rewrite all your data, and there is no inbuilt way of doing this in Cassandra. You could write a script to do this by reading in the data from one CF, switching round the order, then write out to a new CF and switch. If this needs to be done live it's harder but still possible.

Related

check multiple condition with like statement

I check two columns if the string "done" or "completed" is available in either of these columns.at this moment I write
case when col1 like '%done%'
or col1 like "%completed%'
or col2 like "%done%'
or col2 like "%completed%'
end as status
this is a sample so it's only 4 lines however if there are multiple strings that need to check it takes a lot of effort to write code like this. Can we write which gives a similar result?
(col1||"-"||col2) like (%done%,%completed%)
i know above code is not possible, but do we have any alternative?
If you have a lengthy list, create a table with one column and insert a row for each key word. Then write your query to join to this table using the LIKE operator:
SELECT DISTINCT m.*
FROM maintable m,
keyword_table k
WHERE m.col1 LIKE '%'||k.keyword||'%'
Just be aware that this will not perform well in bulk (millions of rows). That would require more advanced techniques, but for small tables this is fine.

Get the last inserted row ID in trafodion

I want to get the row ID or record ID for last inserted record in the table in Trafodion.
Example:
1 | John <br/>
2 | Michael
When executing an INSERT statement, I want to return the created ID, means 3.
Could anyone tell me how to do that using trafodion or is it not possible ?
Are you using a sequence generator to generate unique ids for this table? Something like this:
create table idcol (a largeint generated always as identity not null,
b int,
primary key(a desc));
Either way, with or without sequence generator, you could get the highest key with this statement:
select max(a) from idcol;
The problem is that this statement could be very inefficient. Trafodion has a built-in optimization to read the min of a key column, but it doesn't use the same optimization for the max value, because HBase didn't have a reverse scan until recently. We should make use of the reverse scan, please feel free to file a JIRA. To make this more efficient with the current code, I added a DESC to the primary key declaration. With a descending key, getting the max key will be very fast:
explain select max(a) from idcol;
However, having the data grow from higher to lower values might cause issues in HBase, I'm not sure whether this is a problem or not.
Here is yet another solution: Use the Trafodion feature that allows you to select the inserted data, showing you the inserted values right away:
select * from (insert into idcol(b) values (11),(12),(13)) t(a,b);
A B
-------------------- -----------
1 11
2 12
3 13
--- 3 row(s) selected.

How can I use a variable to pick which column is selected?

How can we use variable as column name ?
In my table days (MONDAY,TUESDAY..) are column names.
I want to get the DAY dynamically and use AS COLUMN NAME in my query.
My query :
SELECT EMP FROM SCHEDULE WHERE "DAY"(Dynamically I want) =1;
You simply can't use variables to change the actual text of the queries. variables can be used just in place of literal values (dates, strings, times, numbers) but they can't change the text of the actual command.
The technical reason is that (oversimplyfying the things) oracle FIRST parses the text, establishes an execution plan and only after this considers the values of the variables. more or less you can think (this is just an analogy, of course, it is not really the same thing!) that oracle "compiles" the queries like an C++ compiler compiles the source code of a function: it is not possible to pass a c++ procedure a variable that modifies the text of the procedure itself.
what you have to do is to rethink your approach taking in consideration what I just said:
SELECT EMP FROM SCHEDULE
WHERE
(case :DAY_I_WANT
WHEN 'MONDAY' then -- 'MONDAY' is the string value of the variable :DAY_I_WANT
monday -- monday, here is the column whose value I want CASE to return
WHEN 'TUESDAY' then tuesday
...
...
WHEN 'SUNDAY' then sunday
end) = 1
keep in mind that this solution will not take advantage on any index on the MONDAY..SUNDAY columns. the best approach would be to create a different data structure where you have a separate row for each day and a proper dayofweek column. If you do this, you will be able able to write:
select EMP from schedule
where schedule.DAY = :DAY_I_WANT
and it will allow you to create an index on the DAY column, speeding up searches.
Having a separate column for each day equals to be looking for troubles.

generating unique ids in hive

I have been trying to generate unique ids for each row of a table (30 million+ rows).
using sequential numbers obviously not does not work due to the parallel nature of Hadoop.
the built in UDFs rand() and hash(rand(),unixtime()) seem to generate collisions.
There has to be a simple way to generate row ids, and I was wondering of anyone has a solution.
my next step is just creating a Java map reduce job to generate a real hash string with a secure random + host IP + current time as a seed. but I figure I'd ask here before doing it ;)
Use the reflect UDF to generate UUIDs.
reflect("java.util.UUID", "randomUUID")
Update (2019)
For a long time, UUIDs were your best bet for getting unique values in Hive. As of Hive 4.0, Hive offers a surrogate key UDF which you can use to generate unique values which will be far more performant than UUID strings. Documentation is a bit sparse still but here is one example:
create table customer (
id bigint default surrogate_key(),
name string,
city string,
primary key (id) disable novalidate
);
To have Hive generate IDs for you, use a column list in the insert statement and don't mention the surrogate key column:
-- staging_table would have two string columns.
insert into customer (name, city) select * from staging_table;
Not sure if this is all that helpful, but here goes...
Consider the native MapReduce analog: assuming your input data set is text based, the input Mapper's key (and hence unique ID) would be, for each line, the name of the file plus its byte offset.
When you are loading the data into Hive, if you can create an extra 'column' that has this info, you get your rowID for free. It's semantically meaningless, but so too is the approach you mention above.
Elaborating on the answer by jtravaglini,
there are 2 built in Hive virtual columns since 0.8.0 that can be used to generate a unique identifier:
INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE
Use like this:
select
concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE) as rowkey,
...
;
...
OK
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:0
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:57
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:114
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:171
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:228
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:285
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:342
...
Or you can anonymize that with md5 or similar, here's a link to md5 UDF:
https://gist.github.com/dataminelab/1050002
(note the function class name is initcap 'Md5')
select
Md5(concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE)) as rowkey,
...
Use ROW_NUMBER function to generate monotonically increasing integer ids.
select ROW_NUMBER() OVER () AS id from t1;
See https://community.hortonworks.com/questions/58405/how-to-get-the-row-number-for-particular-values-fr.html
reflect("java.util.UUID", "randomUUID")
I could not vote up the other one. I needed a pure binary version, so I used this:
unhex(regexp_replace(reflect('java.util.UUID','randomUUID'), '-', ''))
Depending on the nature of your jobs and how frequently you plan on running them, using sequential numbers may actually be a reasonable alternative. You can implement a rank() UDF as described in this other SO question.
Write a custom Mapper that keeps a counter for every Map task and creates as row ID for a row the concatenation of JobID() (as obtained from the MR API) + current value of counter. Before the next row is examined, increment the counter.
If you want work with multiple mappers and with large dataset, try using this UDF: https://github.com/manojkumarvohra/hive-hilo
It makes use of zookeeper as central repository to maintain state of sequence and generated unique incrementing numeric values

How can I constrain multiple columns to prevent duplicates, but ignore null values?

Here's a little experiment I ran in an Oracle database (10g). Aside from (Oracle's) implementation convenience, I can't figure out why some insertions are accepted and others rejected.
create table sandbox(a number(10,0), b number(10,0));
create unique index sandbox_idx on sandbox(a,b);
insert into sandbox values (1,1); -- accepted
insert into sandbox values (1,2); -- accepted
insert into sandbox values (1,1); -- rejected
insert into sandbox values (1,null); -- accepted
insert into sandbox values (2,null); -- accepted
insert into sandbox values (1,null); -- rejected
insert into sandbox values (null,1); -- accepted
insert into sandbox values (null,2); -- accepted
insert into sandbox values (null,1); -- rejected
insert into sandbox values (null,null); -- accepted
insert into sandbox values (null,null); -- accepted
Assuming that it makes sense to occasionally have some rows with some column values unknown, I can think of two possible use cases involving preventing duplicates:
1. I want to reject duplicates, but accept when any constrained column's value is unknown.
2. I want to reject duplicates, even in cases when a constrained column's value is unknown.
Apparently Oracle implements something different though:
3. Reject duplicates, but accept (only) when all constrained column values are unknown.
I can think of ways to make use of Oracle's implementation to get to use case (2) -- for example, have a special value for "unknown", and make the columns non-nullable. But I can't figure out how to get to use case (1).
In other words, how can I get Oracle to act like this?
create table sandbox(a number(10,0), b number(10,0));
create unique index sandbox_idx on sandbox(a,b);
insert into sandbox values (1,1); -- accepted
insert into sandbox values (1,2); -- accepted
insert into sandbox values (1,1); -- rejected
insert into sandbox values (1,null); -- accepted
insert into sandbox values (2,null); -- accepted
insert into sandbox values (1,null); -- accepted
insert into sandbox values (null,1); -- accepted
insert into sandbox values (null,2); -- accepted
insert into sandbox values (null,1); -- accepted
insert into sandbox values (null,null); -- accepted
insert into sandbox values (null,null); -- accepted
Try a function-based index:
create unique index sandbox_idx on sandbox(CASE WHEN a IS NULL THEN NULL WHEN b IS NULL THEN NULL ELSE a||','||b END);
There are other ways to skin this cat, but this is one of them.
create unique index sandbox_idx on sandbox
(case when a is null or b is null then null else a end,
case when a is null or b is null then null else b end);
A functional index! Basically I just needed to make sure all the tuples I want to ignore (ie - accept) get translated to all nulls. Ugly, but not butt ugly. Works as desired.
Figured it out with the help of a solution to another question: How to constrain a database table so only one row can have a particular value in a column?
So go there and give Tony Andrews points too. :)
I'm not an Oracle guy, but here's an idea that should work, if you can include a computed column in an index in Oracle.
Add an additional column to your table (and your UNIQUE index) that is computed as follows: it's NULL if both a and b are non-NULL, and it's the table's primary key otherwise. I call this additional column "nullbuster" for obvious reasons.
alter table sandbox add nullbuster as
case when a is null or b is null then pk else null end;
create unique index sandbox_idx on sandbox(a,b,pk);
I gave this example a number of times around 2002 or so in the Usenet group microsoft.public.sqlserver.programming. You can find the discussions if you search groups.google.com for the word "nullbuster". The fact that you're using Oracle shouldn't matter much.
P.S. In SQL Server, this solution is pretty much superseded by filtered indexes:
create unique index sandbox_idx on sandbox(a,b)
(where a is not null and b is not null);
The thread you referenced suggests that Oracle doesn't give you this option. Does it also not have the possibility of an indexed view, which is another alternative?
create view sandbox_for_unique as
select a, b from sandbox
where a is not null and b is not null;
create index sandbox_for_unique_idx on sandbox_for_unique(a,b);
I guess you can then.
Just for the record though, I leave my paragraph to explain why Oracle behaves like that if you have a simple unique index on two columns:
Oracle will never accept two (1, null) pairs if the columns are uniquely indexed.
A pair of 1 and a null, is considered an "indexable" pair. A pair of two nulls cannot be indexed, that's why it lets you insert as many null,null pairs as you like.
(1, null) gets indexed because 1 can be indexed. Next time you try to insert (1, null) again, 1 is picked up by the index and the unique constraint is violated.
(null,null) isn't indexed because there is no value to be indexed. That's why it doesn't violate the unique constraint.

Resources