I am new to Hive and have some problems. I try to find a answer here and other sites but with no luck... I also tried many different querys that come to my mind, also without success.
I have my source table and i want to create new table like this.
Were:
id would be number of distinct counties as auto increment numbers and primary key
counties as distinct names of counties (from source table)
You could follow this approach.
A CTAS(Create Table As Select)
with your example this CTAS could work
CREATE TABLE t_county
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE AS
WITH t AS(
SELECT DISTINCT county, ROW_NUMBER() OVER() AS id
FROM counties)
SELECT id, county
FROM t;
You cannot have primary key or foreign keys on Hive as you have primary key on RBDMSs like Oracle or MySql because Hive is schema on read instead of schema on write like Oracle so you cannot implement constraints of any kind on Hive.
I can not give you the exact answer because of it suppose to you must try to do it by yourself and then if you have a problem or a doubt come here and tell us. But, what i can tell you is that you can use the insertstatement to create a new table using data from another table, I.E:
create table CARS (name string);
insert table CARS select x, y from TABLE_2;
You can also use the overwrite statement if you desire to delete all the existing data that you have inside that table (CARS).
So, the operation will be
CREATE TABLE ==> INSERT OPERATION (OVERWRITE?) + QUERY OPERATION
Hive is not an RDBMS database, so there is no concept of primary key or foreign key.
But you can add auto increment column in Hive. Please try as:
Create table new_table as
select reflect("java.util.UUID", "randomUUID") id, countries from my_source_table;
So my use case is i have to find the location of the primary key column so that i can write query like select * from my_table where ID <='00000536-37ee-471c-a8e0-3d233b8102f5'
So my table has a primary key which is varchar type and values of the column is GUID generated by an application.
Here is an example of primary key
000000bd-104e-4fd6-a791-c5422f29e1b5
0000016e-7e68-4453-b360-7ffd1627dc22
00000196-2dba-4532-8cba-1e853c466697
0000025a-cfae-41b4-b8e7-ef854d49e54a
00000260-8bdb-4b30-acdb-5a67efd4dbfe
00000366-552d-48a0-b8a1-20190ccd087c
000003f2-d6d8-4a51-96cc-407063bc568b
000003ff-3d16-4e88-9cf3-bcdf01c39a2b
00000487-1e6c-4d6d-a683-6f11d517962c
000004cc-6359-4a9a-aa2a-70a6b73a06b1
00000536-37ee-471c-a8e0-3d233b8102f5
Now i need to use this table in aws DMS which accepts only query like select * from table where column =,<=,>=
My use case is to find the exact location of the millions of GUID so that i can divide table into multiple query and select based on GUID .
For example if we have 100th GID is 00000536-37ee-471c-a8e0-3d233b8102f5 then i can write query like select * from my table where GUID <=100
The limitation is i can not add any new columns in the existing table because application impact is huge .
How can i do this ?
One Option that i thought but wanted to confirm is below
Create a temp table
Temp table will have auto generated sequence and ID column
Inset into temp table select only GUID from main table with order of GUID .
In this case the value will be stored on order and i an first select GUID based on 100th number and then i can pass that GUID and write my oroginal query
But i am not sure whether this will work on not
Can some one suggest on this or suggest some other option ?
So let me explain what i want .
I want DMS to read may main table in parallel and migrate .
So lets say one DMS task can read nd migrate from 1 to 100,another 100 to 200 another >200 like that .
Currently i can not do because we dont know the position of the primary key and write the query .
If you want to divide your table into chunks of equal sizes, I would take advantage of the hexadecimal nature of the GUIDs. It will be 256 instead of 100 chunks, but this might be acceptable.
CREATE TABLE t (pk VARCHAR2(36) PRIMARY KEY);
INSERT INTO t VALUES ('000000bd-104e-4fd6-a791-c5422f29e1b5');
The easiest option would be
SELECT * FROM t WHERE pk LIKE '%b5';
A bit more advanced:
SELECT pk, to_number(substr(pk, -2),'xx') FROM t;
If you have millions of rows, this is probably faster:
ALTER TABLE t ADD (mycol GENERATED ALWAYS AS (to_number(substr(pk, -2),'xx')));
CREATE INDEX i ON t(mycol);
SELECT * FROM t WHERE mycol=181;
Once your migration is done, you can undo the additional virtual column:
DROP INDEX i;
ALTER TABLE t DROP (mycol);
I've recently started to play around with Cassandra. My understanding is that in a Cassandra table you define 2 keys, which can be either single column or composites:
The Partitioning Key: determines how to distribute data across nodes
The Clustering Key: determines in which order the records of a same partitioning key (i.e. within a same node) are written. This is also the order in which the records will be read.
Data from a table will always be sorted in the same order, which is the order of the clustering key column(s). So a table must be designed for a specific query.
But what if I need to perform 2 different queries on the data from a table. What is the best way to solve this when using Cassandra ?
Example Scenario
Let's say I have a simple table containing posts that users have written :
CREATE TABLE posts (
username varchar,
creation timestamp,
content varchar,
PRIMARY KEY ((username), creation)
);
This table was "designed" to perform the following query, which works very well for me:
SELECT * FROM posts WHERE username='luke' [ORDER BY creation DESC];
Queries
But what if I need to get all posts regardless of the username, in order of time:
Query (1): SELECT * FROM posts ORDER BY creation;
Or get the posts in alphabetical order of the content:
Query (2): SELECT * FROM posts WHERE username='luke' ORDER BY content;
I know that it's not possible given the table I created, but what are the alternatives and best practices to solve this ?
Solution Ideas
Here are a few ideas spawned from my imagination (just to show that at least I tried):
Querying with the IN clause to select posts from many users. This could help in Query (1). When using the IN clause, you can fetch globally sorted results if you disable paging. But using the IN clause quickly leads to bad performance when the number of usernames grows.
Maintaining full copies of the table for each query, each copy using its own PRIMARY KEY adapted to the query it is trying to serve.
Having a main table with a UUID as partitioning key. Then creating smaller copies of the table for each query, which only contain the (key) columns useful for their own sort order, and the UUID for each row of the main table. The smaller tables would serve only as "sorting indexes" to query a list of UUID as result, which can then be fetched using the main table.
I'm new to NoSQL, I would just want to know what is the correct/durable/efficient way of doing this.
The SELECT * FROM posts ORDER BY creation; will results in a full cluster scan because you do not provide any partition key. And the ORDER BY clause in this query won't work anyway.
Your requirement I need to get all posts regardless of the username, in order of time is very hard to achieve in a distributed system, it supposes to:
fetch all user posts and move them to a single node (coordinator)
order them by date
take top N latest posts
Point 1. require a full table scan. Indeed as long as you don't fetch all records, the ordering can not be achieve. Unless you use Cassandra clustering column to order at insertion time. But in this case, it means that all posts are being stored in the same partition and this partition will grow forever ...
Query SELECT * FROM posts WHERE username='luke' ORDER BY content; is possible using a denormalized table or with the new materialized view feature (http://www.doanduyhai.com/blog/?p=1930)
Question 1:
Depending on your use case I bet you could model this with time buckets, depending on the range of times you're interested in.
You can do this by making the primary key a year,year-month, or year-month-day depending on your use case (or finer time intervals)
The basic idea is that you bucket changes for what suites your use case. For example:
If you often need to search these posts over months in the past, then you may want to use the year as the PK.
If you usually need to search the posts over several days in the past, then you may want to use a year-month as the PK.
If you usually need to search the post for yesterday or a couple of days, then you may want to use a year-month-day as your PK.
I'll give a fleshed out example with yyyy-mm-dd as the PK:
The table will now be:
CREATE TABLE posts_by_creation (
creation_year int,
creation_month int,
creation_day int,
creation timeuuid,
username text, -- using text instead of varchar, they're essentially the same
content text,
PRIMARY KEY ((creation_year,creation_month,creation_day), creation)
)
I changed creation to be a timeuuid to guarantee a unique row for each post creation event. If we used just a timestamp you could theoretically overwrite an existing post creation record in here.
Now we can then insert the Partition Key (PK): creation_year, creation_month, creation_day based on the current creation time:
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update1';
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update2';
now() is a CQL function to generate a timeUUID, you would probably want to generate this in the application instead, and parse out the yyyy-mm-dd for the PK and then insert the timeUUID in the clustered column.
For a usage case using this table, let's say you wanted to see all of the changes today, your CQL would look like:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2;
Or if you wanted to find all of the changes today after 5pm central:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2 AND creation >= minTimeuuid('2016-04-02 5:00-0600') ;
minTimeuuid() is another cql function, it will create the smallest possible timeUUID for the given time, this will guarantee that you get all of the changes from that time.
Depending on the time spans you may need to query a few different partition keys, but it shouldn't be that hard to implement. Also you would want to change your creation column to a timeuuid for your other table.
Question 2:
You'll have to create another table or use materialized views to support this new query pattern, just like you thought.
Lastly if your not on Cassandra 3.x+ or don't want to use materialized views you can use Atomic batches to ensure data consistency across your several de-normalized tables (that's what it was designed for). So in your case it would be a BATCH statement with 3 inserts of the same data to 3 different tables that support your query patterns.
The solution is to create another tables to support your queries.
For SELECT * FROM posts ORDER BY creation;, you may need some special column for grouping it, maybe by month and year, e.g. PRIMARY KEY((year, month), timestamp) this way the cassandra will have a better performance on read because it doesn't need to scan the whole cluster to get all data, it will also save the data transfer between nodes too.
Same as SELECT * FROM posts WHERE username='luke' ORDER BY content;, you must create another table for this query too. All column may be same as your first table but with the different Primary Key, because you cannot order by the column that is not the clustering column.
In my application I have created a column with sequence to have unique id for each and every record that I add,I doubt will there be any unique row-number created by oracle itself for every record we updated in table,if yes then how to access that row number like
SELECT row-number from table where employee_name='name';
here I want to get unique row number created by oracle,
I have searched on net but haven't got proper information
Oracle does maintain a ROWID for each row; however, in my opinion using ROWID in user-written code is both poor practice and dangerous. ROWID is only guaranteed to be constant for the duration of a single transaction. ROWID is not guaranteed to be constant forever and the database can change it if and when it determines that a change is necessary. If your data does not supply a value or combination of values which are unique and unchanging I strongly suggest you learn how to create an artificial key which is automatically set using sequences and triggers. I believe 12c supplies auto-increment columns which you can use if you're using the latest version of Oracle.
Share and enjoy.
There are 2 unique identifiers in Oracle named as ROWNUM and ROWID. You can use them in such ways:-
SELECT ROWNUM
FROM table
WHERE employee_name = 'name';
and
SELECT ROWID
FROM table
WHERE employee_name = 'name';
You can read further about them.
Rownum -
http://docs.oracle.com/cd/B12037_01/server.101/b10759/pseudocolumns008.htm
Rowid - http://docs.oracle.com/cd/B19306_01/server.102/b14200/pseudocolumns008.htm
I have a db table with a column that is a String. I do not consider the case to be significant (e.g. "TEST == "test"). Unfortunately, it appears that JPA2 does, because both values are inserted into my table; I would like the second one to be rejected.
Is there a generic way to annotate an "ignore-case" unique constraint on a string column?
As an alternative, I could also consider putting a unique "ignore-case" constraint on the actual db column. Is that possible in Oracle 10?
What I don't want to do is write code, because this occurs often in this particular db.
All help is greatly appreciated.
you can achieve this with a function-based unique index
create unique index <index_name> on <table_name> (UPPER(<column_name>));
for Example
create table t111( col varchar2(10));
create unique index test_idx on t111 (UPPER(col));
insert into t111 values('test');
insert into t111 values ('TEST');