How can we uniquely identify a each record in Vertica database in absence of a primary key? - uniqueidentifier

There are some tables in which the primary key, unique key, or composite key is not specified.
How can we uniquely identify each record in Vertica?

Vertica, in this respect, is a real relational database: If there is no declared identifier, there is no identifier.
And it is columnar.
Row based DBMSs have a way to physically identify the location of a certain row. That's where, for example, the ROWID of Oracle comes from.
Vertica being columnar, you can not even locate the actual position of the first column in the sort order:
If, for example, you have gender as the first column in the sort order, and the column is encoded as Run-Length-Encoding (RLE), then, you have, in that file that contains the column, for example, the value 'F', an integer of 502, the value 'M' and an integer of 498 for a table containing 1000 rows.
You could calculate the hash of all columns (with a small risk of hash collisions), but if you have two rows like this:
42 | Arthur Dent | 2022-01-25
42 | Arthur Dent | 2022-01-25
There is now way of discerning one row from the other.
Even if you apply a
ROW_NUMBER() OVER(PARTITION BY <all_columns_of_the_table> )
which would lead one of the above rows to get a 1 and the other to get a 2- there will be no way of determining which was assigned to which row.
These are the two, not completely satisfactory, ways of working around this behaviour. Which is not a problem, but a behavioural feature.

Related

continuous values in oracle primary key

Is there a way in oracle to create a column with auto-increment and if a row is deleted the next value that's been deleted should replace the row that is deleted. Is there a way to do that in oracle?
That behavior y ou are describing (having "holes" in the sequence after deletes) will always happen with SEQUENCE. For most applications, it is a good thing and works perfectly because most of the time, the id of the table is artificial and meaningless. Its only use is to connect tables with relations and for that, holes are unimportant.
In your case, if you want to create a continuous sequence and fill gaps if they are created, you need to create a trigger on insert that updates your ID with the value of the first "hole" found in your sequence using a SELECT like this :
SELECT MIN(tb.id) + 1 "first_seq_hole"
FROM yourTable tb
WHERE NOT EXISTS
(SELECT tb.id FROM yourTable tb2 WHERE tb.id + 1 = tb2.id)
Although, I am not sure what is your requirement here but at some point you might still have holes in your sequence (say you delete 10 random rows and never insert any to fill them). That's unavoidable though unless you work on a way to change existing IDs to instantly fill gaps when rows are deleted. It would be complicated and risky if you have child tables using that ID though.

Why doesn't Cassandra UPDATE violate the no read before writes rule

I am confused by two seemingly contradictory statements about Cassandra
No reads before writes (presumably this is because writes are sequential whereas reads require scanning a primary key index)
INSERT and UPDATE have identical semantics (stated in an older version of the CQL manual but presumably still considered essentially true)
Suppose I've created the following simple table:
CREATE TABLE data (
id varchar PRIMARY KEY,
names set<text>
);
Now I insert some values:
insert into data (id, names) values ('123', {'joe', 'john'});
Now if I do an update:
update data set names = names + {'mary'} where id = '123';
The results are as expected:
id | names
-----+-------------------------
123 | {'joe', 'john', 'mary'}
Isn't this a case where a read has to occur before a write? The "cost" would seem to be the the following
The cost of reading the column
The cost of creating a union of the two sets (negligible here but could be noticeable with larger sets)
The cost of writing the data with the key and new column data
An insert would merely be the doing just the last of these.
There is no need for read before writing. Internally each collection stores data using one column per entry -- When you ask for a new entry in a collection the operation is done in the single column*: if the column already exists it will be overwritten otherwise it will be created (InsertOrUpdate).
This is the reason why each entry in a collection can have custom ttl and writetime.
*while with Map and Set this is transparent there is some internal trick to allow multiple columns with same name inside a List.

Fastest way to SELECT a row from a table in a database (Microsoft SQL server)

I have a huge table with one int PRIMARY KEY IDENTITY column.
I guess making the SELECT query using that primary key is the fastest way for the database to find the row in the table isn't it?
If that is true i still have a question.
Is that query as fast as a call to a dictionary by key or the database still has to read all the rows from the beginning (the Primary Key column) till it finds the row itself?
Thanks in advance ^^
Using primary key is obviously the fastest way to access a particular row.
If you want to understand how it works, you have to understand how index works.
In general it works like that :
Let's say you have a table t1(col1,col2...col10) and you have an index on col1.
Index on col1 means that you have some data structure which contains pairs (col1, rec_id)
and rec_id allows direct access to row with appropriate col1.
The data structure is ordered by col1 and therefore allows efficient searching by col1.
I think searching in dictionary works per dictionary search algorithm which should be more like binary search kind.
When you declare a column as Primary key in table, then that column is indexed, hence it should be working based on hashing principle, so searching is definitely NOT row by row as you mentioned.
Finally, yes it is the common and fast way, but you should be selective about the number of columns and rows you need in your sql query. Avoid fetching large number of rows per select call.

oracle- index organized table

what is use-case of IOT (Index Organized Table) ?
Let say I have table like
id
Name
surname
i know the IOT but bit confuse about the use case of IOT
Your three columns don't make a good use case.
IOT are most useful when you often access many consecutive rows from a table. Then you define a primary key such that the required order is represented.
A good example could be time series data such as historical stock prices. In order to draw a chart of the stock price of a share, many rows are read with consecutive dates.
So the primary key would be stock ticker (or security ID) and the date. The additional columns could be the last price and the volume.
A regular table - even with an index on ticker and date - would be much slower because the actual rows would be distributed over the whole disk. This is because you cannot influence the order of the rows and because data is inserted day by day (and not ticker by ticker).
In an index-organized table, the data for the same ticker ends up on a few disk pages, and the required disk pages can be easily found.
Setup of the table:
CREATE TABLE MARKET_DATA
(
TICKER VARCHAR2(20 BYTE) NOT NULL ENABLE,
P_DATE DATE NOT NULL ENABLE,
LAST_PRICE NUMBER,
VOLUME NUMBER,
CONSTRAINT MARKET_DATA_PK PRIMARY KEY (TICKER, P_DATE) ENABLE
)
ORGANIZATION INDEX;
Typical query:
SELECT TICKER, P_DATE, LAST_PRICE, VOLUME
FROM MARKET_DATA
WHERE TICKER = 'MSFT'
AND P_DATE BETWEEN SYSDATE - 1825 AND SYSDATE
ORDER BY P_DATE;
Think of index organized tables as indexes. We all know the point of an index: to improve access speeds to particular rows of data. This is a performance optimisation of trick of building compound indexes on sub-sets of columns which can be used to satisfy commonly-run queries. If an index can completely satisy the columns in a query's projection the optimizer knows it doesn't have to read from the table at all.
IOTs are just this approach taken to its logical confusion: buidl the index and throw away the underlying table.
There are two criteria for deciding whether to implement a table as an IOT:
It should consists of a primary key (one or more columns) and at most one other column. (okay, perhaps two other columns at a stretch, but it's an warning flag).
The only access route for the table is the primary key (or its leading columns).
That second point is the one which catches most people out, and is the main reason why the use cases for IOT are pretty rare. Oracle don't recommend building other indexes on an IOT, so that means any access which doesn't drive from the primary key will be a Full Table Scan. That might not matter if the table is small and we don't need to access it through some other path very often, but it's a killer for most application tables.
It is also likely that a candidate table will have a relatively small number of rows, and is likely to be fairly static. But this is not a hard'n'fast rule; certainly a huge, volatile table which matched the two criteria listed above could still be considered for implementations as an IOT.
So what makes a good candidate dor index organization? Reference data. Most code lookup tables are like something this:
code number not null primary key
description not null varchar2(30)
Almost always we're only interested in getting the description for a given code. So building it as an IOT will save space and reduce the access time to get the description.

Is an Index Organized Table appropriate here?

I recently was reading about Oracle Index Organized Tables (IOTs) but am not sure I quite understand WHEN to use them. So I have a small table:
create table categories
(
id VARCHAR2(36),
group VARCHAR2(100),
category VARCHAR2(100
)
create unique index (group, category, id) COMPRESS 2;
The id column is a foreign key from another table entries and my common query is:
select e.id, e.time, e.title from entries e, categories c where e.id=c.id AND e.group=? AND c.category=? ORDER by e.time
The entries table is indexed properly.
Both of these tables have millions (16M currently) of rows and currently this query really stinks (note: I have it wrapped in a pagination query also so I only get back the first 20, but for simplicity I omitted that).
Since I am basically indexing the entire table, does it make sense to create this table as an IOT?
EDIT by popular demand:
create table entries
(
id VARCHAR2(36),
time TIMESTAMP,
group VARCHAR2(100),
title VARCHAR2(500),
....
)
create index (group, time) compress 1;
My real question I dont think depends on this though. Basically if you have a table with few columns (3 in this example) and you are planning on putting a composite index on all three rows is there any reason not to use an IOT?
IOTs are great for a number of purposes, including this case where you're gonna have an index on all (or most) of the columns anyway - but the benefit only materialises if you don't have the extra index - the idea is that the table itself is an index, so put the columns in the order that you want the index to be in. In your case, you're accessing category by id, so it makes sense for that to be the first column. So effectively you've got an index on (id, group, category). I don't know why you'd want an additional index on (group, category, id).
Your query:
SELECT e.id, e.time, e.title
FROM entries e, categories c
WHERE e.id=c.id AND e.group=? AND c.category=?
ORDER by e.time
You're joining the tables by ID, but you have no index on entries.id - so the query is probably doing a hash or sort merge join. I wouldn't mind seeing a plan for what your system is doing now to confirm.
If you're doing a pagination query (i.e. only interested in a small number of rows) you want to get the first rows back as quick as possible; for this to happen you'll probably want a nested loop on entries, e.g.:
NESTED LOOPS
ACCESS TABLE BY ROWID - ENTRIES
INDEX RANGE SCAN - (index on ENTRIES.group,time)
ACCESS TABLE BY ROWID - CATEGORIES
INDEX RANGE SCAN - (index on CATEGORIES.ID)
Since the join to CATEGORIES is on ID, you'll want an index on ID; if you make it an IOT, and make ID the leading column, that might be sufficient.
The performance of the plan I've shown above will be dependent on how many rows match the given "group" - i.e. how selective an average "group" is.
Have you looked at dba-oracle.com, asktom.com, IOUG, another asktom.com?
There are penalties to pay for IOTs - e.g., poorer insert performance
Can you prototype it and compare performance?
Also, perhaps you might want to consider a hash cluster.
IOT's are a trade off. You are getting access performance for decreased insert/update performance. We typically use them for reference data that is batch loaded daily and not updated during the day. This is not to say it's the only way to use them, just how we use them.
Few things here:
You mention pagination - have you considered the first_rows hint?
Is that the order your index is in, with group as the first field? If so I'd consider moving ID to be the first column since that index will not be used.
foreign keys should have an index on the column. Consider addind an index on the foreign key (id column).
Are you sure it's not the ORDER BY causing slowness?
What version of Oracle are you using?
I ASSUME there is a primary key on table entries for field id, correct?
Why the WHERE condition does not include "c.group = e.group" ?
Try to:
Remove the order by condition
Change the index definition from "create unique index (group,
category, id)" to "create unique index (id, group, category)"
Reorganise table categories as an IOT on (group, category, id)
Reorganise table categories as an IOT on (id, group, category)
In each of the above case use EXPLAIN PLAN to review the cost

Resources