SQL data versioning in DuckDB - jdbc

I'm using MonetDB in a project and currently evaluating moving to DuckDB. As part of that I'm also reevaluating how we do versioning of data and if there's a better way to do it in DuckDB.
We manage data versions by having tables structured like this:
create table data(
row int,
key varchar(50),
value varchar(50),
version int
);
create table data_version(
row int,
next_version int
);
and then querying like this (to show a surface of all data up to and including version 2):
select d.key, d.value from data d
left join data_version dv on dv.row = d.row
where d.version <= 2 and (dv.next_version > 2 or dv.next_version is null);
Mininal working example here
This has the advantage of being append only (no table updates, just inserts) and seems to be quite performant. Bulk loading can be tricky because you have to keep track of what was already written in order to update the data_version table, but it's not too bad.
DuckDB has a lot of great functionality above and beyond standard SQL (like window funtions) and I'm wondering does this mean there's a better way to do versioning of data? I'm hoping someone more familiar with DuckDB might know. (Maybe there's just a better way to do versioning anyway!)
(Note, the example above isn't really showing off why we need a column oriented database but in that data table there will be lots of other columns we perform grouping style queries on, with the versioning clause)

Related

Change schema in an Impala/Hive table with a very large amount of data?

We have a Hive table stored on HDFS with 800+ columns and >65 billion rows (and growing) and need to:
Remove a column with a complex type (small array)
Add a column with a complex type (small array)
Possibly add a handful of other columns (simple type, e.g. string or int)
Modify the contents of 3 columns for every row in the database (effectively read it in, make a simple change, write it back out to the same column and row that it came from). I realise this is probably a separate operation to the other three requirements above.
We could set up a new empty table with the new schema and copy the data over (using CREATE TABLE xxxxx FROM SELECT ... or INSERT INTO xxxx SELECT ...) but tests suggest it would take 1 - 3 weeks running non stop. And it's possible we may need to make further minor similar modifications in future.
Is there an efficient, sensible alternative to copying the whole table? Would ALTER TABLE work (at least for the structural changes, items 1 - 3 above)? What are the pros and cons of either option(s)?
Table is going to be queried using Impala, in a Zeppelin-based interface.
Thanks for any advice.

Insert into a view in Hive

Can we insert into a view in Hive?
I have done this in the past with Oracle and Teradata.
But, doesn't seem to work in Hive.
create table t2 (id int, key string, value string, ds string, hr string);
create view v2 as select id, key, value, ds, hr from t2;
insert into v2 values (1,'key1','value1','ds1','hr1')
***Error while compiling statement: FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if null is encrypted: java.lang.NullPointerException***
These seems to be some sort of update support in view. But, I can't see anything on insert into a view.
https://cwiki.apache.org/confluence/display/Hive/UpdatableViews
Thanks for the feedback. Makes sense. The reason behind needing this functionality is, we use an ETL tool that has problems with handling high precision decimals (>15 digits). If the object(table->column in this case) is represented as string within the tool, we don't have a problem. So, i thought i'll define a bunch of views with string datatypes and use that in the tool instead. But, can't do inserts in hive to view. So, may be i need to think of something else. Have done this way before with oracle and teradata.
Can we have two tables with different structures point to the same underlying hdfs content? Probably wouldn't work because fo the parquet storage which stores schema. Sorry, not a hadoop expert.
Thanks a lot for your time.
It is not possible to insert data in a Hive view, Hive view is just a projection of a Hive table (you can see it as presaved query). From Hive documentation
Note that a view is a purely logical object with no associated
storage. (No support for materialized views is currently available in
Hive.) When a query references a view, the view's definition is
evaluated in order to produce a set of rows for further processing by
the query. (This is a conceptual description; in fact, as part of
query optimization, Hive may combine the view's definition with the
query's, e.g. pushing filters from the query down into the view.)
The link (https://cwiki.apache.org/confluence/display/Hive/UpdatableViews) seems to be for a proposed feature.
Per the official documentation:
Views are read-only and may not be used as the target of LOAD/INSERT/ALTER.

Query a table in different ways or orderings in Cassandra

I've recently started to play around with Cassandra. My understanding is that in a Cassandra table you define 2 keys, which can be either single column or composites:
The Partitioning Key: determines how to distribute data across nodes
The Clustering Key: determines in which order the records of a same partitioning key (i.e. within a same node) are written. This is also the order in which the records will be read.
Data from a table will always be sorted in the same order, which is the order of the clustering key column(s). So a table must be designed for a specific query.
But what if I need to perform 2 different queries on the data from a table. What is the best way to solve this when using Cassandra ?
Example Scenario
Let's say I have a simple table containing posts that users have written :
CREATE TABLE posts (
username varchar,
creation timestamp,
content varchar,
PRIMARY KEY ((username), creation)
);
This table was "designed" to perform the following query, which works very well for me:
SELECT * FROM posts WHERE username='luke' [ORDER BY creation DESC];
Queries
But what if I need to get all posts regardless of the username, in order of time:
Query (1): SELECT * FROM posts ORDER BY creation;
Or get the posts in alphabetical order of the content:
Query (2): SELECT * FROM posts WHERE username='luke' ORDER BY content;
I know that it's not possible given the table I created, but what are the alternatives and best practices to solve this ?
Solution Ideas
Here are a few ideas spawned from my imagination (just to show that at least I tried):
Querying with the IN clause to select posts from many users. This could help in Query (1). When using the IN clause, you can fetch globally sorted results if you disable paging. But using the IN clause quickly leads to bad performance when the number of usernames grows.
Maintaining full copies of the table for each query, each copy using its own PRIMARY KEY adapted to the query it is trying to serve.
Having a main table with a UUID as partitioning key. Then creating smaller copies of the table for each query, which only contain the (key) columns useful for their own sort order, and the UUID for each row of the main table. The smaller tables would serve only as "sorting indexes" to query a list of UUID as result, which can then be fetched using the main table.
I'm new to NoSQL, I would just want to know what is the correct/durable/efficient way of doing this.
The SELECT * FROM posts ORDER BY creation; will results in a full cluster scan because you do not provide any partition key. And the ORDER BY clause in this query won't work anyway.
Your requirement I need to get all posts regardless of the username, in order of time is very hard to achieve in a distributed system, it supposes to:
fetch all user posts and move them to a single node (coordinator)
order them by date
take top N latest posts
Point 1. require a full table scan. Indeed as long as you don't fetch all records, the ordering can not be achieve. Unless you use Cassandra clustering column to order at insertion time. But in this case, it means that all posts are being stored in the same partition and this partition will grow forever ...
Query SELECT * FROM posts WHERE username='luke' ORDER BY content; is possible using a denormalized table or with the new materialized view feature (http://www.doanduyhai.com/blog/?p=1930)
Question 1:
Depending on your use case I bet you could model this with time buckets, depending on the range of times you're interested in.
You can do this by making the primary key a year,year-month, or year-month-day depending on your use case (or finer time intervals)
The basic idea is that you bucket changes for what suites your use case. For example:
If you often need to search these posts over months in the past, then you may want to use the year as the PK.
If you usually need to search the posts over several days in the past, then you may want to use a year-month as the PK.
If you usually need to search the post for yesterday or a couple of days, then you may want to use a year-month-day as your PK.
I'll give a fleshed out example with yyyy-mm-dd as the PK:
The table will now be:
CREATE TABLE posts_by_creation (
creation_year int,
creation_month int,
creation_day int,
creation timeuuid,
username text, -- using text instead of varchar, they're essentially the same
content text,
PRIMARY KEY ((creation_year,creation_month,creation_day), creation)
)
I changed creation to be a timeuuid to guarantee a unique row for each post creation event. If we used just a timestamp you could theoretically overwrite an existing post creation record in here.
Now we can then insert the Partition Key (PK): creation_year, creation_month, creation_day based on the current creation time:
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update1';
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update2';
now() is a CQL function to generate a timeUUID, you would probably want to generate this in the application instead, and parse out the yyyy-mm-dd for the PK and then insert the timeUUID in the clustered column.
For a usage case using this table, let's say you wanted to see all of the changes today, your CQL would look like:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2;
Or if you wanted to find all of the changes today after 5pm central:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2 AND creation >= minTimeuuid('2016-04-02 5:00-0600') ;
minTimeuuid() is another cql function, it will create the smallest possible timeUUID for the given time, this will guarantee that you get all of the changes from that time.
Depending on the time spans you may need to query a few different partition keys, but it shouldn't be that hard to implement. Also you would want to change your creation column to a timeuuid for your other table.
Question 2:
You'll have to create another table or use materialized views to support this new query pattern, just like you thought.
Lastly if your not on Cassandra 3.x+ or don't want to use materialized views you can use Atomic batches to ensure data consistency across your several de-normalized tables (that's what it was designed for). So in your case it would be a BATCH statement with 3 inserts of the same data to 3 different tables that support your query patterns.
The solution is to create another tables to support your queries.
For SELECT * FROM posts ORDER BY creation;, you may need some special column for grouping it, maybe by month and year, e.g. PRIMARY KEY((year, month), timestamp) this way the cassandra will have a better performance on read because it doesn't need to scan the whole cluster to get all data, it will also save the data transfer between nodes too.
Same as SELECT * FROM posts WHERE username='luke' ORDER BY content;, you must create another table for this query too. All column may be same as your first table but with the different Primary Key, because you cannot order by the column that is not the clustering column.

can oracle types be updated like tables?

I am converting GTT's to oracle types as explained in an excellent answer by APC. however, some GTT's are being updated based on a select query from another table. For example:
UPDATE my_gtt_1 c
SET (street, city, STATE, zip) = (SELECT src.unit_address,
src.unit_city,
src.unit_state,
src.unit_zip_code
FROM (SELECT mbr.ROWID row_id,
unit_address,
RTRIM(a.unit_city) unit_city,
RTRIM(a.unit_state) unit_state,
RTRIM(a.unit_zip_code) unit_zip_code
FROM table_1 b,
table_2 a,
my_gtt_1 mbr
WHERE type = 'ABC'
AND id = b.ssn_head
AND a.h_id = b.h_id
AND row_id >= v_start_row
AND row_id <= v_end_row) src
WHERE c.ROWID = src.row_id)
WHERE state IS NULL
OR state = ' ';
if my_gtt_1 was not a global temporary table but an oracle collection type then is it possible to do updates this complex? Or in these cases we are better off using the global temporary table?
you can not perform set UPDATE operations on object types. You will have to do it row by row, as in:
FOR i IN l_tab.FIRST..l_tab.LAST LOOP
SELECT src.unit_address,
src.unit_city,
src.unit_state,
src.unit_zip_code
INTO l_tab(i).street,
l_tab(i).city,
l_tab(i).STATE,
l_tab(i).zip
FROM (your_query) src;
END LOOP;
You should therefore try to do all computations at creation time (where you can BULK COLLECT). Obviously, if your process needs many steps you might find that a global temporary table outperforms an in-memory structure.
From the last questions you have asked, it seems you are trying to replace all global temporary tables with object tables. I would suggest caution because in general, they are not interchangeable:
Objects tables are in-memory structures: you don't want to load a million+ rows table into memory. They are mainly used as a buffer: you load a few (100 for example) rows into the structure, perform what you need to do with these rows then load the next batch. You can not easily treat this structure as a regular table: for example you can only search this structure efficiently with the standard indexing key (you cannot search by rowid in your example unless you define the structure to be indexed by rowid).
Temporary tables on the other hand are very similar to ordinary tables. You can load millions of rows in them, perform joins, complex set operations. You can index the temporary table for further optimization.
In my opinion, the change your are trying to conduct will take a massive overhaul of your logic and it may not perform better. In general, you would not replace GTT with object tables. You may be able to remove GTT with significant gain in performance by using SET operations directly (perform massive UPDATE/DELETE/INSERT on your data directly without a staging table).
I would suggest performing benchmarks before choosing a solution (this is probably what you are doing right now :)
I think this part of APC's answer to your previous question is relevant here:
Global temporary tables are also good
if we have a lot of intermediate
processing which is just too
complicated to be solved with a single
SQL query. Especially if that
processing must be applied to subsets
of the retrieved rows.
You cannot update the in-memory data with an UPDATE statement like you can a GTT; you would need to write procedural code to locate and change the array elements in question.

How can I improve the performance of LinqToSql queries that use EntitySet properties?

I'm using LinqToSql to query a small, simple SQL Server CE database.
I've noticed that any operations involving sub-properties are disappointingly slow.
For example, if I have a Customer table that is referenced by an Order table, LinqToSql will automatically create an EntitySet<Order> property. This is a nice convenience, allowing me to do things like Customer.Order.Where(o => o.ProductName = "Stopwatch"), but for some reason, SQL Server CE hangs up pretty bad when I try to do stuff like this. One of my queries, which isn't really that complicated takes 3-4 seconds to complete.
I can get the speed up to acceptable, even fast, if I just grab the two tables individually and convert them to List<Customer> and List<Order>, then join then manually with my own query, but this requires a lot of extra code. LinqToSql generates these EntitySet<T> properties automatically--I'd like to use them.
So, how can I improve the performance? For example, are there any DataContext options that would help?
Note: My database in its initial state is only about 250K and I don't expect it to grow to more than 1-2Mb. So, it's not like there are a lot of records.
Update
Here are the table definitions for the example I used in my question:
create table Order
(
Id int identity(1, 1) primary key,
ProductName ntext null,
Quantity int null
)
create table Customer
(
Id int identity(1, 1) primary key,
OrderId int null references Order (Id)
)
I've not used SQL Server CE with Linq to SQL before, but with such a small DB, my gut tells me the performance issue has more to do with poor query optimization than disk access.
Try getting the SQL query from your Linq to SQL objects to see what might be happening. Perhaps run those queries manually against the CE database to see how they perform.
It's very possible that just adding the correct indexes will solve the problem.
You might also try LinqToSQL Profiler. http://l2sprof.com/

Resources