Table joins in BigQuery - go

Do you know if there's a way to join two tables by, for
example, using a foreign key constraint like in MySQL (I don't seem to
find anything about this) ? If not, is there a replacement ?
Thanks!

I interpret your question as below -
Is there a way to limit the values that can be used on the tableX to only the IDs that exist on the tableY? For example via using a foreign key constraint like in MySQL!
BigQuery does not provide any direct mechanism for this to happen.
But you can easily achieve this by yourself.
For example, assume you need to insert some data to tableX, but you want to make sure that only those rows will be inserted where id in that new data is in tableY
So, you can "enforce" this via below query
#standardSQL
SELECT n.*
FROM newData AS n
JOIN tableY AS y
ON n.id = y.id
... you can run this query with tableX as destination and ONLY needed rows will be inserted
Hope you got an idea
Also you can check existing related feature requests -
https://issuetracker.google.com/issues/35906045
https://issuetracker.google.com/issues/35906043

Since you asked two questions (Stack Overflow suggests asking 1 question per question), I'll answer one:
Also, do you know if there's a way to join two tables by, for example,
using a foreign key
In BigQuery you can join tables by any key - even by keys defined on the fly (this is pretty useful when you need to join two tables from different datasets that choose to encode identical values in different ways).
Why would you need a foreign key to do these joins?

Related

Why Access and Filter Predicates are the same here?

When I get the autotrace output of the query above using the Oracle SQL Developer, I see that the join condition is used for access and filter predicates. My question is, does it read all the department_ids from the DEPT_ID_PK and then use these IDs to access and filter the employees table? If so, why the employees table has full table scan? Why does it read the employees table again by using the department_ids of the departments table? Could anyone please read this execution plan step by step simply, and explain the reason why the access and filter predicates are used here?
Best Regards
it is a merge join (a bit like hash join, Merge join is used when projections of the joined tables are sorted on the join columns. Merge joins are faster and uses less memory than hash joins).
so Oracle do a full table scan of in outer table (EMPLOYEES) and the it read the inner table in a ordred manner.
the filtre predicates is the column on which the projection will be done
more details: https://datacadamia.com/db/oracle/merge_join
It uses the primary key to avoid sorting, otherwise the plan would be like this
The distinction between "Access predicates" and "Filter predicates" is not particularly consistent, so take them with healthy amount of skepticism. For example, if you remove the USE_MERGE hint, then there would be no Fiter Predicates in the plan any more, and the Access Predicates node would be relocated under the HASH_JOIN node (where it makes more sense for MERGE_JOIN as well):

Data consistency between Oracle tables

I have one big table A who has PK (C1, C2, C3) and many other columns, to make the select faster, a smaller table B was created with PK (C1, C2). So we can do a select by joining the two tables to find a row in A.
But the problem now is that it can happen that if data is corrupted in B which results in a joint select returns nothing but we still have a row in A.
Am I doing something wrong with this design and how can I ensure the data in those two tables are consistent?
Thanks a lot.
Standard way - if those tables are in a master-detail relationship - is to create a foreign key constraint which will prevent deleting master if details exist.
If you can fix it now, do it - then create the constraint.
If you can't, then create foreign key constraint using INITIALLY DEFERRED DEFERRABLE option so that current values aren't checked, but future DML will be.
Finally, to fetch data although certain rows don't exist any more, use outer join.
"Am I doing something wrong with this design"
Well it's hard to be sure without more details about your scenario but probably you just needed a non-unique index on A(C1, C2).
Although I would like to see some benchmarking which proves an index-range scan on your primary key index was not up to the job. Especially as it seems likely the join on table B is using that access path.
Performance tuning an Oracle database is a matter of understanding and juggling many variables. It's not just a case of "bung on another index". We need to understand what the database is actually doing and why the optimiser made that choice. So, please read this post on asking Oracle tuning questions which will give you some insight into how to approach query optimisation.

How does Hive partition works

Lets assume the below table:
as schema:
ID,NAME,Country and my partition key is country.
If my query is like:
select * from table where id between 155555756 to 10000000000;
The partition will not work in that case, right? .
On a simple note .What if I do not use partition key in my query . So table full scan will be there, right?
Answer to your first question is yes, this query plan will not do partition pruning.
You can use the following statement to check if the query does partition pruning:
explain dependency <your query>
Answer to your second question - It depends!
If the hive.mapred.mode is set to strict, then hive will not allow to do full table scans, and few other "risky" operations like cross joins etc.,
Depending on the version of hive you are using, these settings also affect the number of partitions that can be scanned by a single query
hive.metastore.limit.partition.request (or)
hive.limit.query.max.table.partition

Query a table in different ways or orderings in Cassandra

I've recently started to play around with Cassandra. My understanding is that in a Cassandra table you define 2 keys, which can be either single column or composites:
The Partitioning Key: determines how to distribute data across nodes
The Clustering Key: determines in which order the records of a same partitioning key (i.e. within a same node) are written. This is also the order in which the records will be read.
Data from a table will always be sorted in the same order, which is the order of the clustering key column(s). So a table must be designed for a specific query.
But what if I need to perform 2 different queries on the data from a table. What is the best way to solve this when using Cassandra ?
Example Scenario
Let's say I have a simple table containing posts that users have written :
CREATE TABLE posts (
username varchar,
creation timestamp,
content varchar,
PRIMARY KEY ((username), creation)
);
This table was "designed" to perform the following query, which works very well for me:
SELECT * FROM posts WHERE username='luke' [ORDER BY creation DESC];
Queries
But what if I need to get all posts regardless of the username, in order of time:
Query (1): SELECT * FROM posts ORDER BY creation;
Or get the posts in alphabetical order of the content:
Query (2): SELECT * FROM posts WHERE username='luke' ORDER BY content;
I know that it's not possible given the table I created, but what are the alternatives and best practices to solve this ?
Solution Ideas
Here are a few ideas spawned from my imagination (just to show that at least I tried):
Querying with the IN clause to select posts from many users. This could help in Query (1). When using the IN clause, you can fetch globally sorted results if you disable paging. But using the IN clause quickly leads to bad performance when the number of usernames grows.
Maintaining full copies of the table for each query, each copy using its own PRIMARY KEY adapted to the query it is trying to serve.
Having a main table with a UUID as partitioning key. Then creating smaller copies of the table for each query, which only contain the (key) columns useful for their own sort order, and the UUID for each row of the main table. The smaller tables would serve only as "sorting indexes" to query a list of UUID as result, which can then be fetched using the main table.
I'm new to NoSQL, I would just want to know what is the correct/durable/efficient way of doing this.
The SELECT * FROM posts ORDER BY creation; will results in a full cluster scan because you do not provide any partition key. And the ORDER BY clause in this query won't work anyway.
Your requirement I need to get all posts regardless of the username, in order of time is very hard to achieve in a distributed system, it supposes to:
fetch all user posts and move them to a single node (coordinator)
order them by date
take top N latest posts
Point 1. require a full table scan. Indeed as long as you don't fetch all records, the ordering can not be achieve. Unless you use Cassandra clustering column to order at insertion time. But in this case, it means that all posts are being stored in the same partition and this partition will grow forever ...
Query SELECT * FROM posts WHERE username='luke' ORDER BY content; is possible using a denormalized table or with the new materialized view feature (http://www.doanduyhai.com/blog/?p=1930)
Question 1:
Depending on your use case I bet you could model this with time buckets, depending on the range of times you're interested in.
You can do this by making the primary key a year,year-month, or year-month-day depending on your use case (or finer time intervals)
The basic idea is that you bucket changes for what suites your use case. For example:
If you often need to search these posts over months in the past, then you may want to use the year as the PK.
If you usually need to search the posts over several days in the past, then you may want to use a year-month as the PK.
If you usually need to search the post for yesterday or a couple of days, then you may want to use a year-month-day as your PK.
I'll give a fleshed out example with yyyy-mm-dd as the PK:
The table will now be:
CREATE TABLE posts_by_creation (
creation_year int,
creation_month int,
creation_day int,
creation timeuuid,
username text, -- using text instead of varchar, they're essentially the same
content text,
PRIMARY KEY ((creation_year,creation_month,creation_day), creation)
)
I changed creation to be a timeuuid to guarantee a unique row for each post creation event. If we used just a timestamp you could theoretically overwrite an existing post creation record in here.
Now we can then insert the Partition Key (PK): creation_year, creation_month, creation_day based on the current creation time:
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update1';
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update2';
now() is a CQL function to generate a timeUUID, you would probably want to generate this in the application instead, and parse out the yyyy-mm-dd for the PK and then insert the timeUUID in the clustered column.
For a usage case using this table, let's say you wanted to see all of the changes today, your CQL would look like:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2;
Or if you wanted to find all of the changes today after 5pm central:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2 AND creation >= minTimeuuid('2016-04-02 5:00-0600') ;
minTimeuuid() is another cql function, it will create the smallest possible timeUUID for the given time, this will guarantee that you get all of the changes from that time.
Depending on the time spans you may need to query a few different partition keys, but it shouldn't be that hard to implement. Also you would want to change your creation column to a timeuuid for your other table.
Question 2:
You'll have to create another table or use materialized views to support this new query pattern, just like you thought.
Lastly if your not on Cassandra 3.x+ or don't want to use materialized views you can use Atomic batches to ensure data consistency across your several de-normalized tables (that's what it was designed for). So in your case it would be a BATCH statement with 3 inserts of the same data to 3 different tables that support your query patterns.
The solution is to create another tables to support your queries.
For SELECT * FROM posts ORDER BY creation;, you may need some special column for grouping it, maybe by month and year, e.g. PRIMARY KEY((year, month), timestamp) this way the cassandra will have a better performance on read because it doesn't need to scan the whole cluster to get all data, it will also save the data transfer between nodes too.
Same as SELECT * FROM posts WHERE username='luke' ORDER BY content;, you must create another table for this query too. All column may be same as your first table but with the different Primary Key, because you cannot order by the column that is not the clustering column.

DB project - improving performance with relationships

I have two tables, let's call them TableA and TableB. One record in TableA is related to one or more in TableB. But there's also one special record within them in TableB for each record from TableA (for example with lowest ID), and I want to have quick access to that special one. Data from both tables aren't deleted - it's a kind of history rarely cleared. How do that the best in terms of performance?
I thought of:
1) two-way relationship, but it will affect insert performance
2) design next table, with primary key as FK_TableA (for TableA record exactly one is "special") and second column FK_TableB and then create view
3) design next table, with primary key as FK_TableA, FK_TableB, make FK_TableA unique and then create view
I'm open for all other ideas :)
4) I'd consider an indexed view to hide the JOIN and row restriction
This is similar to your options 2 and 3 but the DB engine will maintain it for you. With a new table you'll either compromise data integrity or have to manage the data via triggers

Resources