RethinkDB composite primary key Join - rethinkdb

I was wondering how would you do a join on a table with a composite primary key.
The composite key is achieved by using an array in the primary key field
Table 1
{id: key1, other: data}
Table 2
{id: [key1, key2], other: data}
So what I want is to join on table2.id[0] with table1
r.table("table1").eq_join("id[0]", r.table("table2")).run()

You cannot use eqJoin here because it requires keys to be strictly equal (strings are not arrays and vice versa).
This also means the best performance among all join operations, so this is why eqJoin is designed to accept a field name only, not an expression.
You seem to want innerJoin that can handle you case but sacrificing some performance (actually I'm not sure about real performance implications):
r.table('table1')
.innerJoin(
r.table('table2'),
(doc1, doc2) => doc1('id').eq(doc2('id').nth(0))
)
Note that you can use expressions you were trying to use in your question ("id[0]" merely means a field name for eqJoin).

Related

Which Postgresql index is most efficient for text column with queries based on similarity

I would like to create an index on text column for the following use case. We have a table of Segment with a column content of type text. We perform queries based on the similarity by using pg_trgm. This is used in a translation editor for finding similar strings.
Here are the table details:
CREATE TABLE public.segments
(
id integer NOT NULL DEFAULT nextval('segments_id_seq'::regclass),
language_id integer NOT NULL,
content text NOT NULL,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT segments_pkey PRIMARY KEY (id),
CONSTRAINT segments_language_id_fkey FOREIGN KEY (language_id)
REFERENCES public.languages (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE,
CONSTRAINT segments_content_language_id_key UNIQUE (content, language_id)
)
And here is the query (Ruby + Hanami):
def find_by_segment_match(source_text_for_lookup, source_lang, sim_score)
aggregate(:translation_records)
.where(language_id: source_lang)
.where { similarity(:content, source_text_for_lookup) > sim_score/100.00 }
.select_append { float::similarity(:content, source_text_for_lookup).as(:similarity) }
.order { similarity(:content, source_text_for_lookup).desc }
end
---EDIT---
This is the query:
SELECT "id", "language_id", "content", "created_at", "updated_at", SIMILARITY("content", 'This will not work.') AS "similarity" FROM "segments" WHERE (("language_id" = 2) AND (similarity("content", 'This will not work.') > 0.45)) ORDER BY SIMILARITY("content", 'This will not work.') DESC
SELECT "translation_records"."id", "translation_records"."source_segment_id", "translation_records"."target_segment_id", "translation_records"."domain_id",
"translation_records"."style_id",
"translation_records"."created_by", "translation_records"."updated_by", "translation_records"."project_name", "translation_records"."created_at", "translation_records"."updated_at", "translation_records"."language_combination", "translation_records"."uid",
"translation_records"."import_comment" FROM "translation_records" INNER JOIN "segments" ON ("segments"."id" = "translation_records"."source_segment_id") WHERE ("translation_records"."source_segment_id" IN (27548)) ORDER BY "translation_records"."id"
---END EDIT---
---EDIT 1---
What about re-indexing? Initially we'll import about 2 million legacy records. When and how often, if at all, should we rebuild the index?
---END EDIT 1---
Would something like CREATE INDEX ON segment USING gist (content) be ok? I can't really find which of the available indices would be best suitable for our use case.
Best, seba
The 2nd query you show seems to be unrelated to this question.
Your first query can't use a trigram index, as the query would have to be written in operator form, not function form, to do that.
In operator form, it would look like this:
SELECT "id", "language_id", "content", "created_at", "updated_at", SIMILARITY("content", 'This will not work.') AS "similarity"
FROM segments
WHERE language_id = 2 AND content % 'This will not work.'
ORDER BY content <-> 'This will not work.';
In order for % to be equivalent to similarity("content", 'This will not work.') > 0.45, you would first need to do a set pg_trgm.similarity_threshold TO 0.45;.
Now how you get ruby/hanami to generate this form, I don't know.
The % operator can be supported by either the gin_trgm_ops index or the gist_index_ops index. The <-> can only be supported by gist_trgm_ops. But it is pretty hard to predict how efficient that support will be. If your "contents" column is long or your text to compare is long, it is unlikely to be very efficient, especially in the case of gist.
Ideally you would partition your table by language_id. If not, then it might be helpful to build a multicolumn index having both columns.
CREATE INDEX segment_language_id_idx ON segment USING btree (language_id);
CREATE INDEX segment_content_gin ON segment USING gin (content gin_trgm_ops);

ClickHouse: how to enable performant queries against increasing user-defined attributes

I am designing a system that handles a large number of buried point event. An event record contains:
buried_point_id, for example: 1 means app_launch, 2 means user_register.
happened_at: the event timestamp.
user_id: the user identifier.
other attributes, including basic ones (phone_number, city, country) and user-defined ones (click_item_id, it literally can be any context information). PMs will add more and more user-defined attributes to the event record.
The query pattern is like:
SELECT COUNT(DISTINCT user_id) FROM buried_points WHERE buried_point_id = 1 AND city = 'San Francisco' AND click_item_id = 123;
Since my team invests heavily in ClickHouse, I want to leverage ClickHouse for the problem. I wonder if it is a good practice to use the experimental Map data type to store all attributes in a MAP-type column such as {city: San Francisco, click_item_id: 123, ...}, or any other recommendation? Thanks.

Query a table with primary key and two conditions on sort key

I'm trying to query a dynamodb table using the partition key and a sort key. The sort key is a unix date, so I want to request x partition key between these 2 dates on the sort. I am currently able to achieve this with a table scan, but I have to move this to a query for the speed benefit. I am unable to find any decent examples online of people using a partition key and sort key to query their table.
I have carefully read through this https://docs.aws.amazon.com/sdk-for-go/api/service/dynamodb/#DynamoDB.Query and understand that my params must go within the KeyConditionExpression.
I have read through https://github.com/aws/aws-sdk-go/blob/master/service/dynamodb/expression/examples_test.go and understand it on the whole. But I just can't find the syntax for KeyConditionExpression
I'd have thought it was something like this:
keyCond := expression.Key("accountId").
Equal(expression.Value(accountId)).
And(expression.Key("sortKey").
Between(expression.Value(fromDateDec), expression.Value(toDateDec)))
But this throws:
ValidationException: Invalid KeyConditionExpression: Incorrect operand type for operator or function; operator or function: BETWEEN, operand type: NULL
First you need KeyAnd to combine Hash Key and sort key condition.
// keyCondition represents the key condition where the partition key
// "TeamName" is equal to value "Wildcats" and sort key "Number" is equal
// to value 1
keyCondition := expression.KeyAnd(expression.Key("TeamName").Equal(expression.Value("Wildcats")), expression.Key("Number").Equal(expression.Value(1)))
Now instead equal condition you can replace with your between condition as follows
// keyCondition represents the boolean key condition of whether the value
// of the key "foo" is between values 5 and 10
keyCondition := expression.KeyBetween(expression.Key("foo"), expression.Value(5), expression.Value(10))

different data types associated data skew

Today I read one article about the hive tuning. One paragraph is as follows:
Scene: user_id in the user table the field user_id INT, log table field both of type string type int. When two tables in accordance with the user_id Join operation, the default Hash operation will be allocated int id, this will cause all records of the string type id assigned to a reducer.
Solution: numeric type is converted to a string type
select * from users a
left outer join logs b
on a.usr_id = cast (b. user_id as string)
Can anybody give me some more explanation about the above opinion, I really cannot understand the words the author describe. Why "this will cause all records of the string type id assigned to a reducer." happened? Thanks in advance!
For starters you did not copy and paste / transcribe the original properly. Here is the more likely wording:
this will cause all records of the string type id assigned to a
single reducer.
The reason that would happen is that the conversion of string to int without the cast is probably turning it to 0. Therefore the hashing will put all of the id's into the same partition for the 0 values.

Oracle 9i Sub query

Hi Can any one help me out of this query forming logic
SELECT C.CPPID, c.CPP_AMT_MANUAL
FROM CPP_PRCNT CC,CPP_VIEW c
WHERE
CC.CPPYR IN (
SELECT C.YEAR FROM CPP_VIEW_VIEW C WHERE UPPER(C.CPPNO) = UPPER('123')
AND C.CPP_CODE ='CPP000000000053'
and TO_CHAR(c.CPP_DATE,'YYYY/Mon')='2012/Nov'
)
AND UPPER(C.CPPNO) = UPPER('123')
AND C.CPP_CODE ='CPP000000000053'
and TO_CHAR(c.CPP_DATE,'YYYY/Mon') = '2012/Nov';
Please Correct me if i formed wrong query structure, in terms of query Performance and Standards. Thanks in Advance
If you have some indexes or partitioned tables I would not use functions on columns but on variables, to be able to use indexes/select partitions.
Also I use ANSI 92 SQL syntax. You don't specify(or not directly) a join contition between cpp_prcnt and cpp_view so it is actually a cartesian product(cross join)
SELECT C.CPPID, c.CPP_AMT_MANUAL
FROM CPP_PRCNT CC
CROSS JOIN CPP_VIEW c
WHERE
CC.CPPYR IN (
SELECT C.YEAR
FROM CPP_VIEW_VIEW C
WHERE C.CPPNO = '123'
AND C.CPP_CODE ='CPP000000000053'
AND trunc(c.CPP_DATE,'MM')=to_date('2012/Nov','YYYY/Mon')
)
AND C.CPPNO = '123'
AND C.CPP_CODE ='CPP000000000053'
AND trunc(c.CPP_DATE,'MM')=to_date('2012/Nov','YYYY/Mon')
If you show us the definition of cpp_view_view(seems to be a view over cpp_view), the definition(if simple) of CPP_VIEW and what you're trying to achieve, I bet there are more things to be improved/fixed.
There are a couple of things you could improve:
if possible, get rid of the UPPER() in the comparison - this will render any indices useless. If that's not possible, consider a function-based index on UPPER(CPPNO)
do not convert your DATE column to a string to compare it with a string - do it the other way round (i.e. convert your string to a date => only one conversion needed instead of one per table row, use of indices possible)
play around with EXISTS instead of IN, as suggested by Dileep - might be faster

Resources