Which Postgresql index is most efficient for text column with queries based on similarity

Which Postgresql index is most efficient for text column with queries based on similarity - ruby

I would like to create an index on text column for the following use case. We have a table of Segment with a column content of type text. We perform queries based on the similarity by using pg_trgm. This is used in a translation editor for finding similar strings.
Here are the table details:
CREATE TABLE public.segments
(
id integer NOT NULL DEFAULT nextval('segments_id_seq'::regclass),
language_id integer NOT NULL,
content text NOT NULL,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT segments_pkey PRIMARY KEY (id),
CONSTRAINT segments_language_id_fkey FOREIGN KEY (language_id)
REFERENCES public.languages (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE,
CONSTRAINT segments_content_language_id_key UNIQUE (content, language_id)
)
And here is the query (Ruby + Hanami):
def find_by_segment_match(source_text_for_lookup, source_lang, sim_score)
aggregate(:translation_records)
.where(language_id: source_lang)
.where { similarity(:content, source_text_for_lookup) > sim_score/100.00 }
.select_append { float::similarity(:content, source_text_for_lookup).as(:similarity) }
.order { similarity(:content, source_text_for_lookup).desc }
end
---EDIT---
This is the query:
SELECT "id", "language_id", "content", "created_at", "updated_at", SIMILARITY("content", 'This will not work.') AS "similarity" FROM "segments" WHERE (("language_id" = 2) AND (similarity("content", 'This will not work.') > 0.45)) ORDER BY SIMILARITY("content", 'This will not work.') DESC
SELECT "translation_records"."id", "translation_records"."source_segment_id", "translation_records"."target_segment_id", "translation_records"."domain_id",
"translation_records"."style_id",
"translation_records"."created_by", "translation_records"."updated_by", "translation_records"."project_name", "translation_records"."created_at", "translation_records"."updated_at", "translation_records"."language_combination", "translation_records"."uid",
"translation_records"."import_comment" FROM "translation_records" INNER JOIN "segments" ON ("segments"."id" = "translation_records"."source_segment_id") WHERE ("translation_records"."source_segment_id" IN (27548)) ORDER BY "translation_records"."id"
---END EDIT---
---EDIT 1---
What about re-indexing? Initially we'll import about 2 million legacy records. When and how often, if at all, should we rebuild the index?
---END EDIT 1---
Would something like CREATE INDEX ON segment USING gist (content) be ok? I can't really find which of the available indices would be best suitable for our use case.
Best, seba

The 2nd query you show seems to be unrelated to this question.
Your first query can't use a trigram index, as the query would have to be written in operator form, not function form, to do that.
In operator form, it would look like this:
SELECT "id", "language_id", "content", "created_at", "updated_at", SIMILARITY("content", 'This will not work.') AS "similarity"
FROM segments
WHERE language_id = 2 AND content % 'This will not work.'
ORDER BY content <-> 'This will not work.';
In order for % to be equivalent to similarity("content", 'This will not work.') > 0.45, you would first need to do a set pg_trgm.similarity_threshold TO 0.45;.
Now how you get ruby/hanami to generate this form, I don't know.
The % operator can be supported by either the gin_trgm_ops index or the gist_index_ops index. The <-> can only be supported by gist_trgm_ops. But it is pretty hard to predict how efficient that support will be. If your "contents" column is long or your text to compare is long, it is unlikely to be very efficient, especially in the case of gist.
Ideally you would partition your table by language_id. If not, then it might be helpful to build a multicolumn index having both columns.

CREATE INDEX segment_language_id_idx ON segment USING btree (language_id);
CREATE INDEX segment_content_gin ON segment USING gin (content gin_trgm_ops);

Related

AggregatingMergeTree order by column not in the sorting key

What are some options to have AggregatingMergeTree merge by a column but ordered by a column that's not in the sorting key?
My application is similar to Zendesk tickets. A ticket has a category, status, and ID. The application emits ticket status change events to CH and I'm calculating statistics on the time it took to close since it was created given some time range R group by some time period P.
For example, events look like this
{
"ticket": "A",
"event_time": 2022-12-08T15:00:00Z,
"category": "bug",
"status": "created"
},
{
"ticket": "A",
"event_time": 2022-12-08T15:30:00Z,
"category": "bug",
"status": "reviewing"
},
{
"ticket": "A",
"event_time": 2022-12-08T16:00:00Z,
"category": "bug",
"status": "reviewed"
}
My AggregatingMergeTree (more specifically, it's replicated) has a sorting key on the ticket ID to aggregate two states into one.
CREATE TABLE ticket_created_to_reviewed
(
`ticket` String,
`created_ticket_event_id` SimpleAggregateFunction(max, String),
`created_ticket_event_time` SimpleAggregateFunction(max, DateTime64(9)),
`created_ticket_category` SimpleAggregateFunction(max, String),
`close_ticket_event_id` SimpleAggregateFunction(max, String),
`close_ticket_event_time` SimpleAggregateFunction(max, DateTime64(9)),
`close_ticket_category` SimpleAggregateFunction(max, String),
)
ENGINE = ReplicatedAggregatingMergeTree('<path>', '{replica}')
PARTITION BY toYYYYMM(close_ticket_event_time)
PRIMARY KEY ticket
ORDER BY ticket
TTL date_trunc('second', if(close_ticket_event_time > created_ticket_event_time,
close_ticket_event_time, created_ticket_event_time)) + toIntervalMonth(12)
SETTINGS index_granularity = 8192
Two MVs SELECT on the raw events and inserts to the ticket_created_to_reviewed. One for WHERE status = 'created' and another for WHERE status = 'reviewed'
So far the data populates correctly, although I have to exclude rows that only have one of the status events populated. Getting hourly p99 of ticket time to close past 1 day for each category looks something like this
SELECT
quantile(0.9)(date_diff('second', created_ticket_event_time, close_ticket_event_time)),
date_trunc('hour', close_ticket_event_time) as t,
close_ticket_category as category
FROM
(
SELECT
ticket,
max(created_ticket_event_id) AS created_ticket_event_id,
max(created_ticket_event_time) AS created_ticket_event_time,
max(created_ticket_category) AS created_ticket_category,
max(close_ticket_event_id) AS close_ticket_event_id,
max(close_ticket_event_time) AS close_ticket_event_time,
max(close_ticket_category) AS close_ticket_category
FROM ticket_unreviewed_to_reviewed
GROUP BY ticket
)
WHERE close_ticket_event_id != '' AND created_ticket_event_id != '' AND
close_ticket_event_time > addDays(now(), -1)
GROUP BY t, category
The problem is close_ticket_event_time is not in the sorting key so the query scans the full table, but I can't also include that column in the sorting key because the table wouldn't then aggregate by the ticket ID.
Any suggestions?
Things tried:
Adding an index and/or projection that orders by close_ticket_event_time. However, I think the main problem is that the sorting key is on ticket ID so the data is not ordered by time to efficiently find the matching time range, but at the same time adding close_ticket_event_time breaks the aggregation behavior in AggregatingMergeTree
MV that joins created ticket and closed ticket, and a different destination table with close_ticket_event_time as the sorting key. The destination table doesn't contain all the data if the right side of the JOIN isn't available at the time MV was triggered (i.e. left side). This can happen if events are ingested out of order.
Ideally, what I'm looking for is something like this in AggregatingMergeTree, but it appears this isn't possible due to the nature of how the data is stored.
PRIMARY KEY ticket
ORDER BY close_ticket_event_time
Thanks in advance

New column indicating if row is the first instance of the value for the Entity ID using SQL instead of DAX

I currently have a column that is created using the following DAX formula (a calculating language used by platforms such as Power BI) which indicates if the listed activity is the first one ever for that Entity ID. Below is my DAX script if it helps at all:
// "Declares column name"
First Time Activity =
// "if the column 'Timestamp' is equal to..."
if('Activity Table'[Timestamp]=
// "...is equal to the earliest Timestamp for that Entity ID and Activity Name"
CALCULATE(min('Activity Table'[Timestamp]),
filter('Activity Table',
'Activity Table'[Entity ID] = earlier('Activity Table'[Entity ID]) &&
'Activity Table'[Activity Name] = earlier('Activity Table'[Activity Name])
)
)
// "...then return a 1. If not, then return a blank/null"
,1,BLANK())
But I need this now to be a column made in PL SQL rather than in DAX. Any help on the SQL script would be much appreciated since I'm fairly novice at SQL.
Thanks

You don't actually need a column. you can write your query as :
Select a
,decode(activity_date
,MIN(activity_date) over (partition by activity_id)
,'Y'
,'N') first_record_indicator
From activity_table a
But, if you table is too huge to actually query like this everytime, you can create a column named first_record_indicator and populate it in "BEFORE INSERT" trigger.
e.g. https://www.techonthenet.com/oracle/triggers/before_insert.php

Strange behaviour when using FILTER to filter a different table with no direct relationship?

I have two facts tables, First and Second, and two dimension tables, dimTime and dimColour.
Fact table First looks like this:
and facet table Second looks like this:
Both dim-tables have 1:* relationships to both fact tables and the filtering is one-directional (from dim to fact), like this:
dimColour[Color] 1 -> * First[Colour]
dimColour[Color] 1 -> * Second[Colour]
dimTime[Time] 1 -> * First[Time]
dimTime[Time] 1 -> * Second[Time_]
Adding the following measure, I would expect the FILTER-functuion not to have any affect on the calculation, since Second does not filter First, right?
Test_Alone =
CALCULATE (
SUM ( First[Amount] );
First[Alone] = "Y";
FILTER(
'Second';
'Second'[Colour]="Red"
)
)
So this should evaluate to 7, since only two rows in First have [Alone] = "Y" with values 1 and 6 and that there is no direct relationship between First and Second. However, this evaluates to 6. If I remove the FILTER-function argument in the calculate, it evaluates to 7.
There are thre additional measures in the pbix-file attached which show the same type of behaviour.
How is filtering one fact table which has no direct relationship to a second fact table affecting the calculation done on the second table?
Ziped Power BI-file: PowerBIFileDownload

Evaluating the table reference 'Second' produces a table that includes the columns in both the Second table, as well as those in all the (transitive) parents of the Second table.
In this case, this is a table with all of the columns in dimColour, dimTime, Second.
You can't see this if you just run:
evaluate 'Second'
as when 'evaluate' returns the results to the user, these "Parent Table" (or "Related") columns are not included.
Even so, these columns are certainly present.
When a table is converted to a row context, these related columns become available via RELATED.
See the following queries:
evaluate FILTER('Second', ISBLANK(RELATED(dimColour[Color])))
evaluate 'Second' order by RELATED(dimTime[Hour])
Similarly, when arguments to CALCULATE are used to update the filter context, these hidden "Related" columns are not ignored; hence, they can end up filtering First, in your example. You can see this, by using a function that strips the related columns, such as INTERSECT:
Test_ActuallyAlone = CALCULATE (
SUM ( First[Amount] ),
First[Alone] = "Y",
//This filter now does nothing, as none of the columns in Second
//have an impact on 'SUM ( First[Amount] )'; and the related columns
//are removed by the INTERSECT.
FILTER(
INTERSECT('Second', 'Second')
'Second'[Colour]="Red"
)
)
(See these resources that describe the "Expanded Table"
(this is an alternative but equivalent explanation of this behaviour)
https://www.sqlbi.com/articles/expanded-tables-in-dax/
https://www.sqlbi.com/articles/context-transition-and-expanded-tables/
)

Dynamically expand ALL lists and records from json

I want to expand all lists and records in a json response.
Columns are like e.g. (this is dynamically, it also can be 10 records and 5 lists):
Text, Text, [List], [List], Text, [Record], [Record], String, [Record]
I wrote a function for getting all columns with the specific type
Cn.GetAllColumnsWithType = (table as table, typ as type) as list =>
let
ColumnNames = Table.ColumnNames(table),
ColumnsOfType = List.Select(ColumnNames, (name) =>
List.AllTrue(List.Transform(Table.Column(table, name), (cell) => Type.Is(Value.Type(cell), typ))))
in
ColumnsOfType;
and a function to expand all lists from a table
Cn.ExpandAllListsFromTable = (table as table, columns as list) =>
let
expandedListTable = List.Accumulate(columns, table, (state, columnToExpand) =>
Table.ExpandListColumn(state, columnToExpand))
in
expandedListTable;
all lists are now records and i want to dynamically expand all these records.
I think i need a foreach to iterate through the list (which are only records cause of Cn.GetAllColumnsWithType),
Table.ExpandRecordColumn each element with it's Table.ColumnNames and add it to the table but i don't know how to do it.
Maybe you can help me out cause it's driving me crazy.
Cheers
Edit:
I recently opened a thread but there i wanted to expand a specific one like
#"SelectItems" = Table.SelectColumns(records,{"$items"}),
#"$items1" = #"SelectItems"{0}[#"$items"],
but now i want to do it all dynamically.

Chris Webb wrote a function to do this for Table-type columns:
http://blog.crossjoin.co.uk/2014/05/21/expanding-all-columns-in-a-table-in-power-query/
I've shared a tweaked version of that that I made for Record-type columns:
https://gist.github.com/Mike-Honey/0a252edf66c3c486b69b

You do not need a function for that. Assuming that the previous step in M was named Removed Other Columns, and that the column to expand is named Data, then make regular Expand step and replace its code of #"Expanded Data" with the following code:
#"Expanded Data"
= Table.ExpandTableColumn(
#"Removed Other Columns",
"Data",
List.Union(List.Transform(#"Removed Other Columns"[Data], each Table.ColumnNames(_)))
)
It expands all columns without referencing their names.

Play Framework: How to render a table structure from plain SQL table

I would be happy to get a good way to get the "table" structure from a plain SQL table.
In my specific case, I need to render JSON structure used by Google Visualization API "datatable" object:
http://code.google.com/apis/chart/interactive/docs/reference.html#DataTable
However, having an example in HTML would help either.
My "source" is a plain SQL table of "DailySales": its columns are "Day" (date), "Product" and "DailySaleTotal" (daily sale for that product). Please recall that my "model" reflects the 3-column table above.
The table columns should be "products" (suppose we have very small number of such). Each row should represent a specific date, and the row data are the actual sales for that day.
Date Product1 Product2 Product3
01/01/2012 30 50 60
01/02/2012 35 3 15
I was trying to use nested #{list} tags in a template, but unfortunately I failed to find a natural way to provide a template with a "list" to represent the "row data".
Of course, I can build a "helper object" in Java that will build a list of the "sales data" items per date - but this looks very weird to me.
I would be thankful to anyone who can provide an elegant solution.
Max

When you load your model order it by date and product name. Then in your controller build a map with date as index and list of model objects that have the same date as value of the map
Then in your template you have a first list iteration on map keys for the rows and a second list iteration on the list value for the columns.
Something like
[
#{list modelMap.keys, as: 'date'}
[${date},#{list modelMap.get(date), as: 'product'}${product.dailySaleTotal}#{ifnot product_isLast},#{/ifnot}#{/list}]#{ifnot date_isLast},#{/ifnot}
#{/list}
]
you can then adapt your json rendering to the exact structure you want to have. Here it is an array of arrays.

Instead of generating the JSON yourself, like Seb suggested, you can generate it:
private static Result queryToJsonResult(String sql) {
SqlQuery sqlQuery = Ebean.createSqlQuery(sql);
return ok(Json.toJson(sqlQuery.findList()));
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Which Postgresql index is most efficient for text column with queries based on similarity - ruby

CREATE INDEX segment_language_id_idx ON segment USING btree (language_id); CREATE INDEX segment_content_gin ON segment USING gin (content gin_trgm_ops);

Related

AggregatingMergeTree order by column not in the sorting key

New column indicating if row is the first instance of the value for the Entity ID using SQL instead of DAX

Strange behaviour when using FILTER to filter a different table with no direct relationship?

Dynamically expand ALL lists and records from json

Play Framework: How to render a table structure from plain SQL table

Categories

Resources