Inconsistent handling of accents in binary fulltext search in MariaDB - full-text-search

I am puzzled how MariaDB fulltext search handles accents. I find it inconsistent and I would like to understand why.
To illustrate my problem lets create a testing table like this:
CREATE TABLE `fulltext_test` (
`title` varchar(128) COLLATE utf8_czech_ci NOT NULL,
FULLTEXT KEY `title` (`title`)
) ENGINE=InnoDB AUTO_INCREMENT=277 DEFAULT CHARSET=utf8 COLLATE=utf8_czech_ci
I defined collate because in real-life table I need to perform sorting according that column so I really want utf8_czech_ci.
Lets insert one row:
INSERT INTO `fulltext_test` VALUES ('klíč');
And now test how it behaves. This is an expected behaviour:
> SELECT * FROM `fulltext_test` WHERE MATCH (`title`) AGAINST ("klíč" IN BOOLEAN MODE);
+--------+
| title |
+--------+
| klíč |
+--------+
1 row in set (0.00 sec)
And this is what puzzles me. From the first result (search for "klíc") I would say fulltext search is accute-sensitive, but the second result (search for "klič", notice the subtle difference: í instead of i) proves it is not.
> SELECT * FROM `fulltext_test` WHERE MATCH (`title`) AGAINST ("klíc" IN BOOLEAN MODE);
Empty set (0.00 sec)
> SELECT * FROM `fulltext_test` WHERE MATCH (`title`) AGAINST ("klič" IN BOOLEAN MODE);
+--------+
| title |
+--------+
| klíč |
+--------+
1 row in set (0.00 sec)
Why is this happening? How can I configure it?

With the existing collations, I don't think there is a way to do either of these for Czech:
Always be insensitive to acute and caron accents, or
Alwasy be sensitive to them.
Here is a clumsy workaround:
Add another column
search TEXT NOT NULL
then put into search a copy of the text to search, but with all the accents stripped off. Or at least all the carons stripped off. You can use a tedious set of REPLACE(...) functions to do so.
Then have that column have the FULLTEXT index, but the original column is what you display.
Or...
It may suffice for search to be a copy of the original column, except for the collation:
search TEXT COLLATION utf8_bin NOT NULL
(and have FULLTEXT(search))

Related

Utilize function-based spatial index in SELECT list

I have an Oracle 18c table called LINES with 1000 rows. The DDL for the table can be found here: db<>fiddle.
The data looks like this:
create table lines (shape sdo_geometry);
insert into lines (shape) values (sdo_geometry(2002, 26917, null, sdo_elem_info_array(1, 2, 1), sdo_ordinate_array(574360, 4767080, 574200, 4766980)));
insert into lines (shape) values (sdo_geometry(2002, 26917, null, sdo_elem_info_array(1, 2, 1), sdo_ordinate_array(573650, 4769050, 573580, 4768870)));
insert into lines (shape) values (sdo_geometry(2002, 26917, null, sdo_elem_info_array(1, 2, 1), sdo_ordinate_array(574290, 4767090, 574200, 4767070)));
insert into lines (shape) values (sdo_geometry(2002, 26917, null, sdo_elem_info_array(1, 2, 1), sdo_ordinate_array(571430, 4768160, 571260, 4768040)));
...
I've created a function that's intentionally slow — for testing purposes. The function takes the SDO_GEOMETRY lines and outputs a SDO_GEOEMTRY point.
create or replace function slow_function(shape in sdo_geometry) return sdo_geometry
deterministic is
begin
return
--Deliberately make the function slow for testing purposes...
-- ...convert from SDO_GEOMETRY to JSON and back, several times, for no reason.
sdo_util.from_json(sdo_util.to_json(sdo_util.from_json(sdo_util.to_json(sdo_util.from_json(sdo_util.to_json(sdo_util.from_json(sdo_util.to_json(sdo_util.from_json(sdo_util.to_json(
sdo_lrs.geom_segment_start_pt(shape)
))))))))));
end;
As an experiment, I want to create a function-based spatial index, as a way to pre-compute the result of the slow function.
Steps:
Create an entry in USER_SDO_GEOM_METADATA:
insert into user_sdo_geom_metadata (table_name, column_name, diminfo, srid)
values (
'lines',
'infrastr.slow_function(shape)',
-- 🡅 Important: Include the function owner.
sdo_dim_array (
sdo_dim_element('X', 567471.222, 575329.362, 0.5), --note to self: these coordinates are wrong.
sdo_dim_element('Y', 4757654.961, 4769799.360, 0.5)
),
26917
);
commit;
Create a function-based spatial index:
create index lines_idx on lines (slow_function(shape)) indextype is mdsys.spatial_index_v2;
Problem:
When I use the function in the SELECT list of a query, the index isn't being used. Instead, it's doing a full table scan...so the query is still slow when I select all rows (CTRL+ENTER in SQL Developer).
You might ask, "Why select all rows?" Answer: That's how mapping software often works...you display all (or most) of the points in the map — all at once.
explain plan for
select
slow_function(shape)
from
lines
select * from table(dbms_xplan.display);
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 34 | 7 (0)| 00:00:01 |
| 1 | TABLE ACCESS FULL| LINES | 1 | 34 | 7 (0)| 00:00:01 |
---------------------------------------------------------------------------
Likewise, in my mapping software (ArcGIS Desktop 10.7.1), the map doesn't utilize the index either. I can tell, because the points are slow to draw in the map.
I'm aware that it's possible to create a view, and then register that view in USER_SDO_GEOM_METADATA (in addition to registering the index). And use that view in the map. I've tried that, but the mapping software still doesn't use the index.
I've also tried an SQL hint, but no luck — I don't think the hint is being used:
create or replace view lines_vw as (
select
/*+ INDEX (lines lines_idx) */
cast(rownum as number(38,0)) as objectid, --the mapping software needs a unique ID column
slow_function(shape) as shape
from
lines
where
slow_function(shape) is not null --https://stackoverflow.com/a/59581129/5576771
)
Question:
How can I utilize the function-based spatial index in the SELECT list in a query?
A spatial index is invoked only by the WHERE clause, not the SELECT list. A function in the SELECT list is invoked for every row returned by the WHERE clause, which in your case is SDO_ANYINTERACT( ) returning all rows.
You don't appear to be firing the index; just adding the function call as an attribute is insufficient
select
slow_function(shape)
from
lines
Should be....
select slow_function(shape)
from lines
Where sdo_anyinteract(slow_function(shape),sdo_geometry(2003, 26917,null,sdo_elem_info_array(1,1003,3),sdo_ordinate_array(1,2,3,4)) = 'TRUE'
Where 1,2,3,4 are the values of an optimized rectangle.
I tried using sdo_anyinteract() in the WHERE clause, as #SimonGreener suggested.
Unfortunately, the query still seems to be doing a full table scan (in addition to using the index). I was hoping to only use the index.
select
slow_function(shape) as shape
from
lines
where
sdo_anyinteract(slow_function(shape),
mdsys.sdo_geometry(2003, 26917, null, mdsys.sdo_elem_info_array(1, 1003, 1), mdsys.sdo_ordinate_array(573085.8702, 4771088.3813, 566461.6349, 4768833.3225, 570335.0629, 4757455.1278, 576959.2982, 4759710.1866, 573085.8702, 4771088.3813))
) = 'TRUE'
---------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 46 | 1 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID | LINES | 1 | 46 | 1 (0)| 00:00:01 |
|* 2 | DOMAIN INDEX (SEL: 0.000000 %)| LINES_IDX | | | 1 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
PLAN_TABLE_OUTPUT
-----------------------------------
2 - access("MDSYS"."SDO_ANYINTERACT"("INFRASTR"."SLOW_FUNCTION"("SHAPE"),"MDSYS"."
SDO_GEOMETRY"(2003,26917,NULL,"MDSYS"."SDO_ELEM_INFO_ARRAY"(1,1003,1),"MDSYS"."SDO_OR
DINATE_ARRAY"(573085.8702,4771088.3813,566461.6349,4768833.3225,570335.0629,4757455.1
278,576959.2982,4759710.1866,573085.8702,4771088.3813)))='TRUE')
I played around with a SQL hint: /*+ INDEX (lines lines_idx) */. But that didn't seem to make a difference.
The function takes the SDO_GEOMETRY lines and outputs a SDO_GEOEMTRY point.
A possible alternative might be:
Instead of returning/indexing a geometry column, maybe I could return/index X&Y numeric columns (using a regular non-spatial index). And then either:
A) Convert the XY columns to SDO_GEOMETRY after-the-fact in a query on-the-fly, or...
B) Use GIS software to display the XY data as points in a map. For example, in ArcGIS Pro, create an "XY Event Layer".
That technique seemed to work ok here: Improve performance of startpoint query (ST_GEOMETRY). I was able to utilize the function-based index (non-spatial) in a SELECT clause — making my query significantly faster.
Of course, that technique would work best for points — since converting XYs after-the-fact to point geometries is easy/performant/practical. Whereas converting lines or polygons (maybe from WKT?) to geometries after-the-fact likely wouldn't make much sense. Even if that were possible, it would likely be too slow, and defeat the purpose of precomputing the data in the function-based index in the first place.

Oracle Domain index and sorting

The following query performs very poorly, due to the "order by". My goal is to get only a small subset of the resultset (using ROWNUM, for example). However, when I add "order by" it goes through the entire resultset performing an index lookup for each record, which makes it extremely slow. Without sorting the query is about 100 times faster when I limit the resultset to, for example, 1000 records.
QUERY:
SELECT text_field
from mytable where
contains(text_field,'ABC', 1)>0
order by another_field;
THIS IS HOW I CREATED THE INDEX:
CREATE INDEX myindex ON mytable (text_field) INDEXTYPE IS ctxsys.context FILTER BY another_field
EXECUTION PLAN:
---------------------------------------------------------------
| Id | Operation | Name |
---------------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT ORDER BY | |
| 2 | TABLE ACCESS BY INDEX ROWID| MYTABLE |
|* 3 | DOMAIN INDEX | MYINDEX |
---------------------------------------------------------------
I also used CTXCAT instead of CONTEXT, and no improvement. I think the problem is, when I want the results sorted (only top 1000), it performs an index lookup for each record in the "entire" resultset. Is there a way to avoid that?
Thank you.
To have the ordering applied before the rownum filter, you need to use an in-line view:
SELECT text_file
from (
SELECT text_field
from mytable where
contains(text_field,'ABC', 1)>0
order by another_field
)
where rownum <= 1000;
With your index in place Oracle should optimise this to do as little work as possible. You should see 'sort order by stopkey' and 'count stopkey' steps in the plan, which is Oracle being clever and knowing it only needs to get 1000 values from the index.
If you don't use the in-line view but just add the rownum to your original query it will still optimise it but as you state it will order the first 1000 random (or indeterminate, anyway) rows it finds, because of the sequence of operations it performs.

Oracle PLSQL: Performance issue when using TABLE functions

I am currently facing a performance problem when using the table functions. I will explain.
I am working with Oracle types and one of them is defined like below:
create or replace TYPE TYPESTRUCTURE AS OBJECT
(
ATTR1 VARCHAR2(30),
ATTR2 VARCHAR2(20),
ATTR3 VARCHAR2(20),
ATTR4 VARCHAR2(20),
ATTR5 VARCHAR2(20),
ATTR6 VARCHAR2(20),
ATTR7 VARCHAR2(20),
ATTR8 VARCHAR2(20),
ATTR9 VARCHAR2(20),
ATTR10 VARCHAR2(20),
ATTR11 VARCHAR2(20),
ATTR12 VARCHAR2(20),
ATTR13 VARCHAR2(10),
ATTR14 VARCHAR2(50),
ATTR15 VARCHAR2(13)
);
Then I have one table of this type like:
create or replace TYPE TYPESTRUCTURE_ARRAY AS TABLE OF TYPESTRUCTURE ;
I have one procedure with the following variables:
arr QCSTRUCTURE_ARRAY;
arr2 QCSTRUCTURE_ARRAY;
ARR is only containing one single instance of TYPESTRUCTURE with all its attributes set to NULL except ATTR4 which is set to 'ABC'
ARR2 is completelly empty.
Here comes the part which is giving me the performance issue.
The purpose is to take some values from a view (depending on the value on ATTR4) and fill those in same or similar structure. So I do the following:
SELECT TYPESTRUCTURE(MV.A,null,null,MV.B,MV.C,MV.D,null,null,MV.E,null,null,MV.F,MV.F,MV.G,MV.H)
BULK COLLECT INTO arr2
FROM TABLE(arr) PARS
JOIN MYVIEW MV
ON MV.B = PARS.ATTR4;
The code here works correctly except for the fact that is taking 15 seconds to execute the query...
This query is filling into ARR around 20 instances of TYPESTRUCTURE (or rows).
It could look like there may be lots of data on the view. But what gets me strange is that if I change the query and I set something hardcoded like the one below then is completelly fast (miliseconds)
SELECT TYPESTRUCTURE(MV.A,null,null,MV.B,MV.C,MV.D,null,null,MV.E,null,null,MV.F,MV.F,MV.G,MV.H)
BULK COLLECT INTO arr2
FROM (SELECT 'ABC' ATTR4 FROM DUAL) PARS
JOIN MYVIEW MV
ON MV.B = PARS.ATTR4;
In this new query I am directly hardcoding the value but keeping the join in order to try to test something as much similar as the one above but without the TABLE() function..
So here my question.... Is it possible that this TABLE() function is creating such a big delay with only having one single record inside? I would like to know whether someone can give me some advice on what is wrong in my approach and whether there may be some other way to achieve...
Thanks!!
This problem is likely caused by a poor optimizer estimate for the number of rows returned by the TABLE function. The CARDINALITY or DYNAMIC_SAMPLING hints may be the best way to solve the problem.
Cardinality estimate
Oracle gathers statistics on tables and indexes in order to estimate the cost of accessing those objects. The most important estimate is how many rows will be returned by an object. Procedural code does not have statistics, by default, and Oracle does not make any attempt to parse the code and estimate how many rows will be produced. Whenever Oracle sees a procedural row source it uses a static number. On my database, the number is 16360. On most databases the estimate is 8192, as beherenow pointed out.
explain plan for
select * from table(sys.odcinumberlist(1,2,3));
select * from table(dbms_xplan.display(format => 'basic +rows'));
Plan hash value: 2234210431
--------------------------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------------------------
| 0 | SELECT STATEMENT | | 16360 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 16360 |
--------------------------------------------------------------
Fix #1: CARDINALITY hint
As beherenow suggested, the CARDINALITY hint can solve this problem by statically telling Oracle how many rows to estimate.
explain plan for
select /*+ cardinality(1) */ * from table(sys.odcinumberlist(1,2,3));
select * from table(dbms_xplan.display(format => 'basic +rows'));
Plan hash value: 2234210431
--------------------------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 16360 |
--------------------------------------------------------------
Fix #2: DYNAMIC_SAMPLING hint
A more "official" solution is to use the DYNAMIC_SAMPLING hint. This hint tells Oracle to sample some data at run time before it builds the explain plan. This adds some cost to building the explain plan, but it will return the true number of rows. This may work much better if you don't know the number ahead of time.
explain plan for
select /*+ dynamic_sampling(2) */ * from table(sys.odcinumberlist(1,2,3));
select * from table(dbms_xplan.display(format => 'basic +rows'));
Plan hash value: 2234210431
--------------------------------------------------------------
| Id | Operation | Name | Rows |
--------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 3 |
--------------------------------------------------------------
But what's really slow?
We don't know exactly was slow in your query. But whenever things are slow it's usually best to focus on the worst cardinality estimate. Row estimates are never perfect, but being off by several orders of magnitude can have a huge impact on an execution plan. In the simplest case it may change an index range scan to a full table scan.

Efficiently query table with conditions including array column in PostgreSQL

Need to come up with a way to efficiently execute a query with and array and integer columns in the WHERE clause, ordered by a timestamp column. Using PostgreSQL 9.2.
The query we need to execute is:
SELECT id
from table
where integer = <int_value>
and <text_value> = any (array_col)
order by timestamp
limit 1;
int_value is an integer value, and text_value is a 1 - 3 letter text value.
The table structure is like this:
Column | Type | Modifiers
---------------+-----------------------------+------------------------
id | text | not null
timestamp | timestamp without time zone |
array_col | text[] |
integer | integer |
How should I design indexes / modify the query to make it as efficient as possible?
Thanks so much! Let me know if more information is needed and I'll update ASAP.
PG can use indexes on array but you have to use array operators for that so instead of <text_value> = any (array_col) use ARRAY[<text_value>]<#array_col (https://stackoverflow.com/a/4059785/2115135). You can use the command SET enable_seqscan=false; to force pg to use indexes if it's possible to see if the ones you created are valid. Unfortunately GIN index can't be created on integer column so you will have to create two diffrent indexes for those two columns.
See the execution plans here: http://sqlfiddle.com/#!12/66a71/2
Unfortunately GIN index can't be created on integer column so you will have to create two diffrent indexes for those two columns.
That's not entirely true, you can use btree_gin or -btree_gist
-- feel free to use GIN
CREATE EXTENSION btree_gist;
CREATE INDEX ON table USING gist(id, array_col, timestamp);
VACUUM FULL ANALYZE table;
Now you can run the operation on the index itself
SELECT *
FROM table
WHERE id = ? AND array_col #> ?
ORDER BY timestamp;

Will this type of pagination scale?

I need to paginate on a set of models that can/will become large. The results have to be sorted so that the latest entries are the ones that appear on the first page (and then, we can go all the way to the start using 'next' links).
The query to retrieve the first page is the following, 4 is the number of entries I need per page:
SELECT "relationships".* FROM "relationships" WHERE ("relationships".followed_id = 1) ORDER BY created_at DESC LIMIT 4 OFFSET 0;
Since this needs to be sorted and since the number of entries is likely to become large, am I going to run into serious performance issues?
What are my options to make it faster?
My understanding is that an index on 'followed_id' will simply help the where clause. My concern is on the 'order by'
Create an index that contains these two fields in this order (followed_id, created_at)
Now, how large is the large we are talking about here? If it will be of the order of millions.. How about something like the one that follows..
Create an index on keys followed_id, created_at, id (This might change depending upon the fields in select, where and order by clause. I have tailor-made this to your question)
SELECT relationships.*
FROM relationships
JOIN (SELECT id
FROM relationships
WHERE followed_id = 1
ORDER BY created_at
LIMIT 10 OFFSET 10) itable
ON relationships.id = itable.id
ORDER BY relationships.created_at
An explain would yield this:
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
| 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Impossible WHERE noticed after reading const tables |
| 2 | DERIVED | relationships | ref | sample_rel2 | sample_rel2 | 5 | | 1 | Using where; Using index |
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
If you examine carefully, the sub-query containing the order, limit and offset clauses will operate on the index directly instead of the table and finally join with the table to fetch the 10 records.
It makes a difference when at one point your query makes a call like limit 10 offset 10000. It will retrieve all the 10000 records from the table and fetch the first 10. This trick should restrict the traversal to just the index.
An important note: I tested this in MySQL. Other database might have subtle differences in behavior, but the concept holds good no matter what.
you can index these fields. but it depends:
you can assume (mostly) that the created_at is already ordered. So that might by unnecessary. But that more depends on you app.
anyway you should index followed_id (unless its the primary key)

Resources