Elasticsearch for uncertain amount i18n

Elasticsearch for uncertain amount i18n - elasticsearch

I am using elasticsearch-php in a laravel project.
I have a products table like below
CREATE TABLE `products` (
`id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`title` JSON NOT NULL,
);
The title of each product will have multiple languages, and the type of language is uncertain
+------------+---------------------------------------------------------------------------+
| id | title |
+------------+---------------------------------------------------------------------------+
| 1 |{"en-US":"Toyota Cruiser","ja-JP":"トヨタクルーザー","zh-CN":"丰田酷路泽"} |
+------------+---------------------------------------------------------------------------+
| 2 |{"en-US":"Subaru Outback","ja-JP":"スバルアウトバック"} |
+------------+---------------------------------------------------------------------------+
| 3 |{"zh-CN":"路虎 揽胜","ja-JP":"ランドローバーレンジローバー"} |
+------------+---------------------------------------------------------------------------+
| 4 |{"en-US":"BMW X5"} |
+------------+---------------------------------------------------------------------------+
How can I create a elasticsearch index for products table which supports search in i18n?
Thanks a lot!

I can see you have the language identifier in your database as en-us, zh-cn which means before storing the title, you know the language of product, than you can simply use the multi-field and can add a new language sub-field for all the available languages in Elasticsearch.
You can have a look at all the current supported languages in Elasticsearch and add all the sub-fields in beginning or change the mapping and add a new language sub-field as you get it in your system.

Related

MemSQL search performance: JSON vs varchar

I'm running some memsql performance tests on sample data and have a very poor behavior while querying JSON data.
I have 2 tables looking very similar and containing exactly the same information (loaded from the same csv file). The difference is that the segments column is JSON vs varchar(255).
CREATE TABLE `test_events` (
`timestamp` datetime NOT NULL,
`user_id` int(20) NOT NULL,
`segments` JSON COLLATE utf8_bin NOT NULL,
KEY `timestamp` (`timestamp`) /*!90619 USING CLUSTERED COLUMNSTORE */,
/*!90618 SHARD */ KEY `user_id` (`user_id`)
CREATE TABLE `test_events_string` (
`timestamp` datetime NOT NULL,
`user_id` int(20) NOT NULL,
`segments` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT '',
KEY `timestamp` (`timestamp`) /*!90619 USING CLUSTERED COLUMNSTORE */,
/*!90618 SHARD */ KEY `user_id` (`user_id`)
And an example data is (amount of items in array vary from 1 to 20):
memsql> select * from test_events limit 1;
+---------------------+---------+------------------------+
| timestamp | user_id | segments |
+---------------------+---------+------------------------+
| 2017-01-04 00:00:00 | 26834 | [19,18,9,6,7,22,34,43] |
+---------------------+---------+------------------------+
Below are 2 queries which fetch the same info, but speed is different. Both queries have been executed twice and I copy 2nd run:
memsql> select count(*) from test_events where json_array_contains_double(segments, 42);
+----------+
| count(*) |
+----------+
| 79312103 |
+----------+
1 row in set (15.86 sec)
memsql> select count(*) from test_events_string where segments like '%42%';
+----------+
| count(*) |
+----------+
| 79312103 |
+----------+
1 row in set (1.96 sec)
memsql> select count(*) from test_events;
+-----------+
| count(*) |
+-----------+
| 306939340 |
+-----------+
1 row in set (0.02 sec)
So the JSON scan is 8 times slower than a %x% LIKE. Is there something which can improve it?
Maybe you can advice how to solve that business logic problem with a different approach? Basically, we log events for users and for each event we want to attach an array of ids of some entities. That array is frequently changed during user's lifecycle. We want to run queries filtering by 1 or many ids, pretty much like an example above.
Just in case, some tech specs. 3 identical bare metal servers. 1 server is for aggregator, 2 for data. Each machine has NUMA, so 4 leaf nodes total. Fast SSDs, 32Cores (2 X E5-2650v2#2.60GHz), 32GB RAM.

I'm not surprised that this is slow.
MemSQL uses a parquet based compression for columnar json, and we don't have these sorts of fast lookups quite yet (but stay tuned!).
There are a few options.
One is, if you're always going to be searching for 42, you can use a persisted column (https://docs.memsql.com/docs/persistent-computed-columns).
This seems unlikely to be your use case.
The other option is, if you are always looking at the same array, you can create a normalized table (https://en.wikipedia.org/wiki/Database_normalization).
SOmething like
create table test_events_array (timestamp datetime not null, user_id bigint not null, segment bigint, shard(user_id), key(ts) using clustered columnstore)
then doing
select count(*) from test_events_array where segment=42 will be lightning fast.
It'll also compress down to almost nothing with that schema, probably.

Force use of primary key in Oracle during search

I have scenario where i need to search & display records from huge tables with lots of rows. I have pre-defined search criteria for my tables for which user can provide the filter & click search .
Considering a sample table :
CREATE TABLE suppliers
( supplier_name varchar2(50) NOT NULL,
address varchar2(50),
city varchar2(50) NOT NULL,
state varchar2(25),
zip_code varchar2(10),
CONSTRAINT "suppliers_pk" PRIMARY KEY (supplier_name, city)
);
INSERT INTO suppliers VALUES ('ABCD','XXXX','YYYY','ZZZZ','95012');
INSERT INTO suppliers VALUES ('EFGH','MMMM','NNNN','OOOO','95010');
INSERT INTO suppliers VALUES ('IJKL','EEEE','FFFF','GGGG','95009');
I have provided the user with search fields as the primary key - supplier_name, city
If he enters both the fields, my query performance will be good since it goes for index scan
SELECT supplier_name, address, city, state, zip_code FROM suppliers where supplier_name = 'ABCD' and city = 'ZZZZ';
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 102 | 1 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| SUPPLIERS | 1 | 102 | 1 (0)| 00:00:01 |
|* 2 | INDEX UNIQUE SCAN | suppliers_pk | 1 | | 1 (0)| 00:00:01 |
However, if he enters only one of the search field, my query performance will go bad since it goes for full table scan
SELECT supplier_name, address, city, state, zip_code FROM suppliers where supplier_name = 'ABCD' ;
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 102 | 3 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| SUPPLIERS | 1 | 102 | 3 (0)| 00:00:01 |
Is there a way to force oracle to think it is a primary key search when i don't have all of the key fields in search , something like below ( which obviously is not working )
SELECT supplier_name, address, city, state, zip_code FROM suppliers where supplier_name = 'ABCD' and city = city;
Thanks.

You are thinking about this in the wrong way.
The query optimiser will choose what it thinks best execution plan for the query based on the information available at the time the query is parsed (or sometimes when the parameters changed). Generally - if you give it the right information in terms of stats etc, it usually will do a good job.
You might think that you know better than it, but remember that you won't be monitoring this for the life of the database. The data changes, you want the database to be able to react and change the execution plan when it needs to.
That said, if you are set on forcing it to use the index, you can use a hint:
SELECT /*+ INDEX(suppliers suppliers_pk) */
supplier_name, address, city, state, zip_code FROM suppliers where
supplier_name = 'ABCD' ;

A full table scan is not necessarily bad. You have only a few rows in your table, so the optimizer thinks it is better to do a FTS than an index range scan. It will start using the PK index a soon as the RDBMS thinks it is better, i.e. you have lots a rows and the restriction on a certain supplier reduces the result significantly. If you want to search on city only instead of supplier you need another index with city only (or at least starting with city). Keep in mind that you might have to update the table statistics after you have loaded your table with bulk data. It is always important to test query performance with somehow realistic amounts of data.

Index is organised first on supplier_name second on city so it is not possible to use that index for query based on city only.
Please create second index based only on city. This will help your query.

Avoid full table scan on a table Oracle

I have a table which has 70 columns, Where primary key is the combination of 15 columns (which includes number and varchar2) . Please see below query
select * from tab1 where k1=1234567889;
Plan hash value: 1179808636
---------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6044 | 2201K| 4585K (1)| 15:17:04 |
|* 1 | TABLE ACCESS FULL| tab1 | 6044 | 2201K| 4585K (1)| 15:17:04 |
---------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 – filter ("K1"=30064825087)
Where tab1 is a table mentions above and k1 is a column which is part of primary key. Table is not partitioned. Table is also analyzed (table, index and columns) after data has been inserted. Output for above query returns like 100000 plus records. The problem is even after having PK on the k1 column, the query is doing full table scan, which is not acceptable. On the other hand using index hints does not really speed up the process.
Please advise what would be the possible solution.

For this query:
select *
from tab1
where k1 = 1234567889;
The best index is one that has k1 as the first key in the index. There can be a composite index, by k1 has to be the first key. It sounds like you have a composite primary key and k1 is not the first key.
I would recommend that you simply define another index:
create index idx_tab1_k1 on tab1(k1);

There are several ways to avoid a full-table scan
Indexes: Ensure that indexes exist on the key value and that the index has been analyzed with dbms_stats.
Use_nl hint: You can direct that the optimizer use a nested loops join (which requires indexes).
index hint: You can specify the indexes that you want to use.

DataMapper naming convention clashes with existing MySQL table

I am working on a MySQL table that is used by another program. What I want to do is to put up a web interface for this database with Sinatra and DataMapper. However, when I declare my property in the DataMapper model, I run into some problem with naming convention.
For example, the field in MySQL table is ControlStationID, and I declared as such, but when DataMapper runs, it change it to control_station_id. Anyway I can rectify this? I can't change the table structure.
Thanks.
Error seen:
DataObjects::SQLError: Unknown column 'control_station_id' in 'field list' (code: 1054, sql state: 42S22, query: SELECT `id`, `control_station_id` FROM `returnmessage` ORDER BY `id`)
MySQL table structure
mysql> show fields from returnmessage;
+-------------------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+---------------+------+-----+---------+----------------+
| ID | bigint(20) | NO | PRI | NULL | auto_increment |
| ControlStationID | int(11) | YES | | NULL | |
+-------------------+---------------+------+-----+---------+----------------+
My code
class ReturnMessage
include DataMapper::Resource
property :ID, Serial
property :ControlStationID, Integer
end
repository(:default).adapter.resource_naming_convention = lambda do |value|
value.downcase
end

Solution:
property :ControlStationID, Serial, :field=>'ControlStationID'

How should i design my tables for concurrent table scan access?

I need to hold multiple pairs of 70,000 rows and perform a comparison difference like operation between them using a minus operator. At any time there could be comparisons (table scans).
I currently have one table with this sort of design:
primary key (sequenced)
foreign key to identify set
key to identify set #1 or set #2
then the data here i need to minus against
The data would look something like this
| PK | FK | Key | Data |
| 1 | 1 | Left | Some data |
| 1 | 1 | Left | Diff data |
| 1 | 1 | Right | Some data |
My query would be:
SELECT data
FROM diffTable
WHERE pk = 1
AND fk = 1
AND key = 'Left'
MINUS
SELECT data
FROM diffTable
WHERE pk = 1
AND fk = 1
AND key = 'Right'
I am fearing the full table scans will monopolise resources and subsequent inserts and minus' will suffer.
How should I design my tables and why?

create index PK_FK on diff_table
(PK, FK, Key);
The query you posted in your question would run very fast with this index.
Btw, the PK column is not, by itself, the primary key. See the other comments. Perhaps you want:
alter table diff_table
add constraint PK_FK primary key (PK, FK, Key);
maybe pick a better name...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch for uncertain amount i18n - elasticsearch

Related

MemSQL search performance: JSON vs varchar

Force use of primary key in Oracle during search

Avoid full table scan on a table Oracle

DataMapper naming convention clashes with existing MySQL table

How should i design my tables for concurrent table scan access?

Categories

Resources