Efficient way to join by levenshtein in Hive or Impala

Efficient way to join by levenshtein in Hive or Impala - hadoop

I have two tables one includes about 17K (NLIST) records while the other 57K (FNAMES).
I would like to join the both by comparing the records using levenshtein formula.
Here is the example for the content of tables:
Table NLIST:
+------+-------------+
| ID | S_NAME |
+------+-------------+
| 1 | Avi |
| 2 | Moshe |
| 3 | David |
....
Table FNAMES:
+------+-------------+
| ID | NICKNAMES |
+------+-------------+
| 1 | Avile |
| 2 | Dudi |
| 3 | Moshiko |
| 4 | Avi |
| 5 | DAVE |
....
The above tables are just examples. In the real case the names column can include more than one word.
The required result should be:
+------+-------------+--------+
| ID | NICKNAMES | S_NAME |
+------+-------------+--------+
| 1 | Avile | Avi |
| 2 | Dudi | David |
| 3 | Moshiko | Moshe |
| 4 | Avi | Avi |
| 5 | DAVE | David |
...
Here is the code I use:
select FNAMES.NICKNAMES, NLIST.S_NAME
from NICKNAMES
LEFT OUTER JOIN NLIST
ON(true)
WHERE levenshtein (FNAMES.NICKNAMES, NLIST.S_NAME) <=4
The above code runs for a very long time and I stopped its running.
How can I make it run in a reasonable time?
In addition, I think the levenshtein distance depends on the length of the words. How can I find the optimal value for the distance (in this case I chose 4 arbitrarily)?

Hive Table performance is depends upon various point .
Query enginee
File format
use VECTORIZATION set hive.vectorized.execution.enabled = true;set hive.vectorized.execution.reduce.enabled = true;
If you have good server you can try with Impala and definitely it is faster than Hive.
You can do the fine tuning of impala which will give you an edge to execute this query faster .Tuning Impala for Performance

Related

How to properly apply index_desc hint for a remote database object in oracle database?

We were facing some issues with execution plans while accessing remote database objects with dblink. Here is the query itself run on the remote database:
select --+ index_desc (d DAY_OPERATIONAL_PK)
d.oper_day
from day_operational d
where rownum = 1
The plan for this query is the following :
Plan Hash Value : 2761870770
---------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 8 | 2 | 00:00:01 |
| * 1 | COUNT STOPKEY | | | | | |
| 2 | INDEX FULL SCAN DESCENDING | DAY_OPERATIONAL_PK | 1 | 8 | 2 | 00:00:01 |
---------------------------------------------------------------------------------------------
This one works correct that is it returns the last operational day. In this case 14.09.2021. However if execute this exact same query from other database connecting to this one via dblink, wrong results are returned . In this case the first row of the table is returned - 05.09.2009.
Here is the query:
select --+ index_desc (d DAY_OPERATIONAL_PK)
d.oper_day
from day_operational#iabs d
where rownum = 1
The plan generated for this query in local database is the following:
Plan Hash Value :
---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT REMOTE | | 1 | 8 | 2 | 00:00:01 |
| * 1 | COUNT STOPKEY | | | | | |
| 2 | INDEX FAST FULL SCAN | XPKDAY_OPERATIONAL | 1 | 8 | 2 | 00:00:01 |
---------------------------------------------------------------------------------------
As it can be seen, the plan generated when connected via dblink uses full table scan and ignores index_desc hint. How could we enforce oracle to use this index? Tried adding driving_site hint but it didn't help

Sorry for confusions. It occured that index names in local database and remote database are not same and therefore hint was ignoring since there was no index with that name in the remote db. With right index names, both the plan and result were correct

How to implement Oracle's "func(...) keep (dense_rank ...)" In Hive

I have a table abcd in Oracle DB
+-------------+----------+
| abcd.speed | abcd.ab |
+-------------+----------+
| 4.0 | 2 |
| 4.0 | 2 |
| 7.0 | 2 |
| 7.0 | 2 |
| 8.0 | 1 |
+-------------+----------+
And I'm using a query like this:
select min(speed) keep (dense_rank last order by abcd.ab NULLS FIRST) MOD from abcd;
I'm trying to convert the code to Hive, but it looks like keep is not available in Hive.
Could you suggest an equivalent statement?

select -max(struct(ab,-speed)).col2 as mod
from abcd
;
+------+
| mod |
+------+
| 4.0 |
+------+
Let start by explaining min(speed) keep (dense_rank last order by abcd.ab NULLS FIRST):
Find the row(s) with the max value of ab.
For this/those row(s), find the min value of speed.
We are using 2 tricks here.
The 1st is based on the ability to get the max value of a struct.
max(struct(c1,c2,c3,...)) returns the same result as if you have sorted the structs by c1, then by c2, then by c3 etc. and then chose the last element.
The 2nd trick is to use -speed (which is the same of -1*speed).
Finding the max of -speed and then taking the minus of that value (which gives us speed), is the same of finding the min of speed.
If we would have ordered the structs, it would have looked like this (since 2 is bigger than 1 and -4 is bigger than -7):
+----+-------+
| ab | speed |
+----+-------+
| 1 | -8.0 |
| 2 | -7.0 |
| 2 | -7.0 |
| 2 | -4.0 |
| 2 | -4.0 |
+----+-------+
The last struct in this case in struct(2,-4.0), therefore this is the result of the max function.
The fields names for a struct are col1, col2, col3 etc., so
struct(2,-4.0).col2 is -4.0. and preceding it with minus (which is the same as multiple it by -1) as in -struct(2,-4.0).col2 is 4.0.

How to arrange multi partitions in hive?

say i have a order table, which contains multi time column(spend_time,expire_time,withdraw_time),
usually,i will query the table with the above column independently,so how do i create the partitions?
order_no | spend_time | expire_time | withdraw_time | spend_amount
A001 | 2017/5/1 | 2017/6/1 | 2017/6/2 | 100
A002 | 2017/4/1 | 2017/4/19 | 2017/4/25 | 500
A003 | 2017/3/1 | 2017/3/19 | 2017/3/25 | 1000
Usually the business situation is to calculate total spend_amount between certain spend_time or expire_time or withdraw_time, or the combination of the 3.
But with 3 time dimensions cross combination(each has about 1000 partitions) can be a lot of partitions(1000*1000*1000),is that ok and efficient?
my solution is that i create 3 tables with 3 different columns.Is this a efficient way to solve this problem?

How to Move a whole partition to another tabel on another database?

Database: Oracle 12c
I want to take single partition, or a set of partitions, disconnect it from a Table, or set of tables on DB1 and move it to another table on another database. I would like to avoid doing DML to do this for performance reasons (It needs to be fast).
Each Partition will contain between three and four hundred million records.
Each Partition will be broken up into approximately 300 Sub-Partitions.
The task will need to be automated.
Some thoughts I had:
Somehow put each partition in it's own datafile upon creation, then detaching from the source and attaching it to the destination?
Extract the whole partition (not record-by-record)
Any other non-DML Solutions are also welcom
Example (Move Part#33 from both to DB#2, preferably with a single, operation):
__________________ __________________
| DB#1 | | DB#2 |
|------------------| |------------------|
|Table1 | |Table1 |
| Part#1 | | Part#1 |
| ... | | ... |
| Part#33 | ----> | Part#32 |
| Subpart#1 | | |
| ... | | |
| Subpart#300 | | |
|------------------| |------------------|
|Table2 | |Table2 |
| Part#1 | | Part#1 |
| ... | | ... |
| Part#33 | ----> | Part#32 |
| Subpart#1 | | |
| ... | | |
| Subpart#300 | | |
|__________________| |__________________|

Please read the document below with all the examples of exchanging partitions of table.
https://oracle-base.com/articles/misc/partitioning-an-existing-table-using-exchange-partition

Display record count in listbox using multiple tables and fields

i need help with a query, can't get it to work correctly. What i'm trying to achieve is to have a select box displaying the number of records associated with a particular theme, for some theme it works well for some it displays (0) when infact there are 2 records, I'm wondering if someone could help me on this, your help would be greatly appreciated, please see below my actual query + table structure :
SELECT theme.id_theme, theme.theme, calender.start_date,
calender.id_theme1,calender.id_theme2, calender.id_theme3, COUNT(*) AS total
FROM theme, calender
WHERE (YEAR(calender.start_date) = YEAR(CURDATE())
AND MONTH(calender.start_date) > MONTH(CURDATE()) )
AND (theme.id_theme=calender.id_theme1)
OR (theme.id_theme=calender.id_theme2)
OR (theme.id_theme=calender.id_theme3)
GROUP BY theme.id_theme
ORDER BY theme.theme ASC
THEME table
|---------------------|
| id_theme | theme |
|----------|----------|
| 1 | Yoga |
| 2 | Music |
| 3 | Taichi |
| 4 | Dance |
| 5 | Coaching |
|---------------------|
CALENDAR table
|---------------------------------------------------------------------------|
| id_calender | id_theme1 | id_theme2 | id_theme3 | start_date | end_date |
|-------------|-----------|-----------|-----------|------------|------------|
| 1 | 2 | 4 | | 2015-07-24 | 2015-08-02 |
| 2 | 4 | 1 | 5 | 2015-08-06 | 2015-08-22 |
| 3 | 1 | 3 | 2 | 2014-10-11 | 2015-10-28 |
|---------------------------------------------------------------------------|
LISTBOX
|----------------|
| |
| Yoga (1) |
| Music (1) |
| Taichi (0) |
| Dance (2) |
| Coaching (1) |
|----------------|
Thanking you in advance

I think that themes conditions should be into brackets
((theme.id_theme=calender.id_theme1)
OR (theme.id_theme=calender.id_theme2)
OR (theme.id_theme=calender.id_theme3))
Hope this help

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Efficient way to join by levenshtein in Hive or Impala - hadoop

Related

How to properly apply index_desc hint for a remote database object in oracle database?

How to implement Oracle's "func(...) keep (dense_rank ...)" In Hive

How to arrange multi partitions in hive?

How to Move a whole partition to another tabel on another database?

Display record count in listbox using multiple tables and fields

Categories

Resources