Search for similar words using an index - oracle

I need to search over a DB table using some kind of fuzzy search like the one from oracle and using indexes since I do not want a table scan(there is a lot of data).
I want to ignore case, language special stuff(ñ, ß, ...) and special characters like _, (), -, etc...
Search for "maria (cool)" should get "maria- COOL" and "María_Cool" as matches.
Is that possible in Oracle in some way?
About the case, I think it can be solved created the index directly in lower case and searching always lower-cased. But I do not know how to solve the special chars stuff.
I thought about storing the data without special chars in a separated column and searching on that returning the real one, but I am not 100% sure where that is the perfect solution.
Any ideas?

Maybe UTL_MATCH can help.
But you can also create a function based index on, lets say, something like this:
regexp_replace(your_column, '[^0-9a-zA-Z]+', ' ')
And try to match like this:
...
WHERE regexp_replace(your_column, '[^0-9a-zA-Z]+', ' ') =
regexp_replace('maria (cool)' , '[^0-9a-zA-Z]+', ' ')
Here is a sqlfiddle demo It's not complete, but can be a start

Related

oracle - can I use contain and near with a clob? Need to speed up query

We have a query that takes 48 minutes to run a search on a clob. The query is written as if it is not a clob column and uses contains and near. This search for 3 words within a certain word distance from each other is important. I'm needing to speed this up and want to do an index on the clob, but don't know if that would work and don't fully understand how to do it. I found this from Tom Burleson
http://www.dba-oracle.com/t_clob_search_query.htm OR https://asktom.oracle.com/pls/apex/asktom.search?tag=oracle-text-contains-search-with-near-is-very-slow
, but can't figure out how to do it with contains and near to enable the search of 3 words withing a certain distance from each other.
current script:
SELECT clob_field
FROM clob_table
WHERE contains(clob_field,'NEAR (((QUICK),(FOX),(LAZY)),5)') > 0;
Want to use something like this if it will act like indexing:
SELECT clob_field
FROM clob_table
WHERE contains(dbms_lob.substr(clob_field,'near(((QUICK),(FOX),(LAZY)),5)')) > 0;
If not, I need to do indexing, but I don't quite understand how to use CTXCAT and CONTEXT (https://docs.oracle.com/cd/A91202_01/901_doc/text.901/a90122/ind4.htm). I also don't like what I read here that says that if one uses CTXCAT for indexing a clob you have to use CONTEXT, or something like that. It can't affect the other queries that are done on this field.
Thanks in advance!
Contains won't work unless it is globally indexed, so I had to index the field and then could get the original query working.

searching in CLOB for words in a list/table

I have a large table with a clob column (+100,000 rows) from which I need to search for specific words within a certain timeframe.
{select id, clob_field, dbms_lob.instr(clob_field, '.doc',1,1) as doc, --ideally want .doc
dbms_lob.instr(clob_field, '.docx',1,1) as docx, --ideally want .docx
dbms_lob.instr(clob_field, '.DOC',1,1) as DOC, --ideally want .DOC
dbms_lob.instr(clob_field, '.DOCX',1,1) as DOCX --ideally want .DOCX
from clob_table, search_words s
where (to_char(date_entered, 'DD-MON-YYYY')
between to_date('01-SEP-2018') and to_date('30-SEP-2018'))
AND (contains(clob_field, s.words )>0) ;}
The set of words are '.doc', '.DOC', '.docx', and '.docx'. When I use
CONTAINS() it seems to ignore the dot and so provides me with lots of rows, but not with the document extensions in it. It finds emails with .doc as part of the address, so the doc will have a period on either side of it.
i.e. mail.doc.george#here.com
I don't want those occurrences. I have tried it with a space at the end of the word and it ignores the spaces. I have put these in a search table I created, as shown above, and it still ignores the spaces. Any suggestions?
Thanks!!
Here's two suggestions.
The simple, inefficient way is to use something besides CONTAINS. Context indexes are notoriously tricky to get right. So instead of the last line, you could do:
AND regexp_instr(clob_field, '\.docx', 1,1,0,'i') > 0
I think that should work, but it might be very slow. Which is when you'd use an index. But Oracle Text indexes are more complicated than normal indexes. This old doc explains that punctuation characters (as defined in the index parameters) are not indexed, because the point of Oracle Text is to index words. If you want special characters to be indexed as part of the word, you need to add it to the set of printjoin characters. This doc explains how, but I'll paste it here. You need to drop your existing CONTEXT index and re-create it with this preference:
begin
ctx_ddl.create_preference('mylex', 'BASIC_LEXER');
ctx_ddl.set_attribute('mylex', 'printjoins', '._-'); -- periods, underscores, dashes can be parts of words
end;
/
CREATE INDEX myindex on clob_table(clob_field) INDEXTYPE IS CTXSYS.CONTEXT
parameters ('LEXER mylex');
Keep in mind that CONTEXT indexes are case-insensitive by default; I think that's what you want, but FYI you can change it by setting the 'mixed_case' attribute to 'Y' on the lexer, right below where you set the printjoins attribute above.
Also it seems like you're trying to search for words which end in .docx, but CONTAINS isn't INSTR - by default it matches entire words, not strings of characters. You'd probably want to modify your query to do AND contains(clob_field, '%.docx')>0

VB 6 Advanced Find Method in View Code

Lets say i have 3 table A,B,C.
In every table i have some insert query.
I want to using Find "ctrl+f" to find every insert query with some format.
Example: i want to find code that contain "insert [table_name] value" no matter what is the table name (A or B or C), so i want to search some code but skip the word in the middle of it.
I have googling with any keyword, but i doesn't get any solution that even close to what i want.
Is it possible to do something like this.?
You need to use what are known as "wildcard" characters.
In the find window, you'll notice there is a check box called "Use Pattern Matching".
If you check this, then you can use some special characters to expand your search.
? is a wildcard that indicates any character can take this place.
* is a wildcard that indicates a string of any length could take this place
eg. ca? would match cat, car, cam etc
ca* would match cat, car, catastrophe, called ... etc
So something along the lines of insert * value should find what you are interested in.

Reversing a string using an index in Oracle

I have a table that has IDs and Strings and I need to be able to properly index for searching for the end of the strings. How we are currently handling it is copying the information into another table and reversing each string and indexing it normally. What I would like to do is use some kind of index that allows to search in reverse.
Example
Data:
F7421kFSD1234
d7421kFSD1235
F7541kFSD1236
d7421kFSD1234
F7421kFSD1235
b8765kFSD1235
d7421kFSD1234
The way our users usually input thier search is something along the lines of...
*1234
By reversing the strings (and the search string: 4321*) I could find what I am looking for without completely scanning the whole table. My question is: Is making a second table the best way of doing this?
Is there a way to reverse index?
Ive tried an index like this...
create index REVERSE_STR_IDX on TABLE(STRING) REVERSE;
but oracle doesn't seem to be using it according to the Explain Plan.
Thanks in advance for the help.
Update:
I did have a problem with unicode characters not being reversed correctly. The solution to this was casting them.
Example:
select REVERSE(cast(string AS varchar2(2000)))
from tbl
where id = 1
There is the myth that a reverse key index can be used for that, however, I've never seen that in action.
I would try a "manual" function based index.
CREATE INDEX REVERSE_STR_IDX on TBL(reverse(string));
SELECT *
FROM TBL
WHERE reverse(string) LIKE '4321%';

Full-text search in Postgres or CouchDB?

I took geonames.org and imported all their data of German cities with all districts.
If I enter "Hamburg", it lists "Hamburg Center, Hamburg Airport" and so on. The application is in a closed network with no access to the internet, so I can't access the geonames.org web services and have to import the data. :(
The city with all of its districts works as an auto complete. So each key hit results in an XHR request and so on.
Now my customer asked whether it is possible to have all data of the world in it. Finally, about 5.000.000 rows with 45.000.000 alternative names etc.
Postgres needs about 3 seconds per query which makes the auto complete unusable.
Now I thought of CouchDb, have already worked with it. My question:
I would like to post "Ham" and I want CouchDB to get all documents starting with "Ham". If I enter "Hamburg" I want it to return Hamburg and so forth.
Is CouchDB the right database for it? Which other DBs can you recommend that respond with low latency (may be in-memory) and millions of datasets? The dataset doesn't change regularly, it's rather static!
If I understand your problem right, probably all you need is already built in the CouchDB.
To get a range of documents with names beginning with e.g. "Ham". You may use a request with a string range: startkey="Ham"&endkey="Ham\ufff0"
If you need a more comprehensive search, you may create a view containing names of other places as keys. So you again can query ranges using the technique above.
Here is a view function to make this:
function(doc) {
for (var name in doc.places) {
emit(name, doc._id);
}
}
Also see the CouchOne blog post about CouchDB typeahead and autocomplete search and this discussion on the mailing list about CouchDB autocomplete.
Optimized search with PostgreSQL
Your search is anchored at the start and no fuzzy search logic is required. This is not the typical use case for full text search.
If it gets more fuzzy or your search is not anchored at the start, look here for more:
Similar UTF-8 strings for autocomplete field
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
In PostgreSQL you can make use of advanced index features that should make the query very fast. In particular look at operator classes and indexes on expressions.
1) text_pattern_ops
Assuming your column is of type text, you would use a special index for text pattern operators like this:
CREATE INDEX name_text_pattern_ops_idx
ON tbl (name text_pattern_ops);
SELECT name
FROM tbl
WHERE name ~~ ('Hambu' || '%');
This is assuming that you operate with a database locale other than C - most likely de_DE.UTF-8 in your case. You could also set up a database with locale 'C'. I quote the manual here:
If you do use the C locale, you do not need the xxx_pattern_ops
operator classes, because an index with the default operator class is
usable for pattern-matching queries in the C locale.
2) Index on expression
I'd imagine you would also want to make that search case insensitive. so let's take another step and make that an index on an expression:
CREATE INDEX lower_name_text_pattern_ops_idx
ON tbl (lower(name) text_pattern_ops);
SELECT name
FROM tbl
WHERE lower(name) ~~ (lower('Hambu') || '%');
To make use of the index, the WHERE clause has to match the the index expression.
3) Optimize index size and speed
Finally, you might also want to impose a limit on the number of leading characters to minimize the size of your index and speed things up even further:
CREATE INDEX lower_left_name_text_pattern_ops_idx
ON tbl (lower(left(name,10)) text_pattern_ops);
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~~ (lower('Hambu') || '%');
left() was introduced with Postgres 9.1. Use substring(name, 1,10) in older versions.
4) Cover all possible requests
What about strings with more than 10 characters?
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~ (lower(left('Hambu678910',10)) || '%');
AND lower(name) ~~ (lower('Hambu678910') || '%');
This looks redundant, but you need to spell it out this way to actually use the index. Index search will narrow it down to a few entries, the additional clause filters the rest. Experiment to find the sweet spot. Depends on data distribution and typical use cases. 10 characters seem like a good starting point. For more than 10 characters, left() effectively turns into a very fast and simple hashing algorithm that's good enough for many (but not all) use cases.
5) Optimize disc representation with CLUSTER
So, the predominant access pattern will be to retrieve a bunch of adjacent rows according to our index lower_left_name_text_pattern_ops_idx. And you mostly read and hardly ever write. This is a textbook case for CLUSTER. The manual:
When a table is clustered, it is physically reordered based on the index information.
With a huge table like yours, this can dramatically improve response time because all rows to be fetched are in the same or adjacent blocks on disk.
First call:
CLUSTER tbl USING lower_left_name_text_pattern_ops_idx;
Information which index to use will be saved and successive calls will re-cluster the table:
CLUSTER tbl;
CLUSTER; -- cluster all tables in the db that have previously been clustered.
If you don't want to repeat it:
ALTER TABLE tbl SET WITHOUT CLUSTER;
However, CLUSTER takes an exclusive lock on the table. If that's a problem, look into pg_repack or pg_squeeze, which can do the same without exclusive lock on the table.
6) Prevent too many rows in the result
Demand a minimum of, say, 3 or 4 characters for the search string. I add this for completeness, you probably do it anyway.
And LIMIT the number of rows returned:
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~~ (lower('Hambu') || '%')
LIMIT 501;
If your query returns more than 500 rows, tell the user to narrow down his search.
7) Optimize filter method (operators)
If you absolutely must squeeze out every last microsecond, you can utilize operators of the text_pattern_ops family. Like this:
SELECT name
FROM tbl
WHERE lower(left(name, 10)) ~>=~ lower('Hambu')
AND lower(left(name, 10)) ~<=~ (lower('Hambu') || chr(2097151));
You gain very little with this last stunt. Normally, standard operators are the better choice.
If you do all that, search time will be reduced to a matter of milliseconds.
I think a better approach is keep your data on your database (Postgres or CouchDB) and index it with a full-text search engine, like Lucene, Solr or ElasticSearch.
Having said that, there's a project integrating CouchDB with Lucene.

Resources