Fuzzy search on Oracle database [closed] - oracle

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I need to implement fuzzy search in our web application. At first we had hardcoded options on our frontend, but now we have to work with database data (only return top 10 candidates). I just want to ask about the best approach how the data flow should be for some input field. My thoughts are:
User types any character in a search field
Frontend issues POST request to the Backend
Backend asks database to start some fuzzy search procedure
Backend returns List of 10 best results back to the frontend
Frontend displays them
My two biggest concerns are:
Can I make this happen on a large scale of data (for example 100 000) under 1 second?
What algorithm for fuzzy searching this big amount of data for relative short period of time is the best? (I am searching tool names that can consist of 2 and more words for example My Example Tool) - I have checked already some like ULT_MATCH and SOUNDEX but I don't know if it is the right choice
Is it okay to issue request every time user types a character? Doesn't this amount of HTTP requests halts our application?

Oracle does have a powerful feature for text searches, it is called Oracle Text. It allows for searching within large chunks of text or documents. It allows for stopwords, synonyms, fuzzy search, etc.
Subsecond searches should be no problem
Check Fuzzy Matching and stemming or google "Oracle text fuzzy search"
What do you think yourself ? Firing off a request against > 100000 rows every time each user types a single character ? You can probably scale your backend for that but it does not make sense... check #Randy's comment too.

Related

is it good to delete original data if data has been indexed into elasticsearch? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Hi i'm new to elasticsearch.
Recently, I'm trying to design search engine platform.
One thing that is really confusing me is storage of elasticsearch.
as far as i know, once document has been indexed, user can find contents of data from elasticsearch.
does it mean that I can delete original document files? because users can get data from elasticsearch.
Or Do I have to store original data somewhere?
Or it depends on use cases?
Please help me.
I am hoping you are using RDBMS or some other robust data store for your application and using elasticsearch to improve the search or read performance, Elasticsearch is not a database while database should be robust.
Please refer robustness section of elasticsearch official blog where they clearly mention its not as robust as typical databases and there are a lot of out of memory errors and its build for speed considering you have ample amount of memory.
In-short, you should not delete the original data ie if you can't afford the data-loss in your application as data-loss happens in ES.

Scanning and finding sensitive data in an Elasticsearch index in an efficient way

What I have : Elastic search database for full text search purposes.
What my requirement is : In a given elasticsearch index, I need to detect some sensitive data like iban no, credit card no, passport no, social security no, address etc. and report them to the client. There will be checkboxes as input parameters. For instance, the client can select credit card no and passport no, then clicks detect button. After that, the system will start scanning index, and reports documents which include credit card no and passport no. It is aimed to have more than 200 sensitive data types, and clients will be able to make multiple selections over these types.
What I have done : I have created a C# application and used Nest library for ES queries. In order to detect each sensitive data type, I have created regular expressions and some special validation rules in my C# app which works well for manually given input string.
In my C# app, I have created a match all query with scroll api. When the user clicks detect button, my app is iterating all the source records which returns from scroll api,and for each record, the app is executing sensitive data finder codes based on client's selection.
The problem here is searching all source records in the ES index, extracting sensitive datas and preparing report as fast as possible with great amount of documents. I know ES is designed for full text search, not for scanning whole system and bringing data. However all data are in elasticsearch right now and I need to use this db to make detecting operation.
I am wondering if I can do this in a different and efficient way. Can this problem be solved with writing an elastic search plugin without a C# app? Or is there a better solution to scan the whole source data in ES index?
Thanks for suggestions.
Passport number, other sensitive information detection algorithm should run once, during indexing time, or maybe asynchronously as a separate job that will update documents with flags representing the presence of sensitive information. Based on the flag the relevant documents can be searched.
Search time analysis in this case will be very costly and should be avoided.

Are there downsides of indexing every field in Elasticsearch index? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I found information about how number of shards, number of fields in mapping, affect performance of Elasticsearch. But I could not find any information about how indexing or not indexing a field affect cluster performance.
Imagine I have document like:
{
"name":"John Doe",
"photoPath":"/var/www/public/images/i1/j/d/123456.jpg",
"passwordHash":"$12$3b$abc...354",
"bio":"..."
}
I need to put 10 to 100 such documents to the cluster each second. When I put such document in index I am pretty sure I'd need to fulltext search for name and fulltext search for bio. I will never search for photoPath and I will never need fulltext search for password hash.
When I do my mapping I have several options:
make all fields text and analyze them with simple analyzer (i.e. tokenize by any not-character) - in that case I will have terms like "i1", "3b" or "123456" in my index
make name and bio text, make password hash keyword and make photoPath non-indexed
So my questions are:
In what ways, if any, am I improving performance in case I use the second option with custom tailored field types?
Am I correct in my assumption that having less fields indexed helps performance?
Am I correct in my assumption that indexing fewer fields will improve indexing performance?
Am I correct in my assumption that actual search will be faster if I index only what I need?
Here we go with the answers:
In what ways, if any, am I improving performance in case I use the second option with custom tailored field types? --> see detailed explanation below
Am I correct in my assumption that having less fields indexed helps performance? --> Yes
Am I correct in my assumption that indexing fewer fields will improve indexing performance? --> Yes
Am I correct in my assumption that actual search will be faster if I index only what I need? --> Most likely
Detailed explanation:
Every index comes with a mapping in which you not just specify what data should get indexed but also in how many fields your data is stored and how to process the data before storing it. In its default configuration Elasticsearch will dynamically create this mapping for you based on the type of data you sent to it.
Every single entry in your mapping consumes some bytes which will add to the size of the cluster state (the data structure that contains all the meta-information about your cluster such as information about nodes, indices, fields, shards, etc. and lives in RAM). For some users the cluster state simply got too big which severely affected performance. As a safety measurement Elasticsearch by default does not allow you to have more than 1000 fields in a single index.
Depending on mapping type and optional mapping parameters Elasticsearch will create one or more data structures for every single field you store. There are less and more "expensive" types, e.g. keyword and boolean are rather "cheap" types, whereas "text" (for full text search) is a rather expensive type, as it also requires preprocessing (analysis) of your strings. By default Elasticsearch maps strings to a multifield made up of 2 fields: one that goes by <fieldname> which is of type text and supports full-text search, and one that goes by <fieldname>.keyword of type keyword which only supports exact match search. On top of keyword fields and some other field types allow you to do analytics and use them for sorting. If you don't need one or the other, then please customize your mapping by storing it only for the use case you need. It makes a huge difference if you only need to display a field (no need to create any additional data structures for that field), whether you need to be able to search in a field (requiring specific data structures like inverted indices), or whether you need to do analytics on a field or sort by that field (requiring specific data structures like doc_values). Besides the Elasticsearch fields you specify in your mapping with a type you also can control the data structures that should get created with the following mapping-parameters: index, doc_values, enabled (just to name a few)
At search time it also makes a difference over how many fields you are searching and how big your index is. The fewer fields, the smaller the index, the better for fast search requests.
Conclusion:
So, your idea to customize your mapping by only storing some fields as keyword fields, some as text fields, some as multifields makes perfect sense!
As the question has several parts, I would try to answer them with official elasticsearch(ES) resources. Before that let's break what OP has in the ES index and every field use case:
{
"name":"John Doe", //used for full text search
"photoPath":"/var/www/pub/images/i1/j/d/123.jpg", // not used for search or retrival.
"passwordHash":"$12$3b$abc...354", // not used for search or retrival.
"bio":"..." //used for full text search**
}
Now as OP just mentioned photoPath and passwordHash aren't used for full-text search, I am assuming that these fields will not be used even for retrieval purposes.
So first, we need to understand what's the difference b/w indexing a field and storing the field and this is explained very well in this and this article. In short, if you have _source disabled(default is enabled), you will not be able to retrieve a field if it's not stored.
Now coming to the optimization part and improving the performance part. it's very simple that if you (send/store) more data what you actually need, then you wasting resources(nertwork,CPU,memory, disk). And ES is no different here.
Now coming to OP assumptions/questions:
In what ways, if any, am I improving performance in case I use the second option with custom-tailored field types? This option definitely better than first as you are not indexing the fields which you don't need for a search, but there is still room for optimization if you don't need to retrieve them, then it's better not to store them as well as remove from index mapping.
Am I correct in my assumption that having fewer fields indexed helps performance? Yes, as this way your inverted index would be smaller and you would be able to cache more data from your inverted index to file system cache and searching in small no of data is always faster. Apart from that, it helps to improve the relevance of your search by not indexing the unnecessary fields for your search.
Am I correct in my assumption that indexing fewer fields will improve indexing performance? Explained in the previous answer.
Am I correct in my assumption that the actual search will be faster if I index only what I need? It not only improves the search speed but improves indexing speed(as there will be lesser segments and merging them takes less time)
I can add a lot more information but I wanted to keep this short and simple. Let me know if anything isn't clear and would be happy to add more information.

GraphQL vs Elasticsearch what should i use for fast searching performance that return with many different schema? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
Improve this question
I am making a real-time search that will indicate the correct pattern from the search string. Then it will search with this pattern and return with the correct database schema dynamically.
Example Like: Google Assistant
You are comparing apple with orange if you are comparing GraphQL with ElasticSearch. They are totally different technologies.
GraphQL is the API layer technology which compare to REST. It mainly defines the request/response format and structure of your HTTP based API. It is not another NoSQL that help you to store and query data efficiently.
If you are using GraphQL , you still need to query the data by yourself , which the data may actually store and come from NoSQL , SQL DB , ElasticSearch or other web service or blablabla . GraphQL does not care about where you store the data ,the data can even store at multiple data sources. What he cares is that you tell him how to get the data.
Back to your case , you most probably can use ElasticSearch for storing and searching the data efficiently. And put GraphQL in front of ElasticSearch such that users/developers interact with the service through GraphQL API in order to enjoy GraphQL benefits.
It depends on the Use case.
Recently I realized I could have used GraphQL for Searching instead of Elasticsearch (For Just This Use case), with respect to cost of running two services one that GraphQL was reading from and other one is Elasticsearch.
All in all it good you can use these two technologies cause you may need them in different use cases.

A lot of deleted docs will Influence the speed of query? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I need to delete docs frequent,but es only flags these as deleted.If there are a lot of deleted docs,the speed of query will lower?Has other problems?
EDIT
In other words,I often delete a lot of docs from a index,and never use force merge api to release disk usage,i will have query performance issues after a period of time?
you must simply send an http POST request to your elasticsearch node, in below structure
http://localhost:9200/your_index_name/_forcemerge
for more details you can read this page
If there are a lot of deleted docs,the speed of query will lower?
the answer is yes
In other words,I often delete a lot of docs from a index,and never use force merge api to release disk usage,i will have query performance issues after a period of time?
elasticsearch automatically run merge process when insert or update operations is too high (that causes segments being dirty). in other hand you can use forcemerge api to have some controls on merging process yourself.
Documents are stored in the index as segments which are formed when the document is created in lucene. Deleting the document from elastic don't actually delete the document from the underlying segment, which forms the basic data storage for ES.
Yes having lot of deleted documents will have query performance issues as the query will still search for matched documents in the deleted segments as well.
Force_merge or optimize the index is usually the option to do about it, but you should take little care to handle this as this is heavy disk i/o operation.
$ curl -XPOST 'http://localhost:9200/kimchy,elasticsearch/_optimize'
$ curl -XPOST 'http://localhost:9200/_optimize?only_expunge_deletes=true'
Can you explain more why you have so many deletes frequently. As we also had huge deletes frequently, but we handled them on index level. Our deletes happens for documents for certain date-range , so we indexes the documents based on dates and when the time comes to delete the doc for certain date we just simply drop the index.
If you have any pattern for the documents to be deleted, i suggest you separate them out in a index and just drop the index

Resources