Fuzzy search over encrypted data

Fuzzy search over encrypted data - elasticsearch

I have a schema where couple of fields must be encrypted. I was wondering if someone has done it or can point me to a resource to know if with elasticsearch I have a way to implement a fuzzy search over this encrypted data.
For example when I have
{
"last_name": "encryptedLastName",
}
and 2 documents where lastName was encrypted one with encrypted value of last_name=Ferdinand and another one with encrypted value of last_name=Ferdadian
I'd like to be able to search with a string and fetch both document as long as the levenstein distance is > 80 for example. Is this at all possible?
On another note, I also wanted to be able to do searches with 'like' over the encrypted data for example where last_name like 'Fer%'

You can build index over encrypted data, but it would mean the data will be unencrypted in the index. The same reason why they are encrypted in the database itself likely means they can't be unencrypted in the elasticsearch index either.
And if the encryption is any good, similar values look completely different after encryption.

Generally (not specific to Elasticsearch):
To search over encrypted data, you will need to decrypt it. If you want to make it fast, you need to keep a decrypted index. You can either have a fast search or good encryption but not both.

Related

Scanning and finding sensitive data in an Elasticsearch index in an efficient way

What I have : Elastic search database for full text search purposes.
What my requirement is : In a given elasticsearch index, I need to detect some sensitive data like iban no, credit card no, passport no, social security no, address etc. and report them to the client. There will be checkboxes as input parameters. For instance, the client can select credit card no and passport no, then clicks detect button. After that, the system will start scanning index, and reports documents which include credit card no and passport no. It is aimed to have more than 200 sensitive data types, and clients will be able to make multiple selections over these types.
What I have done : I have created a C# application and used Nest library for ES queries. In order to detect each sensitive data type, I have created regular expressions and some special validation rules in my C# app which works well for manually given input string.
In my C# app, I have created a match all query with scroll api. When the user clicks detect button, my app is iterating all the source records which returns from scroll api,and for each record, the app is executing sensitive data finder codes based on client's selection.
The problem here is searching all source records in the ES index, extracting sensitive datas and preparing report as fast as possible with great amount of documents. I know ES is designed for full text search, not for scanning whole system and bringing data. However all data are in elasticsearch right now and I need to use this db to make detecting operation.
I am wondering if I can do this in a different and efficient way. Can this problem be solved with writing an elastic search plugin without a C# app? Or is there a better solution to scan the whole source data in ES index?
Thanks for suggestions.

Passport number, other sensitive information detection algorithm should run once, during indexing time, or maybe asynchronously as a separate job that will update documents with flags representing the presence of sensitive information. Based on the flag the relevant documents can be searched.
Search time analysis in this case will be very costly and should be avoided.

Cassandra table structure for query to get all usernames

In my cassandra database I have a table with users and I want a function to search for users by their unique usernames. For that I need to query all usernames from the user table so that I can filter them serverside, because for input of "nark" I should also find username "Mark", "Narkis" and so on, so I can't just use the username as a partition key and search for the exact value.
If I give them all in the same partition, it results in a hot partition. If I distribute them over multiple partitions, I have to search in all of them.
How can query that efficiently for millions of users? Is there a way to search like that without querying all usernames?
Thank you for your help!

Cassandra natively is not a good fit for such a use case. Even extensive use of secondary indexes will be of minimal help here.
Nevertheless if you already have all your data on C* to achieve such a functionality you essentially need a indexing framework on top of it, most widely used is Apache SOLR (built on Lucene).I have seen SOLR work like magic for fuzzy searching on C* though nothing beats having something like Elasticsearch for the use case.

Should I be using database ID's as Elastic ID's

I am new to elastic and starting to sync my database tables into elastic indexes. I have started by using the table ID(UUID) as the elastic id, but I am starting to wonder if this is a mistake in terms of performance or flexibility in the long term? Any advice would be appreciated.

I think this approach should actually be a best practice. When you update data in your ES index from the (changed) DB, you can address the document directly.
It has worked great for us to use the _bulk update API, which requires an explicit id per item.
On every change on the DB side, we enqueue change notifications, the changed object gets JSON-serialized and sent to ES, asynchronously, and in larger batches. That is making a huge performance difference. Search performance, on the other side, does not depend on the length of the _id AFAIK, not even when you look up by _id. So your DB UUID should be just fine. Especially since _ids can be alphanumeric, they are not limited to just numbers.
Having a 1:1 relationship via _id between the ES result and your system of record (I assume that's what your DB is for) is advantageous also for transparency purposes. In any case, you want to store the database ID as some field, ideally indexed, at least, to help you understand where that document came from.
So, rather than creating your own ID field, you may as well use the built-in _id field right away, with your DB-supplied data.

elasticsearch string values vs ids- performance best practices

i wonder whats better from performance aspect:
to store string values in elasticsearch, or to save a map table in mysql(for example), and save the id.
for example 10M documents with a field 'browser', with all browsers' names,
or with ids, that i pre-generated and saved in mysql
use case is analyze server traffic
thanks!

Solr Direct Search Vs Oracle DB Text Field Direct Search

I am working on a large scale oracle database. There is a requirement to validate whether the email address is available in the database when user enters it. If a direct database call is made it would be exact match
e.g Select email from Users where emailaddress = "sampleemail#domain.com" it is not a LIKE.
It has been suggested that rather than doing a direct database exact search it would be better to do a solr search for this. Even so it will be an exact match.
I would like understand, Can there be a significant advantage in using solr in this scenario as it is a exact match. If so how

No, don't do that. Build an index in the database (a unique btree) and query that. Whoever suggested solr for this is highly misinformed about the trade offs. This is literally and exactly why there are indexes inside the database at all.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio