Does the Polybase shard range constitute data knowledge? - polybase

In the Polybase database, does the client encrypt and then determine in which shard a record belongs?
I was trying to understand how do you know the correct location of the data?

Yes, that’s how it will work initially. You can use zk-snarks to prove that this is the correct location for the data. This is detailed further in the Polybase whitepaper.
There is also ongoing research into homomorphic encryption (computing over encrypted data without access to the secret key), which would allow this to happen on the indexer.
Note: I'm a founder of Polybase.

Related

Spring data mongodb, How do I check email duplication when the data was encrypted

I use this library https://github.com/agoston/spring-data-mongodb-encrypt for encrypting user personal data. The problem is, I don't know how the library works, but even though you use the same email address it encrypted and yield difference result so I have no way to check whether the email is duplicated or not. One way that I could think of is that to query all the email address from the database, which is automatically decrypted by the library, and then check if the requested email existed yet in the records but this approach would require a lot of IO and resource if we have large database. Are there any better solutions?

How to know when data has been inserted in clickhouse

I understood that clickhouse is eventually consistent. So once an insert call returns, it doesn't mean that the data will appear in a select query.
does that apply to stand-alone clickhouse (no distribution, no replication)?
I understand the concept of eventual consistency for data replication, but does it apply with distribution but no replication?
using a distributed+replicated clickhouse, what is a recommended way to know that some insert(s) can be safely looked up?
Basically I didn't find much information on this topic, so maybe I am not asking the best questions. Feel free to enlighten me.
No, but single-node setup shouldn't be considered reliable either.
By default yes, you'll insert to node the client is connected to (probably via some load balancer) and Distributed table will asynchronously forward each piece of data to node where it belongs. The insert_distributed_sync=1 setting will make the client wait synchronously.
On insert use ***MergeTree shard tables directly (not Distributed) with insert_quorum=2 setting (if there are 3 replicas) and retry infinitely with exactly same batch if there are some errors (can use different replicas on retry, since there's a deduplication based on batch hash). Then on reads use select_sequential_consistency=1 setting.

Using ElasticSearch as a permanent storage

Recently I am working on a project which is producing a huge amount of data every day, in this project, there are two functionalities, one is storing data into Hbase for future analysis, and second one is pushing data into ElasticSearch for monitoring.
As the data is huge, we should store data into two platforms(Hbase,Elasticsearch)!
I have no experience in both of them. I want no know is it possible to use elasticsearch instead of hbase as a persistence storage for future analytics?
I recommend you reading this old but still valid article : https://www.elastic.co/blog/found-elasticsearch-as-nosql
Keep in mind, Elasticsearch is only a search engine. But it depends if your data are critical or if you can accept to lose some of them like non critical logs.
If you don't want to use an additionnal database with huge large data, you probably can store them into files in something like HDFS.
You should also check Phoenix https://phoenix.apache.org/ which may provide the monitoring features that you are looking for

Encrypting data in oracle database

What are the ways in which data can be encrypted? Say for example salary column, even the admin should not be able to see the encrypted columns if possible, data should be visible only through application to users who have access which is defined in the application, changes in application (adding new functionality to encrypt/decrypt at application level) would be a last resort and minimal.
So far I have thought of 2 ways any fresh ideas or pros and cons of the ones below would be much appreciated:
1. Using Oracle TDE (transparent data encryption).
- Con : Admin can possibly grant himself rights to see the data
2. Creating a trigger to encrypt before insert and something along the lines of a pipeline to retrieve.
Oracle Database Vault is the only way to prevent a DBA from being able to access data stored in the database. That is an extra cost product, however, and it requires you to have an additional set of security admins whose job it is to grant the DBAs whatever privileges they actually need.
Barring that, you'd be looking at solutions that encrypt and decrypt the data in the application outside the database. That would involve making changes to the database structure (i.e. the salary column would be declared as a raw rather than a number). And it involves application changes to call the encryption and decryption routines. And that requires that you solve the key management problem which is generally where these sorts of solutions fail. Storing the encryption key somewhere that the application can retrieve it but somewhere that no admin can access is generally non-trivial. And then you need to ensure that the key is backed up and restored separately since the encrypted data in the database is useless without the key.
Most of the time, though, I'd tend to suggest that the right approach is to allow the DBA to see the data and audit the queries they run instead. If you see that one particular DBA is running queries for fun rather than occasionally looking at bits of data in the course of doing her job, you can take action at that point. Knowing that their queries are being audited is generally enough to keep the DBA from accessing data that she doesn't really need.

Serializeable In-Memory Full-Text Index Tool for Ruby

I am trying to find a way to build a full-text index stored in-memory in a format that can be safely passed through Marshal.dump/Marshal.load so I can take the index and encrypt it before storing it to disk.
My rationale for needing this functionality: I am designing a system where a user's content needs to be both encrypted using their own key, and indexed for full text searching. I realize there would be significant overhead and memory usage if for each user of the system I had to un-marshal and load the entire index of their content into memory. For this project security is far more important than efficiency.
A full text index would maintain far too many details about a user's content to leave unencrypted, and simply storing the index on an encrypted volume is insufficient as each user's index would need to be encrypted using the unique key for that user to maintain the level of security desired.
User content will be encrypted and likely stored in a traditional RDBMS. My thought is that loading/unloading the serialized index would be less overhead for a user with large amounts of content than decrypting all the DB rows belonging to them and doing a full scan for every search.
My trials with ferret got me to the point of successfully creating an in-memory index. However, the index failed a Marshal.dump due to the use of Mutex. I am also evaluating xapian and solr but seem to be hitting roadblocks there as well.
Before I go any further I would like to know if this approach is even a sane one and what alternatives I might want to consider if its not. I also want to know if anyone has had any success with serializing a full-text index in this manner, what tool you used, and any pointers you can provide.
Why not use a standard full-text search engine and keep each client's index on a separate encrypted disk image, like TrueCrypt? Each client's disk image could have a unique key, it would use less RAM, and would probably take less time to implement.

Resources