Possible column encryption algorithms for azure dedicated sql pool - algorithm

There is a need to encrypt personal data columns in some tables, due to GDPR requirements.
We are using Azure Dedicated SQL pool.
Would you please recomment some algorithm for encryption of these columns, while not affecting complex queries to run, or other external users which queries these data and list pros and cons of each algorithm to use?
Thanks a lot!

Related

Database strategy for concurrent read/write operation in it

I have 6 services talking to the same database SQL Server 2016 (Payments) where some services are doing write operations and some are doing read operations. Database server holds other databases as well than Payments database. We do not have any archival job in place on Payments database. We recently got 99% CPU usage and as well as memory issue on database server.
Obvious steps I can take including
Create archival jobs to migrate old data to archived database
Can scale up database server.
But still want to explore other best solutions. I have below questions.
Can we make different databases for read and write operations, if yes how?
Can we migrate data on the fly to NoSql database from RDBMS because it is faster for read operation?
What is the best design for such applications where concurrent read and write operations happens?
Storage is all about trade-offs, so it is extremely tricky to find correct "storage" solution without diving deep in different aspects such as latency, availability, concurrency, access pattern and security requirements. In this particular case, payments data is being stored which should be confidential and straightforward removes some storage solutions. In general, you should
Cache the read data, but if the same data is being modified
constantly this will not work. Caching also doesn't work well when
your reads are not public (i.e., can not be reused across multiple
read calls, preferrably across multiple users), which is possible in this case as we are dealing with payments data.
Read/write master database and read-only slaves pattern is also "common" pattern to scale reads. It doesn't scale the writes though. It again depends if the application can work with "replication lag".
Sharding is the common access pattern to scale writes. It comes with other burden of cross node query aggregation etc (in some databases).
Finally, based on the data access pattern, refactor the schema
and employ different databases. CQRS (Command Query Responsibility
Segregation) is one way to achieve it, but it comes at it has its
own pros and cons. For more details: https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs
Few years back, I read this book which helped me immensely in understanding these concepts: https://www.amazon.com/Scalability-Startup-Engineers-Artur-Ejsmont/dp/0071843655

cassandra vs elastic search vs any other design suggestions

We have a need to run analytics queries on the data stored in rds. And that's becoming very very slow because of group by queries and ever increasing size of the tables.
For example we have following 3 tables in RDS :
alm(id,name,cli, group_id, con_id ...)
group(id, type,timestamp ...)
con(id,ip,port ...)
each of the tables have very high amount of data and are being updated several times a minute as the new data comes in.
Now we want to run aggregation queries like :
select name from alm, group, con where alm.group_id=group.id and alm.con_id=con.id group by name, group.type, con.ip
We also want users to run custom aggregation queries in the future as opposed to the fix query provided by us in future.
So far the options we are considering are moving to either Cassandra, Elasticsearch or Dynamo db so that aggregation would be faster. Can someone guide as to how to go about this problem ? Or any crumbs of experience ? Anybody know any technologies have severe advantage over others ?
Cassandra and DynamoDB are quite different from ElasticSearch. And all three are very different from relational database offerings.
For ad-hoc analytics, relational databases, with a well designed schema, can be pretty good up to the point where you need to split your data across multiple servers (then replication issues start to dominate the benefits). And that's really the primary motivation for non-relational databases. But the catch is that in order to solve the horizontal scaling problem, they generally trade some features such as joining and aggregating.
Elastic search is really great at answering search queries, but not particularly good at aggregations (other than very basic counts, sums and their estimates). It's amazing at indexing copious amounts of data but it can't answer queries that require complex cross index operations. It is also not as robust (rebuilding indexes may be needed from time to time)
If you have high volumes of data and you need aggregation, you pretty much have two options:
if you can get away with offline analytics, then distributed data processing frameworks such as Spark can get you the answers you need very efficiently
if you need online analytics, the most common approach is to pre-compute the aggregations and update as you get more data, so that answers to queries can be very fast without having to process a lot of data for each query
Don't be afraid to mix and match though. Relational databases have their purpose as do non-relationals. There is no silver bullet though.
One more options is Column-oriented databases, this kind of DB is more suitable for 'analytics' cases when you have many data fields and you want to perform aggregations or extract some subset of fields for big amount of data.
Recently Yandex ClickHouse becomes very popular and there is Column Oriented service from Amazon - Redshift. Also there are several other solutions
Store in parquet and use spark, partition efficiently

Using ElasticSearch and Kibana for Business Intelligence

We are using ElasticSearch for search capability in our product. This works fine.
Now we want to provide self service Business intelligence to our customers. Reporting on the operational database sucks due to performance impact. At the run-time, calculating average 'order resolution time' for 10 million records would not fetch the results in time. Traditional way is to create a data mart by loading the operational data using ETL and summarizing it. Then use any reporting engine, to offer metrics and reports to customers. This approach works but increases total cost of ownership for our customers.
I am wondering if anybody has used ElasticSearch as the intermediate data surface for reporting. Can Kibana serve the data exploration, visualization need?
We have the same needs.
Tools like Qlik, PowerBI, Tableau require to increase the overall insfrastructure stack and where you are designin solution to bring abroad without the possibility to share anyting they could be not the best possible option in terms of both costs
& complexity.
I have used devextreme by devexpress. Its server side approach using custom store is very efficient to handle & perform operations on large amount of data. In case of mysql and mssql db, I have myself performed grouping, sorting ,filtering, summaries on 10 million data using devextreme.
Apache Superset seems to be an answer. https://superset.apache.org/docs/intro

Dealing with Gigabytes of Data

I am going to start on with a new project. I need to deal with hundred gigs of data in a .NET application. It is very early stage now to give much detail about this project. Some overview is follows:
Lots of writes and Lots of reads on same tables, very realtime
Scaling is very important as the client insists expansion of database servers very frequently, thus, the application servers as well
Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented
Each row of data may contains lots of attributes to deal with
I am suggesting/having following as a solution:
Use distributed hash table sort of persistence (not S3 but an inhouse one)
Use Hadoop/Hive likes (any replacement in .NET?) for any analytical process across the nodes
Impelement GUI in ASP.NET/Silverlight (with lots of ajaxification,wherever required)
What do you guys think? Am i making any sense here?
Are your goals performance, maintainability, improving the odds of success, being cutting edge?
Don't give up on relational databases too early. With a $100 external harddrive and sample data generator (RedGate's is good), you can simulate that kind of workload quite easily.
Simulating that workload on a non-relational and cloud database and you might be writing your own tooling.
"Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented"
This is the hallmark of a data warehouse.
Here's the trick with DW processing.
Data is FLAT. Facts and Dimensions. Minimal structure, since it's mostly loaded and not updated.
To do aggregation, every query must be a simple SELECT SUM() or COUNT() FROM fact JOIN dimension GROUP BY dimension attribute. If you do this properly so that every query has this form, performance can be very, very good.
Data can be stored in flat files until you want to aggregate. You then load the data people actually intend to use and create a "datamart" from the master set of data.
Nothing is faster than simple flat files. You don't need any complexity to handle terabytes of flat files that are (as needed) loaded into RDBMS datamarts for aggregation and reporting.
Simple bulk loads of simple dimension and fact tables can be VERY fast using the RDBMS's tools.
You can trivially pre-assign all PK's and FK's using ultra-high-speed flat file processing. This makes the bulk loads all the simpler.
Get Ralph Kimball's Data Warehouse Toolkit books.
Modern databases work very well with gigabytes. It's when you get into terabytes and petabytes that RDBMSes tend to break down. If you are foreseeing that kind of load, something like HBase or Cassandra may be what the doctor ordered. If not, spend some quality time tuning your database, inserting caching layers (memached), etc.
"lots of reads and writes on the same tables, very realtime" - Is integrity important? Are some of those writes transactional? If so, stick with RDBMS.
Scaling can be tricky, but it doesn't mean you have to go with cloud computing stuff. Replication in DBMS will usually do the trick, along with web application clusters, load balancers, etc.
Give the RDBMS the responsibility to keep the integrity. And treat this project as if it were a data warehouse.
Keep everything clean, you dont need to go using a lot of third parties tools: use the RDBMS tools instead.
I mean, use all tools that the RDBMS has, and write an GUI that extract all data from the Db using well written stored procedures of a well designed physical data model (index, partitions, etc).
Teradata can handle a lot of data and is scalable.

Recommendation for a large-scale data warehousing system

I have a large amount of data I need to store, and be able to generate reports on - each one representing an event on a website (we're talking over 50 per second, so clearly older data will need to be aggregated).
I'm evaluating approaches to implementing this, obviously it needs to be reliable, and should be as easy to scale as possible. It should also be possible to generate reports from the data in a flexible and efficient way.
I'm hoping that some SOers have experience of such software and can make a recommendation, and/or point out the pitfalls.
Ideally I'd like to deploy this on EC2.
Wow. You are opening up a huge topic.
A few things right off the top of my head...
think carefully about your schema for inserts in the transactional part and reads in the reporting part, you may be best off keeping them separate if you have really large data volumes
look carefully at the latency that you can tolerate between real-time reporting on your transactions and aggregated reporting on your historical data. Maybe you should have a process which runs periodically and aggregates your transactions.
look carefully at any requirement which sees you reporting across your transactional and aggregated data, either in the same report or as a drill-down from one to the other
prototype with some meaningful queries and some realistic data volumes
get yourself a real production quality, enterprise ready database, i.e. Oracle / MSSQL
think about using someone else's code/product for the reporting e.g. Crystal/BO / Cognos
as I say, huge topic. As I think of more I'll continue adding to my list.
HTH and good luck
#Simon made a lot of excellent points, I'll just add a few and re-iterate/emphasize some others:
Use the right datatype for the Timestamps - make sure the DBMS has the appropriate precision.
Consider queueing for the capture of events, allowing for multiple threads/processes to handle the actual storage of the events.
Separate the schemas for your transactional and data warehouse
Seriously consider a periodic ETL from transactional db to the data warehouse.
Remember that you probably won't have 50 transactions/second 24x7x365 - peak transactions vs. average transactions
Investigate partitioning tables in the DBMS. Oracle and MSSQL will both partition on a value (like date/time).
Have an archiving/data retention policy from the outset. Too many projects just start recording data with no plans in place to remove/archive it.
Im suprised none of the answers here cover Hadoop and HDFS - I would suggest that is because SO is a programmers qa and your question is in fact a data science question.
If youre dealing with a large number of queries and large processing time, you would use HDFS (a distributed storage format on EC) to store your data and run batch queries (I.e. analytics) on commodity hardware.
You would then provision as many EC2 instances as needed (hundreds or thousands depending on how big your data crunching requirements are) and run map reduce queires against.your data to produce reports.
Wow.. This is a huge topic.
Let me begin with databases. First get something good if you are going to have crazy amounts to data. I like Oracle and Teradata.
Second, there is a definitive difference between recording transactional data and reporting/analytics. Put your transactional data in one area and then roll it up on a regular schedule into a reporting area (schema).
I believe you can approach this two ways
Throw money at the problem: Buy best in class software (databases, reporting software) and hire a few slick tech people to help
Take the homegrown approach: Build only what you need right now and grow the whole thing organically. Start with a simple database and build a web reporting framework. There are a lot of descent open-source tools and inexpensive agencies that do this work.
As far as the EC2 approach.. I'm not sure how this would fit into a data storage strategy. The processing is limited which is where EC2 is strong. Your primary goal is effecient storage and retreival.

Resources