Using cassandra in a data grid to sort and filter data - sorting

We are converting from SQL Server to Cassandra for various reasons. The back end system is converted and working and now we are focusing on the front end systems.
In the current system we have a number of Telerik data grids where the app loads all the data and search/sort/filter is done in the grid itself. We want to avoid this and are going to push the search/sort/filter to the DB. In SQL Server this is not a problem because of ad-hoc queries. However in Cassandra it becomes very confusing.
If any operation was allowed then of course a Cassandra table would have to model the data that way. However I was wondering how this is performed in real world scenarios for large amounts of data and large amounts of columns.
For instance, if I had a grid with columns 1, 2, 3, 4 what is the best course of action?
Highly control what the user can do
Create a lot of tables to model the data and pick the one to select from
Don't allow the user to do any data operations

As any NoSQL system, Cassandra performs the queries on Primary Keys best. You can of course use secondary indices, but it will be a lot slower.
So the recommended way is to create Materialized Views for all possible queries.
Another way is to use something like Apache Ignite on top of Cassandra to do analytics, but you don't want to use grids for some reason as i get it.

Related

Oracle materialized view vs JPA query

My case is that a third party prepares a table in our schema domain on which we run different spring batch jobs that look for mutations (diff between the given third party table and our own tables). This table will contain about 200k records on average.
My question is simply: does generating a material view up front provide any benefits vs running the query at runtime?
Since the third party table will be populated on our command (basically it's a db boolean field that is set to 1, after which a scheduler picks it up to populate the table. Don't ask me why it's done this way), the query needs to run anyway.
Obviously from an application point of view, it seems more performant to query a flat material view. However, I'm not sure if there is any real performance benefit, since the material view needs to be built on db level.
Thanks.
The benefit of a materialized view here is if you are running the multiple times (more so if the query is expensive and / or there is a big drop in cardinality).
If you are only hitting the query once then you there isn't going to be a huge amount in it. You are running the same query either way and you have the overhead of inserting into the materialized view but you also have the benefit that you can tune this a lot easier than you could querying via JPA and could apply things like compression so less data is transferred back to the application but for 200k rows any difference is likely to be small.
All in all, unless you are running the same query multiple times then I wouldn't bother.
Update
One other thing to consider is coupling. Referencing a materialized view directly in JPA would allow you to update any logic without updating the application but the flip side of this is that logic is hidden outside the application which can make debugging a pain.
Also if you are just referencing a materialized view directly and not using any query rewrite or rollup features then am simple table created via CTAS would actually be better as you still have the precomputed data without the (small) overhead of maintaining the materialized view.

Advice on Setup

I started my first data analysis job a few months ago and I am in charge of a SQL database and then taking that data and creating dashboards within Power BI. Our SQL database is replicated from an online web portal we use for data entry. We do not add data ourselves to the database but instead the data is put into tables based on the data entered into the web portal. Since this database is replicated via another company, I created our own database that is connected via linked server. I have built many views to pull only the needed data from the initial database( did this to limit the amount of data sent to Power BI for performance). My view count is climbing and wondering in terms of performance, is this the best way forward. The highest row count of a view is 32,000 and the lowest is around 1000 rows.
Some of the views that I am writing end up joining 5-6 tables together due to the structure built by the data web portal company that controls the database.
My suggestion would be to create a Datawarehouse schema ( star schema ) keeping as principal, one star schema per domain. For example one for sales, one for subscriptions, one for purchase, etc. Use the logic of Datamarts.
Identify your dimensions and your facts and keep evolving that schema. You will find out that you will end up with a much fewer number of tables.
Your data are not that big so you can use whatever ETL strategy you like.
Truncate load or incrimental.

"Saving" BigQuery Views for use in Tableau

I'm trying to make faster dashboards in Tableau by creating views of my calculations directly in BigQuery.
Based on my understating if the gcloud documentation here, the view will re-execute the query once it is accessed, so it kinda defeats my goal.*
*My goal is to eliminate calculations on the fly, be it in Tableau or BigQuery.
Is it possible to "save" these views, by way of scheduled scripts or workflows?
Thanks,
A view is best thought of as a way to reformat a table to make it look more convenient to further queries. The query still has to run on BigQuery so the benefits will be that the view may look simpler to Tableau than the raw table (particularly convenient if the view uses some complex SQL to create some of its columns). But it won't save calculation time.
But, if your view is doing some complex consolidation of a larger table then it might be worth saving the results as a new table instead of creating a view. This is OK if your underlying table doesn't change frequently (rule of thumb if you use the results every day and the table changes weekly, it is probably worthwhile and certainly so if the changes are monthly). Then Tableau will be querying pre-consolidated results rather than the much larger raw table. BigQuery storage and processing is cheap so this is often a reasonable solution.
Another alternative is to use a Tableau extract to bring the data into your local drive or server. This is only practical if the table is small enough to fit locally and will only work really well for speed if it fits into local memory (which can be a lot more than you might think). But extracts, at least on Tableau server, can be set to refresh on a schedule, making much faster user interaction and absolving you of having to remember to manually update the consolidated table.

Searching/selecting query in cache

I have been using cache for a long time. We store data against some key and fetch it from cache whenever required. I know that StackOverflow and many other sites heavily rely on cache. My question is do they always use key-value mechanism for caching or do they form some sql like query within a cache? For instance, I want to view last week report. This report's content will vary each day. Do i need to store different reports against each day (where day as a key) or can I get this result from forming some query that aggregate result across different key? Does any caching product (like redis) provide this functionality?
Thanks In Advance
Cache is always done as a key-value hash table. This is how it stays so fast. If you're doing querying then you're not doing cache.
What you may be trying to ask is... you could have in your database a table that contains agregated report data. And you could query against that pre-calculated table.
One of the reasons for cache (e.g. memcached ) being fast is its simplicity of data access and querying protocol.
The more functionality you add, more tradeoff you will have to do on the efficiency part. A full fledged SQL engine in a "caching" database is not a good design. Though you can utilize a data structures oriented database like Redis to design your cache data to suit your querying needs. For example: one set or one hash for each date.
A step further, you can use databases like MongoDb , or memsql which are pretty fast and have rich querying support.So an aggregation report once a while won't be an issue.
However, as a design decision, you will have to accept that their caching throughput will not be as much as memcached or redis.

Windows Azure Application high volume of records insertions

We are meant to be developing a Web based application based on Azure platform, though I’ve got some basic understanding but still have many questions
The application that we are to develop will have lot of database interaction and would need to insert a large volume of records every day.
What is the best way to interact with db here is via Queue (ie work role and then worker role reads queue and save data in db)or direct to SQL server?
And should it be a multi-tenant application?
I've been playing around with windows azure SQL database of a little while now and this is a blog post i wrote about inserting large amounts of data
http://alexandrebrisebois.wordpress.com/2013/02/18/ingesting-massive-amounts-of-relational-data-with-windows-azure-sql-database-70-million-recordsday/
my recipe is as follows: to Insert/Update data I used the following dataflow
◾Split your data into reasonably sized DataTables
◾Store the data tables as blobs in Windows Azure Blob Storage Service
◾Use SqlBulkCopy to insert data is into write tables
◾Once you have reached reasonable a amount of records in your write tables, merge the records into your read tables using reasonably sized batches. Depending on the complexity and indexes/triggers present on the read tables, batches should be of about 100000 to 500000.
◾Before merging each batch, be sure to remove duplicates by keeping the most recent records only.
◾Once a batch has been merged remove the data from the write table. Keeping this table reasonably small is quite important.
◾Once your data has been merged, be sure to check up on your index fragmentation.
◾Rince &repeat

Resources