BigQuery Dashboard Design - Cost Optimization And Caching - caching

I have to design Dashboard with data source in bigquery.
Dashboard should be giving 3 graphs and tables
Table- Total payments to purchase and date wise breakage.
Two row table/ pie chart basically showing the done payment and pending payment.
Last and the biggest one showing bar chart total sales and redeemed aggregated
a. Day wise for last 30 days
b. Week wise for last 12
c. Month wise for last 6
Now I want to ask about 3 really, not the UI, that was for completeness.
Design 1. Create date wise tables from joining the dumps of all the services and application tables ( we have microservices) as a schedule Job in merchant bff. And then use the query cache of bigquery. And bigger aggregate from those tables.
Design 2. Run the schedule Job but don't create date wise tables and load that data then in redis cache for front-end.
I am in kind of fix bent more toward Design 1. But have reasons and few doubts.
Assumption is that bigquery query cache is good and better than redis is it?
Still in Design 1 there will be two queries one load and one select, so is my choice wrong? And Design 2 is better?
Did I miss anything in Design 1 if it's the one I should pursue. One thing I kind of noticed while writing down the solution is I haven't taken into account merchant ID in the date table design schema.
Is redis unnecessary moving part/ complexity here.
My data is not PETA byte big data but I hope it will be.

Related

vertica how restrict database size

Could you please help me with the following issue?
I have installed vertica cluster. I can't understand how I can restrict database size in time or in size. For example, data in a database must be deleted older than 30 days or when a database size of 100 GB is reached (what comes first)
There is no automated way of doing this, and no logical way of "restricting database size". You can't just trim "data" from a "database".
What are you talking about (in terms of limiting data outside of 30 days old) needs to be done on the table level. You would need some kind of date field and delete anything older than 30 days. However, I would advice against deleting rows in this way. It is non-performant and can cause queries against the table to be slow: see DELETE and UPDATE Performance Considerations. The best way of doing this would be to partition the table by day, and create an automated script (bash, python, etc) to each day drop the partition that corresponds with the date 30 days ago: see Dropping Partitions.
As for deleting data—if the size of the "database" goes above 100GB—this requirement is extremely vague and would be impossible to enforce. Let's say you have 50 tables, and the size of several of those tables grows so that the total size of the database is over 100GB, how would you decide which table to prune? This also must be done on a table by table level (or in this case—technically—on a projection level, since that is where the data is actually stored).
To see the compressed size (size on disk) of the database you can use this query:
SELECT SUM(used_bytes) / ( 1024^3 ) AS database_size_gb
FROM projection_storage;
However, since data can only be deleted with a DELETE or DROP PARTITION statement on a table, it would also be helpful to see the size of each table. You can do this by using this query:
SELECT projection_schema, anchor_table_name, SUM(used_bytes) / ( 1024^3 ) AS table_size_gb
FROM projection_storage
GROUP BY 1, 2
ORDER BY 3 DESC;
From the results you can decide which tables you want to prune.
A couple of notes (as a Vertica DBA):
Data is stored in projections. Having too many projections on a single table can not only cause queries to be slow but will also increase the overall data footprint. Avoid using too many projections (especially too many superprojections, don't have more than two per table, and most tables will only need one). Use the database designer or follow the guidelines in the documentation for creating custom projections: Design Fundamentals.
Also, another trick to keep database size down is to use the DESIGNER_DESIGN_PROJECTION_ENCODINGS function. Unless your projections are created with the database designer, they will likely only contain the auto encoding. Using the DESIGNER_DESIGN_PROJECTION_ENCODINGS function will help you to pick the most optimal encoding for each column. I have seen properly encoded projections take up a mere 2% disk size compared to the previously un-optimized projection. That is rare, but in my experience you will still see at least a 20-40% reduction in size. Do not be afraid to use this function liberally. It is one of my favorite tools as a Vertica DBA.
Also

Reasons against using Elasticsearch as an OLAP cube

At first glance, it seems that with Elasticsearch as a backend it is easy and fast to build reports with pivot-like functionality as used in traditional business intelligence environments.
By "pivot-like" I mean that in SQL-terms, data is grouped by one to two dimensions, filtered, ordered by one or two dimensions and aggregated by several metrics e.g. with sum or count.
By "easy" I mean that with a sufficiently large cluster, no pre-aggregation of the data is required, which saves ETLs and data engineering time.
By "fast" I mean that due to Elasticsearch's near real time capability report latency can be reduced in many instances, when compared to traditional business intelligence systems.
Are there any reasons, not to use Elasticsearch for the above purpose?
ElasticSearch is a great alternative to a cube, we use it for that same purpose today. One huge benefit is that with a cube you need to know what dimensions you want to create reports on. With ES you just shove in more and more data and figure out later how you want to report on it.
At our company we regularly have data go through the following life cycle.
record is written to SQL
primary key from SQL is written to RabbitMQ
we respond back to the customer very quickly
When Rabbit has time, it uses the primary key to gather up all the data we want to report on
That data is written to ElasticSearch
A word of advice: If you think you might want to report on it, get it from the beginning. Inserting 1M rows into ES is very easy, updating 1M rows is a bigger pain.

governor limits with reports in SFDC

We have a business requirement to show a cost summary for all our projects in a single table.
In order to tabulate these costs we have to query through all the client tasks, regions, job roles, pay rates, cost tables, deliverables, efforts, and hour records (client tasks are in the same table and tasks and regions are in the same table and deliverables, effort, and hours are stored as monthly totals).
Since I have to query all of this before I go for-looping through everything it hits a large number of scripting lines very quickly. Computationally it's like O(m * n * o * p) and some of our projects have all four variables that go up very quickly. My estimates for how to do this have ranged from 90 million lines of code to 600 billion.
Using batch apex we could break this up by task regions into 200 batches, but that would reduce the computational profile to (600 B / 200 ) = 3 billion lines of code (well above the salesforce limit.
We have been playing around with using informatica to do these massive calculations, but we have several problems including (1) our end users can not wait more than five or so minutes, but just transferring the data (90% of all records if all the projects got updated at once) would take 15 minutes over informatica or the web api (2) we have noticed these massive calculations need to happen in several places (changing a deliverable forecast value, creating an initial forecast, etc).
Is there a governor limit work around that will meet our requirements here (massive volume of data with response in 5 or so minutes? Is force.com a good platform for us to use here?
This is the way I've been doing it for a similar calculation:
An ERD would help, but have you considered doing this in smaller pieces and with reports in salesforce instead of custom code?
By smaller pieces I mean, use roll-up summary fields to get some totals higher in your tree of objects.
Or use apex triggers so as hours are entered the cost * hours is calculated and placed onto the time record, and then rolled-up to the deliverables.
Basically get your values calculated at the time the data is entered instead of having to run your calculations every time.
Then you can simply run a report that says show me all my projects and their total cost or total time because those total costs/times are stored/calculated already.
Roll-up summaries only work with master-detail
Triggers work anytime, but you'll want to account for insert, update as well as delete and undelete! Aggregate Functions will be your friend assuming that the trigger context has fewer than 50,000 records to aggregate - which I'd hope it does b/c if there are more than 50,000 time entries for a single deliverable that's a BIG deliverable :)
Hope that helps a bit?

Would you recommend using Hadoop/HBASE?

We have a SQL server 2008 and one of the tables, say table A has the following characteristics:
Every day we get several heterogeneous feeds from other systems with numerical data.
Feeds are staged elsewhere, converted to a format compliant with A's schema.
Inserted into A.
Schema looks like:
<BusinessDate> <TypeId> <InsertDate> <AxisX> <AxisY> <Value>
The table has a variable number of rows. Essentially we have to purge it at the weekends otherwise the size affects performance. So size ranges from 3m-15m rows during the week. Due to some new requirements we expect this number to be increased by 10m by the end of 2012. So we would be talking about 10m-25m rows.
Now in addition
Data in A never change. The middle tier may use A's data but it will be a read only operation. But typically the middle tier doesn't even care about the contents. It typically (not always but 80% of cases) runs stored procs to generate reports and delivers the reports in other systems.
Clients of these table would typically want to do do long sequential reads for one business date and type. i.e. "get me all type 1 values for today"
Clients will want to join this table with 3-5 more tables and then deliver reports to other systems.
The above assumptions are not necessarily valid for all tables with which A is joined. For example we usually join A with a table B and do a computation like B.value*A.value. B.value is a volatile column.
Question
A's characteristics do sound very much like what HBase and other column oriented schemas can offer.
However some of the joins are with volatile data.
Would you recommend migrating A to an HBase schema?
And also, if we were to move A I would assume we would also have to migrate B and other dependent tables which (on the contrary with A) are being used by several other places from the middle tier. Wouldn't this be complicating things a lot?
25 Million rows doesn't sound big enough to justify using HBase, although the usage pattern fits. You need a name node, a job tracker, a master and then your region servers, so you'll be needing a minimum of maybe 5 nodes to run HBase in any reasonable way. Your rows are so small I'm guessing it's maybe 10gb of data, so storing this across 5 servers seems like overkill.
If you do go this route (perhaps you want to store more than a week's data at once) there are ways to integrate HBase with relational DBs. Hive, for example, provides ODBC/JDBC connectivity and can query HBase. Oracle and Teradata both provide integration between their relational DB software and non-relational storage. I know Microsoft has recently announced that they are dropping Dryad in favor of integrating with Hadoop, but I am not certain how far along that process is wrt SQL Server. And if all you need is "get a list of IDs to use in my SQL query" you can of course write something yourself easily enough.
I think HBase is very exciting, and there may be things you haven't mentioned which would drive you towards it (e.g. high availability). But my gut says you can probably scale out your relational db much more cheaply than switching to HBase.

Database speed optimization: few tables with many rows, or many tables with few rows?

I have a big doubt.
Let's take as example a database for a whatever company's orders.
Let's say that this company make around 2000 orders per month, so, around 24K order per year, and they don't want to delete any orders, even if it's 5 years old (hey, this is an example, numbers don't mean anything).
In the meaning of have a good database query speed, its better have just one table, or will be faster having a table for every year?
My idea was to create a new table for the orders each year, calling such orders_2008, orders_2009, etc..
Can be a good idea to speed up db queries?
Usually the data that are used are those of the current year, so there are less lines the better is..
Obviously, this would give problems when I search in all the tables of the orders simultaneously, because should I will to run some complex UNION .. but this happens in the normal activities very rare.
I think is better to have an application that for 95% of the query is fast and the remaining somewhat slow, rather than an application that is always slow.
My actual database is on 130 tables, the new version of my application should have about 200-220 tables.. of which about 40% will be replicated annually.
Any suggestion?
EDIT: the RDBMS will be probably Postgresql, maybe (hope not) Mysql
Smaller tables are faster. Period.
If you have history that is rarely used, then getting the history into other tables will be faster.
This is what a data warehouse is about -- separate operational data from historical data.
You can run a periodic extract from operational and a load to historical. All the data is kept, it's just segregated.
Before you worry about query speed, consider the costs.
If you split the code into separate code, you will have to have code that handles it. Every bit of code you write has the chance to be wrong. You are asking for your code to be buggy at the expense of some unmeasured and imagined performance win.
Also consider the cost of machine time vs. programmer time.
If you use indexes properly, you probably need not split it into multiple tables. Most modern DBs will optimize access.
Another option you might consider is to have a table for the current year, and at the end append the data to another table which has data for all the previous years. ?
I would not split tables by year.
Instead I would archive data to a reporting database every year, and use that when needed.
Alternatively you could partition the data, amongst drives, thus maintaining performance, although i'm unsure if this is possible in postgresql.
For the volume of data you're looking at splitting the data seems like a lot of trouble for little gain. Postgres can do partitioning, but the fine manual [1] says that as a rule of thumb you should probably only consider it for tables that exceed the physical memory of the server. In my experience, that's at least a million rows.
http://www.postgresql.org/docs/current/static/ddl-partitioning.html
I agree that smaller tables are faster. But it depends on your business logic if it makes sense to split a single entity over multiple tables. If you need a lot of code to manage all the tables than it might not be a good idea.
It also depends on the database what logic you're able to use to tackle this problem. In Oracle a table can be partitioned (on year for example). Data is stored physically in different table spaces which should make it faster to address (as I would assume that all data of a single year is stored together)
An index will speed things up but if the data is scattered across the disk than a load of block reads are required which can make it slow.
Look into partitioning your tables in time slices. Partitioning is good for the log-like table case where no foreign keys point to the tables.

Resources