Drill-down feature is at OLAP or BI layer? - business-intelligence

Drill-down, by definition, requires the use of hierarchical data where values are grouped into levels. (reference)
My understanding of drill-down feature is provided by OLAP engine (e.g. Clickhouse, Apache Druid, Apache Pinot etc) and not BI/Visualisation tool (e.g. Tableau, Superset, Grafana etc). However from this presentation slides Seeing Is Believing: Popular BI Tools for ClickHouse (link), it said Grafana supports interactive drill-down on data but not Superset.
My questions:
Drill-down feature is at OLAP or BI layer?
Does drill-down feature require pre-computed dimensional
aggregations?
How dimensions that can be drilled down are identified? Manually or automatically?
Thanks.

Drill-down feature is at OLAP or BI layer?
Not sure what you mean by the both... but you might check the response to the third question.
Does drill-down feature requires pre-computed dimensional aggregations?
Not necessarily. For example, icCube is computing the aggregated values on-the-fly while processing a query; whether it is related to a drilldown or not does not matter at all.
How dimensions that can be drilled down are identified? Manually or automatically?
The visualization tools should somehow know that a dimension exists and that the members of that dimension (e.g., a country, a year) have some children (e.g., cities, quarters) and have a way to query those children. This is automatic for client tool (e.g., Excel, Tableau) connecting for example to icCube, SSAS, ... Those tools are able to discover the dimensions structure first.

Related

Migrate BI from SSAS Olap to the free ElasticSearch

I work at a company that is thinking about migrate his BI structure ( SSAS Olap and Power BI ) to the ElasticSearch/Kibana. Our BI work basically with aggregations as sum, max, min, count, first, last, average at aggregation in multiples hierarchy levels( organization level, customer level, unit level, etc...), and expressions between multiples measures( getting by aggregation and find first/last based on datetime ) like sum, subtraction, multiplication, division, percentage, we also do some date manipulations. We work with more than 1 billion rows at total. Before SSAS Olap we worked with SQL Server OLTP, and our queries were taking many minutes. Because that, we change to OLAP and we now have our aggregate measures in a few seconds. But the Microsoft License is bursting our new budget and now the company requires a free software. At the moment, our Business Intelligence solution doesn't have data mining, neither machine learning, only aggregations measures at multiples hierarchy levels, and operations between these aggregations.
And our solution is the entire on-premises.
I have 2 questions:
1-With the free/basic Elasticsearch license, we will be able to migrate our business intelligence solution to the ElasticEngine and do all these hierarchies, aggregation and search in multiple levels operations?
2-With the free/basic Elasticsearch engine, we will have these measures aggregations between these millions registers and operations between measures at a similar time as our OLAP cube?
Best Regards,
Luis
In general, it is possible to use ElasticSearch in the way you described. However, this has real sense only if you often use 'like'-based filtering by text columns, this use-case is perfect for ElasticSearch. All facts needed for the reports should be included into one ES document; avoid usage of sub-collections, aggregate queries on fields from sub-collections (nested queries) may kill the performance.
If all you need is damn-fast aggregations and 'exact match' filtering it might be much better to use columnar SQL-compatible databases:
MemSQL - free version should be enough for 1 billion of rows, this is good choice if you need to support both OLAP/OLTP DB usage (say, often update rows)
Yandex ClickHouse - free/open source, suitable if your data is append-only (no frequent updates/deletes). Its aggregate queries performance is really ultimate, and it can handle queries to 1 billion of rows fast enough even in the single-node configuration
You mentioned that your existing BI infrastructure uses SSAS OLAP, and I may assume that many reports are in fact pivot tables. Most free BI tools (including Kibana) don't support pivot tables at all or only very-very primitive basic pivot tables. Fortunately, good alternative to PowerBI/Excel PivotTable exists - but it is not completely free. Here I mean SeekTable: it can connect to all DBs I mentioned (MemSQL, ClickHouse) and even to ElasticSearch [Disclaimer: I'm affiliated with this BI tool].

Modern Business Intelligence solution

What is the modern way of building a Business Intelligence solution? I have looked at PowerBI, but I'm wondering what would be the best datasource for it. Is it still traditional datawarehouse solutions that should be used as a datasource? I also hear a lot talk about data lakes, but don't know much about. Or should I just use a regular relational database as the source? Do anyone have any opinions and tips on this?
I think your starting point in your thinking is wrong. You don't chose a front end BI / Dashboard tool and then think what source would be best to connect to it.
You start from your data & information that you want to analyze, report & visualize. Think of structure & variety of data and complexity of analysis, correlations, integrations & business logic.
Then decide how are you going to
Store the data
Process / Transform the data to correlate, integrate or enrich
report or visualize the data
And its only in step 3 from above high level tasks that you come to start thinking of which Analysis / visualization tool is best fit for such data & its integrations with data storage platform I have as well as nature of the data itself.
That will most likely bring you more success than thinking about it the way you posed that question.
I hope it helps.
Start with your data.
Do you have a data warehouse now? If no,
Where is your data, databases, Excel, email? Data in databases, like MySQL, is structured. Data in email or other documents is unstructured. Depending on where your data lives impacts how you will analyze it (which is what BI is all about, in the end.) (And a side note, data lakes are best for analyzing structured, semi-structured and unstructured data together. For example if you queried for data in documentation, a SQL DB and older MS Access data dumps.)
If you have data in different databases and systems, then I would recommend you start with a data warehouse. There are many options, one of the easier ones today is using a cloud-based solution (AWS, Microsoft, etc.)
Once your data is in a location(s) where it can be queried and analyzed as a total data set then you can look at the BI tools that fit your needs.
4.a. What type of analysis do you need? Queries? Trends? Complex data calculations and transformations?
Based on 4.a. look at the tools in the market. PowerBI is just one of a whole variety of data analysis tools and systems on the market. There are many resources on the web, Google ETL tools.
After all of this you can narrow down your choices and select the solution that works best for you.

What is the difference between BO universes and Cognos cubes?

I'm currently trying to understand the mechanics behind BO and Cognos and to understand a bit more deeply how it actually works to start a BI project.
On the side on tradiationnal BI tools, i have trouble seeing the difference (other than the names) between the universes in Bo and the cubes .
Is there a real difference?
Thanks
BO Universe is a metadata layer. You query it - you query database.
In addition to similar metadata layer Cognos BI has two types of cubes. Cube is Data in dimensional structure. With Dimensions and Measures. In addition to basic measure's values it contains aggregates for higher levels of every dimensions.
Transformer Cube. Contains all data inside cube file. You even don't need a database to query it.
Using Dynamic Cube technology you load data into memory and make calculations there. You still need a database, but should be faster. If you have enough memory.

Cube design - ROLAP considerations vs. MOLAP

Does anyone have resources that give a list of things to consider when designing a ROLAP cube, as opposed to MOLAP (I'm doing it in Pentaho, but I guess the principles are not dis-similar for other implementations). For example, I'm thinking of things like:
should extra transformational work be done at the ETL stage to reduce computational work when querying the cube?
should all my dimension tables be in the same database as my cube?
I'm a Pentaho implementor in Indonesia. First, of course you should try to aggregate all your measures group by surrogate keys involved.
And in Mondrian, you can "cache" some computations using additional aggregate tables. You can do it in Pentaho Aggregate Designer. But after that you will need additional work in your data warehouse / ETL stage.
Regards,
Feris
http://pentaho-en.phi-integration.com
First off - the designs are similar but they are driven by different performance & scalability strategies.
Secondly - the etl process is pretty much the same. Except - you'll typically see a lot more data in a rolap cube than a molap cube because of scalability features in relational databases. And you'll often see a rolap cube within a non-rolap database (warehouse, even transactional database) that does more than just support rolap.
Lastly, you'll typically generate aggregate table if you've got much data volume. That aggregation can be done a lot of different ways, but I'd say it is not typically driven by your ETL process unless you lack the ability to manage a separate asychronous process or have data volumes that make it impractical to run period summary jobs.
Thanks to Feris for the link and input, but in the end I went for this book:
http://www.amazon.com/Pentaho-Solutions-Business-Intelligence-Warehousing/dp/0470484322/ref=sr_1_1?ie=UTF8&s=books&qid=1258408259&sr=8-1
I had a good long look at the Mondrian site + docs, but the book seems more comprehensive.

schema-less data warehouse and reporting

We have a system that generates many events as the result of a phone call/web request/sms/email etc, each of these events need to be able to be stored and be available for reporting (for MI/BI etc) on, each of these events have many variables and does not fit any one specific scheme.
The structure of the event document is a key-value pair list (cdr= 1&name=Paul&duration=123&postcode=l21). Currently we have a SQL Server system using dynamically generated sparse columns to store our (flat) document, of which we have reports that run against the data, for many different reasons I am looking at other solutions.
I am looking for suggestions of a system (open or closed) that allows us to push these events in (regardless of the schema) and provide reporting and anlytics on top of it.
I have seen Pentaho and Jasper, but most of the seem to connect to a system to get the data out of it to then report on it. I really just want to be able to push a document in and have it available to be reported on.
As much as I love CouchDB, I am looking for a system that allows schema-less submitting of data and reporting on top of it (much like Pentaho, Jasper, SQL Reporting/Analytics Server etc)
I don't think there is any DBMS that will do what you want and allow an off-the-shelf reporting tool to be used. Low-latency analytic systems are not quick and easy to build. Low-latency on unstructured data is quite ambitious.
You are going to have to persist the data in some sort of database, though.
I think you may have to take a closer look at your problem domain. Are you trying to run low-latency analytical reports, or an operational report that prompts some action within the business when certain events occur? For low-latency systems you need to be quite ruthless about what constitutes operational reporting and what constitutes analytics.
Edit: Discourage the 'potentially both' mindset unless the business are prepared to pay. Investment banks and hedge funds spend big bucks and purchase supercomputers to do 'real-time analytics'. It's not a trivial undertaking. It's even less trivial when you try to do such a system and build it for high uptimes.
Even on apps like premium-rate SMS services and .com applications the business often backs down when you do a realistic scope and cost analysis of the problem. I can't say this enough. Be really, really ruthless about 'realtime' requirements.
If the business really, really need realtime analytics then you can make hybrid OLAP architectures where you have a marching lead partition on the fact table. This is an architecture where the fact table or cube is fully indexed for historical data but has a small leading partition that is not indexed and thus relatively quick to insert data into.
Analytic queries will table scan the relatively small leading data partition and use more efficient methods on the other partitions. This gives you low latency data and the ability to run efficient analytic queries over the historical data.
Run a process nightly that rolls over to a new leading partition and consolidates/indexes the previous lead partition.
This works well where you have items such as bitmap indexes (on databases) or materialised aggregations (on cubes) that are expensive on inserts. The lead partition is relatively small and cheap to table scan but efficient to trickle insert into. The roll-over process incrementally consolidates this lead partition into the indexed historical data which allows it to be queried efficiently for reports.
Edit 2: The common fields might be candidates to set up as dimensions on a fact table (e.g. caller, time). The less common fields are (presumably) coding. For an efficient schema you could move the optional coding into one or more 'junk' dimensions..
Briefly, a junk dimension is one that represents every existing combination of two or more codes. A row on the table doesn't relate to a single system entity but to a unique combination of coding. Each row on the dimension table corresponds to a distinct combination that occurs in the raw data.
In order to have any analytic value you are still going to have to organise the data so that the columns in the junk dimension contain something consistently meaningful. This goes back to some requirements work to make sure that the mappings from the source data make sense. You can deal with items that are not always recorded by using a placeholder value such as a zero-length string (''), which is probably better than nulls.
Now I think I see the underlying requirements. This is an online or phone survey application with custom surveys. The way to deal with this requirement is to fob the analytics off onto the client. No online tool will let you turn around schema changes in 20 minutes.
I've seen this type of requirement before and it boils down to the client wanting to do some stats on a particular survey. If you can give them a CSV based on the fields (i.e. with named header columns) in their particular survey they can import it into excel and pivot it from there.
This should be fairly easy to implement from a configurable online survey system as you should be able to read the survey configuration. The client will be happy that they can play with their numbers in Excel as they don't have to get their head around a third party tool. Any competent salescritter should be able to spin this to the client as a good thing. You can use a spiel along the lines of 'And you can use familiar tools like Excel to analyse your numbers'. (or SAS if they're that way inclined)
Wrap the exporter in a web page so they can download it themselves and get up-to-date data.
Note that the wheels will come off if you have larger data volumes over 65535 respondents per survey as this won't fit onto a spreadsheet tab. Excel 2007 increases this limit to 1048575. However, surveys with this volume of response will probably be in the minority. One possible workaround is to provide a means to get random samples of the data that are small enough to work with in Excel.
Edit: I don't think there are other solutions that are sufficiently flexible for this type of applicaiton. You've described a holy grail of survey statistics.
I still think that the basic strategy is to give them a data dump. You can pre-package it to some extent by using OLE automation to construct a pivot table and deliver something partially digested. The API for pivot tables in Excel is a bit hairy but this is certainly quite feasible. I have written VBA code that programatically creates pivot tables in the past so I can say from personal experience that this is feasible to do.
The problem becomes a bit more complex if you want to compute and report distributions of (say) response times as you have to construct the displays. You can programatically construct pivot charts if necessary but automating report construction through excel in this way will be a fair bit of work.
You might get some mileage from R (www.r-project.org) as you can construct a framework that lets you import data and generate bespoke reports with a bit of R Code. This is not an end-user tool but your client base sounds like they want canned reports anyway.

Resources