We use Metabase to monitor our KPIs, such as total current active items.
It's a great tool, but when we needed to look at timely trends such as total current active items change by day.
Metabase can't help much, since it only displays what current database has.
I know there are many datawarehousing stacks I can use to store time-series data.
But just curious if there's any framework that can provide a framework to store time-series data of configured KPIs?
What I imagined is like an advanced metabse, the user scenario would be like this:
I can setup metrics I'd like to trace (like DAU, Total Active Items)
The framework automatically saves these metrics by specified time-slot (day, minute, week, month, on user's configuration)
Visualize these data
Would like to have some inputs and thought :D
Related
I am doing some research about what the best possible state that data should be in so that reporting and BI analytics perform well but can be produced by business users from a set of various data collections which align with a business data glossary that I have worked through.
We have not chosen a specific BI tool but have been playing around with Power BI and Sisense
We have not decided on a data store technology to use for reporting purposes
Origin Data
Our business application that the data will originate from has a normalised SQL relational database. There are quite a few tables and joins to consider which work fine from an application perspective but I have recommended supplying the output of those queries as a flat denormalised set of data to increase redundancy and remove the joins entirely.
Business Data Glossary
As we go through defining the business data glossary, the number of columns increases but I do not anticipate there being any more than 100 columns per row as a complete reporting set of data. I wanted to ensure that each row of data is at a transactional depth (level 0) and that the roll up through the data would be done through aggregations by distinct key values and dimensional taxonomy.
Architecture
I want some advice around what a modern architecture looks like and what works for business users rather than users who are comfortable with SQL queries and a myriad of joins on a physical data model.
I read an article about setting up data flows for Power BI which looked like they type of thing I want to do from a data availability perspective but it doesn't advice on how the data should be stored and what type of database to use.
Data Sets
The data we have that needs to be reported on are transactions where level 0 is trade positions (individual transactions from either a local or counterparty entity), level 1 is reconciliations (relating local and counterparty entities and trade linking identifier) and level 2 would be where it can be rolled up by taxonomy like asset type or status.
The current data set size would be a snapshot of positions every business day so, its duplicated every day with a snapshot date applied. The reports would be able to move across dates and show changes over time.
Any advice would be greatly appreciated on how to tackle reporting and BI in 2020. Oooh, one last thing, there is the possibility that we won't be allowed to process this type of data in the public cloud, we have our own infrastructure which is on private cloud so, that might need to be a consideration. Thanks
So I will be embarking on designing a dashboard that will display KPI's and other relevant information for my team. Since I am in the early stages of this project and am not very familiar on the technical process behind designing a dashboard, I need some questions vetted out first before I go and shop for some solutions to avoid reinventing the wheel.
Here are some of my questions:
We want a dashboard that can provide live-time information via our data sources (or as close to live-time as possible). What function allows a dashboard to update itself with concurrent datasources? From a conceptual standpoint, I can understand creating a dashboard out of Microsoft Excel, and having the dashboard dependent on the values you may have set within your pivot table.
How do you make a dashboard request information from multiple datasources on its own? Just like the excel example, a user may have to go into the pivot tables to update values, but I want to know how would a dashboard request this by itself and what is the exact method from a programming standpoint? Does the code execute itself every time you refresh the webpage?
How do you create datasources organically? I know for some solutions such as SharePoint BI Center, there are pre-supported datasources like an excel sheet or SharePoint and it's as easy as uploading your document and letting the design handle the rest. However, there are going to be some datasources that I know that will need to be fetched. Do I need to understand something else like an event recorder in order to navigate this issue?
Introduction
The dashboard (or a report, respectively) is usually the result of a long chain of steps. Very much simplified it could look like this:
src1
|------\
src2 | /---- Dashboards
|------+---[DWH]-[BR]-+
src n | | \---- Reports etc.
|------/ [Big Data]
Keep in mind, this is only a very, very simple structure of a data backend / frontend.
DWH means Data Warehouse, where data might be stored temporarily (you referred to this as fetching). This could be a database, could be a Big Data engine, could be a combination of both...
Afterwards, there are Business Rules (BR). Those might be specific rules in how different departments calculate and relate to data, but also simple things like algebra.
Questions
So, the main question should not be about the technology:
What software should we choose?
How can we create a dashboard?
but on the contrary focused on your business processes (see it like a top-down view):
How does our core process look like? Where would I like to measure data?
How would department a calculate sales in difference to department b? Should all use the same rule?
Where does everyone store the data? Can we access it? Do we need structural data?
And, very easy to forget but also easily sometimes one of the biggest parts: Is the identifier of a business object (say, sales id) everywhere build and formatted in the same way?
Conclusion
When those questions are at least in the back of your head and you keep working in this direction, more or less automatically data will spill out at certain points of that process.
Then it won't matter if you use Excel, a small-to medium app like Tableau, Tibco Spotfire, QlikView, Power BI or you want to go full scale with a big Hadoop backend, databases and JasperReports, Apache Drill, Pentaho, SSIS on top of it... it will come out eventually.
TL;DR
Focus on the processes first. Make sure to understand them. Draft in Excel. Then proceed in getting the data and the tools you need to help your use cases. It will work out much better from a "top-down" approach than trying to solve your requirements with tools only.
I would like to store a large amount of timeseries from devices. Also these timeseries have to be validated, can be modified by an operator and have to be exported to other systems. Holes in the timeseries must be found. Timeseries must be shown in the UI filtered by serialnumber and date range.
We have thought about using hadoop, hbase, opentsdb and spark for this scenario.
What do you think about it? Can Spark connect to opentsdb easily?
Thanks
OpenTSDB is really great for storing large amount of time series data. Internally, it is underpinned by HBase - which means that it had to find a way around HBase's limitations in order to perform well. As a result, the representation of time series is highly optimized and not easy to decode. AFAIK, there is no out-of-the-box connector that would allow to fetch data from OpenTSDB into Spark.
The following GitHub project might provide you with some guidance:
Achak1987's connector
If you are looking for libs that would help you with time series, have a look at spark-ts - it contains useful functions for missing data imputation as well.
Warp 10 offers the WarpScript language which can be used from Spark/Pig/Flink to manipulate time series and access data stored in Warp 10 via a Warp10InputFormat.
Warp 10 is Open Source and available at www.warp10.io
Disclaimer: I'm CTO of Cityzen Data, maker of Warp 10.
Take a look at Axibase Time Series Database which has a rather unique versioning feature to maintain a history of value changes for the same timestamp. Once enabled with per-metric granularity, the database keeps track of source, status and times of value modifications for audit trail or data reconciliation.
We have customers streaming data from Spark apps using Network API, typically once data is enriched with additional metadata (aks series tags) for downstream reporting.
You can query data from ATSD with REST API or SQL.
Disclaimer: I work for Axibase.
I'm having fun learning about Hadoop and the various projects around it and currently have 2 different strategies I'm thinking about for building a system to store a large collection of market tick data, I'm just getting started with both Hadoop/HDSF and HBase but hoping someone can help me plant a system seed that I won't have to junk later using these technologies. Below is an outline of my system and requirements with some query and data usage use cases and lastly my current thinking about the best approach from the little documentation I have read. It is an open ended question and I'll gladly like any answer that is insightful and accept the best one, feel free to comment on any or all of the points below. - Duncan Krebs
System Requirements - Be able to leverage the data store for historical back testing of systems, historical data charting and future data mining. Once stored, data will always be read-only, fast data access is desired but not a must-have when back testing.
Static Schema - Very Simple, I want to capture 3 types of messages from the feed:
Timestamp including date,day,time
Quote including Symbol,timestamp,ask,askSize,bid,bidSize,volume....(About 40 columns of data)
Trade including Symbol,timestamp,price,size,exchange.... (About 20 columns of data)
Data Insert Use Cases - Either from a live market stream of data or lookup via broker API
Data Query Use Cases - Below demonstrates how I would like to logically query my data.
Get me all Quotes,Trades,Timestamps for GOOG on 9/22/2014
Get me all Trades for GOOG,FB BEFORE 9/1/2014 AND AFTER 5/1/2014
Get me the number of trades for these 50 symbols for each day over the last 90 days.
The Holy Grail - Can MapReduce be used for uses cases like these below??
Generate meta-data from the raw market data through distributed agents. For example, Write a job that will compute the average trading volume on a 1 minute interval for all stocks and all sessions stored in the database. Create the job to have an agent for each stock/session that I tell what stock and session it should compute this value for. (Is this what MapReduce can do???)
On the classpath of the agents can I add my own util code so that the use case above for example could publish its value into a central repo or Messaging server? Can I deploy an agent as an OSGI bundle?
Create different types of agents for different types of metrics and scores that are executed every morning before pre-market trading?
High Frequency Trading
I'm also interested if anyone can share some experience using Hadoop in the context of high frequency trading systems. Just getting into this technology my initial sense is Hadoop can be great for storing and processing large volumes of historic tick data, if anyone is using this for real-time trading I'd be interested in learning more! - Duncan Krebs
Based of my understanding of your requirements, Hadoop would be really good solution to store your data and run your queries on it using Hive.
Storage: You can store the data in Hadoop in a directory structure like:
~/stock_data/years=2014/months=201409/days=20140925/hours=01/file
Inside the hours folder, the data specific to that hour of the day can reside.
One advantage of using such structure is that you can create external tables in Hive over this data with your partitions on years, months, days and hours. Something like this:
Create external table stock_data (schema) PARTITIONED BY (years bigint, months bigint, days bigint, hours int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION
'~/stock_data'
Coming to the queries part, once you have the data stored in the format mentioned above you can easily run simple queries.
Get me all Quotes,Trades,Timestamps for GOOG on 9/22/2014
select * from stock_data where stock = 'GOOG' and days = 20140922
Get me all Trades for GOOG,FB BEFORE 9/1/2014 AND AFTER 5/1/2014
select * from stock_data where stock in ('GOOG', 'FB') and days > 20140501 and days < 20140901)
You can run any such aggregation queries once in a day and use the output to come up with the metrics before pre-market trading. Since Hive internally runs mapreduce these queries won't be very fast.
In order to get faster results, you can use some of the in memory projects like Impala or Spark. I have myself used Impala to run queries on my hive tables and I have seen a major improvement in the run time for my queries (around 40x). Also you wouldn't need to make any changes to the structure of the data.
Data Insert Use Cases : You can use tools like Flume or Kafka for inserting data in real time to Hadoop (and thus to the hive tables). Flume is linearly scalable and can also help in processing events on the fly while transferring.
Overall, a combination of multiple big data technologies can provide a really decent solution to the problem you proposed and these solution would scale to huge amounts of data.
I'm looking at Web based Visualization tools pulling large data sets directly from hive.
My use case is this -:
Say we have done some offline analysis,the results of which are stored as tables in a storage box (Hadoop) and can be queried via Hive.
In these tables I have only the field I am interested in visualizing. Since I need to expose and to this visualization to multiple stakeholders I need this hosted on the web, possibly over one of our internal web server. At this point in time, the data should be securely connected and directly connect via Hive
My criteria is this -:
Cost of license(vs one time purchase)
Leraning curve & adaptablity
(Low priority, but important)Visualization formats suited to digital Advertising as a use-case, like funnels, Lift attribution etc.
I like Tableu, but its very expensive (upward of 10000 USD per year) - was looking at something good, but cheaper. I evaluated Datameer and it looks promising, have you used it for similar usecases and what were your experiences ?
I haven't tried it yet, but perhaps something like Zeppelin ( http://zeppelin-project.org ) might be useful to look at.