methods to calculate a representative index value using multiple different types of metrics - metrics

I am trying to create a index metric of multiple metric to show a representative score for a specific category.
example:
In the datacenter, there are multiple components such as compute, storage, power, space, network, firewalls
each of these have multiple NODES and associated metrics like usage, performance, health, number of xxx processed etc. and these metrics are over time.
Another example:
Developers have multiple activities, like checkin, number of line modified, bug fixed per checkin, bugs, failed regression
Is there a mathematical model that can take these metrics & some additional configurations like importance, relevance and create a single score that I can track over time - and then if I wish to double click to understand blips on the graph, I can look at it by looking at the individual metrics ?
We have ELK stack collecting all this datacenter data and other ITSMC tools that have support metrics, SLA etc.
are there any applications that do this ? sort of like the tracker of S&P500 or FINTECH stocks etc. like the index fund that tracks multiple types of stocks based on invested $$.
thanks

Related

Prometheus metric and label naming for high cardinality dimensions

I want to report a metric for each item that is viewed in our system. We have tens of millions of items.
In the Prometheus documentation it warns not to label high cardinality metrics
CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
So in this case what is the best practice?
The practice is to use a logs-based system such as the ELK stack, rather than a metrics-based system like Prometheus or Graphite.

Service architecture using technologies which provide parallelism and high scalability

I'm working on a booking system with a single RDBMS. This system has units (products) with several characteristics (attributes) like: location, size [m2], has sea view, has air conditioner…
On the top of that there is pricing with its prices for different periods e.g. 1/1/2018 – 1/4/2018 -> 30$ ... Also, there is capacity with its own periods 1/8/2017 – 1/6/2018… Availability which is the same as capacity.
Each price can have its own type: per person, per stay, per item… There are restrictions for different age groups, extra bed, …
We are talking about 100k potential units. The end user can make request to search all units in several countries, for two adults and children of 3 and 7 years, for period 1/1/2018 – 1/8/2018, where are 2 rooms with one king size bed and one single bed + one extra bed. Also, there can be other rules which are handled by rule engine.
In classical approach filtering would be done in several iterations, trying to eliminate as much as possible in each iteration. There could be done several tables with semi results which must be synchronized with every change when something has been changed through administration.
Recently I was reading about Hadoop and Storm which are highly scalable and provide parallelism. I was wondering if this kind of technology is suitable for solving described problem. Main idea is to write “one method” which validates each unit, if satisfies given filter search. Later this function is easy to extend with additional logic. Each cluster could take its own portion of the load. If there are 10 cluster, each of them could process 10k units.
In Cloudera tutorial there is a moment when with Sqoop, content from RDBMS has been transferred to HDFS. This process takes some time, so it seems it’s not a good approach to solve this problem. Given problem is highly deterministic and it requires to have immediate synchronization and to operates with fresh data. Maybe to use in some streaming service and to parallelly write into HDFS and RDBMS? Do you recommend some other technology like Storm?
What could be possible architecture, starting point, to satisfy all requirements to solve this problem.
Please point me into right direction if this problem is improper for the site.

ElasticSearch Analytical queries

I am evaluating a few different options for powering an analytics application using an open-source technology. One of the options is using ElasticSearch, though I haven't been able to find any examples of companies using it for large-scale implementations of analytics, thus my question here.
For datasets of 1B-10B points, what limitations (if any, or would it be possible?) would ElasticSearch have? For example, in having a feature-set like Google Analytics, with it.
Here's one user who seems to do analytics on largeish amounts of data - https://digitalgov.gov/2015/01/07/elk - plus description of what they do including downsides.
With Elasticsearch there is no black-white answer to a question as open-ended as yours. The amount of records is not everything: how much disk space are we talking about, how many nodes, how many indices, the number of shards for each, what kind of analytics you need, hardware specs etc etc. Two things are certain from the data you mentioned: you need dedicated master nodes and more importantly good client nodes and depending on queries and the concurrent searches count you will need more or less of them.
In Elasticsearch 5 the client node is called coordinating node but it has the same role. One limitation I can think of is the heap/RAM memory of such coordinating node. The heap of an Elasticsearch node shouldn't be set to values larger than ~30GB due to the longer garbage collection cycles of the JVM (larger memory to clean, more time it takes, more unusable the node is). During GC nothing else runs on that JVM. So you could be limited by the size of the memory.
I said that you most likely will need coordinating nodes because heavy aggregations (what will probably be the most used feature in an analytics platform) will use cpu and memory in the final phase of a query where it gathers the results from all shards involved and performs a final sorting and aggregation. Thus it will need more memory than a normal data node would only for aggregations.
I doubt though that a single aggregation will use so many GBs of memory but it could theoretically use it if the query/aggregation being used is built in a reckless way. Depending on how many concurrent searches there are and how much memory they use you might need more or less coordinating nodes so that the GC cycles are not very frequent.
Bottom line: I think this is possible but some common sense is needed (see my comment about reckless aggregations) and some as close to reality as possible estimations regarding the load.
Google Analytics Pros:
Easy to Install
Can be used in multiple environments (e.g. web, mobile, other)
Customized data collection
Google Analytics Cons:
Custom reporting is limited
Upgrading to Premium is expensive
Requires continual traning
Slices data into smaller samples to deal with large sampling issues
ElasticSearch Pros:
Distributed by design
Easier to scale horizontally
Good at full text search
Fast indexing & querying
ElasticSearch Cons:
Not a relational database therefore does not benefit from things like foreign-key constaints
Data consistency can be affected
No built-in authentication or authorization system

Reasons against using Elasticsearch as an OLAP cube

At first glance, it seems that with Elasticsearch as a backend it is easy and fast to build reports with pivot-like functionality as used in traditional business intelligence environments.
By "pivot-like" I mean that in SQL-terms, data is grouped by one to two dimensions, filtered, ordered by one or two dimensions and aggregated by several metrics e.g. with sum or count.
By "easy" I mean that with a sufficiently large cluster, no pre-aggregation of the data is required, which saves ETLs and data engineering time.
By "fast" I mean that due to Elasticsearch's near real time capability report latency can be reduced in many instances, when compared to traditional business intelligence systems.
Are there any reasons, not to use Elasticsearch for the above purpose?
ElasticSearch is a great alternative to a cube, we use it for that same purpose today. One huge benefit is that with a cube you need to know what dimensions you want to create reports on. With ES you just shove in more and more data and figure out later how you want to report on it.
At our company we regularly have data go through the following life cycle.
record is written to SQL
primary key from SQL is written to RabbitMQ
we respond back to the customer very quickly
When Rabbit has time, it uses the primary key to gather up all the data we want to report on
That data is written to ElasticSearch
A word of advice: If you think you might want to report on it, get it from the beginning. Inserting 1M rows into ES is very easy, updating 1M rows is a bigger pain.

max number of couchbase views per bucket

How many views per bucket is too much, assuming a large amount of data in the bucket (>100GB, >100M documents, >12 document types), and assuming each view applies only to one document type? Or asked another way, at what point should some document types be split into separate buckets to save on the overhead of processing all views on all document types?
I am having a hard time deciding how to split my data into couchbase buckets, and the performance implications of the views required on the data. My data consists of more than a dozen relational DBs, with at least half with hundreds of millions of rows in a number of tables.
The http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writing-bestpractice.html doc section "using document types" seems to imply having multiple document types in the same bucket is not ideal because views on specific document types are updated for all documents, even those that will never match the view. Indeed, it suggests separating data into buckets to avoid this overhead.
Yet there is a limit of 10 buckets per cluster for performance reasons. My only conclusion therefore is that each cluster can handle a maximum of 10 large collections of documents efficiently. Is this accurate?
Tug's advice was right on and allow me to add some perspective as well.
A bucket can be considered most closely related to (though not exactly) a "database instantiation" within the RDMS world. There will be multiple tables/schemas within that "database" and those can all be combined within a bucket.
Think about a bucket as a logical grouping of data that all shares some common configuration parameters (RAM quota, replica count, etc) and you should only need to split your data into multiple buckets when you need certain datasets to be controlled separately. Other reasons are related to very different workloads to different datasets or the desire to be able to track the workload to those datasets separately.
Some examples:
-I want to control the caching behavior for one set of data differently than another. For instance, many customers have a "session" bucket that they want always in RAM whereas they may have a larger, "user profile" bucket that doesn't need all the data cached in RAM. Technically these two data sets could reside in one bucket and allow Couchbase to be intelligent about which data to keep in RAM, but you don't have as much guarantee or control that the session data won't get pushed out...so putting it in its own bucket allows you to enforce that. It also gives you the added benefit of being able to monitor that traffic separately.
-I want some data to be replicated more times than others. While we generally recommend only one replica in most clusters, there are times when our users choose certain datasets that they want replicated an extra time. This can be controlled via separate buckets.
-Along the same lines, I only want some data to be replicated to another cluster/datacenter. This is also controlled per-bucket and so that data could be split to a separate bucket.
-When you have fairly extreme differences in workload (especially around the amount of writes) to a given dataset, it does begin to make sense from a view/index perspective to separate the data into a separate bucket. I mention this because it's true, but I also want to be clear that it is not the common case. You should use this approach after you identify a problem, not before because you think you might.
Regarding this last point, yes every write to a bucket will be picked up by the indexing engine but by using document types within the JSON, you can abort the processing for a given document very quickly and it really shouldn't have a detrimental impact to have lots of data coming in that doesn't apply to certain views. If you don't mind, I'm particularly curious at which parts of the documentation imply otherwise since that certainly wasn't our intention.
So in general, we see most deployments with a low number of buckets (2-3) and only a few upwards of 5. Our limit of 10 comes from some known CPU and disk IO overhead of our internal tracking of statistics (the load or lack thereof on a bucket doesn't matter here). We certainly plan to reduce this overhead with future releases, but that still wouldn't change our recommendation of only having a few buckets. The advantages of being able to combine multiple "schemas" into a single logical grouping and apply view/indexes across that still exist regardless.
We are in the process right now of coming up with much more specific guidelines and sizing recommendations (I wrote those first two blogs as a stop-gap until we do).
As an initial approach, you want to try and keep the number of design documents around 4 because by default we process up to 4 in parallel. You can increase this number, but that should be matched by increased CPU and disk IO capacity. You'll then want to keep the number of views within each document relatively low, probably well below 10, since they are each processed in serial.
I recently worked with one user who had an fairly large amount of views (around 8 design documents and some dd's with nearly 20 views) and we were able to drastically bring this down by combining multiple views into one. Obviously it's very application dependent, but you should try to generate multiple different "queries" off of one index. Using reductions, key-prefixing (within the views), and collation, all combined with different range and grouping queries can make a single index that may appear crowded at first, but is actually very flexible.
The less design documents and views you have, the less disk space, IO and CPU resources you will need. There's never going to be a magic bullet or hard-and-fast guideline number unfortunately. In the end, YMMV and testing on your own dataset is better than any multi-page response I can write ;-)
Hope that helps, please don't hesitate to reach out to us directly if you have specific questions about your specific use case that you don't want published.
Perry
As you can see from the Couchbase documentation, it is not really possible to provide a "universal" rules to give you an exact member.
But based on the best practice document that you have used and some discussion(here) you should be able to design your database/views properly.
Let's start with the last question:
YES the reason why Couchbase advice to have a small number of bucket is for performance - and more importantly resources consumption- reason. I am inviting you to read these blog posts that help to understand what's going on "inside" Couchbase:
Sizing 1: http://blog.couchbase.com/how-many-nodes-part-1-introduction-sizing-couchbase-server-20-cluster
Sizing 2: http://blog.couchbase.com/how-many-nodes-part-2-sizing-couchbase-server-20-cluster
Compaction: http://blog.couchbase.com/compaction-magic-couchbase-server-20
So you will see that most of the "operations" are done by bucket.
So let's now look at the original question:
yes most the time your will organize the design document/and views by type of document.
It is NOT a problem to have all the document "types" in a single(few) buckets, this is in fact the way your work with Couchbase
The most important part to look is, the size of your doc (to see how "long" will be the parsing of the JSON) and how often the document will be created/updated, and also deleted, since the JS code of the view is ONLY executed when you create/change the document.
So what you should do:
1 single bucket
how many design documents? (how many types do you have?)
how any views in each document you will have?
In fact the most expensive part is not during the indexing or quering it is more when you have to rebalance the data and indices between nodes (add, remove , failure of nodes)
Finally, but it looks like you already know it, this chapter is quite good to understand how views works (how the index is created and used):
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-operation.html
Do not hesitate to add more information if needed.

Resources