In Biguqery, does a single slot consume multiple slots? - performance

When I was reading a query plan about timeline information, I found that the information about "activeUnits" conflicted with that of "totalSlotMs".
I think that each "unit" in timeline information consumes a single slot. Therefore totalSlotMs is equal to the product of "activeUnits" and elapsed time. However, in my timeline information, totalSlotMs was more than twice as much as the product.
The timeline information is like the following
{"activeUnits": "84", "completedUnits": "2673", "elapsedMs": "102776", "pendingUnits": "81", "totalSlotMs": "46827346"},
{"activeUnits": "84", "completedUnits": "2673", "elapsedMs": "103776", "pendingUnits": "81", "totalSlotMs": "47040505"},
The elapsed time between two sample is 1000ms, and totalSlotMs is increased 213159ms.
If an unit consumed a single slot, totalSlotMs was equal to 84000ms, which is much less than 213159ms.
Does a single slot consume multiple slots?

A unit of work is just a concept that means "a piece of work" and its not really useful for measuring something.
In the documentation its said
Within the query plan, the terms work units and workers are used to
convey information specifically about parallelism. Elsewhere within
BigQuery, you may encounter the term slot", which is an abstracted
representation of multiple facets of query execution, including
compute, memory, and I/O resources. Top level job statistics provide
the estimate of individual query cost using the totalSlotMs estimate
of the query using this abstracted accounting.
So I understand that a work unity can use many slots which explains your doubt.

Related

How does Anomaly detection in Elasticsearch work?

I have a question about the Anomaly Detection module provided by elastic stack. As per my understanding of Machine Learning the more data being fed to the model the better learning it will do provided the data is proper. Now I want to use the Anomaly Detection Module in kibana. I did some testing with that and with some reading I found that basically it is better that we have at least 3 weeks of data or 20 buckets worth. Now lets say we receive about 40 million records a day. This will take a whole lot of time for the model to train for a day itself now if get about 3 weeks worth of this amount of data this will put a lot of pressure on the node. But if I feed the model less data and reduce the bucket span it will make my model more sensitive. So what is my best bet for this. How is that I can make the most out of the Anomaly Detection module.
Just FYI: I do have a dedicated Machine learning Node with equipped with more than enough memory but it still takes a whole lot of time to process records for a day so my concern is it will take a whole whole lot of time to process 3 weeks worth of data.
So My question is that if we give large amount of data for short amount of time say 1 week to the model for training and if we give large amount of data for a slightly longer amount of time say 3 weeks to the model for training will these two models detect anomalies with the same accuracy.
If you have a dedicated ML node with ample memory, I don't see what the problem could be. Common sense has it that the more data you have, the better the model can learn, and the more accurate your prediction model will be. Also seasonality might not be well captured with just one week of data. If you have the data and are not using it out of fear that it will take a "some time" to analyze it, what's the point of gathering it in the first place?
It is true that it will take "some time" to build the model initially, but afterwards, the ML process will run more frequently depending on your chosen bucket span size (configurable) and process the new documents that arrived in the meantime, it's really fast. Regarding sensitivity, your mileage may vary, but it's not dependent only on the amount of data you feed, but also on the size of the bucket span you choose.

Approach to measuring end-to-end latency from a sales transaction to a stock level in a database

I have a system in which sales transactions are written to a Kafka topic in real time. One of the consumers of this data is an aggregator program which maintains a database of stock quantities for all locations, in real time; it will consume data from multiple other sources as well. For example, when a product is sold from a store, the aggregator will reduce the quantity of that product in that store by the quantity sold.
This aggregator's database will be presented via an API to allow applications to check stock availability (the inventory) in any store in real time.
(Note for context - yes, there is an ERP behind all this which does a lot more; the purpose of this inventory API is to consume data from multiple sources, including the ERP and the ERP's data feeds, and potentially other ERPs in future, to give a single global information source for this singular purpose).
What I want to do is to measure the end-to-end latency: how long it takes from a sales transaction being written to the topic, to being processed by the aggregator (not just read from the topic). This will give an indicator of how far behind real-time the inventory database is.
The sales transaction topic will probably be partitioned, so the transactions may not arrive in order.
So far I have thought of two methods.
Method 1 - measure latency via stock level changes
Here, the sales producer injects a special "measurement" sale each minute, for an invalid location like "SKU 0 in branch 0". The sale quantity would be based on the time of day, using a numerical sequence of some kind. A program would then poll the inventory API, or directly read the database, to check for changes in the level. When it changes, the magnitude of the change will indicate the time of the originating transaction. The difference between then and now gives us the latency.
Problem: If multiple transactions are queued and are then later all processed together, the change in inventory value will be the sum of the queued transactions, giving a false reading.
To solve this, the sequence of numbers would have to be chosen such that when they are added together, we can always determine which was the lowest number, giving us the oldest transaction and therefore the latency.
We could use powers of 2 for this, so the lowest bit set would indicate the earliest transaction time. Our sequence would have to reset every 30 or 60 minutes and we'd have to cope with wraparound and lots of edge cases.
Assuming we can solve the wraparound problem and that a maximum measurable latency of, say, 20 minutes is OK (after which we just say it's "too high"), then with this method, it does not matter whether transactions are processed out of sequence or split into partitions.
This method gives a "true" picture of the end-to-end latency, in that it's measuring the point at which the database has actually been updated.
Method 2 - measure latency via special timestamp record
Instead of injecting measurement sales records, we use a timestamp which the producer is adding to the raw data. This timestamp is just the time at which the producer transmitted this record.
The aggregator would maintain a measurement of the most recently seen timestamp. The difference between that and the current time would give the latency.
Problem: If transactions are not processed in order, the latency measurement will be unstable, because it relies on the timestamps arriving in sequence.
To solve this, the aggregator would not just output the last timestamp it saw, but instead would output the oldest timestamp it had seen in the past minute across all of its threads (assuming multiple threads potentially reading from multiple partitions). This would give a less "lumpy" view.
This method gives an approximate picture of the end-to-end latency, since it's measuring the point at which the aggregator receives the sales record, not the point at which the database has been updated.
The questions
Is one method more likely to get usable results than the other?
For method 1, is there a sequence of numbers which would be more efficient than powers of 2 in allowing us to work out the earliest value when multiple ones arrive at once, requiring fewer bits so that the time before sequence reset would be longer?
Would method 1 have the same problem of "lumpy" data as method 2, in the case of a large number of partitions or data arriving out of order?
Given that method 2 seems simpler, is the method of smoothing out the lumps in the measurement a plausible one?

Algorithms for establishing baselines from time series data

In my app I collect a lot of metrics: hardware/native system metrics (such as CPU load, available memory, swap memory, network IO in terms of packets and bytes sent/received, etc.) as well as JVM metrics (garbage collectins, heap size, thread utilization, etc.) as well as app-level metrics (instrumentations that only have meaning to my app, e.g. # orders per minute, etc.).
Throughout the week, month, year I see trends/patterns in these metrics. For instance when cron jobs all kick off at midnight I see CPU and disk thrashing as reports are being generated, etc.
I'm looking for a way to assess/evaluate metrics as healthy/normal vs unhealthy/abnormal but that takes these patterns into consideration. For instance, if CPU spikes around (+/- 5 minutes) midnight each night, that should be considered "normal" and not set off alerts. But if CPU pins during a "low tide" in the day, say between 11:00 AM and noon, that should definitely cause some red flags to trigger.
I have the ability to store my metrics in a time-series database, if that helps kickstart this analytical process at all, but I don't have the foggiest clue as to what algorithms, methods and strategies I could leverage to establish these cyclical "baselines" that act as a function of time. Obviously, such a system would need to be pre-seeded or even trained with historical data that was mapped to normal/abnormal values (which is why I'm learning towards a time-series DB as the underlying store) but this is new territory for me and I don't even know what to begin Googling so as to get back meaningful/relevant/educated solution candidates in the search results. Any ideas?
You could categorize each metric (CPU load, available memory, swap memory, network IO) with the day and time as bad or good for each metric.
Come up with a set of data for a given time frame with metric values and whether they are good or bad. Train a model using 70% of the data with the good and bad answers in the data.
Then test the trained model using the other 30% of data without the answers to see if you get the predicted results (good,bad) from the model. You could use a classification algorithm.

Service architecture using technologies which provide parallelism and high scalability

I'm working on a booking system with a single RDBMS. This system has units (products) with several characteristics (attributes) like: location, size [m2], has sea view, has air conditioner…
On the top of that there is pricing with its prices for different periods e.g. 1/1/2018 – 1/4/2018 -> 30$ ... Also, there is capacity with its own periods 1/8/2017 – 1/6/2018… Availability which is the same as capacity.
Each price can have its own type: per person, per stay, per item… There are restrictions for different age groups, extra bed, …
We are talking about 100k potential units. The end user can make request to search all units in several countries, for two adults and children of 3 and 7 years, for period 1/1/2018 – 1/8/2018, where are 2 rooms with one king size bed and one single bed + one extra bed. Also, there can be other rules which are handled by rule engine.
In classical approach filtering would be done in several iterations, trying to eliminate as much as possible in each iteration. There could be done several tables with semi results which must be synchronized with every change when something has been changed through administration.
Recently I was reading about Hadoop and Storm which are highly scalable and provide parallelism. I was wondering if this kind of technology is suitable for solving described problem. Main idea is to write “one method” which validates each unit, if satisfies given filter search. Later this function is easy to extend with additional logic. Each cluster could take its own portion of the load. If there are 10 cluster, each of them could process 10k units.
In Cloudera tutorial there is a moment when with Sqoop, content from RDBMS has been transferred to HDFS. This process takes some time, so it seems it’s not a good approach to solve this problem. Given problem is highly deterministic and it requires to have immediate synchronization and to operates with fresh data. Maybe to use in some streaming service and to parallelly write into HDFS and RDBMS? Do you recommend some other technology like Storm?
What could be possible architecture, starting point, to satisfy all requirements to solve this problem.
Please point me into right direction if this problem is improper for the site.

How can I determine the appropriate number of tasks with GCD or similar?

I very often encounter situations where I have a large number of small operations that I want to carry out independently. In these cases, the number of operations is so large compared to the actual time each operation takes so simply creating a task for each operation is inappropriate due to overhead, even though GCD overhead is typically low.
So what you'd want to do is split up the number of operations into nice chunks where each task operates on a chunk. But how can I determine the appropriate number of tasks/chunks?
Testing, and profiling. What makes sense, and what works well is application specific.
Basically you need to decide on two things:
The number of worker processes/threads to generate
The size of the chunks they will work on
Play with the two numbers, and calculate their throughput (tasks completed per second * number of workers). Somewhere you'll find a good equilibrium between speed, number of workers, and number of tasks in a chunk.
You can make finding the right balance even simpler by feeding your workers a bunch of test data, essentially a benchmark, and measuring their throughput automatically while adjusting these two variables. Record the throughput for each combination of worker size/task chunk size, and output it at the end. The highest throughput is your best combination.
Finally, if how long a particular task takes really depends on the task itself (e.g. some tasks take X time, and while some take X*3 time, then you can can take a couple of approaches. Depending on the nature of your incoming work, you can try one of the following:
Feed your benchmark historical data - a bunch of real-world data to be processed that represents the actual kind of work that will come into your worker grid, and measure throughput using that example data.
Generate random-sized tasks that cross the spectrum of what you think you'll see, and pick the combination that seems to work best on average, across multiple sizes of tasks
If you can read the data in a task, and the data will give you an idea of whether or not that task will take X time, or X*3 (or something in between) you can use that information before processing the tasks themselves to dynamically adjust the worker/task size to achieve the best throughput depending on current workload. This approach is taken with Amazon EC2 where customers will spin-up extra VMs when needed to handle higher load, and spin them back down when load drops, for example.
Whatever you choose, any unknown speed issue should almost always involve some kind of demo benchmarking, if the speed at which it runs is critical to the success of your application (sometimes the time to process is so small, that it's negligible).
Good luck!

Resources