Why isn't my Storm topology keeping up? - performance

My hardware configuration:
2 x Nimbus: 2 x 1 CPU # 10 Core/20 Thread (CentOS reports 40 cores)
19 x Supervisors: 2 x 1 CPU # 12 Core/24 Thread (CentOS reports 48 cores)
All disks are 10K spindles or faster
All chassis are at 128GB RAM
10gbit interconnects
Kafka has 40 partitions
My topology is doing simple, low-CPU work. It takes a JSON object of about 25 kilobytes, unpacks it, queries a few webservices (each in a different bolt), transforms it to a new object, and sends this new object to a final webservice.
I'm not timing out any tuples, the CPU usage on my machines is all very low.
I wrote a small application to ping ZooKeeper and get offsets, and compare these with Kafka queue depths. The timestamp is from this morning. The offset is summed across all 40 partitions of Kafka, the ZooKeeper offsets are summed across all 20 consumers. The "lag" column is the difference, and the values in parenthesis are the differences between each heartbeat (every 15 seconds). The first line shows that the offsets grew by 7 messages, but the lag increased by 13. This means 20 messages came in, but only 7 were processed. When the topology starts initially, it will keep up, and then slowly fall behind.
[7/26/2017 10:34:50 AM] Offset 35535228, lag 53983 (7, 13)
[7/26/2017 10:35:05 AM] Offset 35535234, lag 53990 (6, 7)
[7/26/2017 10:35:21 AM] Offset 35535237, lag 53992 (3, 2)
[7/26/2017 10:35:36 AM] Offset 35535243, lag 53998 (6, 6)
[7/26/2017 10:35:54 AM] Offset 35535247, lag 54004 (4, 6)
[7/26/2017 10:36:10 AM] Offset 35535251, lag 54013 (4, 9)
[7/26/2017 10:36:27 AM] Offset 35535258, lag 54018 (7, 5)
[7/26/2017 10:36:43 AM] Offset 35535267, lag 54024 (9, 6)
[7/26/2017 10:36:59 AM] Offset 35535276, lag 54028 (9, 4)
[7/26/2017 10:37:15 AM] Offset 35535283, lag 54041 (7, 13)
[7/26/2017 10:37:31 AM] Offset 35535293, lag 54063 (10, 22)
[7/26/2017 10:37:46 AM] Offset 35535310, lag 54078 (17, 15)
[7/26/2017 10:38:02 AM] Offset 35535320, lag 54084 (10, 6)
[7/26/2017 10:38:17 AM] Offset 35535326, lag 54091 (6, 7)
[7/26/2017 10:38:33 AM] Offset 35535330, lag 54100 (4, 9)
[7/26/2017 10:38:48 AM] Offset 35535334, lag 54103 (4, 3)
[7/26/2017 10:39:04 AM] Offset 35535339, lag 54116 (5, 13)
[7/26/2017 10:39:21 AM] Offset 35535342, lag 54120 (3, 4)
[7/26/2017 10:39:36 AM] Offset 35535349, lag 54124 (7, 4)
[7/26/2017 10:39:52 AM] Offset 35535351, lag 54134 (2, 10)
[7/26/2017 10:40:08 AM] Offset 35535355, lag 54142 (4, 8)
[7/26/2017 10:40:23 AM] Offset 35535357, lag 54146 (2, 4)
[7/26/2017 10:40:39 AM] Offset 35535359, lag 54149 (2, 3)
[7/26/2017 10:40:54 AM] Offset 35535365, lag 54160 (6, 11)
Where do I look next?

Bryan, try to set number ackers also for your topology, I can see only 2 ackers executers are running for your topology, you can set number of ackers with this conf.setNumAckers(10).
This should solve your problem.
If I'm not wrong, topology starts failing tuples after some time right?

Related

How to efficiently insert into 2-D sort matrix with O(n) time

I need to insert into a sorted 2-D n x n matrix, insertion should be achieved in O(n). but the condition is the whole matrix is not sorted, only rows and columns are sorted.
example:
9 10 12 14
11 12 14 15
21 22 28 Null
23 24 Null Null
after inserting a new element into the above matrix the matrix should be sorted in row and column wise.
My approach
I am thinking to do something like inserting the new element at the end of the matrix (3, 3), from there I will check (3, 3-1) and (3-1, 3) I will swap (3, 3) with max[(3, 3-1), (3-1, 3)] if (3, 3) < max[(3, 3-1), (3-1, 3)], like that I will traverse the whole matrix until there is no swapping is required.
Please let me know If there is a better way to do it.

Maximum number of subsets of overlapping intervals

Given a set of intervals S what is the efficient way to find the number of subsets assigned to each interval from the set S.
say for example
S = (11,75), (14,62), (17,32), (24,48), (31,71), (34,74), (40,97), (41,58)
as for output
(11, 75) => 6 -> (14,62), (17,32), (24,48), (31,71), (34,74), (41,58)
(14, 62) => 3 -> (17,32), (24,48), (41,58)
(17, 32) => 0
(24, 48) => 0
(31, 71) => 1 -> (41,58)
(34, 74) => 1 -> (41,58)
(40, 97) => 1 -> (41,58)
(41, 58) => 0
Is it possible to get this mapping in o(nlogn) or substantially less than o(n2)?
There seems to be an O(n*log(n)) way to do this. The intuition is that we need some sort of way to organize the intervals in a way where, at the current step, all intervals that the current one could contain have been considered.
The algorithm is as follows:
Sort the intervals by end times in ascending order, and sort tied end times by their start times in descending order.
Iterate over the intervals and maintain a sorted set of all start times seen. The idea is that, when looking at the current interval, all intervals that it could contain have already been examined, and the number of intervals that the current one does contain is simply the number of elements in our built set that have a start time later than the current one.
Walking through the example, we first find that
sortedIntervals = [(17,32), (24,48), (41,58), (14,62), (31,71), (34,74), (11,75),(40,97)]
And let our sorted set of intervals (sorting now by start time) be
examinedIntervals = []
Now let's step through sortedIntervals
Consider (17,32). examinedIntervals is empty, so it doesn't contain anything.
Consider (24, 48). examinedIntervals = [(17, 32)] . Because there are no intervals that start after 24, we have that (24, 48) contains 0 intervals.
Consider (41, 58). examinedIntervals = [(17, 32), (24, 48)]. No intervals have a start time after 41, so (41, 58) contains 0 intervals
Consider (14, 62). examinedIntervals = [(17, 32), (24, 48), (41, 58)]. Now all three intervals have a start time after 14, so (14, 62) contains 3 intervals
Consider (31, 71). examinedIntervals = [(14, 62), (17, 32), (24, 48), (41, 58)]. Only one interval comes after 31, so (31, 71) contains 1 interval
Consider (34, 74). examinedIntervals = [(14, 62), (17, 32), (24, 48), (31, 71), (41, 58)]. One interval comes after 34, so (34, 74) contains 1 interval
Consider (11, 75). examinedIntervals = [(14, 62), (17, 32), (24, 48), (31, 71), (34, 74), (41, 58)], and all 6 intervals have a start time after 14.
Consider (40, 97). examinedIntervals = [(11, 75), (14, 62), (17, 32), (24, 48), (31, 71), (34, 74), (41, 58)]. Only one interval comes after 40, so (40, 97) contains 1 interval.
Summarizing we do indeed get the correct results:
(40, 97) -> 1
(11, 75) -> 6
(34, 74) -> 1
(31, 71) -> 1
(14, 62) -> 3
(41, 58) -> 0
(24, 48) -> 0
(17, 32) -> 0
It can also be verified easily that runtime is O(n*log(n)), assuming the use of an efficient sort and a balanced tree in the second part. The initial sort runs in the given amount of time. The second portion involves n insertions into a binary tree of height O(log(n)), giving a runtime of O(nlog(n)). Because we're summing two steps that run in O(nlog(n)), the overall runtime is O(nlog(n)).

Moving maximum variant

Yesterday, I got asked the following question during a technical interview.
Imagine that you are working for a news agency. At every discrete point of time t, a story breaks. Some stories are more interesting than others. This "hotness" is expressed as a natural number h, with greater numbers representing hotter news stories.
Given a stream S of n stories, your job is to find the hottest story out of the most recent k stories for every t >= k.
So far, so good: this is the moving maximum problem (also known as the sliding window maximum problem), and there is a linear-time algorithm that solves it.
Now the question gets harder. Of course, older stories are usually less hot compared to newer stories. Let the age a of the most recent story be zero, and let the age of any other story be one greater than the age of its succeeding story. The "improved hotness" of a story is then defined as max(0, min(h, k - a)).
Here's an example:
n = 13, k = 4
S indices: 0 1 2 3 4 5 6 7 8 9 10
S values: 1 3 1 7 1 3 9 3 1 3 1
mov max hot indices: 3 3 3 6 6 6 6 9
mov max hot values: 7 7 7 9 9 9 9 3
mov max imp-hot indices: 3 3 5 6 7 7 9 9
mov max imp-hot values: 4 3 3 4 3 3 3 3
I was at a complete loss with this question. I thought about adding the index to every element before computing the maximum, but that gives you the answer for when the hotness of a story decreases by one at every step, regardless of whether it reached the hotness bound or not.
Can you find an algorithm for this problem with sub-quadratic (ideally: linear) running time?
I'll sketch a linear-time solution to the original problem involving a double-ended queue (deque) and then extend it to improved hotness with no loss of asymptotic efficiency.
Original problem: keep a deque that contains the stories that are (1) newer or hotter than every other story so far (2) in the window. At any given time, the hottest story in the queue is at the front. New stories are pushed onto the back of the deque, after popping every story from the back until a hotter story is found. Stories are popped from the front as they age out of the window.
For example:
S indices: 0 1 2 3 4 5 6 7 8 9 10
S values: 1 3 1 7 1 3 9 3 1 3 1
deque: (front) [] (back)
push (0, 1)
deque: [(0, 1)]
pop (0, 1) because it's not hotter than (1, 3)
push (1, 3)
deque: [(1, 3)]
push (2, 1)
deque: [(1, 3), (2, 1)]
pop (2, 1) and then (1, 3) because they're not hotter than (3, 7)
push (3, 7)
deque: [(3, 7)]
push (4, 1)
deque: [(3, 7), (4, 1)]
pop (4, 1) because it's not hotter than (5, 3)
push (5, 3)
deque: [(3, 7), (5, 3)]
pop (5, 3) and then (3, 7) because they're not hotter than (6, 9)
push (6, 9)
deque: [(6, 9)]
push (7, 3)
deque: [(6, 9), (7, 3)]
push (8, 1)
deque: [(6, 9), (7, 3), (8, 1)]
pop (8, 1) and (7, 3) because they're not hotter than (9, 3)
push (9, 3)
deque: [(6, 9), (9, 3)]
push (10, 1)
pop (6, 9) because it exited the window
deque: [(9, 3), (10, 1)]
To handle the new problem, we modify how we handle aging stories. Instead of popping stories as they slide out of the window, we pop the front story whenever its improved hotness becomes less than or equal to its hotness. When determining the top story, only the most recently popped story needs to be considered.
In Python:
import collections
Elem = collections.namedtuple('Elem', ('hot', 't'))
def winmaximphot(hots, k):
q = collections.deque()
oldtop = 0
for t, hot in enumerate(hots):
while q and q[-1].hot <= hot:
del q[-1]
q.append(Elem(hot, t))
while q and q[0].hot >= k - (t - q[0].t) > 0:
oldtop = k - (t - q[0].t)
del q[0]
if t + 1 >= k:
yield max(oldtop, q[0].hot) if q else oldtop
oldtop = max(0, oldtop - 1)
print(list(winmaximphot([1, 3, 1, 7, 1, 3, 9, 3, 1, 3, 1], 4)))
Idea is the following: for each breaking news, it will beat all previous news after k-h steps. It means for k==30 and news hotness h==28, this news will be hotter than all previous news after 2 steps.
Let's keep all moments of time when next news will be the hottest. At step i we get moment of time when current news will beat all previous ones equal to i+k-h.
So we will have such sequence of objects {news_date | news_beats_all_previous_ones_date}, which is in increasing order by news_beats_all_previous_ones_date:
{i1 | i1+k-h} {i3 | i3+k-h} {i4 | i4+k-h} {i7 | i7+k-h} {i8 | i8+k-h}
At current step we get i9+k-h, we are adding it to the end of this list, removing all values which are bigger (since sequence is increasing this is easy).
Once first element's news_beats_all_previous_ones_date becomes equal current date (i), this news becomes answer to the sliding window query and we remove this item from the sequence.
So, you need a data structure with ability to add to the end, and remove from beginning and from the end. This is Deque. Time complexity of solution is O(n).

How to implement a data structure which takes an integer and its weight such that a query to it returns that integer its weight% times?

Data is entered as the number followed by the weight. For example, if the data structure has data entered (1, 9) and (2, 1) then it should return the number 1 90% of the time and the number 2 10% of the time.
Which data structure should be used for this implementation? Also, what would the basic code for the query function look like?
Edit: I was considering a tree which stores the cumulative sum for every subtree. Say I have (1, 4), (2, 7), (3, 1), and (4, 11).
The tree would look like:
23
/ \
11 12
/ \ / \
4 7 1 11
I do not know if this tree should be binary. Also, does it make sense to store the weights in the tree and map them to the number or somehow use the numbers given as data input?
Make from value/weight tuples (1, 4), (2, 7), (3, 1), (4, 11) an array with cumulative weight sums
[(1, 4), (2, 11), (3, 12), (4, 23)]
and get value with binary search for cumulative weight field.
It is not clear from question, how the query should work - randomly?

Suggestions for optimizing length of bins for a time period

I have an optimisation problem where I need to optimize the lengths of a fixed number of bins over a known period of time. The bins should contain minimal overlapping items with the same tag (see definition of items and tags later).
If the problem can be solved heuristically that is fine, the exact optimum is not important.
I was wondering if anybody had any suggestions as to approaches to try out for this or at had any ideas as to what the name of the problem would be.
The problem
Lets say we have n number of items that have two attributes: tag and time range.
For an example we have the following items:
(tag: time range (s))
(1: 0, 2)
(1: 4, 5)
(1: 7, 8)
(1: 9, 15)
(2: 0, 5)
(2: 7, 11)
(2: 14, 20)
(3: 4, 6)
(3: 7, 11)
(4: 5, 15)
When plotted this would look as follows:
Lets say we have to bin this 20 second period of time into 4 groups. We could do this by having 4 groups of length 5.
And would look something like this:
The number of overlapping items with the same tag would be
Group 1: 1 (tag 1)
Group 2: 2 (tag 1 and tag 3)
Group 3: 2 (tag 2)
Group 4: 0
Total overlapping items: 5
Another grouping selection for 4 groups would then be of lengths 4, 3, 2 and 11 seconds.
The number of overlapping items with the same tag would be :
Group 1: 0
Group 2: 0
Group 3: 0
Group 4: 1 (tag 2)
Attempts to solve (brute force)
I can find the optimum solution by binning the whole period of time into small segments (say 1 seconds, for the above example there would be 20 bins).
I can then find all the integer compositions for the integer 20 that use 4 components. e.g.
This would provide 127 different compositions
(1, 1, 4, 14), (9, 5, 5, 1), (1, 4, 4, 11), (13, 3, 3, 1), (3, 4, 4, 9), (10, 5, 4, 1), (7, 6, 6, 1), (1, 3, 5, 11), (2, 4, 4, 10) ......
For (1, 1, 4, 14) the grouping would be 4 groups of 1, 1, 4 and 14 seconds.
I then find the composition with the best score (smallest number of overlapping tags).
The problem with this approach is that it can only be done on relatively small numbers as the number of compositions of an integer gets incredibly large when the size of the integer increases.
Therefore, if my data is 1000 seconds and I have to put bins of size 1 second the run time would be too long.
Attempts to solve (heuristically)
I have tried using a genetic algorithm type approach.
Where chromosomes are a composition of lengths which are created randomly and genes are the individual lengths of each group. Due to the nature of the data I am struggling to do any meaningful crossover/mutations though.
Does anyone have any suggestions?

Resources