Storm bolt connecting to a database - apache-storm

I have a spout which reads from a source with 40K qps.
I have two bolt, first one which reads from the source and does a database connection to build a cache which refreshes in every hour. The database has 2 connection open for a user so executor count that I have for this bolt is 2.
Other bolt is assigned 200 executors and 200 task to process the request.
I can't increase the connection to db. And I see that all the request is going to single workers. Other workers keep waiting and prints "0 send message".
kafkaSpoutConfigList:
- executorsCount: 30
taskCount: 30
spoutName: 'kafka_consumer_spout'
topicName: 'request'
processingBoltConfigList:
- executorsCount: 2
taskCount: 2
boltName: 'db_bolt'
boltClassName: 'com.Bolt1Class'
boltSourceList:
- 'kafka_consumer_spout'
- executorsCount: 200
taskCount: 200
boltName: 'bolt2'
boltClassName: 'com.Bolt2Class'
boltSourceList:
- 'db_bolt::streamx'
kafkaBoltConfigList:
- executorsCount: 15
taskCount: 15
boltName: 'kafka_producer_bolt'
topicName: 'consumer_topic'
boltSourceList:
- 'bolt2::Stream1'
- executorsCount: 15
taskCount: 15
boltName: 'kafka_producer_bolt'
topicName: 'data_test'
boltSourceList:
- 'bolt2::Stream2'
I am using localandgroupshuffling.

When you use LocalOrShuffleGrouping, the following happens:
If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping
So let's say your workers look like this:
worker1: {"bolt1 task 1", "bolt2 task 0-50"}
worker2: { "bolt1 task 2", "bolt2 task 50-100"}
worker3: { "bolt2 task 100-150"}
worker4: { "bolt2 task 150-200"}
In this case because you're telling Storm to use a local grouping when sending from bolt1 to bolt2, all the tuples will be going to worker 1 and 2. Worker 3 and 4 will be idle.
If you want to send tuples also to worker 3 and 4, you need to switch to shuffle grouping.

Related

Nomad Job - Failed to place all allocations

I’m trying to deploy an AWS EBS volume via nomad but I’m this below error. How do I resolve it?
$ nomad job plan -var-file bambootest.vars bamboo2.nomad
+/- Job: “bamboo2”
+/- Stop: “true” => “false”
+/- Task Group: “main” (1 create)
Volume {
AccessMode: “single-node-writer”
AttachmentMode: “file-system”
Name: “bambootest”
PerAlloc: “false”
ReadOnly: “false”
Source: “bambootest”
Type: “csi”
}
Task: “web”
Scheduler dry-run:
WARNING: Failed to place all allocations.
Task Group “main” (failed to place 1 allocation):
Class “system”: 3 nodes excluded by filter
Class “svt”: 2 nodes excluded by filter
Class “devtools”: 2 nodes excluded by filter
Class “bambootest”: 2 nodes excluded by filter
Class “ambt”: 2 nodes excluded by filter
Constraint “${meta.namespace} = bambootest”: 9 nodes excluded by filter
Constraint “missing CSI Volume bambootest”: 2 nodes excluded by filter
Below is an excerpt of the volume block that seems to be the problem.
group main {
count = 1
volume "bambootest" {
type = "csi"
source = "bambootest"
read_only = false
access_mode = "single-node-writer"
attachment_mode = "file-system"
}
task web {
driver = "docker"

Explain output of tail [tube] from beanstool (beanstalkd)

I'm struggling to unpick what the output from beantools tail on a beanstalk tube means exactly, specifically age, reserves & releases.
stat shows one job in this tube, but tail spits out thousands of these with the same job id:
id: 1, length: 184, priority: 1024, delay: 0, age: 45, ttr: 60
reserves: 101414, releases: 101413, buries: 0, kicks: 0, timeouts: 0
body:{snip}
age - age in seconds
reserves - a secondary id for this job after getting put back in the queue
releases - the reserve job that's going to get put back in the queue after this one is done
The huge numbers of reserves on the same job ID were caused by the process breaking on a timeout and not being caught - beanstalk saw the job failed and reserved it in a loop.

Python Futures error after process pool finished - atexit._run_exitfuncs

I'm testing with a minimal example how futures work when using ProcessPoolExecutor.
First, I want to know the result of my processing functions, then I would like to add complexity to the scenario.
import time
import string
import random
import traceback
import concurrent.futures
from concurrent.futures import ProcessPoolExecutor
def process(*args):
was_ok = True
try:
name = args[0][0]
tiempo = args[0][1]
print(f"Task {name} - Sleeping {tiempo}s")
time.sleep(tiempo)
except:
was_ok = False
return name, was_ok
def program():
amount = 10
workers = 2
data = [''.join(random.choices(string.ascii_letters, k=5)) for _ in range(amount)]
print(f"Data: {len(data)}")
tiempo = [random.randint(5, 15) for _ in range(amount)]
print(f"Times: {len(tiempo)}")
with ProcessPoolExecutor(max_workers=workers) as pool:
try:
index = 0
futures = [pool.submit(process, zipped) for zipped in zip(data, tiempo)]
for future in concurrent.futures.as_completed(futures):
name, ok = future.result()
print(f"Task {index} with code {name} finished: {ok}")
index += 1
except Exception as e:
print(f'Future failed: {e}')
if __name__ == "__main__":
program()
If I run this program, the output is as expected, obtaining all the future results. However, just at the end I also get a failure:
Data: 10
Times: 10
Task utebu - Sleeping 14s
Task klEVG - Sleeping 10s
Task ZAHIC - Sleeping 8s
Task 0 with code klEVG finished: True
Task RBEgG - Sleeping 9s
Task 1 with code utebu finished: True
Task VYCjw - Sleeping 14s
Task 2 with code ZAHIC finished: True
Task GDZmI - Sleeping 9s
Task 3 with code RBEgG finished: True
Task TPJKM - Sleeping 10s
Task 4 with code GDZmI finished: True
Task CggXZ - Sleeping 7s
Task 5 with code VYCjw finished: True
Task TUGJm - Sleeping 12s
Task 6 with code CggXZ finished: True
Task THlhj - Sleeping 11s
Task 7 with code TPJKM finished: True
Task 8 with code TUGJm finished: True
Task 9 with code THlhj finished: True
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/concurrent/futures/process.py", line 101, in _python_exit
thread_wakeup.wakeup()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/concurrent/futures/process.py", line 89, in wakeup
self._writer.send_bytes(b"")
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 183, in send_bytes
self._check_closed()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 136, in _check_closed
raise OSError("handle is closed")
OSError: handle is closed
AFAIK the code doesn't have an error itself. I've been researching really old questions like the following, without luck to find a fix for it:
this one related to the same error msg but for Python 2,
this issue in GitHub which seems to be exactly the same I'm having (and seems not fixed yet...?),
this other issue where the comment I linked to seems to point to the actual problem, but doesn't find a solution to it,
of course the official docs for Python 3.7,
...
And so forth. However still unlucky to find how to solve this behaviour. Even found a old question here in SO who pointed to avoid using as_completed function and using submit instead (and from there came this actual test, before I just had a map to my process function).
Any idea, fix, explanation or workaround are welcome. Thanks!

Web or console UI for filtering loglines on multiple dimensions?

I'm writing a tool to help with analysis of small-ish logfiles (e.g. 1-2 MB, in rare cases up to 50 MB).
The logfiles come from a file-syncing application, and contain a variety of different loglines:
2016-02-22 21:18:03,872 +0200 INFO pid=670 4664029184:PerfReporter perf_reporter.pyo:71 Current Stats: sync_bo=0, dio=266945, blacklist_len=0, uptime=1601770, pc=60, sync_x=0, sync_y=0, prs=78368, sync_percent=0, corpus=8819, c0=1510, pvm=3095812
2016-02-22 21:18:03,874 +0200 INFO pid=670 4664029184:PerfReporter sync_http_client.pyo:237 Opening direct connection to csi.gstatic.com:443.
2016-02-22 21:19:13,185 +0200 INFO pid=670 4650881024:SyncClientImpressionsThread impression_logger.pyo:278 Heartbeat was added.
2015-06-23 12:15:29,860 +0300 INFO pid=33914 4634906624:Worker-2 snapshot_sqlite.pyo:143 Adding local entry inode=57033344, filename=None
2015-06-23 12:15:29,861 +0300 INFO pid=33914 4634906624:Worker-2 snapshot_sqlite.pyo:171 Adding cloud entry resource_id=file:0B_JGPr4BzMr4dmdCbFBibms5WFk, filename=None
2015-06-23 12:15:29,862 +0300 INFO pid=33914 4634906624:Worker-2 snapshot_sqlite.pyo:253 Updating cloud entry doc_id=0B_JGPr4BzMr4dmdCbFBibms5WFk, filename=~$Foo Bar.xlsx
2015-06-23 12:15:30,247 +0300 INFO pid=33914 4651732992:Batcher batcher.pyo:849 Batcher Stats = file_count = Counter({_COUNT_KEY(direction=_DownloadDirectionType(Direction.DOWNLOAD), action=_FSChangeActionType(Action.CREATE), batch=False, successful=True): 1}), byte_count = Counter({_COUNT_KEY(direction=_DownloadDirectionType(Direction.DOWNLOAD), action=
_FSChangeActionType(Action.CREATE), batch=False, successful=True): 165}), batch_operation_count = Counter(), process_seconds = Counter({_COUNT_KEY(direction=_DownloadDirectionType(Direction.DOWNLOAD), action=_FSChangeActionType(Action.CREATE), batch=False, successful=True): 0.6173379421234131}), duration seconds = 1 (start_time = 1435050929, end_time = 143505093
0)
I'll be parsing out any key-value pairs, as well as several key attributes (e.g. inode number, filename, doc_id)
I would then like a UI (either console or Web UI) that lets you filter by various things, and display the full loglines:
Filtering by time ranges
Filtering by inode number, filename, event-type etc.
Are there any existing UIs elements/toolkits/frameworks that allow easy filtering along multiple dimensions?
So for example, you could select an inode number and event-type, and see a full history over time for that combination?
Probably similar to what Splunk/Kibana and ilk allow you to, but available as a stand-alone component? (console or web)
With MASSALYZER you can do it in console. If you need help, ask me!

UnavailableShardsException when running tests with 1 shard and 1 node

We are running our tests (PHP application) in Docker. Some tests use Elasticsearch.
We have configured Elasticsearch to have only 1 node and 1 shard (for simplicity). Here is the config we added to the default:
index.number_of_shards: 1
index.number_of_replicas: 0
Sometimes when the tests run, they fail because of the following Elasticsearch response:
{
"_indices":{
"acme":{
"_shards":{
"total":1,
"successful":0,
"failed":1,
"failures":[
{
"index":"acme",
"shard":0,
"reason":"UnavailableShardsException[[acme][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: delete_by_query {[acme][product], query [{\"query\":{\"term\":{\"product_id\":\"3\"}}}]}]"
}
]
}
}
}
}
The error message extracted from the response:
UnavailableShardsException[[acme][0] Primary shard is not active or isn't assigned to a known node. Timeout: [1m], request: delete_by_query {[acme][product], query [{\"query\":{\"term\":{\"product_id\":\"3\"}}}]}]
Why would our client fail to connect to Elasticsearch's node or shard randomly? Is this something to do with the fact that we have only 1 shard? Is this a bad thing?

Resources