Tornado app halting regularly for few seconds with 100% CPU

Tornado app halting regularly for few seconds with 100% CPU - amazon-ec2

I am trying to troubleshoot an app running on tornado 2.4 on Ubuntu 11.04 on EC2. It appears to be hitting 100% CPU regularly and halts at that request for few seconds.
Any help on this is greatly appreciated.
Symptoms:
top shows 100% cpu just at the time it halts. Normally server is about 30-60% cpu utilization.
It halts every 2-5 minutes just for one request. I have checked that there are no cronjobs affecting this.
It halts for about 2 to 9 seconds. Problem goes away on restarting tornado and worsens with tornado uptime. Longer the server is up, for longer duration it halts.
Http requests for which the problem appears do not seem to have any pattern.
Interestingly, next request in log sometimes sometimes matches the halting duration and some times does not. Example:
00:00:00 GET /some/request ()
00:00:09 GET /next/request (9000ms)
00:00:00 GET /some/request ()
00:00:09 GET /next/request (1ms)
# 9 seconds gap in requests is certainly not possible as clients are constantly polling.
Database (mongodb) shows no expensive or large number of queries. No page faults. Database is on the same machine - local disk.
vmstat shows no change in read/write sizes compared to last few minutes.
tornado is running behind nginx.
sending SIGINT when it was most likely halting, gives different stacktrace everytime. Some of them are below:
Traceback (most recent call last):
File "chat/main.py", line 3396, in <module>
main()
File "chat/main.py", line 3392, in main
tornado.ioloop.IOLoop.instance().start()
File "/home/ubuntu/tornado/tornado/ioloop.py", line 515, in start
self._run_callback(callback)
File "/home/ubuntu/tornado/tornado/ioloop.py", line 370, in _run_callback
callback()
File "/home/ubuntu/tornado/tornado/stack_context.py", line 216, in wrapped
callback(*args, **kwargs)
File "/home/ubuntu/tornado/tornado/iostream.py", line 303, in wrapper
callback(*args)
File "/home/ubuntu/tornado/tornado/stack_context.py", line 216, in wrapped
callback(*args, **kwargs)
File "/home/ubuntu/tornado/tornado/httpserver.py", line 298, in _on_request_body
self.request_callback(self._request)
File "/home/ubuntu/tornado/tornado/web.py", line 1421, in __call__
handler = spec.handler_class(self, request, **spec.kwargs)
File "/home/ubuntu/tornado/tornado/web.py", line 126, in __init__
application.ui_modules.iteritems())
File "/home/ubuntu/tornado/tornado/web.py", line 125, in <genexpr>
self.ui["_modules"] = ObjectDict((n, self._ui_module(n, m)) for n, m in
File "/home/ubuntu/tornado/tornado/web.py", line 1114, in _ui_module
def _ui_module(self, name, module):
KeyboardInterrupt
Traceback (most recent call last):
File "chat/main.py", line 3398, in <module>
main()
File "chat/main.py", line 3394, in main
tornado.ioloop.IOLoop.instance().start()
File "/home/ubuntu/tornado/tornado/ioloop.py", line 515, in start
self._run_callback(callback)
File "/home/ubuntu/tornado/tornado/ioloop.py", line 370, in _run_callback
callback()
File "/home/ubuntu/tornado/tornado/stack_context.py", line 216, in wrapped
callback(*args, **kwargs)
File "/home/ubuntu/tornado/tornado/iostream.py", line 303, in wrapper
callback(*args)
File "/home/ubuntu/tornado/tornado/stack_context.py", line 216, in wrapped
callback(*args, **kwargs)
File "/home/ubuntu/tornado/tornado/httpserver.py", line 285, in _on_headers
self.request_callback(self._request)
File "/home/ubuntu/tornado/tornado/web.py", line 1408, in __call__
transforms = [t(request) for t in self.transforms]
File "/home/ubuntu/tornado/tornado/web.py", line 1811, in __init__
def __init__(self, request):
KeyboardInterrupt
Traceback (most recent call last):
File "chat/main.py", line 3351, in <module>
main()
File "chat/main.py", line 3347, in main
tornado.ioloop.IOLoop.instance().start()
File "/home/ubuntu/tornado/tornado/ioloop.py", line 571, in start
self._handlers[fd](fd, events)
File "/home/ubuntu/tornado/tornado/stack_context.py", line 216, in wrapped
callback(*args, **kwargs)
File "/home/ubuntu/tornado/tornado/netutil.py", line 342, in accept_handler
callback(connection, address)
File "/home/ubuntu/tornado/tornado/netutil.py", line 237, in _handle_connection
self.handle_stream(stream, address)
File "/home/ubuntu/tornado/tornado/httpserver.py", line 156, in handle_stream
self.no_keep_alive, self.xheaders, self.protocol)
File "/home/ubuntu/tornado/tornado/httpserver.py", line 183, in __init__
self.stream.read_until(b("\r\n\r\n"), self._header_callback)
File "/home/ubuntu/tornado/tornado/iostream.py", line 139, in read_until
self._try_inline_read()
File "/home/ubuntu/tornado/tornado/iostream.py", line 385, in _try_inline_read
if self._read_to_buffer() == 0:
File "/home/ubuntu/tornado/tornado/iostream.py", line 401, in _read_to_buffer
chunk = self.read_from_fd()
File "/home/ubuntu/tornado/tornado/iostream.py", line 632, in read_from_fd
chunk = self.socket.recv(self.read_chunk_size)
KeyboardInterrupt
Any tips on how to troubleshoot this is greatly appreciated.
Further observations:
strace -p, during the time it hangs, shows empty output.
ltrace -p during hang time shows only free() calls in large numbers:
free(0x6fa70080) =
free(0x1175f8060) =
free(0x117a5c370) =

It sounds like you're suffering from garbage collection (GC) storms. The behavior you've described is typical of that diagnosis, and the ltrace further supports the hypothesis.
Lots of objects are being allocated and disposed of in the main/event loops being exercised by your usage ... and the periodic flurries of calls to free() result from that.
One possible approach would be to profile your code (or libraries on which you are depending) and see if you can refactor it to use (and re-use) objects from pre-allocated pools.
Another possible mitigation would be to make your own, more frequent, calls to trigger the garbage collection --- more expensive in aggregate but possibly less costly at each call. (That would be a trade-off for more predictable throughput).
You might be able to use the Python: gc module both for investigating the issue more deeply (using gc.set_debug()) and for a simple attempted mitigation (calls to gc.collect() after each transaction for example). You might also try running your application with gc.disable() for a reasonable length of time to see if further implicates the Python garbage collector. Note that disabling the garbage collector for an extended period of time will almost certainly cause paging/swapping ... so use it only for validation of our hypothesis and don't expect that to solve the problem in any meaningful way. It may just defer the problem 'til the whole system is thrashing and needs to be rebooted.
Here's an example of using gc.collect() in another SO thread on Tornado: SO: Tornado memory leak on dropped connections

Related

Rasa Timeout Issue

When running Rasa (tried on versions 1.3.3, 1.3.7, 1.3.8) I encounter this timeout exception message almost every time I make a call. I am running a simple program that recognises when a user offers their age, and stores the age in a database through an action response:
Bot loaded. Type a message and press enter (use '/stop' to exit):
Your input -> I am 24 years old
2019-10-10 13:29:33 ERROR asyncio - Task exception was never retrieved
future: <Task finished coro=<configure_app.<locals>.run_cmdline_io() done, defined at /Users/Kami/Documents/rasa/venv/lib/python3.7/site-packages/rasa/core/run.py:123> exception=TimeoutError()>
Traceback (most recent call last):
File "/Users/Kami/Documents/rasa/venv/lib/python3.7/site-packages/rasa/core/run.py", line 127, in run_cmdline_io
server_url=constants.DEFAULT_SERVER_FORMAT.format("http", port)
File "/Users/Kami/Documents/rasa/venv/lib/python3.7/site-packages/rasa/core/channels/console.py", line 138, in record_messages
async for response in bot_responses:
File "/Users/Kami/Documents/rasa/venv/lib/python3.7/site-packages/async_generator/_impl.py", line 366, in step
return await ANextIter(self._it, start_fn, *args)
File "/Users/Kami/Documents/rasa/venv/lib/python3.7/site-packages/async_generator/_impl.py", line 205, in throw
return self._invoke(self._it.throw, type, value, traceback)
File "/Users/Kami/Documents/rasa/venv/lib/python3.7/site-packages/async_generator/_impl.py", line 209, in _invoke
result = fn(*args)
File "/Users/Kami/Documents/rasa/venv/lib/python3.7/site-packages/rasa/core/channels/console.py", line 103, in send_message_receive_stream
async for line in resp.content:
File "/Users/Kami/Documents/rasa/venv/lib/python3.7/site-packages/aiohttp/streams.py", line 40, in __anext__
rv = await self.read_func()
File "/Users/Kami/Documents/rasa/venv/lib/python3.7/site-packages/aiohttp/streams.py", line 329, in readline
await self._wait('readline')
File "/Users/Kami/Documents/rasa/venv/lib/python3.7/site-packages/aiohttp/streams.py", line 297, in _wait
await waiter
File "/Users/Kami/Documents/rasa/venv/lib/python3.7/site-packages/aiohttp/helpers.py", line 585, in __exit__
raise asyncio.TimeoutError from None
concurrent.futures._base.TimeoutError
Transport closed # ('127.0.0.1', 63319) and exception experienced during error handling
Originally I thought this timeout was being caused by using large lookup tables for another part of my Rasa program, but for age recognition I am using a simple regex:
## regex:age
- ^(0?[1-9]|[1-9][0-9]|[1][1-9][1-9])$
And even this also causes the timeout.
Please help me solve this. I don't even need to avoid the timeout, I just want to know where I can catch/ignore this exception.
Thanks!

I was fetching data from an API wherein I was getting a Timeout error because it was not able to fetch the data in the default time limit :
Go to the directory: venv/Lib/site-packages/rasa/core/channels/console.py
Change the default value of DEFAULT_STREAM_READING_TIMEOUT_IN_SECONDS to more than 10, in my case I changed it to 30 it worked.
Another reason could be fetching of data again and again within a short period of time which could result in a timeout.
Observations :
When DEFAULT_STREAM_READING_TIMEOUT_IN_SECONDS is set to 10 i get timeout error
When DEFAULT_STREAM_READING_TIMEOUT_IN_SECONDS is set to 30 and keep on running rasa shell again and again I get a timeout error
When DEFAULT_STREAM_READING_TIMEOUT_IN_SECONDS is set to 30 and run rasa shell not frequently it functions perfectly.

Make sure that you uncomment the below code
action_endpoint:
url: "http://localhost:5055/webhook"
in the endpoints.yml. It is used when you are making custom actions to query database.

I had the same problem and it was not solved by increasing timeout.
Make sure you are sending back a 'string' to the rasa shell from rasa action sever. What I mean is, if you are using 'text = ' in your utter_message, make sure that the async result is also a string and not just an object or something else. Change the type if required.
dispatcher.utter_message(text='has to be a string')
Running 'rasa shell -vv' showed me that it is receiving an object and that is why it is not able to parse it, hence timeout.

I can't comment now, but add followup to Vishal response. To check that hooks are present and waiting for connection you can use -vv command line switch. This display all available hooks at startup. For example:
2020-04-21 14:05:56 DEBUG rasa.core.utils - Available web server routes:
/webhooks/rasa GET custom_webhook_RasaChatInput.health
/webhooks/rasa/webhook POST custom_webhook_RasaChatInput.receive
/webhooks/rest GET custom_webhook_RestInput.health
/webhooks/rest/webhook POST custom_webhook_RestInput.receive
/ GET hello

Golang: Preview of managed VM app returns error

I'm trying to preview a Go docker (App Engine ManagedVM) app using the gcloud preview app run command.
But I keep getting this error:
Traceback (most recent call last):
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine/dev_appserver.py", line 83, in <module>
_run_file(__file__, globals())
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine/dev_appserver.py", line 79, in _run_file
execfile(_PATHS.script_file(script_name), globals_)
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/devappserver2.py", line 985, in <module>
main()
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/devappserver2.py", line 978, in main
dev_server.start(options)
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/devappserver2.py", line 774, in start
self._dispatcher.start(options.api_host, apis.port, request_data)
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/dispatcher.py", line 182, in start
_module, port = self._create_module(module_configuration, port)
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/dispatcher.py", line 262, in _create_module
threadsafe_override=threadsafe_override)
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/module.py", line 1463, in __init__
super(ManualScalingModule, self).__init__(**kwargs)
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/module.py", line 514, in __init__
self._module_configuration)
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/module.py", line 237, in _create_instance_factory
module_configuration=module_configuration)
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine/google/appengine/tools/devappserver2/vm_runtime_factory.py", line 78, in __init__
timeout=self.DOCKER_D_REQUEST_TIMEOUT_SECS)
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine/google/appengine/tools/docker/containers.py", line 740, in NewDockerClient
client.ping()
File "/Users/jwesonga/google-cloud-sdk/./lib/docker/docker/client.py", line 711, in ping
return self._result(self._get(self._url('/_ping')))
File "/Users/jwesonga/google-cloud-sdk/./lib/docker/docker/client.py", line 76, in _get
return self.get(url, **self._set_request_timeout(kwargs))
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine/lib/requests/requests/sessions.py", line 468, in get
return self.request('GET', url, **kwargs)
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine/lib/requests/requests/sessions.py", line 456, in request
resp = self.send(prep, **send_kwargs)
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine/lib/requests/requests/sessions.py", line 559, in send
r = adapter.send(request, **kwargs)
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine/lib/requests/requests/adapters.py", line 384, in send
raise Timeout(e, request=request)
requests.exceptions.Timeout: (<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x10631c7d0>, 'Connection to 192.168.59.104 timed out. (connect timeout=60)')
ERROR: (gcloud.preview.app.run) DevAppSever failed with error code [1]
I've confirmed that docker is up and running using boot2docker status which returns running This was working before but after a machine reboot, nothing seems to work. Any ideas?

The main issue is:
File "/Users/jwesonga/google-cloud-sdk/platform/google_appengine
/lib/requests/requests/adapters.py", line 384, in send
raise Timeout(e, request=request)
requests.exceptions.Timeout:
(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object
at 0x10631c7d0>, 'Connection to 192.168.59.104 timed out.
(connect timeout=60)')
ERROR: (gcloud.preview.app.run) DevAppSever failed with error code [1]
Which is often the case when you have a proxy, and is discussed in pip issue 1805
It is supposed to be fixed in pip1.6, but just in case, you can try the workaround of alexandrem
/opt/venvs/ironic/lib/python2.6/site-packages/pip/_vendor/requests
/adapters.patch.py /opt/venvs/ironic/lib/python2.6/site-packages
/pip/_vendor/requests/adapters.py
209c209
if True or not proxy in self.proxy_manager:
^^^^
basically I just add a True to the condition on line 209 of the adapter.py to always create a ProxyManager instance, thus skipping the pool manager logic.

The gcloud command enable the ah_host process and also created the docker image of your app and passes it to the Docker daemon, in your case it seems that your docker daemon is not responding to the request. So to make sure,perform "sudo docker -d" to check if the Docker daemon is running on your machine or not.
Also check that, the path of the certificate you set correctly and value of the TLS_VERIFY is TRUE.
Go through the documentation [1] for the installation of Docker on MacOS
[1] https://docs.docker.com/installation/mac/

Computing Sky View Factor in GrassGis

Hy community,
I´m currently working on my Master Thesis and I have to compute the "sky view factor". Since ESRI Arcmap is not a helpful choice to do that, I found that it is fairly easy to compute with GrassGIS (V.7) using the r.skyview command.
But I get an error message in the logfile i can´t really deal with. Hope that someone of you is experienced with that kind of problem and can help me out with this.
Here is what the GrassGIS output says:
*(Fri Jan 09 16:17:10 2015)
r.skyview input=Subset#PERMANENT output=Subset_SVF ndir=16 maxdistance=15.0
Unknown module parameter "keyword" at line 21
Unknown module parameter "keyword" at line 22
FEHLER: Value <rast> ambiguous for parameter <type>
Valid options: raster,raster_3d,vector,old_vector,ascii_vector,labels,region,group,all
Traceback (most recent call last):
File "C:\Users\Axel-HP\AppData\Roaming\GRASS7\addons/scripts/r.skyview.py", line 120, in <module>
sys.exit(main())
File "C:\Users\Axel-HP\AppData\Roaming\GRASS7\addons/scripts/r.skyview.py", line82, in main
old_maps = _get_horizon_maps()
File "C:\Users\Axel-HP\AppData\Roaming\GRASS7\addons/scripts/r.skyview.py", line 114, in_get_horizon_maps
pattern=TMP_NAME + "*")[gcore.gisenv()['MAPSET']]
File "C:\Temp\GRASSGIS7\etc\python\grass\script\core.py", line 1176, in list_grouped
type=types, pattern=pattern,
exclude=exclude).splitlines():
File "C:\Temp\GRASSGIS7\etc\python\grass\script\core.py", line 425, in read_command
return handle_errors(returncode, stdout, args, kwargs)
File "C:\Temp\GRASSGIS7\etc\python\grass\script\core.py", line 308, in handle_errors
returncode=returncode)
grass.exceptions.CalledModuleError: Module run None
['g.list', '--q', '-m', 'type=rast', 'pattern=tmp_horizon_2340*'] ended with error
Process ended with non-zero return code 1. See errors in the (error) output.
(Fri Jan 09 16:17:11 2015) Befehl ausgeführt (1 Sek)*

I just tested r.skyview and it is working. There were big changes in GRASS module parameter names recently which caused the trouble, but now it should work without problems.

How do I get useful diagnostics from boto?

How can I get useful diagnostics out of boto? All I ever seem to get is the infuriatingly useless "400 Bad Request". I recognize that boto is just passing along what the underlying API makes available, but surely there's some way to get something more useful than "Bad Request".
Traceback (most recent call last):
File "./mongo_pulldown.py", line 153, in <module>
main()
File "./mongo_pulldown.py", line 24, in main
print "snap = %r" % snap
File "./mongo_pulldown.py", line 149, in __exit__
self.connection.delete_volume(self.volume.id)
File "/home/roy/deploy/current/python/local/lib/python2.7/site-packages/boto/ec2/connection.py", line 1507, in delete_volume
return self.get_status('DeleteVolume', params, verb='POST')
File "/home/roy/deploy/current/python/local/lib/python2.7/site-packages/boto/connection.py", line 985, in get_status
raise self.ResponseError(response.status, response.reason, body)
boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request

I didn't have much luck with putting the debug setting in the config file, but the call to ec2.connect_to_region() takes a debug parameter, with the same values as in j0nes' answer.
ec2 = boto.ec2.connect_to_region("eu-west-1", debug=2)
Everything that connection object sends/receives will get dumped to stdout.

You can configure the boto.cfg file to be more verbose:
[Boto]
debug = 2
debug: Controls the level of debug messages that will be printed by
the boto library. The following values are defined:
0 - no debug messages are printed
1 - basic debug messages from boto are printed
2 - all boto debugging messages plus request/response messages from httplib

HUE Web UI will not login first time

I've installed CDH 4.2.1 and now I'm trying to access HUE Web UI for the first time. I enter a new user name and password, click Sign Up, and wait and wait and nothing happens 20 minutes. If I open another window and try to access the login page then I get a message that the database is locked.
I'm running on a single node. And following is the error message for the second window:
Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/share/hue/build/env/lib/python2.6/site-packages/eventlet-0.9.14-py2.6.egg/eventlet/wsgi.py", line 336, in handle_one_response
result = self.application(self.environ, start_response)
File "/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/share/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg/django/core/handlers/wsgi.py", line 245, in __call__
response = middleware_method(request, response)
File "/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/share/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg/django/contrib/sessions/middleware.py", line 36, in process_response
request.session.save()
File "/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/share/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg/django/contrib/sessions/backends/db.py", line 63, in save
obj.save(force_insert=must_create, using=using)
File "/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/share/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg/django/db/models/base.py", line 434, in save
self.save_base(using=using, force_insert=force_insert, force_update=force_update)
File "/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/share/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg/django/db/models/base.py", line 500, in save_base
rows = manager.using(using).filter(pk=pk_val)._update(values)
File "/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/share/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg/django/db/models/query.py", line 491, in _update
return query.get_compiler(self.db).execute_sql(None)
File "/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/share/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg/django/db/models/sql/compiler.py", line 861, in execute_sql
cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
File "/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/share/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg/django/db/models/sql/compiler.py", line 727, in execute_sql
cursor.execute(sql, params)
File "/opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/share/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg/django/db/backends/sqlite3/base.py", line 200, in execute
return Database.Cursor.execute(self, query, params)
DatabaseError: database is locked
Any idea?
Thank you,
Roberto.

The error here means it is trying to connect to Sqlite and fails as the first connection is still not finished (and Sqlite is not concurrent) so it is not very useful here.
If would look in the hue logs, especially 'runcpserver.log' if there is more information.
Adding 'export DESKTOP_DEBUG=1' in the environment and restarting Hue might give more details.
I would go on http://HUE_SERVER:HUE_PORT/dump_config, look at the 'database' value and delete the file and run a /opt/cloudera/parcels/CDH-4.2.1-1.cdh4.2.1.p0.5/share/hue syncb or sync database if in CM.
It will recreate the database and make sure no other process is using it.
If it still does not work I would give a try to MySQL: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_15_8.html

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio