Dask tutorial failing with distributed.nanny - WARNING - Restarting worker - client

Interested in the possibilities offered by Dask, I started with the dask tutorial, and prepared my laptop by following the instructions as per the tutorial: cloning the repo and making a new conda env with:
conda env create -f binder/environment.yml
conda activate dask-tutorial
All goes fine and packages are installed. Then i kickoff jupyter lab and open the first workbook:
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
The output is an almost-infinite printing of "distributed.nanny - WARNING - Restarting worker", some followed by more errors (see below). I spent the last couple of hours trying to figure out why I'm having this problem, but I don't find.
Tried bringing a LocalCluster(), it didn't help. Tried limiting the memory to 1GB, same problem.
Tried updating the packages, rebooting the laptop, still nothing.
Note, if that can be useful: I'm on Windows, I use conda, and this is a company laptop on which I don't have the admin rights.
Would anyone know why i have this issue?
Thanks!
Emek
P.S: Amongst the plethora of "distributed.nanny - WARNING - Restarting worker", I also get a few:
2023-02-10 16:04:36,283 - distributed.nanny - WARNING - Restarting worker
2023-02-10 16:04:36,408 - distributed.nanny - WARNING - Restarting worker
2023-02-10 16:04:36,425 - distributed.nanny - WARNING - Restarting worker
Traceback (most recent call last):
File "C:\Users\xxx\Anaconda3\envs\dask-tutorial\lib\site-packages\distributed\nanny.py", line 853, in _wait_until_connected
msg = self.init_result_q.get_nowait()
File "C:\Users\xxx\Anaconda3\envs\dask-tutorial\lib\multiprocessing\queues.py", line 135, in get_nowait
return self.get(False)
File "C:\Users\xxx\Anaconda3\envs\dask-tutorial\lib\multiprocessing\queues.py", line 116, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\xxx\Anaconda3\envs\dask-tutorial\lib\site-packages\distributed\utils.py", line 741, in wrapper
return await func(*args, **kwargs)
File "C:\Users\xxx\Anaconda3\envs\dask-tutorial\lib\site-packages\distributed\nanny.py", line 545, in _on_worker_exit
await self.instantiate()
File "C:\Users\xxx\Anaconda3\envs\dask-tutorial\lib\site-packages\distributed\nanny.py", line 442, in instantiate
result = await self.process.start()
File "C:\Users\xxx\Anaconda3\envs\dask-tutorial\lib\site-packages\distributed\nanny.py", line 714, in start
msg = await self._wait_until_connected(uid)
File "C:\Users\xxx\Anaconda3\envs\dask-tutorial\lib\site-packages\distributed\nanny.py", line 855, in _wait_until_connected
await asyncio.sleep(self._init_msg_interval)
File "C:\Users\xxx\Anaconda3\envs\dask-tutorial\lib\asyncio\tasks.py", line 605, in sleep
return await future
asyncio.exceptions.CancelledError
2023-02-10 16:04:37,021 - distributed.nanny - WARNING - Restarting worker
2023-02-10 16:04:37,024 - distributed.nanny - WARNING - Restarting worker
2023-02-10 16:04:37,027 - distributed.nanny - WARNING - Restarting worker

Followed this answer suggesting to downgrade ipykernel and it solved the issue for now.
The following packages will be DOWNGRADED:
ipykernel 6.21.1-pyh025b116_0 --> 6.15.0-pyh025b116_0

Related

dev_appserver.py BadArgumentError: app must not be empty

Hey all,
For context: I had this dev_appserver setup working late last year in 2021, and upon trying to set it up again, I'm getting odd errors.
BadArgumentError: app must not be empty.
I've solved quite a lot of errors up to this point, and this is where I'm at:
JDK 1.11+ installed (for Cloud Datastore Emulator)
Golang 1.15+ installed (for gops & dev_appserver.py - go build)
Gcloud Components:
I run my dev_appserver like this:
export DATASTORE_DATASET=dev8celbux
export DATASTORE_PROJECT_ID=dev8celbux
export DATASTORE_USE_PROJECT_ID_AS_APP_ID=true
dev_appserver.py --enable_console --admin_port=8001 --port=8081 --go_debugging=true --support_datastore_emulator=true --datastore_path=./datastore/local_db.bin setuptables-app.yaml
INFO 2022-09-09 13:26:30,233 devappserver2.py:317] Skipping SDK update check.
INFO 2022-09-09 13:26:30,250 datastore_emulator.py:156] Starting Cloud Datastore emulator at: http://localhost:58946
INFO 2022-09-09 13:26:32,381 datastore_emulator.py:162] Cloud Datastore emulator responded after 2.131000 seconds
INFO 2022-09-09 13:26:32,381 <string>:384] Starting API server at: http://localhost:59078
INFO 2022-09-09 13:26:32,384 <string>:374] Starting gRPC API server at: http://localhost:59079
INFO 2022-09-09 13:26:32,394 instance_factory.py:184] Building with dependencies from go.mod.
INFO 2022-09-09 13:26:32,397 dispatcher.py:280] Starting module "setuptables" running at: http://localhost:8081
INFO 2022-09-09 13:26:32,397 admin_server.py:70] Starting admin server at: http://localhost:8001
WARNING 2022-09-09 13:26:32,398 devappserver2.py:414] No default module found. Ignoring.
2022/09/09 13:26:35 STARTING
INFO 2022-09-09 13:26:37,220 instance.py:294] Instance PID: 9656
This error appears when I try & view the contents within the local datastore at localhost:8001/datastore.
Traceback (most recent call last):
File "C:\Users\user\AppData\Local\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\lib\webapp2\webapp2\__init__.py", line 1526, in __call__
rv = self.handle_exception(request, response, e)
File "C:\Users\user\AppData\Local\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\lib\webapp2\webapp2\__init__.py", line 1520, in __call__
rv = self.router.dispatch(request, response)
File "C:\Users\user\AppData\Local\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\lib\webapp2\webapp2\__init__.py", line 1270, in default_dispatcher
return route.handler_adapter(request, response)
File "C:\Users\user\AppData\Local\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\lib\webapp2\webapp2\__init__.py", line 1094, in __call__
return handler.dispatch()
File "C:\Users\user\AppData\Local\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\tools\devappserver2\admin\admin_request_handler.py", line 88, in dispatch
super(AdminRequestHandler, self).dispatch()
File "C:\Users\user\AppData\Local\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\lib\webapp2\webapp2\__init__.py", line 588, in dispatch
return self.handle_exception(e, self.app.debug)
File "C:\Users\user\AppData\Local\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\lib\webapp2\webapp2\__init__.py", line 586, in dispatch
return method(*args, **kwargs)
File "C:\Users\user\AppData\Local\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\tools\devappserver2\admin\datastore_viewer.py", line 661, in get
kinds = self._get_kinds(namespace)
File "C:\Users\user\AppData\Local\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\tools\devappserver2\admin\datastore_viewer.py", line 597, in _get_kinds
return sorted([x.kind_name for x in q.run()])
File "C:\Users\user\AppData\Local\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\ext\db\__init__.py", line 2077, in run
raw_query = self._get_query()
File "C:\Users\user\AppData\Local\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\ext\db\__init__.py", line 2482, in _get_query
_app=self._app)
File "C:\Users\user\AppData\Local\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\api\datastore.py", line 1371, in __init__
self.__app = datastore_types.ResolveAppId(_app)
File "C:\Users\user\AppData\Local\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\api\datastore_types.py", line 238, in ResolveAppId
ValidateString(app, 'app', datastore_errors.BadArgumentError)
File "C:\Users\user\AppData\Local\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\api\datastore_types.py", line 186, in ValidateString
raise exception('%s must not be empty.' % name)
BadArgumentError: app must not be empty.
I figured out that it is somewhat related to the APPLICATION_ID environment variable being missing. Upon setting it, I am able to view the Database page. HOWEVER. When getting no errors writing my data to the emulator (line by line debugged to confirm & local_db.bin is created), upon looking at the data, nothing is there. I successfully write 15 entities from the code's point of view. However none appear on the admin page. I think it's due to the manual set of the APPLICATION_ID as I did not do this before. Perhaps should be automatic somehow. Was thinking that this environment variable could maybe do that: export DATASTORE_USE_PROJECT_ID_AS_APP_ID=true but doesn't seem to change anything.
Before calling creation of entities:
After calling creation of entities:
I write the data like this, no doubt this works correctly.
ctx, err := appengine.Namespace(appengine.BackgroundContext(), "celbux101")
...
userKeyOut, err := datastore.Put(ctx, userKey, &F1_4{...})
Also, looked in both the default & the designated namespace (celbux101):
Super stumped. :( Help appreciated!
I really think it may somehow be related to APPLICATION_ID
Yes!
... I managed to come to a solution! As suspected, the data was getting written correctly, as confirmed by the line-by-line debug & the creation of the local_db.bin. The issue is that the dev_appserver's UI is not able to show the database entities due to the incorrect or missing APPLICATION_ID, as deducted.
I figured out that the dev_appserver's UI uses both the APPLICATION_ID & namespace to determine where to look for your entities. Also, the dev_appserver has it's own default APPLICATION_ID.
Solution
The fix is to export this environment variable BEFORE running your dev_appserver.py.
export APPLICATION_ID=dev~None
This magic export allows everything work as expected. You can view the APPLICATION_ID that the UI is trying to use on the top-left of the interface.
EDIT: I just came back to running this on a new computer, and want to add this for future reference:
If you are getting IOError: emulator did not respond within 10s
INSTALL Python27 & add to your path! (alongside your bundled python)
From Google's documentation
export DATASTORE_DATASET=my-project-id
export DATASTORE_EMULATOR_HOST=::1:8432
export DATASTORE_EMULATOR_HOST_PATH=::1:8432/datastore
export DATASTORE_HOST=http://::1:8432
export DATASTORE_PROJECT_ID=my-project-id
This will display your project name (instead of dev~None)

Ansible: error when deploying playbooks in parallel

i am setting up a kubernetes cluster with ansible.
This is running fine.
Now i usually have 2 or 3 clusters i can test different things with.
Often it happens at some point in time that the cluster/server gots broken. If that happens, i usually recreate the servers and start the playbook again. because this takes some time, i want to be able to run 2 or more playbooks in parallel.
But every time i do this, i get the following error:
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: FileNotFoundError: [Errno 2] No such file or directory
I run my playbook like this:
"$ansible_playbook"
-i "${ANSIBLE_HOSTS}"
"${ANSIBLE_YML}"
--flush-cache
--user root
--become
--become-user root
--ask-sudo-pass
What could be the reason for the error?
I can imagine, that ansible creates some files in the background, used by the different playbooks. But which files could that be?
thx in advance!
Update more detailed error log (-vvv)
ansible-playbook 2.7.8
config file = /home/mod/cod/wo/thingylabs/kubernetes-provisioning/playbooks/test1/ansible.cfg
configured module search path = ['/home/mod/cod/wo/thingylabs/kubernetes-provisioning/vendors/kubespray/library']
ansible python module location = /usr/lib/python3.7/site-packages/ansible
executable location = /usr/bin/ansible-playbook
python version = 3.7.2 (default, Jan 10 2019, 23:51:51) [GCC 8.2.1 20181127]
Using /home/mod/cod/wo/thingylabs/kubernetes-provisioning/playbooks/test1/ansible.cfg as config file
SUDO password:
ERROR! Unexpected Exception, this is probably a bug: [Errno 2] No such file or directory
the full traceback was:
Traceback (most recent call last):
File "/usr/bin/ansible-playbook",
exit_code = cli.run()
File "/usr/lib/python3.7/site-packages/ansible/cli/playbook.py", line 104, in run
loader, inventory, variable_manager = self._play_prereqs(self.options)
File "/usr/lib/python3.7/site-packages/ansible/cli/__init__.py", line 786, in _play_prereqs
inventory = InventoryManager(loader=loader, sources=options.inventory)
File "/usr/lib/python3.7/site-packages/ansible/inventory/manager.py", line 148, in __init__
self.parse_sources(cache=True)
File "/usr/lib/python3.7/site-packages/ansible/inventory/manager.py", line 207, in parse_sources
source = unfrackpath(source, follow=False)
File "/usr/lib/python3.7/site-packages/ansible/utils/path.py", line 47, in unfrackpath
basedir = op.getcwd()
FileNotFoundError: [Errno 2] No such file or directory

HAWQ stop cluster failed

I installed HAWQ from source code. After initializing and starting HAWQ cluster, I tried to stop it with "hawq stop cluster". However, it failed.
The error shows:
[hadoop#Master ~]$ hawq stop cluster
20161217:19:59:31:004594 hawq_stop:Master:hadoop-[INFO]:-Prepare to do 'hawq stop'
20161217:19:59:31:004594 hawq_stop:Master:hadoop-[INFO]:-You can check log in /home/hadoop/hawqAdminLogs/hawq_stop_20161217.log
20161217:19:59:31:004594 hawq_stop:Master:hadoop-[INFO]:-Stop hawq with args: ['stop', 'cluster']
Continue with HAWQ service stop Yy|Nn (default=N):
20161217:19:59:38:004594 hawq_stop:Master:hadoop-[INFO]:-No standby host configured
20161217:19:59:38:004594 hawq_stop:Master:hadoop-[INFO]:-Stop hawq cluster
Traceback (most recent call last):
File "/home/hadoop/hawq/bin/hawq_ctl", line 1276, in <module>
stop_hawq(opts, hawq_dict)
File "/home/hadoop/hawq/bin/hawq_ctl", line 1043, in stop_hawq
instance.run()
File "/home/hadoop/hawq/bin/hawq_ctl", line 891, in run
check_return_code(self._stopAll())
File "/home/hadoop/hawq/bin/hawq_ctl", line 816, in _stopAll
master_result = self._stop_master()
File "/home/hadoop/hawq/bin/hawq_ctl", line 760, in _stop_master
self._stop_master_checks()
File "/home/hadoop/hawq/bin/hawq_ctl", line 712, in _stop_master_checks
self.conn = dbconn.connect(self.dburl, utility=True)
File "/home/hadoop/hawq/lib/python/gppylib/db/dbconn.py", line 211, in connect
cnx = pgdb._connect_(cstr, dbhost, dbport, dbopt, dbtty, dbuser, dbpasswd)
AttributeError: 'module' object has no attribute '_connect_'
At present, I used the alternative way to stop the cluster, that is, stop master and segments separately with pg_ctl.
pg_ctl stop -D <master_data_dir>/<segment_data_dir>
Anything about this error is helpful. Thanks!
Because directly use the command 'pip install pygresql', it will install the latest version(5.0.3) pygresql. In the errors above, pgdb._connect_() is the old version (4.2.2) routine, in 5.0.3 it is pgdb._connect().
The solution is :
pip install pygresql==4.2.2
Before stop cluster, if it's not '-M immediate' stop, hawq will connect to database to check running connections.
From your log, the connection to master node is failed due to python module issues. Seems like pygresql module is not installed properly. Please try to reinstall it.

Python (boto) TypeError launching Spark Cluster

Following is attempt to launch cluster with ten slaves.
12:13:44/sparkup $ec2/spark-ec2 -k sparkeast -i ~/.ssh/myPem.pem \
-s 10 -z us-east-1a -r us-east-1 launch spark2
Here is output. Note that the same command had been successful with the February Master code. Today I had updated to latest 1.4.0-SNAPSHOT
Setting up security groups...
Searching for existing cluster spark2 in region us-east-1...
Spark AMI: ami-5bb18832
Launching instances...
Launched 10 slaves in us-east-1a, regid = r-68a0ae82
Launched master in us-east-1a, regid = r-6ea0ae84
Waiting for AWS to propagate instance metadata...
Waiting for cluster to enter 'ssh-ready' state.........unable to load cexceptions
TypeError
p0
(S''
p1
tp2
Rp3
(dp4
S'child_traceback'
p5
S'Traceback (most recent call last):\n File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1280, in _execute_child\n sys.stderr.write("%s %s (env=%s)\\n" %(executable, \' \'.join(args), \' \'.join(env)))\nTypeError\n'
p6
sb.Traceback (most recent call last):
File "ec2/spark_ec2.py", line 1444, in <module>
main()
File "ec2/spark_ec2.py", line 1436, in main
real_main()
File "ec2/spark_ec2.py", line 1270, in real_main
cluster_state='ssh-ready'
File "ec2/spark_ec2.py", line 869, in wait_for_cluster_state
is_cluster_ssh_available(cluster_instances, opts):
File "ec2/spark_ec2.py", line 833, in is_cluster_ssh_available
if not is_ssh_available(host=dns_name, opts=opts):
File "ec2/spark_ec2.py", line 807, in is_ssh_available
stderr=subprocess.STDOUT # we pipe stderr through stdout to preserve output order
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 709, in __init__
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1328, in _execute_child
raise child_exception
TypeError
The AWS console shows that instances are actually running. So it is unclear what actually failed.
Any hints or workarounds appreciated.
UPDATE This same error occurs when doing login command. It seems to be problem with the boto API - but the cluster itself appears to be OK.
ec2/spark-ec2 -i ~/.ssh/sparkeast.pem login spark2
Searching for existing cluster spark2 in region us-east-1...
Found 1 master, 10 slaves.
Logging into master ec2-54-87-46-170.compute-1.amazonaws.com...
unable to load cexceptions
TypeError
p0
(.. same exception stacktrace as above )
The issue is that the python-2.7.6 installation on my yosemite macbook appears to have become corrupted.
I reset the PATH and PYTHONPATH to point to a custom homebrew installed python version and then the boto - and other python commands including building spark performance project - work fine.

nginx + python app -- how to enable error logging/stack trace

I have a Flask app running on nginx + uWSGI.
On my local server (non-nginx), I get a nice stack trace + error reporting for exceptions.
Like this:
$ python run.py
Traceback (most recent call last):
File "run.py", line 1, in <module>
from myappname import app
File "/home/me/myappname/myappname/__init__.py", line 27, in <module>
file_handler.setLevel(logging.debug)
File "/usr/lib/python2.7/logging/__init__.py", line 710, in setLevel
self.level = _checkLevel(level)
File "/usr/lib/python2.7/logging/__init__.py", line 190, in _checkLevel
raise TypeError("Level not an integer or a valid string: %r" % level)
On nginx, there is next to no logging whatsoever (in /var/log/nginx/error.log).
This post suggests adding app.logger.exception('Failed') to my script, which didn't help.
How do I enable this sort of logging for debugging purposes?
Nginx will capture your app's console output, but you must make the app recover from exceptions. Else, you'll only get 500 or 400 errors from Nginx.
Try running the app off Nginx until it seems stable.
Use the logging module to capture app status information to your own log file. This strategy will be useful in the long run.

Resources