Ansible: error when deploying playbooks in parallel - ansible

i am setting up a kubernetes cluster with ansible.
This is running fine.
Now i usually have 2 or 3 clusters i can test different things with.
Often it happens at some point in time that the cluster/server gots broken. If that happens, i usually recreate the servers and start the playbook again. because this takes some time, i want to be able to run 2 or more playbooks in parallel.
But every time i do this, i get the following error:
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: FileNotFoundError: [Errno 2] No such file or directory
I run my playbook like this:
"$ansible_playbook"
-i "${ANSIBLE_HOSTS}"
"${ANSIBLE_YML}"
--flush-cache
--user root
--become
--become-user root
--ask-sudo-pass
What could be the reason for the error?
I can imagine, that ansible creates some files in the background, used by the different playbooks. But which files could that be?
thx in advance!
Update more detailed error log (-vvv)
ansible-playbook 2.7.8
config file = /home/mod/cod/wo/thingylabs/kubernetes-provisioning/playbooks/test1/ansible.cfg
configured module search path = ['/home/mod/cod/wo/thingylabs/kubernetes-provisioning/vendors/kubespray/library']
ansible python module location = /usr/lib/python3.7/site-packages/ansible
executable location = /usr/bin/ansible-playbook
python version = 3.7.2 (default, Jan 10 2019, 23:51:51) [GCC 8.2.1 20181127]
Using /home/mod/cod/wo/thingylabs/kubernetes-provisioning/playbooks/test1/ansible.cfg as config file
SUDO password:
ERROR! Unexpected Exception, this is probably a bug: [Errno 2] No such file or directory
the full traceback was:
Traceback (most recent call last):
File "/usr/bin/ansible-playbook",
exit_code = cli.run()
File "/usr/lib/python3.7/site-packages/ansible/cli/playbook.py", line 104, in run
loader, inventory, variable_manager = self._play_prereqs(self.options)
File "/usr/lib/python3.7/site-packages/ansible/cli/__init__.py", line 786, in _play_prereqs
inventory = InventoryManager(loader=loader, sources=options.inventory)
File "/usr/lib/python3.7/site-packages/ansible/inventory/manager.py", line 148, in __init__
self.parse_sources(cache=True)
File "/usr/lib/python3.7/site-packages/ansible/inventory/manager.py", line 207, in parse_sources
source = unfrackpath(source, follow=False)
File "/usr/lib/python3.7/site-packages/ansible/utils/path.py", line 47, in unfrackpath
basedir = op.getcwd()
FileNotFoundError: [Errno 2] No such file or directory

Related

How do I setup the _SERVER_MODEL_PATH variable?

I'm trying to replicate the quickstart save and serve example.
I go to the example folder, run the python script and can see the model runs and artifacts when I type mlflow ui.
However, when I try the mlflow serve command with different model run Ids and ports I get a 404 in my browser, even though the command seems successful:
mlflow models serve -m runs:/e1dabe8fc6e84286af5bee28ca89cdde/model --port 1234
2022/07/11 07:40:01 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
2022/07/11 07:40:02 INFO mlflow.utils.conda: Conda environment mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4 already exists.
2022/07/11 07:40:02 INFO mlflow.pyfunc.backend: === Running command 'conda activate mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4 & waitress-serve --host=127.0.0.1 --port=1234 --ident=mlflow mlflow.pyfunc.scoring_server.wsgi:app'
INFO:waitress:Serving on http://127.0.0.1:1234
I tried running directly from anaconda prompt, and I get the following error:
conda activate mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4 & waitress-serve --host=127.0.0.1 --port=1234 --ident=mlflow mlflow.pyfunc.scoring_server.wsgi:app
Traceback (most recent call last):
File "C:\Users\sergio ferro.conda\envs\mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\sergio ferro.conda\envs\mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\sergio ferro.conda\envs\mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4\Scripts\waitress-serve.exe_main.py", line 7, in
File "C:\Users\sergio ferro.conda\envs\mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4\lib\site-packages\waitress\runner.py", line 283, in run
app = resolve(module, obj_name)
File "C:\Users\sergio ferro.conda\envs\mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4\lib\site-packages\waitress\runner.py", line 218, in resolve
obj = import(module_name, fromlist=segments[:1])
File "C:\Users\sergio ferro.conda\envs\mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4\lib\site-packages\mlflow\pyfunc\scoring_server\wsgi.py", line 6, in
app = scoring_server.init(load_model(os.environ[scoring_server._SERVER_MODEL_PATH]))
File "C:\Users\sergio ferro.conda\envs\mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4\lib\os.py", line 679, in getitem
raise KeyError(key) from None
KeyError: 'pyfunc_model_path'
I have tried deleting and creating a new anaconda environment, ran from git bash, anaconda prompt, added anaconda3 environment variables. I know it has something to do with the _SERVER_MODEL_PATH variable but I wouldn't know how to set it up or which path add to my environment variables so it can read this variable from there.

Shared Connection to host Closed on running Ansible playbook as unprivileged user?

I am using ansible v2.9.2 and recently I am facing issues using the npm ansible module as it is giving me shared connection to host closed errors. I have tried using both python2 and 3 and the results were the same. Below is the doc containing my error and playbook as well please have a look.
link: https://docs.google.com/document/d/1iaNMIjR3EVFYVvSoJEPTmjhSrDsnfZc5VCvUUamdKps/edit?usp=sharing
fatal: [1.0.3.99]: FAILED! => {"changed": false, "module_stderr": "Shared connection to 1.0.3.99 closed.\r\n", "module_stdout": "Traceback (most recent call last):\r\n File \"/var/tmp/ansible-tmp-1577345183.7290096-173113890020428/AnsiballZ_npm.py\", line 114, in \r\n _ansiballz_main()\r\n File \"/var/tmp/ansible-tmp-1577345183.7290096-173113890020428/AnsiballZ_npm.py\", line 106, in _ansiballz_main\r\n invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)\r\n File \"/var/tmp/ansible-tmp-1577345183.7290096-173113890020428/AnsiballZ_npm.py\", line 49, in invoke_module\r\n imp.load_module('main', mod, module, MOD_DESC)\r\n File \"/tmp/ansible_npm_payload_6EJdAk/main.py\", line 310, in \r\n File \"/tmp/ansible_npm_payload_6EJdAk/main.py\", line 287, in main\r\n File \"/tmp/ansible_npm_payload_6EJdAk/main.py\", line 200, in list\r\n File \"/usr/lib/python2.7/json/init.py\", line 339, in loads\r\n return _default_decoder.decode(s)\r\n File \"/usr/lib…
Ansible Playbook: 
- hosts: all
  remote_user: abhinav
  become: yes
  tasks:
   - name: npm command
     npm:
     path: /data/codebase/test/api
     executable: /home/test/.nvm/versions/node/v8.15.0/bin/npm
     state: present
     become_user: test
     become: yes
The problem is becoming an unprivileged user
When both the connection user and the become_user are unprivileged, the module file is written as the user that Ansible connects as, but the file needs to be readable by the user Ansible is set to become. In this case, Ansible makes the module file world-readable ... Starting in Ansible 2.1, Ansible defaults to issuing an error if it cannot execute securely with become."
See Ways to resolve this include:
Use pipelining pipelining = true
Install POSIX.1e filesystem acl support on the managed host.
Avoid becoming an unprivileged user.

Using command/shell modules causes: ValueError: Key name may not begin with an underscore on multiple runs

Ansible throws an error on every task which uses the shell or command modules, but not when running for the first time on a new machine.
The process I use is to image a new raspberry pi, and then use ansible to set up the services that I need. Running ansible for the first time works fine, but if I run it again (without changing anything) it fails, saying ValueError: Key name may not begin with an underscore
Here is an example of a task that throws an error. Running /usr/local/bin/pigpiod -v on the remote machine works as expected/
- name: see if pigpiod is the correct version
command: "/usr/local/bin/pigpiod -v"
register: pigpiod_version
Here is the error:
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ValueError: Key name may not begin with an underscore
fatal: [issacs_box]: FAILED! => {"changed": false, "module_stderr": "Traceback (most recent call last):\n File \"<stdin>\", line 113, in <module>\n File \"<stdin>\", line 105, in _ansiballz_main\n File \"<stdin>\", line 48, in invoke_module\n File \"/usr/lib/python3.5/imp.py\", line 234, in load_module\n return load_source(name, filename, file)\n File \"/usr/lib/python3.5/imp.py\", line 170, in load_source\n module = _exec(spec, sys.modules[name])\n File \"<frozen importlib._bootstrap>\", line 626, in _exec\n File \"<frozen importlib._bootstrap_external>\", line 673, in exec_module\n File \"<frozen importlib._bootstrap>\", line 222, in _call_with_frames_removed\n File \"/tmp/ansible_command_payload_hc3z4iej/__main__.py\", line 292, in <module>\n File \"/tmp/ansible_command_payload_hc3z4iej/__main__.py\", line 199, in main\n File \"/tmp/ansible_command_payload_hc3z4iej/ansible_command_payload.zip/ansible/module_utils/basic.py\", line 901, in __init__\n File \"/tmp/ansible_command_payload_hc3z4iej/ansible_command_payload.zip/ansible/module_utils/basic.py\", line 2243, in _log_invocation\n File \"/tmp/ansible_command_payload_hc3z4iej/ansible_command_payload.zip/ansible/module_utils/basic.py\", line 2201, in log\n File \"systemd/_journal.pyx\", line 68, in systemd._journal.send\n File \"systemd/_journal.pyx\", line 32, in systemd._journal._send\nValueError: Key name may not begin with an underscore\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}
I had the wrong PYPI package installed. Coincidentally, there is a package called systemd (version 0.16.1) which is different from the official systemd-python (version 234) package. Running pip3 uninstall systemd and then pip3 install systemd-python --user solved the problem.
Seems like a weird Ansible bug. What version are you using? Can you try to downgrade and/or upgrade a version. If that would fix this error, please notify the Ansible developers and create an issue at their repo.
Else, try to update and/or downgrade your Python version. It could be that something is wrong with this file /usr/lib/python3.5/imp.py

FileNotFoundError: [Errno 2] No such file or directory: while deleting the minidcos vagrant cluster

I have created local minidcos vagrant cluster using below command.
$ sudo minidcos vagrant create ./dcos_generate_config.sh --agents 0
The above command is not successful. It is failed abruptly due to No space left on device.
when I list the cluster I see the cluster exists.
$ sudo minidcos vagrant list
default
I'm not able to access the cluster using sudo minidcos vagrant web. I get the same error when I tried to destroy the cluster as below -
$ sudo minidcos vagrant destroy
Traceback (most recent call last):
File "/usr/local/bin/minidcos", line 10, in <module>
sys.exit(minidcos())
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/dcos_e2e_cli/dcos_vagrant/commands/destroy.py", line 59, in destroy
cluster_vms.destroy()
File "/usr/local/lib/python3.7/site-packages/dcos_e2e_cli/dcos_vagrant/commands/_common.py", line 294, in destroy
self.vagrant_client.destroy()
File "/usr/local/lib/python3.7/site-packages/dcos_e2e_cli/dcos_vagrant/commands/_common.py", line 274, in vagrant_client
item for item in self.workspace_dir.iterdir()
File "/usr/local/lib/python3.7/site-packages/dcos_e2e_cli/dcos_vagrant/commands/_common.py", line 274, in <listcomp>
item for item in self.workspace_dir.iterdir()
File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1074, in iterdir
for name in self._accessor.listdir(self):
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/acaa37838a534dc0ae51c3fcc059f650'
How can I successfully delete the cluster?
The issue here was that the workspace directory was deleted, and yet the VMs were still detected.
The workspace is configurable as per the documentation.
This could happen because the workspace directory is somehow deleted while the VMs are running, but it also happens when the host is shut down (assuming the default workspace temporary directory is used).
The behaviour is now changed as of minidcos version 2019.04.08.1.
In particular, minidcos vagrant list no longer lists VMs which are not in the running state.
There is also a new minidcos vagrant clean command which cleans all VMs and leftover VMs.

Python (boto) TypeError launching Spark Cluster

Following is attempt to launch cluster with ten slaves.
12:13:44/sparkup $ec2/spark-ec2 -k sparkeast -i ~/.ssh/myPem.pem \
-s 10 -z us-east-1a -r us-east-1 launch spark2
Here is output. Note that the same command had been successful with the February Master code. Today I had updated to latest 1.4.0-SNAPSHOT
Setting up security groups...
Searching for existing cluster spark2 in region us-east-1...
Spark AMI: ami-5bb18832
Launching instances...
Launched 10 slaves in us-east-1a, regid = r-68a0ae82
Launched master in us-east-1a, regid = r-6ea0ae84
Waiting for AWS to propagate instance metadata...
Waiting for cluster to enter 'ssh-ready' state.........unable to load cexceptions
TypeError
p0
(S''
p1
tp2
Rp3
(dp4
S'child_traceback'
p5
S'Traceback (most recent call last):\n File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1280, in _execute_child\n sys.stderr.write("%s %s (env=%s)\\n" %(executable, \' \'.join(args), \' \'.join(env)))\nTypeError\n'
p6
sb.Traceback (most recent call last):
File "ec2/spark_ec2.py", line 1444, in <module>
main()
File "ec2/spark_ec2.py", line 1436, in main
real_main()
File "ec2/spark_ec2.py", line 1270, in real_main
cluster_state='ssh-ready'
File "ec2/spark_ec2.py", line 869, in wait_for_cluster_state
is_cluster_ssh_available(cluster_instances, opts):
File "ec2/spark_ec2.py", line 833, in is_cluster_ssh_available
if not is_ssh_available(host=dns_name, opts=opts):
File "ec2/spark_ec2.py", line 807, in is_ssh_available
stderr=subprocess.STDOUT # we pipe stderr through stdout to preserve output order
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 709, in __init__
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1328, in _execute_child
raise child_exception
TypeError
The AWS console shows that instances are actually running. So it is unclear what actually failed.
Any hints or workarounds appreciated.
UPDATE This same error occurs when doing login command. It seems to be problem with the boto API - but the cluster itself appears to be OK.
ec2/spark-ec2 -i ~/.ssh/sparkeast.pem login spark2
Searching for existing cluster spark2 in region us-east-1...
Found 1 master, 10 slaves.
Logging into master ec2-54-87-46-170.compute-1.amazonaws.com...
unable to load cexceptions
TypeError
p0
(.. same exception stacktrace as above )
The issue is that the python-2.7.6 installation on my yosemite macbook appears to have become corrupted.
I reset the PATH and PYTHONPATH to point to a custom homebrew installed python version and then the boto - and other python commands including building spark performance project - work fine.

Resources