Python (boto) TypeError launching Spark Cluster - amazon-ec2

Following is attempt to launch cluster with ten slaves.
12:13:44/sparkup $ec2/spark-ec2 -k sparkeast -i ~/.ssh/myPem.pem \
-s 10 -z us-east-1a -r us-east-1 launch spark2
Here is output. Note that the same command had been successful with the February Master code. Today I had updated to latest 1.4.0-SNAPSHOT
Setting up security groups...
Searching for existing cluster spark2 in region us-east-1...
Spark AMI: ami-5bb18832
Launching instances...
Launched 10 slaves in us-east-1a, regid = r-68a0ae82
Launched master in us-east-1a, regid = r-6ea0ae84
Waiting for AWS to propagate instance metadata...
Waiting for cluster to enter 'ssh-ready' state.........unable to load cexceptions
TypeError
p0
(S''
p1
tp2
Rp3
(dp4
S'child_traceback'
p5
S'Traceback (most recent call last):\n File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1280, in _execute_child\n sys.stderr.write("%s %s (env=%s)\\n" %(executable, \' \'.join(args), \' \'.join(env)))\nTypeError\n'
p6
sb.Traceback (most recent call last):
File "ec2/spark_ec2.py", line 1444, in <module>
main()
File "ec2/spark_ec2.py", line 1436, in main
real_main()
File "ec2/spark_ec2.py", line 1270, in real_main
cluster_state='ssh-ready'
File "ec2/spark_ec2.py", line 869, in wait_for_cluster_state
is_cluster_ssh_available(cluster_instances, opts):
File "ec2/spark_ec2.py", line 833, in is_cluster_ssh_available
if not is_ssh_available(host=dns_name, opts=opts):
File "ec2/spark_ec2.py", line 807, in is_ssh_available
stderr=subprocess.STDOUT # we pipe stderr through stdout to preserve output order
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 709, in __init__
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1328, in _execute_child
raise child_exception
TypeError
The AWS console shows that instances are actually running. So it is unclear what actually failed.
Any hints or workarounds appreciated.
UPDATE This same error occurs when doing login command. It seems to be problem with the boto API - but the cluster itself appears to be OK.
ec2/spark-ec2 -i ~/.ssh/sparkeast.pem login spark2
Searching for existing cluster spark2 in region us-east-1...
Found 1 master, 10 slaves.
Logging into master ec2-54-87-46-170.compute-1.amazonaws.com...
unable to load cexceptions
TypeError
p0
(.. same exception stacktrace as above )

The issue is that the python-2.7.6 installation on my yosemite macbook appears to have become corrupted.
I reset the PATH and PYTHONPATH to point to a custom homebrew installed python version and then the boto - and other python commands including building spark performance project - work fine.

Related

How do I setup the _SERVER_MODEL_PATH variable?

I'm trying to replicate the quickstart save and serve example.
I go to the example folder, run the python script and can see the model runs and artifacts when I type mlflow ui.
However, when I try the mlflow serve command with different model run Ids and ports I get a 404 in my browser, even though the command seems successful:
mlflow models serve -m runs:/e1dabe8fc6e84286af5bee28ca89cdde/model --port 1234
2022/07/11 07:40:01 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
2022/07/11 07:40:02 INFO mlflow.utils.conda: Conda environment mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4 already exists.
2022/07/11 07:40:02 INFO mlflow.pyfunc.backend: === Running command 'conda activate mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4 & waitress-serve --host=127.0.0.1 --port=1234 --ident=mlflow mlflow.pyfunc.scoring_server.wsgi:app'
INFO:waitress:Serving on http://127.0.0.1:1234
I tried running directly from anaconda prompt, and I get the following error:
conda activate mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4 & waitress-serve --host=127.0.0.1 --port=1234 --ident=mlflow mlflow.pyfunc.scoring_server.wsgi:app
Traceback (most recent call last):
File "C:\Users\sergio ferro.conda\envs\mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\sergio ferro.conda\envs\mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\sergio ferro.conda\envs\mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4\Scripts\waitress-serve.exe_main.py", line 7, in
File "C:\Users\sergio ferro.conda\envs\mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4\lib\site-packages\waitress\runner.py", line 283, in run
app = resolve(module, obj_name)
File "C:\Users\sergio ferro.conda\envs\mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4\lib\site-packages\waitress\runner.py", line 218, in resolve
obj = import(module_name, fromlist=segments[:1])
File "C:\Users\sergio ferro.conda\envs\mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4\lib\site-packages\mlflow\pyfunc\scoring_server\wsgi.py", line 6, in
app = scoring_server.init(load_model(os.environ[scoring_server._SERVER_MODEL_PATH]))
File "C:\Users\sergio ferro.conda\envs\mlflow-ddf4db606beaa0e9bb42ff0ed98e8f4c4c7cb1f4\lib\os.py", line 679, in getitem
raise KeyError(key) from None
KeyError: 'pyfunc_model_path'
I have tried deleting and creating a new anaconda environment, ran from git bash, anaconda prompt, added anaconda3 environment variables. I know it has something to do with the _SERVER_MODEL_PATH variable but I wouldn't know how to set it up or which path add to my environment variables so it can read this variable from there.

Ansible: error when deploying playbooks in parallel

i am setting up a kubernetes cluster with ansible.
This is running fine.
Now i usually have 2 or 3 clusters i can test different things with.
Often it happens at some point in time that the cluster/server gots broken. If that happens, i usually recreate the servers and start the playbook again. because this takes some time, i want to be able to run 2 or more playbooks in parallel.
But every time i do this, i get the following error:
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: FileNotFoundError: [Errno 2] No such file or directory
I run my playbook like this:
"$ansible_playbook"
-i "${ANSIBLE_HOSTS}"
"${ANSIBLE_YML}"
--flush-cache
--user root
--become
--become-user root
--ask-sudo-pass
What could be the reason for the error?
I can imagine, that ansible creates some files in the background, used by the different playbooks. But which files could that be?
thx in advance!
Update more detailed error log (-vvv)
ansible-playbook 2.7.8
config file = /home/mod/cod/wo/thingylabs/kubernetes-provisioning/playbooks/test1/ansible.cfg
configured module search path = ['/home/mod/cod/wo/thingylabs/kubernetes-provisioning/vendors/kubespray/library']
ansible python module location = /usr/lib/python3.7/site-packages/ansible
executable location = /usr/bin/ansible-playbook
python version = 3.7.2 (default, Jan 10 2019, 23:51:51) [GCC 8.2.1 20181127]
Using /home/mod/cod/wo/thingylabs/kubernetes-provisioning/playbooks/test1/ansible.cfg as config file
SUDO password:
ERROR! Unexpected Exception, this is probably a bug: [Errno 2] No such file or directory
the full traceback was:
Traceback (most recent call last):
File "/usr/bin/ansible-playbook",
exit_code = cli.run()
File "/usr/lib/python3.7/site-packages/ansible/cli/playbook.py", line 104, in run
loader, inventory, variable_manager = self._play_prereqs(self.options)
File "/usr/lib/python3.7/site-packages/ansible/cli/__init__.py", line 786, in _play_prereqs
inventory = InventoryManager(loader=loader, sources=options.inventory)
File "/usr/lib/python3.7/site-packages/ansible/inventory/manager.py", line 148, in __init__
self.parse_sources(cache=True)
File "/usr/lib/python3.7/site-packages/ansible/inventory/manager.py", line 207, in parse_sources
source = unfrackpath(source, follow=False)
File "/usr/lib/python3.7/site-packages/ansible/utils/path.py", line 47, in unfrackpath
basedir = op.getcwd()
FileNotFoundError: [Errno 2] No such file or directory

FileNotFoundError: [Errno 2] No such file or directory: while deleting the minidcos vagrant cluster

I have created local minidcos vagrant cluster using below command.
$ sudo minidcos vagrant create ./dcos_generate_config.sh --agents 0
The above command is not successful. It is failed abruptly due to No space left on device.
when I list the cluster I see the cluster exists.
$ sudo minidcos vagrant list
default
I'm not able to access the cluster using sudo minidcos vagrant web. I get the same error when I tried to destroy the cluster as below -
$ sudo minidcos vagrant destroy
Traceback (most recent call last):
File "/usr/local/bin/minidcos", line 10, in <module>
sys.exit(minidcos())
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/dcos_e2e_cli/dcos_vagrant/commands/destroy.py", line 59, in destroy
cluster_vms.destroy()
File "/usr/local/lib/python3.7/site-packages/dcos_e2e_cli/dcos_vagrant/commands/_common.py", line 294, in destroy
self.vagrant_client.destroy()
File "/usr/local/lib/python3.7/site-packages/dcos_e2e_cli/dcos_vagrant/commands/_common.py", line 274, in vagrant_client
item for item in self.workspace_dir.iterdir()
File "/usr/local/lib/python3.7/site-packages/dcos_e2e_cli/dcos_vagrant/commands/_common.py", line 274, in <listcomp>
item for item in self.workspace_dir.iterdir()
File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1074, in iterdir
for name in self._accessor.listdir(self):
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/acaa37838a534dc0ae51c3fcc059f650'
How can I successfully delete the cluster?
The issue here was that the workspace directory was deleted, and yet the VMs were still detected.
The workspace is configurable as per the documentation.
This could happen because the workspace directory is somehow deleted while the VMs are running, but it also happens when the host is shut down (assuming the default workspace temporary directory is used).
The behaviour is now changed as of minidcos version 2019.04.08.1.
In particular, minidcos vagrant list no longer lists VMs which are not in the running state.
There is also a new minidcos vagrant clean command which cleans all VMs and leftover VMs.

graphlab create: unable to start cluster in aws

At the moment I'm trying to create a cluster in aws ec2 with Graphlab Create. The code is as follows:
import graphlab as gl
ec2config = gl.deploy.Ec2Config(region='us-west-2', instance_type='m3.large',
aws_access_key_id='secret-acces-key-id',
aws_secret_access_key='secret-access-key')
ec2 = gl.deploy.ec2_cluster.create(name='Test Cluster',
s3_path='s3://test-big-data-2016', ec2_config=ec2config, idle_shutdown_timeout=3600, num_hosts=1)
When the above code is executed I get the following error:
Traceback (most recent call last):
File "test.py", line 59, in
ec2 = gl.deploy.ec2_cluster.create(name='Test Cluster', s3_path='s3://test-big-data-2016', ec2_config=ec2config, idle_shutdown_timeout=36000, num_hosts=1)
File "/Users/remco/anaconda/envs/gl-env/lib/python2.7/site-packages/graphlab/deploy/ec2_cluster.py", line 83, in create
cluster.start()
File "/Users/remco/anaconda/envs/gl-env/lib/python2.7/site-packages/graphlab/deploy/ec2_cluster.py", line 233, in start
self.idle_shutdown_timeout
File "/Users/remco/anaconda/envs/gl-env/lib/python2.7/site-packages/graphlab/deploy/_executionenvironment.py", line 372, in _start_commander_host
raise RuntimeError('Unable to start host(s). Please terminate '
RuntimeError: Unable to start host(s). Please terminate manually from the AWS console.
When I look in EC2 Management Console a new instance is launched and running. But still getting the error in the terminal.
I really don't know what I'm doing wrong here. I followed the exact instructions from: https://turi.com/learn/userguide/deployment/pipeline-example.html

HAWQ stop cluster failed

I installed HAWQ from source code. After initializing and starting HAWQ cluster, I tried to stop it with "hawq stop cluster". However, it failed.
The error shows:
[hadoop#Master ~]$ hawq stop cluster
20161217:19:59:31:004594 hawq_stop:Master:hadoop-[INFO]:-Prepare to do 'hawq stop'
20161217:19:59:31:004594 hawq_stop:Master:hadoop-[INFO]:-You can check log in /home/hadoop/hawqAdminLogs/hawq_stop_20161217.log
20161217:19:59:31:004594 hawq_stop:Master:hadoop-[INFO]:-Stop hawq with args: ['stop', 'cluster']
Continue with HAWQ service stop Yy|Nn (default=N):
20161217:19:59:38:004594 hawq_stop:Master:hadoop-[INFO]:-No standby host configured
20161217:19:59:38:004594 hawq_stop:Master:hadoop-[INFO]:-Stop hawq cluster
Traceback (most recent call last):
File "/home/hadoop/hawq/bin/hawq_ctl", line 1276, in <module>
stop_hawq(opts, hawq_dict)
File "/home/hadoop/hawq/bin/hawq_ctl", line 1043, in stop_hawq
instance.run()
File "/home/hadoop/hawq/bin/hawq_ctl", line 891, in run
check_return_code(self._stopAll())
File "/home/hadoop/hawq/bin/hawq_ctl", line 816, in _stopAll
master_result = self._stop_master()
File "/home/hadoop/hawq/bin/hawq_ctl", line 760, in _stop_master
self._stop_master_checks()
File "/home/hadoop/hawq/bin/hawq_ctl", line 712, in _stop_master_checks
self.conn = dbconn.connect(self.dburl, utility=True)
File "/home/hadoop/hawq/lib/python/gppylib/db/dbconn.py", line 211, in connect
cnx = pgdb._connect_(cstr, dbhost, dbport, dbopt, dbtty, dbuser, dbpasswd)
AttributeError: 'module' object has no attribute '_connect_'
At present, I used the alternative way to stop the cluster, that is, stop master and segments separately with pg_ctl.
pg_ctl stop -D <master_data_dir>/<segment_data_dir>
Anything about this error is helpful. Thanks!
Because directly use the command 'pip install pygresql', it will install the latest version(5.0.3) pygresql. In the errors above, pgdb._connect_() is the old version (4.2.2) routine, in 5.0.3 it is pgdb._connect().
The solution is :
pip install pygresql==4.2.2
Before stop cluster, if it's not '-M immediate' stop, hawq will connect to database to check running connections.
From your log, the connection to master node is failed due to python module issues. Seems like pygresql module is not installed properly. Please try to reinstall it.

Resources