EMR notebooks install additional libraries - bash

I'm having a surprisingly hard time working with additional libraries via my EMR notebook. The AWS interface for EMR allows me to create Jupyter notebooks and attach them to a running cluster. I'd like to use additional libraries in them. SSHing into the machines and installing manually as ec2-user or root will not make the libraries available to the notebook, as it apparently uses the livy user. Bootstrap actions install things for hadoop. I can't install from the notebook because its user apparently doesn't have sudo, git, etc., and it probably wouldn't install to the slaves anyway.
What is the canonical way of installing additional libraries for notebooks created through the EMR interface?

For the sake of an example, let's assume you need librosa Python module on running EMR cluster. We're going to use Python 2.7 as the procedure is simpler - Python 2.7 is guaranteed to be on the cluster and that's the default runtime for EMR.
Create a script that installs the package:
#!/bin/bash
sudo easy_install-2.7 pip
sudo /usr/local/bin/pip2 install librosa
and save it to your home directory, e.g. /home/hadoop/install_librosa.sh. Note the name, we're going to use it later.
In the next step you're going to run this script through another script inspired by Amazon EMR docs: emr_install.py. It uses AWS Systems Manager to execute your script over the nodes.
import time
from boto3 import client
from sys import argv
try:
clusterId=argv[1]
except:
print("Syntax: emr_install.py [ClusterId]")
import sys
sys.exit(1)
emrclient=client('emr')
# Get list of core nodes
instances=emrclient.list_instances(ClusterId=clusterId,InstanceGroupTypes=['CORE'])['Instances']
instance_list=[x['Ec2InstanceId'] for x in instances]
# Attach tag to core nodes
ec2client=client('ec2')
ec2client.create_tags(Resources=instance_list,Tags=[{"Key":"environment","Value":"coreNodeLibs"}])
ssmclient=client('ssm')
# Run shell script to install libraries
command=ssmclient.send_command(Targets=[{"Key": "tag:environment", "Values":["coreNodeLibs"]}],
DocumentName='AWS-RunShellScript',
Parameters={"commands":["bash /home/hadoop/install_librosa.sh"]},
TimeoutSeconds=3600)['Command']['CommandId']
command_status=ssmclient.list_commands(
CommandId=command,
Filters=[
{
'key': 'Status',
'value': 'SUCCESS'
},
]
)['Commands'][0]['Status']
time.sleep(30)
print("Command:" + command + ": " + command_status)
To run it:
python emr_install.py [cluster_id]

What is the canonical way of installing additional libraries for notebooks created through the EMR interface?
EMR Notebooks recently launched 'notebook-scoped libraries' using which you can install additional Python libraries on your cluster from public or private PyPI repository and use it within notebook session.
Notebook-scoped libraries provide the following benefits:
You can use libraries in an EMR notebook without having to re-create
the cluster or re-attach the notebook to a cluster.
You can isolate library dependencies of an EMR notebook to the individual notebook session. The libraries installed from within the notebook cannot interfere with other libraries on the cluster or libraries installed within other notebook sessions.
Here are more details,
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-scoped-libraries.html
Technical blog:
https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/

What I usually do in this case is deleting my cluster and creating a new one with bootstrap actions. Bootstrap actions allow you to install additional libraries on your cluster : https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html.
For example writing the following script and saving it in S3 will allow you to use datadog from your notebook running on top of your cluster (at least it works with EMR 5.19) :
#!/bin/bash -xe
#install datadog module for using in pyspark
sudo pip-3.4 install -U datadog
Here is the command line I would run for launching this cluster :
aws emr create-cluster --release-label emr-5.19.0 \
--name 'EMR 5.19 test' \
--applications Name=Hadoop Name=Spark Name=Hive Name=Livy \
--use-default-roles \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large \
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large \
--region eu-west-1 \
--log-uri s3://<path-to-logs> \
--configurations file://config-emr.json \
--bootstrap-actions Path=s3://<path-to-bootstrap-in-aws>,Name=InstallPythonModules
And the config-emr.json that is stored locally on your computer :
[{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
},
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3"
}
}
]
}]
I assume that you could do exactly the same thing when creating a cluster through the EMR interface by going to the advanced options of creation.

I spent way to long on this, AWS documentation or support did not help at all, but did get it working so that you can install libraries for python directly in the notebook.
If you can do the items below then you can install libraries though running pip install commands in a single line Jupyter cell, with the Python runtime, like so
!pip install pandas
One item that confused me a lot was that I could SSH into the cluster and reach out to the internet, ping and pip would both work, but then the notebook was not able to reach out nor were any libraries actually available. Instead you need to make sure that the notebook can reach out. One good test is just to see if you can ping out. Same structure as above, single line starting with !
!ping google.com
If that is taking too long and timing out then you still need to figure out your VPN/subnet rules.
Notes below on cluster creation:
(step 1) This does not work for every version of EMR. I have it working on 5.30.0, but last I checked 5.30.1 did not work.
(step 2 -> Networking) You need to make sure you're on a private subnet and your VPN can reach out to the public internet. Again, don't let SHHing into the server fool you, the notebook is either inside a docker image there or running somewhere else. The only relevant tests are the ones you're running directly from the notebook.
Once you have this working and install a package it will work for any notebook on that cluster. I have a notebook just called install that has one line per package that I run through whenever I spin up a new cluster.

Related

How to configure docker swarm using jenkins?

I have got an assignment. The assignment is "Write a shell script to install and configure docker swarm(one master/leader and one node) and automate the process using Jenkins." I am new to this technology and finding it difficult to proceed. Can anyone help me in explaining step-by-step process of how to proceed?
#Rajnish Kumar Singh, Have you tried to check resources online? I understand you are very new to this technology, but googling some key words like
what is docker swarm
what is jenkins , etc would definitely helps
Having said that, Basically you need to do below set of steps to complete your assignment
Pre-requisites
2 or more - Ubuntu 20.04 Server
(You can use any linux distros like ubuntu, Redhat etc, But make sure your install and execute commands change accordingly.
Here we need two nodes mainly to configure the master and worker node cluster)
Eg :
manager --- 132.92.41.4
worker --- 132.92.41.5
You can create these nodes in any of public cloud providers like AWS EC2 instances or GCP VMs etc
Next, You need to do below set of steps
Configure Hosts
Install Docker-ce
Docker Swarm Initialization
You can refer this article for more info https://www.howtoforge.com/tutorial/ubuntu-docker-swarm-cluster/
This completes first part of your assignment.
Next, You can create one small shell script and include all those install and configuration commands in that script. Basically shell script is collection of set of linux commands. Instead of running each commands separately , you will run script alone and all set up will be done for you.
You can create small script using touch command
touch docker-swarm-install.sh
Specify proper privileges to script to make it executable
chmod +x docker-swarm-install.sh
Next include all your install + configure commands, which you have used earlier to do docker swarm set up in scripts (You can refer above shared link)
Now, when your script is ready, you can configure this script in jenkins job and whenever jenkins job is run, script will get execute and docker swarm cluster will be created
You need a jenkins server. Jenkins is open source software, you can install it in any of public cloud instance (Aws EC2)
Reference : https://devopsarticle.com/how-to-install-jenkins-on-aws-ec2-ubuntu-20-04/
Next once installation is completed. You need to configure job in jenkins
Reference : https://www.toolsqa.com/jenkins/jenkins-build-jobs/
Add your 'docker-swarm-install.sh' as build step in created job
Reference : https://faun.pub/jenkins-jobs-hands-on-for-the-different-use-cases-devops-b153efb483c7
If all set up is successful and now when you run your jenkins job, your docker swarm cluster must be get created.

Run a PowerShell script on Azure AKS nodes,

I have a PowerShell script that I want to run on some Azure AKS nodes (running Windows) to deploy a security tool. There is no daemon set for this by the software vendor. How would I get it done?
Thanks a million
Abdel
Similar question has been asked here. User philipwelz has written:
Hey,
although there could be ways to do this, i would recommend that you dont. The reason is that your AKS setup should not allow execute scripts inside container directly on AKS nodes. This would imply a huge security issue IMO.
I suggest to find a way the execute your script directly on your nodes, for example with PowerShell remoting or any way that suits you.
BR,
Philip
This user is right. You should avoid executing scripts on your AKS nodes. In your situation if you want to deploy Prisma cloud you need to go with the following doc. You are right that install scripts work only on Linux:
Install scripts work on Linux hosts only.
But, for the Windows and Mac software you have specific yaml files:
For macOS and Windows hosts, use twistcli to generate Defender DaemonSet YAML configuration files, and then deploy it with kubectl, as described in the following procedure.
The entire procedure is described in detail in the document I have quoted. Pay attention to step 3 and step 4. As you can see, there is no need to run any powershell script:
STEP 3:
Generate a defender.yaml file, where:
The following command connects to Console (specified in [--address](https://docs.paloaltonetworks.com/prisma/prisma-cloud/prisma-cloud-admin-compute/install/install_kubernetes.html#)) as user <ADMIN> (specified in [--user](https://docs.paloaltonetworks.com/prisma/prisma-cloud/prisma-cloud-admin-compute/install/install_kubernetes.html#)), and generates a Defender DaemonSet YAML config file according to the configuration options passed to [twistcli](https://docs.paloaltonetworks.com/prisma/prisma-cloud/prisma-cloud-admin-compute/install/install_kubernetes.html#). The [--cluster-address](https://docs.paloaltonetworks.com/prisma/prisma-cloud/prisma-cloud-admin-compute/install/install_kubernetes.html#) option specifies the address Defender uses to connect to Console.
$ <PLATFORM>/twistcli defender export kubernetes \
--user <ADMIN_USER> \
--address <PRISMA_CLOUD_COMPUTE_CONSOLE_URL> \
--cluster-address <PRISMA_CLOUD_COMPUTE_HOSTNAME>
- <PLATFORM> can be linux, osx, or windows.
- <ADMIN_USER> is the name of a Prisma Cloud user with the System Admin role.
and then STEP 4:
kubectl create -f ./defender.yaml
I think that the above answer is not completely correct.
The twistcli command, does not export daemonset for Windows Nodes. The "PLATFORM" option, is for choosing the OS of the computer that the command will run.
After testing, I have made the conclusion that there is no Docker Image for Prisma Cloud for Windows Kubernetes Nodes, as it is deployed as a service at Windows OS, and not Container (as in Linux). Wrapping up, the Daemonset is not working at the Windows Hosts
I believe the only solution is this -> Windows
This is the Powershell script that WytrzymaƂy Wiktor has mentioned.
Unfortunately this cannot be automated easily, as you have to deploy an Azure VM per AKS Cluster (at the same network), and RDP to the AKS Windows Node and run the script.
If anyone has another suggestion or solution, feel free to share.

How to run components in AWS Greengrass?

In AWS Greengrass Documentation it says you can test components like this
sudo /greengrass/v2/bin/greengrass-cli deployment create \
--recipeDir ~/greengrassv2/recipes \
--artifactDir ~/greengrassv2/artifacts \
--merge "com.example.HelloWorld=1.0.0"
But if I want to run a component from another script. I should use the same command? For example I have a component that publishes some data to MQTT, and right now I am using system.os like this:
os.system("sudo /greengrass/v2/bin/greengrass-cli deployment create \
--recipeDir ~/greengrassv2/recipes \
--artifactDir ~/greengrassv2/artifacts \
--merge "com.example.HelloWorld=1.0.0"")
But I am not sure if it's the right solution. It does not seem like a nice solution.
I wouldn't recommend using greengrass-cli deployment create command to run the component
It's for local development only
The command run through all the lifecycle steps defined in the component recipe file before running the component, it can be a big overhead.
If "another script" is also a Greengrass component, you can Use the AWS IoT Device SDK for interprocess communication (IPC)
If "another script" is not Greengrass component, you can use restart command to trigger a run for the component. It's less overhead than create command.
sudo /greengrass/v2/bin/greengrass-cli component restart --names "HelloWorld"

Can't add admin in Shield Elasticsearch - [Error]Could not find or load main class org.elasticsearch.shield.authc.esusers.tool.ESUsersTool

I am trying out Shield as a security measure for my Kibana and Elasticsearch. Running on Mac OS X 10.9.5
Followed the documentation from Elastic. Managed to install Shield. Since my Elasticsearch is running automatically, I skipped step 2(start elasticsearch).
For step 3, I tried adding an admin. Ran this following command on my terminal. bin/shield/esusers useradd admin -p password -r admin.
Unfortunately I'm getting this error.
Error: Could not find or load main class org.elasticsearch.shield.authc.esusers.tool.ESUsersTool
Below are the additional steps I took.
Double checked that the bin/shield esusers path existed and all.
Manually starting elasticsearch before adding users
Tried a variety of different commands based on the documentation.
bin/shield/esusers useradd admin -r admin and
bin/shield/esusers useradd es_admin -r admin
Ran those commands with sudo
Same error generated. Can't seem to find the problem on google as well. Not really sure what I'm missing here as the documentation seems pretty straightforward.
You must restart the node because new Java classes were added to it (from the Shield plugin) and the JVM behind Elasticsearch needs to reload those classes. It can only do that if you restart it.
Kill the process and start it up again, or use curl -XPOST "http://localhost:9200/_shutdown" to shut the cluster down.
Also, the Shield plugin needs to be installed on all the nodes in the cluster.

How do install security updates on an Amazon Linux AMI EC2 instance?

I see the following notices displayed on login:
__| __|_ )
_| ( / Amazon Linux AMI
___|\___|___|
See /usr/share/doc/system-release/ for latest release notes.
There are 30 security update(s) out of 39 total update(s) available
How do I install these updates on my machine?
As outlined in section Security Updates within Amazon Linux AMI Basics, Amazon Linux AMIs are configured to download and install security updates at launch time, i.e. If you do not need to preserve data or customizations on your running Amazon Linux AMI instances, you can simply relaunch new instances with the latest updated Amazon Linux AMI (see section Product Life Cycle for details).
This currently includes only Critical or Important security updates though, see the AWS team's response to Best practices for Amazon Linux image security updates:
The default on Amazon Linux AMI is to install any Critical or
Important security updates on launch. This is a function of cloud-init
and be modified in cloud.cfg on the box or by passing in user-data.
This is why you see some security updates still available at launch.
Consequently, if you want to install all security updates or indeed need to preserve data or customizations on your running Amazon Linux AMI instances, you can maintain those instances through the Amazon Linux AMI yum repositories, i.e. you need to facilitate the regular Yum update mechanism as outlined for the yum-security plugin:
# yum update --security
Please note: This does not work if only security updates are selected, due to the fact that security updates are not properly flagged in centos and amazon linux. This may be a matter of Redhat making security a paid feature which, if I'm being frank, is bullshit.
For this to work you must update the yum-cron config file to install all updates. This makes security updates less likely to run reliably which makes everyone less secure.
update_cmd = default
Amazon Linux runs updates when the host boots for the first time.
If you plan to have hosts up long-term you may also want to enable automatic security updates. I recommend using yum-cron:
sudo yum install yum-cron
The config file is here: (you probably want to just run security updates)
/etc/yum/yum-cron.conf
You can then enable yum-cron like so:
sudo service yum-cron start
edit from a useful comment below:
"If you're creating/destroying instances with an auto-scaling group, etc, the command should be something like "sudo yum update -y" in user data."
The answer above is correct, here are the 4 commands you can copy and paste to run:
# Install the package yum-cron
sudo yum install yum-cron -y
# Change the config file /etc/yum/yum-cron.conf and modify the line apply_updates from no to yes
sudo sed -i "s/apply_updates = no/apply_updates = yes/" /etc/yum/yum-cron.conf
# Enable the yum-cron service to start automatically upon system boot
sudo systemctl enable yum-cron
# Start the yum-cron service now
sudo systemctl start yum-cron
These commands also work on Red Hat 7, CentOS 7
If you are running as the root user you can simply run the commands without sudo:
yum install yum-cron -y
sed -i "s/apply_updates = no/apply_updates = yes/" /etc/yum/yum-cron.conf
systemctl enable yum-cron
systemctl start yum-cron
For more information see https://linuxize.com/post/configure-automatic-updates-with-yum-cron-on-centos-7/
https://www.howtoforge.com/tutorial/how-to-setup-automatic-security-updates-on-centos-7/

Resources