How to download 'Million Song Dataset ' to my local disk? - download

The data is http://labrosa.ee.columbia.edu/millionsong/
Is there any direct way use like 'wget' to get it?
Or how to get it from amazon ec2 virtual machine?

Related

Create VM from VMDK in VMware using Ansible

I could create VM in VMware from existing VMDK from Vsphere client. But I am not able to create it using Ansible. Can anyone suggest how to create VM from existing VMDK with Ansible?
From my experience, you can't do it out of the box with Ansible.
I edited vsphere_guest module to be able to accomplish this for my client.
Unfortunately I can't disclose this custom module's code.
You need to modify add_disk method: remove create file operation and set existing VMDK filename for disk backing.

Download a file from the Internet directly to my S3 bucket

I'm working with EMR (Elastic MapReduce) on AWS infrastructure and the default way to provide input files (large datasets) for programs is to upload them to an S3 bucket and reference those buckets from within EMR.
Usually I download the datasets to my local,development machine and then upload them to S3, but this is getting harder to do with larger files, as upload speeds are generally much lower than download speeds.
My question is is there a way to download files from the internet (given their URL) directly into S3 so I don't have to download them to my local machine and then manually upload them?
No. You need an intermediary- typically, an EC2 instance is used, rather than your local machine, for speed.

sync EBS volumes via S3

I am looking to have multiple Amazon EC2 instances use the same data store. Amazon does not condone mounting an S3 Bucket as a file system, so I am trying to avoid this solution. Is there a way to synchronize an EBS volume with S3 or would it be best to use rsync and cron?
Do you really have to have the files locally available from within EBS? What if instead you served them to yourself via CloudFront, and restricted the permissions so that only your instances (or only your Security Group) could see the files?
Come Fall 2015 you'll be able to use Elastic File Storage (EFS) for this. But until then, I would suppose the next best thing is to use the AWS command-line to sync down from S3 to your volume:
aws s3 sync s3://my-bucket/my/folder /mnt/my-ebs/
After the initial run, that sync command is surprisingly fast. So from there you could just cron it to run hourly or so?

Syncing between Amazon EBS Devices

I have 2 EC2 instances, each with their own EBS attached. Sitting infront of the EC2s is a load balancer.
These instances run CMS driven sites, where uses can upload files.
What would be the best solution to the problem of a file getting uploaded to one EBS and the load balancer sending a visitor to the EC2 instance whose EBS does not have the file? Some sort of cron which runs an rsync?
Suggestions very welcome!
Thanks
S
I believe the best solution would be to use single shared storage like Amazon S3. It's better to use some plugin for your CMS to store users' files on S3. But if there is no such plugin you can use Fuse s3fs adapter to mount the file system on both instances and configure your CMS to store those files in that specified directory.
there are several solutions to this problem from top of my head i think
nfs/samba shared dir between instances
svn deploy
cluster file systems - OCFS/GFS
cloud management such as capistrano and trriger a deploy when you need
and of course cron jobs when you can do ftp, scp, rsync, s3sync/copy etc
Or possibly, create one EC2 instance as NFS and share it's directories with your other instances.
There are multiple solutions to keep data in both EC2 in sync with or without using EBS volumes.
Can use AWS EFS service instead of using EBS volumes. EFS volume can be shared between EC2 instances within a VPC, and both instances will have data in sync on the mountpath where EFS is mounted on instances.
Another solution is using Gluster File Storage. This can also work between EBS volumes in different AWS region. Refer this link: http://sanketdangi.com/post/5601762671/gluster-config-aws-multi-az
Can mount S3 bucket on your EC2 instances using S3 Fuse. Refer this link: https://github.com/s3fs-fuse/s3fs-fuse/wiki/Fuse-Over-Amazon
May be you can also use "s3 sync" on both ebs volumes. This way both ebs will be in sync via S3. Refer this link: https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

How to rsync to all Amazon EC2 servers?

I have a Scalr EC2 cluster, and want an easy way to synchronize files across all instances.
For example, I have a bunch of files in /var/www on one instance, I want to be able to identify all of the other hosts, and then rsync to each of those hosts to update their files.
ls /etc/aws/hosts/app/
returns the IP addresses of all of the other instances
10.1.2.3
10.1.33.2
10.166.23.1
Ideas?
As Zach said you could use S3.
You could download one of many clients out there for mapping drives to S3. (search for S3 and webdav).
If I was going to go this route I would setup an S3 bucket with all my shared files and use jetS3 in a cronJob to sync each node's local drive to the bucket (pulling down S3 bucket updates). Then since I normally use eclipse & ant for building, I would create a ANT job for deploying updates to the S3 bucket (pushing updates up to the S3 bucket).
From http://jets3t.s3.amazonaws.com/applications/synchronize.html
Usage: Synchronize [options] UP <S3Path> <File/Directory>
(...)
or: Synchronize [options] DOWN
UP : Synchronize the contents of the Local Directory with S3.
DOWN : Synchronize the contents of S3 with the Local Directory
...
I would recommend the above solution, if you don't need cross-node file locking. It's easy and every system can just pull data from a central location.
If you need more cross-node locking:
An ideal solution would be to use IBM's GPFS, but IBM doesn't just give it away (at least not yet). Even though it's designed for High Performance interconnects it also has the ability to be used over slower connections. We used it as a replacement for NFS and it was amazingly fast ( about 3 times faster than NFS ). There maybe something similar that is open source, but I don't know. EDIT: OpenAFS may work well for building a clustered filesystem over many EC2 instances.
Have you evaluated using NFS? Maybe you could dedicate one instance as an NFS host.

Resources