How do you import Big Data public data sets into AWS? - amazon-ec2

Loading any of Amazon's listed public data sets (http://aws.amazon.com/datasets) would take a lot of resources and bandwidth. What's the best way to import them into AWS so you start working with them quickly?

You will need to create a new EBS Instance using the Snapshot-ID for the public dataset. That way you won't need to pay for transfer.
But be careful, some data sets are only available in one region, most likely denoted by a note similar to this. You should register your EC2 instance in the same region then.
These datasets are hosted in the us-east-1 region. If you process these from other regions you will be charged data transfer fees.

FYI : SDBExplorer uses Multithreaded BatchPutAttributes to achieve high write throughput while uploading bulk data to Amazon SimpleDB. SDB Explorer allows multiple parallel uploads. If you have the bandwidth, you can take full advantage of that bandwidth by running number of BatchPutAttributes processes at once in parallel queue that will reduce the time spend in processing. SDBExplorer supports Import data from MySql and CSV to Amazon SimpleDB.
http://www.sdbexplorer.com
Disclosure : I am the developer of SDBExplorer.

Related

Best method to persist data from an AWS Lambda invocation?

I use AWS Simple Email Services (SES) for email. I've configured SES to save incoming email to an S3 bucket, which triggers an AWS Lambda function. This function reads the new object and forwards the object contents to an alternate email address.
I'd like to log some basic info. from my AWS Lambda function during invocation -- who the email is from, to whom it was sent, if it contained any links, etc.
Ideally I'd save this info. to a database, but since AWS Lambda functions are costly (relatively so to other AWS ops.), I'd like to do this as efficiently as possible.
I was thinking I could issue an HTTPS GET request to a private endpoint with a query-string containing the info. I want logged. Since I could fire my request async. at the outset and continue processing, I thought this might be a cheap and efficient approach.
Is this a good method? Are there any alternatives?
My Lambda function fires irregularly so despite Lambda functions being kept alive for 10 minutes or so post-firing, it seems a database connection is likely slow and costly since AWS charges per 100ms of usage.
Since I could conceivable get thousands of emails/month, ensuring my Lambda function is efficient is paramount to cost. I maintain 100s of domain names so my numbers aren't exaggerated. Thanks in advance.
I do not think that thousands per emails per month should be a problem, these cloud services have been developed with scalability in mind and can go way beyond the numbers you are suggesting.
In terms of persisting, I cannot really understand - lack of logs, metrics - why your db connection would be slow. From the moment you use AWS, it will use its own internal infrastructure so speeds will be high and not something you should be worrying about.
I am not an expert on billing but from what you are describing, it seems like using lambdas + S3 + dynamoDB is highly optimised for your use case.
From the type of data you are describing (email data) it doesn't seem that you would have neither a memory issue (lambdas have mem constraints which can be a pain) or an IO bottleneck. If you can share more details on your memory used during invocation and the time taken that would be great. Also how much data you store on each lambda invocation.
I think you could store jsonified strings of your email data in dynamodb easily, it should be pretty seamless and not that costly.
Have not used (SES) but you could put a trigger on DynamoDB whenever you store a record, in case you want to follow up with another lambda.
You could combine S3 + dynamoDB. When you store a record, simply upload a file containing the record to a new S3 key and update the row in DynamoDB with a pointer to the new S3 object
DynamoDB + S3
You can now persist data using AWS EFS.

Amazon Web Services: Spark Streaming or Lambda

I am looking for some high level guidance on an architecture. I have a provider writing "transactions" to a Kinesis pipe (about 1MM/day). I need to pull those transactions off, one at a time, validating data, hitting other SOAP or Rest services for additional information, applying some business logic, and writing the results to S3.
One approach that has been proposed is use Spark job that runs forever, pulling data and processing it within the Spark environment. The benefits were enumerated as shareable cached data, availability of SQL, and in-house knowledge of Spark.
My thought was to have a series of Lambda functions that would process the data. As I understand it, I can have a Lambda watching the Kinesis pipe for new data. I want to run the pulled data through a bunch of small steps (lambdas), each one doing a single step in the process. This seems like an ideal use of Step Functions. With regards to caches, if any are needed, I thought that Redis on ElastiCache could be used.
Can this be done using a combination of Lambda and Step Functions (using lambdas)? If it can be done, is it the best approach? What other alternatives should I consider?
This can be achieved using a combination of Lambda and Step Functions. As you described, the lambda would monitor the stream and kick off a new execution of a state machine, passing the transaction data to it as an input. You can see more documentation around kinesis with lambda here: http://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html.
The state machine would then pass the data from one Lambda function to the next where the data will be processed and written to S3. You need to contact AWS for an increase on the default 2 per second StartExecution API limit to support 1MM/day.
Hope this helps!

Data-aware load balancing with embedded and distribted caches/datagrids

Sorry i'm a beginner in load balancing.
In distributed environments we tend more and more to send the treatment (map/reduce) to the data so that the result gets computed locally and then aggregated.
What i'd like to do apply for partionned/distributed data, not replicated.
Following the same kind of principle, i'd like to be able to send an user request on the server where the user data is cached.
When using an embedded cache or datagrid to get low response time, when the dataset is large, we tend to avoid replication and use distributed/partitionned caches.
The partitionning algorithm are generally hash-based and permits to have replicas to handle server failures.
So finally, a user data is generally hosted on something like 3 servers (1 primary copy and 2 replicas)
On a local cache misses, the caches are generally able to search for the entry on other cache peers.
This works fine but needs a network access.
I'd like to have a load balancing strategy that avoid this useless network call.
What i'd like to know: is it possible to have a load balancer that is aware of the partitionning mecanism of the cache so that it always forwards to one of the webservers having a local copy if the data we need?
For exemple, i have a request www.mywebsite.com/user=387
The load balancer will check the 387 userId and know that this user is stored in servers 1, 6 and 12. And thus he can roundrobin to one of them or other strategy.
If there's no generic solution, are there opensource or commercial, software or hardware load balancers that permits to define custom routing strategies?
How much extracting data of a request will slow down the load balancer? What's the cost of extracting an url parameter (like in my exemple with user=387) and following some rules to go to the right webserver, compared to a roundrobin strategy for exemple?
Is there an abstraction library on top of cache vendors so that we can retrieve easily the partitionning data and make it available to the load balancer?
Thanks!
Interesting question. I don't think there is a readily available solution for your requirements, but it would be pretty easy to build if your hashing criteria is relatively simple and depends only on the request (a URL parameter as in your example).
If I were building this, I would use Varnish (http://varnish-cache.org), but you could do the same in other reverse proxies.

How to make EC2 instance call another instance?

I have two EC2 instances. I want that if one finish a job, it will sign the other one to do other stuff.
So, how to make the communication? I don't want to use CURL.. coz it seems like expensive. I think AWS should have some simple solution but I still can't find relevant help in the documentation.
:(
also, how to send data between two instances without giong through SSH in a fast way? I know ssh can be done. but it seems slow. once again, any tool that EC2 provide to do that?
Actually, I need two methods:
1) Instance A tells Instance B to grab the data from Instance A.
This is answered by Adrian that I can use SQS. I will try that.
2) Once Instance B get the signal, then the data (EBS) data in Instance A needs to transfer to Instance B. The amount of data can be big even I zip it. It is around 50 MB. And I need Instance B to get the data fast so that Instance B will have enough time to process the data before next interval comes in.
So, I am thinking of either these methods:
a) Instance A has the data dump from DB, upload to S3. Then signal Instance B. Instance B gets the data from S3.
b) Instance A has the data dump from DB. Then signal Instance B. Instance B establish SSH (or any connection) to Instance A and grabs the data.
The data may need to be stored permanently but it is not a concern at this moment. It is mainly for Instance B to process.
This is a simple scenario. I'm thinking of what if I scale it with multiple instances, what the proper approach is. :)
Thanks.
Amazon has a special service for this -- it's called SQS, and it allows instances to send messages to each other through special queues. There are SDKs for SQS in various languages, like Java and PHP. This should serve your signaling needs.
For actually sending the bulky data over, it's best to use S3 (and send the object key in the SQS message). You're right that you're introducing latency by adding the extra middle-man, but you'll find that S3 is very fast from EC2 instances (if you put them in the same availability zone, that is), and more importantly than performance, S3 is very reliable. If you try to manage the transfer yourself through SSH, you'll have to work out a lot of error checking and retry logic that S3 handles for you. You can use S3FS to easily write and read to/from S3 from EC2.
Edited to address your updated question.
You may want to look at SNS... which is kind of like push SQS.
How fast do you need this communication to be? SSH is pretty darn fast. The only thing that I can think of that might be faster is raw sockets (from within whatever program is running the jobs).
You could use a distributed workflow managing service.
If Instance B has already completed the task, it can go on to pick another task. Usually, you would want Instance B to signal that is has "picked" up a task and is doing it. Then other instances should try to pick up other tasks on your list. You need a central service which knows which task has been picked up already, and which ones are left for grabs.
When Instance B completes the task successfully, it should signal the central service that it is free for a new task, and pick one up if there is something left.
If it fails to complete the task, the central service should be able to detect it (via heartbeats and timeouts you defined) and put the task back on the list so that some other instance can pick it up.
Amazon SWF is the central service which will provide you with all of this.
For data required by each instance, you should put it in a central store like s3, and configure s3 paths in a way such that each task knows where to download data from, without having to sync up.
e.g. data for task 1 could be placed in something like s3://my-bucket/task1

Azure scalability over XML File

What is the best practise solution for programmaticaly changing the XML file where the number of instances are definied ? I know that this is somehow possible with this csmanage.exe for the Windows Azure API.
How can i measure which Worker Role VMs are actually working? I asked this question on MSDN Community forums as well: http://social.msdn.microsoft.com/Forums/en-US/windowsazure/thread/02ae7321-11df-45a7-95d1-bfea402c5db1
To modify the configuration, you might want to look at the PowerShell Azure Cmdlets. This really simplifies the task. For instance, here's a PowerShell snippet to increase the instance count of 'WebRole1' in Production by 1:
$cert = Get-Item cert:\CurrentUser\My\<YourCertThumbprint>
$sub = "<YourAzureSubscriptionId>"
$servicename = '<YourAzureServiceName>'
Get-HostedService $servicename -Certificate $cert -SubscriptionId $sub |
Get-Deployment -Slot Production |
Set-DeploymentConfiguration {$_.RolesConfiguration["WebRole1"].InstanceCount += 1}
Now, as far as actually monitoring system load and throughput: You'll need a combination of Azure API calls and performance counter data. For instance: you can request the number of messages currently in an Azure Queue:
http://yourstorageaccount.queue.core.windows.net/myqueue?comp=metadata
You can also set up your role to capture specific performance counters. For example:
public override bool OnStart()
{
var diagObj= DiagnosticMonitor.GetDefaultInitialConfiguration();
AddPerfCounter(diagObj,#"\Processor(*)\% Processor Time",60.0);
AddPerfCounter(diagObj, #"\ASP.NET Applications(*)\Request Execution Time", 60.0);
AddPerfCounter(diagObj,#"\ASP.NET Applications(*)\Requests Executing", 60.0);
AddPerfCounter(diagObj, #"\ASP.NET Applications(*)\Requests/Sec", 60.0);
//Set the service to transfer logs every minute to the storage account
diagObj.PerformanceCounters.ScheduledTransferPeriod = TimeSpan.FromMinutes(1.0);
//Start Diagnostics Monitor with the new storage account configuration
DiagnosticMonitor.Start("DiagnosticsConnectionString",diagObj);
}
So this code captures a few performance counters into local storage on each role instance, then every minute those values are transferred to table storage.
The trick, now, is to retrieve those values, parse them, evaluate them, and then tweak your role instances accordingly. The Azure API will let you easily pull the perf counters from table storage. However, parsing and evaluating will take some time to build out.
Which leads me to my suggestion that you look at the Azure Dynamic Scaling Example on the MSDN code site. This is a great sample that provides:
A demo line-of-business app hosting a wcf service
A load-generation tool that pushes messages to the service at a rate you specify
A load-monitoring web UI
A scaling engine that can either be run locally or in an Azure role.
It's that last item you want to take a careful look at. Based on thresholds, it compares your performance counter data, as well as queue-length data, to those thresholds. Based on the comparisons, it then scales your instances up or down accordingly.
Even if you end up not using this engine, you can see how data is grabbed from table storage, massaged, and used for driving instance changes.
Quantifying the load is actually very application specific - particularly when thinking through the Worker Roles. For example, if you are doing a large parallel processing application, the expected/hoped for behavior would be 100% CPU utilization across the board and the 'scale decision' may be based on whether or not the work queue is growing or shrinking.
Further complicating the decision is the lag time for the various steps - increasing the Role Instance Count, joining the Load Balancer, and/or dropping from the load balancer. It is very easy to get into a situation where you are "chasing" the curve, constantly churning up and down.
As to your specific question about specific VMs, since all VMs in a Role definition are identical, measuring a single VM (unless the deployment starts with VM count 1) should not really tell you much - all VMs are sitting behind a load balancer and/or are pulling from the same queue. Any variance should be transitory.
My recommendation would be to pick something that is not inherently highly variable to monitor (e.g. CPU). Generally, you want to find a trending point - for web apps it may be the response queue, for parallel apps it may be azure queue depth, etc. but for either they would be the trend and not the absolute number. I would also suggest measuring them at fairly broad intervals - minutes, not seconds. If you have a load you need to respond to in seconds, then realistically you will need to increase your running instance count ahead of time.
With regard to your first question, you can also use the Autoscaling Application Block to dynamically change instance counts based on a set of predefined rules.

Resources