I regularly need to move large files to and from an EC2 instance connected via Session Manager. File transfers within AWS are fast as are files between local machines and non AWS assets over our fiber connection.
However, upstream and downstream speeds with EC2 over Session Manager are really slow -- like around 1MB/s. I proxy ssh over Session Manager which allows me to use regular utilities to move things around. Is this a Session Manager thing, a function of how I'm using, it or something else?
If this is the best I can do, I'll have to deal with it, but I'd love to use a better way if there's one available.
I discovered exactly the same issue when using rsync and other file transfer tools via SSM. Uploads speeds to an EC2 instance that were ~15 MB/s when connecting directly (using its public IP, not using SSM) appeared limited to between 300 and 800 KB/s when going via SSM.
I contacted AWS support for clarifications, and their response included:
"After discussing this situation with our SSM service team, they have mentioned that there will be some delay in SCP over Session Manager compared to direct SCP as there are extra hops in communication in SCP via SSM. Apart from the extra hops, there are other limits imposed in this feature which controls the rate of packet transfer and size of packet. These restrictions are placed to prevent misuse on the feature.
Therefore, there is not a way to mitigate this speed limitation you have encountered due to this."
This Github issue from 2019 on the aws-ssm-agent repo indicates slow performance which they claimed was resolved, but it seems they do not expect users to manage large file uploads/downloads via SSM.
Related
We are building a reporting app on Laravel that need to fetch users data from a third-party server that allow 1 request per seconds.
We need to fetch 100K to 1000K rows based on user and we can fetch max 250 rows per request.
So the restriction is:
1. We can send 1 request per seconds
2. 250 rows per request
So, it requires 400-4000 request/jobs to fetch a user data, So, loading data for multiple users is very time-consuming and the server gets slow.
So, now, we are planning to load the data using multiple servers, like 4-10 servers to fetch users data, so we can send 10 requests per second from 10 servers.
How can we design the system and process jobs from multiple servers?
Is it possible to use a dedicated server for hosting Redis and connect to that Redis server from multiple servers and execute jobs? Can any conflict/race-condition happen?
Any hint or prior experience related to this would be really helpful.
The short answer is yes, this is absolutely possible and is something I've implemented in production apps many times before.
Redis is just like any other service and can run anywhere, with clients from anywhere, connecting to it. It's all up to your configuration of the server to dictate how exactly that happens (and adding passwords, configuring spiped, limiting access via the firewall, etc.). I'd reccommend reading up on the documentation they have in the Administration section here: https://redis.io/documentation
Also, when you do make the move to a dedicated Redis host, with multiple clients accessing it, you'll likely want to look into having more than just one Redis server running for reliability, high availability, etc. Redis has efficient and easy replication available with a few simple configuration commands, which you can read more about here: https://redis.io/topics/replication
Last thing on Redis, if you do end up implementing a master-slave set up, you may want to look into high availability and auto-failover if your Master instance were to go down. Redis has a really great utility built into the application that can monitor your Master and Slaves, detect when the Master is down, and automatically re-configure your servers to promote one of the slaves to the new master. The utility is called Redis Sentinel, and you can read about that here: https://redis.io/topics/sentinel
For your question about race conditions, it depends on how exactly you write your jobs that are pushed onto the queue. For your use case though, it doesn't sound like this would be too much of an issue, but it really depends on the constraints of the third-party system. Either way, if you are subject to a race condition, you can still implement a solution for it, but would likely need to use something like a Redis Lock (https://redis.io/topics/distlock). Taylor recently added a new feature to the upcoming Laravel version 5.6 that I believe implements a version of the Redis Lock in the scheduler (https://medium.com/#taylorotwell/laravel-5-6-preview-single-server-scheduling-54df8e0e139b). You can look into how that was implemented, and adapt for your use case if you end up needing it.
So far I get an average of 700 kilobytes per second for downloads via chrome hitting an ec2 instance in virginia (us-east region). If I download directly from s3 in virginia (us-east region) I get 2 megabytes per second.
I've simplified this way down to simply running apache and reading a file from a mounted ebs volume. Less than one percent of the time I've seen the download hit around 1,800 kilobytes per second.
I also tried nginx, no difference. I also tried running a large instance with 7GB of Ram. I tried allocating 6GB of ram to the jvm and running tomcat, streaming the files in memory from s3 to avoid the disk. I tried enabling sendfile in apache. None of this helps.
When I run from apache reading from the file system, and use a download manager such as downthemall, I always get 2 megabytes per second when downloading from an ec2 instance in virginia (us-east region). It's as if my apache is configured to only allow 700 megabytes per thread. I don't see any configuration options relating to this though.
What am I missing here? I also benchmarked dropbox downloads as they use ec2 as well, and I noticed I get roughly 700 kilobytes per second there too, which is way slow as well. I imagine they must host their ec2 instances in virginia / us-east region as well based in the speed. If I use a download manager to download files from dropbox I get 2 megabytes a second as well.
Is this just the case with tcp, where if you are far away from the server you have to split transfers into chunks and download them in parrallel to saturate your network connection?
I think your last sentence is right: your 700mbps is probably a limitation of a given tcp connection ... maybe a throttle imposed by EC2, or perhaps your ISP, or the browser, or a router along the way -- dunno. Download managers likely split the request over multiple connections (I think this is called "multi-source"), gluing things together in the right order after they arrive. Whether this is the case depends on the software you're using, of course.
Is there a utility for Windows that allows you to test different aspects of file transfer operations across a Lan or a Wan.
Example...
How long does it take to move a file of a known size (500 MB or 1 GB) from Server A (on site) to Server B (on site) or to Server C (off site-Satellite location)?
D-ITG will allow you to test many aspects of your links. It does not necessarily allow you transfer a file directly, but it allows you to control almost all aspects of the transmission of data across the wire.
If all you are interested in is bulk transfer time (and not all the nitty-gritty details) you could just use a basic FTP application and time the transfer.
Probably nothing you've not already figured out. You could get some coarse grain metrics using a batch file to coordinate:
start monitoring
copy file
stop monitoring
Copy file might just be initiating a file copy between two nodes on the LAN, or it might initiate a FTP copy between two nodes on the WAN.
Monitoring could be as basic as writing the current time to output or file, or it could be as complex as adding performance counter metrics from the network adapter on the two machines.
A commercial WAN emulator would also give you the information your looking for. I've used the Shunra Appliance successfully in the past. Its pretty expensive, so I'd really only recommend it if critical business success is riding on understanding how application behavior could change based on network conditions and is something you could incorporate into regular testing activities.
I have over 500 machines distributed across a WAN covering three continents. Periodically, I need to collect text files which are on the local hard disk on each blade. Each server is running Windows server 2003 and the files are mounted on a share which can be accessed remotely as \server\Logs. Each machine holds many files which can be several Mb each and the size can be reduced by zipping.
Thus far I have tried using Powershell scripts and a simple Java application to do the copying. Both approaches take several days to collect the 500Gb or so of files. Is there a better solution which would be faster and more efficient?
I guess it depends what you do with them ... if you are going to parse them for metrics data into a database, it would be faster to have that parsing utility installed on each of those machines to parse and load into your central database at the same time.
Even if all you are doing is compressing and copying to a central location, set up those commands in a .cmd file and schedule it to run on each of the servers automatically. Then you will have distributed the work amongst all those servers, rather than forcing your one local system to do all the work. :-)
The first improvement that comes to mind is to not ship entire log files, but only the records from after the last shipment. This of course is assuming that the files are being accumulated over time and are not entirely new each time.
You could implement this in various ways: if the files have date/time stamps you can rely on, running them through a filter that removes the older records from consideration and dumps the remainder would be sufficient. If there is no such discriminator available, I would keep track of the last byte/line sent and advance to that location prior to shipping.
Either way, the goal is to only ship new content. In our own system logs are shipped via a service that replicates the logs as they are written. That required a small service that handled the log files to be written, but reduced latency in capturing logs and cut bandwidth use immensely.
Each server should probably:
manage its own log files (start new logs before uploading and delete sent logs after uploading)
name the files (or prepend metadata) so the server knows which client sent them and what period they cover
compress log files before shipping (compress + FTP + uncompress is often faster than FTP alone)
push log files to a central location (FTP is faster than SMB, the windows FTP command can be automated with "-s:scriptfile")
notify you when it cannot push its log for any reason
do all the above on a staggered schedule (to avoid overloading the central server)
Perhaps use the server's last IP octet multiplied by a constant to offset in minutes from midnight?
The central server should probably:
accept log files sent and queue them for processing
gracefully handle receiving the same log file twice (should it ignore or reprocess?)
uncompress and process the log files as necessary
delete/archive processed log files according to your retention policy
notify you when a server has not pushed its logs lately
We have a similar product on a smaller scale here. Our solution is to have the machines generating the log files push them to a NAT on a daily basis in a randomly staggered pattern. This solved a lot of the problems of a more pull-based method, including bunched-up read-write times that kept a server busy for days.
It doesn't sound like the storage servers bandwidth would be saturated, so you could pull from several clients at different locations in parallel. The main question is, what is the bottleneck that slows the whole process down?
I would do the following:
Write a program to run on each server, which will do the following:
Monitor the logs on the server
Compress them at a particular defined schedule
Pass information to the analysis server.
Write another program which sits on the core srver which does the following:
Pulls compressed files when the network/cpu is not too busy.
(This can be multi-threaded.)
This uses the information passed to it from the end computers to determine which log to get next.
Uncompress and upload to your database continuously.
This should give you a solution which provides up to date information, with a minimum of downtime.
The downside will be relatively consistent network/computer use, but tbh that is often a good thing.
It will also allow easy management of the system, to detect any problems or issues which need resolving.
NetBIOS copies are not as fast as, say, FTP. The problem is that you don't want an FTP server on each server. If you can't process the log files locally on each server, another solution is to have all the server upload the log files via FTP to a central location, which you can process from. For instance:
Set up an FTP server as a central collection point. Schedule tasks on each server to zip up the log files and FTP the archives to your central FTP server. You can write a program which automates the scheduling of the tasks remotely using a tool like schtasks.exe:
KB 814596: How to use schtasks.exe to Schedule Tasks in Windows Server 2003
You'll likely want to stagger the uploads back to the FTP server.
We're running a lightweight web app on a single EC2 server instance, which is fine for our needs, but we're wondering about monitoring and restarting it if it goes down.
We have a separate non-Amazon server we'd like to use to monitor the EC2 and start a fresh instance if necessary and shut down the old one. All our user data is on Elastic Storage, so we're not too worried about losing anything.
I was wondering if anyone has any experience of using EC2 in this way, and in particular of automating the process of starting the new instance? We have no problem creating something from scratch, but it seems like it should be a solved problem, so I was wondering if anyone has any tips, links, scripts, tutorials, etc to share.
Thanks.
You should have a look at puppet and its support for AWS. I would also look at the RightScale AWS library as well as this post about starting a server with the RightScale scripts. You may also find this article on web serving with EC2 useful. I have done something similar to this but without the external monitoring, the node monitored itself and shut down when it was no longer needed then a new one would start up later when there was more work to do.
Couple of points:
You MUST MUST MUST back up your Amazon EBS volume.
They claim "better" reliability, but not 100%, and it's SEVERAL orders of magnitude off of S3's "12 9's" of durability. S3 durability >> EBS durability. That's a fact. EBS supports a "snapshots" feature which backs up your storage efficiently and incrementally to S3. Also, with EBS snapshots, you only pay for the compressed deltas, which is typically far far less than the allocated volume size. In another life, I've sent lost-volume emails to smaller customers like you who "thought" that EBS was "durable" and trusted it with the only copy of a mission-critical database... it's heartbreaking.
Your Q: automating start-up of a new instance
The design path you mention is relatively untraveled; here's why... Lots of companies run redundant "hot-spare" instances where the second instance is booted and running. This allows rapid failover (seconds) in the event of "failure" (could be hardware or software). The issue with a "cold-spare" is that it's harder to keep the machine up to date and ready to pick up where the old box left off. More important, it's tricky to VALIDATE that the spare is capable of successfully recovering your production service. Hardware is more reliable than untested software systems. TEST TEST TEST. If you haven't tested your fail-over, it doesn't work.
The simple automation of starting a new EBS instance is easy, bordering on trivial. It's just a one-line bash script calling the EC2 command-line tools. What's tricky is everything on top of that. Such a solution pretty much implies a fully 100% automated deployment process. And this is all specific to your application. Can your app pull down all the data it needs to run (maybe it's stored in S3?). Can you kill you instance today and boot a new instance with 0.000 manual setup/install steps?
Or, you may be talking about a scenario I'll call "re-instancing an EBS volume":
EC2 box dies (root volume is EBS)
Force detach EBS volume
Boot new EC2 instance with the EBS volume
... That mostly works. The gotchas:
Doesn't protect against EBS failures, either total volume loss or an availability loss
Recovery time is O(minutes) assuming everything works just right
Your services need to be configured to restart automatically. It does no good to bring the box back if Nginx isn't running.
Your DNS routes or other services or whatever need to be ok with the IP-address changing. This can be worked around with ElasticIP.
How are your host SSH keys handled? Same name, new host key can break SSH-based automation when it gets the strong-warning for host-key-changed.
I don't have proof of this (other than seeing it happen once), but I believe that EC2/EBS _already_does_this_ automatically for boot-from-EBS instances
Again, the hard part here is on your plate. Can you stop your production service today and bring it up RELIABLY on a new instance? If so, the EC2 part of the story is really really easy.
As a side point:
All our user data is on Elastic Storage, so we're not too worried about losing anything.
I'd strongly suggest to regularly snapshot your EBS (Elastic Block Storage) to S3 if you are not doing that already.
You can use an autoscale group with a min/max/desired quantity of 1. Place the instance behind an ELB and have the autoscale group be triggered by the ELB healthy node count. This allows you to have built in monitoring by cloudwatch and the ELB health check. Anytime there is an issue the instance be replaced by the autoscale service.
If you have not checked 'Protect against accidental termination' you might want to do so.
Even if you have disabled 'Detailed Monitoring' for your instance you should still see the 'StatusCheckFailed' metric for your instance over which you can configure an alarm (In the CloudWatch dashboard)
Your application (hosted in a different server) should receive the alarm and start the instance using the AWS API (or CLI)
Since you have protected against accidental termination you would never need to spawn a new instance.