Situation:
To replace an 10+ year old Windows 2000 2-Node cluster with shared MSA SCSI storage with a newer Windows 2003 2-Node cluster with shared FC storage.
The shared storage is current split into two drives X(data) and Q(quorum).
The X Drive consists of a Flat File DB consisting of 13.1 million+ files in 1.3 million+ folders. These files need to be copied from the old cluster to the new cluster with minimal down time.
File Count: 13,023,328
Total Filesize: 8.43 GB (File Size not Size on Disk)
Folder Count: 1308153
The old Win 200 Cluster has been up for over 10 years, continually reading/writing and is now also heavily fragmented. The X Drive on the Win 2000 Cluster also contains 7 backups of the DB, which are created/updated via Robo Copy once per day, this currently takes 4-5 hours and adds a real lag to system performance.
Old Cluster
- 2 x HP DL380 G4 |
1 x HP MSA 500 G2 (SCSI) | Raid 5 (4 disks + Spare)| Win 2k
New Cluster
- 2 x HP DL380 G7 |
1 x HP StorageWorks P2000 G2 MSA (Fibre Channel) | Win 2k3
The Database can be offline for 5 to 8 hours comfortably, and 15 hours absolute maximum, due to the time sensitive data it provides.
Options We've Tried:
Robo / FastCopy both seemed to sit around 100-300 files copied per second, with the database offline.
Peersync Copy from a local node backup (D: drive), this completed in 17 hours with an average of 250 files per second.
Question/Options:
Block by Block Copy - We think might be the fastest, but it will also copy the backups from the original X drive.
Redirect Daily Backup - Redirect the daily backup from the local X Drive to a network share of the new X Drive. Slow to begin with, but will then only be up to 12 hours out of date when we come to switch over, as it could be run while the old system is live. Final Sync on the move day, should take no more than 10 hours, to 100% confirm the old and new systems are identical.
Custom Copy Script - We have access to C# and Python
Robo/Fast Copy/ Other File Copy, open to suggestions and settings
Disk Replace / Raid Rebuild - The risky or impossible option, replace each of the older disks, with a new smaller form factor disk, in old G2 caddy, allow raid to rebuild, replace and rebuild until all drives are replaced. On day of migration, move the 4 disks to new P2000 MSA, in the same raid order?
Give Up - And leave it running on the old hardware until it dies a fiery death.*
We seem to be gravitating to Option 2, but thought we should put this to some of the best minds in the world before committing.
ps. Backups on the new cluster are to a new (M) drive using Shadow Copy.
* Unfortunately not a real option, as we do need to move to the newer hardware as the old storage and clustercan no longer cope with demand.
We went with Option 2, and redirected the twice daily backup from the original cluster to the new MSA raid on the new cluster.
It was run as a pull from the new cluster using PeerSync and a Windows share on the old cluster.
We tried to use the PeerSync TCP client which would have been faster / more efficient, but it wasn't compatible with Windows 2000. PeerSync was chosen over most other copy tools out there due to its compatibility and non-locking file operations, allowing the original cluster to be online throughout with minimal performance impact.
This took around 13.5 hours for the initial copy, and then around 5.5 hours for the incremental diff copies. The major limiting factor was the original clusters shared MSA RaidSet, the drives were online and being access through the backups, so the normal operation slowed down the backup times.
The final sync took about 5 hours and that was the total time the database was offline, for the hardware upgrade.
Related
The documentation says:
It is recommended that new tables which are expected to have heavy read and write workloads have at least as many tablets as tablet servers.
If I have as many tablets as data disks (for instance 3 tablet servers, 10 disks per node, I split the table in 30 partitions), will kudu be smart enough to put a tablet per disk or am I actually limiting performance?
I wonder in theory (assuming a very big table) what would be the best:
3 partitions (1 per tablet server)
30 partitions (1 per disk)
more than 30 (because my table is really big)
I can answer to the best of my knowledge. We have 24 tablet servers but we created tables with more than 400 tablets (partitions). That is the 3rd option in your list. We talked to Kudu Dev group a couple of time and we were told is that the scans will be better if Disk Rowsets are evenly distributed and they are not more than a few Gigabytes per tablet (partition).
We have seen that table queries/writes performed better when we distributed the big table across multiple tablets rather than restricting them to equal number of tablet servers.
Things that impacted us are:
table writes are better when we did asynchronous writes and checked for write status when we flushed data.
table scans are better when the tablets have even and less data sizes.
There is a lot of disk IO when the amount of data in each tablet is very high. This caused write and read issues. We saw Kudu RPC Issues and queue backup when there are minimal tablets.
Also we had a table with more than 15G of data in 1 tablet and the queries are so bad on it and we had to redistribute data.
Our experience was that it's always better to have more tablets with even distribution (evenly distributed compacted Disk Rowsets) and under 10GB per tablet. Make sure compaction is going well and not having issues.
I'm trying to find a way to run a batch script on Windows that backs up my project directory to our local network file share server.
Example of what I would usually run:
robocopy /mir "C:\PROJECT_FOLDER_PATH" "\\NETWORK_FOLDER_PATH"
But, every now and then, my IT admin approaches me about a massive copy operation that is slowing down the network.
As my projects folder grows over time, this becomes more of an annoyance. I try to run the script only while signing off later in the day to minimize the number of people affected in the office but, I was trying to come up with a better solution.
I've written a script that uses 7zip to create a 7zip archive and splits it into volumes of 250MB. So now I have a folder that just contains several smaller files and no folders to worry about. But, if I batch copy all of these to the server, I'm concerned I'm still running into the same problem.
So my initial idea was to run copy one file at a time every 5-10sec. rather than all at once. But I would only want the script to run once. I know I could write a loop and rely on robocopy's /mir tag to skip files that have already been backed up, but I don't want to have to monitor the script once I start it.
I want to run the script when I'm ready to do a backup and then have it copy the files up to the network at intervals to avoid over taxing our small network.
Robocopy has a special option to throttle data traffic while copying.
/ipg:n - Specifies the inter-packet gap to free bandwidth on slow lines.
The number n is the number of milliseconds for Robocopy to wait after each block of 64 KB.
The higher the number, the slower Robocopy gets, but also: the less likely you will run into a conflict with your IT admin.
Example:
robocopy /mir /ipg:50 "C:\PROJECT_FOLDER_PATH" "\\NETWORK_FOLDER_PATH"
On a file of 1 GB (about 16,000 blocks of 64 KB each), this will increase the time it takes to copy the file with 800 seconds (16,000 x 50 ms).
Suppose it normally takes 80 seconds to copy this file; this might well be the case on a 100 Mbit connection.
Then the total time becomes 80 + 800 = 880 seconds (almost 15 minutes).
The bandwidth used is 8000 Mbit / 880 sec = 9.1 Mbit/s.
This leaves more than 90 Mbit/s of bandwidth for other processes to use.
Other options you may find useful:
/rh:hhmm-hhmm - Specifies run times when new copies may be started.
/pf - Checks run times on a per-file (not per-pass) basis.
Source:
https://technet.microsoft.com/en-us/library/cc733145(v=ws.11).aspx
http://www.zeda.nl/index.php/en/copy-files-on-slow-links
http://windowsitpro.com/windows-server/robocopy-over-network
I have a file at location "A" which will be downloaded by multiple clients via FTP. The clients can access the file at the same time. The host server (where file is stored) is solaris server with link speed of 100BT. The clients can support up to 1Gbps. Size of the file is nearly ~700 mb
When 5 to 6 clients downloaded the file, the download took around 20 mins. But when the number of clients was increased to ~40, the download took more than a hour.
My question here is that when the number of clients is increased will it have an impact on download speed? If yes then what are the factors that are responsible for this impact? Please clarify...
This question would better be asked on superuser because it is not about programming.
But if your server has a 100 BT link, it can support about 10 MB / sec. Distribute this over 5 clients and each gets 2 MB/sec. Use 40 clients and each gets 250 KB/sec. Of course it gets slower the more clients you have.
Imagine a load of sections of pipe of varying thicknesses joined together with your server at one end and your client(s) at the other. The pieces of pipe here are:
the disk where your file is stored on the server
the CPU and memory bandwidth on your server
the network connection from your server (and all switches and hubs on the way)
the CPU and memory bandwidth on your client
the disk where the file will be saved on your client
Basically, the transfer is going to go as fast as the thinnest piece of pipe allows data to flow through it. As a rough guide, the performances will be
60-150 MBytes/s
several GBytes/s
100 Mbits/s or around 10-12 MBytes/s
several GBytes/s
60-150 MBytes/s
As you can see, the server's 100Mb/s network interface is the biggest bottleneck by a massive factor (5-15x). Also, you say your file is 700mb (millibits), but I suspect you mean 700MB (megabytes). So, if your server's network interface is only 100 Mb/s (or 10MB/s) the 700MB file is going to take at least 70s to pass through the network and it will need to do so once for each client, so 5 clients are going to take at least 350s assuming no overheads.
Short answer:
try compressing the file,
or going on eBay to get a faster network interface for the server
distribute from the server to one (or more) of your 1Gb/s clients and then from there to other clients.
I have a server with 2 hard disks in it - each 400 GB. One hosts the the SQL Server files, the other one is used purely for backups. The backup disk fills up from time to time and I have to go in and delete old backups. I am no DBA so I am still trying to figure out a way to delete old backups automatically.
Can SQL Server perform slowly if the backup disk is almost full and has less than 100 MB space left on it; even though it doesn't have the database files on it only backups?
The first disk which holds the database files is never full.
Thanks
You have a performance problem ('sql server performs poorly') so investigate it as a performance problem should be investiagted. Follow a methodology like Waits and Queues. Follow the Perforamnce Troubleshooting Flow Chart. Stop making guesses and taking random actions. Measure.
What gives best performance for running PostgreSQL on EC2? EBS in RAID? PGData on /mnt?
Do you have any preferences or experiences? Main "plus" for running PostgreSQL on EBS is switching from one to another instance. Can this be the reason to be slower than using the /mnt partition?
PS: I'm running PostgreSQL 8.4 with datas/size about 50G, Amazon EC2 xlarge(64) instance.
Here there is some linked info. The main take-away is this post from Bryan Murphy:
Been running a very busy 170+ gb OLTP postgres database on Amazon for
1.5 years now. I can't say I'm "happy" but I've made it work and still prefer it to running downtown to a colo at 3am when something
goes wrong.
There are two main things to be wary of:
1) Physical I/O is not very good, thus how that first system used a RAID0.
Let's be clear here, physical I/O is at times terrible. :)
If you have a larger database, the EBS volumes are going to become a
real bottleneck. Our primary database needs 8 EBS volumes in a RAID
drive and we use slony to offload requests to two slave machines and
it still can't really keep up.
There's no way we could run this database on a single EBS volume.
I also recommend you use RAID10, not RAID0. EBS volumes fail. More
frequently, single volumes will experience very long periods of poor
performance. The more drives you have in your raid, the more you'll
smooth things out. However, there have been occasions where we've had
to swap out a poor performing volume for a new one and rebuild the
RAID to get things back up to speed. You can't do that with a RAID0
array.
2) Reliability of EBS is terrible by database standards; I commented on this
a bit already at
http://archives.postgresql.org/pgsql-general/2009-06/msg00762.php The end
result is that you must be careful about how you back your data up, with a
continuous streaming backup via WAL shipping being the recommended approach.
I wouldn't deploy into this environment in a situation where losing a
minute or two of transactions in the case of a EC2/EBS failure would be
unacceptable, because that's something that's a bit more likely to hapen
here than on most database hardware.
Agreed. We have three WAL-shipped spares. One streams our WAL files
to a single EBS volume which we use for worst case scenario snapshot
backups. The other two are exact replicas of our primary database
(one in the west coast data center, and the other in an east coast
data center) which we have for failover.
If we ever have to worst-case-scenario restore from one of our EBS
snapshots, we're down for six hours because we'll have to stream the
data from our EBS snapshot back over to an EBS raid array. 170gb at
20mb/sec (if you're lucky) takes a LONG time. It takes 30 to 60
minutes for one of those snapshots to become "usable" once we create a
drive from it, and then we still have to bring up the database and
wait an agonizingly long time for hot data to stream back into memory.
We had to fail over to one of our spares twice in the last 1.5 years.
Not fun. Both times were due to instance failure.
It's possible to run a larger database on EC2, but it takes a lot of
work, careful planning and a thick skin.
Bryan