I have a complex legacy data migration problem. MS Access data going into MySQL. I'm using a Rake task. There's a lot of data and it requires a lot of transforming and examining. The Rake task is hundreds of lines across about 12 files. The whole thing takes about two hours to run. It has to run on Windows (I'm using XP VMware VM hosted on an OS X Leopard system) because the Ruby libraries that can talk to MS Access only work on Windows.
I'm finding that sometimes, not every time, I'll start the task and come back later and it will be stalled. No error message. I put numerous print statements in it so you should see lots of reporting going by, but there's just the last thing it did just sitting there. "11% done" or whatever.
I hit Ctrl-C to and instead of going back to the command prompt, the task starts up again where it left off, reported output starts going by again.
I'm sorry for the abstract question, but I'm hoping someone might have an idea or notion of what's happening. Maybe suggestions for troubleshooting.
Well, if the access side seems to be freezing, consider shoving the data into MySql, and see if that eliminates this problem. In other words, the data has to go over eventually, you might as move the data into that system from the get go. There is a number of utilities around that allow you to move the data into MySql (or just export the access data to CSV files).
So, you not doing data transformations during that transfer of data while you move it into MySql (so, it not a labor nor programming cost of time hereā¦just transfer the data).
Once you have the data in MySql, then your code is doing the migration (transformation) of data from one MySql database (or table) to another. And, you out of a VM environment and running things in a native environment. (faster performance, likely more stable).
So, I would vote to get your data over into MySql..then you down to a single platform.
The less systems involved, the less chance of problems.
Related
We have a product where we have a server running as a windows service which used a very old database called QDBM. It is quite fast, but as soon as our service crashes or so the database is corrupt.
We are searching now for a replacement for this. We don't need a complex database, but it should be possible to just deliver it with our software and install everything automatically. It should be able to store several million keys and corresponding values (from 50 byte up to several hundred bytes per entry) and performant.
Some of the information we always hold in the memory, so it would be great if we could retrieve that quite fast. And up to 2000 clients can send data to us once per second which means that we need to update (most of the time the same) 2000 entries in the database.
We don't need transactions and it is ok if some of the latest information is lost, as we discover the latest data in the background and after a while this will be repaired.
I tried already leveldb which is ok from performance perspective, but it seems to hang after some hours. Maybe internal reorganizations or so, this was already reported by others.
Mongodb was 50 times slower, I don't know why. My test just drops the database, creates 100k entries and reads them which takes 21 seconds while the same took 0.4 seconds with leveldb.
Redis sounds like a good solution, but I have problems to find a free still supported windows version (memurai seems to cost something)? Most people propose to install it under Windows' linux shell, but I think this sounds more like a solution if you have a single database server, but not if you want to deliver a software which should run on hundreds of computers.
As I said, maybe a simpler solution would be sufficient for us.
BTW: It would be nice if the database can be synched to another computer (we have a failover service), but thats not a must.
It would be really great if you have more proposals.
Sorry if this question is a bit vague. I don't know the right technical terms.
Basically in my research group we use a shared windows machine with a lot of RAM to run models, using remote desktop to access it from our own computers.
It would be great if we could build a queue so that we get the most use out of the machine, especially if we could then rearrange the order once it is up and running. Often someone will want to run say 50 runs of a 2 hour model, and someone else will just want to run once and check the results immediately, so they should get priority, but it's a pain stopping and starting large sets of runs.
We run models via command line, any ideas?
You could store the total time each user has spent on the computer and it would also be a good feature to let the users estimate the time they intend to use the computer. The queue could be built based on these data and if it is possible, when the estimated time is left 110%, then automatically kick out the user and allow the next to use the computer. I think you should implement a very basic system without too much effort. When all of you see it, you will have ideas about the optimal direction where the project should be headed.
Let me start by saying that I think there is a better way of doing things than I'm doing now... so, please don't post comments and answers saying that I should be using a different technology, etc. I have a "reasonably" specific question.
A little background:
Basically, I have system where I'm processing a lot of varied, but fairly structured data feeds each day (CSV files). It's a fairly generic ETL type of system. I started off writing Python scripts to do it all in memory. But, I found that I was writing a lot of code to check and enforce rules that could easily be described by a db schema. So, I've got a of a series of SQS queue (one for each source) that has file locations (on s3) to process and a PostgreSQL db script to load to do it. Hacky? Yes; probably. But, in a way, it's pretty easy to just define all of your rules in PostgreSQL. At least for me with approx 15 years of RDBMS experience (what's that old saying about when you only have a hammer, everything looks like a nail?)
So, all works pretty well. But, when creating EC2 instances, I have a choice of an image_id and a type/size. I have my base "PostgreSQL worker image" that I use, but it's really geared for one size (micro).
But, now I'm thinking about trying to play around and see what kind of gains I could get if I went with small or medium. My initial thought is that I would just created separate image_ids with a postgres conf settings geared to them. But, seems a bit messy. (but, the whole thing is a bit messy and hacky)
Given what I have in place, is there a better way to accomplish this than just separate AMIs?
Final notes:
My AMIs are all PostgreSQL 9.1 and Ubuntu 12.04. And the DBs are just temporary storage. They only exist for the 15 or 20 minutes they are needed to load/process/output the data.
If you feel like this question could be better answered on the SE's DBA site, then please feel free to add a comment. I usually start with StackOverflow because it's a bigger community and it's a community that I feel more at home with. I'm much more of a developer than a DBA.
I am writing my first MacOS application that uses SQLite (https://github.com/ccgus/fmdb).
I could either open/close the database connexion for each transaction (CRUD), or on init/dealloc. What is the best way?
I'm not sure I have the definitive answer, but having looked into this a bit myself, I've seen numerous people who say it's ok to leave the database open.
Also, if you look at the Sqlite site you'll see they've done a lot of work on ensuring a database will not get corrupted from crashes, power failures etc.
http://www.sqlite.org/testing.html
http://www.sqlite.org/atomiccommit.html
My experience using Sqlite and FMDB is it seems to be fine to open a connection and just leave it open. Remember, this is a "connection" to a file, that's on a local file system that's on Flash memory. That's a very different situation than a connection over the network. I think the chances of failure are extremely slim, as it's clearly designed to handle crashes, power failures etc. even if they occur during an actual database operation - so outside of a database operation they are not an issue.
You could of course argue that it's bad practice to keep a database connection open when not in use, and I wouldn't recommend it in a typical client-server setup, but on the iPhone/iPad I think it's a non-issue. Keeping it open seems to work fine and is one less thing to worry about.
You don't want your app to keep the DB open from start to finish, unless all it does is start, do DB stuff, then quit. The reason for this is that on rare occasions, the app may be terminated by a system problem, loss of power, etc.; since SqLite is file-based, this may result in an unclosed file or some other out-of-sync condition. Open the DB when you need it open, do your thing, and close it when you no longer need it open. You can't protect against a crash while you're actually doing db ops, but you see to it that the db was stable and closed when your last set of db ops ran. Just as an aside, SqLite opens and closes very quickly. Well, let me amend that: the SqLite3 I have compiled into my app does. I don't actually know about other versions.
We use SourceSafe 6.0d and have a DB that is about 1.6GB. We haven't had any problems yet, and there is no plan to change source control programs right now, but how big can the SourceSafe database be before it becomes an issue?
Thanks
I've had VSS problems start as low as 1.5-2.0 gigs.
The meta-answer is, don't use it. VSS is far inferior to a half-dozen alternatives that you have at your fingertips. Part of source control is supposed to be ensuring the integrity of your repository. If one of the fundamental assumptions of your source control tool is that you never know when it will start degrading data integrity, then you have a tool that invalidates its own purpose.
I have not seen a professional software house using VSS in almost a decade.
1 byte!
:-)
Sorry, dude you set me up.
Do you run the built-in ssarchive utility to make backups? If so, 2GB is the maximum size that can be restored. (http://social.msdn.microsoft.com/Forums/en-US/vssourcecontrol/thread/6e01e116-06fe-4621-abd9-ceb8e349f884/)
NOTE: the ssarchive program won't tell you this; it's just that if you try to restore a DB over 2GB, it will fail. Beware! All these guys who are telling you that they are running fine with larger DB are either using another archive program, or they haven't tested the restore feature.
I've actually run a vss db that was around 40 gig. I don't recommend it, but it is possible. Really the larger you let it go, the more you're playing with fire. I've heard instances where the db gets corrupted, and the items in source control were unrecoverable. I would definately back it up on a daily basis and start looking to change source control systems. Having been in the position of the guy who they call when it fails, I can tell you that it will really start to get stressful when you realize that it could just go down and never come back.
Considering the amount of problems SourceSafe can generate on its own, I would say the size has to be in the category "Present on disk" for it to develop problems.
I've administered a VSS DB over twice that size. As long as your are vigilant about running Analyze, you should be OK.
Sourcesafe recommends 3-5G with a "don't ever go over 13G".
In practice, however, ours is over 20G and seems to be running fine.
The larger you get, Analyze will find more and more problems including lost files, etc.
EDIT: Here is the official word: http://msdn.microsoft.com/en-us/library/bb509342(VS.80).aspx
I have found that Analyze/Fix starts getting annoyingly slow at around 2G on a reasonably powerful server. We run Analyze once per month on databases that are used by 20 or so developers. The utility finds occasional fixes to perform, but actual use has been basically problem free for years at my workplace.
The main thing according to Microsoft is make sure you never run out of disk space, whatever the size of the database.
http://msdn.microsoft.com/en-us/library/bb509342(VS.80).aspx
quote:
Do not allow Visual SourceSafe or the Analyze tool to run out of disk space while running. Running out of disk space in the middle of a complex operation can create serious database corruption