Truncate a Riak Database

Truncate a Riak Database - ruby

I am writing a bit of code which uses Riak DB and I want to reset my database into a known state at the beginning of each test.
Is there a way to truncate a riak database cleanly? What about a way to execute inside of a transaction and rollback at the end of the test?
Currently I use a bit of code like this:
riak.buckets.each do |bucket|
bucket.keys.each do |key|
bucket.delete(key)
end
end
But I imagine this would be pretty slow to do this at the beginning of every test.

I think every test-oriented developer faces this dilemma when working with Riak. As Christian has mentioned, there is no concept of rollbacks in Riak. And there is no single "truncate database" command that you can issue.
You have 3 approaches available to you:
Clear all the data on your test cluster. This essentially means issuing shell commands (assuming your test server is running on the same machine as your test suite). If you're using a Memory backend, this means issuing riak restart between each test. For other backends, you'd have to stop the node and delete the whole data directory and start it again: riak stop && rm -rf <...>/data/* && riak start. PROS: Wipes the cluster data clean between each test. CONS: This is slow (when you take into account shutdown and restart times), and issuing shell commands from your test suite is often awkward. (Sidenote: while it may be slow to do between each test, you can certainly feel free to clear the data directory before each run of the whole test suite.)
Loop through all the buckets and keys and delete them, on your test cluster, as you've suggested above. PROS: Simple to understand and implement. CONS: Also slow (to run between each test).
Have each test clean up after itself. So, if your test creates a User object, make sure to issue a DELETE command for that object at the end of the test. Optionally, test that a user doesn't exist initially, before creating one. (To make doubly sure that the previous test cleaned up). PROS: Simple to understand and implement. Fast (definitely faster than looping through all the buckets and keys between each test). CONS: Easy for developers to forget to clean up after each insert.
After having debated these approaches, I've settled on using #3 (combined, frequently, with wiping the test server data directory before each test suite run).
Some thoughts on mitigating the CONS of the 'each test cleans up after itself, manually' approach:
Use a testing framework that runs tests in random order. Many frameworks, like Ruby's Minitest, do this out of the box. This often helps catch tests that depend on other tests conveniently forgetting to clean up
Periodically examine your test cluster (via a list buckets) after the tests run, to make sure there's nothing left. In fact, you can do this programmatically at the end of each test suite (something as simple as doing a bucket list and making sure it's empty).
(This is good testing practice in general, but especially relevant with Riak) Write less tests that hit the database. Maintain strict division between Unit Tests (that test object state and behavior without hitting the db) and Integration or Functional Tests (that do hit the db). Make sure there's a lot more of the former than the latter. To put it in other words -- you don't have to test that the database works, with each unit test. Trust it (though obviously, verify, during the integration tests).
For example, if you're using Riak with Ruby on Rails, and you're testing your models, don't call test_user.save! to verify that a user instance is valid (like I once did, when first getting started). You can simply test for test_user.valid?, and understand that the call to save will work (or fail) accordingly, during actual use. Consider using Mockist-style testing, which verifies whether or not a save! function was actually invoked, instead of actually saving to the db and then reading back. And so on.

There are few possible answers here.
Are you testing that data is persisted by querying Riak using its key? If so, you can set up a test server. Documentation, such as it is, is here, http://rubydoc.info/github/basho/riak-ruby-client/Riak/TestServer
Are you testing access by secondary index? If so, why? Do you not trust Riak or the Ruby driver?
In all probability, your tests shouldn't be coupled to the data store in any case. It slows down things.
If you do insist and the TestServer isn't working for you, set up a new bucket for every test run. Each bucket is its own namespace, so it's pretty much clean slate. Periodically, stop the nodes and clear out data directories as per Christian's answer above.

As there is no concept of transactions or rollbacks in Riak, that is not possible. The memory backend is however commonly used for testing as it supports the features of Bitcask (auto-expiry) and LevelDB (secondary indexes). Whenever the database needs to be cleared, the nodes just need to be restarted.
If using Bitcask or LevelDB when testing, the most efficient method to clear the database is to shut down the node and simply remove the data directories.

Related

PHPUnit test dependencies

A similar question has been asked before but I don't quite understand the answer. My specific case is that I have a unit test which tests the registration of a user via a REST API endpoint. User registration however depends on a few records which must exist in the database, otherwise it will fail. Inserting these records into the database is most definitely a test case by itself too. So my question is, should I execute my tests in a specific order in order for the records to exist, or should I explicitly insert the records again in every testcase that depends on it?
It might be somewhat irrelevant but I'm using Laravel 5, so testing is done in PHPUnit.

should I execute my tests in a specific order in order for the
records to exist, or should I explicitly insert the records again in
every testcase that depends on it?
I think the correct answer here is that you should not do either (but please read on, it might still be ok to do the latter, though not perfect).
If you say registering the user is a test case in itself. Very well then, write that test and let's assume you have that test in what follows.
Creating tests so that they run in order
Lets deal with the first option of running the creating those rows once and then running multiple tests against them.
I think this is a very flawed approach no matter the circumstances. All of a sudden all tests depend on one another.
Say you run test A, B and C on those rows. Maybe it's even the case that right now none of them alters the rows. But there is no way you can be sure that no bug is ever introduced into B that alters data ( mustn't even be a bug, could just be that the underlying functionality is changed ).
Now you're in a situation where test C might pass, but only if B did not run before. This is an entirely unacceptable situation, especially when the reverse is true, C only passing if B ran.
This could show in say a fresh installation of your App throwing errors in real life, while your development setup containing a bunch of data works and so do the tests because B created a certain state in your database ( that maybe also exists randomly in your dev database ).
Then you give it out to some poor customer and all of a sudden "option X" is not set, or the initial admin user does not exist or whatever :)
=> bad plan
Running the Setup for Every Test that depends on it
This is a significantly better plan. Now you at least have full control of your database state in every test and they all run independent of one another.
The order of them running will not affect outcome
=> good
Also this is a relatively standard thing to do for a subset of tests. Just subclass your main UnittestCase class and make all tests depending on that function subclasses of that thing like so:
abstract class NeedsDbSetupTestCase extends MyAppMainTestCase {
function setUp(){
parent::setUp();
$this->setupDb();
}
private function setupDb(){
//add your rows and tables and such
}
}
=> acceptable idea
The Optimal Approach
The above still comes some drawbacks. For one it isn't really a unittest anymore once it depends on very specific database interactions, which makes it less valueable in exactly pinpointing an issue. Admittedly though this is in many cases more a theoretical than a practical issue :)
What will much more likely become a practical issue though is performance. You are adding a bunch of database writes that might need to be run hundreds of times once your test suit grows. At the beginning of your project this might mean that it takes 4s to run it instead of 2s :P ... once the project grows you might find yourself losing a lot of time because of this though.
One last issues you might also face is that your test suit becomes dependent on the database it's run against. Maybe it passes running against MySQL 5.5 and fails against 5.6 ( academic example I guess :P ) => you might have all kinds of strange behavior with tests passing locally but failing in CI and whatnot (somewhat likely depending on your setup).
Since you are interesting in this in a more generic sense, let me outline the proper way of handling this here generically too :)
What it will always come down to is that a situation like this causes you trouble:
class User {
private $id;
public function get_data(){
return make_a_sql_call_and_return_row_as_array("SELECT properta1, propertyb FROM users WHERE id = " . $this->id);
}
}
Now some other method is to be tested that actually uses the return of get_data() and you need that data in the db :) ... or you just mock your User object!
Assuming you have some method in another class that uses that User object.
And your test looks a little something like this:
// run this in the context of the class that sets up the db for you
$user = new User($user_id);
$this->assertTrue(some_method_or_function($user);
All you need here from $user is to say return the array [1,5]. Instead of inserting this and then using an instance of User, just create the mock:
// this one doesn't do anything yet, returns null on every method.
$user = $this->getMockBuilder('User')->disableOriginalConstructor()->get_mock();
// now just make it return what you want it to return
$user->method('get_data')->willReturn(array(1,2));
// And run your test lightning fast without having ever touched the database but getting the same result :)
$this->assertTrue(some_method_or_function($user);
Another hidden ( but valuable ) benefit of this approach is, that setting up the mocks and such actually forces you about the details that go into every classes behavior, giving you a significantly more detailed understanding of your app in the end.
Obviously the downside is that it (not always but often) requires a lot more work to code your tests this way and the benefit might not be worth the trouble.
Especially when working with other frameworks like WordPress and such that your code depends on, it might be somewhat unfeasible to really mock all db interaction, while existing libraries provide slower but trivial to implement database testing capabilities for your code :)
But in general option 3 is the way to go, option one is just wrong and option two might be what everyone eventually does in real life :D

Is there a fast way to reinitialize or clear etcd keyspace for testing?

I am writing a small ruby application that utilizes etcd via the etcd-ruby gem to coordinate activities across a cluster.
One problem I have is how to write specs for it.
My first approach was to attempt to mock out the etcd calls at the client level, however this is sub-optimal because the responses returned by the client are quite complex with metadata. I thought about writing a wrapper over the etcd client to strip away the metadata and make a mocking approach easier, but the problem is the algorithm does depend on this metadata at times, so the abstraction becomes very leaky and just a painful layer of indirection.
Another approach is to use VCR to record actual requests. This has the benefit of allowing specs to run without etcd, but it becomes a mess of initializing state and managing cassettes.
This brings me to my question. etcd is fast enough as a solo node that it seems easiest and most straightforward to just use it directly in tests and not attempt to stub it at all. The only problem here is that I can't see any easy way to clear the keyspace between tests. Recursive delete on the root key is not allowed. Also, this doesn't reset the indices. I checked the etcd-ruby gem specs, and it appears to bypass the issue by using keys based on uuids so that keys simply never collide. I suppose that is a viable approach, but is there something better?

I would test against an etcd docker container which you can tear down and restore very quickly for tests.

Rails, how to migrate large amount of data?

I have a Rails 3 app running an older version of Spree (an open source shopping cart). I am in the process of updating it to the latest version. This requires me to run numerous migrations on the database to be compatible with the latest version. However the apps current database is roughly around 300mb and to run the migrations on my local machine (mac os x 10.7, 4gb ram, 2.4GHz Core 2 Duo) takes over three days to complete.
I was able to decrease this time to only 16 hours using an Amazon EC2 instance (High-I/O On-Demand Instances, Quadruple Extra Large). But 16 hours is still too long as I will have to take down the site to perform this update.
Does anyone have any other suggestions to lower this time? Or any tips to increase the performance of the migrations?
FYI: using Ruby 1.9.2, and Ubuntu on the Amazon instance.

Dropping indices beforehand and adding them again afterwards is a good idea.
Also replacing .where(...).each with .find_each and perhaps adding transactions could help, as already mentioned.
Replace .save! with .save(:validate => false), because during the migrations you are not getting random inputs from users, you should be making known-good updates, and validations account for much of the execution time. Or using .update_attribute would also skip validations where you're only updating one field.
Where possible, use fewer AR objects in a loop. Instantiating and later garbage collecting them takes CPU time and uses more memory.

Maybe you have already considered this:
Tell the db not to bother making sure everything is on disk (no WAL, no fsync, etc), you now have an in memory db which should make a very big difference. (Since you have taken the db offline you can just restore from a backup in the unlikely event of power loss or similar). Turn fsync/WAL on when you are done.
It is likely that you can do some of the migrations before you take the db offline. Test this in staging env of course. That big user migration might very well be possible to do live. Make sure that you don't do it in a transaction, you might need to modify them a bit.
I'm not familiar with your exact situation but I'm sure there are even more things you can do unless this isn't enough.

This answer is more about approach than a specific technical solution. If your main criteria is minimum downtime (and data-integrity of course) then the best strategy for this is to not use rails!
Instead you can do all the heavy work up-front and leave just the critical "real time" data migration (i'm using "migration" in the non-rails sense here) as a step during the switchover.
So you have your current app with its db schema and the production data. You also (presumably) have a development version of the app based on the upgraded spree gems with the new db schema but no data. All you have to do is figure out a way of transforming the data between the two. This can be done in a number of ways, for example using pure SQL and temporary tables where necessary or using SQL and ruby to generate insert statements. These steps can be split up so that data that is fairly "static" (reference tables, products, etc) can be loaded into the db ahead of time and the data that changes more frequently (users, sesssions, orders, etc) can be done during the migration step.
You should be able to script this export-transform-import procedure so that it is repeatable and have tests/checks after it's complete to ensure data integrity. If you can arrange access to the new production database during the switchover then it should be easy to run the script against it. If you're restricted to a release process (eg webistrano) then you might have to shoe-horn it into a rails migration but you can run raw SQL using execute.

Take a look at this gem.
https://github.com/zdennis/activerecord-import/
data = []
data << Order.new(:order_info => 'test order')
Order.import data

Unfortunaltly the downrated solution is the only one. What is really slow in rails are the activerecord models. The are not suited for tasks like this.
If you want a fast migration you will have to do it in sql.
There is an other approach. But you will always have to rewrite most of the migrations...

Testing concurrency features

How would you test Ruby code that has some concurrency features? For instance, let's assume I have a synchronization mechanism that is expected to prevent deadlocks. Is there a viable way to test what it really does? Could controlled execution in fibers be the way forward?

I had the exact same problem and have implemented a simple gem for synchronizing subprocesses using breakpoints: http://github.com/remen/fork_break
I've also documented an advanced usage scenario for rails3 at http://www.hairoftheyak.com/testing-concurrency-in-rails/

I needed to make sure a gem (redis-native_hash) I authored could handle concurrent writes to the same Redis hash, detect the race condition, and elegantly recover. I found that to test this I didn't need to use threads at all.
it "should respect changes made since last read from redis" do
concurrent_edit = Redis::NativeHash.find :test => #hash.key
concurrent_edit["foo"] = "race value"
concurrent_edit.save
#hash["yin"] = "yang"
#hash["foo"] = "bad value"
#hash.save
hash = Redis::NativeHash.find :test => #hash.key
hash["foo"].should == "race value"
hash["yin"].should == "yang"
end
In this test case I just instantiated another object which represents the concurrent edit of the Redis hash, had it make a change, then make sure saving the already-existing object pointing to the same hash respected those changes.
Not all problems involving concurrency can be tested without actually USING concurrency, but in this case it was possible. You may want to try looking for something similar to test your concurrency solutions. If its possible its definitely the easier route to go.

It's definitely a difficult problem. I started writing my test using threads, and realized that they way the code I was testing was implemented, I needed the Process IDs (PID) to actually be different. Threads run using the same PID as the process that kicked off the Thread. Lesson learned.
It was at that point I started exploring forks, and came across this Stack Overflow thread, and played with fork_break. Pretty cool, and easy to set up. Though I didn't need the breakpoints for what I was doing, I just wanted processes to run through concurrently, using breakpoints could be very useful in the future. The problem I ran into was that I kept getting an EOFError and I didn't know why. So I started implementing forking myself, instead of going through fork_break, and found out it was that an exception was happening in the code under test. Sad that the stack trace was hidden from me by the EOFError, though I understand that the child process ended abruptly and that's kinda how it goes.
The next problem I came across was with the DatabaseCleaner. No matter which strategy it used (truncation, or transaction), the child process's data was truncated/rolled back when the child process finished, so the data that was inserted by child processes was gone and the parent process couldn't select and verify that it was correct.
After banging my head on that and trying many other unsuccessful things, I came across this post http://makandracards.com/makandra/556-test-concurrent-ruby-code which was almost exactly what I was already doing, with one little addition. Calling "Process.exit!" at the end of the fork. My best guess (based on my fairly limited understanding of forking) is that this causes the process to end abruptly enough that it completely bypasses any type of database cleanup when the child process ends. So my parent process, the actual test, can continue and verify the data it needs to verify. Then during the normal after hooks of the test (in this case cucumber, but could easily be rspec too), the database cleaner kicks in and cleans up data as it normally would for a test.
So, just thought I'd share some of my own lessons learned in this discusson of how to test concurrent features.

Is it bad practice to run tests on a database instead of on fake repositories?

I know what the advantages are and I use fake data when I am working with more complex systems.
What if I am developing something simple and I can easily set up my environment in a real database and the data being accessed is so small that the access time is not a factor, and I am only running a few tests.
Is it still important to create fake data or can I forget the extra coding and skip right to the real thing?
When I said real database I do not mean a production database, I mean a test database, but using a real live DBMS and the same schema as the real database.

The reasons to use fake data instead of a real DB are:
Speed. If your tests are slow you aren't going to run them. Mocking the DB can make your tests run much faster than they otherwise might.
Control. Your tests need to be the sole source of your test data. When you use fake data, your tests choose which fakes you will be using. So there is no chance that your tests are spoiled because someone left the DB in an unfamiliar state.
Order Independence. We want our tests to be runnable in any order at all. The input of one test should not depend on the output of another. When your tests control the test data, the tests can be independent of each other.
Environment Independence. Your tests should be runnable in any environment. You should be able to run them while on the train, or in a plane, or at home, or at work. They should not depend on external services. When you use fake data, you don't need an external DB.
Now, if you are building a small little application, and by using a real DB (like MySQL) you can achieve the above goals, then by all means use the DB. I do. But make no mistake, as your application grows you will eventually be faced with the need to mock out the DB. That's OK, do it when you need to. YAGNI. Just make sure you DO do it WHEN you need to. If you let it go, you'll pay.

It sort of depends what you want to test. Often you want to test the actual logic in your code not the data in the database, so setting up a complete database just to run your tests is a waste of time.
Also consider the amount of work that goes into maintaining your tests and testdatabase. Testing your code with a database often means your are testing your application as a whole instead of the different parts in isolation. This often result in a lot of work keeping both the database and tests in sync.
And the last problem is that the test should run in isolation so each test should either run on its own version of the database or leave it in exactly the same state as it was before the test ran. This includes the state after a failed test.
Having said that, if you really want to test on your database you can. There are tools that help setting up and tearing down a database, like dbunit.
I've seen people trying to create unit test like this, but almost always it turns out to be much more work then it is actually worth. Most abandoned it halfway during the project, most abandoning ttd completely during the project, thinking the experience transfer to unit testing in general.
So I would recommend keeping tests simple and isolated and encapsulate your code good enough it becomes possible to test your code in isolation.

As far as the Real DB does not get in your way, and you can go faster that way, I would be pragmatic and go for it.
In unit-test, the "test" is more important than the "unit".

I think it depends on whether your queries are fixed inside the repository (the better option, IMO), or whether the repository exposes composable queries; for example - if you have a repository method:
IQueryable<Customer> GetCustomers() {...}
Then your UI could request:
var foo = GetCustomers().Where(x=>SomeUnmappedFunction(x));
bool SomeUnmappedFunction(Customer customer) {
return customer.RegionId == 12345 && customer.Name.StartsWith("foo");
}
This will pass for an object-based fake repo, but will fail for actual db implementations. Of course, you can nullify this by having the repository handle all queries internally (no external composition); for example:
Customer[] GetCustomers(int? regionId, string nameStartsWith, ...) {...}
Because this can't be composed, you can check the DB and the UI independently. With composable queries, you are forced to use integration tests throughout if you want it to be useful.

It rather depends on whether the DB is automatically set up by the test, also whether the database is isolated from other developers.
At the moment it may not be a problem (e.g. only one developer). However (for manual database setup) setting up the database is an extra impediment for running tests, and this is a very bad thing.

If you're just writing a simple one-off application that you absolutely know will not grow, I think a lot of "best practices" just go right out the window.
You don't need to use DI/IOC or have unit tests or mock out your db access if all you're writing is a simple "Contact Us" form. However, where to draw the line between a "simple" app and a "complex" one is difficult.
In other words, use your best judgment as there is no hard-and-set answer to this.

It is ok to do that for the scenario, as long as you don't see them as "unit" tests. Those would be integration tests. You also want to consider if you will be manually testing through the UI again and again, as you might just automated your smoke tests instead. Given that, you might even consider not doing the integration tests at all, and just work at the functional/ui tests level (as they will already be covering the integration).
As others as pointed out, it is hard to draw the line on complex/non complex, and you would usually now when it is too late :(. If you are already used to doing them, I am sure you won't get much overhead. If that were not the case, you could learn from it :)

Assuming that you want to automate this, the most important thing is that you can programmatically generate your initial condition. It sounds like that's the case, and even better you're testing real world data.
However, there are a few drawbacks:
Your real database might not cover certain conditions in your code. If you have fake data, you cause that behavior to happen.
And as you point out, you have a simple application; when it becomes less simple, you'll want to have tests that you can categorize as unit tests and system tests. The unit tests should target a simple piece of functionality, which will be much easier to do with fake data.

One advantage of fake repositories is that your regression / unit testing is consistent since you can expect the same results for the same queries. This makes it easier to build certain unit tests.
There are several disadvantages if your code (if not read-query only) modifies data:
- If you have an error in your code (which is probably why you're testing), you could end up breaking the production database. Even if you didn't break it.
- if the production database changes over time and especially while your code is executing, you may lose track of the test materials that you added and have a hard time later cleaning it out of the database.
- Production queries from other systems accessing the database may treat your test data as real data and this can corrupt results of important business processes somewhere down the road. For example, even if you marked your data with a certain flag or prefix, can you assure that anyone accessing the database will adhere to this schema?
Also, some databases are regulated by privacy laws, so depending on your contract and who owns the main DB, you may or may not be legally allowed to access real data.
If you need to run on a production database, I would recommend running on a copy which you can easily create during of-peak hours.

It's a really simple application, and you can't see it growing, I see no problem running your tests on a real DB. If, however, you think this application will grow, it's important that you account for that in your tests.
Keep everything as simple as you can, and if you require more flexible testing later on, make it so. Plan ahead though, because you don't want to have a huge application in 3 years that relies on old and hacky (for a large application) tests.

The downsides to running tests against your database is lack of speed and the complexity for setting up your database state before running tests.
If you have control over this there is no problem in running the tests directly against the database; it's actually a good approach because it simulates your final product better than running against fake data. The key is to have a pragmatic approach and see best practice as guidelines and not rules.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio