Running cucumber scenario outline in parallel - ruby

I have a feature with one scenario outline that takes too long to run because of the number of examples I use. I would like to take advantage of the multiple threads available and run these outlines in parallel. One way is to split up the feature/scenario outline into multiple features. This works but then it leaves me with multiple copies of the same scenario with only a difference in examples. My question is if there is a way to split this scenario outline to run in parallel without having to create multiple features?

Jay, I doubt there is a way to do exactly what are asking (splitting a single scenario outline into separate parallel runs). I would say if your primary concern is 'speed' regarding this test then it would be worth having duplicate features and separating the examples out so that the coverage would complete faster. I generally like to keep my scenarios below a minute unless I am doing something that is 'expansive' like the one you are describing sounds, but for those tests I run overnight and don't expect quick results.
If the feature as you say 'takes too long to run" BUT you want it to be quicker then i would suggest to simply cut into 'quicker' bite size tests that you can then run in parallel.
Even if there was a way to do what you asked, I believe you would be better off splitting up the test anyways.

Related

testing runtimes of different versions of the same code

I've have been wondering about the time complexity of the programs that I run and I would like to know how I can compare their speeds. I usually like do questions on platforms like CodeAbbey and when I'm done I get the chance to see other peoples attempts of the problem as well. With the diversity, people have many different methods and my hope is to pick up hacks from what they have done to improve my skills. So is there some platform/website/app/program etc that I can use to to check which gets done first by running them simultaneously?
One other curious question iv been wondering about is:
does each program take exactly the same time to run everytime? like if I code something up and run it once, if I run it a second time, will it execute in the same time as before or will it be slightly different and the best I can hope for is an average?
I have tried to Google it but i don't really understand the results or how to use the sites. I just want to be able to see how much time it take to run.

Competitive Programming : Generating test cases and validating the program correctness

I have been doing sport programming for a while and still improving day by day. But one thing I have always wondered is that it would be really nice if I could automate the test-case generation process and cross-validation of my program. Definitely it would be a brute force approach as some test cases would be algorithm specific.
Doing a google search gives me a nice link on Quora : How do programming contest problem setters make test cases ? and the popular testlib used by problem setters.
But isn't this a chicken-egg problem?
Assume I generated 1 million input test cases, but what would I check them against? How will I generate the outputs? Because I am still in the process of validating the program... If my script generates the correct outputs as well, then whats the point of writing the program in the first place. I can submit the script itself. Also, its not possible to write 1 million outputs for generated test cases manually. Can anyone please clarify this confusion.
I hope i have clarified the problem correctly.
It's common to generate the answer by a slow but obviously correct solution (like an exhaustive search). It can't be used as a main solution as it's too slow for large test cases, but you can check the output of your fast (but possibly incorrect) program using it.
Well the thing is it is not as broad as you think it to be. Test generation in competitive programming is guided by the algorithm of the problem and it's correctness proof.
So when you are thinking that there are million of test cases if you analyze the different situations the program can be then you will likely to get all the test cases. Maybe in certain algorithm you are some times processing the even index elements or the odd index elements of an array. Now what you will do? Divide it in 2 cases even or odd. Consider the smallest case for even ones . Same for odd ones. This way you are basically visiting all the control flow path of the program.
In competitive programming as we first determine the algorithm then we decide on a proper input sizes and then all this test cases and validations, it is often easy to think the corner points. Test case for 1000000 elements or when input is 0 or 1 ...test cases like this.
Another is most of the time we write a brute force solution much more slower than the original one. Now what we do? we just generate random medium size test cases and then run it again the slow program and we can check with our checker solution etc.
correctness is guided by some mathematical proof also.(Heuristics, Induction, Box principle, Number Theory etc ) That way we are sure about the correctness of the solution.
I faced the same issue earlier this year and saw some of my colleagues also figuring out a way to deal with this. Because there are sometimes when I just couldn’t think any more test cases and that’s when I decided to make a test case generator tool of my own, It’s free and open-source so anyone can use it.
You can easily generate a lot of test cases using this tool and validate the result using the output given from the correct but slower approach(in terms of time complexity and space complexity). You can either run them parallelly and check for outputs or write a simple script to compare the output of both programs (the slower but correct one and the better but unsure one) to validate.
I believe good coders won’t be needing it anyway, but for the middle level (div2, div3) coders and newbies, it can prove to be a lot helpful.
You can access it from GitHub : Test Case generator.
Both python source code and .exe files are present there with instructions,
If you want to make some changes of your own you can work with the python file.
If you just directly want it to generate some test cases you should prefer the .exe file(inside the zip).
It’ll help a lot if you’re beginning your Competitive programming journey.
Also, any suggestions or improvements are always welcome. Additionally you can also contribute to this project by adding some feature that you think the project is lacking or by adding some new test case formats by yourselves or making by request for the same.

Why BDD test scenarios are separated?

When I'm writing tests using lettuce, I want to create a huge scenario that contains an user making every possible action on website. But testing tools are making me aim separating them. What is the benefit of it?
You put "BDD" into the title of this question, and tag it with both "BDD" and "TDD" tags. So you're interested in Behavior Driven Development and Test Driven Development.
Why should you drive development a little bit at a time instead of driving the entire application all at once? That's what your question amounts to in the context of BDD and TDD.
You are going to write one method, one additional bit of functionality, next. Of course that bit will contribute to the overall behavior, and it's good to have an understanding of the overall behavior you're trying to develop, but you need focus. You need to know when that next bit is working and complete, so you can move on to the next bit. A full-blown test of the entire app will fail at the beginning; it will fail after your first bit of functionality is implemented; it will fail when you are half-done, and it will fail when you are 99% done. Unfortunately, it will probably also fail when you are 100% done - and now you'll have to find where you went wrong, and fix that bit (or those bits).
But if you write a test of just that new bit of functionality, it will fail right now, and it will pass in five minutes or ten minutes or twenty minutes or maybe an hour. Then you'll know it's time for the next bit. And it will keep passing as you write more tests and implement more functionality. When you're 99% done, you'll have 99% of your ultimate tests passing - and 100%, minus one, of your current tests. You can see your real progress and know that what you have written up to date is really working.
That's why you should write small tests, one at a time, and make them pass one at a time.
Three important things come to my mind:
readability: when a scenario fails, it is easier to understand from a first glance at the scenario's name what went wrong and significantly easier to fix when it's small and focused
maintainability: it is easier to modify/update small scenario
independence: large scenarios tend to make steps dependent on each other. This way, the further an action in a scenario is, the more it is depending on previous actions in more complicated ways that are hard to comprehend. This directly influences two previous reasons.

Boundaries of acceptance tests

my application, among other things, uses some crawlers to read information exposed by a remote xml feed by another application (we're not responsible for this one). Crawled data is later displayed to the user.
The xml might contain simple data and links, that we follow if we need additional data.
The tests in our system are both unit tests, that test that we parse correctly the xml documents, and acceptance tests, that are meant to test what we display in our ui.
I was reasoning about the acceptance tests, and that's what this question is about.
Right now, for each acceptance test, we bring an embedded http server that serves some test data, that is specific for the test. We then start up our application, we crawl the test data and we verify the criteria for the test. While this approach has the advantage of testing the whole system from end to end, it has also the side effect of increasing the build time considerably each time we add a new acceptance test.
Is this the right approach for the acceptance tests?
I was wondering if, since the system that provides the feeds is an external one, wouldn't it be better to test the network communication layer and the crawlers at unit level and run the acceptance tests assuming the data has already been crawled?
I'd like to hear some thought from somebody else. :-)
Thanks!
Acceptance tests do tend to run slowly, require more setup and tend to be far more brittle than unit or integration tests. If you search the web for "test pyramid" you will find plenty of information on this. The general consensus is that you should have tests at the unit, integration and acceptance levels. With most tests being unit tests and just a few acceptance tests that do the end-to-end stuff. Often development teams will setup their ci servers to only run any long running acceptance tests during their nightly build processes so that they don't impact the performance of the unit test test runs.
I agree with what Andrew wrote, but wanted to add a different angle to the answer, one which I think is quite often missed in such discussions.
Your team is building a product and your company wants to get the best value for money from this endeavor.
At the beginning you might think that tests slow you down - your system is simple and everyone understands it so why waste time. It might feel like you get little value for money from writing tests. But this is obviously wrong if you adopt a longer term view of your product development. But I'll stop here, as I'm preaching to the converted.
However, if you adopt the same mindset when trying to answer your question, you will see that actually the answer depends a lot on your circumstances. I'm going to use a rather simplified math model to explain my thinking:
Let P(bug | test) denote a probability of a bug, provided you are running the test, let C(test) denote the cost of running the test and let C(bug) denote the cost of a bug.
If you focus on a particular bug*, you want to minimize the following:
P(bug | test_1)*C(bug) + C(test_1) ... P(bug | test_n)*C(bug) + C(test_n)
Where your suite consists of n tests.
If you were to disregard the test cost, clearly the more tests you have, the better right? But because tests need to be maintained, executed, etc., they have a non-zero cost. This means that you have a trade-off and in the end you are performing a U-curve optimization here (a bit like on this picture where they are trying to find the optimal tradeoff between release and holding costs).
The actual costs depend a lot on a particular domain, product areas and test types.
If you are in banking the cost of a bug can be enormous, so it will dwarf the test costs. But if you are writing a recommendation engine for music, having suggestions that are off for a few hours will not be a problem. Actually, in the latter case, you probably want the freedom to experiment with different algorithms and the ability to iterate quickly, so the cost of a test might over-shaddow the cost of a bug.
Lets say you work on a particular product. Even that is not homogenous. There will be areas of your product that are more critical than others. Take twitter for example, if a person could not tweet or load the tweets of who they follow it would be a big problem. On the other hand, if "who to follow suggestions" are empty, the impact on the product will be much smaller.
Finally, the cost of tests is not uniform either. But as I said earlier, it is not negligible and needs to be considered with care. I worked both in places where poor test coverage slowed the teams down because they lacked the confidence to push their changes to production, and in places where test runs were so long that people complained that they are constantly building and hardly working.
One last thing. It is good to build with resiliency to failure in mind - will lower the cost of bugs for you.

Preventing performance regressions in R

What is a good workflow for detecting performance regressions in R packages? Ideally, I'm looking for something that integrates with R CMD check that alerts me when I have introduced a significant performance regression in my code.
What is a good workflow in general? What other languages provide good tools? Is it something that can be built on top unit testing, or that is usually done separately?
This is a very challenging question, and one that I'm frequently dealing with, as I swap out different code in a package to speed things up. Sometimes a performance regression comes along with a change in algorithms or implementation, but it may also arise due to changes in the data structures used.
What is a good workflow for detecting performance regressions in R packages?
In my case, I tend to have very specific use cases that I'm trying to speed up, with different fixed data sets. As Spacedman wrote, it's important to have a fixed computing system, but that's almost infeasible: sometimes a shared computer may have other processes that slow things down 10-20%, even when it looks quite idle.
My steps:
Standardize the platform (e.g. one or a few machines, a particular virtual machine, or a virtual machine + specific infrastructure, a la Amazon's EC2 instance types).
Standardize the data set that will be used for speed testing.
Create scripts and fixed intermediate data output (i.e. saved to .rdat files) that involve very minimal data transformations. My focus is on some kind of modeling, rather than data manipulation or transformation. This means that I want to give exactly the same block of data to the modeling functions. If, however, data transformation is the goal, then be sure that the pre-transformed/manipulated data is as close as possible to standard across tests of different versions of the package. (See this question for examples of memoization, cacheing, etc., that can be used to standardize or speed up non-focal computations. It references several packages by the OP.)
Repeat tests multiple times.
Scale the results relative to fixed benchmarks, e.g. the time to perform a linear regression, to sort a matrix, etc. This can allow for "local" or transient variations in infrastructure, such as may be due to I/O, the memory system, dependent packages, etc.
Examine the profiling output as vigorously as possible (see this question for some insights, also referencing tools from the OP).
Ideally, I'm looking for something that integrates with R CMD check that alerts me when I have introduced a significant performance regression in my code.
Unfortunately, I don't have an answer for this.
What is a good workflow in general?
For me, it's quite similar to general dynamic code testing: is the output (execution time in this case) reproducible, optimal, and transparent? Transparency comes from understanding what affects the overall time. This is where Mike Dunlavey's suggestions are important, but I prefer to go further, with a line profiler.
Regarding a line profiler, see my previous question, which refers to options in Python and Matlab for other examples. It's most important to examine clock time, but also very important to track memory allocation, number of times the line is executed, and call stack depth.
What other languages provide good tools?
Almost all other languages have better tools. :) Interpreted languages like Python and Matlab have the good & possibly familiar examples of tools that can be adapted for this purpose. Although dynamic analysis is very important, static analysis can help identify where there may be some serious problems. Matlab has a great static analyzer that can report when objects (e.g. vectors, matrices) are growing inside of loops, for instance. It is terrible to find this only via dynamic analysis - you've already wasted execution time to discover something like this, and it's not always discernible if your execution context is pretty simple (e.g. just a few iterations, or small objects).
As far as language-agnostic methods, you can look at:
Valgrind & cachegrind
Monitoring of disk I/O, dirty buffers, etc.
Monitoring of RAM (Cachegrind is helpful, but you could just monitor RAM allocation, and lots of details about RAM usage)
Usage of multiple cores
Is it something that can be built on top unit testing, or that is usually done separately?
This is hard to answer. For static analysis, it can occur before unit testing. For dynamic analysis, one may want to add more tests. Think of it as sequential design (i.e. from an experimental design framework): if the execution costs appear to be, within some statistical allowances for variation, the same, then no further tests are needed. If, however, method B seems to have an average execution cost greater than method A, then one should perform more intensive tests.
Update 1: If I may be so bold, there's another question that I'd recommend including, which is: "What are some gotchas in comparing the execution time of two versions of a package?" This is analogous to assuming that two programs that implement the same algorithm should have the same intermediate objects. That's not exactly true (see this question - not that I'm promoting my own questions, here - it's just hard work to make things better and faster...leading to multiple SO questions on this topic :)). In a similar way, two executions of the same code can differ in time consumed due to factors other than the implementation.
So, some gotchas that can occur, either within the same language or across languages, within the same execution instance or across "identical" instances, which can affect runtime:
Garbage collection - different implementations or languages can hit garbage collection under different circumstances. This can make two executions appear different, though it can be very dependent on context, parameters, data sets, etc. The GC-obsessive execution will look slower.
Cacheing at the level of the disk, motherboard (e.g. L1, L2, L3 caches), or other levels (e.g. memoization). Often, the first execution will pay a penalty.
Dynamic voltage scaling - This one sucks. When there is a problem, this may be one of the hardest beasties to find, since it can go away quickly. It looks like cacheing, but it isn't.
Any job priority manager that you don't know about.
One method uses multiple cores or does some clever stuff about how work is parceled among cores or CPUs. For instance, getting a process locked to a core can be useful in some scenarios. One execution of an R package may be luckier in this regard, another package may be very clever...
Unused variables, excessive data transfer, dirty caches, unflushed buffers, ... the list goes on.
The key result is: Ideally, how should we test for differences in expected values, subject to the randomness created due to order effects? Well, pretty simple: go back to experimental design. :)
When the empirical differences in execution times are different from the "expected" differences, it's great to have enabled additional system and execution monitoring so that we don't have to re-run the experiments until we're blue in the face.
The only way to do anything here is to make some assumptions. So let us assume an unchanged machine, or else require a 'recalibration'.
Then use a unit-test alike framework, and treat 'has to be done in X units of time' as just yet another testing criterion to be fulfilled. In other words, do something like
stopifnot( timingOf( someExpression ) < savedValue plus fudge)
so we would have to associate prior timings with given expressions. Equality-testing comparisons from any one of the three existing unit testing packages could be used as well.
Nothing that Hadley couldn't handle so I think we can almost expect a new package timr after the next long academic break :). Of course, this has to be either be optional because on a "unknown" machine (think: CRAN testing the package) we have no reference point, or else the fudge factor has to "go to 11" to automatically accept on a new machine.
A recent change announced on the R-devel feed could give a crude measure for this.
CHANGES IN R-devel UTILITIES
‘R CMD check’ can optionally report timings on various parts of the check: this is controlled by environment variables documented in ‘Writing R Extensions’.
See http://developer.r-project.org/blosxom.cgi/R-devel/2011/12/13#n2011-12-13
The overall time spent running the tests could be checked and compared to previous values. Of course, adding new tests will increase the time, but dramatic performance regressions could still be seen, albeit manually.
This is not as fine grained as timing support within individual test suites, but it also does not depend on any one specific test suite.

Resources