My Java application have almost 800 Junit(5) test cases, my team is adding around 30 new cases a week. This component is a heart of the whole system, So every case needs lot of input data objects. Initially it is designed to read input data object (main object) from JSON files, Its like reading atleast 2/3 Jsons from #beforeClass method of every testclass. We are exepected to add more workflows this year and this does not look a viable solution to read the input data from Jsons. These many reads are causing performance issues as well and impacting the build time.
Can some one suggest alternate approaches to prepare the input data for Junit test cases.
We might not go with preparing manually in each test class as this will be a huge effort and one of the json in each test case is around 200 lines of data. (We use Gradle 6X and Junit 5).
Related
I'm new to Spark and I'm trying to understand the metrics in the Web UI that are related to in my Spark Application (developed through Dataset API). I've watched few videos by Spark Summit and Databricks and most of the videos I watched were about a general overview of the Web UI like: definition of stage/job/task, how to understand when something is not working properly (e.g. not balanced work between executors), suggestions about things to avoid while programming, etc.
However, I couldn't find a detailed explaination of each performance metrics. In particular I'm interested understanding the things in the following images that are related to a Query that contains a groupBy(Col1, Col2), a orderBy(Col1, Col2) and a show().
Job 0
If I understood well, the default max partition size is set to 128 MB. Since my dataset size is 1378MB I get 11 tasks that work with 128MB, right? and since in the first stage I did some filtering (before applying groupBy) tasks write in memory so Shuffle Write is 108.3KB but why do I get 200 tasks for second stage?
After the groupBy I used an orderBy, is the number of tasks related to how my dataset is or it is related to the size of it?
UPDATE: I found this spark.sql.shuffle.partitions of 200 default partitions conundrum and some other questions, but now I'm wondering if there is a specific reason for it to be 200?
Stage 0
Why some tasks have result serialization here? If I understood well the serialization is related to the output so any show(), count(), collect(), etc. But in this stage those actions are not present (before the groupBy).
Stage 1
Is it normal that there is a huge part for result serialization time? I called show() (that takes 20 rows by default and there is an orderBy) so all tasks run in parallel and that one serialized all its records?
Why only one task have a considerable Shuffle Read Time? I expected all to have at least a small amount of Shuffle Read Time, again it is something related to my dataset?
The deserialization time is related to reading my dataset file? I'm asking because I wouldnt have expected it there since it is stage 1 and it was already present in stage 0.
Job 1- caching
Since I'm dealing with 3 queries that starts from the same dataset, I used cache() at the beginning of the first Query. I was wondering why it shows 739.9MB / 765 [input size/records] ... In the first query it shows 1378.8 MB / 7647431 [input size/records].
I guess that it has 11 tasks since the size of the dataset cached is still 1378MB but 765 is a really low number compared to the initial that was 7647431 so I dont think it is really related to records/rows, right?
Thanks for reading.
Spark 2.1.0 and Scala 2.11.
Our project has hundreds of unit tests that perform relatively simple operations such as creating data sets of 3 or 4 objects and performing simple transformations on them. Many of these tests take as long as 5-10 seconds to run, which with hundreds of tests adds up to many minutes and is becoming a problem for our CI build. The operations are so simple that I wonder if there is a Spark configuration we can use to speed things up.
For example simply creating a data set like this:
val histData = Seq(
FooType(id = "id1", code = "code1",orgId = 1l),
FooType(id = "id2", code = "code2",orgId = 1l),
FooType(id = "id3", code = "code3",orgId = 1l)
).toDS()
takes 800 msec (FooType is a case class). After creating 2 or 3 data sets like this and a few filter / map / join operations (I really don't think those details matter but if you do let me know and I will post them), the collect() takes 1000-2000 msec. Add up a few operations like this and a test can take 5-10 seconds.
For the unit tests we are only concerned with the functional aspects of the test, we do not need threading, scaling, cacheing, on-disk storage, etc. The test data is small (usually less than 1KB) and is created in memory (not read from disk or any external source), and assertions are performed on the transformed objects in memory. I understand that behind the scenes Spark may be invoking the DAGScheduler, code generator, etc. and I wonder if there is a way to execute the jobs without that functionality. Or if that does have to be done, to do it once at the start of the unit test suite and use that throughout.
The session is created with something like this:
session = SparkSession.builder.config("spark.sql.shuffle.partitions","10").getOrCreate()
and the same session is used in each unit test. We are invoking the Spark API methods directly from the unit test, so there is no Spark submit or separate process or job, it all runs in the JVM created by either the IDE or gradle that invokes the unit tests.
It just seems to me that these operations should be taking a few msec each, and I'm looking for a way to pare down the Spark configuration so it evaluates everything in memory in the fastest way possible. Thanks for any tips or ideas.
I am load testing (baseline, capacity, longevity) a bunch of APIs (eg. user service, player service, etc) using JMeter. Each of these services have several endpoints (eg. create, update, delete, etc). I am trying to figure out a good way to organize my test plans in JMeter so that I can load test all of these services.
1) Is it a good idea to create a separate JMeter Test Plan (jmx) for each of the APIs rather than creating one JMeter test plan and adding thread groups like "Thread Group for User Service", "Thread Group for Player Service", etc? I was thinking about adding one test plan per API, and then adding several Thread Groups for different types of load testing (baseline, capacity, longevity, etc).
2) When JMeter calculates the Sample Time (Response Time), does it also include the time taken by the BeanShell Processors?
3) Is it a good idea to put a Listener inside of each Simple Controller? I am using JMeter Plugins for reporting. I wanted to view the reports for each endpoint.
Answers to any or all of the questions would be much appreciated :)
I am using a structure like below for creating a test plan in JMeter.
1) I like a test plan to look like a test suite. JMeter has several ways of separating components and test requirements, so it can be hard to set a rule. One test plan is likely to be more efficient than several, and can be configured to satisfy most requirements. I find there can be alot of repetition between plans, which often means maintaining the same code in different places. Better to use modules and includes on the same plan to reduce code duplication, but includes are equivalent and can be used with test fragments to reduce duplication.
Threadgroups are best used as user groups, but can be used to separate tests any way you please. Consider the scaling you need for different pages/sites. ie User/Administrator tests can be done in different Thread Groups, so you can simulate say 50 users and 2 admins testing concurrently. Or you may distinguish front-end/back-end or even pages/sites.
2) It does not include beanshell pre- and post-processing times. (But if you use a beanshell sampler, it depends on the code)
3) listeners are expensive, so fewer is better. To separate the results, you can give each sampler a different title, and the listeners/graphs can then group these as required. You can include timestamps or indexes as part of your sampler title using variables, properties and ${__javaScript}, etc. This will cause more or less grouping depending on the implementation you choose.
I'm familiarising myself with JMeter and I've thought of something that's either pretty cool or a very dumb idea.
Whilst reading about Listeners I noticed the following:
Note that all Listeners save the same data; the only difference is in the way the data is presented on the screen.
And this:
Graph Results MUST NOT BE USED during load test as it consumes a lot of resources (memory and CPU). Use it only for either functional testing or during Test Plan debugging and Validation.
So I was wondering: if all listeners receive the same data; why not save that data in a CSV or even XML file, and feed that to a listener afterwards? It would be very resource friendly to have the Graph Results Listener display a graph after the tests are done, instead of while testing.
Am I missing something, or is this a good possiblity?
Yes you can do that and i think most guys use it that way only. Instead of CSV and XML files use JTL file format to save the results. In normal scenario one uses command line to run the test and save the data in a file(preferably JTL). After the test is done you can use the JTL file to generate reports with JMeter UI or using other tools like this.
I have unit tests defined for my Visual Studio 2008 solution. These tests are defined in multiple methods and in multiple classes across several files.
I've read in a blog article that when using MSTest, it is a mistake to think that you can depend on the order of execution of your tests:
Execution Interleaving: Since each instance of the test class is instantiated separately on a different thread, there are no guarantees
regarding the order of execution of unit tests in a single class, or
across classes. The execution of tests may be interleaved across
classes, and potentially even assemblies, depending on how you chose
to execute your tests. The key thing here is – all tests could be
executed in any order, it is totally undefined.
That said, I have to have a pre-execution step before any of these tests gets to run. That is, I actually want to define an order of execution somehow. For example, 1) first create the database; 2) test that it's created; then 3) run the remaining 50 tests in arbitrary order.
Any ideas on how I can do that?
I wouldn't test that the database is successfully created; I will assume that all subsequent tests will fail if it is not, and it feels in a way that you would be testing the test code.
Regarding a pre-test step to set up the database, you can do that by creating a method and decorating it with the ClassInitialize attribute. That will make the test framework execute that method prior to any other method within the test class:
[ClassInitialize()]
public static void InitializeClass(TestContext testContext)
{
// your init code here
}
Unit tests should all work standalone, and should not have dependencies on each other, otherwise you can't run a single test in isolation.
Every test that needs the database should then just create it on demand (if it's not already been created - you can use a singleton/static class to ensure that if multiple tests are executed in a batch, the database is only actually created once).
Then it won't matter which test executes first; it'll just be created the first time a test needs a database to use.
In theory it is correct that tests should be independent of each other and be able to run standalone. But in practice, there is a difference between theory and practice, and VS2010 gives me a hard time with its fixed order of execution (random order that is always the same).
Here are some examples:
I have a unit test that cross checks the dates between some tables and verifies that everything is in agreement. Obviously it is of no use to run this test on an empty database, so I want to to run SOME TIME AFTER the unit test that inserts data. Sorry VS2010 doesn't let you do this.
OK, cool, then I will add it to the insert unit test as an epilogue. But then I want to cross check other 10 things and instead of having a unit test ("Make sure that entities with various parameters can be inserted without crashes") I end up having a mega-test.
Then another case.
My unit test inserts entities, just insert, to make sure that this part of the logic works ok. Then I have a multi-threaded version of the test, to make sure that there are no deadlocks and stuff. Clearly I need the multi-threaded test to run SOME TIME AFTER the single threaded test, and ONLY if the single threaded test succeeds. Sorry, VS2010 can't do this.
Another case. I have a unit test that deletes ALL entities of a given kind in the database. This should result in a bunch of empty tables and lots of zeros in other tables. Clearly it is useless to run it on an empty database, so the test inserts 10.000 entities if it finds the DB empty. However, if it runs AFTER the multithreaded test, it will find 250.000 entities, and to delete ALL of them takes TIME. Sorry, VS2010 won't let me do anything about it.
The funny thing is that because of this situation my unit tests started slowly turning into mega-tests, that took more than 30 mins to complete (each) and then VS2010 would time them out, cause the default test timeout is 30 mins. OMG please help! :-)