The effect of println statements in production? - performance

I was reviewing the Grails application code and i found println statements in a lot of places. These were used for debugging. I am wondering whether leaving these statements affect the production app performance?

Yes, it affects in production environment, because println statements are synchronous. Without processing your println statements execution will not move forward, if you are printing huge size objects like Map, List, file content etc, this will take more execution time and increase the log file size as well, so this will definitely affect your production performance.
The better way is if you want to maintain the logs use Log4J like asynchronous library for auditing your important logs in the application.
Log4j reference

Related

Spark Performance Tuning Question - Resetting all caches for performance testing

I'm currently working on performance & memory tuning for a spark process. As part of this I'm performing multiple runs of different versions of the code and trying to compare their results side by side.
I've got a few questions to ask, so I'll post each separately so they can be addressed separately.
Currently, it looks like getOrCreate() is re-using the Spark Context each run. This is causing me two problems:
Caching from one run may be affecting the results of future runs.
All of the tasks are bundled into a single 'job', and I have to guess at which tasks correspond to which test run.
I'd like to ensure that I'm properly resetting all caches in each run to ensure that my results are comparable. I'd also ideally like some way of having each run show up as a separate job in the local job history server so that it's easier for me to compare.
I'm currently relying on spark.catalog.clearCache() but not sure if this is covering all of what I need. I'd also like a way to ensure that the tasks for each job run are clearly grouped in some fashion for comparison so I can see where I'm losing time and hopefully see total memory used by each run as well (this is one thing I'm currently trying to improve on).

executePackage seems to take a long time to launch subpackage

I am a relative beginner at SSIS so I may be doing something silly.
I have a process that involves looping over a heterogenous queue and processing the objects 1 at a time. The process is currently being done in 'set logic' and its dropping stuff. I was asked to rework it in a looping manner, so that decision has been made for me.
I have chosen to implement queue logic in 1 package and the actual processing in another package.
This is all going relatively well considering...
I now have the process up and running, but its slow. 9 seconds per item. Clearly I cant present this solution. :-)
One thing i notice, 1.5 - 2 seconds of each loop are on the ExecutePackage Task in the queue loop.
I cant figure out how to get a hard number, I am using the flashing green box method of performance tuning. The other steps seem to be very fast. Adding indexes, changing sql to sps, all the usual tricks have helped.
Is the UI realiable at all with regards to boxes turning white/yellow/green? Some tasks report times in the progress tab, some dont seem to. So I am counting yellow time.
Should calling a subpackage be that expensive? 1 change i made was I change 'RunInASeparateProcess' to FALSE. I did that because the subpackage produces the following message otherwise:
Error: 0xC0012024 at Script Task: The task "Script Task" cannot run on this edition of Integration Services. It requires a higher level edition.
Task failed: Script Task
The reading i have done seems to advocate multiple packages. Anyone have any counter patterns? Should i stay the course? I started changing to 1 package. Copy/paste doesnt seem to work well w/ SequenceContainers. I would also need to recreate all the variables in the parent package. Doable, but im not sure that is the answer.
Does anyone know of any tuning resources/websites/books they would be willing to share.
Update - I have been tearing things down in an effort to figure out what the problem is. I was thinking it was the package configurations passing variable values. I dont think that is it. I can pass variables to another package w/ nothing in it and it is fast.
I can make the trivial subpackage slow by adding the two connection managers to it.
I suddenly realize I may be making and breaking a connection to both an Oracle Server and a SQL server in both the main package and then the sub package.
Am I correct in this observation?
Is there any way I can reuse the connection between the two packages?
When i google it, most of what i see is suggestions for passing the connection string.
UPDATE - I combined the two packages into one. This performance is not about 1.25 seconds per item, down from about 9. the only thing i can point to that changed is i am now reusing a single connection instead of making multiple connections.
Thanks, I appreciate any help you are kind enough to offer.
Greg
Once you enable logging, I'd suggest running the package from a command window using dtexec. While that doesn't perfectly duplicate the server environment, it does have the advantages of (a) eliminating BIDS as a potential performance issue and (b) being something you can do without jumping through change control hoops.

Effect of logging on Apache's performence

I am developing an Apache module. During development I found it convenient to use logging functions at various points in my code.
For example after opening a file, I would log an error if the operation was not successful so that I may know exactly where the problem occurred in my code.
Now, I am about to deliver my module to my boss (I am on an internship). I wanted to what are the best practices regarding logging. Is it good for maintenance purposes or is it bad because it may hamper the response time of the server.
It really depends on how you wrote those logging instructions. If you wrote:
logger.debug(computeSomeCostlyDebugOutput());
You might affect performance badly if the logger is not set on a DEBUG level (computeSomeCostlyDebugOutput will always take time to execute and its result will then be ignored by the logger if not matching the DEBUG level).
If you write it like this instead:
if (logger.isDebugEnabled()) {
logger.debug(computeSomeCostlyDebugOutput());
}
then the costly operations and logging will occur only if the correct logger level is set (i.e. the logger won't ignore it). It basically acts like another switch for the logger, the first switch being the configured logger level.
As Andrzej Doyle very well pointed out, the logger will check its level internally, but this happens in the debug method, after time was already wasted in the computeSomeCostlyDebugOutput. If you waste time in computeSomeCostlyDebugOutput, you better do it when you know that its result won't be in vain.
A lot of software ships with logging instructions which you can activate if you need more details into the inner workings of the thing, execution path and stuff. Just make sure they can be deactivated and only take computing time if the appropriate level is set.
One of the design goals of Log4J (for good reason) was performance; Ceki Gulku wanted the library to be usable in production, enterprise software and the overhead of Log4J itself is actually pretty minimal (measured on my own webapp project with a profiler).
Two things that are liable to take some time though, are:
Forming the arguments to pass into the logging method, for some calls. As dpb says, this should be avoided by wrapping any computation of complex output in a logging check so you're not deriving complex debug output when the logger's going to throw it away as it's set to only record errors.
The sheer I/O required to record the log data. If you've got your application logging 200Mb of debug logs per second (it might sound infeasible, but it's happened to me before) then that's likely to put a strain on how fast it runs, as IIRC the file writing happens synchronously. In various webapp-type projects I've developed, I can usually actually notice a slight difference in responsiveness when I set the logs to debug level (and IMO this is a good thing, as you should be generating a lot of output when you ask for debug logs).

Is it bad practice to run tests on a database instead of on fake repositories?

I know what the advantages are and I use fake data when I am working with more complex systems.
What if I am developing something simple and I can easily set up my environment in a real database and the data being accessed is so small that the access time is not a factor, and I am only running a few tests.
Is it still important to create fake data or can I forget the extra coding and skip right to the real thing?
When I said real database I do not mean a production database, I mean a test database, but using a real live DBMS and the same schema as the real database.
The reasons to use fake data instead of a real DB are:
Speed. If your tests are slow you aren't going to run them. Mocking the DB can make your tests run much faster than they otherwise might.
Control. Your tests need to be the sole source of your test data. When you use fake data, your tests choose which fakes you will be using. So there is no chance that your tests are spoiled because someone left the DB in an unfamiliar state.
Order Independence. We want our tests to be runnable in any order at all. The input of one test should not depend on the output of another. When your tests control the test data, the tests can be independent of each other.
Environment Independence. Your tests should be runnable in any environment. You should be able to run them while on the train, or in a plane, or at home, or at work. They should not depend on external services. When you use fake data, you don't need an external DB.
Now, if you are building a small little application, and by using a real DB (like MySQL) you can achieve the above goals, then by all means use the DB. I do. But make no mistake, as your application grows you will eventually be faced with the need to mock out the DB. That's OK, do it when you need to. YAGNI. Just make sure you DO do it WHEN you need to. If you let it go, you'll pay.
It sort of depends what you want to test. Often you want to test the actual logic in your code not the data in the database, so setting up a complete database just to run your tests is a waste of time.
Also consider the amount of work that goes into maintaining your tests and testdatabase. Testing your code with a database often means your are testing your application as a whole instead of the different parts in isolation. This often result in a lot of work keeping both the database and tests in sync.
And the last problem is that the test should run in isolation so each test should either run on its own version of the database or leave it in exactly the same state as it was before the test ran. This includes the state after a failed test.
Having said that, if you really want to test on your database you can. There are tools that help setting up and tearing down a database, like dbunit.
I've seen people trying to create unit test like this, but almost always it turns out to be much more work then it is actually worth. Most abandoned it halfway during the project, most abandoning ttd completely during the project, thinking the experience transfer to unit testing in general.
So I would recommend keeping tests simple and isolated and encapsulate your code good enough it becomes possible to test your code in isolation.
As far as the Real DB does not get in your way, and you can go faster that way, I would be pragmatic and go for it.
In unit-test, the "test" is more important than the "unit".
I think it depends on whether your queries are fixed inside the repository (the better option, IMO), or whether the repository exposes composable queries; for example - if you have a repository method:
IQueryable<Customer> GetCustomers() {...}
Then your UI could request:
var foo = GetCustomers().Where(x=>SomeUnmappedFunction(x));
bool SomeUnmappedFunction(Customer customer) {
return customer.RegionId == 12345 && customer.Name.StartsWith("foo");
}
This will pass for an object-based fake repo, but will fail for actual db implementations. Of course, you can nullify this by having the repository handle all queries internally (no external composition); for example:
Customer[] GetCustomers(int? regionId, string nameStartsWith, ...) {...}
Because this can't be composed, you can check the DB and the UI independently. With composable queries, you are forced to use integration tests throughout if you want it to be useful.
It rather depends on whether the DB is automatically set up by the test, also whether the database is isolated from other developers.
At the moment it may not be a problem (e.g. only one developer). However (for manual database setup) setting up the database is an extra impediment for running tests, and this is a very bad thing.
If you're just writing a simple one-off application that you absolutely know will not grow, I think a lot of "best practices" just go right out the window.
You don't need to use DI/IOC or have unit tests or mock out your db access if all you're writing is a simple "Contact Us" form. However, where to draw the line between a "simple" app and a "complex" one is difficult.
In other words, use your best judgment as there is no hard-and-set answer to this.
It is ok to do that for the scenario, as long as you don't see them as "unit" tests. Those would be integration tests. You also want to consider if you will be manually testing through the UI again and again, as you might just automated your smoke tests instead. Given that, you might even consider not doing the integration tests at all, and just work at the functional/ui tests level (as they will already be covering the integration).
As others as pointed out, it is hard to draw the line on complex/non complex, and you would usually now when it is too late :(. If you are already used to doing them, I am sure you won't get much overhead. If that were not the case, you could learn from it :)
Assuming that you want to automate this, the most important thing is that you can programmatically generate your initial condition. It sounds like that's the case, and even better you're testing real world data.
However, there are a few drawbacks:
Your real database might not cover certain conditions in your code. If you have fake data, you cause that behavior to happen.
And as you point out, you have a simple application; when it becomes less simple, you'll want to have tests that you can categorize as unit tests and system tests. The unit tests should target a simple piece of functionality, which will be much easier to do with fake data.
One advantage of fake repositories is that your regression / unit testing is consistent since you can expect the same results for the same queries. This makes it easier to build certain unit tests.
There are several disadvantages if your code (if not read-query only) modifies data:
- If you have an error in your code (which is probably why you're testing), you could end up breaking the production database. Even if you didn't break it.
- if the production database changes over time and especially while your code is executing, you may lose track of the test materials that you added and have a hard time later cleaning it out of the database.
- Production queries from other systems accessing the database may treat your test data as real data and this can corrupt results of important business processes somewhere down the road. For example, even if you marked your data with a certain flag or prefix, can you assure that anyone accessing the database will adhere to this schema?
Also, some databases are regulated by privacy laws, so depending on your contract and who owns the main DB, you may or may not be legally allowed to access real data.
If you need to run on a production database, I would recommend running on a copy which you can easily create during of-peak hours.
It's a really simple application, and you can't see it growing, I see no problem running your tests on a real DB. If, however, you think this application will grow, it's important that you account for that in your tests.
Keep everything as simple as you can, and if you require more flexible testing later on, make it so. Plan ahead though, because you don't want to have a huge application in 3 years that relies on old and hacky (for a large application) tests.
The downsides to running tests against your database is lack of speed and the complexity for setting up your database state before running tests.
If you have control over this there is no problem in running the tests directly against the database; it's actually a good approach because it simulates your final product better than running against fake data. The key is to have a pragmatic approach and see best practice as guidelines and not rules.

How to keep your own debug lines without checking them in?

When working on some code, I add extra debug logging of some kind to make it easier for me to trace the state and values that I care about for this particular fix.
But if I would check this in into the source code repository, my colleagues would get angry on me for polluting the Log output and polluting the code.
So how do I locally keep these lines of code that are important to me, without checking them in?
Clarification:
Many answers related to the log output, and that you with log levels can filter that out. And I agree with that.
But. I also mentioned the problem of polluting the actual code. If someone puts a log statement between every other line of code, to print the value of all variables all the time. It really makes the code hard to read. So I would really like to avoid that as well. Basically by not checking in the logging code at all. So the question is: how to keep your own special purpose log lines. So you can use them for your debug builds, without cluttering up the checked in code.
If the only objetive of the debugging code you are having problems with is to trace the values of some varibles I think that what you really need is a debugger. With a debugger you can watch the state of any variable in any moment.
If you cannot use a debugger, then you can add some code to print the values in some debug output. But this code should be only a few lines whose objective has to be to make easier the fix you are doing. Once it's commited to trunk it's fixed and then you shouldn't need more those debug lines, so you must delete them. Not delete all the debug code, good debug code is very useful, delete only your "personal" tracing debug code.
If the fix is so long that you want to save your progress commiting to the repository, then what you need is a branch, in this branch you can add so much debugging code as you want, but anyway you should remove it when merging in trunk.
But if I would check this in into the
source code repository, my colleagues
would get angry on me for polluting
the Log output and polluting the code.
I'm hoping that your Log framework has a concept of log levels, so that your debugging could easily be turned off. Personally I can't see why people would get angry at more debug logging - because they can just turn it off!
Why not wrap them in preprocessor directives (assuming the construct exists in the language of your choice)?
#if DEBUG
logger.debug("stuff I care about");
#endif
Also, you can use a log level like trace, or debug, which should not be turned on in production.
if(logger.isTraceEnabled()) {
logger.log("My expensive logging operation");
}
This way, if something in that area does crop up one day, you can turn logging at that level back on and actually get some (hopefully) helpful feedback.
Note that both of these solutions would still allow the logging statements to be checked in, but I don't see a good reason not to have them checked in. I am providing solutions to keep them out of production logs.
If this was really an ongoing problem, I think I'd assume that the central repository is the master version, and I'd end up using patch files to contain the differences between the official version (the last one that I worked on) and my version with the debugging code. Then, when I needed to reinstate my debug, I'd check out the official version, apply my patch (with the patch command), fix the problem, and before check in, remove the patch with patch -R (for a reversed patch).
However, there should be no need for this. You should be able to agree on a methodology that preserves the information in the official code line, with mechanisms to control the amount of debugging that is produced. And it should be possible regardless of whether your language has conditional compilation in the sense that C or C++ does, with the C pre-processor.
I know i'm going to get negative votes for this...
But if I were you, i'd just build my own tool.
It'll take you a weekend, yes, but you'll keep your coding style, and your repository clean, and everyone will be happy.
Not sure what source control you use. With mine, you can easily get a list of the things that are "pending to be checked in". And you can trigger a commit, all through an API.
If I had that same need, i'd make a program to commit, instead of using the built-in command in the Source Control GUI. Your program would go through the list of pending things, take all the files you added/changed, make a copy of them, remove all log lines, commit, and then replace them back with your version.
Depending on what your log lines look like, you may have to add a special comment at the end of them for your program to recognize them.
Again, shouldn't take too much work, and it's not much of a pain to use later.
I don't expect you'll find something that does this for you already done (and for your source control), it's pretty specific, I think.
Similar to
#if DEBUG #endif....
But that will still mean that anyone running with the 'Debug' configuration will hit those lines.
If you really want them skipped then use a log level that no one else uses, or....
Create a different run configuration called MYDEBUGCONFIG
and then put your debug code in between blocks like this:
#if MYDEBUGCONFIG
...your debugging code
#endif;
What source control system are you using? Git allows you to keep local branches. If worse comes to worst, you could just create your own 'Andreas' branch in the repository, though branch management could become pretty painful.
If you really are doing something like:
puts a log statement
between every other line of code, to
print the value of all variables all
the time. It really makes the code
hard to read.
that's the problem. Consider using a test framework, instead, and write the debug code there.
On the other hand, if you are writing just a few debug lines, then you can manage to avoid those by hands (e.g. removing the relevant lines with the editor before the commit and undoing the change after it's done) - but of course it have to be very infrequent!
IMHO, you should avoid the #if solution. That is the C/C++ way of doing conditional debugging routines. Instead attribute all of logging/debugging functions with the ConditionalAttribute. The constructor of the attribute takes in a string. This method will only be called if the particular pre-processor definition of the same name as the attribute string is defined. This has the exact same runtime implications as the #if/#endif solution but it looks a heck of a lot better in code.
This next suggestion is madness do not do it but you could...
Surround your personal logging code with comments such as
// ##LOG-START##
logger.print("OOh A log statment");
// ##END-LOG##
And before you commit your code run a shell script that strips out your logs.
I really wouldn't reccomend this as it's a rubbish idea, but that never stops anyone.
Alternative you could also not add a comment at the end of every log line and have a script remove them...
logger.print("My Innane log message"); //##LOG
Personally I think that using a proper logging framework with a debug logging level etc should be good enough. And remove any superfluous logs before you submit your code.
Treat it as first class code and keep with the code with proper logging API and build option to compile it out/completely disable.

Resources