calling perl script with system VS implementing package - performance

Let me start with giving an example of what I'm dealing with first:
I often call existed Perl scripts from previous engineers to process some data, and then proceed further with my script. I either use system or back-ticks to call other people scripts within my script.
Now, I'm wondering if I rewrite those scripts as packages and use require or use to include those packages in my script, will it increase the processing speed? How big of a difference would it be?

Benefits:
It would save the time taken to load the shell, load perl, compile the script and the module it uses. That's a couple of seconds minimum, but it could be much larger.
If you had to serialize data to pass to the child, you also save the time taken to serialize and deserialize the data.
It would allow more flexible interfaces.
It would make error handling easier and more flexible.
Downsides:
Since everything is now in the same process, the child can have a much larger effect on the parent. e.g. A crash in the child will take down the parent.

Related

Performance implications of function calls in PSM1 Modules

I have a function that does a find/replace on text files, and it has worked well for some time. Until I needed to process a 12 million line file.
My initial code used Get-Content and Write-Content, and with the massive file it was going to take hours to process, not to mention the memory implications of loading 12 million lines into RAM.
So, I wrote a little test script to compare that approach vs Stream Reader/Writer. And Streaming looked like it was going to be a massive performance improvement, dropping processing to 30 seconds. I then added a .Replace() on each line, and total processing time only went up to maybe a minute. All good. So then I went to implement it in my real code, and performance has tanked again. That code is a PS1 that loads a number of PSM1 files. The function to do the find replace is in one of those PSM1 files, and that code calls functions in another PSM1. The test script was everything in a single small PS1.
Given that my test script didn't use a function call at all, I tested that first, so there is a function in the PS1 that gets called 12 million times from the loop in the same PS1. No real performance impact.
So, my thought then was that calling a function in one PSM1 that then calls a function in another PSM1 (12 million times) might be the issue. So I made a dummy function (which just returns the passed string, as if no replacement was needed) in the same PSM1 as the loop. And that is orders of magnitude slower.
I have not tested this with everything in the PS1, mostly because these functions are needed in three different scripts with very different argument requirements, so implementing it with Modules really made a lot of sense logistically, and changing that would be a massive undertaking.
That said, is there a known performance hit when calling a function that lives in a Module? I was under the impression that once the Modules are loaded, it's basically the same as if it was all in a single PS1, but perhaps not? FWIW, I am not using NameSpaces. All of my functions just have function name prefix on the noun side to avoid conflicts.
I also can't really post minimally functional code very easily since that's in a single file that doesn't exhibit the behavior. If there is no obvious answer to someone I guess my next step is to implement the test script with some modules, but that's not really apples to apples either, since my real modules are rather large.
To add a little context: When the function (in a PSM1) does not call a function and simply sets $writeLine = $originalLine total time is 15 seconds.
When doing an actual find and replace inline (no call to a function) like this $writeLine = $originalLine.Replace($replace, $with) total processing time is 16 seconds.
When calling a function in the same PSM1 that just returns the original string total time is 17 minutes.
But again, when it's all in a PS1 file with no modules, calling a function has minimal impact. So it certainly seems like calling a function in a PSM1, even from a function in that same PSM1, has a massive performance overhead.
And more context:
I moved the replace function in the test script into a Module. No appreciable change. So I moved the main code, including the loop, into a function in that module, and called it from the main script. Again, no real change. Both took around 15 seconds.
So, it's not something innate in Modules. That then begs the question, what could I be doing in my other modules that would trigger this behavior? This modules are 3000-10,000 lines of code, so there is a lot going on. Hopefully someone has some insight as to best practices with modules to mitigate this. And hopefully it's not "Don't use big modules". ;)
Final update:
It seems it IS a function of how big the module is. I deleted all the other functions in the Module that contains the loop, and performance is fine, 17 seconds. So, basically even as of PS5.0, the implementation of modules is pretty useless for anything large. Rather disconcerting. I wonder if the same would be true if all the functions where in a single file, and PowerShell performance with large files with lots of functions is just bad? Anyone have any experience down this road?

How can I pass Selenium WebDriver objects between seperate Ruby processes?

I want to pass an instance of an object between two Ruby processes. Specifically, I want to pass an instance of a Selenium WebDriver from one process to another process. The reason I want to do this is because it takes a lot of time for Ruby to create this object, but I want it to be used by the other process.
I've found some related questions here and here that seem to point towards using DRb, but I've been unable to find any useful examples or sample code.
Is there a tool other than DRb that I should be using? Does anyone have an example similar to this that I could copy from?
It looks like you're going to have to use DRb, although the documentation for it seems to be lacking. There is however an interesting article here. You might also want to consider purchasing The dRuby Book by Masatoshi Seki to get a better idea of how to do this effectively.
Another option to investigate if you are not looking at simultaneous access, but you just want to send the object from one process to another, is to serialize (that is, encode in a way that Ruby can read) the object with YAML (for a human readable file) or Marshall (for a binary encoded file) and send it using a pipe. This was mentioned in another answer that has since been deleted.
Note that either of these solutions require modifying the Selenium code heavily since the objects you want to manipulate neither support copying, nor simultaneous access natively.
TL;DR
Most queue or distributed processes are going to require some sort of serialization to work properly. If you want to pass objects rather than messages, then this will a limiting factor in how you approach the problem.
DRb
I don't know if you can marshal a WebDriver object. If you can't, then DRb may be a good choice for your distributed Ruby programs because it supports DRbObject references for things that can't be marshaled. There are some examples provided in the DRb documentation.
Selenium Wire Protocol
Depending on what you're really trying to do, it may be worth taking a closer look at using the remote bindings for the Remote WebDriver client/server, or Selenium's JSON Wire Protocol as an alternative to passing objects between processes.
Other Alternatives: Fixtures, Factories, Stubs, and Mocks
Whether or not these work in your specific case will depend a lot on why you want to pass objects instead of simply driving the remote server. If it's largely an issue of how long it takes to build your object, then the serialization/de-serialization cycle may not necessarily be faster in all cases.
You might want to revisit why your object is so slow to create. If gathering and processing the data for it is what's taking too long, you can use some sort of test fixture or factory to trim that time, either by using a smaller set of fixed data, or using a pre-serialized object that's optimized for speed.
You might also consider whether you actually need real data or objects for your test at all. In many cases, you can speed up your tests a lot by stubbing methods or creating mock objects that will return the values you need for your integration tests without needing to perform expensive calculations or long-running operations.
There are certainly cases where you need to drive the full stack and perform acceptance tests on real data. Even then, you may be able to devise a set of fixture data that will take less time or memory to process. It's certainly worth at least thinking about.

calling shell commands from code by design?

The Unix philosophy teaches that we should develop small programs that do one thing well. It also teaches that we should separate policy from mechanics. I guess one way to take this is to design a text-based shell command first and build a gui on top of that later (if desired).
I truly like the idea that small programs can be composed (piped together) into more complex systems. I also like the fact that simple, focused designs should theoretically need less maintenance than a monolithic system that binds all its rules together.
How sound would it be to program something (in Ruby or Python for example) that relegates some of its functionality to shell commands called straight from the code? Taking this a step further, does it make sense to deliberately design a shell command that is intended to be called directly from code (compiled or scripted)? Obviously, this would only make sense if the shell command had some worthy console use.
I can't say from my experience that this is a practice I've seen much of. More times than not task-specific code relies on task-specific libraries. Of course, it's possible that, unbeknownst to me, I have made use of libraries which are actually just wrappers around shell commands. (Or rather the shell command is a wrapper around some library.)
The unix paradigm is modularity. You should write your program as a bunch of modules, which can then be extracted into multiple programs if you want to. However, executing a new program whenever you'd like to make a function call is slow and unpractical.

How to test-drive software that uses external command-line tools

I'm trying to figure out how to test-drive software that launches external processes that take file paths as an input and write output after lengthy processing to stdout or some file? Is there some common patterns on writing tests in this kind of situations? It is hard to create fast executing tests that could verify correct usage of external tools without launching actual tools in tests and inspecting the results.
You could memoize (http://en.wikipedia.org/wiki/Memoization) the external processes. Write a wrapper in Ruby that computes the md5 sum of the input file and checks it against a database of known checksums. If it matches one, copy over the right output; otherwise, invoke the tool normally.
Test right up to your boundaries. In your case, the boundary is the command-line that you construct to invoke the external program (which you can capture by monkey patching). If you're gluing yourself in to that program's stdout (or processing its result by reading files) that's another boundary. The test is whether your program can process that "input".
The 90%-case answer would be to mock the external command-line tools and verify that the right input is being passed to them at the dividing interface between the two. This helps keep the test suite fast. Also you shouldn't have to bring in the command-line tools since they are not 'your code under test' - it brings in the possibility that the unit test could fail either due changes to your code or some change in behavior in the command line utility.
But it seems like you're having trouble defining the 'right input' - in which case using Optimizations like Memoization (as Dave suggests) might give you the best of both worlds.
Assuming the external programs are well-tested, you should just test that your program is passing the correct data to them.
I think you are getting into a common issue with unit testing, in that correctness is really determined by if the integration works, so how does the unit test help you?
The basic answer is that the unit test tests that the parameters you intend to pass to command line tool are in fact getting passed that way, and that the results you anticipate getting back are in fact processed the way you intend to process them.
Then there is a second level of tests, which may or may not be automated (preferably they are, but it does depend on if it is practical), which are at the functional level where the real utilities are called so that you can see that what you intend to pass and what you anticipate getting back match what actually happens.
There would also be nothing wrong with a set of tests which "tests" the external tools (which perhaps run on a different schedule, or only when you upgrade those tools) which establish your assumptions, passing in the raw input and asserting that you get back the raw output. That way if you upgrade the tool you can catch any behavior changes which may affect you.
You have to decide if those last set of tests are worthwhile or not. It very much depends on the tools involved.

NSThread or pythons' threading module in pyobjc?

I need to do some network bound calls (e.g., fetch a website) and I don't want it to block the UI. Should I be using NSThread's or python's threading module if I am working in pyobjc? I can't find any information on how to choose one over the other. Note, I don't really care about Python's GIL since my tasks are not CPU bound at all.
It will make no difference, you will gain the same behavior with slightly different interfaces. Use whichever fits best into your system.
Learn to love the run loop. Use Cocoa's URL-loading system (or, if you need plain sockets, NSFileHandle) and let it call you when the response (or failure) comes back. Then you don't have to deal with threads at all (the URL-loading system will use a thread for you).
Pretty much the only time to create your own threads in Cocoa is when you have a large task (>0.1 sec) that you can't break up.
(Someone might say NSOperation, but NSOperationQueue is broken and RAOperationQueue doesn't support concurrent operations. Fine if you already have a bunch of NSOperationQueue code or really want to prepare for working NSOperationQueue, but if you need concurrency now, run loop or threads.)
I'm more fond of the native python threading solution since I could join and reference threads around. AFAIK, NSThreads don't support thread joining and cancelling, and you could get a variety of things done with python threads.
Also, it's a bummer that NSThreads can't have multiple arguments, and though there are workarounds for this (like using NSDictionarys and NSArrays), it's still not as elegant and as simple as invoking a thread with arguments laid out in order / corresponding parameters.
But yeah, if the situation demands you to use NSThreads, there shouldn't be any problem at all. Otherwise, it's cool to stick with native python threads.
I have a different suggestion, mainly because python threading is just plain awful because of the GIL (Global Interpreter Lock), especially when you have more than one cpu core. There is a video presentation that goes into this in excruciating detail, but I cannot find the video right now - it was done by a Google employee.
Anyway, you may want to think about using the subprocess module instead of threading (have a helper program that you can execute, or use another binary on the system. Or use NSThread, it should give you more performance than what you can get with CPython threads.

Resources