Is it possible to use CompUnit modules for collected data? - performance

I have developed a module for processing a collection of documents.
One run of the software collects information about them. The data is stored in two structures called %processed and %symbols. The data needs to be cached for subsequent runs of the software on the same set of documents, some of which can change. (The documents are themselves cached using CompUnit modules).
Currently the data structures are stored / restored as follows:
# storing
'processed.raku`.IO.spurt: %processed.raku;
'symbols.raku`.IO.spurt: %symbol.raku;
# restoring
my %processed = EVALFILE 'processed.raku';
my %symbols = EVALFILE 'symbols.raku';
Outputting these structures into files, which can be quite large, can be slow because the hashes are parsed to create the Stringified forms, and slow on input because they are being recompiled.
It is not intended for the cached files to be inspected, only to save state between software runs.
In addition, although this is not a problem for my use case, this technique cannot be used in general because Stringification (serialisation) does not work for Raku closures - as far as I know.
I was wondering whether the CompUnit modules could be used because they are used to store compiled versions of modules. So perhaps, they could be used to store a 'compiled' or 'internal' version of the data structures?
Is there already a way to do this?
If there isn't, is there any technical reason it might NOT be possible?

(There's a good chance that you've already tried this and/or it isn't a good fit for your usecase, but I thought I'd mention it just in case it's helpful either to you or to anyone else who finds this question.)
Have you considered serializing the data to/from JSON with JSON::Fast? It has been optimized for (de)serialization speed in a way that basic stringification hasn't been / can't be. That doesn't allow for storing Blocks or other closures – as you mentioned, Raku doesn't currently have a good way to serialize them. But, since you mentioned that isn't an issue, it's possible that JSON would fit your usecase.
[EDIT: as you pointed out below, this can make support for some Raku datastructures more difficult. There are typically (but not always) ways to work around the issue by specifying the datatype as part of the serialization step:
use JSON::Fast;
my $a = <a a a b>.BagHash;
my $json = $a.&to-json;
my BagHash() $b = from-json($json);
say $a eqv $b # OUTPUT: «True»
This gets more complicated for datastructures that are harder to represent in JSON (such as those with non-string keys). The JSON::Class module could also be helpful, but I haven't tested its speed.]

After looking at other answers and looking at the code for Precompilation, I realised my original question was based on a misconception.
The Rakudo compiler generates an intermediate "byte code", which is then used at run time. Since modules are self-contained units for compilation purposes, they can be precompiled. This intermediate result can be cached, thus significantly speeding up Raku programs.
When a Raku program uses code that has already been compiled, the compiler does not compile it again.
I had thought of the precompilation cache as a sort of storage of the internal state of a program, but it is not quite that. That is why - I think - #ralph was confused by the question, because I was not asking the right sort of question.
My question is about the storage (and restoration) of data. JSON::Fast, as discussed by #codesections is very fast because it is used by the Rakudo compiler at a low level and so is highly optimised. Consequently, restructuring data upon restoration will be faster than restoring native data types because the slow rate-determining step is storage and restoration from "disk", which JSON does very quickly.
Interestingly, the CompUnit modules I mentioned use low level JSON functions that make JSON::Fast so quick.
I am now considering other ways of storing data using optimised routines, perhaps using a compression/archiving module. It will come down to testing which is fastest. It may be that the JSON route is the fastest.
So this question does not have a clear answer because the question itself is "incorrect".

Update As #RichardHainsworth notes, I was confused by their question, though felt it should be helpful to answer as I did. Based on his reaction, and his decision not to accept #codesection's answer, which at that point was the only other answer, I concluded it was best to delete this answer to encourage others to answer. But now Richard has provided an answer that provides good resolution, I'm undeleting it in the hope that's now more useful.
TL;DR Instead of using EVALFILE, store your data in a module which you then use. There are simple ways to do this that would be minimal but useful improvements over EVALFILE. There are more complex ways that might be better.
A small improvement over EVALFILE
I've decided to first present a small improvement so you can solidify your shift in thinking from EVALFILE. It's small in two respects:
It should only take a few minutes to implement.
It only gives you a small improvement over EVALFILE.
I recommend you properly consider the rest of this answer (which describes more complex improvements with potentially bigger payoffs instead of this small one) before bothering to actually implement what I describe in this first section. I think this small improvement is likely to turn out to be redundant beyond serving as a mental bridge to later sections.
Write a program, say store.raku, that creates a module, say data.rakumod:
use lib '.';
my %hash-to-store = :a, :b;
my $hash-as-raku-code = %hash-to-store .raku;
my $raku-code-to-store = "unit module data; our %hash = $hash-as-raku-code";
spurt 'data.rakumod', $raku-code-to-store;
(My use of .raku is of course overly simplistic. The above is just a proof of concept.)
This form of writing your data will have essentially the same performance as your current solution, so there's no gain in that respect.
Next, write another program, say, using.raku, that uses it:
use lib '.';
use data;
say %data::hash; # {a => True, b => True}
useing the module will entail compiling it. So the first time you use this approach for reading your data instead of EVALFILE it'll be no faster, just as it was no faster to write it. But it should be much faster for subsequent reads. (Until you next change the data and have to rebuild the data module.)
This section also doesn't deal with closure stringification, and means you're still doing a data writing stage that may not be necessary.
Stringifying closures; a hack
One can extend the approach of the previous section to include stringifications of closures.
You "just" need to access the source code containing the closures; use a regex/parse to extract the closures; and then write the matches to the data module. Easy! ;)
For now I'll skip filling in details, partly because I again think this is just a mental bridge and suggest you read on rather than try to do as I just described.
Using CompUnits
Now we arrive at:
I was wondering whether the CompUnit modules could be used because they are used to store compiled versions of modules. So perhaps, they could be used to store a 'compiled' or 'internal' version of the data structures?
I'm a bit confused by what you're asking here for two reasons. First, I think you mean the documents ("The documents are themselves cached using CompUnit modules"), and that documents are stored as modules. Second, if you do mean the documents are stored as modules, then why wouldn't you be able to store the data you want stored in them? Are you concerned about hiding the data?
Anyhow, I will presume that you are asking about storing the data in the document modules, and that you're interested in ways to "hide" that data.
One simple option would be to write the data as I did in the first section, but insert the our %hash = $hash-as-raku-code"; etc code at the end, after the actual document, rather than at the start.
But perhaps that's too ugly / not "hidden" enough?
Another option might be to add Pod blocks with Pod block configuration data at the end of your document modules.
For example, putting all the code into a document module and throwing in a say just as a proof-of-concept:
# ...
# Document goes here
# ...
# At end of document:
=begin data :array<foo bar> :hash{k1=>v1, k2=>v2} :baz :qux(3.14)
=end data
say $=pod[0].config<array>; # foo bar
That said, that's just code being executed within the module; I don't know if the compiled form of the module retains the config data. Also, you need to use a "Pod loader" (cf Access pod from another Raku file). But my guess is you know all about such things.
Again, this might not be hidden enough, and there are constraints:
The data can only be literal scalars of type Str, Int, Num, or Bool, or aggregations of them in Arrays or Hashs.
Data can't have actual newlines in it. (You could presumably have double quoted strings with \ns in them.)
Modifying Rakudo
Aiui, presuming RakuAST lands, it'll be relatively easy to write Rakudo plugins that can do arbitrary work with a Raku module. And it seems like a short hop from RakuAST macros to basic is parsed macros which in turn seem like a short hop from extracting source code (eg the source of closures) as it goes thru the compiler and then spitting it back out into the compiled code as data, possibly attached to Pod declarator blocks that are in turn attached to code as code.
So, perhaps just wait a year or two to see if RakuAST lands and gets the hooks you need to do what you need to do via Rakudo?

Related

How to pass functions as arguments to other functions in Julia without sacrificing performance?

EDIT to try to address #user2864740's edit and comment: I am wondering if there is any information particularly relevant to 0.4rc1/rc2 or in particular a strategy or suggestion from one of the Julia developers more recent than those cited below (particularly #StefanKarpinski's Jan 2014 answer in #6 below). Thx
Please see e.g.
https://groups.google.com/forum/#!topic/julia-users/pCuDx6jNJzU
https://groups.google.com/forum/#!topic/julia-users/2kLNdQTGZcA
https://groups.google.com/forum/#!msg/julia-dev/JEiH96ofclY/_amm9Cah6YAJ
https://github.com/JuliaLang/julia/pull/10269
https://github.com/JuliaLang/julia/issues/1090
Can I add type information to arguments that are functions in Julia?
Performance penalty using anonymous function in Julia
(As a fairly inexperienced Julia user) my best synthesis of this information, some of which seems to be dated, is that the best practice is either "avoid doing this" or "use FastAnonymous.jl."
I'm wondering what the bleeding edge latest and greatest way to handle this is.
[Longer version:]
In particular, suppose I have a big hierarchy of functions. I would like to be able to do something like
function transform(function_one::Function{from A to B},
function_two::Function{from B to C},
function_three::Function{from A to D})
function::Function{from Set{A} to Dict{C,D}}(set_of_As::Set{A})
Dict{C,D}([function_two(function_one(a)) => function_three(a)
for a in set_of_As])
end
end
Please don't take the code too literally. This is a narrow example of a more general form of transformation I'd like to be able to do regardless of the actual specifics of the transformation, BUT I'd like to do it in such a way that I don't have to worry (too much) about checking the performance (that is, beyond the normal worries I'd apply in any non-function-with-function-as-parameter case) each time I write a function that behaves this way.
For example, in my ideal world, the correct answer would be "so long as you annotate each input function with #anon before you call this function with those functions as arguments, then you're going to do as well as you can without tuning to the specific case of the concrete arguments you're passing."
If that's true, great--I'm just wondering if that's the right interpretation, or if not, if there is some resource I could read on this topic that is closer to a "logically" presented synthesis than the collection of links here (which are more a stream of collective consciousness or history of thought on this issue).
The answer is still "use FastAnonymous.jl," or create "functor types" manually (see NumericFuns.jl).
If you're using julia 0.4, FastAnonymous.jl works essentially the same way that official "fast closures" will eventually work in base julia. See https://github.com/JuliaLang/julia/issues/11452#issuecomment-125854499.
(FastAnonymous is implemented in a very different way on julia 0.3, and has many more weaknesses.)

Is there an easy way to replace a deprecated method call in Xcode?

So iOS 6 deprecates presentModalViewController:animated: and dismissModalViewControllerAnimated:, and it replaces them with presentViewController:animated:completion: and dismissViewControllerAnimated:completion:, respectively. I suppose I could use find-replace to update my app, although it would be awkward with the present* methods, since the controller to be presented is different every time. I know I could handle that situation with a regex, but I don't feel comfortable enough with regex to try using it with my 1000+-files-big app.
So I'm wondering: Does Xcode have some magic "update deprecated methods" command or something? I mean, I've described my particular situation above, but in general, deprecations come around with every OS release. Is there a better way to update an app than simply to use find-replace?
You might be interested in Program Transformation Systems.
These are tools that can automatically modify source code, using pattern-directed source-to-source transformations ("if you see this source-level pattern, replace it by that source-level pattern") that operate on code structures rather than text. Done properly, these transformations can be reliable and semantically correct, and they're a lot easier to write than low-level procedural code that navigates and smashes nanoscopic actual tree structures.
It is not the case that using such tools is easy; such tools have to know how to parse the language of interest into compiler data structures, (e.g., ObjectiveC), process the patterns, and regenerate compilable source code from the modified structures. Even with the basic transformation engine, somebody needs to carefully define parsers (and unparsers!) for the dialects of the languages of interest. And it takes time to learn how to use such a even if you have such parsers/unparsers. This is worth it if the changes you need to make are "regular" (in the program transformation sense, not the regexp sense) and widespread (as yours seem to be).
Our DMS Software Reengineering toolkit has an ObjectiveC front end, and can carry out such transformations.
no there is no magic like that

Abstracting away from data structure implementation details in Clojure

I am developing a complex data structure in Clojure with multiple sub-structures.
I know that I will want to extend this structure over time, and may at times want to change the internal structure without breaking different users of the data structure (for example I may want to change a vector into a hashmap, add some kind of indexing structure for performance reasons, or incorporate a Java type)
My current thinking is:
Define a protocol for the overall structure with various accessor methods
Create a mini-library of functions that navigate the data structure e.g. (query-substructure-abc param1 param2)
Implement the data structure using defrecord or deftype, with the protocol methods defined to use the mini-library
I think this will work, though I'm worried it is starting to look like rather a lot of "glue" code. Also it probably also reflects my greater familiarity with object-oriented approaches.
What is the recommended way to do this in Clojure?
I think that deftype might be the way to go, however I'd take a pass on the accessor methods. Instead, look into clojure.lang.ILookup and clojure.lang.Associative; these are interfaces which, if you implement them for your type, will let you use get / get-in and assoc / assoc-in, making for a far more versatile solution (not only will you be able to change the underlying implementation, but perhaps also to use functions built on top of Clojure's standard collections library to manipulate your structures).
A couple of things to note:
You should probably start with defrecord, using get, assoc & Co. with the standard defrecord implementations of ILookup, Associative, IPersistentMap and java.util.Map. You might be able to go a pretty long way with it.
If/when these are no longer enough, have a look at the sources for emit-defrecord (a private function defined in core_deftype.clj in Clojure's sources). It's pretty complex, but it will give you an idea of what you may need to implement.
Neither deftype nor defrecord currently define any factory functions for you, but you should probably do it yourself. Sanity checking goes inside those functions (and/or the corresponding tests).
The more conceptually complex operations are of course a perfect fit for protocol functions built on the foundation of get & Co.
Oh, and have a look at gvec.clj in Clojure's sources for an example of what some serious data structure code written using deftype might look like. The complexity here is of a different kind from what you describe in the question, but still, it's one of the few examples of custom data structure programming in Clojure currently available for public consumption (and it is of course excellent quality code).
Of course this is just what my intuition tells me at this time. I'm not sure that there is much in the way of established idioms at this stage, what with deftype not actually having been released and all. :-)

Cross version line matching

I'm considering how to do automatic bug tracking and as part of that I'm wondering what is available to match source code line numbers (or more accurate numbers mapped from instruction pointers via something like addr2line) in one version of a program to the same line in another. (Assume everything is in some kind of source control and is available to my code)
The simplest approach would be to use a diff tool/lib on the files and do some math on the line number spans, however this has some limitations:
It doesn't handle cross file motion.
It might not play well with lines that get changed
It doesn't look at the information available in the intermediate versions.
It provides no way to manually patch up lines when the diff tool gets things wrong.
It's kinda clunky
Before I start diving into developing something better:
What already exists to do this?
What features do similar system have that I've not thought of?
Why do you need to do this? If you use decent source version control, you should have access to old versions of the code, you can simply provide a link to that so people can see the bug in its original place. In fact the main problem I see with this system is that the bug may have already been fixed, but your automatic line tracking code will point to a line and say there's a bug there. Seems this system would be a pain to build, and not provide a whole lot of help in practice.
My suggestion is: instead of trying to track line numbers, which as you observed can quickly get out of sync as software changes, you should decorate each assertion (or other line of interest) with a unique identifier.
Assuming you're using C, in the case of assertions, this could be as simple as changing something like assert(x == 42); to assert(("check_x", x == 42)); -- this is functionally identical, due to the semantics of the comma operator in C and the fact that a string literal will always evaluate to true.
Of course this means that you need to identify a priori those items that you wish to track. But given that there's no generally reliable way to match up source line numbers across versions (by which I mean that for any mechanism you could propose, I believe I could propose a situation in which that mechanism does the wrong thing) I would argue that this is the best you can do.
Another idea: If you're using C++, you can make use of RAII to track dynamic scopes very elegantly. Basically, you have a Track class whose constructor takes a string describing the scope and adds this to a global stack of currently active scopes. The Track destructor pops the top element off the stack. The final ingredient is a static function Track::getState(), which simply returns a list of all currently active scopes -- this can be called from an exception handler or other error-handling mechanism.

Is checking Perl function arguments worth it?

There's a lot of buzz about MooseX::Method::Signatures and even before that, modules such as Params::Validate that are designed to type check every argument to methods or functions. I'm considering using the former for all my future Perl code, both personal and at my place of work. But I'm not sure if it's worth the effort.
I'm thinking of all the Perl code I've seen (and written) before that performs no such checking. I very rarely see a module do this:
my ($a, $b) = #_;
defined $a or croak '$a must be defined!';
!ref $a or croak '$a must be a scalar!";
...
#_ == 2 or croak "Too many arguments!";
Perhaps because it's simply too much work without some kind of helper module, but perhaps because in practice we don't send excess arguments to functions, and we don't send arrayrefs to methods that expect scalars - or if we do, we have use warnings; and we quickly hear about it - a duck typing approach.
So is Perl type checking worth the performance hit, or are its strengths predominantly shown in compiled, strongly typed languages such as C or Java?
I'm interested in answers from anyone who has experience writing Perl that uses these modules and has seen benefits (or not) from their use; if your company/project has any policies relating to type checking; and any problems with type checking and performance.
UPDATE: I read an interesting article on the subject recently, called Strong Testing vs. Strong Typing. Ignoring the slight Python bias, it essentially states that type checking can be suffocating in some instances, and even if your program passes the type checks, it's no guarantee of correctness - proper tests are the only way to be sure.
If it's important for you to check that an argument is exactly what you need, it's worth it. Performance only matters when you already have correct functioning. It doesn't matter how fast you can get a wrong answer or a core dump. :)
Now, that sounds like a stupid thing to say, but consider some cases where it isn't. Do I really care what's in #_ here?
sub looks_like_a_number { $_[0] !~ /\D/ }
sub is_a_dog { eval { $_[0]->DOES( 'Dog' ) } }
In those two examples, if the argument isn't what you expect, you are still going to get the right answer because the invalid arguments won't pass the tests. Some people see that as ugly, and I can see their point, but I also think the alternative is ugly. Who wins?
However, there are going to be times that you need guard conditions because your situation isn't so simple. The next thing you have to pass your data to might expect them to be within certain ranges or of certain types and don't fail elegantly.
When I think about guard conditions, I think through what could happen if the inputs are bad and how much I care about the failure. I have to judge that by the demands of each situation. I know that sucks as an answer, but I tend to like it better than a bondage-and-discipline approach where you have to go through all the mess even when it doesn't matter.
I dread Params::Validate because its code is often longer than my subroutine. The Moose stuff is very attractive, but you have to realize that it's a way for you to declare what you want and you still get what you could build by hand (you just don't have to see it or do it). The biggest thing I hate about Perl is the lack of optional method signatures, and that's one of the most attractive features in Perl 6 as well as Moose.
I basically concur with brian. How much you need to worry about your method's inputs depends heavily on how much you are concerned that a) someone will input bad data, and b) bad data will corrupt the purpose of the method. I would also add that there is a difference between external and internal methods. You need to be more diligent about public methods because you're making a promise to consumers of your class; conversely you can be less diligent about internal methods as you have greater (theoretical) control over the code that accesses it, and have only yourself to blame if things go wrong.
MooseX::Method::Signatures is an elegant solution to adding a simple declarative way to explain the parameters of a method. Method::Signatures::Simple and Params::Validate are nice but lack one of the features I find most appealing about Moose: the Type system. I have used MooseX::Declare and by extension MooseX::Method::Signatures for several projects and I find that the bar to writing the extra checks is so minimal it's almost seductive.
Yes its worth it - defensive programming is one of those things that are always worth it.
The counterargument I've seen presented to this is that checking parameters on every single function call is redundant and a waste of CPU time. This argument's supporters favor a model in which all incoming data is rigorously checked when it first enters the system, but internal methods have no parameter checks because they should only be called by code which will pass them data which has already passed the checks at the system's border, so it is assumed to still be valid.
In theory, I really like the sound of that, but I can also see how easily it can fall like a house of cards if someone uses the system (or the system needs to grow to allow use) in a way that was unforeseen when the initial validation border is established. All it takes is one external call to an internal function and all bets are off.
In practice, I'm using Moose at the moment and Moose doesn't really give you the option to bypass validation at the attribute level, plus MooseX::Declare handles and validates method parameters with less fuss than unrolling #_ by hand, so it's pretty much a moot point.
I want to mention two points here.
The first are the tests, the second the performance question.
1) Tests
You mentioned that tests can do a lot and that tests are the only way
to be sure that your code is correct. In general i would say this is
absolutly correct. But tests itself only solves one problem.
If you write a module you have two problems or lets say two different
people that uses your module.
You as a developer and a user that uses your module. Tests helps with the
first that your module is correct and do the right thing, but it didn't
help the user that just uses your module.
For the later, i have one example. i had written a module using Moose
and some other stuff, my code ended always in a Segmentation fault.
Then i began to debug my code and search for the problem. I spend around
4 hours of time to find the error. In the end the problem was that i have
used Moose with the Array Trait. I used the "map" function and i didn't
provide a subroutine function, just a string or something else.
Sure this was an absolutly stupid error of mine, but i spend a long time to
debug it. In the end just a checking of the input that the argument is
a subref would cost the developer 10 seconds of time, and would cost me
and propably other a lot of more time.
I also know of other examples. I had written a REST Client to an interface
completly OOP with Moose. In the end you always got back Objects, you
can change the attributes but sure it didn't call the REST API for
every change you did. Instead you change your values and in the end you
call a update() method that transfers the data, and change the values.
Now i had a user that then wrote:
$obj->update({ foo => 'bar' })
Sure i got an error back, that update() does not work. But sure it didn't
work, because the update() method didn't accept a hashref. It only does
a synchronisation of the actual state of the object with the online
service. The correct code would be.
$obj->foo('bar');
$obj->update();
The first thing works because i never did a checking of the arguments. And i don't throw an error if someone gives more arguments then i expect. The method just starts normal like.
sub update {
my ( $self ) = #_;
...
}
Sure all my tests absolutely works 100% fine. But handling these errors that
are not errors cost me time too. And it costs the user propably a lot
of more time.
So in the end. Yes, tests are the only correct way to ensure that your code
works correct. But that doesn't mean that type checking is meaningless.
Type checking is there to help all your non-developers (on your module)
to use your module correctly. And saves you and others time finding
dump errors.
2) Performance
The short: You don't care for performance until you care.
That means until your module works to slow, Performance is always fast
enough and you don't need to care for this. If your module really works
to slow you need further investigations. But for these investigions
you should use a profiler like Devel::NYTProf to look what is slow.
And i would say. In 99% slowliness is not because you do type
checking, it is more your algorithm. You do a lot of computation, calling
functions to often etc. Often it helps if you do completly other solutions
use another better algorithm, do caching or something else, and the
performance hit is not your type checking. But even if the checking is the
performance hit. Then just remove it where it matters.
There is no reason to leave the type checking where performance don't
matters. Do you think type checking does matter in a case like above?
Where i have written a REST Client? 99% of performance issues here are
the amount of request that goes to the webservice or the time for such an
request. Don't using type checking or MooseX::Declare etc. would propably
speed up absolutly nothing.
And even if you see performance disadvantages. Sometimes it is acceptable.
Because the speed doesn't matter or sometimes something gives you a greater
value. DBIx::Class is slower then pure SQL with DBI, but DBIx::Class
gives you a lot for these.
Params::Validate works great,but of course checking args slows things down. Tests are mandatory(at least in the code I write).
Yes it's absolutely worth it, because it will help during development, maintenance, debugging, etc.
If a developer accidentally sends the wrong parameters to a method, a useful error message will be generated, instead of the error being propagated down to somewhere else.
I'm using Moose extensively for a fairly large OO project I'm working on. Moose's strict type checking has saved my bacon on a few occassions. Most importantly it has helped avoid situations where "undef" values are incorrectly being passed to the method. In just those instances alone it saved me hours of debugging time..
The performance hit is definitely there, but it can be managed. 2 hours of using NYTProf helped me find a few Moose Attributes that I was grinding too hard and I just refactored my code and got 4x performance improvement.
Use type checking. Defensive coding is worth it.
Patrick.
Sometimes. I generally do it whenever I'm passing options via hash or hashref. In these cases it's very easy to misremember or misspell an option name, and checking with Params::Check can save a lot of troubleshooting time.
For example:
sub revise {
my ($file, $options) = #_;
my $tmpl = {
test_mode => { allow => [0,1], 'default' => 0 },
verbosity => { allow => qw/^\d+$/, 'default' => 1 },
force_update => { allow => [0,1], 'default' => 0 },
required_fields => { 'default' => [] },
create_backup => { allow => [0,1], 'default' => 1 },
};
my $args = check($tmpl, $options, 1)
or croak "Could not parse arguments: " . Params::Check::last_error();
...
}
Prior to adding these checks, I'd forget whether the names used underscores or hyphens, pass require_backup instead of create_backup, etc. And this is for code I wrote myself--if other people are going to use it, you should definitely do some sort of idiot-proofing. Params::Check makes it fairly easy to do type checking, allowed value checking, default values, required options, storing option values to other variables and more.

Resources