Under the hood Scikit-Learn does a lot of input validation checks defined here such as checking for (X,y) shapes, expected scalars/arrays, ensuring estimator is already fitted, etc. When designing models these checks are extremely useful to catch early bugs, but these become runtime overheads in large mature pipelines once in production.
Is there currently any way, maybe via a global setting, passing in keyword arguments, or similar to disable these internal checks within pipelines, estimators & transformers?
From this scikit discussion, turns out we can set global settings via:
import sklearn
sklearn.set_config(
assume_finite=True, # disable validation
)
Related
I'm writing a terraform provider for a software, which has a large set of instance specific global configurations (approximately 300 of them). When you use the provider, you define your endpoint and credentials and then operate within this instance. What I'm struggling to decide is how exactly to manage this config. It's not a resource that is created or destroyed, so I'm not sure if creating a global_config resource would be the best approach. Since all the values will already have been initialised during the setup of the system and can only be overridden; the config cannot be destroyed; you can't have more than two config resources. Since you should be able to override all entries, it can't be a data source either.
I haven't managed to find any relevant documentation (or even similar examples) so far, so I would be very grateful, if someone could point me to anything relevant, or suggest how to best achieve this. Thanks.
Terraform's provider model is designed primarily for objects that Terraform itself can create or destroy. There is no built-in support for automatically "adopting" an existing object to be under Terraform's management, because Terraform generally assumes that each object is managed by exactly one declared resource instance and Terraform aims to preserve that assumption by being the one to have created the object.
However, there are some existing examples in other systems of this sort of "singleton" object that is implicitly created but can have its settings changed. Key examples for study are the resource types for default VPCs and their default public subnets in AWS.
There are currently two broad ways to represent this situation in Terraform, neither of which is perfect and so each of which has some advantages and disadvantages to consider:
Mandatory terraform import: you can potentially build your resource type so that its "create" action always immediately fails telling the user to import the existing object, and then to implement the "import" action to allow users to explicitly bind their existing object to their Terraform resource instance using the terraform import command.
This is the more explicit of the two options in that it requires the user to intentionally declare that the existing object should be managed by this Terraform configuration, in the same way that users normally take that action in Terraform. This means that the user remains in control and can (as they must always do when importing) take care to import that object into only one resource instance in one Terraform configuration, thereby preserving Terraform's uniqueness assumption.
However, it also adds a mandatory extra setup step to any Terraform configuration which uses this resource type. That extra step does not fit well into typical automation around Terraform, and so that step will often need to be taken in an exceptional way outside of a team's normal workflow.
Treat "create" as if it were "adopt": since the actions a provider is expected to implement for a resource type are just a matter convention, there's no technical reason that your "create" action cannot just verify that the configured object exists and return success without creating anything. I call that "adopting" here to represent the idea that Terraform will then assume that this existing object is now under the exclusive management of whatever resource instance claimed to have created it, but "adopting" is not actually a formal part of Terraform's workflow.
This has the advantage of fitting well into an existing Terraform workflow, requiring no unusual additional steps on the part of the operator.
However, it also means that it's easier to accidentally adopt the same object into two different resource instances, either in the same configuration or in separate configurations. The consequences of doing that will vary depending on what the object represents, but at minimum it will likely result in the different resource instances "fighting" one another, constantly undoing each other's work on each new Terraform run and thus never converging on a stable desired state.
The second of these is the more convenient of the two and so is the one that existing providers have typically chosen as long as the consequences of incorrect multiple-adoption are just the risk of a non-converging system: that situation is confusing and kinda annoying, but also often not super harmful.
The first is the safer of the two because it guards against the accidental multiple-adoption problem. It could be appropriate if two configurations fighting to control a single object may have more significant consequences, such as one configuration breaking the other one by changing its settings in a way that is invalid for the other use-case.
I have a project running Ember#3.20. We are currently in the process of migrating from classic to glimmer based components and have come across some expensive computational patterns which would benefit from caching.
My question is, what is the best approach to caching functionality to getters for glimmer components? It looks like there are currently a few ways to do this:
#cached via tracked-toolbox - I believe this was released prior to the ember cached api. I didn't peek under the hood but it has the has a #cached decorator which might collide with future ember #cached.
ember-cache-primitive-polyfill - Mentioned in the Ember docs as a polyfill for the ember cached API (3.22) but the syntax isn't as concise as the #cached decorator
ember-cached-decorator-polyfill - related to RFC566 appears to be based on option 2 with a more ergonomic syntax
Upgrade to 3.22 - Trying to avoid bumping ember unless there is a significant benefit. At a glance, I didn't see #cached included here though.
Any additional insight/guidelines into how expensive a getter should be to warrant it being cached? For example, preventing re-renders seems a fairly obvious use case but there can be a wide range of what developers might consider an "expensive" computation.
There are two categories of things here:
The two #cached decorators.
The caching primitives introduced via RFC 0566.
In the vast majority of Ember or Glimmer app or normal library code, you’ll just be using the decorator. You’d only ever really reach for the caching primitives if you were building some low-level library code yourself (not never, but not exactly common, either).
As for the #cached decorators, they have basically the same semantics. The tracked-toolbox version was research that fed into the the development of the primitive that Glimmer ships (and Ember uses), and so ember-cached-decorator-polyfill is implemented using the actual public API—polyfilling it via ember-cache-primitive-polyfill if necessary.
In terms of the performance characteristics, you don’t even actually need to think about it in terms of preventing re-renders: that’s not how the system works anyway. (See this blog post I wrote last year (2020) for a deep dive on how re-rendering gets scheduled in Ember and Glimmer using the autotracking concepts.) It’s also worth remembering that caching is not free! So it’s not as simple as “this thing costs something, so I should cache it”—the caching has to pay for itself to be worth it, and it costs both memory use and CPU time to create and to check caches.
With that caveat firmly in mind, I tend to think of “expense” here in the following categories:
am I rendering this hundreds or thousands of times?
does rendering this cause a long-running computation that will impact render (i.e. on the order of multiple milliseconds)
does this trigger asynchronous behavior?
(especially) does this trigger an API call?
In a lot of normal app code, the only getters you’ll really need to decorate with #cached are getters which produce API calls based on the components’ arguments. Since the getter will otherwise be invoked every time it is referenced, you will end up with multiple API calls, which can produce a situation where the apparent state in the UI flips back and forth as references to different promises resolve.
I have a processor that generates time series data in JSON format. Based on the received data I need to make a forecast using machine learning algorithms on python. Then write the new forecast values to another flow file.
The problem is: when you run such a python script, it must perform many massive preprocessing operations: queries to a database, creating a complex data structure, initializing forecasting models, etc.
If you use ExecuteStreamCommand, then for each flow file the script will be run every time. Is this true?
Can I make in NIFI a python script that starts once and receives the flow files many times, storing the history of previously received data. Or do I need to make an HTTP service that will receive data from NIFI?
You have a few options:
Build a custom processor. This is my suggested approach. The code would need to be in Java (or Groovy, which provides a more Python-like experience) but would not have Python dependencies, etc. However, I have seen examples of this approach for ML model application (see Tim Spann's examples) and this is generally very effective. The initialization and individual flowfile trigger logic is cleanly separated, and performance is good.
Use InvokeScriptedProcessor. This will allow you to write the code in Python and separate the initialization (pre-processing, DB connections, etc., onScheduled in NiFi processor parlance) with the execution phase (onTrigger). Some examples exist but I have not personally pursued this with Python specifically. You can use Python dependencies but not "native modules" (i.e. compiled C code), as the execution engine is still Jython.
Use ExecuteStreamCommand. Not strongly recommended. As you mention, every invocation would require the preprocessing steps to occur, unless you designed your external application in such a way that it ran a long-lived "server" component and each ESC command sent data to it and returned an individual response. I don't know what your existing Python application looks like, but this would likely involve complicated changes. Tim has another example using CDSW to host and deploy the model and NiFi to send it data via HTTP to evaluate.
Make a Custom Processor that can do that. Java is more appropriate. I believe you can do pretty much every with Java you just need to find libraries. Yes, there might be some issues with some initialization and preprocessing that can be handled by all that in the init function of nifi that will allow you preserve the state of certain components.
Link in my use case I had to build a custom processor that could take in images and apply count the number of people in that image. For that, I had to load a deep learning model once in the init method and after through on trigger method, it could be taking the reference of that model every time it processes an image.
The Problem:
I'm trying to have Puppet manage some of the details of several scheduled tasks without managing whether those tasks are enabled. To that end, I declare scheduled_task resources without any explicit enabled attributes, with the intention of communicating that whether the tasks are enabled is not to be altered by Puppet. Puppet, however, is persistent in re-enabling the tasks on every run, just as if I had specified enabled => true for each of them. How can I make it stop doing that?
Already Tried:
I've considered setting the attributes for each datacenter via hiera, but the reality is that this makes failovers and switches more complicated than necessary. I don't want to change my puppet code every time that needs to happen. I also can't shut-off the puppet-agent runs. I need to maintain the integrity of our rolling deployment system.
Providers?
I've read a little about providers, seems like I can handle the behavior there. However, I'm having a hard time figuring out where there is (if any) documentation that explains how to use them to override specific resource properties.
Notify/Subscribe:
I've thought about using notify/subscribe to only set the triggers to enabled on creation. I'm not thinking this is the right solution, because it's not one resource subscribing to/notifying another, it's properties being set on a single resource. If there's some magical way of doing this or something similar, I'd love to know.
Bottom-line: I just need puppet to create disabled tasks, and let me turn them on/off without changing the state in subsequent runs.
Is there a way to ignore resource attribute defaults in Puppet entirely?
Only by explicitly declaring a value for every attribute of every resource.
But that doesn't seem to be the question you really wanted to ask. You seem to be exploring ways to assign attribute values without specifying them as literals in your resource declarations, and I guess you're looking for some kind of layered approach, with the bottom layer replacing or overriding types' and providers' built-in defaults.
As you thought, you could conceivably do this at least in part via providers. You would need to write a custom provider for your resource type and specify that it be used by each resource instance. But don't. This way is complicated to implement and confusing to anyone who later has to read your manifests (maybe including future you).
Notify / subscribe, on the other hand, are simply the wrong tools for the job. They do not do what you seem to think they do, or perhaps you just haven't thought that idea through.
I think you're selling Hiera short and / or inappropriately minimizing the complexity of the task. Probably a mixture of both -- I'm inclined to guess that Hiera can do more for you than you appreciate, but also that the complications you envision will manifest to some extent, in some form, no matter what you do.
Nevertheless, there is an approach that seems to match pretty closely what I think you want: resource default declarations (I link to the latest docs, but this feature is present, in substantially the same form, in every Puppet version released in at least the last nine years). Thus, you might cause all scheduled_task resources to be disabled unless you explicitly say otherwise by putting this resource default declaration at an appropriate place:
Scheduled_task {
enabled => false
}
When choosing where to put such a statement, do note that, unlike anything else in modern Puppet, resource default statements have dynamic scope. The manual discusses that in somewhat more detail.
Revisited:
In light of the clarification of the question, I'll clarify that resource types' providers are where resource management behavior lives. Therefore, if Puppet's behavior on target machines is not what you want in the event of an altogether missing property declaration, then modifying the provider or writing your own are pretty much your only alternatives. Of course, if you have a support contract then perhaps you don't have to do that in-house.
If you don't want to hack on providers -- an altogether reasonable position -- then you're left with managing the properties to their desired values. Supposing that you employ Hiera effectively, however, you should not need to modify any manifests to control which servers have their tasks enabled.
I'm relatively new to software development, and I'm on my way to completing my first app for the iPhone.
While learning Swift, I learned that I could add functions outside the class definition, and have it accessible across all views. After a while, I found myself making many global functions for setting app preferences (registering defaults, UIAppearance, etc).
Is this bad practice? The only alternate way I could think of was creating a custom class to encapsulate them, but then the class itself wouldn't serve any purpose and I'd have to think of ways to passing it around views.
Global functions: good (IMHO anyway, though some disagree)
Global state: bad (fairly universally agreed upon)
By which I mean, it’s probably a good practice to break up your code to create lots of small utility functions, to make them general, and to re-use them. So long as they are “pure functions”
For example, suppose you find yourself checking if all the entries in an array have a certain property. You might write a for loop over the array checking them. You might even re-use the standard reduce to do it. Or you could write a re-useable function, all, that takes a closure that checks an element, and runs it against every element in the array. It’s nice and clear when you’re reading code that goes let allAboveGround = all(sprites) { $0.position.y > 0 } rather than a for…in loop that does the same thing. You can also write a separate unit test specifically for your all function, and be confident it works correctly, rather than a much more involved test for a function that includes embedded in it a version of all amongst other business logic.
Breaking up your code into smaller functions can also help avoid needing to use var so much. For example, in the above example you would probably need a var to track the result of your looping but the result of the all function can be assigned using let. Favoring immutable variables declared with let can help make your program easier to reason about and debug.
What you shouldn’t do, as #drewag points out in his answer, is write functions that change global variables (or access singletons which amount to the same thing). Any global function you write should operate only on their inputs and produce the exact same results every time regardless of when they are called. Global functions that mutate global state (i.e. make changes to global variables (or change values of variables passed to them as arguments by reference) can be incredibly confusing to debug due to unexpected side-effects they might cause.
There is one downside to writing pure global functions,* which is that you end up “polluting the namespace” – that is, you have all these functions lying around that might have specific relevance to a particular part of your program, but accessible everywhere. To be honest, for a medium-sized application, with well-written generic functions named sensibly, this is probably not an issue. If a function is purely of use to a specific struct or class, maybe make it a static method. If your project really is getting too big, you could perhaps factor out your most general functions into a separate framework, though this is quite a big overhead/learning exercise (and Swift frameworks aren’t entirely fully-baked yet), so if you are just starting out so I’d suggest leaving this for now until you get more confident.
* edit: ok two downsides – member functions are more discoverable (via autocomplete when you hit .)
Updated after discussion with #AirspeedVelocity
Global functions can be ok and they really aren't much different than having type methods or even instance methods on a custom type that is not actually intended to contain state.
The entire thing comes down mostly to personal preference. Here are some pros and cons.
Cons:
They sometimes can cause unintended side effects. That is they can change some global state that you or the caller forgets about causing hard to track down bugs. As long as you are careful about not using global variables and ensure that your function always returns the same result with the same input regardless of the state of the rest of the system, you can mostly ignore this con.
They make code that uses them difficult to test which is important once you start unit testing (which is a definite good policy in most circumstances). It is hard to test because you can't mock out the implementation of a global function easily. For example, to change the value of a global setting. Instead your test will start to depend on your other class that sets this global setting. Being able to inject a setting into your class instead of having to fake out a global function is generally preferable.
They sometimes hint at poor code organization. All of your code should be separable into small, single purpose, logical units. This ensures your code will remain understandable as your code base grows in size and age. The exception to this is truly universal functions that have very high level and reusable concepts. For example, a function that lets you test all of the elements in a sequence. You can also still separate global functions into logical units by separating them into well named files.
Pros:
High level global functions can be very easy to test. However, you cannot ignore the need to still test their logic where they are used because your unit test should not be written with knowledge of how your code is actually implemented.
Easily accessible. It can often be a pain to inject many types into another class (pass objects into an initializer and probably store it as a property). Global functions can often remove this boiler plate code (even if it has the trade off of being less flexible and less testable).
In the end, every code architecture decision is a balance of trade offs each time you go to use it.
I have a Framework.swift that contains a set of common global functions like local(str:String) to get rid of the 2nd parameter from NSLocalize. Also there are a number of alert functions internally using local and with varying number of parameters which makes use of NSAlert as modal dialogs more easy.
So for that purpose global functions are good. They are bad habit when it comes to information hiding where you would expose internal class knowledge to some global functionality.