Best practice, where to put request_mem_region - linux-kernel

I have two drivers code, in the former one the function request_mem_region is called during the device probe, in the latter is called during the device open, of course you have to call the release_mem_regione function taking into account the request function position, but i was wondering which are the pros and cons between these two choices. Any suggest?

It depends on how the region is being used. If it is only needed while the device is open, then it makes sense to put it in open. If the device is constantly working in the background, even if it is not open, then it will need to be mapped in probe. Basically, request it for the smallest scope that is required for your device's needs.
Putting it in probe, even if only needed while it is open, won't always cause problems, but it does mean you're eating up memory/address space that you don't have to. (edit: I forgot, this doesn't perform the mapping just reserves it, so it isn't really an issue.)
Putting it in open can lead to more complexity - for example, you need to make sure that it only gets mapped once, even if the device can be opened by two different processes at once. Not a hard problem to solve, but something that can be overlooked when learning to write modules and working from example code.

Related

simulate atomic operation in independent systems

I have two independent systems. At some point I would like to be able to make an operation that affects the both system, and I would like to simulate atomicity even this is technically impossible. To illustrate the problem lets say that we would like to move a object from one of the system to the other.
First because every operation might fail at any point I am adding a tentative record to the both system indicating the intention. The algo is:
Set the object in system 1 in tentative mode for remove
Set the object in system 2 in tentative mode for add
Move the object from system 1 to system 2
Remove the tentativeness from the system 2
Remove the tentativeness from system 1
The lack of atomic operation though might result in having the object in both systems are in none depending on the order of steps 4 and 5, and a crash between them.
My question is, is there an algorithm that could somehow resolve the lack of atomicity and allow me to guarantee it. I kind of see that it seems impossible, but I hope it is not.
Quite possible (though not perfect). Databases do this all the time. See https://en.wikipedia.org/wiki/Distributed_transaction and https://en.wikipedia.org/wiki/Two-phase_commit_protocol for an introduction.
It is, of course, a pithy subject, so I can't supply a quick thumbnail sketch in code. But yes, you can do this.
Your approach has some merit. What you need is more communication between the two systems.

A C++11 based signals/slots with ordering

I may be a bit in over my head here, but if you never try new things - you'll never learn I suppose. I'm working with some multi-touch stuff and have built myself a small but functional GUI library. Up until now I've been used boosts Signals2 library to have my detected gestures be distributed to all active GUI elements (whether on screen or not). I'm a big fan of avoiding pre-mature optimization, so things have been honky-dory until now.
I've used vs2013's profiler to find out that when the user goes touch crazy (the device supports up to 41 simultaneous touches), then my system grinds to a halt, and Signals2 is the culprit. Keep in mind that each touch can trigger a number of Gestures which are all communicated to every GUI element that have registered to interact with this type of Gesture.
Now there are a number of ways to deal with this bottleneck:
Have GUI elements work more cleverly and disconnect them when they're off-screen.
Optimize the signals/slots system so the calls are resolved faster.
Prioritization of events.
I'm not a big fan of ever having to deal with 3 - if avoidable - as it'll directly impact the responsiveness of my application. Nr. 1 should probably be implemented, but I'm more interested in getting to Nr. 2 first.
I don't really need any big fancy stuff. The Signals/Slots system I'd need really only needs to do the core emission stuff along with these 2 feature:
Slots must be able to return a value ending the emission - effectively cancelling any subsequent handling of a signal.
Slots must be order-able - and fairly efficient at this. GUI elements that are interacted with will pop-up above others, so this type of change in order is bound to happen quite often.
I stumbled across this really interesting implementation
https://testbit.eu/2013/cpp11-signal-system-performance/
which seems to have everything except the 'ordering' I need. I've only looked over the code once, and it does seem a little intimidating. If I were to try and add ordering capabilities, I'd prefer not to change too much stuff around if necessary. Does anyone have experience with this stuff? I'm fairly certain that a linked list is not optimal for constant removal and insertion, but then again, it probably needs to be optimized the most for constant emission calls.
Any thoughts are most welcome!
--- Update ---
I've spend a little time adding the features I needed to the code put into the public domain above and pasted the complete (and somewhat hacky version) here:
SimpleSignal with added features
In short, I've added:
Blockable Connections (Implemented via simple IF statement)
Depth/Order parameter (Linear-search insertion)
With those additions, keep in mind it has these current issues:
Blocked connections are simply skipped, not actively removed from the data-structure, so having a lot of blocked connections will affect run-time performance.
Depth is only maintained during insertion. So if you'd like to change the depth you'll have to disconnect and reconnect your slot.
Since the SignalLink interface has become exposed (as a result of my block implementation), it's less safe from a user perspective. It's way easier for you to shoot yourself in the foot with this version by messing with existing references and pointers.
This implementation hasn't been as thoroughly tested as the original I'm sure. I did try out the new functionality a bit. User beware.

Fail Fast vs. Robustness

Our product is a distributed system. The modules I work on are fairly new, quite rigorous, well tested. They were developed with recent best practices in mind. Other modules can be considered as legacy software.
While I'm vigilant about everything that happens within modules I'm responsible for, I'm under constant pressure to work with bad data sent to me from the other modules. At heart, I'm a "Fail Fast" principle developer and as a result , when problems arise I usually am able to eliminate the possibility of error in my modules. It's not so much about blame, just saving wasted effort in chasing bugs in the wrong places.
But the argument I keep coming up against is: "We can't let this stuff fail in production, the customer expects this to work, why don't you work around this problem". And this would be an argument for robustness: be liberal in what you accept, conservative in what you send.
I should also note that these are mostly intermittent problems. We see them in integration tests but they are hard to reproduce. Timing and concurrency are involved.
I'm having a hard time balancing between the two principles. Part of it is my worry that if I start allowing and propagating exceptional data, I'm inviting trouble and I won't have as much confidence in my system. But I can't argue against keeping the system working even if other modules are sending me wrong data. The reason other modules aren't getting fixed is that they are too complex and fragile, while mine still appear clear and safe. But if I don't resist the pressure, my modules will slowly be saddled with the same problems I've been rejecting until now.
I should say that the system is not "crashing" in production, but my module may simply display an error to the operator and ask them to contact support. A crash would be a big problem, but if I'm reporting the error clearly, then isn't this the right thing to do? I suspect that my peers just don't want the customer to see any problems, period. But my module is rejecting data from other modules within our product, not customer input. So it seems to me that we are just not tackling problems.
So, do I need to be more pragmatic or hold my ground?
I share the "fail fast" preference/principle. Don't think of this as a conflict of principles though, its more a conflict of understanding. Your counterpart has some unspoken requirement ("dont show the user a bad time") that implies some missed requirement. You did not have a chance to think about/implement this requirement beforehand, so the requirement has left a bad taste in your mouth. Forget this viewpoint, re-approach it as a new project with a fixed requirement you can work against.
Maybe the best result is to give an error message like you displayed. But it sounds like you implemented it before having buy-in from your counterpart, when they had a choice to accept it. Earlier communication about what you were doing could have addressed something like that.
Be careful in how you prevent the ideas. Constantly referring to the other systems "too complex and fragile" might be rubbing people the wrong way. Express simply the systems are new to you and take longer to understand. Do put the time into understanding them, so you do not reduce peoples expectations of your capability.
I'd say that it depends on what happens if you don't halt. Does someone's paycheck get processed wrong? Does the wrong order get sent out? That would be worth stopping for.
If possible, have your cake and eat it too - don't report the error to the user, get the customer to agree to send diagnostic reports and report every failure back. Bug the developer(s) who own the faulting module(s) to fix them. And by bug I mean file a bug against them. Or, if management doesn't think it's worth the cost of fixing, don't.
I'd also write up unit tests against those modules that fail, especially if you can tell what the original input was that caused them to generate the wrong output.
What it really comes down to though is what the person who reviews your performance wants from you, especially after you explain the problem to them, via email.
Simply put, this sounds like a "don't check for something you can't handle". The fact that you're catching the error and able to report it means you're not propagating it. But it also means that since you can report it, you have some mechanism to trap the error and, therefore potentially handle it yourself, and correct it rather than report it.
Mind, I'm assuming that your error report is more interesting than a random exception you caught some place deep in the system. But even then, if it's an exception you're testing for and you're creating (i.e. you check if the denominator is zero and send an error rather than simply inadvertently dividing by zero and catching the exception higher up), then that suggests you may well have a way of correcting the problem.
Bottom line, you need both. You need to try to make the data as error free as practical, but also report the unexpected.
I don't think that you can lock the door and cross your arms saying "it's not my problem". The fact that it's coming from "old, fragile systems" is meaningless. YOUR code is not old a fragile and clearly the efficient place, in terms of the entire integrated system, to "fix" the data, once you've detected the problem. Yea the old modules will continue to GIGO to other, lesser systems, but those legacy modules combined with your new module are a cohesive whole and thus make up "the system".
The typical real problem here is simply the time/value equation of writing all this fix up code vs new features. That's a different debate. But if you have time, and you know things that you can do to clean up incoming data, "be liberal in what you accept" is sound policy.
I won't get into the reasons, but you are right.
In my experience, PHB's are missing the part of the brain required to understand why fail fast has merit and "robustness" as defined by do-whatever-it-takes-eat-errors-if-necessary is a bad idea. It is hopeless. They just don't have the hardware to grok it. They tend to say things "ok you make a good point but what about the user" - it's just their version of think of the children, and signals the end of a conversion with me anytime it's brought up.
My advice is to stand your ground. Eternally.
Thanks everyone. The case that prompted this question ended well, and partly thanks to insights I got from the answers above.
My initial reaction was to stick to fail fast, but I thought about this some more, and had reached the conclusion that one of the roles of my module is to provide a stabilizing anchor to the rest of the system. That does not necessarily mean accepting bad data, but surfacing problems, isolating them and handling them in a transparent manner until we find a solution.
I planned adding a new handler and code path for this case, which would properly execute as if it was a special use case that was previously undocumented.
We had a discussion where I reiterated the need to deal with the problem at the boundary, but was also willing to help. I outlined my plan to the other side, because I had a suspicion that my position was viewed as overly pedantic, and that the solution was perceived as me only having to turn off spurious validation of harmless data, even if it was incorrect. In reality though, the way I work is largely data driven, so I explained why it has to be correct and how behavior is driven by it and how in accommodating this data I will be implementing a special code path.
I think this gave weight to my position and it led to a more thorough discussion of the other side's aversion to fixing the data. It turned out that it was more of a weariness of dealing with an error prone legacy system than an actual obstacle. There was a relatively simple solution, it was just scary to make a change, a mindset that's fairly entrenched.
But having aired all challenges and possible solutions, we eventually agreed to fix the data, and so far it seems to have solved our problem. Our integration tests are now passing consistently, but we have also added logging and will continue to monitor it.
In summary, I think that for me, the synthesis of both principles is that fail fast is essential for surfacing problems. But once they do surface, robustness means providing a transparent path to continue operation in a way that does not compromise the system. I was able to offer that, and by doing so, won some goodwill from the other side and got the data fixed in the end.
Again, thanks to everyone that responded. I'm too new to rate comments, but I do appreciate all the perspectives presented.
That's a tricky one. If your module receives bad data and it's "ok" for you to just do nothing with them and return, then I would suggest to write to an error log instead of showing an error to the user.
It kind of depends on the class of error you are getting. If the way the system is breaking means you can keep going without feeding bad data to any other parts of the system, you should do everything in your power to work with whatever input is given.
To my mind though data purity trumps working systems, you cannot allow bad data to propagate elsewhere and corrupt other systems. To the extent you can massage data to be correct and then keep going, you should do so on the theory that the data is safe and you must keep the system running...
I like to think of things in terms of data streams. Passing bad data along is polluting the whole stream, and that is bad because just like real pollution a drop can spoil a whole river of data (if one element is bad, what else can you trust?). But equally bad is blocking the flow, letting nothing pass because you spotted something you could easily remove. Filter it out and if everyone at every stage is also filter, you get clear clean data out the other end even if a few impurities started up in the middle.
The question from your peers is: "why don't you work around this problem"
You say that it's possible for you detect the bad data, and report an error to the user. This is the normal approach - once you know the data coming to your functions is bad, you should fail fast (and this is the recommendation from the other answers I have read here).
However, your question doesn't specify the domain in which your software is operating. If you know the data coming in is erroneous, is it possible for you to request that data again? Is it actually possible to recover from the situation?
I mentioned that the "domain" here is important. So if you have an app which displays streamed video data for example, and maybe your wireless signal is weak so the stream is corrupt, should the system "fail fast" and display an error message? Or should a poorer image be displayed, and an attempt to reconnect made if needed, depending on the magnitude of the problem?
Depending on your domain, it may be possible for you to detect bad data, and make a second request for the data without inconveniencing the user. (This is clearly only relevant in cases where you'd expect the data to be better the second time, but you do say the issues you are experiencing are intermittent and possible concurrency related)...
So, fail-fast is good, and is definitely something you should do if you can't recover. And you should definitely not propagate bad data. But if you can recover, which in some domains you can, then failing straight away is not necessarily the best thing to do.

Can this kernel function be more readable? (Ideas needed for an academic research!)

Following my previous question regarding the rationale behind extremely long functions, I would like to present a specific question regarding a piece of code I am studying for my research. It's a function from the Linux Kernel which is quite long (412 lines) and complicated (an MCC index of 133). Basically, it's a long and nested switch statement
Frankly, I can't think of any way to improve this mess. A dispatch table seems both huge and inefficient, and any subroutine call would require an inconceivable number of arguments in order to cover a large-enough segment of code.
Do you think of any way this function can be rewritten in a more readable way, without losing efficiency? If not, does the code seem readable to you?
Needless to say, any answer that will appear in my research will be given full credit - both here and in the submitted paper.
Link to the function in an online source browser
I don't think that function is a mess. I've had to write such a mess before.
That function is the translation into code of a table from a microprocessor manufacturer. It's very low-level stuff, copying the appropriate hardware registers for the particular interrupt or error reason. In this kind of code, you often can't touch registers which have not been filled in by the hardware - that can cause bus errors. This prevents the use of code that is more general (like copying all registers).
I did see what appeared to be some code duplication. However, at this level (operating at interrupt level), speed is more important. I wouldn't use Extract Method on the common code unless I knew that the extracted method would be inlined.
BTW, while you're in there (the kernel), be sure to capture the change history of this code. I have a suspicion that you'll find there have not been very many changes in here, since it's tied to hardware. The nature of the changes over time of this sort of code will be quite different from the nature of the changes experienced by most user-mode code.
This is the sort of thing that will change, for instance, when a new consolidated IO chip is implemented. In that case, the change is likely to be copy and paste and change the new copy, rather than to modify the existing code to accommodate the changed registers.
Utterly horrible, IMHO. The obvious first-order fix is to make each case in the switch a call to a function. And before anyone starts mumbling about efficiency, let me just say one word - "inlining".
Edit: Is this code part of the Linux FPU emulator by any chance? If so this is very old code that was a hack to get linux to work on Intel chips like the 386 which didn't have an FPU. If it is, it's probably not a suitable study for academics, except for historians!
There's a kind of regularity here, I suspect that for a domain expert this actually feels very coherent.
Also having the variations in close proximty allows immediate visual inspection.
I don't see a need to refactor this code.
I'd start by defining constants for the various classes. Coming into this code cold, it's a mystery what the switching is for; if the switching was against named constants, I'd have a starting point.
Update: You can get rid of about 70 lines where the cases return MAJOR_0C_EXCP; simply let them fall through to the end of the routine. Since this is kernel code I'll mention that there might be some performance issues with that, particularly if the case order has already been optimized, but it would at least reduce the amount of code you need to deal with.
I don't know much about kernels or about how re-factoring them might work.
The main thing that comes to my mind is taking that switch statement and breaking each sub step in to a separate function with a name that describes what the section is doing. Basically, more descriptive names.
But, I don't think this optimizes the function any more. It just breaks it in to smaller functions of which might be helpful... I don't know.
That is my 2 cents.

How to test application's startup time or performance

There is a free tool called PassMark AppTimer. But I think it's not quite fit for my needs.
Windows provides a tool called xperf, is there a way to use it to test/benchmark application startup time?
If I'm helping to develop an app, and it gets too slow on startup (or any other phase), I just do this.
Common wisdom is that measuring performance of various routines is necessary for finding performance problems.
I go the other way - I locate the biggest problems (because their very slowness exposes them), and then I can roughly estimate how much time they take, if I care to. Here's an example of how it works.
The kinds of things I have found are, for example 1) fetching and converting strings from resources, which were in resources so that they could be internationalized, but did not really need to be internationalized, or 2) creating and deleting (along with serializing) deep data structures for no real reason in the process of setting up UI controls.
The things found are almost never what you might guess, so it is a mistake to guess. Just see what the process tells you.
The interesting thing about this is that the problem is almost never the kind of thing that profilers could easily tell you. The problem is nearly always some innocent-looking function or method call, somewhere in the middle of the call stack, that only gets your attention because 1) it shows up a lot, and 2) by looking at what it is doing and why, you can see that it can be done without. Getting rid of it saves as much time as it was on the stack.

Resources