How does RFE identify feature importances in feature engineering? - algorithm

I am just excited to know what actually works in the backend of Recursive Feature Elimination ? It is known that rfe eliminates least important variables in each iteration. But how is it identifying feature importances?
I have searched many places. But nowhere I am able to find how the feature importance is calculated? Is it simply the relationship checking that works in the backend?

Related

How do you mitigate proposal-number overflow attacks in Byzantine Paxos?

I've been doing a lot of research into Paxos recently, and one thing I've always wondered about, I'm not seeing any answers to, which means I have to ask.
Paxos includes an increasing proposal number (and possibly also a separate round number, depending on who wrote the paper you're reading). And of course, two would-be leaders can get into duels where each tries to out-increment the other in a vicious cycle. But as I'm working in a Byzantine, P2P environment, it makes me what to do about proposers that would attempt to set the proposal number extremely high - for example, the maximum 32-bit or 64-bit word.
How should a language-agnostic, platform-agnostic Paxos-based protocol deal with integer maximums for proposal number and/or round number? Especially intentional/malicious cases, which make the modular-arithmetic approach of overflowing back to 0 a bit unattractive?
From what I've read, I think this is still an open question that isn't addressed in literature.
Byzantine Proposer Fast Paxos addresses denial of service, but only of the sort that would delay message sending through attacks not related to flooding with incrementing (proposal) counters.
Having said that, integer overflow is probably the least of your problems. Instead of thinking about integer overflow, you might want to consider membership attacks first (via DoS). Learning about membership after consensus from several nodes may be a viable strategy, but probably still vulnerable to Sybil attacks at some level.
Another strategy may be to incorporate some proof-of-work system for proposals to limit the flood of requests. However, it's difficult to know what to use this as a metric to balance against (for example, free currency when you mine the block chain in Bitcoin). It really depends on what type of system you're trying to build. You should consider the value of information in your system, then create a proof of work system that requires slightly more cost to circumvent.
However, once you have the ability to slow down a proposal counter, you still need to worry about integer maximums in any system with a high number of (valid) operations. You should have a strategy for number wrapping or a multiple precision scheme in place where you can clearly determine how many years/decades your network can run without encountering trouble without blowing out a fixed precision counter. If you can determine that your system will run for 100 years (or whatever) without blowing out your fixed precision counter, even with malicious entities, then you can choose to simplify things.
On another (important) note, the system model used in most papers doesn't reflect everything that makes a real-life implementation practical (Raft is a nice exception to this). If anything, some authors are guilty of creating a system model that is designed to avoid a hard problem that they haven't found an answer to. So, if someone says that X will solve everything, please be aware they they only mean that it solves everything in the very specific system model that they defined. On the other side of this, you should consider that the system model is closely tied to a statement that says "Y is impossible". A nice example to explain this concept is the completely asynchronous message passing of the Ben-Or consensus algorithm which uses nondeterminism in the system model's state machine to avoid the limits specified by the FLP impossibility result (which specifies that consensus requires partially asynchronous message passing when the system model's state machine is deterministic).
So, you should continue to consider the "impossible" after you read a proof that says it can't be done. Nancy Lynch did a nice writeup on this concept.
I guess what I'm really saying is that a good solution to your question doesn't really exist yet. If you figure it out, please publish it (or let me know if you find an existing paper).

Automate Finding Pertinent Methods in Large Project

I have tried to be disciplined about decomposing into small reusable methods when possible. As the project growing, I am re-implementing the exact same method.
I would like to know how to deal with this in an automated way. I am not looking for an IDE specific solution. Dependency on method names may not be sufficient. Unix and scripting are solutions that would be extremely beneficial. Answers such as "take care" etc. are not the solutions I am seeking.
I think the cheapest solution to implement might be to use Google Desktop. A more accurate solution would probably be much harder to implement - treat your code base as a collection of documents where the identifiers (or tokens in the identifiers) are words of the document, and then use document clustering techniques to find the closest matching code to a query. I'm aware of some research similar to that, but nothing close to out-of-the-box code that you could use. You might try looking on Google Code Search for something. I don't think they offer a desktop version, but you might be able to find some applicable code you can adapt.
Edit: And here's a list of somebody's favorite code search engines. I don't know whether any are adaptable for local use.
Edit2: Source Code Search Engine is mentioned in the comments following the list of code search engines. It appears to be a commercial product (with a free evaluation version) that is intended to be used for searching local code.

How can avoid people using my code for evil? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm not sure if this is quite the right place, but it seems like a decent place to ask.
My current job involves manual analysis of large data sets (at several levels, each more refined and done by increasingly experienced analysts). About a year ago, I started developing some utilities to track analyst performance by comparing results at earlier levels to final levels. At first, this worked quite well - we used it in-shop as a simple indicator to help focus training efforts and do a better job overall.
Recently though, the results have been taken out of context and used in a way I never intended. It seems management (one person in particular) has started using the results of these tools to directly affect EPR's (enlisted performance reports - \ it's an air force thing, but I assume something similar exists in other areas) and similar paperwork. The problem isn't who is using these results, but how. I've made it clear to everyone that the results are, quite simply, error-prone.
There are numerous unavoidable obstacles to generating this data, which I have worked to minimize with some nifty heuristics and such. Taken in the proper context, they're a useful tool. Out of context however, as they are now being used, they do more harm than good.
The manager(s) in question are taking the results as literal indicators of whether an analyst is performing well or poorly. The results are being averaged and individual scores are being ranked as above (good) or below (bad) average. This is being done with no regard for inherent margins of error and sample bias, with no regard for any sort of proper interpretation. I know of at least one person whose performance rating was marked down for an 'accuracy percentage' less than one percentage point below average (when the typical margin of error from the calculation method alone is around two to three percent).
I'm in the process of writing a formal report on the errors present in the system ("Beginner's Guide to Meaningful Statistical Analysis" included), but all signs point to this having no effect.
Short of deliberately breaking the tools (a route I'd prefer avoiding but am strongly considering under the circumstances), I'm wondering if anyone here has effectively dealt with similar situations before? Any insight into how to approach this would be greatly appreciated.
Update:
Thanks for the responses - plenty of good ideas all around.
If anyone is curious, I'm moving in the direction of 'refine, educate, and take control of interpretation'. I've started rebuilding my tools to try and negate or track error better and automatically generate any numbers and graphs they could want, with included documentation throughout (while hiding away as obscure references the raw data they currently seem so eager to import to the 'magical' excel sheets).
In particular, I'm hopeful that visual representations of error and properly created ranking systems (taking into account error, standard deviations, etc.) will help the situation.
Either modify the output to include error information (so if the error is +/- 5 %, don't output 22%, output 17% - 27%), or educate those whom this is being used against to the error so that they can defend themselves when it is used against them.
Well, you seem to have run afoul of the Law of Unintended Consequences in the context of human behavior.
Unfortunately, once the cat is out of the bag, it's pretty hard to put back in. You have a few options (which are not mutually exclusive, by the way) to consider, including:
Alter the reports so that their data can no longer be abused in the way you describe.
Work with management to help them understand why their use of your data is improper or misleading.
Work with those whose performance is being measured to pressure management to rethink their policy on the matter.
Work with management/analysts to come up with a viable means to measure performance in a way that is fair to everyone.
Break the report in a manner that makes them unusable for any purposes.
Clearly there is a desire on the part of management to get analytics on performance of analysts. Likely there is a real need for this ... and your reports happened to fill a void in the available information. The best option for everyone would be to find a way to effectively and fairly fill this need. There are many possible ways to achieve this - from dropping dense rankings in favor of performance tiers to using time-over-time variance to refine performance measurements.
Now, it's entirely possible that the existing reports you've provided simply cannot be applied in a fair and accurate manner to address this problem. In which case, you should work with your management team to make sure they understand why this is the case - and either redefine the way performance is measured or take the time to develop an appropriate and fair methodology.
One of the strongest means to convince management that their (ab)use of the data in your report is unwise is to remind them of the concept of perverse incentives. It's entirely possible that over time, analysts will modify their behavior in a way that results in higher rankings in performance reports at the cost of real performance or quality of results that are not otherwise captured or expressed. You seem to have a good understanding of your domain - so I would hope that you could provide specific and dramatic examples of such consequences to help make your case.
All you can do is to try and educate the managers as to why what they're doing is incorrect.
Beyond that, you can't stop idiots from being idiotic, and you'll just go mad trying.
I definitely wouldn't "break" code that people are relying on, even if it's not a specific deliverable. That will only cause them to complain about you, a move which may affect your own EPR :-)
I really think the key here is good communication with your managers.
Besides, I like PatrickV's idea. You could also try some other ways to engineer your tool around the problem so that it'll seem silly/be hard to use it as performance measurement - change the name of the statistics to mean something other than "how good programmer X is", make it hard to get data per-person, show error statistics.
You can also try to display the data in another way (this may actually make your managers think you are trying to help them). Show a graph - a several pixels difference in position may be harder to identify than a numeric results (my guess - your managers are using excel and coloring red everything below average). Draw the error margin so it doesn't make sense to obsess over fractions of percentages.
Give the result as a scale - low and high margin that take into account your error information, it is harder to compare.
Edit: Oh yeah, and read about "social interfaces". You can start with's Spolsky's Not Just Usability and Building Communities with Software.
I would echo #paxdiablo's advice, as a first step:
Work on the report on the inherent errors. In fact, make it the introduction to every copy generated.
When you refer to the measurement errors, indicate they are the lower limit of the errors (unless there actually aren't any).
Try to educate the manager(s) in the error of his/her ways.
If possible, discuss the issue with your manager. And perhaps with the offending managers' management, depending on how familiar you are with them you probably limit it to just "express some concerns" and giving a heads-up.
Consult your HR department, or whomever is in charge of fairness in the performance reviews.
Good luck.
The problem is that the code is not yours, it belongs to your company. They really can do whatever they want with it.
I hate to say this, but if you have an issue with the ethics of your company you will have to leave that company.
One thing you could do is implement the comparison yourself. If he really wants to check if somebody is performing significantly less than the rest, it should be tested formally as well.
Now to choose the right test is a bit tricky without knowing the data and the structure, so I can't really advise you on that one. Just take into account that if you do pairwise comparisons, or compare multiple scores against an average, that you run into the multitesting problem. A classic way of correcting is using Bonferroni. If you implement that one, you can be sure that at a certain point, noone will jump out any more. The Bonferroni correction is very conservative. Another option is using Dunn-Sidak, which is supposed to be less conservative.
The correct implementation would be an ANOVA -if the assumptions are met and the data suitable off course- with a post-hoc comparison like a Tukey Honest Significant Difference test. That way at least the uncertainty on the results is taken into account.
If you don't have a clue on which test to use, describe your data in detail on stats.stackexchange.com and ask for help on which test to use.
Cheers
I just wanted to elaborate on the Perverse Incentives answer of LBushkin. I can easily see your problem extending to where analysts will avoid difficult topics for fear of reducing their score. Or maybe they will provide the same answer as earlier stages to avoid hurting a friends score, even if that is not correct. An interesting question is what happens if the later answer is incorrect - you have no truth, just successive analytic opinions - in this case I assume the first answer is marked as "incorrect", right?
Maybe presenting some of these extensions to the manager will help.

How can I estimate a Web site build (refresh) when I don't know all of the site's features?

I know there are several estimating questions here, and I have read through most of them, but this one is slightly different. If you're doing a refresh on a Web site, it might include usability enhancements that increase the hours for page production and development. We'll never look at a Web site and say to ourselves the way it is now is the way it will be in the future. If that were the case, then our clients wouldn't be looking for our expertise. Should it always be a requirement to do a team brainstorm before responding to an RFP or creating a formal statement of work? What if those doing the brainstorming are not doing the final work? We can only inventory the current site to a certain extent, and I'm starting to think we should make estimates only for what we know, letting the potential client tell us where we're missing certain elements.
In your proposal, be fairly specific about what you saw on the current web site (how many pages/resources? are they low/med/high complex? What are the high level features you see already existing (i.e. search, security, AJAX, profiles, etc.) and what would you consider adding? Give ranges, rather than specific estimates.
The more detail you give about what you saw without knowing the requirements will help the client believe you didn't just shoot the proposal through an RFP chute, and that you're serious about the work, without tying you to a commitment to deliver more/faster than you can. Clients do understand that you can't really make a reliable estimate, but the more they believe you have made a serious attempt, the more they're likely to commit their time to helping you understand the requirements towards making a final proposal and SOW.
When you don't know what your estimating, expect the estimation to be inaccurate (and, at best, approximate).
State your assumptions ("X if we do Foo or Y if we don't").
State what you would need to reduce uncertainty ("we need to spend an hour with the client to gather requirements before we can provide any estimate").

How to assemble a project with software products and your own code

Let's say you have a specific project on hand, it can be divided to parts, and you are not completely sure about all the difficulties that will arise.
Time is of the essence.
How do you decide whether a part should use software product or your own code? (considering, that some tools are awesome, but will require much time to learn)
How do you choose the right software product?
How much time (as a percentage) should this stage of choosing the right product, if any, take, and how much time to evaluate a single product?
Is there a way-back, is it o.k to change your mind, after putting efforts in a product, and finding it not suitable?
I would love to hear any rules of thumb about those.
Changing your decisions is like changing your blueprint for a house while it's already being built.
It will entirely depend on what you have spent in time and money to that point.
Some considerations:
0) Understand the problem in clear and simple terms before beginning. Know what's critical to it's success and then use that list to see if any software, language, or tool will aid it, and at what cost, and if the cost outweighs the benefit.
1) Use a crammer's schedule. Build it in the order of what you would build if you only had 1 day or 1 week and no more to work on it. It's amazing how much doesn't matter anymore when you have to do 50% of the features at 100% of the quality. Focus on value, value, value. Read something like 37 Signal's book Getting Real for more on this.
2) Do not re-invent the wheel. It's always easier it seems to build something from scratch. Unless you are doing a fraction of the implementation and it's truly simpler, meaning you can avoid abstraction until you forget what you were building, consider it. If you can build it faster, better, cheaper and in the same amount of time, do it.
3) Know the features of your tools, and the benefits any tools need to give your solution. You should be familiar with or at least aware of many of the tools out there that you may or may not integrate.
4) Pick a language that is used to solve a lot of problems. Chances are you will find many great libraries and tools to build your software that will save your time. If you need something that delivers, can run, and you can lean on the smarts of others, use something established, or a language that can access .NET or Java easily if need be.
For each part of your software you recognize as a software component/package:
How do you decide whether a part should use software product or your own code?
(considering, that some tools are awesome, but will require much time to learn)
Ask yourself whether the component you are considering is a part of your product's main business core.
If not then it is usually better to use an existing solution and not send too much time on it.
If it is then make sure there is no existing product that is better than what you are planning. - It there is, consider purchasing licenses to it instead of developing your product.
Search online for similar components (commercial, open source and even articles/demo-source-code).
Do any of them implement all of your requirements from the components?
How much do they cost, would it cost you more to develop and maintain a similar component?
What are the license conditions? - Are they OK for your product?
If component includes a user-interface, is it plesent to look at and easy to use?
If you answered yes to all the above then do not develop the component yourself.
If not:
Is the component open source or published in an article / demo-code? - If so, it robust, could you take the code an improve it or use it as an example to help you write code that is more suitable for your requirements? - If so write your own code, use code as part of your own component that is not developed from scratch.
If your answer to the above is no, then you'll have to develop your own (or you're searching in the wrong places).
How do you choose the right software product?
See answers to 1.
How much time (as a percentage) should this stage of choosing the right product, if any, take, and how much time to evaluate a single product?
Clear an entire day, search for existing components, read about them (features, prices, reviews) and download + install up to 5 of them.
Clear another day evaluate 2-3 products, compare demos/examples, look at code, write 2 small examples of using each (same example different product).
If you choose more than 3, clear another day and test the others.
Is there a way-back, is it o.k to change your mind, after putting efforts in a product, and finding it not suitable?
Always design your software so that every component is replaceable.
This guarantees that there is always "a way back".
(Use interfaces & adapter design pattern, divide to many assemblies, connect all components as loosely as possible (using events, binding, as etc.) - loose coupling.
Even if you implement something yourself make sure there is a way back - sometime you may use the wrong technology/design and have to replace a component with a new one you develop/purchase.
Other rules of thumb:
Consider which application-wide technologies to use before considering each component.
Writing in assembly would take the longest, in C less, in C++ even less, in more modern languages such as C#, Java, Delphi even less.
Which has more of the self components that are relevant to you? What does your team have experience in.
If you are using .NET (C#), then WPF could help you lower the coupling between GUI and business logic and make a better looking GUI, however it take time to learn how to use it (a 5 day minimum course is very much recommended).
As in any art the difficulty is composing a good solution based on a very large possible solutions space. There as many ways to go about this as there are developers.
I’d normally spend some time understanding the problem and stating it clearly and succinctly as possible, preferably in a written form. The problem description should be completely abstracted away from any possible solutions. Next I’d normally list available constraints that will need to be applied to the solution (time, budget, legal, political, performance, usability, skill availability within team and so on).
Then the theory goes that you need to look on the market for something that solves the problem and meets the constraints at the same time. In practise, the process is not that straight-forward: you try to identify market categories that are likely to be useful, then research them, see what is available and continuously try to reduce the gap between the constraints and capabilities as much as possible, often by going back and revisiting and re-negotiating the constraints.
A few generic tips:
During the research keep coming back to the original problem.
There is always more than one solution, try to extend breadth (concentrating on very different ways of solving the problem) of the search space before going deeper.
Be clear on a number of options it’s worth researching, and amount of time worth spending on each of them before making a decision whether to investigate further.
It’s seldom worth finding an optimal solution, especially then technological landscape keeps changing very rapidly. Look for a solution that is good enough: “The Paradox of Choice - Why More is Less”.
It’s rarely worth turning to users for help (unless they are software experts) on choosing between several options. If you’ve got a number of options all looking equally attractive that means you need to go back and understand the original problem better, it’s likely you’ve missed a requirement or two.
Some further notes on using third-party components (refers to GUI components, but easy to apply to other software areas as well).
And even more notes on scoping, composing and researching for a project.
How do you decide whether a part should use software product or your own code? (considering, that some tools are awesome, but will require much time to learn)
Ask your self two questions.
1) Is it a mature product. If yes, then
2) How long it would take to create the functionality it provides on your own. If that value times your hourly rate is greater than the cost of the product, then use that product.
How do you choose the right software product?
Consult your network of other developers. Have they used it, did they run into problems. Consult the interweb. Create a prototype using the product. Does it work well? Any major bugs?
How much time (as a percentage) should this stage of choosing the right product, if any, take, and how much time to evaluate a single product?
It depends on the size of the project, and the criticality of the product to the success. Most of the time, you are going to be able to get a high level view of the product in a very short amount of time.
It may be just a few minutes using it before you say, nope - not ready for prime time. If it makes past that, a day or two of experimentation may tell you that it passes muster for your project.
If it's a huge project with many developers, then you probably want to spend more time doing a prototype application with it to be sure it's worth investing all that time in.
Is there a way-back, is it o.k to change your mind, after putting efforts in a product, and finding it not suitable?
If you find it's not working out, there's nothing wrong with going back. In fact you probably have to. Ideally you will find this out early. Not at the 11th hour. Again, this is the purpose of prototyping.
There are already some really good answers here, so I won't repeat it, however there is one point you should definitely consider, and though I would have thought its obvious I havent seen it mentioned here yet:
The personnel you have available to implement the solution, their core competency, and their general level of competence.
Who you have to implement this (assuming it's a team, and not just yourself - but relevant even if its just you, too...) can have a HUGE effect on the outcome. If you don't have experienced programmers to help you develop this, you're better off looking for some OTS product to do the work for you... Or, even if you have programmers who are not likely to succeed, you still might want to find a solution with lower overall project risk.

Resources