Virus Signatures and Genetic Algorithms - genetic-algorithm

I would like to know how one achieves the following signature. I have read online that (al least in the past) researchers will take the "suspected" file the binary code, convert it to assembly, examine it, pick sections of code that appear to be unusual, and identifying the corresponding bytes in the machine code.
But then how is the bellow virus string signature achieved?
MIRC.Julie=6463632073656e6420246e69636b20433a5c57696e646f77735c4a756c696531362c4a50472e636f6d0a0d6e31333d207d0a0d6e31343d200a0d6e31353d206374637020313a70696e673a2f6463632073656e6420246e69636b20433a5c57696e646f77735c4a756c696531362c4a50
Also, (although this might sound completely crazy) that string above must mean something, i can only guess a sequence of actions, actual code, etc. So if it was once "translated" in this form (virus signature) from assembly, is it possible to convert it back?
Just in case you might wonder why am asking what even I think is a weird question. This is why... I am preparing my BSc final year computer science project, and at this point I am wondering whether it would be possible to maybe generate/estimate/evaluate/predict virus signatures by using GA's (Genetic Algorithms). Maybe that will help make my question a bit easier to understand, I hope.
Thanks!

You cannot revert it back because normally virus signatures are encrypted. The way they are obtained is by extracting the binary mallicious code from an executable and then converting it to hexadecimal representation.
Hopefully helps

The virus signature shown is probably dependent on the scanner that generated it. I find it extremely easy to believe that all virus scanners create their signatures in different ways. Without a source, there's no way to explain how it was developed, and even with a source I doubt this is something that AV companies will reveal, since it allows virus developers the opportunity to avoid detection.
"Generate/estimate/evaluate/predict" are four different problems and not all of them are best done with a GA. You need to select your problem before selecting an algorithm.

Related

Competitive Programming : Generating test cases and validating the program correctness

I have been doing sport programming for a while and still improving day by day. But one thing I have always wondered is that it would be really nice if I could automate the test-case generation process and cross-validation of my program. Definitely it would be a brute force approach as some test cases would be algorithm specific.
Doing a google search gives me a nice link on Quora : How do programming contest problem setters make test cases ? and the popular testlib used by problem setters.
But isn't this a chicken-egg problem?
Assume I generated 1 million input test cases, but what would I check them against? How will I generate the outputs? Because I am still in the process of validating the program... If my script generates the correct outputs as well, then whats the point of writing the program in the first place. I can submit the script itself. Also, its not possible to write 1 million outputs for generated test cases manually. Can anyone please clarify this confusion.
I hope i have clarified the problem correctly.
It's common to generate the answer by a slow but obviously correct solution (like an exhaustive search). It can't be used as a main solution as it's too slow for large test cases, but you can check the output of your fast (but possibly incorrect) program using it.
Well the thing is it is not as broad as you think it to be. Test generation in competitive programming is guided by the algorithm of the problem and it's correctness proof.
So when you are thinking that there are million of test cases if you analyze the different situations the program can be then you will likely to get all the test cases. Maybe in certain algorithm you are some times processing the even index elements or the odd index elements of an array. Now what you will do? Divide it in 2 cases even or odd. Consider the smallest case for even ones . Same for odd ones. This way you are basically visiting all the control flow path of the program.
In competitive programming as we first determine the algorithm then we decide on a proper input sizes and then all this test cases and validations, it is often easy to think the corner points. Test case for 1000000 elements or when input is 0 or 1 ...test cases like this.
Another is most of the time we write a brute force solution much more slower than the original one. Now what we do? we just generate random medium size test cases and then run it again the slow program and we can check with our checker solution etc.
correctness is guided by some mathematical proof also.(Heuristics, Induction, Box principle, Number Theory etc ) That way we are sure about the correctness of the solution.
I faced the same issue earlier this year and saw some of my colleagues also figuring out a way to deal with this. Because there are sometimes when I just couldn’t think any more test cases and that’s when I decided to make a test case generator tool of my own, It’s free and open-source so anyone can use it.
You can easily generate a lot of test cases using this tool and validate the result using the output given from the correct but slower approach(in terms of time complexity and space complexity). You can either run them parallelly and check for outputs or write a simple script to compare the output of both programs (the slower but correct one and the better but unsure one) to validate.
I believe good coders won’t be needing it anyway, but for the middle level (div2, div3) coders and newbies, it can prove to be a lot helpful.
You can access it from GitHub : Test Case generator.
Both python source code and .exe files are present there with instructions,
If you want to make some changes of your own you can work with the python file.
If you just directly want it to generate some test cases you should prefer the .exe file(inside the zip).
It’ll help a lot if you’re beginning your Competitive programming journey.
Also, any suggestions or improvements are always welcome. Additionally you can also contribute to this project by adding some feature that you think the project is lacking or by adding some new test case formats by yourselves or making by request for the same.

Downloading the binary code from an AT89S52 chip

I have an AT89S52, and I want to read the program burned on it.
Is there a way to do it with the programming interface?
(I am well aware it will be assembly code, but I think I can handle it, since I'm looking for a specific string in that code)
Thanks
You may not be able to (at all easily anyway) -- that chip has three protection bits that are intended to prevent you from doing so. If you're dealing with some sort of commercial product, chances are pretty good that those bits will be set.
Reference: Datasheet, page 20, section 17.

Writing fool-proof applications vs writing for performance

I'm biased towards writing fool-proof applications. For example with PHP site, I validate all the inputs from client-side using JS. On the server-side I validate again. On both sides I do validation for emptiness, and other patterns (email, phone, url, number, etc). And then I strip malicious tags or characters, trim them (server-side). Later I convert the input into desired formats/data types (string, int, float, etc). If the library meant for server-side only, I even give developers chances for graceful degradation and accommodate the tolerate the worst inputs and normalize to the acceptable ones (I have predefined set of the acceptable ones).
Now I'm reading a library that I wrote one and a half years ago. I'm wondering if developers are so evil or lack IQ for me do so much of graceful degradation, finding every possible chance to make the dudes right, even they gave crappy input which seriously harms performance. Or shall I do minimal checking and expect developers to be able and are willfully to give proper input? I have no hope for end-users but should I trust developers more and give them application/library with better performance?
Common policy is to validate on the server anything sent from the client because you can't be totally sure it really was your client that sent it. You don't want to "trust developers more" and in the process find that you've "trusted hackers of your site more".
Fixing invalid input automatically can be as much a curse as a blessing -- you've essentially committed to accepting the invalid input as a valid part of your protocol (ie, in a future version if you make a change that will break the invalid input that you were correcting, it is no longer backwards compatible with the client code that has been written). In extremis, you might paint yourself into a corner that way. Also, invalid calls tend to propagate to new code -- people often copy-and-paste example code and then modify it to meet their needs. If they've copied bad code that you've been correcting at the server, you might find you start getting proportionally more and more bad data coming in, as well as confusing new programmers who think "that just doesn't look like it should be right, but it's the example everyone is using -- maybe I don't understand this after all".
Never expect diligence from developers. Always validate, if you can, any input that comes into your code, especially if it comes across a network.
End users (whether they're programmers using your tool, or non-programmers using your application) don't have to be stupid or evil to type the wrong thing in. As programmers we all too often make wrong assumptions about what's obvious for them.
That's the first thing, which justifies comprehensive validation all on its own. But validation isn't the same as guessing what they meant from what they typed, and inferring correct input from incorrect - unless the inference rules are also well known to the users (like Word's auto-correct, for instance).
But what is this performance you seek? There's no bit of client-side (or server-side, for that matter) validation that takes longer to run than the second or so that is an acceptable response time.
Validate, and ensure it doesn't break as the first priority. Then worry about making it clever enough to know (reliably) what they meant. After that, worry about how fast it is. In the real world, syntax validation doesn't make a measurable difference to anything where user input takes most of the total time.
Microsoft made the mistake of trusting programmers to do the right thing back in the days of Windows 3.1 and to a lesser extent Windows 95. You need only read a few posts from Raymond Chen to see where that road ultimately leads.
(P.S. This is not a dig against Microsoft - it's a statement on fact about how programmers abused the more liberal Win16, either deliberately or through ignorance)
I think you are right in being biased toward fool-proof applications. I would not assume that that degrades performance enough to be of much concern. Rather I would address performance concerns separately, starting by profiling or my favorite method, stackshots. There must be a way to get those in PHP.

How do you decide whether to use a library or write your own implementation

Inspired by this question which started out innocently but is turning into a major flame war.
Let's say you need to a utility method - reasonably straightforward but not a one-liner. Quoted question was how to repeat a string X times. How do you decide whether to use a 3rd party implementation or write your own?
The obvious downside to 3rd party approach is you're adding a dependency to your code.
But if you're writing your own you need to code it, test it, (maybe) profile it so you'll likely end up spending more time.
I know the decision itself is subjective, but criteria you use to arrive at it should not be.
So, what criteria do you use to decide when to write your own code?
General Decision
Before deciding on what to use, I will create a list of criteria that must be met by the library. This could include size, simplicity, integration points, speed, problem complexity, dependencies, external constraints, and license. Depending on the situation the factors involved in making the decision will differ.
Generally, I will hunt for a suitable library that solves the problem before writing my own implementation. If I have to write my own, I will read up on appropriate algorithms and seek ideas from other implementations (e.g., in a different language).
If, after all the aspects described below, I can find no suitable library or source code, and I have searched (and asked on suitable forums), then I will develop my own implementation.
Complexity
If the task is relatively simple (e.g., a MultiValueMap class), then:
Find an existing open-source implementation.
Integrate the code.
Rewrite it, or trim it down, if it excessive.
If the task is complex (e.g., a flexible object-oriented graphing library), then:
Find an open-source implementation that compiles (out-of-the-box).
Execute its "Hello, world!" equivalent.
Perform any other evaluations as required.
Determine its suitability based on the problem domain criteria.
Speed
If the library is too slow, then:
Profile it.
Optimize it.
Contribute the results back to the community.
If the code is too complex to be optimized, and speed is a factor, discuss it with the community and provide profiling details. Otherwise, look for an equivalent, but faster (possibly less feature-rich) library.
API
If the API is not simple, then:
Write a facade and contribute it back to the community.
Or find a simpler API.
Size
If the compiled library is too large, then:
Compile only the necessary source files.
Or find a smaller library.
Bugs
If the library does not compile out of the box, seek alternatives.
Dependencies
If the library depends on scores of external libraries, seek alternatives.
Documentation
If there is insufficient documentation (e.g., user manuals, installation guides, examples, source code comments), seek alternatives.
Time Constraints
If there is ample time to find an optimal solution, then do so. Often there is not sufficient time to write from scratch. And usually there are a number of similar libraries to evaluate. Keep in mind that, by meticulous loose coupling, you can always swap one library for another. Find what works, initially, and if it later becomes a burden, replace it.
Development Environment
If the library is tied to a specific development environment, seek alternatives.
License
Open source.
10 questions ...
+++ (use library) ... --- (write own library)
Is the library exactly what I need? Customizable in a few steps? +++
Does it provide almost all functionality? Easily extensible? +++
No time? +++
It's good for one half and plays well with other? ++
Hard to extend, but excellent documentation? ++
Hard to extend, yet most of the functionality? +
Functionality ok, but outdated? -
Functionality ok, .. but weird (crazy interface, not robust, ...)? --
Library works, but the person who needs to decide is in the state of hybris? ---
Library works, manageable code size, portfolio needs update? ---
Some thoughts ...
If it is something that is small but useful, probably for others, too, then why now write a library and put it on the web. The cost publishing this kind of small libraries decreased, as well as the hurdle for others to tune in (see bitbucket or github). So what's the criteria?
Maybe it should not exactly replicate an existing already known library. If it replicates something existing, it should approach the problem from new angle, or better it should provide a shorter or more condensed* solution.
*/fun
If it's a trivial function, it's not worth pulling in an entire library.
If it's a non-trivial function, then it may be worth it.
If it's multiple functions which can all be handled by pulling in a single library, it's almost definitely worth it.
Keep it in balance
You should keep several criteria in balance. I'd consider a few topics and ask a few questions.
Developing time VS maintenance time
Can I develop what I need in a few hours? If yes, why do I need a library? If I get a lib am I sure that it will not cause hours spent to debug and documentation reading? The answer - if I need something obvious and straightforward I don't need an extra-flexible lib.
Simplicity VS flexibility
If I need just an error wrapper do I need a lib with flexible types and stack tracing and color prints and.... Nope! Using even beautifully designed but flexible and multipurpose libs could slow your code. If you plan to use 2% of functionality you don't need it.
Dig task VS small task
Did I faced a huge task and I need external code to solve it? Definitely AMQP or SQL operations is too big tasks to develop from scratch but tiny logging could be solved in place. Don't use external libs to solve small tasks.
My own libs VS external libs
Sometimes is better to grow your own library because it is for 100% used, for 100% appropriate your goals, you know it best, it is always up to date with your applications. Don't build your own lib just to be cool and keep in mind that a lot of libs in your vendor directory developed "just to be cool".
For me this would be a fairly easy answer.
If you need to be cost effective, then it would probably be best to try and find a library/framework that does what you want. If you can't find it, then you will be forced to write it or find a different approach.
If you have the time and find it fun, write one. You will learn a lot along the way and you can give back to the open source community with you killer new bundle of code. If you don't, well, then don't. But if you can't find one, then you have to write it anyway ;)
Personally, if I can justify writing a library, I always opt for that. It's fun, you learn a lot about what you are directing your focus towards, and you have another tool to add to your arsenal and put on your CV.
If the functionality is only a small part of the app, or if your needs are the same as everyone else's, then a library is probably the way to go. If you need to consume and output JSON, for example, you can probably knock something together in five minutes to handle your immediate needs. But then you start adding to it, bit by bit. Eventually, you have all the functionality that you would find in any library, but 1) you had to write it yourself and 2) it isn't a robust and well document as what you would find in a library.
If the functionality is a big part of the app, and if your needs aren't exactly the same as everyone else's, then think much more carefully. For example, if you are doing machine learning, you might consider using a package like Weka or Mahout, but these are two very different beasts, and this component is likely to be a significant part of your application. A library in this case could be a hindrance, because your needs might not fit the design parameters of the original authors, and if you attempt to modify it, you will need to worry about a much larger and more complex system than the minimum that you would build yourself.
There's a good article out there talking about sanitizing HTML, and how it was a big part of the app, and something that would need to be heavily tuned, so using an outside library wasn't the best solution, in spite of the fact that there were many libraries out that did exactly what seemed to be called for.
Another consideration is security.
If a black-hat hacker finds a bug in your code they can create an exploit and sell it for money. The more popular the library is, the more the exploit worth. Think about OpenSSL or Wordpress exploits. If you re-implement the code, chances that your code is not vulnerable exactly the same way the popular library is. And if your lib is not popular, then an zero-day exploit of your code probably wouldn't worth much, and there is a good chance your code is not targeted by bounty hunters.
Another consideration is language safety. C language can be very fast. But from the security standpoint it's asking for trouble. If you reimplement the lib in some script language, chances of arbitrary code execution exploits are low (as long as you know the possible attack vectors, like serialization, or evals).

Standard methods of debugging

What's your standard way of debugging a problem? This might seem like a pretty broad question with some of you replying 'It depends on the problem' but I think a lot of us debug by instinct and haven't actually tried wording our process. That's why we say 'it depends'.
I was sort of forced to word my process recently because a few developers and I were working an the same problem and we were debugging it in totally different ways. I wanted them to understand what I was trying to do and vice versa.
After some reflection I realized that my way of debugging is actually quite monotonous. I'll first try to be able to reliably replicate the problem (especially on my local machine). Then through a series of elimination (and this is where I think it's problem dependent) try to identify the problem.
The other guys were trying to do it in a totally different way.
So, just wondering what has been working for you guys out there? And what would you say your process is for debugging if you had to formalize it in words?
BTW, we still haven't found out our problem =)
My approach varies based on my familiarity with the system at hand. Typically I do something like:
Replicate the failure, if at all possible.
Examine the fail state to determine the immediate cause of the failure.
If I'm familiar with the system, I may have a good guess about to root cause. If not, I start to mechanically trace the data back through the software while challenging basic assumptions made by the software.
If the problem seems to have a consistent trigger, I may manually walk forward through the code with a debugger while challenging implicit assumptions that the code makes.
Tracing the root cause is, of course, where things can get hairy. This is where having a dump (or better, a live, broken process) can be truly invaluable.
I think that the key point in my debugging process is challenging pre-conceptions and assumptions. The number of times I've found a bug in that component that I or a colleague would swear is working fine is massive.
I've been told by my more intuitive friends and colleagues that I'm quite pedantic when they watch me debug or ask me to help them figure something out. :)
Consider getting hold of the book "Debugging" by David J Agans. The subtitle is "The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems". His list of debugging rules — available in a poster form at the web site (and there's a link for the book, too) is:
Understand the system
Make it fail
Quit thinking and look
Divide and conquer
Change one thing at a time
Keep an audit trail
Check the plug
Get a fresh view
If you didn't fix it, it ain't fixed
The last point is particularly relevant in the software industry.
I picked those on the web or some book which I can't recall (it may have been CodingHorror ...)
Debugging 101:
Reproduce
Progressively Narrow Scope
Avoid Debuggers
Change Only One Thing At a Time
Psychological Methods:
Rubber-duck debugging
Don't Speculate
Don't be too Quick to Blame the Tools
Understand Both Problem and Solution
Take a Break
Consider Multiple Causes
Bug Prevention Methods:
Monitor Your Own Fault Injection Habits
Introduce Debugging Aids Early
Loose Coupling and Information Hiding
Write a Regression Test to Prevent Re occurrence
Technical Methods:
Inert Trace Statements
Consult the Log Files of Third Party Products
Search the web for the Stack Trace
Introduce Design By Contract
Wipe the Slate Clean
Intermittent Bugs
Explot Localility
Introduce Dummy Implementations and Subclasses
Recompile / Relink
Probe Boundary Conditions and Special Cases
Check Version Dependencies (third party)
Check Code that Has Changed Recently
Don't Trust the Error Message
Graphics Bugs
When I'm up against a bug that I can't get seem to figure out, I like to make a model of the problem. Make a copy of the section of problem code, and start removing features from it, one at a time. Run a unit test against the code after every removal. Through this process your will either remove the feature with the bug (and hence, locate the bug), or you will have isolated the bug down to a core piece of code that contains the essence of the problem. And once you figure out the essence of the problem, its a lot easier to fix.
I normally start off by forming an hypothesis based on the information I have at hand. Once this is done, I work to prove it to be correct. If it proves to be wrong, I start off with a different hypothesis.
Most of the Multithreaded synchronization issues get solved very easily with this approach.
Also you need to have a good understanding of the debugger you are using and its features. I work on Windows applications and have found windbg to be extremely helpful in finding bugs.
Reducing the bug to its simplest form often leads to greater understanding of the issue as well adding the benefit of being able to involve others if necessary.
Setting up a quick reproduction scenario to allow for efficient use of your time to test any hypothosis you chose.
Creating tools to dump the environment quickly for comparisons.
Creating and reproducing the bug with logging turned onto the maximum level.
Examining the system logs for anything alarming.
Looking at file dates and timestamps to get a feeling if the problem could be a recent introduction.
Looking through the source repository for recent activity in the relevant modules.
Apply deductive reasoning and apply the Ockham's Razor principles.
Be willing to step back and take a break from the problem.
I'm also a big fan of using process of elimination. Ruling out variables tremendously simplifies the debugging task. It's often the very first thing that should to be done.
Another really effective technique is to roll back to your last working version if possible and try again. This can be extremely powerful because it gives you solid footing to proceed more carefully. A variation on this is to get the code to a point where it is working, with less functionality, than not working with more functionality.
Of course, it's very important to not just try things. This increases your despair because it never works. I'd rather make 50 runs to gather information about the bug rather take a wild swing and hope it works.
I find the best time to "debug" is while you're writing the code. In other words, be defensive. Check return values, liberally use assert, use some kind of reliable logging mechanism and log everything.
To more directly answer the question, the most efficient way for me to debug problems is to read code. Having a log helps you find the relevant code to read quickly. No logging? Spend the time putting it in. It may not seem like you're finding the bug, and you may not be. The logging might help you find another bug though, and eventually once you've gone through enough code, you'll find it....faster than setting up debuggers and trying to reproduce the problem, single stepping, etc.
While debugging I try to think of what the possible problems could be. I've come up with a fairly arbitrary classification system, but it works for me: all bugs fall into one of four categories. Keep in mind here that I'm talking about runtime problems, not compiler or linker errors. The four categories are:
dynamic memory allocation
stack overflow
uninitialized variable
logic bug
These categories have been most useful to me with C and C++, but I expect they apply pretty well elsewhere. The logic bug category is a big one (e.g. putting a < b when the correct thing was a <= b), and can include things like failing to synchronize access among threads.
Knowing what I'm looking for (one of these four things) helps a lot in finding it. Finding bugs always seems to be much harder than fixing them.
The actual mechanics for debugging are most often:
do I have an automated test that demonstrates the problem?
if not, add a test that fails
change the code so the test passes
make sure all the other tests still pass
check in the change
No automated testing in your environment? No time like the present to set it up. Too hard to organize things so you can test individual pieces of your program? Take the time to make it so. May make it take "too long" to fix this particular bug, but the sooner you start, the faster everything else'll go. Again, you might not fix the particular bug you're looking for but I bet you find and fix others along the way.
My method of debugging is different, probably because I am still beginner.
When I encounter logical bug I seem to end up adding more variables to see which values go where and then I go and debug line by line in the piece of code that causing a problem.
Replicating the problem and generating a repeatable test data set is definitely the first and most important step to debugging.
If I can identify a repeatable bug, I'll typically try and isolate the components involved until I locate the problem. Frequently I'll spend a little time ruling out cases so I can state definitively: The problem is not in component X (or process Y, etc.).
First I try to replicate the error, without being able to replicate the error it is basically impossible in a non-trivial program to guess the problem.
Then if possible, break out the code in a separate standalone project. There are several reasons for this: If the original project is big it quite difficult to debug second it eliminates or highlights any assumptions about the code.
I normally always have another copy of VS open which I use for the debugging parts in mini projects and to test routines which I later add to the main project.
Once having reproduced the error in the separate module the battle is almost won.
Sometimes it is not easy to break out a piece of code so in those cases I use different methods depending on how complex the issue is. In most cases assumptions about data seem to come and bite me so I try to add lots of asserts in the code in order make sure my assumptions are correct. I also disabling code by using #ifdef until the error disappears. Eliminating dependencies to other modules etc... sort of slowly circling in the bug like a vulture ..
I think I don't have really a conscious way of doing it, it varies quite a lot but the general principle is to eliminate the noise around the issue until it is quite obvious what it is. Hope I didn't sound too confusing :)

Resources