tools for diffing windows binaries? - windows

Our QA team wants to focus their testing based on what EXEs and DLLs have actually changed between builds. We have a nice svn change report, but the relationship between source and changed binaries isn't always obvious. The builds we're comparing are always full clean builds, so we can't use file system timestamps. I'm looking for tools to compare windows (and windows CE) PE binaries that will ignore the embedded timestamps and other cruft. Any recommendations for tools or other ways to generate a reliable 'what binaries have really changed' report? Thanks.
clarification: Thanks for the answers so far, but we can't generate the report by doing straightforward byte-for-byte compares or comparing checksums, because all the files appear different every time we build, even if the sources haven't changed, because of timestamps that the compiler inserts. The problem is how to ignore the false positives. The disassemble & compare idea is closest to what we need, I think...
answered! Bindiff is just what I was looking for. Many thanks.

Have you had a look at Bindiff?

I ran into this problem before. My solution was to write a tool which set all the timestamps in an .EXE/.DLL to a known value. I would run it as a post-build step. Then binary diffs would work just fine.

You could perhaps disassemble the binary, and then do a diff on the assembly...
This sounds like your QA team is taking the wrong approach though... It shouldn't matter to them what the code looks like; just that it does what it's supposed to do.
Edit:
Oh! After reading it again, I realized that I misinterpreted your question. I thought they wanted to test the methods that had changed...
In that case, why not get the MD5 hash and compare those? The tiniest change will cause a totally different hash to be generated.

Not sure what kind of binaries (DLLs? PE/WinCE executables only? Other?)Is it possible to embed version information in the binaries, e.g. using a source control tag that updates the version in the source code on commits. Then when the new build is created, the binary would have it's version string updated as well. Instead of having to diff a binary file, you could use the version string and check that for changes.

Look at NDepend.

When I was working on the "home grown" tool for installation verification at my company, we used Beyond Compare as a backend for comparison.
It has great file/folder comparison (binary as well) and scripting capabilities and can output XML reports.

Project dependency graph generator and Dependency-Grapher for C++-Projects both use GraphViz to visualize dependencies. I imagine that you could use either of them as a basis for your needs with special highlighting of the branches in the dependency graph where source files or other leaves have changed.
MD5 hashes or checksums (as suggested above), a simple diff ignoring whitespace and filtering out comment changes, or changlist information from your version control system can signal which files have changed.

gnu binutils specifically strings

Related

How to use Subversion with HelpNDoc

I am writing a documentation for a project that involves multiple developers. We use Subversion (SVN) to work on our code base.
I wrote the first draft of the documentation document using HelpNDoc, which I like for the nice tree-view and easy of use; the problem is that there is a single file, so I don't know how to use SVN to allow other developers to contribute to the documentation and update it.
Do you know if it's possible? If not, can you advice a nice software, easy to use, with a tree-view of the documentation that can be used with SVN or makes it possible for multiple users to update it? We use Windows.
HelpNDoc projects are binary files based on the SQLite open source database engine. The advantage is that the whole documentation stored in a single file so it can easily be copied, moved, shared, backed-up...
However one drawback is that it has to be checked-in as binary content in any version control system including Subversion: diff and merge are not possible on those files.
One possible solution would be to use external documents in HelpNDoc's library: each user works on her own document (which can be a Word document, and HTML web-page...) and a master HelpNDoc project is created to include those documents at generation time. See "Include a file at generation time" in the following step by step guide: How to add an item to the library
Amount of files doesn't matter, real format (text/* or binary) - does. If SVN|any VCS can merge two HelpNDoc files with diverged history (just try it by hand), you'll be happy
I once used Helpinator for software documentation, it's pretty close to HelpnDoc but it's storage format is more suitable for version control.

How do you verify that 2 copies of a VB 6 executable came from the same code base?

I have a program under version control that has gone through multiple releases. A situation came up today where someone had somehow managed to point to an old copy of the program and thus was encountering bugs that have since been fixed. I'd like to go back and just delete all the old copies of the program (keeping them around is a company policy that dates from before version control was common and should no longer be necessary) but I need a way of verifying that I can generate the exact same executable that is better than saying "The old one came out of this commit so this one should be the same."
My initial thought was to simply MD5 hash the executable, store the hash file in source control, and be done with it but I've come up against a problem which I can't even parse.
It seems that every time the executable is generated (method: Open Project. File > Make X.exe) it hashes differently. I've noticed that Visual Basic messes with files every time the project is opened in seemingly random ways but I didn't think that would make it into the executable, nor do I have any evidence that that is indeed what's happening. To try to guard against that I tried generating the executable multiple times within the same IDE session and checking the hashes but they continued to be different every time.
So that's:
Generate Executable
Generate MD5 Checksum: md5sum X.exe > X.md5
Verify MD5 for current executable: md5sum -c X.md5
Generate New Executable
Verify MD5 for new executable: md5sum -c X.md5
Fail verification because computed checksum doesn't match.
I'm not understanding something about either MD5 or the way VB 6 is generating the executable but I'm also not married to the idea of using MD5. If there is a better way to verify that two executables are indeed the same then I'm all ears.
Thanks in advance for your help!
That's going to be nearly impossible. Read on for why.
The compiler will win this game, every time...
Compiling the same project twice in a row, even without making any changes to the source code or project settings, will always produce different executable files.
One of the reasons for this is that the PE (Portable Executable) format that Windows uses for EXE files includes a timestamp indicating the date and time the EXE was built, which is updated by the VB6 compiler whenever you build the project. Besides the "main" timestamp for the EXE as a whole, each resource directory in the EXE (where icons, bitmaps, strings, etc. are stored in the EXE) also has a timestamp, which the compiler also updates when it builds a new EXE. In addition to this, EXE files also have a checksum field that the compiler recalculates based on the EXE's raw binary content. Since the timestamps are updated to the current date/time, the checksum for the EXE will also change each time a project is recompiled.
But, but...I found this really cool EXE editing tool that can undo this compiler trickery!
There are EXE editing tools, such as PE Explorer, that claim to be able to adjust all the timestamps in an EXE file to a fixed time. At first glance you might think you could just set the timestamps in two copies of the EXE to the same date, and end up with equivalent files (assuming they were built from the same source code), but things are more complicated than that: the compiler is free to write out the resources (strings, icons, file version information, etc.) in a different order each time you compile the code, and you can't really prevent this from happening. Resources are stored as independent "chunks" of data that can be rearranged in the resulting EXE without affecting the run-time behavior of the program.
If that wasn't enough, the compiler might be building up the EXE file in an area of uninitialized memory, so certain parts of the EXE might contain bits and pieces of whatever was in memory at the time the compiler was running, creating even more differences.
As for MD5...
You are not misunderstanding MD5 hashing: MD5 will always produce the same hash given the same input. The problem here is that the input in this case (the EXE files) keep changing.
Conclusion: Source control is your friend
As for solving your current dilemma, I'll leave you with this: associating a particular EXE with a specific version of the source code is a more a matter of policy, which has to be enforced somehow, than anything else. Trying to figure out what EXE came from what version without any context is just not going to be reliable. You need to track this with the help of other tools. For example, ensuring that each build produces a different version number for your EXE's, and that that version can be easily paired with a specific revision/branch/tag/whatever in your version control system. To that end, a "free-for-all" situation where some developers use source control and others use "that copy of the source code from 1997 that I'm keeping in my network folder because it's my code and source control is for sissies anyway" won't help make this any easier. I would get everyone drinking the source control Kool-Aid and adhering to a standard policy for creating builds right away.
Whenever we build projects, our build server (we use Hudson) ensures that the compiled EXE version is updated to include the current build number (we use the Version Number Plugin and a custom build script to do this), and when we release a build, we create a tag in Subversion using the version number as the tag name. The build server archives release builds, so we can always get the specific EXE (and setup program) that was given to a customer. For internal testing, we can choose to pull an archived EXE from the build server, or just tell the build server to rebuild the EXE from the tag we created in Subversion.
We also never, ever, ever release any binaries to QA or to customers from any machine other than the build server. This prevents "works on my machine" bugs, and ensures that we are always compiling from a "known" copy of the source code (it only pulls and builds code that is in our Subversion repository), and that we can always associate a given binary with the exact version of the code that it was created from.
I know it has been a while, but since there is VB De-compiler app, you may consider bulk-decompiling vb6 apps, and then feeding decompilation results to an AI/statistical anomaly detection on the various code bases. Given the problem you face doesn't have an exact solution, it is unlikely the results will be 100% accurate, but as you feed more data, the detection should become more and more accurate

How to automatically detect and release DLLs that have really changed?

Whenever we recompile an exe or a DLL, its binary image is different even if the source code is the same, due to various timestamps and checksums in the image.
But, our quality system implies that each time a new DLL is published, related validation tests must be performed again (often manually, and it takes a significant amount of time.)
So, our goal is to avoid releasing DLLs that have not actually changed. I.e: having an automatic procedure (script, tool, whatever...) that detect different Dlls based only on meaningful information they contain (code and data), ignoring timestamps and checksum.
Is there a good way to achieve this ?
Base it off the version information, and only update the version information when you actually make changes.
Have your build tool build the DLL twice. Whatever differences exist between the two are guaranteed to be the result of timestamps or checksums. Now you can use that information to compare to your next build.
If you have an automated build system that syncs source before starting a build, only proceed with building and publishing if there any actual changes in source control. You should be able to detect this easily from the output of your source control client.
We have the same problem with our build system. Unfortunately it is not trivial to detect if there are any material code changes since we have numerous static libraries, so a change to one may result in a dll/exe changing. Changes to a file directly used by the dll/exe may just be fixing a bad comment, not changing the resulting object code.
I've looked previously for a tool to do what you desired and I did not see one. My thought was to compare the two files manually and skip the non meaningful differences in the two versions. The Portable File Format is well documented, so I don't expect this to be terribly difficult. Our requirements additional require that we ignore the version stamped into the dll/exe since we uniquely stamp all our files, and also to ignore the signature as we sign all our executables.
I've yet to find time to do any of this, but I'd be interested in collaborating with you if you proceed with implementing a solution. If you do find a tool that does this, please do let us know.

Should a .sln be committed to source control?

Is it a best practice to commit a .sln file to source control? When is it appropriate or inappropriate to do so?
Update
There were several good points made in the answers. Thanks for the responses!
I think it's clear from the other answers that solution files are useful and should be committed, even if they're not used for official builds. They're handy to have for anyone using Visual Studio features like Go To Definition/Declaration.
By default, they don't contain absolute paths or any other machine-specific artifacts. (Unfortunately, some add-in tools don't properly maintain this property, for instance, AMD CodeAnalyst.) If you're careful to use relative paths in your project files (both C++ and C#), they'll be machine-independent too.
Probably the more useful question is: what files should you exclude? Here's the content of my .gitignore file for my VS 2008 projects:
*.suo
*.user
*.ncb
Debug/
Release/
CodeAnalyst/
(The last entry is just for the AMD CodeAnalyst profiler.)
For VS 2010, you should also exclude the following:
ipch/
*.sdf
*.opensdf
Yes -- I think it's always appropriate. User specific settings are in other files.
Yes you should do this. A solution file contains only information about the overall structure of your solution. The information is global to the solution and is likely common to all developers in your project.
It doesn't contain any user specific settings.
You should definitely have it. Beside the reasons other people mentioned, it's needed to make one step build of the whole projects possible.
I generally agree that solution files should be checked in, however, at the company I work for we have done something different. We have a fairly large repository and developers work on different parts of the system from time to time. To support the way we work we would either have one big solution file or several smaller. Both of these have a few shortcomings and require manual work on the developers part. To avoid this, we have made a plug-in that handles all that.
The plug-in let each developer check out a subset of the source tree to work on simply by selecting the relevant projects from the repository. The plugin then generates a solution file and modifies project files on the fly for the given solution. It also handles references. In other words, all the developer has to do is to select the appropriate projects and then the necessary files are generated/modified. This also allows us to customize various other settings to ensure company standards.
Additionally we use the plug-in to support various check-in policies, which generally prevents users from submitting faulty/non-compliant code to the repository.
Yes, things you should commit are:
solution (*.sln),
project files,
all source files,
app config files
build scripts
Things you should not commit are:
solution user options (.suo) files,
build generated files (e.g. using a build script) [Edit:] - only if all necessary build scripts and tools are available under version control (to ensure builds are authentic in cvs history)
Regarding other automatically generated files, there is a separate thread.
Yes, it should be part of the source control.
When ever you add/remove projects from your application, .sln would get updated and it would be good to have it under source control. It would allow you to pull out your application code 2 versions back and directly do a build (if at all required).
Yes, you always want to include the .sln file, it includes the links to all the projects that are in the solution.
Under most circumstances, it's a good idea to commit .sln files to source control.
If your .sln files are generated by another tool (such as CMake) then it's probably inappropriate to put them into source control.
We do because it keeps everything in sync. All the necessary projects are located together, and no one has to worry about missing one. Our build server (Ant Hill Pro) also uses the sln to figure which projects to build for a release.
We usually put all of our solutions files in a solutions directory. This way we separate the solution from the code a little bit, and it's easier to pick out the project I need to work on.
The only case where you would even considder not storing it in source control would be if you had a large solution with many projects which was in source control, and you wanted to create a small solution with some of the projects from the main solution for some private transient requirement.
Yes - Everything used to generate your product should be in source control.
We keep or solution files in TFS Version Control. But since or main solution is really large, most developers have a personal solution containing only what they need. The main solution file is mostly used by the build server.
.slns are the only thing we haven't had problems with in tfs!

Comparing VB6.exes

We're going through a massive migration project at the minute and trying to validate the code that is deployed to the live estate matches the code we have in source control.
Obviously the .net code is easy to compare because we can disassemble. I don't believe this is possible in vb6 exes because of the manner of compilation.
Does anyone have any ideas on how I could validate the source code and the compiled executable matches the file I have in Live.
Thanks
Visual Basic had (has) two ways of compiling, one to the interpreter ( called P-code) that would result in smaller binaries, and a second one that generates "regular" windows .exe file (called native) that was introduced because it was supposed to be fastar than p-code; although the compiled file size increased with this option.
If your compilation was using p-code, it is in theory possible to restore the sources.
Either way is pretty difficult to do, but there are tools that claim they can partially do this, one that I know of ( never tried but there is a trial version ) is VB-decompiler
http://www.vb-decompiler.org/
Unfortunately that's almost impossible. Bear in mind that VB6 code compiled on different machines will have different exe sizes and deployment requirements.
This is why the old VB'ers had a dedicated machine to compile their code.
This won't help you with already deployed items, but if you upped the revision number on every compile (there is a project setting to do this for you automatically) then you could easily compare version numbers.
My old company bought a copy of that VB-Decompiler and as noted before VB5/6 generates P-Code extra, that tool did produce some code and if not Assembly code which could be "read".
If you have all the code you compiled, you could compare the CRC's of that code to what is deployed in the field. But if you don't have the original compiled code, depending upon how you compiled the code you (if you used P-Code rather than Native Code you may be able to disassemble but the disassembly will look nothing like your source code). I doubt you would have shipped the PDB's with the exe's, but if you did, you could certainly use those to compare with the source code in your repository.
Have a trusted computer that can check out the various libraries and exes you make and compile them automatically. Keep those in a read-only but accessible location. Then do a binary comparison between the deployed site and your comparison site.
However I am not sure of the logic over disassembling the the complied units. My company and most other places I know of use a combination of a build computer and unit testing. In our company the EXE we make is a very thin shell over a bunch of libraries. For example a button click will be passed to a UI Active X DLL that does the actual processing. What we do after a build is run a special EXE that perform our list of unit test. If they all passed we know our libraries, where 90% of our code is, are good. As for the actual EXE we have a hand procedure that takes about two hours to do and then we are good. IIt is rare for any errors to happen in the EXE.

Resources