How to automatically detect and release DLLs that have really changed?

How to automatically detect and release DLLs that have really changed? - windows

Whenever we recompile an exe or a DLL, its binary image is different even if the source code is the same, due to various timestamps and checksums in the image.
But, our quality system implies that each time a new DLL is published, related validation tests must be performed again (often manually, and it takes a significant amount of time.)
So, our goal is to avoid releasing DLLs that have not actually changed. I.e: having an automatic procedure (script, tool, whatever...) that detect different Dlls based only on meaningful information they contain (code and data), ignoring timestamps and checksum.
Is there a good way to achieve this ?

Base it off the version information, and only update the version information when you actually make changes.

Have your build tool build the DLL twice. Whatever differences exist between the two are guaranteed to be the result of timestamps or checksums. Now you can use that information to compare to your next build.

If you have an automated build system that syncs source before starting a build, only proceed with building and publishing if there any actual changes in source control. You should be able to detect this easily from the output of your source control client.

We have the same problem with our build system. Unfortunately it is not trivial to detect if there are any material code changes since we have numerous static libraries, so a change to one may result in a dll/exe changing. Changes to a file directly used by the dll/exe may just be fixing a bad comment, not changing the resulting object code.
I've looked previously for a tool to do what you desired and I did not see one. My thought was to compare the two files manually and skip the non meaningful differences in the two versions. The Portable File Format is well documented, so I don't expect this to be terribly difficult. Our requirements additional require that we ignore the version stamped into the dll/exe since we uniquely stamp all our files, and also to ignore the signature as we sign all our executables.
I've yet to find time to do any of this, but I'd be interested in collaborating with you if you proceed with implementing a solution. If you do find a tool that does this, please do let us know.

Related

Why might it be necessary to force rebuild a program?

I am following the book "Beginning STM32" by Warren Gay (excellent so far, btw) which goes over how to get started with the Blue Pill.
A part that has confused me is, while we are putting our first program on our Blue Pill, the book advises to force rebuild the program before flashing it to the device. I use:
make clobber
make
make flash
My question is: Why is this necessary? Why not just flash the program since it is already made? My guess is that it is just to learn how to make an unmade program... but I also wonder if rebuilding before flashing to the device is best practice? The book does not say why?

You'd have to ask the author, but I would suggest it is "just in case" no more. Lack of trust that the make file specifies all possible dependencies perhaps. If the makefile were hand-built without automatically generated dependencies, that is entirely possible. Also it is easier to simply advise to rebuild than it is to explain all the situations where it might be necessary or otherwise, which will not be exhaustive.
From the author's point of view, it eliminates a number of possible build consistency errors that are beyond his control so it ensures you don't end up thinking the book is wrong, when it might be something you have done that the author has no control over.
Even with automatically generated dependencies, a project may have dependencies that the makefile or dependency generator does not catch, resource files used for code generation using custom tools for example.
For large projects developed over a long time, some seldom modified modules could well have been compiled with an older version of the tool chain, a clean build ensures everything is compiled and linked with the current tool.
make builds dependencies based on file timestamp; if you have build variants controlled by command-line macros, the make will not determine which dependencies are dependent on such a macro, so when building a different variant (switching from a "debug" to "release" build for example), it is good idea to rebuild all to ensure each module is consistent and compatible.
I would suggest that during a build/debug cycle you use incremental build for development speed as intended, and perform a rebuild for release or if changing some build configuration (such as enabling/disabling floating-point hardware for example or switching to a different processor variant.
If during debug you get results that seem to make no sense, such as breakpoints and code stepping not aligning with the source code, or a crash or behaviour that seems unrelated to some small change you may have made (perhaps that code has not even been executed), sometimes it may be down to a build inconsistency (for a variety of reasons usually with complex builds) and in such cases it is wise to at least eliminate build consistency as a cause by performing a rebuild all.
Certainly if you if you are releasing code to a third-party, such as a customer or for production of some product, you would probably want to perform a clean build just to ensure build consistency. You don't want users reporting bugs you cannot reproduce because the build they have been given is not reproducible.

Rebuilding the complete software is a good practice beacuse here you will generate all dependencies and symbol files path along with your local machine paths.
If you would need to debug any application with a debugger then probably you would need symbol file and paths where your source code is present and if you directly flash the application without rebuilding , then you might not be able to debug certain paths because you dont get to know at which path the application was compiled and you might miss the symbol related information.

Should the STM32 HAL be included as a precompiled library

I have a Keil STM32 project for a STM32L0. I sometimes (more often than I want) have to change the include paths or global defines. This will trigger a complete recompile for all code because it needs to ‘check’ for changed behaviour because of these changes. The problem is: I didn’t necessarily change relevant parameters for the HAL and as such it isn’t needed (as far as I understand) that these files are completely recompiled. This recompilation takes up quite a bit of time because I included all the HAL drivers for my STM32L0.
Would a good course of action be to create a separate project which compiles the HAL as a single library and include that in my main project? (This would of course be done for every microcontroller separately as they have different HALs).
ps. the question is not necessarily only useful for this specific example but the example gives some scope to the question.
pps. for people who aren't familiar with the STM32 HAL. It is the standardized interface with which the program interfaces with the underlying hardware. It is supplied in .c and .h files instead of the precompiled form of the STD/STL.
update
Here is an example of the defines that need to be managed in my example project:
STM32L072xx,USE_B_BOARD,USE_HAL_DRIVER, REGION_EU868,DEBUG,TRACE
Only STM32L072xx, and DEBUG are useful for configuring the HAL library and thus there shouldn't be a need for me to recompile the HAL when I change TRACE from defined to undefined. Therefore it seems to me that the HAL could be managed separately.
edit
Seeing as a close vote has been cast: I've read the don't ask section and my question seeks to constructively add to the knowledge of building STM32 programs and find a best practise on how to more effectively use the HAL libraries. I haven't found any questions on SO about building the HAL as a static library and therefore this question at least qualifies as unique. This question is also meant to invite a rich answer which elaborates on the pros/cons of building the HAL as a separate static library.

The answer here is.. it depends. As already pointed out in the comments, it depends on how you're planning to manage your projects. To answer your question in an unbiased way:
Option #1 - having HAL sources directly in your project means rebuilding HAL every time anything in its (and underlying) headers changes, which you've already noticed. Downside of it is longer build times. Upside - you are sure that what you build is what you get.
Option #2 - having HAL as a precompiled static library. Upside - shorter build times, downside - you can no longer be absolutely certain that the HAL library you include actually works as you want it to. In particular, you'd need to make sure in some way that all the #defines are exactly the same as when the library has been built. This includes project-wide definitions (DEBUG, STM32L072xx etc.), as well as anything in HAL config files (stm32l0xx_hal_conf.h).
Seeing how you're a Keil user - maybe it's just a matter of enabling multi-core build? See this link: http://www.keil.com/support/man/docs/armcc/armcc_chr1359124201769.htm. HAL library isn't so large that build times should be a concern when it comes to rebuilding its source files.
If I was to express my opinion and experience - personally I wouldn't do it, as it may lead to lower reliability or side effects that will be very hard to diagnose and will only get worse as you add more source files and more libraries like this. Not to mention adding more people to work on the project and explaining to them how they "need to remember to rebuild X library when they change given set of header files or project-wide definitions".
In fact, we've ran into the same dilemma for the code base I work on - it spans over 10k source and header files in total, some of which are configuration-specific and many of which are shared. It's highly modular which allows us to quickly create something new (both hardware- and software-wise) just by configuring existing code, mainly through a set of header files. However because this configuration is done through headers, making a change in them usually means rebuilding a large portion of the project. Even though build times get annoying sometimes, we opted against making static libraries for the reasons mentioned above. To me personally it's better to prioritize reliability, as in "I know what I build".
If I was to give any general tips that help to avoid rebuilds as your project gets large:
Avoid global headers holding all configuration. It's usually tempting to shove all configuration in one place, create pretty comments and sections for each software module in this one file. It's easier to manage this way (until this file becomes too big), but because this file is so common, it means that any change made to it will cause a full rebuild. Split such files to separate headers corresponding to each module in your project.
Include header files only where you need them. I sometimes see an approach where there are header files created that only "bundle" other header files and such header file is later included. In this case, making a change to any of those "smaller" headers will have an effect of having to recompile all source files including the larger file. If such file didn't exist, then only sources explicitly including that one small header would have to be recompiled. Obviously there's a line to be drawn here - including too "low level" headers may not be the greatest idea either, e.g. they may not be meant to be included as being internal library files which may change any time.
Prioritize including headers in source files over header files. If you have a pair of your own *.c (*.cpp) and *.h files - let's say temp_logger.c/.h and you need ADC - then unless you really need some ADC definition in your header (which you likely won't), then include the ADC header file in your temp_logger.c file. Later on, all files making use of the temp_logger functions won't have to be re-compiled in case HAL gets rebuilt again.

My opinion is yes, build the HAL into a library. The benefit of faster build time outweighs the risk of the library getting out of date. After some point early in the project it's unusual for me to change something that would affect the HAL. But the faster build time pays off many times.
I create a multi-project workspace with one project for the HAL library, another project for the bootloader, and a third project for the application. When I'm developing, I only rebuild the application project. When I make a release build, I select Project->Batch Build and rebuild all three projects. This way the release builds always use all the latest code and build settings.
Also, on the Options for Target dialog, Output tab, unchecking Browse Information will greatly reduce the build time.

How do you verify that 2 copies of a VB 6 executable came from the same code base?

I have a program under version control that has gone through multiple releases. A situation came up today where someone had somehow managed to point to an old copy of the program and thus was encountering bugs that have since been fixed. I'd like to go back and just delete all the old copies of the program (keeping them around is a company policy that dates from before version control was common and should no longer be necessary) but I need a way of verifying that I can generate the exact same executable that is better than saying "The old one came out of this commit so this one should be the same."
My initial thought was to simply MD5 hash the executable, store the hash file in source control, and be done with it but I've come up against a problem which I can't even parse.
It seems that every time the executable is generated (method: Open Project. File > Make X.exe) it hashes differently. I've noticed that Visual Basic messes with files every time the project is opened in seemingly random ways but I didn't think that would make it into the executable, nor do I have any evidence that that is indeed what's happening. To try to guard against that I tried generating the executable multiple times within the same IDE session and checking the hashes but they continued to be different every time.
So that's:
Generate Executable
Generate MD5 Checksum: md5sum X.exe > X.md5
Verify MD5 for current executable: md5sum -c X.md5
Generate New Executable
Verify MD5 for new executable: md5sum -c X.md5
Fail verification because computed checksum doesn't match.
I'm not understanding something about either MD5 or the way VB 6 is generating the executable but I'm also not married to the idea of using MD5. If there is a better way to verify that two executables are indeed the same then I'm all ears.
Thanks in advance for your help!

That's going to be nearly impossible. Read on for why.
The compiler will win this game, every time...
Compiling the same project twice in a row, even without making any changes to the source code or project settings, will always produce different executable files.
One of the reasons for this is that the PE (Portable Executable) format that Windows uses for EXE files includes a timestamp indicating the date and time the EXE was built, which is updated by the VB6 compiler whenever you build the project. Besides the "main" timestamp for the EXE as a whole, each resource directory in the EXE (where icons, bitmaps, strings, etc. are stored in the EXE) also has a timestamp, which the compiler also updates when it builds a new EXE. In addition to this, EXE files also have a checksum field that the compiler recalculates based on the EXE's raw binary content. Since the timestamps are updated to the current date/time, the checksum for the EXE will also change each time a project is recompiled.
But, but...I found this really cool EXE editing tool that can undo this compiler trickery!
There are EXE editing tools, such as PE Explorer, that claim to be able to adjust all the timestamps in an EXE file to a fixed time. At first glance you might think you could just set the timestamps in two copies of the EXE to the same date, and end up with equivalent files (assuming they were built from the same source code), but things are more complicated than that: the compiler is free to write out the resources (strings, icons, file version information, etc.) in a different order each time you compile the code, and you can't really prevent this from happening. Resources are stored as independent "chunks" of data that can be rearranged in the resulting EXE without affecting the run-time behavior of the program.
If that wasn't enough, the compiler might be building up the EXE file in an area of uninitialized memory, so certain parts of the EXE might contain bits and pieces of whatever was in memory at the time the compiler was running, creating even more differences.
As for MD5...
You are not misunderstanding MD5 hashing: MD5 will always produce the same hash given the same input. The problem here is that the input in this case (the EXE files) keep changing.
Conclusion: Source control is your friend
As for solving your current dilemma, I'll leave you with this: associating a particular EXE with a specific version of the source code is a more a matter of policy, which has to be enforced somehow, than anything else. Trying to figure out what EXE came from what version without any context is just not going to be reliable. You need to track this with the help of other tools. For example, ensuring that each build produces a different version number for your EXE's, and that that version can be easily paired with a specific revision/branch/tag/whatever in your version control system. To that end, a "free-for-all" situation where some developers use source control and others use "that copy of the source code from 1997 that I'm keeping in my network folder because it's my code and source control is for sissies anyway" won't help make this any easier. I would get everyone drinking the source control Kool-Aid and adhering to a standard policy for creating builds right away.
Whenever we build projects, our build server (we use Hudson) ensures that the compiled EXE version is updated to include the current build number (we use the Version Number Plugin and a custom build script to do this), and when we release a build, we create a tag in Subversion using the version number as the tag name. The build server archives release builds, so we can always get the specific EXE (and setup program) that was given to a customer. For internal testing, we can choose to pull an archived EXE from the build server, or just tell the build server to rebuild the EXE from the tag we created in Subversion.
We also never, ever, ever release any binaries to QA or to customers from any machine other than the build server. This prevents "works on my machine" bugs, and ensures that we are always compiling from a "known" copy of the source code (it only pulls and builds code that is in our Subversion repository), and that we can always associate a given binary with the exact version of the code that it was created from.

I know it has been a while, but since there is VB De-compiler app, you may consider bulk-decompiling vb6 apps, and then feeding decompilation results to an AI/statistical anomaly detection on the various code bases. Given the problem you face doesn't have an exact solution, it is unlikely the results will be 100% accurate, but as you feed more data, the detection should become more and more accurate

Comparing VB6.exes

We're going through a massive migration project at the minute and trying to validate the code that is deployed to the live estate matches the code we have in source control.
Obviously the .net code is easy to compare because we can disassemble. I don't believe this is possible in vb6 exes because of the manner of compilation.
Does anyone have any ideas on how I could validate the source code and the compiled executable matches the file I have in Live.
Thanks

Visual Basic had (has) two ways of compiling, one to the interpreter ( called P-code) that would result in smaller binaries, and a second one that generates "regular" windows .exe file (called native) that was introduced because it was supposed to be fastar than p-code; although the compiled file size increased with this option.
If your compilation was using p-code, it is in theory possible to restore the sources.
Either way is pretty difficult to do, but there are tools that claim they can partially do this, one that I know of ( never tried but there is a trial version ) is VB-decompiler
http://www.vb-decompiler.org/

Unfortunately that's almost impossible. Bear in mind that VB6 code compiled on different machines will have different exe sizes and deployment requirements.
This is why the old VB'ers had a dedicated machine to compile their code.

This won't help you with already deployed items, but if you upped the revision number on every compile (there is a project setting to do this for you automatically) then you could easily compare version numbers.

My old company bought a copy of that VB-Decompiler and as noted before VB5/6 generates P-Code extra, that tool did produce some code and if not Assembly code which could be "read".

If you have all the code you compiled, you could compare the CRC's of that code to what is deployed in the field. But if you don't have the original compiled code, depending upon how you compiled the code you (if you used P-Code rather than Native Code you may be able to disassemble but the disassembly will look nothing like your source code). I doubt you would have shipped the PDB's with the exe's, but if you did, you could certainly use those to compare with the source code in your repository.

Have a trusted computer that can check out the various libraries and exes you make and compile them automatically. Keep those in a read-only but accessible location. Then do a binary comparison between the deployed site and your comparison site.
However I am not sure of the logic over disassembling the the complied units. My company and most other places I know of use a combination of a build computer and unit testing. In our company the EXE we make is a very thin shell over a bunch of libraries. For example a button click will be passed to a UI Active X DLL that does the actual processing. What we do after a build is run a special EXE that perform our list of unit test. If they all passed we know our libraries, where 90% of our code is, are good. As for the actual EXE we have a hand procedure that takes about two hours to do and then we are good. IIt is rare for any errors to happen in the EXE.

tools for diffing windows binaries?

Our QA team wants to focus their testing based on what EXEs and DLLs have actually changed between builds. We have a nice svn change report, but the relationship between source and changed binaries isn't always obvious. The builds we're comparing are always full clean builds, so we can't use file system timestamps. I'm looking for tools to compare windows (and windows CE) PE binaries that will ignore the embedded timestamps and other cruft. Any recommendations for tools or other ways to generate a reliable 'what binaries have really changed' report? Thanks.
clarification: Thanks for the answers so far, but we can't generate the report by doing straightforward byte-for-byte compares or comparing checksums, because all the files appear different every time we build, even if the sources haven't changed, because of timestamps that the compiler inserts. The problem is how to ignore the false positives. The disassemble & compare idea is closest to what we need, I think...
answered! Bindiff is just what I was looking for. Many thanks.

Have you had a look at Bindiff?

I ran into this problem before. My solution was to write a tool which set all the timestamps in an .EXE/.DLL to a known value. I would run it as a post-build step. Then binary diffs would work just fine.

You could perhaps disassemble the binary, and then do a diff on the assembly...
This sounds like your QA team is taking the wrong approach though... It shouldn't matter to them what the code looks like; just that it does what it's supposed to do.
Edit:
Oh! After reading it again, I realized that I misinterpreted your question. I thought they wanted to test the methods that had changed...
In that case, why not get the MD5 hash and compare those? The tiniest change will cause a totally different hash to be generated.

Not sure what kind of binaries (DLLs? PE/WinCE executables only? Other?)Is it possible to embed version information in the binaries, e.g. using a source control tag that updates the version in the source code on commits. Then when the new build is created, the binary would have it's version string updated as well. Instead of having to diff a binary file, you could use the version string and check that for changes.

Look at NDepend.

When I was working on the "home grown" tool for installation verification at my company, we used Beyond Compare as a backend for comparison.
It has great file/folder comparison (binary as well) and scripting capabilities and can output XML reports.

Project dependency graph generator and Dependency-Grapher for C++-Projects both use GraphViz to visualize dependencies. I imagine that you could use either of them as a basis for your needs with special highlighting of the branches in the dependency graph where source files or other leaves have changed.
MD5 hashes or checksums (as suggested above), a simple diff ignoring whitespace and filtering out comment changes, or changlist information from your version control system can signal which files have changed.

gnu binutils specifically strings

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio