Algorithm to recursively search Git repo for a string - algorithm

I am working on a project to automate the code review process for a team of engineers. Basically, what happens is every time an engineer makes a change to a file, before those changes are pushed to Github, they need to figure out what other files are being impacted by that change and add the engineers in charge of those files to view and approve that change. Right now, the person who made the change would manually do the following things: check which function the change occurred in, use the text search feature of an IDE (such as VS code) to see where that function is being used in the entire repo, go through all those search results and check which functions in other files is calling the original function, and then do a search for those functions. They would recursively search for functions until one of a group of designated files called "base files" appear in the search results. Separate engineers are in charge of separate base files, so once a base file appears in the search process, the person who made the change would need to add the engineer in charge of that base file to approve the change because the functionality of that file is potentially impacted by that change. We are trying to find a way to automate these manual steps.
I was wondering if there are any known algorithms that can be used to accomplish something like this. I am thinking of using graphs or trees, but I am not sure which specific graph or tree algorithms I should use.

Hmm, searching for strings is not good enough.
mark all base files
make call graph, directed graph (might not be acyclic)
do a BFS from changed file and log all Base files
Doxygen can generate some call graphs, or maybe there already is some Clang/LLVM call graph builder.

Related

Getting data from .dat files

I'm hoping somebody out there can help me with this. I'm attempting to extract some barcode data from some .dat files. Its a B Tree file system with groups of three files .dat .ix. .dia. The company that wrote the software (a long time ago) say that the program is written in Pascal. I have no experience in reverse engineering but from what I read its most likely the only way to extract the data as the structure of the database is contained in the code of the program. I'm looking for advice on where to start.
I suppose the first thing you need to do is to see if the exe you've got was written with Delphi. You can check with this: http://cc.embarcadero.com/Item/15250
Then, to see if the exe that creates those .dat files were made with 'TurboPower B-Tree Filer', the I'd suggest you download and take a look at this: http://sourceforge.net/projects/tpbtreefiler/
At this step, looking at these sources is needed to familiarize yourself with the class names used in 'TurboPower B-Tree Filer' to help determine if any of those classes were used in your exe.
Then, using 'XN Resource Editor' [search the Internet for this] or, probhably better, 'MiTeC Portable Executable Reader' [ http://www.mitec.cz/pe.html ], see if any class names are relevant.
If they are, then you're in luck --sort of. All you will need to do is to write an app using 'TurboPower B-Tree Filer' to import the data in your dat files to export or manipulate as you wish.
At that point, you might find this link useful.
TurboPower B-Tree Filer and Delphi XE2 - Anyone done it?
If, OTOH, none of the above applies; I fear the only option is to reverse engineer the exe you have.

How can I resolve MSI paths in VBScript?

I am looking for resolving the paths of files in an MSI with or without installing (whichever is faster) from outside an MSI, using VBScript.
I found a similar query using C# instead and Christpher had provided a solution, below: How can I resolve MSI paths in C#?
I am going through the very same pain now but is there anyway to achieve this using WindowsInstaller object in VBScript, rather than go with endless queries through SQL Tables of MSI back and forth to achieve the same. Though any direction would be welcoming as I have tried tested whatever I can with very limited success.
yes there is a solution without installing the msi and using vbscript.
there is a very good example in the Windows Installer SDK called "WiFilVer.vbs"
using that example i've thrown together an quick example script that does exactly what you need.
set installer = CreateObject("WindowsInstaller.Installer")
const READONLY = 0
set db = installer.OpenDataBase("<FULL PATH TO YOUR MSI>", READONLY)
set session = installer.OpenPackage(db, READONLY)
session.DoAction("CostInitialize")
session.DoAction("CostFinalize")
set view = db.OpenView("SELECT File, Directory_, FileName, Component_, Component FROM File,Component WHERE Component=Component_ ORDER BY Directory_")
view.Execute
set record = view.Fetch
do until record is nothing
file = record.StringData(1)
directoryName = record.StringData(2)
fileName = record.StringData(3)
if instr(fileName, "|") then fileName = split(fileName, "|")(1)
wsh.echo(session.TargetPath(directoryName) & fileName)
set record = view.Fetch
loop
just add the path to your MSI file.
tell me if you need a more detailed answer. i will have some more time to answer this in detail this evening.
EDIT the promised background (and why i need to call ConstFinalize)
naveen actually MSDN was the only resource that can give an definitive answer on this, but you need to know where and how to look since windows installer ist IMHO a pretty complex topic.
I really recommend a mix of the msdn installer function reference, the database reference, and the examples from the windows installer SDK (sorry couldn't find a download link, i think its somewhere hidden in the like 3GB windows SDK)
first you need general knowledge of MSIs:
an MSI is actually a relational database.
Everything is stored in tables that relate to each other.
(actually not everything, but i will try to keep it simple ;))
This database is interpreted by the Windows Installer,
this creates a 'Session'
also some parts are dynamically resolved, depending on the system you install the msi on,
like 'special' folders similar to environment variables.
E.g. msi has a "ProgramFilesFolder", where windows generally has %ProgramFiles%.
All dynamic stuff only exists in the Installer session, not the database itself.
In your case there are 3 tables you need to look at, take care of the relations and resolve them.
the 'File' table contains all Files, the 'Component' table tells you which file goes into which directory and the 'Directory' table contains all information about the filesystem structure.
Using a SQL Query i could link the Component and File table to find the directory name (or primary key in database jargon).
But the directory table has relations in itself, its structured like a tree.
take a look at this example directory table (taken from the instEd MSI)
The columns are Directory, Directory_Parent and DefaultDir
InstEdAllUseAppDat InstEdAppData InstEd
INSTALLDIR InstEdPF InstEd
CUBDIR INSTALLDIR hkyb3vcm|Validation
InstEdAppData CommonAppDataFolder instedit.com
CommonAppDataFolder TARGETDIR .
TARGETDIR SourceDir
InstEdPF ProgramFilesFolder instedit.com
ProgramFilesFolder TARGETDIR .
ProgramMenuFolder TARGETDIR .
SendToFolder TARGETDIR .
WindowsFolder_x86_VC.1DEE2A86_2F57_3629_8107_A71DBB4DBED2 TARGETDIR Win
SystemFolder_x86_VC.1DEE2A86_2F57_3629_8107_A71DBB4DBED2 WindowsFolder_x86_VC.1DEE2A86_2F57_3629_8107_A71DBB4DBED2 System
The directory_parent links it to a directory. the DefaultDir contains the actual name.
You could now resolve the tree by yourself and replace all special folders(which in a vbscript would be very tedious)...
...or let the windows installer handle that (just like when installing a msi).
now i have to introduce a new thing: Actions (and Sequences):
when running (installing, removing, repairing) an msi a defined list of actions is performed.
some actions just collect information, some change the actual database.
there are list of actions (called sequences) for various things a msi can do,
like one sequence for installing (called InstallExecuteSequence), one for collecting information from the user (the UI of the MSI: InstallUISequence) or one for adminpoint installations(AdminExecuteSequence).
in our case we don't want to run a whole sequence (which could alter the system or just take to long),
luckily the windows installer lets us run single actions without running a whole sequence.
reading the reference of the directory table on MSDN (the remarks section) you can see which action you need:
Directory resolution is performed during the CostFinalize action
so putting all this together the script is easier to read
* open the msi file
* 'parse' it (providing the session)
* query component and file table
* run the CostFinalize action to resolve directory table (without running the whole MSI)
* get the resolved path with the targetPath function
btw i found the targetPath function by browsing the Installer Reference on MSDN
also i just noticed that CostInitialize is not required. its only required if you want to get the sourcePath of a file.
I hope this makes everything clearer, its very hard to explain since it took me like half a year to understand it myself ;)
And regarding PhilmEs answer:
Yes there are more influences to the resolution of the directory table, like custom actions.
keeping that in mind also the administrative installation might result in different directorys (eg. because different sequence might hold different custom actions).
Components have conditions so maybe a file is not installed at all.
Im pretty sure InstEd doesnt take custom actions into account either.
So yes, there is no 100% solution. Maybe a mix of everything is necessary.
The script given by weberik (deriven from MS SDK VB code) is very good, because it makes it easy to analyse the directory table without an own algorithm (which is a mid-size effort to do it in a loop or with a recursion algorithm).
But it gives not a 100% perfect view for all files, see below.
The method of the script is semi-dynamic (can be extended by other actions), but in effect it gives only the static directory structure, similar to a default administrative install or advanced MSI viewers.
Normally this is enough and what we want.
But be aware, that this is not the 100% solution (Knowing before exact the path of each file afterwards). That does mean, this will not give you always the correct paths in some cases:
You use command line parameters which substitute predefined directory table entries.
The MSI uses custom actions which change paths.
Especially it is not guaranteed, that every file is installed. There may be optional conditions and features and this may depend on the install environment.
In effect, a 100% solution is very very hard to achieve, without installing really. It comes near to reprogram nearly the whole windows installer engine. So simplifications are normally sufficient and accepted.
But you can extend the method to cover custom actions, e.g. with adding a line "session.DoAction(..)" for each additional action needed. Or to include command line parameters.
Sometimes it can be tricky. The easier the structure of the MSI is, the more likely it is that you succeed without more efforts.
Alternative to write an own program:
The question is, what you really want to find out, and if it is really necessary to program it:
If you don't want to write an automatical every-day MSI analyzer maybe the following is sufficient for you:
First tip: install the MSI with "msiexec /a mysetup.msi TARGETDIR="c:\mytestpath" . (similar restrictions as script above by weberik)
If the MSI has not used custom actions to change pathes including forgetting to add to the admin sequence ("forgetting" should be taken as the normal case for 99% or existing setups :-), you get the filestructure like if you install "really" with some special namings for the Windows predefined folders which you will find out easily.
If the administrative install lacks some folders, it is often a better idea of fixing the custom action (adding to the admin sequence) and using this scenario as your primary test case.
The advantage is, that only you limit the dynamics used by admin install. BTW, you can use the same command line params or path settings custom actions as in real install.
Second tip: Google for the InstEd tool , go to the file or component table and you will see the resulting MSI paths in the same static way as with the mentioned VB-script after calling CostInitialize/CostFinalize. For human view such an editor view maybe better.
For automatic testing and improvements or accuracy, you need an own program of course.
For those of you that mentioned snippet given is a good starting point. :-)
The rest of you should live easier with one of the two given methods without programming.

Testing File/Folder Navigation and Manipulation

I am working on a module that supplies methods for navigating directories and manipulating files. Basically it will be a combination of the Dir and File classes, with options specific to the needs of a project I'm working on.
Right now I have started writing tests for some of these methods and things are getting messy.
Example
One of the methods I have is a tree function that returns a hash of files and folders where you can pass options like tree(only: 'folders', limit: 3). In order to test that it only goes down 3 levels, I would have to have 4+ subfolders with dummy files in them.
The Problem
Right now I'm testing on folders outside the project since the subfolders are already there, but I want to move away from this, especially considering the implausibility of testing on system files once I start testing methods equivalent to rm -rf (as well as the lack of portability).
I'm starting to think that I need to create a "lab rat" type folder that I do all my "experiments" on, but I have no clue how to approach creating it.
Do I create a function that creates the files?
Do I pull files and folders from another location?
Do I use some sort of "lorem ipsum" generator for file structures?
Do I make all these files and folders manually(ugh)?
Do I just mock and stub the hell out of everything and not actually create/delete the files and folders?(I don't see this happening)
So...
How would someone normally approach testing excessive amounts of file and folder manipulation?
I don't think you want to use mocks/stubs. The file system of your OS should be well tested and fast, so the benefit of mocks/stubs is minimal. Creating a mock/stub system increases the complexity without much benefit.
Here's my answers:
Do I create a function that creates the files?
Yes. You can create tests for these functions to make sure that they are correct. Instead of calling Dir and File, write helper functions that make the code simple and readable. Maybe you can share the helper functions between the source/test code...
Do I pull files and folders from another location?
Not sure what this is for...
Do I use some sort of "lorem ipsum" generator for file structures?
Yes, if you mean create functions that generate file structures.
Do I make all these files and folders manually(ugh)?
No.
Do I just mock and stub the hell out of everything and not actually create/delete the files and folders?(I don't see this happening)
No. One benefit of creating files/directories is that you can manually check what is going on and not be 100% dependent on the tests. This is actually a good approach because without it there could be a bug where both the source code and test code is not doing what you expect, but you wouldn't know because everything seems to be working.

What is a sensible data-structure for allowing efficient synchronisation between two root paths?

I am working on an application that involves maintaining consistency between two local directories. Specifically, the directories should be identical, with the exception that all files in one of the directories are modified in some particular way (this part is not important to my question).
While running, my application runs two processes that listen for changes occurring under each of the paths, and performs relevant operations to bring them back in sync when necessary.
In terms of my specific question: I'm looking for advice on the tricker situation of when one starts the application. At this point, each process needs to check all files/folders under both the path that it is looking after, to see if anything has changed in anyway whilst the application was not running. (Let us assume that the application cannot be notified by the OS of anything that happened while it was shutdown, and thus will need to directly check every file/folder.)
Each process will have access to (and maintain) a persistent data-structure of all files/folder under its designated path. I was thinking that the following should be held within the data-structure for each of the files and folders:
File/folder name;
File hash (CRC32);
File/folder last mod data; and
File/folder size.
These pieces of information will obviously help to check for any changes to files/folder, but what is the best way to store them?
It seems to me that one sensible way to approach the situation of an application start is for each process to recursively scan through all files/folders under its designated path, and compare the metadata for each file scanned to the metadata stored in its data-structure. Then the processes should also iterate through the data-structures to look for things that have been removed from the paths. Some cases that may be encountered during this process are:
file modified (file name found in data-structure, but hash differs);
file added (no identical filename or hash found in data-structure);
file renamed (file with same hash exists in data-structure, but not with same filename);
folder added (no folder name in data-structure);
folder removed (folder name in data-structure, but not under path);
folder renamed (tricky one).
So, what's the best data-structure to use for this task? In my head I'm thinking some form of sorted associative array, e.g., a red-black tree, which store file and folder objects. Each file object contains name, hash and mod-date attributes , while each folder object contains name and children attributes, where children stores another associative array with everything underneath. Given the path to an arbitrary file, e.g., /foo/bar/file.txt, you begin at the root (foo), check for bar and so on until you get to file.txt's parent object.
Another alternative I can think of is to merely store everything flatly, such that there is one red-black tree where each key is the full path to each file/folder, and the value is the file / folder object. This would probably be quicker for retrieval, but it won't be possible to detect renamed files/folders without iterating through all values anyway, which sounds expensive. In the first approach, it may be the case that identifying a rename would only involves checking a portion of the data-structure rather than all of it.
Sorry the above ideas aren't terribly well thought out. What's the state of the art in this area, and are there any well-trodden approaches to these types of problems?
You're modelling a filesystem, so it's quite natural to use a hierarchical data structure. After all, you don't need to compare the file at dir1\dir2\foo.txt to dir3\bar.txt, right? You didn't mention file moves between directories as something you're tracking.
So, the data structure could be:
interface IFSEntry {
string name
datetime creationDate
pure virtual bool Compare(IFSEntry other)
pure virtual void UpdateFrom(IFSEntry other)
pure virtual bool WasRenamed(Dictionary<string,IFSEntry> possibleOriginals, out string oldName)
...
}
class File : IFSEntry {
...
}
class Directory : IFSEntry {
private Dictionary<string,IFSEntry> children;
...
}
The Directory implementations of UpdateFrom and Compare would recurse down their children.
File renames would be relatively easy by comparing CRC's. You'd miss files that changed in both places and were renamed. You could add a CRC dictionary to the Directory class if the time to run the comparisons proves a performance problem.
For directory moves, if the child files also changed, then you've got a fuzzy logic situation. It would be best to have a merge tool that the user would operate for that situation.
If a file changes in both places, you also need a user-facing merge strategy if conflicting changes occur. I'd argue that is always a good idea, just to let the user eyeball that the document didn't lose coherence.

How do you manage the String Translation Process?

I am working on a Software Project that needs to be translated into 30 languages. This means that changing any string incurs into a relatively high cost. Additionally, translation does not happen overnight, because the translation package needs to be worked by different translators, so this might take a while.
Adding new features is cumbersome somehow. We can think up all the Strings that will be needed before we actually code the UI, but sometimes still we need to add new strings because of bug fixes or because of an oversight.
So the question is, how do you manage all this process? Any tips in how to ease the impact of translation in the software project? How to rule the strings, instead of having the strings rule you?
EDIT: We are using Java and all Strings are internationalized using Resource Bundles, so the problem is not the internationalization per-se, but the management of the strings.
I'm not sure the platform you're internationalizing in. I've written an answer before on the best way to il8n an application. See What do I need to know to globalize an asp.net application?
That said - managing the translations themselves is hard. The problem is that you'll be using the same piece of text across multiple pages. Your framework may not, however, support only having that piece of text in one file (resource files in asp.net, for instance, encourage you to have one resource file per language).
The way that we found to work with things was to have a central database repository of translations. We created a small .net application to import translations from resource files into that database and to export translations from that database to resource files. There is, thus, an additional step in the build process to build the resource files.
The other issue you're going to have is passing translations to your translation vendor and back. There are a couple ways for this - see if your translation vendor is willing to accept XML files and return properly formatted XML files. This is, really, one of the best ways, since it allows you to automate your import and export of translation files. Another alternative, if your vendor allows it, is to create a website to allow them to edit the translations.
In the end, your answer for translations will be the same for any other process that requires repetition and manual work. Automate, automate, automate. Automate every single thing that you can. Copy and paste is not your friend in this scenario.
Pootle is an webapp that allows to manage translation process over the web.
There are a number of major issues that need to be considered when internationalizing an application.
Not all strings are created equally. Depending upon the language, the length of a sentence can change significantly. In some languages, it can be half as long and in others it can be triple the length. Make sure to design your GUI widgets with enough space to handle strings that are larger than your English strings.
Translators are typically not programmers. Do not expect the translators to be able to read and maintain the correct file formats for resource files. You should setup a mechanism where you can transform the translated data round trip to your resource files from something like an spreadsheet. One possibility is to use XSL filters with Open Office, so that you can save to Resource files directly in a spreadsheet application. Also, translators or translation service companies may already have their own databases, so it is good to ask about what they use and write some tools to automate.
You will need to append data to strings - don't pretend that you will never have to or you will always be able to put the string at the end. Make sure that you have a string formatter setup for replacing placeholders in strings. Furthermore, make sure to document what are typical values that will be replaced for the translators. Remember, the order of the placeholders may change in different languages.
Name your i8n string variables something that reflects their meaning. Do you really want to be looking up numbers in a resource file to find out what is the contents of a given string. Developers depend on being able to read the string output in code for efficiency a lot more than they often realize.
Don't be afraid of code-generation. In my current project, I have written a small Java program that is called by ant that parses all of the keys of the default language (master) resource file and then maps the key to a constant defined in my localization class. See below. The lines in between the //---- comments is auto-generated. I run the generator every time I add a string.
public final class l7d {
...normal junk
/**
* Reference to the localized strings resource bundle.
*/
public static final ResourceBundle l7dBundle =
ResourceBundle.getBundle(BUNDLE_PATH);
//---- start l7d fields ----\
public static final String ERROR_AuthenticationException;
public static final String ERROR_cannot_find_algorithm;
public static final String ERROR_invalid_context;
...many more
//---- end l7d fields ----\
static {
//---- start setting l7d fields ----\
ERROR_AuthenticationException = l7dBundle.getString("ERROR_AuthenticationException");
ERROR_cannot_find_algorithm = l7dBundle.getString("ERROR_cannot_find_algorithm");
ERROR_invalid_context = l7dBundle.getString("ERROR_invalid_context");
...many more
//---- end setting l7d fields ----\
}
The approach above offers a few benefits.
Since your string key is now defined as a field, your IDE should support code completion for it. This will save you a lot of type. It get's really frustrating looking up every key name and fixing typos every time you want to print a string.
Someone please correct me if I am wrong. By loading all of the strings into memory at static instantiation (as in the example) will result in a quicker load time at the cost of additional memory usage. I have found the additional amount of memory used is negligible and worth the trade off.
The localised projects I've worked on had 'string freeze' dates. After this time, the only way strings were allowed to be changed was with permission from a very senior member of the project management team.
It isn't exactly a perfect solution, but it did enable us to put defects regarding strings on hold until the next release with a valid reason. Once the string freeze has occured you also have a valid reason to deny adding brand new features to the project on 'spur of the moment' decisions. And having the permission come from high up meant that middle managers would have no power to change specs on your :)
If available, use a database for this. Each string gets an id, and there is either a table for each language, or one table for all with the language in a column (depending on how the site is accessed the performance dictates which is better). This allows updates from translators without trying to manage code files and version control details. Further, it's almost trivial to run reports on what isn't translated, and keep track of what was an autotranslation (engine) vs a real human translation.
If no database, then I stick each language in a separate file so version control issues are reduced. But the structure is basically the same - each string has an id.
-Adam
Not only did we use a database instead of the vaunted resource files (I have never understood why people use something like that which is a pain to manage, when we have such good tools for dealing with databases), but we also avoided the need to tag things in the application (forgetting to tag controls with numbers in VB6 Forms was always a problem) by using reflection to identify the controls for translation. Then we use an XML file which translates the controls to the phrase IDs from the dictionary database.
Although the mapping file had to be managed, it could still be managed independent of the build process, and the translation of the application was actually possible by end-users who had rights in the database.
The solution we came up to so far is having a small application in Excel that reads all the property files, and then shows a matrix with all the translations (languages as headers, keys as rows). It is quite evident what is missing then. This is send to the translators. When it comes back, then the sheet can be processed to generate the same property bundles back again. So far it has eased the pain somewhat, but I wonder what else is around.
This google book - resource file management gives some good tips
You can use Resource File Management software to keep track of strings that have changed and control the workflow to get them translated - otherwise you end up in a mess of freezes and overbearing version control
Some tools that do this sort of thing - no connection an I haven't actually used them, just researching
http://www.sisulizer.com/
http://www.translationzone.com/en/products/
I put in a makefile target that finds all the .properties files and puts them in a zip file to send off to the translators. I offered to send them just diffs, but for some reason they want the whole bundle of files each time. I think they have their own system for tracking just differences, because they charge us based on how many strings have changed from one time to the next. When I get their delivery back, I manually diff all their files with the previous delivery to see if anything unexpected has changed - one time all the PT_BR (Brazillian Portuguese) strings changed, and it turns out they'd used a PT_PT (Portuguese Portuguese) translator for that batch in spite of the order for PT_BR.
In Java, internationalization is accomplished by moving the strings to resource bundles ... the translation process is still long and arduous, but at least it's separated from the process of producing the software, releasing service packs etc. One thing that helps is to have a CI system that repackages everything any time changes are made. We can have a new version tested and out in a matter of minutes whether it's a code change, new language pack or both.
For starters, I'd use default strings in case a translation is missing. For example, the English or Spanish value.
Secondly, you might want to consider a web app or something similar for your translators to use. This requires some resources upfront, but at least you won't need to send files around and it will be obvious for the translators which strings are new, etc.

Resources