Compiling historical information (esp. SLOCs) about a project - continuous-integration

I am looking for a tool that will help me to compile a history of certain code metrics for a given project.
The project is stored inside a mercurial repository and has about a hundred revisions. I am looking for something that:
checks out each revision
computes the metrics and stores them somewhere with an identifier of the revision
does the same with the next revisions
For a start, counting SLOCs would be sufficient, but it would also be nice to analyze # of Tests,TestCoverage etc.
I know such things are usually handled by a CI Server, however I am solo on this project and thus haven't bothered to set up a CI Server (I'd like to use TeamCity but I really didn't see the benefit of doing so in the beginnig). If I'd set up my CI Server now, could it handle that?

According to jitter's suggestion I have written a small bash script running inside cygwin using sloccount for counting the source lines. The output was simply dumped to a textfile:
#!/bin/bash
COUNT=0 #startrev
STOPATREV = 98
until [ $COUNT -gt $STOPATREV ]; do
hg update -C -r $COUNT >> sloc.log # update and log
echo "" >> sloc.log # echo a newline
rm -r lib # dont count lib folder
sloccount /thisIsTheSourcePath | print_sum
let COUNT=COUNT+1
done

You could write a e.g. shell script which
checks out first version
run sloccount on it (save output)
check out next version
repeat steps 2-4
Or look into ohloh which seems to have mercurial support by now.
Otherwise I don't know of any SCM statistics tool which supports mercurial. As mercurial is relatively young (since 2005) it might take some time until such "secondary use cases" are supported. (HINT: maybe provide a hgstat library yourself as there are for svn and csv)

If it were me writing software to do that kind of thing, I think I'd dump metrics results for the project into a single file, and revision control that. Then the "historical analysis" tool would have to pull out old versions of just that one file, rather than having to pull out every old copy of the entire repository and rerun all the tests every time.

Related

Use non-built-in bash commands without modifying .bashsrc

I'm working on cluster and using custom toolkits (more specifically SRA toolkit). In order to use it, I fist had to download (and unpack it) to a specific folder in my directory.
Then I had to modify .bashsrc to include the following segment:
# User specific aliases and functions
export PATH="$PATH:/home/MYNAME/APPS/SRATOOLS/bin"
Now I can use a stuff from SRATools in bash command line, e.g.
prefetch SR111111
My question is, can I use those tools without modifying my .bashsrc?
The reason that I want to do that is because I wrote a .sh script that takes a long time to run, and my cluster has Sun Grid Engine job management system, and I submitted my script to it, only to see the process fail - because a SRA Toolkit command I used was unrecognized.
EDIT (1):
I modified the location where my prefetch command is, and now it looks like:
/MYNAME/APPS/SRA_TOOLS/bin
different from how it is in .bashsrc:
export PATH="$PATH:/home/MYNAME/APPS/SRATOOLS/bin"
And run what #Darkman suggested (put IF THEN ELSE FI and under ELSE put export). The output is that it didn't find SRATools (because path in .bashsrc is different), but it found them under ELSE and script is running normally. Weird. It works on my job management system.
Thanks everybody.

Adding other useful info to a git archive filename automagically

Stumbled across this gem: Export all commits into ZIP files or directories whose inital answer met my needs for exporting commits from certain branches (like develop for example) into separate zip files - all done via a simple, yet clever, one-liner:
git rev-list --all --reverse | while read hash; do git archive --format zip --output ../myproject-commit$((i=i+1))-$hash.zip $hash; done
In my version I replaced the --all with --first-parent develop.
What I would like to do now is make the filenames more useful by including the commit date and commit author in the filename. I've Googled around a bit, grokked the git archive documentation, but do not seem to find any other 'parameters' I could use that are readily available like $hash.
I'm guessing I will need to expand the loop and call up the relevant bits individually, save them into bash variables and pass them on to the output option with something like ${author}, unless anyone else knows a cleaner, simpler way to do this, or can point me to documentation or other examples where I could pull the needed info from other parts of git? Thanks in advance for any insights.

Executing a script takes so long on Git Bash

I'm currently executing a script on Git Bash on a Windows 7 VM. The same script is executed within 15-20 seconds on my Mac machine, but it takes almost 1 hour to run on my Windows.
The script itself contains packages that extract data from XML files, and does not call upon any APIs or anything of the sort.
I have no idea what's going on, and I've tried solving it with the following answers, but to no avail:
https://askubuntu.com/a/738493
https://github.com/git-for-windows/git/wiki/Diagnosing-performance-issues
I would like to have someone help me out in diagnosing or giving a few pointers on what I could do to either understand where the issue is, or how to resolve it altogether.
EDIT:
I am not able to share the entire script, but you can see the type of commands that the script uses through previous questions I have asked on Stackoverflow. Essentially, there is a mixture of XMLStarlet commands that are used.
https://stackoverflow.com/a/58694678/3480297
https://stackoverflow.com/a/58693691/3480297
https://stackoverflow.com/a/58080702/3480297
EDIT2:
As a high level overview, the script essentially loops over a folder for XML files, and then retrieves certain data from each one of those files, before creating an HTML page and pasting that data in tables.
A breakdown of these steps in terms of the code can be seen below:
Searching folder for XML files and looping through each one
for file in "$directory"*
do
if [[ "$file" == *".xml"* ]]; then
filePath+=( "$file" )
fi
done
for ((j=0; j < ${#filePath[#]}; j++)); do
retrieveData "${filePath[j]}"
done
Retrieving data from the XML file in question
function retrieveData() {
filePath=$1
# Retrieve data from the revelent xml file
dataRow=$(xml sel -t -v "//xsd:element[#name=\"$data\"]/#type" -n "$filePath")
outputRow "$dataRow"
}
Outputting the data to an HTML table
function outputRow() {
rowValue=$1
cat >> "$HTMLFILE" << EOF
<td>
<div>$rowValue</div>
</td>
EOF
}
As previously mentioned, the actual xml commands used to retrieve the relevant data can differ, however, the links to my previous questions have the different types of commands used.
Your git-bash installation is out of date.
Execute git --version to confirm this. Are you using something from before 2.x?
Please install the latest version of git-bash, which is 2.24.0 as of 2019-11-13.
See the Release Notes for git for more information about performance improvements over time.

How to convert this script into a custom mercurial command?

I have the following script:
#!/bin/bash
if [ $# -ne 2 ]; then
echo -n "$0 - a utility for applying uncommitted changes to a "
echo "remote hg repository locally also"
echo "Usage: $0 user#hostname path/to/repository"
exit -1
fi
user_at_hostname="$1"
remote_path="$2"
ssh "$user_at_hostname" hg -R "$remote_path" diff | hg import --no-commit -
It's not the most glorious piece of code, and I would rather do something more "mercurial" than that, so to speak. Specifically, I was wondering whether I could achieve the same using a mercurial alias / custom command. Can I?
PS - I had also thought about maybe issuing some sort of shelve command on the remote repository instead of just getting a diff, but I don't want to make thing too complicated.
If you just want to convert this script into an hg foo command without changing it, use a shell alias. Just copy the last line and replace the custom variables with $1 and $2.
If you want to make this look more like a "normal" Mercurial workflow, you could start by committing the changes and then pulling them. I imagine that you are avoiding this workflow so that you can change your mind about these modifications without polluting your repository's history with "oops" commits. If that is the case, then you will likely be interested in the Evolve extension. The Evolve extension is intended to provide a safe and reasonably well-behaved system for sharing mutable history. In this way, you can commit a change; share it with another repository; amend, rebase, squash, or otherwise modify the change; and then share the modified changeset with the same repository or a different one. You can also prune changesets from history and share the fact that you pruned them. If this sharing causes problems, such as amending a commit which has descendants, Mercurial will detect those problems and offer a fix (e.g. rebase the descendants onto the new version of the commit) which you can execute automatically with hg evolve. While the extension is still experimental, it does basically work for most simple use cases.
If experimental software isn't of interest to you, you can still flag the repository as non-publishing. This will allow you to use more traditional history-editing machinery such as hg rebase, hg histedit, and hg strip even after you have pushed to the repository. However, revisions which are destroyed in one repository will not automatically vanish from other repositories without the evolve extension. You will have to strip them by hand.
Finally, note that hg push --force does not destroy revisions. It creates new anonymous branches, typically resulting in a messy history, but without actually losing any data. It is different from git in this fashion.

Speeding up file comparisons (with `cmp`) on Cygwin?

I've written a bash script on Cygwin which is rather like rsync, although different enough that I believe I can't actually use rsync for what I need. It iterates over about a thousand pairs of files in corresponding directories, comparing them with cmp.
Unfortunately, this seems to run abysmally slowly -- taking about ten (Edit: actually 25!) times as long as it takes to generate one of the sets of files using a Python program.
Am I right in thinking that this is surprisingly slow? Are there any simple alternatives that would go faster?
(To elaborate a bit on my use-case: I am autogenerating a bunch of .c files in a temporary directory, and when I re-generate them, I'd like to copy only the ones that have changed into the actual source directory, leaving the unchanged ones untouched (with their old creation times) so that make will know that it doesn't need to recompile them. Not all the generated files are .c files, though, so I need to do binary comparisons rather than text comparisons.)
Maybe you should use Python to do some - or even all - of the comparison work too?
One improvement would be to only bother running cmp if the file sizes are the same; if they're different, clearly the file has changed. Instead of running cmp, you could think about generating a hash for each file, using MD5 or SHA1 or SHA-256 or whatever takes your fancy (using Python modules or extensions, if that's the correct term). If you don't think you'll be dealing with malicious intent, then MD5 is probably sufficient to identify differences.
Even in a shell script, you could run an external hashing command, and give it the names of all the files in one directory, then give it the names of all the files in the other directory. Then you can read the two sets of hash values plus file names and decide which have changed.
Yes, it does sound like it is taking too long. But the trouble includes having to launch 1000 copies of cmp, plus the other processing. Both the Python and the shell script suggestions above have in common that they avoid running a program 1000 times; they try to minimize the number of programs executed. This reduction in the number of processes executed will give you a pretty big bang for you buck, I expect.
If you can keep the hashes from 'the current set of files' around and simply generate new hashes for the new set of files, and then compare them, you will do well. Clearly, if the file containing the 'old hashes' (current set of files) is missing, you'll have to regenerate it from the existing files. This is slightly fleshing out information in the comments.
One other possibility: can you track changes in the data that you use to generate these files and use that to tell you which files will have changed (or, at least, limit the set of files that may have changed and that therefore need to be compared, as your comments indicate that most files are the same each time).
If you can reasonably do the comparison of a thousand odd files within one process rather than spawning and executing a thousand additional programs, that would probably be ideal.
The short answer: Add --silent to your cmp call, if it isn't there already.
You might be able to speed up the Python version by doing some file size checks before checking the data.
First, a quick-and-hacky bash(1) technique that might be far easier if you can change to a single build directory: use the bash -N test:
$ echo foo > file
$ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi
newer than last read
$ cat file
foo
$ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi
older than last read
$ echo blort > file # regenerate the file here
$ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi
newer than last read
$
Of course, if some subset of the files depend upon some other subset of the generated files, this approach won't work at all. (This might be reason enough to avoid this technique; it's up to you.)
Within your Python program, you could also check the file sizes using os.stat() to determine whether or not you should call your comparison routine; if the files are different sizes, you don't really care which bytes changed, so you can skip reading both files. (This would be difficult to do in bash(1) -- I know of no mechanism to get the file size in bash(1) without executing another program, which defeats the whole point of this check.)
The cmp program will do the size comparison internally IFF you are using the --silent flag and both files are regular files and both files are positioned at the same place. (This is set via the --ignore-initial flag.) If you're not using --silent, add it and see what the difference is.

Resources