I have two tables, pages and revisions. Revisions has a foreign key to a page. The contents of a page is the latest entry in the revisions table for that page. The revisions are full copies of the contents, no deltas.
As an experiment, I would like to visualize the revision state of the current revision. If text is new in the current revision, don't do anything. If it is from a recent revision, give it a green background color. If it's very old, give it a red background color. In between, orange. A heat map diff of the age of the content, so to speak.
My question is: How I can extract this data from the revisions of a page? Pointers to literature would be equally useful to actual code solving this problem.
Not really relevant, but just in case: It's for a Ruby project, Ruby on Rails in fact. Here's the project, on github.
Update: here's an example test case, written in Ruby. http://pastie.org/631604
Update:
[ long and slightly off-topic answer on longest-common-subsequence deleted ]
I've integrated my Hunt-McIlroy Algorithm subsequence finder with your test case, which now passes. I made various mods to your test case, see it here at pastie.org. Likewise, here is the rdiff module. Here is my svn log for why your test case was changed.
One quick way to do it is to get the successive versions of the page and run them through the diff utility to get deltas, so you know what to color how. You could of course reinvent the code that goes from two complete pages and finds which bits they have in common, but it's going to be faster to reuse existing code.
You can use svn blame command to gain similar results. Of course revisions and pages needs to be stored in svn. If migrating to svn is a roadblock, maybe examining svn sources how blame command is written might help.
Edit: #August
In order to visualize this, I need something that doesn't care about lines. Don't I?
Well, you need blame for rows and diff for contents of single row. The first one is performed by VCS, second one you can do by yourself or third party code. For every change store in database deltas of blame commends (only changed rows need to be saved). In sample results for one row we have:
Rev. num. Value
23 Hello worl
36 Hello cruel world
45 Hello wonderful world
The desired for you result I assume is (for clarity I've skiped white spaces)
Afer first diff:
(Hello)(23)(cruel)(36)(worl)(23)(d)(36)
After second diff:
(Hello)(23)(wonderful)(45)(worl)(23)(d)(36)
Unified diff doesn't help in this case, so diff need to be done otherwise. You can write algoritm for diff yourself or find appropiate code in merge tools. Below is example of how TortiseMerge does the stuff.
tortise merge http://img169.imageshack.us/img169/7871/merge.png
The problem isn't simple one but I think that my ideas might help you a little or give any clues.
One thing. Heat implies activity or energy, so I would flip your colors around so that
the most recent are red (hot) and the older text is blue/green (cooled off).
You can use any DVCS to achieve that. I'd recommend git. It will be even better than using db.
Related
I want to use SonarQube on my project. The project is quite a big and scanning whole files take much time. Is it possible to scan only changed files in the last commit, and provide report based only on changed lines of code?
I want to check if added or modified lines make the project quality worst and I don't care about old code.
For example, if person A created a file with 9 bugs and then commited changes - the report and quality gate should show 9 bugs. Then person B edited the same file adding few lines containing 2 additional bugs, then commited changes - the report should show the 2 last bugs and quality gate should be executed on the last changes (so should consider the last 2 bugs)
I was able to narrow scan to only changed files in the last commit- but report is generated based on whole files. I had an idea about cutting only changed lines of code, paste them to new file and run sonar scan on the file - but I'm almost sure the SonarQube needs the whole context of file.
Is it possible to somehow achieve my usecase ?
No, it is impossible. I saw a lot of similar questions. These are answers to two of them:
New Code analysis only:
G Ann Campbell:
Analysis will always include all code. Why? Why take the time to
analyze all of it when only a file or two has been changed? Because
any given change can have far-reaching effects. I’ll give you two
examples:
I check in a change that deprecates a much-used method. Suddenly,
issues about the use of deprecated code should be raised all over the
project, but because I only analyzed that one file, no new issues were
raised.
I modify a much-used method to return null in some cases. Suddenly all
the methods that dereference the returned value without first
null-checking it are at risk of NullPointerExceptions. But only the
one file that I changed was analyzed, so none of those “Possible NPE”
issues are raised. Worse, they won’t be raised until after each
individual file happens to be touched.
And that’s why all files are included in each analysis.
I want sonar analysis on newly checkin code:
G Ann Campbell:
First, the SonarQube interface and default Quality Gate are designed to help you focus
on the New Code Period. You can’t keep analysis from picking up those
old issues, but you can decide to only pay attention to issues raised
on newly-changed code. That means you would essentially ignore the
issues on the left side of the project homepage with a white
background and focus instead on the New Code values over the yellow
background on the right. We call this Fixing the Leak, or
alternately Clean as You Code.
Second, if you have a commercial edition, then branch and PR analysis
are available to you. With Short-Lived Branch (SLB) and PR analysis
still covers all files, but all that’s reported in the UI is what’s
changed in the PR / SLB.
Ideally, you’ll combine both of these things to make sure your new
code stays clean.
The position in this matter has not changed over the last years, so don't expect it will be changed.
I'm novice developer, working alone. I'm using Xcode and git version control. Probably I'm not properly organised and doing things wrong, but I'm usually deciding to do commit just to make safe point before I'm spoiling everything. And at that moment I find it difficult to properly describe what I have already done, but I know exactly what I'm going to try next. So when I will do next reference point the previous is already named.
So my question is - are there some version control methodology where reference points are described by plans, not facts. Why this could be a bad idea?
The problem with describing a commit based on what you "plan" to do is that you lose accurate accounting of what has been done. Let's say you plan on doing something, but that doesn't work. So you roll back and try something else, and that works. You commit that, but now what you "planned" to do isn't what was actually done.
At that point, you'll need to go back and edit the comments on the previous commit to describe what you actually did or risk losing a record of the change over time. Also, if you are working in a group, you pretty much need to make your comments based on what you actually did so other members of the team can see it and either check what you did or improve on it.
Unless you plan on never working on a team project, your best bet is to just bite the bullet and figure out how to keep track of what you've done since the last commit. I keep a pen and notepad by my side so I can keep track of changes. I also do frequent commits to keep from forgetting what I've done over a long period of time.
ABC, always be committing. While you may be working on projects for yourself an no one is accountable but yourself, it is generally a good idea to commit what has been done rather than what you plan to do.
Branching is designed to save yourself from what you plan to do. Create a branch called 'addnewscreen' or whatever you plan to do. This way you can keep committing all the small changes on your new stuff without polluting your main branch. Once you are happy, merge it back in and make a new branch for what's next.
If you get stuck, the Pro-Git Book has helped me so many times I've lost count. Hopefully this will help you too. Good luck.
What is the quickest way to isolate the source of an error amongst an ordered list of potential sources? For example, given a list of column mappings, and one of those column mappings is incorrect, what debugging technique would lead you to most quickly identify which mapping is invalid? (By most quickly, I mean, which approach would require the fewest compilation, load, and run cycles?)
Assume that whatever error message the database or database driver generates does not identify the name of the errant column. Sound familiar?
Hint:
The technique is similar to that which you might use to answer the question, "What number am I thinking of between 1 and 1000?", but with the fewest guesses.
You can use interpolation in some cases. I've used this successfully to isolate a bad record.
Sounds familiar, but I hate to be the one to tell you that there is no "quick" way of isolating the sources of errors. I know from my own experience that you want to be absolutely sure you've found the correct source of error before you go about resolving it, and this requires plenty of testing and tracing.
Keep adding more and more diagnostic information until I either isolate the issue, or can't add anymore. If it's my code vs. external code, I will go crazy with trace statements until I isolate the critical bit of code if I otherwise don't know where the issue is. On Windows, the SysInternals suite is my friend... especially the debug viewer. That will show any trace statements from anything running on the system that is emitting trace.
If I truly cannot get more specific information from the error source, then I will go into experimental mode... testing one small change at a time. This works best if you know you have a case that succeeds and a case that does not.
Trivial example: If I have row X that won't be inserted into the database, but I know row Y will, I will then take row Y and change one field at a time and keep inserting until row Y's values = row X's value.
If you really are stumped at where the issue is coming from, time to dust off your Google-fu skills. Someone has probably run into the same problem and posted a question to a forum somewhere. Of course, that's what SO is for too.
You're the human... be more stubborn than the computer!
Here is my use case:
I start on a project XYZ, for which I create a work item, and I make frequent check-ins, easily 10-20 in total. ALL of the code changes will be code-read and code-reviewed.
The change sets are not consecutive - other people check-in in-between my changes, although they are very unlikely to touch the exact same files.
So ... at the end of the project I am interested in a "total diff" - as if there was a single check-in by me to complete the entire project. In theory this is computable. From the list of change sets associated with the work item, you get the list of all files that were affected. Then, the algorithm can aggregate individual diffs over each file and combine them into one. It is possible that a pure total diff is uncomputable due to the fact that someone else renamed files, or changed stuff around very closely, or in the same functions as me. I that case ... I suppose a total diff can include those changes by non-me as well, and warn me about the fact.
I would find this very useful, but I do not know how to do t in practice. Can Visual Studio 2008/2010 (and/or TFS server) do it? Are there other source control systems capable of doing this?
Thanks.
You can certainly compute the 'total diff' yourself - make a branch of the project from the revision just prior to your first commit, then merge all your changesets into it.
I don't think this really is a computable thing in the general case - only contiguous changesets can be merged automatically like this. Saying it's 'unlikely' for others to have touched the files you're working on in the interleving commits doesn't cut it, you need guarantees to be able to automate this sort of thing.
You should be working on a branch of your own if you want to be able to do this easily.
The ability to generate diff information for display or for merge purposes is functionality provided by your version control system, as Mahesh Velaga commented on another answer. If you were able to compute the diff by cherry-picking non-contiguous changesets, then logically you would also be able to merge those changes in a single operation. But this is not supported by TFS. So I strongly suspect that the construction of the cherry-picked diff information is also not supported by TFS. Other version control systems (git, mercurial, darcs come to mind) might have more support for something like this; I don't know for sure.
From my reading of their answers on the TFS version control forums, I think that their recommendation for this would be to create a branch of your own for doing this work in the first place: then the changesets would be contiguous on that branch and creating the "total diff" would be trivial. Since it sounds like you are working on an independent feature anyway (otherwise a diff of only your changes would be meaningless), you should consider having an independent branch for it regardless of whether your version control system is TFS or something else.
The alternative is to construct what such a branch would have looked like after the fact, which is essentially what Jim T's answer proposes. You might prefer that approach if your team is very keen on everyone working in the same kitchen, as it were. But as you are already aware, things can get messy that way.
Create two workspaces. Get Specific Version for files specifying the date or upto those two changeset on those two workspace. Now compare folders using a compare tool. Araxis merge is best one.
sounds like you need a tool that supports changesets (changes over multiple files and committing them all at once) instead of committing each file alone
take a look at this comparison between sourcesafe and mercurial ( free and you can find tools to integrate it with visual studio )
I'm trying to dig up resources on how version control algorithms operate on data, and I'm especially interested in the way git's mechanism operates. I realize git does many different things, but in particular I'm interested in how history is saved and restored. I'd appreciate any links or article references anyone can point me to. thanks :)
If you know how to use git and what it does, but you're curious how, then dig into gitcore-tutorial for start, it shows what objects are stored inside git repository, how it stores next revisions, what is revision and how to do it manually, how revisions are connected and so on.
This presentation is also helpfull in terms of showing how it all works. It was created by maintainer of git-scm page and one of github stuff, so he knows what he talks about.
The Pro Git book has a chapter on internals that might be helpful.
http://progit.org/book/ch9-0.html
It doesn't actually go into details on packfile structure, but it pretty comprehensively covers everything else. If you want to know about packfile and pack index structures, I covered it here in some detail.
The only thing that page doesn't cover is the actual delta algorithms, but afaik that isn't actually covered anywhere. If you're curious I can explain it, though.
History (of a project) in Git is quite simple. Git is at conceptual level snapshot based, which means that history of a project in simplest case of linear history is a string of subsequent versions of a project.
Single version of a project is represented by commit object, which contains information about state (snapshot) of the whole project at given version (revision), version metadata like date of creating a comit and author info, and pointer to zero or more previous versions given one is based on. The versions that given commit is based on are called parent commits. So for linear history it would be list of commits (representing versions / revisions), each of them but last (called sometimes a root commit) pointing to previous / parent commit. There is also branch tip pointer which references latest commit (latest version in given branch), and HEAD which says which branch is current branch.
In more complicated situations history is DAG (Directed Acyclic Graph) of versions, where each version is represented by a commit object with zero or more parents pointing to other commit objects (other versions).
Besides already recommended articles I'd like to point to two more:
The Git Parable blog post by Tom Preston-Werner describing how Git could have ben developed, and describing quite well the design of Git.
Git from the bottom up by John Wiegley.
there are many resources in the Web, such as this article
If you're interested in mercurial, the mercurial book is a great ressource. The original paper from Matt Mackall at OLS is good too.