Eliminating code duplication in a single file - refactoring

Sadly, a project that I have been working on lately has a large amount of copy-and-paste code, even within single files. Are there any tools or techniques that can detect duplication or near-duplication within a single file? I have Beyond Compare 3 and it works well for comparing separate files, but I am at a loss for comparing single files.
This project is an ASP.NET/C# project, but I work with a variety of languages including Java; I'm interested in what tools are best (for any language) to remove duplication.

Check out Atomiq. It finds code that is duplicate that is prime for extracting to one location.

If you're using Eclipse, you can use the copy paste detector (CPD) https://olex.openlogic.com/packages/cpd.

For Python there is CloneDigger. It also supports Java but I have not tried that. It can find code duplication both with a single file and between files, and gives you the result as a diff-like report in HTML.

See SD CloneDR, a tool for detecting copy-paste-edit code within and across multiple files. It detects exact copyies, copies that have been reformatted, and near-miss copies with different identifiers, literals, and even different seqeunces of statements.
The CloneDR handles many languages, including Java (1.4,1.5,1.6) and C# especially up to C#4.0. You can see sample clone detection reports at the website, also including one for C#.

Resharper does this automagically - it suggests when it thinks code should be extracted into a method, and will do the extraction for you

Check out PMD , once you have configured it (which is tad simple) you can run its copy paste detector to find duplicate code.

One with some Office skills can do following sequence in 1 minute:
use ordinary formatter to unify the code style, preferably without line wrapping
feed the code text into Microsoft Excel as a single column
search and replace all dual spaces with single one and do other replacements
sort column
At this point the keywords for duplicates will be already well detected. But to go further
add comparator formula to 2nd column and counter to 3rd
copy and paste values again, sort and see the most repetitive lines

There is an analysis tool, called Simian, which I haven't yet tried. Supposedly it can be run on any kind of text and point out duplicated items. It can be used via a command line interface.

Another option similar to those above, but with a different tool chain: https://www.npmjs.com/package/jscpd


Extracting strings for translation from VB6 code

I have a legacy VB application that still has some life in it, and I am wanting to translate it to another language.
I plan to write a Ruby script, possibly utilising a parser, to extract all strings from the three million lines of source, replace them with constants, and move them to a string resource file that can be used to provide translations.
Is anyone aware of a script/library that could be used to intelligently extract the strings?
I'm not aware of any existing off-the-shelf tool that you could use. We created a tool like this at my work and it worked well. The FRM file format is quite simple (although only briefly documented). We wrote a tool that (1) extracted all strings from control definitions and (2) generated the code to reload them at runtime during Form_Load.

How to find foreign language used in "C comments"

I have a large source code where most of the documentation and source code comments are in english. But one of the minor contributors wrote comments in a different language, spread in various places.
Is there a simple trick that will let me find them ? I imagine first a way to extract all comments from the code and generate a single text file (with possible source file / line number info), then pipe this through some language detection app.
If that matters, I'm on Linux and the current compiler on this project is CLang.
The only thing that comes to mind is to go through all of the code manually and check it yourself. If it's a similar language, that doesn't contain foreign letters, consider using something with a spellchecker. This way, the text that isn't recognized will get underlined, and easy to spot.
Other than that, I don't see an easy way to go through with this.
You could make a program, that reads the files and only prints the comments out to another output file, where you then spell check that file, but this would seem to be a waste of time, as you would easily be able to spot the comments yourself.
If you do make a program for that, however, keep in mind that there are three things to check for:
If comment starts with /*, make sure it stops reading when encountering */
If comment starts with //, only read one line - unless:
If line starting with // ends with \, read next line as well
While it is possible to detect a language from a string automatically, you need way more words than fit in a usual comment to do so.
Solution: Use your own eyes and your own brain...

Tips on writing a duplicate file finder in F#

I am new to programming and F# is my first .NET language as well as my first functional language. As a beginner's project, I would like to try implementing my own duplicate file finder, and I am looking for some tips on the F# tools that are relevant to my project. I apologise in advance if my question doesn't meet StackOverflow's standards: I will gladly make changes to it as required.
Here is the rough idea I have come up with: I will retrieve all files from a desired folder, read the file contents into byte arrays, and then use a hash-table to store the byte arrays and remove duplicates. Will more experienced programmers tell me whether this is a good approach? What improvements can I make? Additionally, as asked earlier, what are the relevant F# tools to look at? MSDN has a huge list of libraries and namespaces and etc., and it is really overwhelming for a newbie like me.
Thank you warmly in advance for your help!
I'd recommend starting with a console application.
There are a couple of relevant .Net APIs:
GetFiles returns an easy to use array of all file paths but blocks until all files are found, where as EnumerateFiles lets you enumerate the files one-by-one and give feedback to the user.
For performance when finding duplicates, the file length can be used to find potential duplicates before comparing the data. Here you could use the Length property of System.IO.FileInfo.
If you create a sequence of tuple of file name and file length, you could use Seq.groupBy to group potential matches. Finally for groups of 2 or more you can read the files and compare the bytes to find exact duplicates.

How to split a large csv file into multiple files in GO lang?

I am a novice Go lang programmer,trying to learn Go lang features.I wanted to split a large csv file into multiple files in GO lang, each file containing the header.How do i do this? I have searched everywhere but couldnt get the right solution.Any help in this regard will be greatly appreciated.
Depending on your shell fu this problem might be better suited for common shell utilities but you specifically mentioned go.
Let's think through the problem.
How big is this csv file? Are we talking 100 lines or is it 5G ?
If it's smallish I typically use this:
However, this package also exists:
Regardless - let's return to the abstraction of the problem. You have a header (which is the first line) and then the rest of the document.
So what we probably want to do (if ignoring csv for the moment) is to read in our file.
Then we want to split the file body by all the newlines in it.
You can use this to do so:
You didn't mention but do you know how many files you want to split by or would you rather split by the line count or byte count? What's the actual limitation here?
Generally it's not going to be file count but if we pretend it is we simply want to divide our line count by our expected file count to give lines/file.
Now we can take slices of the appropriate size and write the file back out via:
A trick I use sometime to help think me threw these things is to write down our mission statement.
"I want to split a large csv file into multiple files in go"
Then I start breaking that up into pieces but take the divide/conquer approach - don't try to solve the entire problem in one go - just break it up to where you can think about it.
Also - make gratiutious use of pseudo-code until you can comfortably write the real code itself. Sometimes it helps to just write a short comment inline with how you think the code should flow and then get it down to the smallest portion that you can code and work from there.
By the way - many of the golang.org packages have example links where you can literally run in your browser the example code and cut/paste that to your own local environment.
Also, I know I'll catch some haters with this - but as for books - imo - you are going to learn a lot faster just by trying to get things working rather than reading. Action trumps passivity always. Don't be afraid to fail.
Here is a package that might help. You can set a necessary chunk size in bytes and a file will be split on an appropriate amount of chunks.

DUnit Compare Two Text Files and show Diff

Is there a way to compare two text files and show the diff if they are not identical in dunit?
The easy start is to read them to TStringList, however the code for comparing two text file is much more complicated, and the gui in the DUnitGui is not sufficient for this.
Any idea? suggestion?
There is a nice little unit that comes with some examples called TDiff, this is available from http://angusj.com/delphi/ and will allow you to compare 2 files and see the differences, it also allows for merging.
It is a very simple Utility that you can download the entire source for.
