Javascript source code analysis ( specifically duplication checking ) - refactoring

Partial duplicate of this
Notes:
I already use JSLint extensively via a tool I wrote that scans in intervals my current project directory for recently updated/created .js files. It's drastically improved productivity for me and I doubt there is anything as good as JSLint for the price (it's free).
That said, is there any analysis tool out there that can find repetitive or near-duplicate code blocks, the goal being to make it easier to find opportunities to consolidate large files or small/medium sized projects?

May not be exactly what your after, but Google's Javascript optimizer is worth a look.

Our CloneDR is a tool for finding exact and near-miss cloned code blocks for a variety of languages. It will find duplicates in the same file or across literally thousands of files if you have them. You don't have to provide it with any guidance; it can find the cloned code by itself. And it won't be fooled by different indentation or line breaks, or even consistent renaming of identifiers
It does support JavaScript, even if it isn't clear from the website.
You can see sample clone reports for a variety of langauges at the website.

You may want to have a look at jsinspect.
jsinspect ./src
It will print a list of code blocks that are either identical or structurally very similar.
And there's also jscpd.

Related

Coverity analysis: ignore 3rd party libraries

In a large C++ project Coverity analysis reports issues in files that we won't be fixing e.g. Boost libraries, STL headers, some 3rd party libraries etc.
Ideally there would be a mechanism to completely ignore these and not to increment the total count for such issues.
In Coverity Connect (v8.1) we've set up Components with file path regexp and that nicely filters the files in question when browsing but the total number of issues does not drop down. Two questions related to this:
is there a way to drop the number of total issues for files we don't care about? e.g. after such an issue has already been captured
if new code we introduce includes one of the offending boost/STL/etc headers, will this clock up the total issue counter? (clearly, that would be less than desirable)
Mandatory disclaimer first: Your customers won’t care that bugs in your code came from a third party. That said, the main answer at the link Yannis mentioned is generally the correct one: “use a component filter.” If it’s not working correctly for you, double-check your configuration. I found it quite robust, even with a negative look-ahead regex with over a hundred disjuncts.
Once such issue is found, you can mark it as false positive or ignore it all the way. You have to do this only once. In future analyze, when this issue is found again, it will keep this status. And no, if you use this include to other files too, the total issue counter won't go higher as long as the issue is in the same file.
Check this:
Can Synopsys Static Analysis (Coverity) automatically ignore issues in third-party or noncritical code?

Extracting the serialized data from unknown files

My dearest stackoverflowers,
I want to access the serialized data contained in files with strange, to me, extensions. The bulk of the data seems to be in a .st and an .idt file.
The program is meant to be run on Windows, and the unix file command gives me only false positives. Any ideas on either what these extensions mean or on how to investigate and extract their contents?
Below I provide the entirety of the extensions in a long list in hope somebody recognizes them. Googling also gives me false positives. For example: .st is commonly used for ATARI emulation files.
Thanks in advance!
.cix
.cmp
.cnt
.dam
.das
.drf
.idt
.irc
.lxp
.mp
.mbr
.str
.vlf
.rpf
.st
.st
Some general advice on how to approach this:
One way to approach this is to use a site like http://filext.com/ to try to figure out where the files came from. This can be tough, because it's not like there's a file extension standard anywhere - anyone can use any extension, so you're going to have a lot of conflicts/disambiguation issues to solve.
Sometimes you can get lucky, and if you open up the files in a plain text editor you can occasionally see plain string data that is readable, which can help identify the general sort of data contained in a file, and therefore help cut down on the possible number of sources for a file. For example, I have often helped people who received a file as an email attachment with no extension, figure out what file type it was using this technique, adding the file extension, and then opening it in the appropriate program.
There are also sites like http://www.oldversion.com/ that keep old versions of programs that you (typcially) can download for free. This is especially helpful if the data you're working with was created 5+ years ago, in a proprietary program, and that program is no longer available/purchasable from the vendor who created it.
Once you have a good idea of what files belong to what programs, then you're probably going to spend a lot of time trying to find online resources for what the structure of the files are. If that isn't available, you can get a copy of the original program, but either the program won't open the files you're interested in or you still want raw access to the data, then try generating some sample output files with data that you input, and go Rosetta Stone on it, comparing your known file to the original file.
From there, the additional knowledge you'll probably want, is to try to find out what language/compiler the software was written in, which can give you a lead on what code libraries were used to serialize the data in the first place. Once you know all that, then it's matter of reading through any available documentation on the serialization process, and then writing a deserializer.
The one thing this technique won't solve is, if you're dealing with corrupt/truncated data files, it may be very difficult to tell the difference between that and whether or not you have the file structure correct. The "Rosetta Stone" technique might be helpful in that case.
Depending on how many different pieces of source software you're talking about, sounds like a pretty big project. Good luck!

How can I quickly search my code using Windows?

I've got the same problem as in this question, except in Windows. Our product has a 100+ MB code base, and searching for stuff in there takes an awful amount of time (several minutes). It's nice when you can narrow your search to a specific subfolder, but that isn't always possible.
I was wondering if there is some tool that would make it faster, probably by indexing. Accuracy is paramount, if a substring exists somewhere, it must be found, even if the file is not indexed or the index is out of date. Also it would be ideal if .svn folders would be ignored when searching.
Failing that, I was wondering if I could make something like that myself. Is there maybe a ready made indexing engine available for such tasks? I was wondering about Windows Indexing Service (or whatever it is called these days), but so far my experience with it (the Windows standard file search facility) has been rather dismal, with it often missing files that were right in front of its nose.
Yes, I have seen Window Indexing service miss files too, but I haven't checked KBs or user forums for explanations. I'm glad to see it confirmed that it's not just me ;-)!
There look to be alot of file index programs available, I would be surprised if you can't find one that meets your needs (although, see later).
Here are some things to consider:
If your team is using an IDE, isn't there an index feature/plug-in? (none of the SVNs provide Indexing capabilites?). Also, add some tags to your question so this will be seen by other windows developers using the same dev enviorment that you are using.
The SO link you provided mentions several options: slocate, rlocate, and I found mlocate. The wikipedia page for slocate says
Locate32 for Windows Windows analog of GNU locate with GUI, released under GNU license
which seems to meet your main requirement. Looking at the screen shots with the multi-tab interface (one labeled advanced) would give me hope that you can exclude svn (at least from results, possibly from what is indexed).
Your requirement for
if a substring exists somewhere, it
must be found, even if the file is not
indexed or the index is out of date.
seems contradictory. For the substring requirement, I can see many indexing programs ignore c lang syntax elements ( {([])}, etc), and, for example, 'then' is either removed because it is considered a noise word, or that it gets stemmed-down to 'the' and THEN is removed because it is noise word.
To get to 'must be found', and really be sure, you would have to develop a test suite to see what the index program is doing for anything that is corner case. (For a 100 MB code base, not out of the question, especially since you are considering rolling your own).
Finally 'even if the file is not indexed ...'. Well, you either use an index or your don't (obviously). Unfortunately, for your requirement, while rlocate is looking for changes all the time, slocate (on Unix) doesn't seem to. Probably if you read/check on the docs or user forums for locate32 you'll get the answers you need.
Rlocate would give you what you need, but from an rlocate page 'rlocate will work only on Linux with version 2.6.'. mlocate doesn't seem to be have a Windows port either only.
Finally here is a link I found that is interesting about mlocate : mlocate vs rlocate. This is the google cache, because the redhat.com said 'not available'.

Locating OEPs in Packed EXE Files

Are there any general rules on how to realiably locate OEPs (Original Entry Points) for packed .exe files, please? What OEP clues are there to search for in debugged assembly language?
Say there is a Windows .exe file packed with PC-Guard 5.06.0400 and I wish to unpack it. Therefore, the key condition is finding the OEP within the freshly extracted block of code.
I would use the common debugger OllyDBG to do that.
In the general case - no way. It highly depends on packer. In the most common case packer may replace some code from OEP by some other code.
This depends solely on the packer and the algorithms its using pack and/or virtualize code. Seeing as you are using ollydbg, i'd suggest checking out tuts4you, woodmanns and openrce, they have many plugins (iirc there is one designed for finding oep's in obfuscated code, but i have no clue how well it performs) and olly scripts for dealing with unpacking various packers (from which you may be able to pick up hints for a certain type of packer), they also have quite a few papers/tutorials on the subject as well, which may or may not be of use.
PC Guard doesn't seem to get much attention, but the video link and info here should be of help (praise be to Google cache!)
It's hard to point out any simple strategy and claim that it will work in general, because the business of packer tools is to make OEP finding a very hard problem. Besides, with a good packer, finding the OEP is still not enough. That being said, I do have some suggestions.
I would suggest that you read this paper on the Justin unpacker, they use heuristics that were reasonably effective at the time, and that you might be able to get some mileage from. They will at least reduce the number of candidate entry points to a manageable number:
A study of the packer problem and its solutions (2008)
by Fanglu Guo , Peter Ferrie , Tzi-cker Chiueh
There are also some web-analysis pages that can tell you a lot about your packed program. For example, the malware analyzer at:
http://eureka.cyber-ta.org/
Here's another one that is currently down, but has done a reasonable job in the past, and I presume will be up again soon):
http://bitblaze.cs.berkeley.edu/renovo.html

the best or speedest way to understand uncommented and complex project [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have complex project without comments. The project is programed in Java but have more than one main class, use several .txt files like a template and use several .bat files. I don't know where to start and how to start discovering the project, because I need to make some changes in that project.
As with others I say this is a slow process.
However having done this in the past many times, this is my methodology:
Identify as many requirements that the code fulfils. This may give you the some reasons as to why certain things are the way they are when you look deeper. A common way of finding these is look for any tests that be available. The automated ones are best, but usually they're as missing as the comments.
Find the entry points to the code. These will give you places where you can poke the code to see how different inputs affect the flow. Common entry points are Code Loading 'Main' type functions, service interfaces, web page post backs etc..
Diagram the code. Look for tools that can build black/white box pictures of the code. For me this invaluable. I have on occasion printed out large listings and attacked then with markers and rulers. You're aim to create your own flow chart (mental or other wise) of the code flow.
Using the above (iteratively) build a set of outputs to the code that you think should occur, and add to these the outputs you may already know about such as logs, data files, database writes etc..
Finally if you have time, create some manual tests though preferably in automated test harnesses to verify the above. This where I start to involve the debugger to see detail in the code.
This methodology usually gives me confidence to make changes.
Note this is iterative process and can be done with portions of the code or overall as you see fit. I usually favour a top down approach to start with and then as I gain some insight I drill down till details become overwhelming and then I repeat. However this is just because my mind works in this way - you may be different. Good luck.
Find the main Main class. The starting point.
Start drawing a picture of the classes and the objects they own and the external entities they reference.
Follow all the branches until you can find a logical ending.
I've used UML reverse engineering tools in the past and while a visual picture is good, stepping through the code has always been the hardest and yet best methodology for me.
And, as you step through each piece you can add in your own comments..
I usually start with doxygen, turning on every extracting option (especially EXTRACT_ALL and EXTRACT_PRIVATE), and enable the SOURCE_BROWSER, HAVE_DOT, CALL_GRAPH and CALLER_GRAPH options (you also need to have dot installed). This gives good view of the software. For every function the calls are displayed and linked in a graph, also the sources are linked from there.
While doxygen is intended for C and C++, it also works with Java sources (set the OPTIMIZE_OUTPUT_JAVA option).
Auch. I'm afraid there is no speedy way to do this. Comment out a line (or two) -> test -> see what breaks. You could also put break statements here and there and run the debugger. That should give you some indication how you got there (ie. what the hierarchy between the classes is).
Hopefully the original developers used some patterns that you can recognize and make notes. Make lots of notes of everything. Start by trying to understand the high level structure and work down from there.
Be prepared to spend endless hours not understanding what the thing is doing.
Speak to the client and try to understand what the project is for, and what are all the things that it does. Someone somewhere had to put in some requirements for the stuff that's in there, if only in an email.
I would try to find the first entry point in the code closest to where you suspect you'll need to start making your changes, set a breakpoint, and start debugging. Check out the contents of local variables and work your way deeper as you get to become familiar with whats going on. Then, when you have a basic understanding of the area of code you're going to be working with, start fiddling with some small changes. Test your understanding of it. Try diagramming what you see happening. If you can do that confidently, you'll be able to decide if you need to go back and continue learning more about the code, or if you know enough to get done what you need to get done.
A start is to use an automated uml modeling tool (if you use eclipse you can use a plug-gin), and start creating UML diagrams of the various classes to see how they are related in a high level and visualize the code. This has helped many times
If there are log files being generated, have a look at it to understand the flow from the starting point (main class). Otherwise, put debug statements to understand the flow.
Ya, that sounds like a pretty bad spot to be in.
I would say that the best way is to just walk through the program line for line. Try to grasp the big picture in the code, and write alot of notes, both on paper and in comments in the code.
I would say, a good approach would be to generate documentation using javadoc or doxygen's class diagram feature, then as you run the code traverse through the class diagrams generated using doxygen and see who calls what. This works wonderfully for me everytime i am working on such a project.
I completely agree to most of the answers posted.
I can add to use a development tool that reverse engineering the code and create a class diagram, to have an overall picture of what is involved.
Then you need patience. But you will be a stronger and smarter developer when you'll get through...
Good luck!
One of the best and first things to do is to try to build and run the code. It might sound a bit simplistic but the problem when you take over undocumented code is that you can't even build it and run it. When have no clue were to start.

Resources