Clone detection algorithm - algorithm

I'm writing an algorithm that detects clones in source code. E.g. if there is a block like:
for(int i = o; i <5; i++){
doSomething(abc);
}
...and if this block is repeated somewhere else in the source code it will be detected as a clone. The method I am using at the moment is to create hashes for lines/blocks and compare them with hashes of other lines/blocks in the same source to see if there are any matches.
Now, if the same block as above was to be repeated somewhere with only the argument of doSomething different, it would not be detected as a clone even though it would appear very much like a clone to you and me. My algorithm detects exact matches but doesn't detect matching blocks where only the argument is different.
Could anyone suggest any ways of getting around this issue? Thanks!

Here's a super-simple way, which might go too far in erasing information (i.e., might produce too many false positives): replace every identifier that isn't a keyword with some fixed name. So you'd get
for (int DUMMY = DUMMY; DUMMY<5; DUMMY++) {
DUMMY(DUMMY);
}
(assuming you really meant o rather than 0 in the initialization part of the for-loop).
If you get a huge number of false positives with this, you could then post-process them by, for instance, looking to see what fraction of the DUMMYs actually correspond to the same identifier in both halves of the match, or at least to identifiers that are consistent between the two.
To do much better you'll probably need to parse the code to some extent. That would be a lot more work.

Well if you're going todo something else then you're going to have to parse to code at least a bit. For example you could detect methods and then ignore the method arguments in your hash. Anyway I think it's always true that you need your program to understand the code better than 'just text blocks', and that might get awefuly complicated.

Related

Getting the last element of a lazy Seq in Raku

I want to get the last element of a lazy but finite Seq in Raku, e.g.:
my $s = lazy gather for ^10 { take $_ };
The following don't work:
say $s[* - 1];
say $s.tail;
These ones work but don't seem too idiomatic:
say (for $s<> { $_ }).tail;
say (for $s<> { $_ })[* - 1];
What is the most idiomatic way of doing this while keeping the original Seq lazy?
What you're asking about ("get[ing] the last element of a lazy but finite Seq … while keeping the original Seq lazy") isn't possible. I don't mean that it's not possible with Raku – I mean that, in principle, it's not possible for any language that defines "laziness" the way Raku does with, for example, the is-lazy method.
If particular, when a Seq is lazy in Raku, that "means that [the Seq's] values are computed on demand and stored for later use." Additionally, one of the defining features of a lazy iterable is that it cannot know its own length while remaining lazy – that's why calling .elems on a lazy iterable throws an error:
my $s = lazy gather for ^10 { take $_ };
say $s.is-lazy; # OUTPUT: «True»
$s.elems; # THROWS: «Cannot .elems a lazy list onto a Seq»
Now, at this point, you might reasonably be thinking "well, maybe Raku doesn't know how long $s is, but I can tell that it has exactly 10 elements in it." And you're not wrong – with that code, $s is indeed guaranteed to have 10 elements. This means that, if you want to get the tenth (last) element of $s, you can do so with $s[9]. And accessing $s's tenth element like that won't change the fact that $s.is-lazy.
But, importantly, you can only do so because you know something "extra" about $s, and that extra info undoes a good chunk of the reason you might want a list to be lazy in practice.
To see what I mean, consider a very similar Seq
my $s2 = lazy gather for ^10 { last if rand > .95; take $_ };
say $s2.is-lazy; # OUTPUT: «True»
Now, $s2probably has 10 elements, but it might not – the only way to know is to iterate through it and find out. In turn, this means $s2[9] does not jump to the tenth element the way $s[9] did; it iterates through $s2 just like you'd need to. And, as a result, if you run $s2[9], then $s2 will no longer be lazy (i.e., $s2.is-lazy will return False).
And this is, in effect, what you did in the code in your question:
my $s = lazy gather for ^10 { take $_ };
say $s.is-lazy; # OUTPUT: «True»
say (for $s<> { $_ }).tail; # OUTPUT: «9»
say $s.is-lazy; # OUTPUT: «False»
Because Raku cannot ever know that it has reached the tail of a lazy Seq, the only way it could tell you the .tail is to fully iterate $s. And that necessarily means that $s is no longer lazy.
Two complications
It's worth mentioning two adjacent topics that aren't actually related but that are close enough that they trip some people up.
First, nothing I've said about lazy iterables not knowing their length precludes some non-lazy iterables from knowing their length. Indeed, a decent number of Raku types do both the Iterator role and the PredictiveIterator role – and the main point of a PredictiveIterator is that it does know how many elements it can produce without needing to produce/iterate them. But PredictiveIterators cannot be lazy.
The second potentially confusing topic is closely related to the first: while no PredictiveIterator can be lazy (that is, none will ever have an .is-lazy method that returns True), some PredictiveIterators have behavior that is very similar to laziness – and, in fact, may even be colloquially referred to as "lazy".
I can't do a great job explaining this distinction because, quite honestly, I don't fully understand it myself. But I can give you an example: the .lines method on an IO::Handle. It's certainly the case that reading the lines of a huge file behaves a lot like it's dealing with a lazy iterable. most obviously, you can process each line without ever having the whole file in memory. And the docs even say that "lines are read lazily" with the .lines method.
On the other hand:
my $l = 'some-file-with-100_000-lines.txt'.IO.lines;
say $l.is-lazy; # OUTPUT: «False»
say $l.iterator ~~ PredictiveIterator; # OUTPUT: «True»
say $l.elems; # OUTPUT: «100000»
So I'm not quite sure whether it's fair to say that $l "is a lazy iterable", but if it is, it's "lazy" in a different way than $s was.
I realize that was a lot, but I hope it is helpful. If you have a more specific use case in mind for laziness (I bet it wasn't gathering the numbers from zero to nine!), I'd be happy to address that more specifically. And if anyone else can fill in some of the details with .lines and other lazy-not-lazy PredictiveIterators, I'd really appreciate it!
Drop the lazy
Lazy sequences in Raku are designed to work well as is. You don't need to emphasize they're lazy by adding an explicit lazy.
If you add an explicit lazy, Raku interprets that as a request to block operations such as .tail because they will almost certainly immediately render laziness moot, and, if called on an infinite sequence, or even just a sufficiently large one, hang or OOM the program.
So, either drop the lazy, or don't invoke operations like .tail that will be blocked if you do.
Expanded version of my original answer
As noted by #ugexe, the idiomatic solution is to drop the lazy.
Quoting my answer to the SO About Laziness:
if a gather is asked if it's lazy, it returns False.
Aiui, something like the following applies:
Some lazy sequence producers may be actually or effectively infinite. If so, calling .tail etc on them will hang the calling program. Conversely, other lazy sequences perform fine when all their values are consumed in one go. How should Raku distinguish between these two scenarios?
A decision was made in 2015 to let value producing datatypes emphasize or deemphasize their laziness via their response to an .is-lazy call.
Returning True signals that a sequence is not only lazy but wants to be known to be lazy by consuming code that calls .is-lazy. (Not so much end-user code but instead built in consuming features such as # sigilled variables handling an assignment trying to determine whether or not to assign eagerly.) Built in consuming features take a True as a signal they ought block calls like .tail. If a dev knows this is overly conservative, they can add an eager (or remove an unneeded lazy).
Conversely, a datatype, or even a particular object instance, may return False to signal that it does not want to be considered lazy. This may be because the actual behaviour of a particular datatype or instance is eager, but it might instead be that it is lazy technically, but doesn't want a consumer to block operations such as .tail because it knows they will not be harmful, or at least prefers to have that be the default presumption. If a dev knows better (because, say, it hangs the program), or at least does not want to block potentially problematic operations, they can add a lazy (or remove an unneeded eager).
I think this approach works well, but it doc and error messages mentioning "lazy" may not have caught up with the shift made in 2015. So:
If you've been confused by some doc about laziness, please search for doc issues with "lazy" in them, or "laziness", and add comments to existing issues, or file a new doc issue (perhaps linking to this SO answer).
If you've been confused by a Rakudo error message mentioning laziness, please search for Rakudo issues with "lazy" in them, and tagged [LTA] (which means "Less Than Awesome"), and add comments, or file a new Rakudo issue (with an [LTA] tag, and perhaps a link to this SO answer).
Further discussion
the docs ... say “If you want to force lazy evaluation use the lazy subroutine or method. Binding to a scalar or sigilless container will also force laziness.”
Yes. Aiui this is correct.
[which] sounds like it implies “my $x := lazy gather { ... } is the same as my $x := gather { ... }”.
No.
An explicit lazy statement prefix or method adds emphasis to laziness, and Raku interprets that to mean it ought block operations like .tail in case they hang the program.
In contrast, binding to a variable alters neither emphasis nor deemphasis of laziness, merely relaying onward whatever the bound producer datatype/instance has chosen to convey via .is-lazy.
not only in connection with gather but elsewhere as well
Yes. It's about the result of .is-lazy:
my $x = (1, { .say; $_ + 1 } ... 1000);
my $y = lazy (1, { .say; $_ + 1 } ... 1000);
both act lazily ... but $x.tail is possible while $y.tail is not.
Yes.
An explicit lazy statement prefix or method forces the answer to .is-lazy to be True. This signals to a consumer that cares about the dangers of laziness that it should become cautious (eg rejecting .tail etc.).
(Conversely, an eager statement prefix or method can be used to force the answer to .is-lazy to be False, making timid consumers accept .tail etc calls.)
I take from this that there are two kinds of laziness in Raku, and one has to be careful to see which one is being used where.
It's two kinds of what I'll call consumption guidance:
Don't-tail-me If an object returns True from an .is-lazy call then it is treated as if it might be infinite. Thus operations like .tail are blocked.
You-can-tail-me If an object returns False from an .is-lazy call then operations like .tail are accepted.
It's not so much that there's a need to be careful about which of these two kinds is in play, but if one wants to call operations like tail, then one may need to enable that by inserting an eager or removing a lazy, and one must take responsibility for the consequences:
If the program hangs due to use of .tail, well, DIHWIDT.
If you suddenly consume all of a lazy sequence and haven't cached it, well, maybe you should cache it.
Etc.
What I would say is that the error messages and/or doc may well need to be improved.

Is there a better way to assign a Ruby hash while avoiding RuboCop's ABC Size warnings?

I have a method that builds a laptop's attributes, but only if the attributes are present within a row that is given to the method:
def build_laptop_attributes desk_id, row, laptop
attributes = {}
attributes[:desk_number] = room_id if laptop && desk_id
attributes[:status] = row[:state].downcase if row[:state]
attributes[:ip_address] = row[:ip_address] if row[:ip_address]
attributes[:model] = row[:model] if row[:model]
attributes
end
Currently, RuboCop is saying that the Metric/AbcSize is too high, and I was wondering if there is an obvious and clean way to assign these attributes?
Style Guides Provide "Best Practices"; Evaluate and Tune When Needed
First of all, RuboCop is advisory. Just because RuboCop complains about something doesn't mean it's wrong in some absolute sense; it just means you ought to expend a little more skull sweat (as you're doing) to see if what you're doing makes sense.
Secondly, you haven't provided a self-contained, executable example. That makes it impossible for SO readers to reliably refactor it, since it can't currently be tested without sample inputs and expected outputs not provided in your original post. You'll need those things yourself to evaluate and refactor your own code, too.
Finally, the ABC Metric looks at assignments, branches, and conditionals. You have five assignments, four conditionals, and what looks liks a method call. Is that a lot? If you haven't tuned Rubocop, the answer is "RuboCop thinks so." Whether or not you agree is up to you and your team.
If you want to try feeding Rubocop, you can do a couple of things that might help reduce the metric:
Refactor the volume and complexity of your assignments. Some possible examples include:
Replace your postfix if-statements with safe navigators (&.) to guard against calling methods on nil.
Extract some of your branching logic and conditionals to methods that "do the right thing", potentially reducing your current method to a single assignment with four method calls. For example:
attributes = { desk_number: location, status: laptop_status, ... }
Replace all your multiple assignments with a deconstructing assignment (although Rubocop often complains about those, too).
Revisit whether you have the right data structure in the first place. Maybe you really just want an OpenStruct, or some other data object.
Your current code seems readable, so is the juice really worth the squeeze? If you decide that RuboCop is misguided in this particular case, and your code works and passes muster in your internal code reviews, then you can tune the metric's sensitivity in your project's .rubocop.yml or disable that particular metric for just that section of your source code.
After reading #Todd A. Jacobs answer, you may want (or not) to write something like this:
def build_laptop_attributes desk_id, row, laptop
desk_number = room_id if laptop && desk_id
{
desk_number: desk_number,
status: row[:state]&.downcase,
ip_address: = row[:ip_address],
model: row[:model]
}.compact
end
This reduces has the advantage of reducing the number of calls to []=, as well as factorizing many ifs in a single compact.
In my opinion, it is more readable because it is more concise and because the emphasis is completely on the correspondence between your keys and values.
Alternative version to reduce the amount of conditionals (assuming you are checking for nil / initialized values):
def build_laptop_attributes desk_id, row, laptop
attributes = {}
attributes[:desk_number] = room_id if laptop && desk_id
attributes[:status] = row[:state]&.downcase
attributes[:ip_address] = row[:ip_address]
attributes[:model] = row[:model]
attributes.compact
end
There is an additional .compact as a cost of removing assignments checks.

Unicode - the right thing to do

I'm working on something which processes UTF-8 encoding, and I found myself asking the question:
What should I do when I encounter a byte which never occur inside a
UTF-8 encoded string?
i.e. 0x1111111X
For example, I'm writing a small snippet of code which looks at the current place in the stream of bytes, and tells you how many bytes are used to represent the code point at that place in the stream.
0x0XXXXXXX just 1
0x10XXXXXX oops, we are in a continuation byte,
search back upstream to find the leading byte
0x11XXXXXX count the
number of leading 1s, that's the answer
0x1111111X err, this is not
possible in UTF-8!!! what to do!?!?
I'm thinking of returning an error value, but wondering if I should, as a side effect, replace it with some more predictable error glyph (I mean the code point representing said glyph). And later when I do something more complicated, like jumping through the string and find that the leading byte does not have the correct number of continuation bytes after it... I'm thinking I should "fix" that up too.
Is it standard practice to leave wrongly encoded strings broken, or to change them and make them be wrong but at least play nice?
The most common way is to just throw a meaningful error if the input is not correct and stop.
There are a lot of good reasons to do so:
speed: if you try to fix errors this often cause your
function to be slower even on correct inputs
simplicity: your code can become really complicated if you try to fix any error
maintainability and correctness: it's just easier to ensure the function works correctly
when you stop whenever the input does not match the specification you are working with. Since you have only to check input according to specification.
purpose: any time you get to such a point like here you have to think about:
what is the purpose of my function? Why I came up with the idea to write it?
Also: a function fixcode which fixes the uft8 could be used also at an other place, so it makes total sense to separate fixing (purpose, simplicity, maintainability and correctness argument again).
Even if you expect an error, I would prefer to separate the encode and fixcode since
your can reuse fixcode in outer contexts.
If you are really thinking about fixing the utf8 code while encoding I would use a pattern like this:
try {
q = encode(s);
} catch(encodingerror) {
log(encodingerror);
t = fixcode(s);
q = encode(t);
}

Is there a case where parameter validation may be considered redundant?

The first thing I do in a public method is to validate every single parameter before they get any chance to get used, passed around or referenced, and then throw an exception if any of them violate the contract. I've found this to be a very good practice as it lets you catch the offender the moment the infraction is committed but then, quite often I write a very simple getter/indexer such as this:
private List<Item> m_items = ...;
public Item GetItemByIdx( int idx )
{
if( (idx < 0) || (idx >= m_items.Count) )
{
throw new ArgumentOutOfRangeException( "idx", "Invalid index" );
}
return m_items[ idx ];
}
In this case the index parameter directly relates to the indexes in the list, and I know for a fact (e.g. documentation) that the list itself will do exactly the same and will throw the same exception. Should I remove this verification or I better leave it alone?
I wanted to know what you guys think, as I'm now in the middle of refactoring a big project and I've found many cases like the above.
Thanks in advance.
It's not just a matter of taste, consider
if (!File.Exists(fileName)) throw new ArgumentException("...");
var s = File.OpenText(fileName);
This looks similar to your example but there are several reasons (concurrency, access rights) why the OpenText() method could still fail, even with a FileNotFound error. So the Exists-check is just giving a false feeling of security and control.
It is a mind-set thing, when you are writing the GetItemByIdx method it probably looks quite sensible. But if you look around in a random piece of code there are usually lots of assumptions you could check before proceeding. It's just not practical to check them all, over and over. We have to be selective.
So in a simple pass-along method like GetItemByIdx I would argue against redundant checks. But as soon as the function adds more functionality or if there is a very explicit specification that says something about idx that argument turns around.
As a rule of thumb an exception should be thrown when a well defined condition is broken and that condition is relevant at the current level. If the condition belongs to a lower level, then let that level handle it.
I would only do parameter verification where it would lead to some improvement in code behavior. Since you know, in this case, that the check will be performed by the List itself, then your own check is redundant and provides no extra value, so I wouldn't bother.
It's true that possibly you duplicated work that's already been done in the API, but it's there now. If your error handling framework works and is solid, and isn't causing performance issues (profiling IYF) then I reckon leave it, and gradually phase it out if you have time. It doesn't sound like a top priority!

What are the pros and cons of putting as much logic as possible in a minimum(one-liners) piece of code?

Is it cool?
IMO one-liners reduces the readability and makes debugging/understanding more difficult.
Maximize understandability of the code.
Sometimes that means putting (simple, easily understood) expressions on one line in order to get more code in a given amount of screen real-estate (i.e. the source code editor).
Other times that means taking small steps to make it obvious what the code means.
One-liners should be a side-effect, not a goal (nor something to be avoided).
If there is a simple way of expressing something in a single line of code, that's great. If it's just a case of stuffing in lots of expressions into a single line, that's not so good.
To explain what I mean - LINQ allows you to express quite complicated transformations in relative simplicity. That's great - but I wouldn't try to fit a huge LINQ expression onto a single line. For instance:
var query = from person in employees
where person.Salary > 10000m
orderby person.Name
select new { person.Name, person.Deparment };
is more readable than:
var query = from person in employees where person.Salary > 10000m orderby person.Name select new { person.Name, person.Deparment };
It's also more readabe than doing all the filtering, ordering and projection manually. It's a nice sweet-spot.
Trying to be "clever" is rarely a good idea - but if you can express something simply and concisely, that's good.
One-liners, when used properly, transmit your intent clearly and make the structure of your code easier to grasp.
A python example is list comprehensions:
new_lst = [i for i in lst if some_condition]
instead of:
new_lst = []
for i in lst:
if some_condition:
new_lst.append(i)
This is a commonly used idiom that makes your code much more readable and compact. So, the best of both worlds can be achieved in certain cases.
This is by definition subjective, and due to the vagueness of the question, you'll likely get answers all over the map. Are you referring to a single physical line or logical line? EG, are you talking about:
int x = BigHonkinClassName.GetInstance().MyObjectProperty.PropertyX.IntValue.This.That.TheOther;
or
int x = BigHonkinClassName.GetInstance().
MyObjectProperty.PropertyX.IntValue.
This.That.TheOther;
One-liners, to me, are a matter of "what feels right." In the case above, I'd probably break that into both physical and logic lines, getting the instance of BigHonkinClassName, then pulling the full path to .TheOther. But that's just me. Other people will disagree. (And there's room for that. Like I said, subjective.)
Regarding readability, bear in mind that, for many languages, even "one-liners" can be broken out into multiple lines. If you have a long set of conditions for the conditional ternary operator (? :), for example, it might behoove you to break it into multiple physical lines for readability:
int x = (/* some long condition */) ?
/* some long method/property name returning an int */ :
/* some long method/property name returning an int */ ;
At the end of the day, the answer is always: "It depends." Some frameworks (such as many DAL generators, EG SubSonic) almost require obscenely long one-liners to get any real work done. Othertimes, breaking that into multiple lines is quite preferable.
Given concrete examples, the community can provide better, more practical advice.
In general, I definitely don't think you should ever "squeeze" a bunch of code onto a single physical line. That doesn't just hurt legibility, it smacks of someone who has outright disdain for the maintenance programmer. As I used to teach my students: always code for the maintenance programmer, because it will often be you.
:)
Oneliners can be useful in some situations
int value = bool ? 1 : 0;
But for the most part they make the code harder to follow. I think you only should put things on one line when it is easy to follow, the intent is clear, and it won't affect debugging.
One-liners should be treated on a case-by-case basis. Sometimes it can really hurt readability and a more verbose (read: easy-to-follow) version should be used.
There are times, however when a one-liner seems more natural. Take the following:
int Total = (Something ? 1 : 2)
+ (SomethingElse ? (AnotherThing ? x : y) : z);
Or the equivalent (slightly less readable?):
int Total = Something ? 1 : 2;
Total += SomethingElse ? (AnotherThing ? x : y) : z;
IMHO, I would prefer either of the above to the following:
int Total;
if (Something)
Total = 1;
else
Total = 2;
if (SomethingElse)
if (AnotherThing)
Total += x;
else
Total += y;
else
Total += z
With the nested if-statements, I have a harder time figuring out the final result without tracing through it. The one-liner feels more like the math formula it was intended to be, and consequently easier to follow.
As far as the cool factor, there is a certain feeling of accomplishment / show-off factor in "Look Ma, I wrote a whole program in one line!". But I wouldn't use it in any context other than playing around; I certainly wouldn't want to have to go back and debug it!
Ultimately, with real (production) projects, whatever makes it easiest to understand is best. Because there will come a time that you or someone else will be looking at the code again. What they say is true: time is precious.
That's true in most cases, but in some cases where one-liners are common idioms, then it's acceptable. ? : might be an example. Closure might be another one.
No, it is annoying.
One liners can be more readable and they can be less readable. You'll have to judge from case to case.
And, of course, on the prompt one-liners rule.
VASTLY more important is developing and sticking to a consistent style.
You'll find bugs MUCH faster, be better able to share code with others, and even code faster if you merely develop and stick to a pattern.
One aspect of this is to make a decision on one-liners. Here's one example from my shop (I run a small coding department) - how we handle IFs:
Ifs shall never be all on one line if they overflow the visible line length, including any indentation.
Thou shalt never have else clauses on the same line as the if even if it comports with the line-length rule.
Develop your own style and STICK WITH IT (or, refactor all code in the same project if you change style).
.
The main drawback of "one liners" in my opinion is that it makes it hard to break on the code and debug. For example, pretend you have the following code:
a().b().c(d() + e())
If this isn't working, its hard to inspect the intermediate values. However, it's trivial to break with gdb (or whatever other tool you may be using) in the following, and check each individual variable and see precisely what is failing:
A = a();
B = A.b();
D = d();
E = e(); // here i can query A B D and E
B.C(d + e);
One rule of thumb is if you can express the concept of the one line in plain language in a very short sentence. "If it's true, set it to this, otherwise set it to that"
For a code construct where the ultimate objective of the entire structure is to decide what value to set a single variable, With appropriate formatting, it is almost always clearer to put multiple conditonals into a single statement. With multiple nested if end if elses, the overall objective, to set the variable...
" variableName = "
must be repeated in every nested clause, and the eye must read all of them to see this.. with a singlr statement, it is much clearer, and with the appropriate formatting, the complexity is more easily managed as well...
decimal cost =
usePriority? PriorityRate * weight:
useAirFreight? AirRate * weight:
crossMultRegions? MultRegionRate:
SingleRegionRate;
The prose is an easily understood one liner that works.
The cons is the concatenation of obfuscated gibberish on one line.
Generally, I'd call it a bad idea (although I do it myself on occasion) -- it strikes me as something that's done more to impress on how clever someone is than it is to make good code. "Clever tricks" of that sort are generally very bad.
That said, I personally aim to have one "idea" per line of code; if this burst of logic is easily encapsulated in a single thought, then go ahead. If you have to stop and puzzle it out a bit, best to break it up.

Resources