Perl, how to make this scripter smaller, faster, better, stronger? [closed] - performance

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm a newbie in perl, how I can make this better and faster?
(SPOJ task)
until($i>99){
$a=<>;
if($a%5==0){
if($a%3==0){
print"SPOKOKOKO\n";
}else{
print"SPOKO\n";
}
}elsif($a%3==0){
print"KOKO\n";
}else{
print"$a";
}
$i++;
}

Something which is divisible by 5 and 3 is divisible by 15, so your logic is more straight forward as...
Divisible by 15
Divisible by 5
Divisible by 3
Not divisible by 5 nor 3
I'd also take advantage of say.
I'd also replace the awkward until with a for and not bother with $i.
Avoid using $a and $b because these are special variables used by sort.
Turn on strict and warnings.
Strip the newline off $a. Technically not necessary if you're going to use it as a number, but if you don't printing it will have an extra newline.
A non-number % anything is 0. Unless we validate the input, if you input nothing or "bananas" you'll get SPOKOKOKO. We can check if the input looks like something Perl considers a number with Scalar::Util::looks_like_number. If it isn't a number, throw an exception; it isn't the function's job to figure out what to do with a non-number. Perl doesn't really have exceptions, but we can get close with a die and eval.
And put the logic inside a subroutine. This separates the details of the input and output from the logic. The logic can now be named, documented, tested, and reused. This also avoids the redundant say.
use strict;
use warnings;
use Scalar::Util qw(looks_like_number);
use v5.10;
sub whatever_this_is_doing {
my $num = shift;
# Validate input early to keep the rest of the logic simple.
if( !looks_like_number($num) ) {
die "'$num' is not a number";
}
if($num % 15 == 0) {
"SPOKOKOKO";
}
elsif($num % 5 == 0) {
"SPOKO";
}
elsif($num % 3 == 0) {
"KOKO";
}
else {
$num; # Note there's no need to quote a single variable.
}
}
for(1..99) {
my $num = <>;
chomp($num);
eval {
say whatever_this_is_doing($num);
} or say "'$num' is not a number.";
}
I'm sure there's some very clever math thing one can do to reduce the number of calls to %, but % is extremely unlikely to be a bottleneck, especially in Perl, so I'd leave it there.

100 rows? That will be so fast that I don't understand why you are asking for optimization. Nevertheless...
The biggest two costs are reading 100 rows and writing 100 rows.
Then come the ifs and modulo checks -- each of these is insignificant.
One thing to consider is to turn it inside out. I'm suggesting 1 print with a messy expression inside the loop
print ( ($a%5 == 0) ?
( ($a%3 == 0) ? "SPOKOKOKO\n" : "SPOKO\n" ) :
( ($a%3 == 0) ? "KOKO\n" : $a );
A lot fewer lines (3 vs 10); probably easier to read (especially if columns are aligned); perhaps faster.
Easier to digest. (I had not noticed that the %3 was s the same.) Oh, and now it is obvious that \n is missing from the last case; maybe a bug?
Perhaps some of the parentheses can be removed, but I don't trust Perl to 'do the right thing with nested uses of ?:.

Related

How can I find both identical and similar strings in a particular field in a text file in Linux?

My apologies ahead of time - I'm not sure that there is an answer for this one using only Linux command-line fu. Please note I am not a programmer, but I have been playing around with bash and python a bit over the last few years.
I have a large text file with rows and columns that resemble the following (note - fields are separated with tabs):
1074 Beetle OOB11061MNH 12/22/16 Confirmed
3430 Hightop 0817BESTYET 08/07/17 Queued
3431 Hightop 0817BESTYET 08/07/17 Queued
3078 Copland 2017GENERAL 07/07/17 Confirmed
3890 Bartok FOODS 09/11/17 Confirmed
5440 Alphapha 00B1106IMNH 01/09/18 Queued
What I want to do is find and output only those rows where the third field is either identical OR similar to another in the list. I don't really care whether the other fields are similar or not, but they should all be included in the output. By similar, I mean no more than [n] characters are different in that particular field (for example, no more than 3 characters are different). So the output I would want would be:
1074 Beetle OOB11061MNH 12/22/16 Confirmed
3430 Hightop 0817BESTYET 08/07/17 Queued
3431 Hightop 0817BESTYET 08/07/17 Queued
5440 Alphapha 00B1106IMNH 01/09/18 Queued
The line beginning 1074 has a third field that differs by 3 characters with 5440, so both of them are included. 3430 and 3431 are included because they are exactly identical. 3078 and 3890 are eliminated because they are not similar.
Through googling the forums I've managed to piece together this rather longish pipeline to be able to find all of the instances where field 3 is exactly identical:
cat inputfile.txt | awk 'BEGIN { OFS=FS="\t" } {if (count[$3] > 1) print $0; else if (count[$3] == 1) { print save[$3]; print $0; } else save[$3] = $0; count[$3]++; }' > outputfile.txt
I must confess I don't really understand awk all that well; I'm just copying and adapting from the web. But that seemed to work great at finding exact duplicates (i.e., it would output only 3430 and 3431 above). But I have no idea how to approach trying to find strings that are not identical but that differ in no more than 3 places.
For instance, in my example above, it should match 1074 and 5440 because they would both fit the pattern:
??B1106?MNH
But I would want it to be able to match also any other random pattern of matches, as long as there are no more than three differences, like this:
20?7G?N?RAL
These differences could be arbitrarily in any position.
The reason for needing this is we are trying to find a way to automatically find typographical errors in a serial-number-like field. There might be a mis-key, or perhaps a letter "O" replaced with a number "0", or the like.
So... any ideas? Thanks for the help!
you can use this script
$ more hamming.awk
function hamming(x,y,xs,ys,min,max,h) {
if(x==y) return 0;
else {
nx=split(x,xs,"");
mx=split(y,ys,"");
min=nx<mx?nx:mx;
max=nx<mx?mx:nx;
for(i=1;i<=min;i++) if(xs[i]!=ys[i]) h++;
return h+(max-min);
}
}
BEGIN {FS=OFS="\t"}
NR==FNR {
if($3 in a) nrs[NR];
for(k in a)
if(hamming(k,$3)<4) {
nrs[NR];
nrs[a[k]];
}
a[$3]=NR;
next
}
FNR in nrs
usage
$ awk -f hamming.awk file{,}
it's a double scan algorithm, finds the hamming distance (the one you described) between keys. Notice the it's O(n^2) algorithm, so may not suitable for very large data sets. However, not sure any other algorithm can do better.
NB Additional note based on the comment which I missed from the post. This algorithm compares the keys character by character, so displacements won't be identified. For example 123 and 23 will give a distance of 3.
Levenshtein distance aka "edit distance" suits your task best. Perl script below requires installing a module Text::Levenshtein (for debian/ubuntu do: sudo apt install libtext-levenshtein-perl).
use Text::Levenshtein qw(distance);
$maxdist = shift;
#ll = (<>);
#k = map {
$k = (split /\t/, $_)[2];
# $k =~ s/O/0/g;
} #ll;
for ($i = 0; $i < #ll; ++$i) {
for ($j = 0; $j < #ll; ++$j) {
if ($i != $j and distance($k[$i], $k[$j]) < $maxdist) {
print $ll[$i];
last;
}
}
}
Usage:
perl lev.pl 3 inputfile.txt > outputfile.txt
The algorithm is the same O(n^2) as in #karakfa's post, but matching is more flexible.
Also note the commented line # $k =~ s/O/0/g;. If you uncomment it, then all O's in key will become 0's, which will fix keys damaged by O->0 transformation. When working with damaged data I always use small rules like this to fix data gradually, refining rules from run to run, to the point where data is almost perfect and fuzzy match is no longer needed.

Should I expect to see the counter in `for` loop changed inside its body? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I'm reading someone else's code and they separately increment their for loop counter inside the loop, as well as including the usual afterthought. For example:
for( int y = 4; y < 12; y++ ) {
// blah
if( var < othervar ) {
y++;
}
// blah
}
Based on the majority of code others have written and read, should I be expecting to see this?
The practice of manipulating the loop counter within a for loop is not exactly widespread. It would surprise many of the people reading that code. And surprising your readers is rarely a good idea.
The additional manipulation of your loop counter adds a ton of complexity to your code because you have to keep in mind what it means and how it affects the overall behavior of the loop. As Arkady mentioned, it makes your code much harder to maintain.
To put it simply, avoid this pattern. When you follow "clean code" principles, especially the single layer of abstraction (SLA) principle, there is no such thing as
for(something)
if (somethingElse)
y++
Following the principle requires you to move that if block into its own method, making it awkward to manipulate some outer counter within that method.
But beyond that, there might be situations where "something" like your example makes; but for those cases - why not use a while loop then?
In other words: the thing that makes your example complicated and confusing is the fact that two different parts of the code change your loop counter. So another approach could look like:
while (y < whatever) {
...
y = determineY(y, who, knows);
}
That new method could then be the central place to figure how to update the loop variable.
I beg to differ with the acclaimed answer above. There is nothing wrong with manipulating loop control variable inside the loop body. For example, here is the classical example of cleaning up the map:
for (auto it = map.begin(), e = map.end(); it != e; ) {
if (it->second == 10)
it = map.erase(it);
else
++it;
}
Since I have been rightfully pointed out to the fact that iterators are not the same as numeric control variable, let's consider an example of parsing the string. Let's assume the string consists of a series of characters, where characters prefixed with '\' are considered to be special and need to be skipped:
for (size_t i = 0; i < s_len; ++i) {
if (s[i] == '\\') {
++i;
continue;
}
process_symbol(s[i]);
}
Use a while loop instead.
While you can do this with a for loop, you should not. Remember that a program is like any other piece of communication, and must be done with your audience in mind. For a program, the audience includes the compiler and the next person to do maintenance on the code (likely you in about 6 months).
To the compiler, the code is taken very literally -- set up a index variable, run the loop body, execute the increment, then check the condition to see if you are looping again. The compiler doesn't care if you monkey with the loop index.
To a person however, a for loop has a specific implied meaning: Run this loop a fixed number of times. If you monkey with the loop index, then this violates the implication. It's dishonest in a sense, and it matters because the next person to read the code will either have to spend extra effort to understand the loop, or will fail to do so and will therefore fail to understand.
If you want to monkey with the loop index, use a while loop. Especially in C/C++/related languages, a for loop is exactly as powerful as a while loop, so you never lose any power or expressiveness. Any for loop can be converted to a while loop and vice versa. However, the next person who reads it won't depend on the implication that you don't monkey with the loop index. Making it a while loop instead of a for loop is a warning that this kind of loop may be more complicated, and in your case, it is in fact more complicated.
If you increment inside the loop, make sure to comment it. A canonical example (based on a Scott Meyers Effective C++ item) is given in the Q&A How to remove from a map while iterating it? (verbatim code copy)
for (auto it = m.cbegin(); it != m.cend() /* not hoisted */; /* no increment */)
{
if (must_delete)
{
m.erase(it++); // or "it = m.erase(it)" since C++11
}
else
{
++it;
}
}
Here, both the non-constant nature of the end() iterator and the increment inside the loop are surprising, so they need to be documented. Note: the loop hoisting here is after all possible so probably should be done for code clarity.
For what it's worth, here is what the C++ Core Guidelines has to say on the subject:
http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Res-loop-counter
ES.86: Avoid modifying loop control variables inside the body of raw
for-loops
Reason The loop control up front should enable correct
reasoning about what is happening inside the loop. Modifying loop
counters in both the iteration-expression and inside the body of the
loop is a perennial source of surprises and bugs.
Also note that in the other answers here that discuss the case with std::map, the increment of the control variable is still only done once per iteration, where in your example, it can be done more than once per iteration.
So after the some confusion, i.e. close, reopen, question body update, title update, I think the question is finally clear. And also no longer opinion based.
As I understand it the question is:
When I look at code written by others, should I be expecting to see "loop condition variable" being changed in the loop body ?
The answer to this is a clear:
yes
When you work with others code - regardless of whether you do a review, fix a bug, add a new feature - you shall expect the worst.
Everything that are valid within the language is to be expected.
Don't make any assumptions about the code being in acordance with any good practice.
It's really better to write as a while loop
y = 4;
while(y < 12)
{
/* body */
if(condition)
y++;
y++;
}
You can sometimes separate out the loop logic from the body
while(y < 12)
{
/* body */
y += condition ? 2 : 1;
}
I would allow the for() method if and only if you rarely "skip" an item,
like escapes in a quoted string.

String matching alternate approach [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Trying to write my own fast pattern matching algo. Dont want to use language specific solution. I am focussing on writing the algo. This is because I was reading about different techniques to do string matching. Some are complicate yet very interesting like Rabin karp, etc.
I came up with this method which is fast and linear. It works well with the different inputs I have tried with. So I was thinking is there any reason I shouldnt be using this approach over the very well know approaches. Basically I am taking a char of text and comparing with the corresponding character of the pattern - one at a time.
Also, if someone could point out my mistake in this one - it will be great. Thank you for your replies and comments in advance :)
public static boolean patternMatch(String pattern, String text)
{
if(pattern == null)
return true;
if(text == null)
return false;
char[] patternArray = pattern.toCharArray();
char[] textArray = text.toCharArray();
int length = pattern.length();
int j = 0;
for(char t : textArray)
{
if(t == patternArray[j])
{
j++;
if(j == length)
return true;
}
else {
j = 0;
if(t == patternArray[j]) j++;
}
}
return false;
}
Two reasons for using a standard approach:
It's easy to write a method that simply does the wrong thing. Your method is like that, because it will fail to match, for instance, the pattern "ab" against the string "aab". (It matches the first "a"s of the pattern and the string, then fails to match "b" to the second "a" of the string, then goes on to see if it can find a match starting at the third character of the string.)
Standard approaches are fast. Your algorithm is linear, which is pretty good (if only it were also correct!). However, many string matching algorithms will work in sublinear time. That is, the time it takes to match a string grows more slowly than linearly in the size of the input problem. Perhaps hard to believe, but true. (Read the literature for substantiations of this claim.)

Why the "for loop" is created although we already have "while loop" or vice vera? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I cannot figure it out why the for loop is created in spite of having while loop.
In my opinion, they are the same even through someone says that the while loop is used when we do not know an iteration number. In that case, we can also use for loop, can't we?
For example,
For loop:
for (;true;)
{
document.write("Hello World <br>");
if(somecondition){
break;
}
}
While loop:
while (true)
{
document.write("Hello World <br>");
if(somecondition){
break;
}
}
Moreover, the complexity of the two kinds of loops are the equal. So, why the loops do not be deleted and rest only one kind?
Short answer: They are both syntactic sugar that make interpreting the code easier for a programmer, not a machine.
Neither approach you given is no more or less valid than the other, and its true that every for loop can be written as a while loop. The difference is that it makes it more readable to another programmer.
Lets look at Python for an example:
for line in file.readline():
I don't need to see anything else in this block of code to know its intent. Its reading a file, where as in this example:
readline = file.readline()
while readline != "":
# some code
readline = file.readline()
It is much less clear what is happening, but the intent is there. Worst still is this:
while true:
readline = file.readline()
if readline == "":
break
I now need to read further into the loop to gain the same insight. As stated above, while loops and for loops offer programmers ways to make more readable and thus more maintainable code.
Indeed, anything you can do using a for loop can also be accomplished with a while loop. The following two are equivalent:
for (initialCondition; test; nextStep) {
doSomething;
}
and
initialCondition;
while (test) {
doSomething;
nextStep;
}
So why do we have the for loop at all? There are two reasons.
First, the pattern in the second code snippet is very common (initialize something, then loop while updating on each step until a condition fails). Because it's so common, many language designers considered it worthwhile to have a special syntax for it.
Second, moving iteration variables (e.g. i or j) out of the loop body and inside the for makes it easier for compilers and interpreters to detect certain classes of error (or perhaps even forbid them). For example, if I write (in MATLAB because I happen to have it open)
for i = 1:10
i = i + 5; // The body of
disp(i); // the loop.
end
then MATLAB presents me with a warning -- "Line 2: Loop index 'i' is changing inside a FOR loop". If I instead wrote it as a while loop:
i = 1;
while i <= 10
i = i + 5; // The body of
disp(i); // the loop.
i = i + 1;
end
Uh-oh! Not only do I no longer get a warning from MATLAB, now my code is incorrect! Modifying the iteration variable in the body of the loop is dangerous, and it's rarely the case that you mean to do it. Writing "initialize, test, update" loops as for loops rather than while loops helps protect you from this class of error.
You are correct, for-loops can be converted to while-loops and vice-versa. You can even convert them to do-while loops.
It is the same with if and switch. Why having a switch when the same can be done with an if as well?
I guess it mainly exists for readability and maintainance reasons. Consider the following code:
int i=0;
while(i < array.length){
doStuff(array[i]);
i++;
}
Now, another developer adds functionality around this loop:
int i=0;
do();
some();
i = other(); // oops, didn't see that i was used for the loop
stuff();
while(i < array.length){
doStuff(array[i]);
i++;
doMoreStuff(array[i]); // oops, did't see i was already incremented
}
With
for(int i=0; i<array.length; i++){
doStuff(array[i]);
doMoreStuff(array[i]);
}
It is easier to read (you know when the loop starts and when it ends)
It is more error robust (i is scoped to the for loop)
Therefore, it is easier to maintain
Yes, I think you don't need neither of them:
Label1:
document.write("Hello World <br>");
if(somecondition) goto Label2;
goto Label1;
Label2:
They're semantically the equivalent. for(x;y;z) { foo; } is equivalent to x; while (y) { foo; z; }. They're not exactly equivalent in further versions of the standard, in the example of for (int x = 0; y; z), the scope of x is the for block and is out of scope after the loop ends, whereas with int x; while (y) x it's still in scope after the loop ends.
Another difference is that for interprets missing y as TRUE, whereas while must be supplied with an expression. for (;;) { foo; } is fine, but while() { foo; } isn not.
Source
Author

Should I test if equal to 1 or not equal to 0?

I was coding here the other day, writing a couple of if statements with integers that are always either 0 or 1 (practically acting as bools). I asked myself:
When testing for positive result, which is better; testing for int == 1 or int != 0?
For example, given an int n, if I want to test if it's true, should I use n == 1 or n != 0?
Is there any difference at all in regards to speed, processing power, etc?
Please ignore the fact that the int may being more/less than 1/0, it is irrelevant and does not occur.
Human's brain better process statements that don't contain negations, which makes "int == 1" better way.
It really depends. If you're using a language that supports booleans, you should use the boolean, not an integer, ie:
if (value == false)
or
if (value == true)
That being said, with real boolean types, it's perfectly valid (and typically nicer) to just write:
if (!value)
or
if (value)
There is really very little reason in most modern languages to ever use an integer for a boolean operation.
That being said, if you're using a language which does not support booleans directly, the best option here really depends on how you're defining true and false. Often, false is 0, and true is anything other than 0. In that situation, using if (i == 0) (for false check) and if (i != 0) for true checking.
If you're guaranteed that 0 and 1 are the only two values, I'd probably use if (i == 1) since a negation is more complex, and more likely to lead to maintenance bugs.
If you're working with values that can only be 1 or 0, then I suggest you use boolean values to begin with and then just do if (bool) or if (!bool).
In language where int that are not 0 represents the boolean value 'true', and 0 'false', like C, I will tend to use if (int != 0) because it represents the same meaning as if (int) whereas int == 1 represents more the integer value being equal to 1 rather than the boolean true. It may be just me though. In languages that support the boolean type, always use it rather than ints.
A Daft question really. If you're testing for 1, test for 1, if you're testing for zero, test for zero.
The addition of an else statement can make the choice can seem arbitrary. I'd choose which makes the most sense, or has more contextual significance, default or 'natural' behaviour suggested by expected frequency of occurrence for example.
This choice between int == 0 and int != 1 may very well boil down to subjective evaluations which probably aren't worth worrying about.
Two points:
1) As noted above, being more explicit is a win. If you add something to an empty list you not only want its size to be not zero, but you also want it to be explicitly 1.
2) You may want to do
(1 == int)
That way if you forget an = you'll end up with a compile error rather than a debugging session.
To be honest if the value of int is just 1 or 0 you could even say:
if (int)
and that would be the same as saying
if (int != 0)
but you probably would want to use
if (int == 1)
because not zero would potentially let the answer be something other than 1 even though you said not to worry about it.
If only two values are possible, then I would use the first:
if(int == 1)
because it is more explicit. If there were no constraint on the values, I would think otherwise.
IF INT IS 1
NEXT SENTENCE
ELSE MOVE "INT IS NOT ONE" TO MESSAGE.
As others have said, using == is frequently easier to read than using !=.
That said, most processors have a specific compare-to-zero operation. It depends on the specific compiler, processor, et cetera, but there may be an almost immeasurably small speed benefit to using != 0 over == 1 as a result.
Most languages will let you use if (int) and if (!int), though, which is both more readable and get you that minuscule speed bonus.
I'm paranoid. If a value is either 0 or 1 then it might be 2. May be not today, may be not tomorrow, but some maintenance programmer is going to do something weird in a subclass. Sometimes I make mistakes myself [shh, don't tell my employer]. So, make the code say tell me that the value is either 0 or 1, otherwise it cries to mummy.
if (i == 0) {
... 0 stuff ...
} else if (i == 1) {
... 1 stuff ...
} else {
throw new Error();
}
(You might prefer switch - I find its syntax in curly brace language too heavy.)
When using integers as booleans, I prefer to interpret them as follows: false = 0, true = non-zero.
I would write the condition statements as int == 0 and int != 0.
I would say it depends on the semantics, if you condition means
while ( ! abort ) negation is ok.
if ( quit ) break; would be also ok.
if( is_numeric( $int ) ) { its a number }
elseif( !$int ) { $int is not set or false }
else { its set but its not a number }
end of discussion :P
I agree with what most people have said in this post. It's much more efficient to use boolean values if you have one of two distinct possibilities. It also makes the code a lot easier to read and interpret.
if(bool) { ... }
I was from the c world. At first I don't understand much about objective-c. After some while, I prefer something like:
if (int == YES)
or
if (int == NO)
in c, i.e.:
if (int == true)
if (int == false)
these days, I use varchar instead of integer as table keys too, e.g.
name marital_status
------ --------------
john single
joe married
is a lot better than:
name marital_status
------ --------------
john S
joe M
or
name marital_status
------ --------------
john 1
joe 2
(Assuming your ints can only be 1 or 0) The two statements are logically equivalent. I'd recommend using the == syntax though because I think it's clearer to most people when you don't introduce unnecessary negations.

Resources