Unicode - the right thing to do - utf-8

I'm working on something which processes UTF-8 encoding, and I found myself asking the question:
What should I do when I encounter a byte which never occur inside a
UTF-8 encoded string?
i.e. 0x1111111X
For example, I'm writing a small snippet of code which looks at the current place in the stream of bytes, and tells you how many bytes are used to represent the code point at that place in the stream.
0x0XXXXXXX just 1
0x10XXXXXX oops, we are in a continuation byte,
search back upstream to find the leading byte
0x11XXXXXX count the
number of leading 1s, that's the answer
0x1111111X err, this is not
possible in UTF-8!!! what to do!?!?
I'm thinking of returning an error value, but wondering if I should, as a side effect, replace it with some more predictable error glyph (I mean the code point representing said glyph). And later when I do something more complicated, like jumping through the string and find that the leading byte does not have the correct number of continuation bytes after it... I'm thinking I should "fix" that up too.
Is it standard practice to leave wrongly encoded strings broken, or to change them and make them be wrong but at least play nice?

The most common way is to just throw a meaningful error if the input is not correct and stop.
There are a lot of good reasons to do so:
speed: if you try to fix errors this often cause your
function to be slower even on correct inputs
simplicity: your code can become really complicated if you try to fix any error
maintainability and correctness: it's just easier to ensure the function works correctly
when you stop whenever the input does not match the specification you are working with. Since you have only to check input according to specification.
purpose: any time you get to such a point like here you have to think about:
what is the purpose of my function? Why I came up with the idea to write it?
Also: a function fixcode which fixes the uft8 could be used also at an other place, so it makes total sense to separate fixing (purpose, simplicity, maintainability and correctness argument again).
Even if you expect an error, I would prefer to separate the encode and fixcode since
your can reuse fixcode in outer contexts.
If you are really thinking about fixing the utf8 code while encoding I would use a pattern like this:
try {
q = encode(s);
} catch(encodingerror) {
log(encodingerror);
t = fixcode(s);
q = encode(t);
}

Related

Have troubles with ocaml

I have some problems with OCAML I wrote this:
let visibility_graph observation memory =
Graph.add_node memory.graph observation.position
Graph.add_node memory.graph observation.spaseship;
but it's not working. However this is working:
let visibility_graph observation memory =
Graph.add_node memory.graph observation.position
You don't give enough information to give a full answer. However the code you show is completely consistent with your error reports. The first example appears to consist of two expressions (function calls) with no separator between them. To execute two expressions sequentially, you need a semicolon (;) between them.
The semicolon at the end of the first example appears to be misplaced. Things might work (depending on what the rest of your code looks like) if you just move this semicolon to the end of the previous line.
The second example looks like a ligitimate function defintion. Of course it's difficult to tell without knowing the definitions of all the identifiers used.

String include weird behavior

I was doing a code golf (use the minimum number of characters) and I had the following working Python solution. I was trying to shorten my code by re-writing it to Ruby but my Ruby code would always print false.
The code had to read two strings, to ignore the case and to tell whether it was possible to obtain one string by rotating the other string. The output had to be either true or false. Do you have any idea what I did wrong in Ruby?
Python 3 (64 characters) - Works
a=input().lower()
b=input().lower()
print(str(a in 2*b).lower())
Ruby (47 characters) - Always prints "false"
a=gets.upcase
b=gets.upcase
p (b*2).include? a
With the examples I can think of, the Ruby code works correctly, but for some reason, it didn't work on the code golf site (codingame.com, the problem was proposed by user "10100111001").
In Ruby gets includes the \n at the end. You'd have to .chomp it away before doing anything.
a=gets.chomp.upcase
b=gets.chomp.upcase
p (b*2).include? a
By the way, this is not the right way to "tell whether it was possible to obtain one string by rotating the other string", it only partially solves the problem, hope you know that.

Refactoring Business Rule, Function Naming, Width, Height, Position X & Y

I am refactoring some business rule functions to provide a more generic version of the function.
The functions I am refactoring are:
DetermineWindowWidth
DetermineWindowHeight
DetermineWindowPositionX
DetermineWindowPositionY
All of them do string parsing, as it is a string parsing business rules engine.
My question is what would be a good name for the newly refactored function?
Obviously I want to shy away from a function name like:
DetermineWindowWidthHeightPositionXPositionY
I mean that would work, but it seems unnecessarily long when it could be something like:
DetermineWindowMoniker or something to that effect.
Function objective: Parse an input string like 1280x1024 or 200,100 and return either the first or second number. The use case is for data-driving test automation of a web browser window, but this should be irrelevant to the answer.
Question objective: I have the code to do this, so my question is not about code, but just the function name. Any ideas?
There are too little details, you should have specified at least the parameters and returns of the functions.
Have I understood correctly that you use strings of the format NxN for sizes and N,N for positions?
And that this generic function will have to parse both (and nothing else), and will return either the first or second part depending on a parameter of the function?
And that you'll then keep the various DetermineWindow* functions but make them all call this generic function?
If so:
Without knowing what parameters the generic function has it's even harder to help, but it's most likely impossible to give it a simple name.
Not all batches of code can be described by a simple name.
You'll most likely need to use a different construction if you want to have clear names. Here's an idea, in pseudo code:
ParseSize(string, outWidth, outHeight) {
ParsePair(string, "x", outWidht, outHeight)
}
ParsePosition(string, outX, outY) {
ParsePair(string, ",", outX, outY)
}
ParsePair(string, separator, outFirstItem, outSecondItem) {
...
}
And the various DetermineWindow would call ParseSize or ParsePosition.
You could also use just ParsePair, directly, but I thinks it's cleaner to have the two other functions in the middle.
Objects
Note that you'd probably get cleaner code by using objects rather than strings (a Size and a Position one, and probably a Pair one too).
The ParsePair code (adapted appropriately) would be included in a constructor or factory method that gives you a Pair out of a string.
---
Of course you can give other names to the various functions, objects and parameters, here I used the first that came to my mind.
It seems this question-answer provides a good starting point to answer this question:
Appropriate name for container of position, size, angle
A search on www.thesaurus.com for "Property" gives some interesting possible answers that provide enough meaningful context to the usage:
Aspect
Character
Characteristic
Trait
Virtue
Property
Quality
Attribute
Differentia
Frame
Constituent
I think ConstituentProperty is probably the most apt.

Erlang pattern matching bitstrings

I'm writing code to decode messages from a binary protocol. Each message type is assigned a 1 byte type identifier and each message carries this type id. Messages all start with a common header consisting of 5 fields. My API is simple:
decoder:decode(Bin :: binary()) -> my_message_type() | {error, binary()}`
My first instinct is to lean heavily on pattern matching by writing one decode function for each message type and to decode that message type completely in the fun argument
decode(<<Hdr1:8, ?MESSAGE_TYPE_ID_X:8, Hdr3:8, Hdr4:8, Hdr5:32,
TypeXField1:32, TypeXFld2:32, TypeXFld3:32>>) ->
#message_x{hdr1=Hdr1, hdr3=Hdr3 ... fld4=TypeXFld3};
decode(<<Hdr1:8, ?MESSAGE_TYPE_ID_Y:8, Hdr3:8, Hdr4:8, Hdr5:32,
TypeYField1:32, TypeYFld2:16, TypeYFld3:4, TypeYFld4:32
TypeYFld5:64>>) ->
#message_y{hdr1=Hdr1, hdr3=Hdr3 ... fld5=TypeYFld5}.
Note that while the first 5 fields of the messages are structurally identical, the fields after that vary for each message type.
I have roughly 20 message types and thus 20 functions similar to the above. Am I decoding the full message multiple times with this structure? Is it idiomatic? Would I be better off just decoding the message type field in the function header and then decode the full message in the body of the message?
Just to agree that your style is very idiomatic Erlang. Don't split the decoding into separate parts unless you feel it makes your code clearer. Sometimes it can be more logical to do that type of grouping.
The compiler is smart and compiles pattern matching in such a way that it will not decode the message more than once. It will first decode the first two fields (bytes) and then use the value of the second field, the message type, to determine how it is going to handle the rest of the message. This works irrespective of how long the common part of the binary is.
So their is no need to try and "help" the compiler by splitting the decoding into separate parts, it will not make it more efficient. Again, only do it if it makes your code clearer.
Your current approach is idiomatic Erlang, so keep going this direction. Don't worry about performance, Erlang compiler does good work here. If your messages are really exactly same format you can write macro for it but it should generate same code under hood. Anyway using macro usually leads to worse maintainability. Just for curiosity why you are generating different record types when all have exactly same fields? Alternative approach is just translate message type from constant to Erlang atom and store it in one record type.

Clone detection algorithm

I'm writing an algorithm that detects clones in source code. E.g. if there is a block like:
for(int i = o; i <5; i++){
doSomething(abc);
}
...and if this block is repeated somewhere else in the source code it will be detected as a clone. The method I am using at the moment is to create hashes for lines/blocks and compare them with hashes of other lines/blocks in the same source to see if there are any matches.
Now, if the same block as above was to be repeated somewhere with only the argument of doSomething different, it would not be detected as a clone even though it would appear very much like a clone to you and me. My algorithm detects exact matches but doesn't detect matching blocks where only the argument is different.
Could anyone suggest any ways of getting around this issue? Thanks!
Here's a super-simple way, which might go too far in erasing information (i.e., might produce too many false positives): replace every identifier that isn't a keyword with some fixed name. So you'd get
for (int DUMMY = DUMMY; DUMMY<5; DUMMY++) {
DUMMY(DUMMY);
}
(assuming you really meant o rather than 0 in the initialization part of the for-loop).
If you get a huge number of false positives with this, you could then post-process them by, for instance, looking to see what fraction of the DUMMYs actually correspond to the same identifier in both halves of the match, or at least to identifiers that are consistent between the two.
To do much better you'll probably need to parse the code to some extent. That would be a lot more work.
Well if you're going todo something else then you're going to have to parse to code at least a bit. For example you could detect methods and then ignore the method arguments in your hash. Anyway I think it's always true that you need your program to understand the code better than 'just text blocks', and that might get awefuly complicated.

Resources