I would like to create a function in OCaml that returns the char lambda (UTF8 0x03bb) but I can't use Char.chr because it's not in the ASCII chart. Is there a way to do so? I am new to OCaml...
First note that you are mixing scalar values (an integer in the ranges 0..0xD7FF and 0xE000 .. 0x10FFFF) and their encoding (the byte serialization of such an integer). Don't say UTF-8 0x03bb, as it doesn't make any sense, what you are talking about is the scalar value U+03BB, the integer that represent small lambda in Unicode.
Now as you noticed the OCaml char type can't represent such integers as it is limited to 256 values. What you can do however is to represent their UTF-8 encoding in OCaml strings which are (or more precisely became) sequences of arbitrary bytes. For U+03BB its UTF-8 serialization is the byte sequence 0xCE 0xBB so you can write:
let lambda = "\xCE\xBB"
If you prefer to deal with scalar values directly you can use an UTF-8 encoder like Uutf (disclaimer I'm the author) and do for example:
let lambda = 0x03BB
let lambda_utf_8 =
let b = Buffer.create 5 in
Uutf.Buffer.add_utf_8 b lambda; Buffer.contents b
For a short refresher on Unicode and a few biased tips on how to deal with Unicode in OCaml you can consult this minimal Unicode introduction.
UPDATE
Since OCaml 4.06, unicode escapes are supported in literal strings. The following UTF-8 encodes the lambda character in the lambda string:
let lambda = "\u{03BB}"
Just tagging along after the excellent answer by Daniel,
If you just want that 0x03bb, you can do
#require "camomile"
open CamomileLibrary.UChar
let () =
let c = of_int 955 in
(* Just to do something with it *)
print_endline (CamomileLibrary.UPervasives.escaped_uchar c)
Here's a way to actually "see" it on the terminal.
#require "camomile, zed"
open CamomileLibrary.UChar
let () =
Zed_utf8.make 1 (of_int 955) |> print_endline
and I got the decimal of 955 from: http://www.fileformat.info/info/unicode/char/03bb/index.htm
Related
When you are going from std::u16string to, lets say std::u32string, std::wstring_convert doesn't work as it expects chars. So how does one use std::wstring_convert to convert between UTF-16 and UTF-32 using std::u16string as input?
For example :
inline std::u32string utf16_to_utf32(const std::u16string& s) {
std::wstring_convert<std::codecvt_utf16<char32_t>, char32_t> conv;
return conv.from_bytes(s); // cannot do this, expects 'char'
}
Is it ok to reinterpret_cast to char, as I've seen in a few examples?
If you do need to reinterpret_cast, I've seen some examples using the string size as opposed to the total byte size for the pointers. Is that an error or a requirement?
I know codecvt is deprecated, but until the standard offers an alternative, it has to do.
If you do not want to reinterpret_cast, the only way I've found is to first convert to utf-8, then reconvert to utf-32.
For ex,
// Convert to utf-8.
std::u16string s;
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> conv;
std::string utf8_str = conv.to_bytes(s);
// Convert to utf-32.
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
std::u32string utf32_str = conv.from_bytes(utf8_str);
Yes this is sad and likely contributes to codecvt deprecation.
I'm getting into Rust programming to realize a small program and I'm a little bit lost in string conversions.
In my program, I have a vector as follows:
let mut name: Vec<winnt::WCHAR> = Vec::new();
WCHAR is the same as a u16 on my Windows machine.
I hand over the Vec<u16> to a C function (as a pointer) which fills it with data. I then need to convert the string contained in the vector into a &str. However, no matter, what I try, I can not manage to get this conversion working.
The only thing I managed to get working is to convert it to a WideString:
widestr = unsafe { WideCString::from_ptr_str(name.as_ptr()) };
But this seems to be a step into the wrong direction.
What is the best way to convert the Vec<u16> to an &str under the assumption that the vector holds a valid and null-terminated string.
I then need to convert the string contained in the vector into a &str. However, no matter, what I try, I can not manage to get this conversion working.
There's no way of making this a "free" conversion.
A &str is a Unicode string encoded with UTF-8. This is a byte-oriented encoding. If you have UTF-16 (or the different but common UCS-2 encoding), there's no way to read one as the other. That's equivalent to trying to read a JPEG image as a PDF. Both chunks of data might be a string, but the encoding is important.
The first question is "do you really need to do that?". Many times, you can take data from one function and shovel it back into another function, never looking at it. If you can get away with that, that might be be best answer.
If you do need to transform it, then you have to deal with the errors that can occur. An arbitrary array of 16-bit integers may not be valid UTF-16 or UCS-2. These encodings have edge cases that can easily produce invalid strings. Null-termination is another aspect - Unicode actually allows for embedded NUL characters, so a null-terminated string can't hold all possible Unicode characters!
Once you've ensured that the encoding is valid 1 and figured out how many entries in the input vector comprise the string, then you have to decode the input format and re-encode to the output format. This is likely to require some kind of new allocation, so you are most likely to end up with a String, which can then be used most anywhere a &str can be used.
There is a built-in method to convert UTF-16 data to a String: String::from_utf16. Note that it returns a Result to allow for these error cases. There's also String::from_utf16_lossy, which replaces invalid encoded parts with the Unicode replacement character.
let name = [0x68, 0x65, 0x6c, 0x6c, 0x6f];
let a = String::from_utf16(&name);
let b = String::from_utf16_lossy(&name);
println!("{:?}", a);
println!("{:?}", b);
If you are starting from a pointer to a u16 or WCHAR, you will need to convert to a slice first by using slice::from_raw_parts. If you have a null-terminated string, you need to find the NUL yourself and slice the input appropriately.
1: This is actually a great way of using types; a &str is guaranteed to be UTF-8 encoded, so no further check needs to be made. Similarly, the WideCString is likely to perform a check once upon construction and then can skip the check on later uses.
This is my simple hack for this case. There must be a bug; fix for your own case:
let mut v = vec![0u16; MAX_PATH as usize];
// imaginary win32 function
win32_function(v.as_mut_ptr());
let mut path = String::new();
for val in v.iter() {
let c: u8 = (*val & 0xFF) as u8;
if c == 0 {
break;
} else {
path.push(c as char);
}
}
I would like to define a grammar in Haskell that matches a string in format "XY12XY" (some alpha followed by some numerics), eg variable names in programming languages.
customer123 is a valid variable name, but '123customer' is not a valid variable name.
I am at a loss how to define the grammar and write a validator function that would validate whether a given string is valid variable name. I have been trying to understand and adapt the parser example at: https://wiki.haskell.org/GADT but I just can't get my head around how to tweak it to make it work for my need.
If any kind fellow Haskell gurus would help me define this please:
validate :: ValidFormat -> String -> Bool
validate f [] = False
validate f s = ...
I would like to define the ValidFormat grammar as:
varNameFormat = Concat Alpha $ Concat Alpha Numeric
I'd start with a simple parser and see if that satisfies your needs, unless you can explain why this is not enough for your use case. Parsers are pretty straightforward. I'll give a very simple (and maybe incomplete) example with attoparsec:
import Control.Applicative
import Data.Attoparsec.ByteString.Char8
import qualified Data.ByteString.Char8 as B
validateVar :: B.ByteString -> Bool
validateVar bstr = case parseOnly variableP bstr of
Right _ -> True
Left _ -> False
variableP :: Parser String
variableP =
(++)
<$> many1 letter_ascii -- must start with one or more letters
<*> many (digit <|> letter_ascii) -- then can have any combination of letters/digits
<* endOfInput -- make sure we don't ignore invalid trailing chars
variableP combines parsers via <*> and will require you to handle both results of many1 letter_ascii and many (digit <|> letter_ascii). In this case we just concatenate both results via (++), check the types of many1, many, letter_ascii and digit. The <* says "parse this, but discard the result of the right hand parser" (otherwise you'd have to handle 3 results).
That means if you run the parser on "abc123" you'll get back "abc123". If you parse "1abc" the parser will fail.
Check the type of parseOnly:
parseOnly :: Parser a -> ByteString -> Either String a
We pass it our parser and the bytestring it should parse. If the parser fails we'll get Left <something went wrong>. If the parser succeeds, we'll get Right <our string>. The cool thing is... instead of just giving a string on success, we could do pretty much anything with the results in variableP, as in: use something different than (++), convert the types and whatnot (mind that the Parser type might also have to change then).
Since we only care if the parser succeeded in validateVar, we can just ignore the result in either case.
So instead of defining GADTs for your grammar, you just define Parsers.
You might also find this link useful for a tutorial: http://www.seas.upenn.edu/~cis194/fall14/spring13/lectures.html (week 10 and 11, including the assignments where you basically write your own little parser library)
I've taken this from examples of regex-applicative
import Text.Regex.Applicative
import Data.Char
import Data.Maybe
varNameFormat :: RE Char String
varNameFormat = (:) <$> psym isAlpha <*> many (psym isAlphaNum)
validate :: RE Char String -> String -> Bool
validate re str = isJust $ str =~ re
You will have
*Main> validate varNameFormat "a123"
True
*Main> validate varNameFormat "1a23"
False
Say I have some input word like "føøbær" and I want a hash table of letter frequencies s.t. f→1, ø→2 – how do I do this in OCaml?
The http://pleac.sourceforge.net/pleac_ocaml/strings.html examples only work on ASCII and https://ocaml-batteries-team.github.io/batteries-included/hdoc2/BatUTF8.html doesn't say how to actually create a BatUTF8.t from a string.
The BatUTF8 module you refer to defines its type t as string, thus there is no conversion needed: a BatUTF8.t is a string. Apparently, the module encourages you to validate your string before using other functions. I guess that a proper way of operating would be something like:
let s = "føøbær"
let () = BatUTF8.validate s
let () = BatUTF8.iter add_to_table s
Looking at the code of Batteries, I found this of_string_unsafe, so perhaps this is the way:
open Batteries
BatUTF8.iter (fun c -> …Hashtbl.add table c …) (BatUTF8.of_string_unsafe "føøbær")`
although, since it's termed "unsafe" (the doc's don't say why), maybe this is equivalent:
BatUTF8.iter (fun c -> …Hashtbl.add table c …) "føøbær"
At least it works for the example word here.
Camomile also seems to iterate through it correctly:
module C = CamomileLibraryDefault.Camomile
C.iter (fun c -> …Hashtbl.add table c …) "føøbær"
I don't know of the tradeoffs between Camomile and BatUTF8 here, though they end up storing different types (BatUChar vs C.Pervasives.UChar).
As a personal challenge I'm trying to implement the SIMON block cipher in Ruby. I'm running into some issues finding the best way to work with the data. The full code related to this question is located at: https://github.com/Rami114/Personal/blob/master/Simon/Simon.rb
SIMON requires both XOR, shift and circular shift operations, the last of which is forcing me to work with BigNums so I can perform the left circular shift with math rather than a more complex/slower double loop on byte arrays.
Is there a better way to convert a string to a BigNum and back again.
String -> BigNum (where N is 64 and pt is a string of plaintext)
pt = pt.chars.each_slice(N/8).map {|x| x.join.unpack('b*')[0].to_i(2)}.to_a
So I break the string into individual characters, slice into N-sized arrays (the word size in SIMON) and unpack each set into a BigNum. That appears to work fine and I can convert it back.
Now my SIMON code is currently broken, but that's more the math I think/hope and not the code. The conversion back is (where ct is an array of bignums representing the ciphertext):
ct.map { |x| [x.to_s(2).rjust(128,'0')].pack('b*') }.join
I seem to have to right-justify pad the string as bignums are of undefined width so I have no leading 0s. Unfortunately the pack requires the defined with to have sensible output.
Is this a valid method of conversion? Is there a better way? I'm not sure on either count and hoping someone here can help out.
E: For #torimus, the circular shift implementation I'm using (From link above)
def self.lcs (bytes, block_size, shift)
((bytes << shift) | (bytes >> (block_size - shift))) & ((1<< block_size)-1)
end
If you would be equally happy with unpack('B*') with msb first binary numbers (which you could well be if all your processing is circular), then you could also use .unpack('Q>') instead of .unpack('B*')[0].to_i(2) for generating pt:
pt = "qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM1234567890!#"
# Your version (with 'B' == msb first) for comparison:
pt_nums = pt.chars.each_slice(N/8).map {|x| x.join.unpack('B*')[0].to_i(2)}.to_a
=> [8176115190769218921, 8030025283835160424, 7668342063789995618, 7957105551900562521,
6145530372635706438, 5136437062280042563, 6215616529169527604, 3834312847369707840]
# unpack to 64-bit unsigned integers directly
pt_nums = pt.unpack('Q>8')
=> [8176115190769218921, 8030025283835160424, 7668342063789995618, 7957105551900562521,
6145530372635706438, 5136437062280042563, 6215616529169527604, 3834312847369707840]
There are no native 128-bit pack/unpacks to return in the other direction, but you can use Fixnum to solve this too:
split128 = 1 << 64
ct = pt # Just to show round-trip
ct.map { |x| [ x / split128, x % split128 ].pack('Q>2') }.join
=> "\x00\x00\x00\x00\x00\x00\x00\x00qwertyui . . . " # truncated
This avoids a lot of the temporary stages on your code, but at the expense of using a different byte coding - I don't know enough about SIMON to comment whether this is adaptable to your needs.