Rust println! prints weird characters under certain circumstances

Rust println! prints weird characters under certain circumstances - utf-8

I'm trying to write a short program (short enough that it has a simple main function). First, I should list the dependency in the cargo.toml file:
[dependencies]
passwords = {version = "3.1.3", features = ["crypto"]}
Then when I use the crate in main.rs:
extern crate passwords;
use passwords::hasher;
fn main() {
let args: Vec<String> = std::env::args().collect();
if args.len() < 2
{
println!("Error! Needed second argument to demonstrate BCrypt Hash!");
return;
}
let password = args.get(1).expect("Expected second argument to exist!").trim();
let hash_res = hasher::bcrypt(10, "This_is_salt", password);
match hash_res
{
Err(_) => {println!("Failed to generate a hash!");},
Ok(hash) => {
let str_hash = String::from_utf8_lossy(&hash);
println!("Hash generated from password {} is {}", password, str_hash);
}
}
}
The issue arises when I run the following command:
$ target/debug/extern_crate.exe trooper1
And this becomes the output:
?sC�M����k��ed from password trooper1 is ���Ka .+:�
However, this input:
$ target/debug/extern_crate.exe trooper3
produces this:
Hash generated from password trooper3 is ��;��l�ʙ�Y1�>R��G�Ѡd
I'm pretty content with the second output, but is there something within UTF-8 that could cause the "Hash generat" portion of the output statement to be overwritten? And is there code I could use to prevent this?
Note: Code was developed in Visual Studio Code in Windows 10, and was compiled and run using an embedded Git Bash Terminal.
P.S.: I looked at similar questions such as Rust println! problem - weird behavior inside the println macro and Why does my string not match when reading user input from stdin? but those issues seem to be issues with new-line and I don't think that's the problem here.

To complement the previous, the answer to your question of "is there something within UTF-8 that could cause the "Hash generat" portion of the output statement to be overwritten?" is:
let str_hash = String::from_utf8_lossy(&hash);
The reason's in the name: from_utf8_lossy is lossy. UTF8 is a pretty prescriptive format. You can use this function to "decode" stuff which isn't actually UTF8 (for whatever reason), but the way it will do this decoding is:
replace any invalid UTF-8 sequences with U+FFFD REPLACEMENT CHARACTER, which looks like this: �
And so that is what the odd replacement you get is: byte sequences which can not be decoded as UTF8, and are replaced by the "replacement character".
And this is because hash functions generally return random-looking binary data, meaning bytes across the full range (0 to 255) and with no structure. UTF8 is structured and absolutely does not allow such arbitrary data so while it's possible that a hash will be valid UTF8 (though that's not very useful) the odds are very very low.
That's why hashes (and binary data in general) are usually displayed in alternative representations e.g. hex, base32 or base64.

You could convert the hash to hex before printing it to prevent this

Neither of the other answers so far have covered what caused the Hash generated part of the answer to get overwritten.
Presumably you were running your program in a terminal. Terminals support various "terminal control codes" that give the terminal information such as which formatting they should use to output the text they're showing, and where the text should be output on the screen. These codes are made out of characters, just like strings are, and Unicode and UTF-8 are capable of representing the characters in question – the only difference from "regular" text is that the codes start with a "control character" rather than a more normal sort of character, but control characters have UTF-8 encodings of their own. So if you try to print some randomly generated UTF-8, there's a chance that you'll print something that causes the terminal to do something weird.
There's more than one terminal control code that could produce this particular output, but the most likely possibility is that the hash contained the byte b'\x0D', which UTF-8 decodes as the Unicode character U+000D. This is the terminal control code "CR", which means "print subsequent output at the start of the current line, overwriting anything currently there". (I use this one fairly frequently for printing progress bars, getting the new version of the progress bar to overwrite the old version of the progress bar.) The output that you posted is consistent with accidentally outputting CR, because some random Unicode full of replacement characters ended up overwriting the start of the line you were outputting – and because the code in question is only one byte long (most terminal control codes are much longer), the odds that it might appear in randomly generated UTF-8 are fairly high.
The easiest way to prevent this sort of thing happening when outputting arbitrary UTF-8 in Rust is to use the Debug implementation for str/String rather than the Display implementation – it will output control codes in escaped form rather than outputting them literally. (As the other answers say, though, in the case of hashes, it's usual to print them as hex rather than trying to interpret them as UTF-8, as they're likely to contain many byte sequences that aren't valid UTF-8.)

Related

gcc ncurses printing extend characters (glyphs) such as char 223

I am writing a terminal program for the Raspberry Pi using ncurses. I want to add a shadow around a box. I want to use mvaddch() to print extended characters such as char 233 (upper half box character). What would be the syntax for the mvaddch() command? Or is there another way to accomplish this?

You're probably referring to something like code page 866. ncurses will assume your terminal shows characters consistent with the locale encoding, which probably is UTF-8. So (unless you want to convert the characters in your program) the way to go is using Unicode values.
The Unicode organization has tables which you can use to lookup a particular code, e.g., ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP866.TXT. For your example, the relevant row is
0xdf 0x2580 #UPPER HALF BLOCK
(because 0xdf is 223). You would use the Unicode 0x2580 in a call to the function mvaddwstr, e.g.
wchar_t mydata[] = { 0x2580, 0 };
mvaddwstr(0,0, mydata);
(the similarly-named wadd_wch uses a data structure which is more complicated).
You would have to link with the ncursesw library, and of course initialize your program's locale using setlocale as mentioned in the ncurses manual page.

GS1-128 barcode with ZPL does not put the AI in ()

i was expecting this command
^FO15,240^BY3,2:1^BCN,100,Y,N,Y,^FD>:>842011118888^FS
to generate a
(420) 11118888
interpretation line, instead it generates
~n42011118888
anyone have idea how to generate the expected output?
TIA!
Joey

If the firmware is up to date, D mode can be used.
^BCo,h,f,g,e,m
^XA
^FO15,240
^BY3,2:1
^BCN,100,Y,N,Y,D
^FD(420)11118888^FS
^XZ
D = UCC/EAN Mode (x.11.x and newer firmware)
This allows dealing with UCC/EAN with and without chained
application identifiers. The code starts in the appropriate subset
followed by FNC1 to indicate a UCC/EAN 128 bar code. The printer
automatically strips out parentheses and spaces for encoding, but
prints them in the human-readable section. The printer automatically
determines if a check digit is required, calculate it, and print it.
Automatically sizes the human readable.

The ^BC command's "interpretation line" feature does not support auto-insertion of the parentheses. (I think it's safe to assume this is partly because it has no way of determining what your data identifier is by just looking at the data provided - it could be 420, could be 4, could be any other portion of the data starting from the first character.)
My recommendation is that you create a separate text field which handles the logic for the parentheses, and place it just above or below the barcode itself. This is the way I've always approached these in the past - I prefer this method because I have direct control over the font, font size, and formatting of the interpretation line.

Is there any character that is illegal in file paths on every OS?

Is there any character that is guaranteed not to appear in any file path on Windows or Unix/Linux/OS X?
I need this because I want to join together a few file paths into a single string, and then split them apart again later.

In the comments, Harry Johnston writes:
The generic solution to this class of problem is to encode the file paths before joining them. For example, if you're dealing with single-byte strings, you could convert them to hex strings; so "hello" becomes "68656c6c6f". (Obviously that isn't the most efficient solution!)
That is absolutely correct. Please don't try to do anything "tricky" with filenames and reserved characters, because it will eventually break in some weird corner case and your successor will have a heck of a time trying to repair the damage.
In fact, if you're trying to be portable, I strongly recommend that you never attempt to create any filenames including any characters other than [a-z0-9_]. (Consider that common filesystems on both Windows and OS X can operate in case-insensitive mode, where FooBar.txt and FOOBAR.TXT are the same identifier.)
A decently compact encoding scheme for practical use would be to make a "whitelisted set" such as [a-z0-9_], and encode any character ch outside your "whitelisted set" as printf("_%2x", ch). So hello.txt becomes hello_2etxt, and hello_world.txt becomes hello_5fworld_2etxt.
Since every _ is escaped, you can use double-_ as a separator: the encoded string hello_2etxt__goodbye___2e_2e uniquely identifies the list of filenames ['hello.txt', 'goodbye', '..'].

You can use a newline character, or specifically CR (decimal code 13) or LF (decimal code 10) if you like. Whether this is suitable or not depends on what requirements you have with regard to displaying the concatenated string to the user - with this approach, it will print its parts on separate lines - which may be very good or very bad for the purpose (or you may not care...).
If you need the concatenated string to print on a single line, edit your question to specify this additional requirement; and we can go from there then.

How do I escape a Unicode string with Ruby?

I need to encode/convert a Unicode string to its escaped form, with backslashes. Anybody know how?

In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.
>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"
>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""
>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil
In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:
>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""
In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):
>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"

To use a unicode character in Ruby use the "\uXXXX" escape; where XXXX is the UTF-16 codepoint. see http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/

If you have Rails kicking around you can use the JSON encoder for this:
require 'active_support'
x = ActiveSupport::JSON.encode('µ')
# x is now "\u00b5"
The usual non-Rails JSON encoder doesn't "\u"-ify Unicode.

There are two components to your question as I understand it: Finding the numeric value of a character, and expressing such values as escape sequences in Ruby. Further, the former depends on what your starting point is.
Finding the value:
Method 1a: from Ruby with String#dump:
If you already have the character in a Ruby String object (or can easily get it into one), this may be as simple as displaying the string in the repl (depending on certain settings in your Ruby environment). If not, you can call the #dump method on it. For example, with a file called unicode.txt that contains some UTF-8 encoded data in it – say, the currency symbols €£¥$ (plus a trailing newline) – running the following code (executed either in irb or as a script):
s = File.read("unicode.txt", :encoding => "utf-8") # this may be enough, from irb
puts s.dump # this will definitely do it.
... should print out:
"\u20AC\u00A3\u00A5$\n"
Thus you can see that € is U+20AC, £ is U+00A3, and ¥ is U+00A5. ($ is not converted, since it's straight ASCII, though it's technically U+0024. The code below could be modified to give that information, if you actually need it. Or just add leading zeroes to the hex values from an ASCII table – or reference one that already does so.)
(Note: a previous answer suggested using #inspect instead of #dump. That sometimes works, but not always. For example, running ruby -E UTF-8 -e 'puts "\u{1F61E}".inspect' prints an unhappy face for me, rather than an escape sequence. Changing inspect to dump, though, gets me the escape sequence back.)
Method 1b: with Ruby using String#encode and rescue:
Now, if you're trying the above with a larger input file, the above may prove unwieldy – it may be hard to even find escape sequences in files with mostly ASCII text, or it may be hard to identify which sequences go with which characters. In such a case, one might replace the second line above with the following:
encodings = {} # hash to store mappings in
s.split("").each do |c| # loop through each "character"
begin
c.encode("ASCII") # try to encode it to ASCII
rescue Encoding::UndefinedConversionError # but if that fails
encodings[c] = $!.error_char.dump # capture a dump, mapped to the source character
end
end
# And then print out all the captured non-ASCII characters:
encodings.each do |char, dumped|
puts "#{char} encodes to #{dumped}."
end
With the same input as above, this would then print:
€ encodes to "\u20AC".
£ encodes to "\u00A3".
¥ encodes to "\u00A5".
Note that it's possible for this to be a bit misleading. If there are combining characters in the input, the output will print each component separately. For example, for input of 🙋🏾 ў ў, the output would be:
🙋 encodes to "\u{1F64B}".
🏾 encodes to "\u{1F3FE}".
ў encodes to "\u045E".
у encodes to "\u0443". ̆
encodes to "\u0306".
This is because 🙋🏾 is actually encoded as two code points: a base character (🙋 - U+1F64B), with a modifier (🏾, U+1F3FE; see also). Similarly with one of the letters: the first, ў, is a single pre-combined code point (U+045E), while the second, ў – though it looks the same – is formed by combining у (U+0443) with the modifier ̆ (U+0306 - which may or may not render properly, including on this page, since it's not meant to stand alone). So, depending on what you're doing, you may need to watch out for such things (which I leave as an exercise for the reader).
Method 2a: from web-based tools: specific characters:
Alternatively, if you have, say, an e-mail with a character in it, and you want to find the code point value to encode, if you simply do a web search for that character, you'll frequently find a variety of pages that give unicode details for the particular character. For example, if I do a google search for ✓, I get, among other things, a wiktionary entry, a wikipedia page, and a page on fileformat.info, which I find to be a useful site for getting details on specific unicode characters. And each of those pages lists the fact that that check mark is represented by unicode code point U+2713. (Incidentally, searching in that direction works well, too.)
Method 2b: from web-based tools: by name/concept:
Similarly, one can search for unicode symbols to match a particular concept. For example, I searched above for unicode check marks, and even on the Google snippet there was a listing of several code points with corresponding graphics, though I also find this list of several check mark symbols, and even a "list of useful symbols" which has a bunch of things, including various check marks.
This can similarly be done for accented characters, emoticons, etc. Just search for the word "unicode" along with whatever else you're looking for, and you'll tend to get results that include pages that list the code points. Which then brings us to putting that back into ruby:
Representing the value, once you have it:
The Ruby documentation for string literals describes two ways to represent unicode characters as escape sequences:
\unnnn Unicode character, where nnnn is exactly 4 hexadecimal digits ([0-9a-fA-F])
\u{nnnn ...} Unicode character(s), where each nnnn is 1-6 hexadecimal digits ([0-9a-fA-F])
So for code points with a 4-digit representation, e.g. U+2713 from above, you'd enter (within a string literal that's not in single quotes) this as \u2713. And for any unicode character (whether or not it fits in 4 digits), you can use braces ({ and }) around the full hex value for the code point, e.g. \u{1f60d} for 😍. This form can also be used to encode multiple code points in a single escape sequence, separating characters with whitespace. For example, \u{1F64B 1F3FE} would result in the base character 🙋 plus the modifier 🏾, thus ultimately yielding the abstract character 🙋🏾 (as seen above).
This works with shorter code points, too. For example, that currency character string from above (€£¥$) could be represented with \u{20AC A3 A5 24} – requiring only 2 digits for three of the characters.

You can directly use unicode characters if you just add #Encoding: UTF-8 to the top of your file. Then you can freely use ä, ǹ, ú and so on in your source code.

try this gem. It converts Unicode or non-ASCII punctuation and symbols to nearest ASCII punctuation and symbols
https://github.com/qwuen/punctuate
example usage:
"100٪".punctuate
=> "100%"
the gem uses the reference in https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/unicode/DefaultTables/symbolTable.html for the conversion.

Least used delimiter character in normal text < ASCII 128

For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.
I will delimit them using a character.
Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.

I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)
In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.
ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group). These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record. The roughly map to fields in modern nomenclature.
Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.
If you must display it, I would recommend displaying it in-application, after it was parsed into fields.

Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.
The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.
I personally think I'd go for | (pipe) if given a choice but going with real data is safest.
And whatever you do, make sure you've worked out an escaping scheme!

When using different languages, this symbol: ¬
proved to be the best. However I'm still testing.

Probably | or ^ or ~ you could also combine two characters

You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.
(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)
If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (# or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.

How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.

Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.

For fast escaping I use stuff like this:
say you want to concatinate str1, str2 and str3
what I do is:
delimitedStr=str1.Replace("#","#a").Replace("|","#p")+"|"+str2.Replace("#","#a").Replace("|","#p")+"|"+str3.Replace("#","#a").Replace("|","#p");
then to retrieve original use:
splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("#p","|").Replace("#a","#");
str2=splitStr[1].Replace("#p","|").Replace("#a","#");
str3=splitStr[2].Replace("#p","|").Replace("#a","#");
note: the order of the replace is important
its unbreakable and easy to implement

Pipe for the win! |

We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.

Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.

I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.

This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.
I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.
CSV is probably a better idea for most situations, though.

Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.

I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win

make it dynamic : )
announce your control characters in the file header
for example
delimiter: ~
escape: \
wrapline: $
width: 19
hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text
would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text
i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio