NSTask string encoding problem

NSTask string encoding problem - cocoa

In my program, I'm grep-ing via NSTask. For some reason, sometimes I would get no results (even though the code was apparently the same as the command run from the CLI which worked just fine), so I checked through my code and found, in Apple's documentation, that when adding arguments to an NSTask object, "the NSTask object converts both path and the strings in arguments to appropriate C-style strings (using fileSystemRepresentation) before passing them to the task via argv[]" (snip).
The problem is that I might grep terms like "Río Gallegos". Sadly (as I checked with fileSystemRepresentation), that undergoes the conversion and turns out to be "RiÃÅo Gallegos".
How can I solve this?
-- Ry

The problem is that I might grep terms like "Río Gallegos". Sadly (as I checked with fileSystemRepresentation), that undergoes the conversion and turns out to be "RiÃÅo Gallegos".
That's one possible interpretation. What you mean is that “Río Gallegos” gets converted to “Ri\xcc\x81o Gallegos”—the UTF-8 bytes to represent the decomposed i + combining acute accent.
Your problem is that grep is not interpreting these bytes as UTF-8. grep is using some other encoding—apparently, MacRoman.
The solution is to tell grep to use UTF-8. That requires setting the LC_ALL variable in your grep task's environment.
The quick and dirty value to use would be “en_US.UTF-8”; a more proper way would be to get the language code for the user's primary preferred language, replace the hyphen, if any, with an underscore, and stick “.UTF-8” on the end of that.

Related

Rust println! prints weird characters under certain circumstances

I'm trying to write a short program (short enough that it has a simple main function). First, I should list the dependency in the cargo.toml file:
[dependencies]
passwords = {version = "3.1.3", features = ["crypto"]}
Then when I use the crate in main.rs:
extern crate passwords;
use passwords::hasher;
fn main() {
let args: Vec<String> = std::env::args().collect();
if args.len() < 2
{
println!("Error! Needed second argument to demonstrate BCrypt Hash!");
return;
}
let password = args.get(1).expect("Expected second argument to exist!").trim();
let hash_res = hasher::bcrypt(10, "This_is_salt", password);
match hash_res
{
Err(_) => {println!("Failed to generate a hash!");},
Ok(hash) => {
let str_hash = String::from_utf8_lossy(&hash);
println!("Hash generated from password {} is {}", password, str_hash);
}
}
}
The issue arises when I run the following command:
$ target/debug/extern_crate.exe trooper1
And this becomes the output:
?sC�M����k��ed from password trooper1 is ���Ka .+:�
However, this input:
$ target/debug/extern_crate.exe trooper3
produces this:
Hash generated from password trooper3 is ��;��l�ʙ�Y1�>R��G�Ѡd
I'm pretty content with the second output, but is there something within UTF-8 that could cause the "Hash generat" portion of the output statement to be overwritten? And is there code I could use to prevent this?
Note: Code was developed in Visual Studio Code in Windows 10, and was compiled and run using an embedded Git Bash Terminal.
P.S.: I looked at similar questions such as Rust println! problem - weird behavior inside the println macro and Why does my string not match when reading user input from stdin? but those issues seem to be issues with new-line and I don't think that's the problem here.

To complement the previous, the answer to your question of "is there something within UTF-8 that could cause the "Hash generat" portion of the output statement to be overwritten?" is:
let str_hash = String::from_utf8_lossy(&hash);
The reason's in the name: from_utf8_lossy is lossy. UTF8 is a pretty prescriptive format. You can use this function to "decode" stuff which isn't actually UTF8 (for whatever reason), but the way it will do this decoding is:
replace any invalid UTF-8 sequences with U+FFFD REPLACEMENT CHARACTER, which looks like this: �
And so that is what the odd replacement you get is: byte sequences which can not be decoded as UTF8, and are replaced by the "replacement character".
And this is because hash functions generally return random-looking binary data, meaning bytes across the full range (0 to 255) and with no structure. UTF8 is structured and absolutely does not allow such arbitrary data so while it's possible that a hash will be valid UTF8 (though that's not very useful) the odds are very very low.
That's why hashes (and binary data in general) are usually displayed in alternative representations e.g. hex, base32 or base64.

You could convert the hash to hex before printing it to prevent this

Neither of the other answers so far have covered what caused the Hash generated part of the answer to get overwritten.
Presumably you were running your program in a terminal. Terminals support various "terminal control codes" that give the terminal information such as which formatting they should use to output the text they're showing, and where the text should be output on the screen. These codes are made out of characters, just like strings are, and Unicode and UTF-8 are capable of representing the characters in question – the only difference from "regular" text is that the codes start with a "control character" rather than a more normal sort of character, but control characters have UTF-8 encodings of their own. So if you try to print some randomly generated UTF-8, there's a chance that you'll print something that causes the terminal to do something weird.
There's more than one terminal control code that could produce this particular output, but the most likely possibility is that the hash contained the byte b'\x0D', which UTF-8 decodes as the Unicode character U+000D. This is the terminal control code "CR", which means "print subsequent output at the start of the current line, overwriting anything currently there". (I use this one fairly frequently for printing progress bars, getting the new version of the progress bar to overwrite the old version of the progress bar.) The output that you posted is consistent with accidentally outputting CR, because some random Unicode full of replacement characters ended up overwriting the start of the line you were outputting – and because the code in question is only one byte long (most terminal control codes are much longer), the odds that it might appear in randomly generated UTF-8 are fairly high.
The easiest way to prevent this sort of thing happening when outputting arbitrary UTF-8 in Rust is to use the Debug implementation for str/String rather than the Display implementation – it will output control codes in escaped form rather than outputting them literally. (As the other answers say, though, in the case of hashes, it's usual to print them as hex rather than trying to interpret them as UTF-8, as they're likely to contain many byte sequences that aren't valid UTF-8.)

What does the "%" mean in tcl?

In a situation like this for example:
[% $create_port %]
or [list [% $RTL_LIST %]]
I realized it had to do with the brackets, but what confuses me is that sometimes it is used with the brackets and variable followed, and sometimes you have brackets with variables inside without the %.
So i'm not sure what it is used for.
Any help is appreciated.

% is not a metacharacter in the Tcl language core, but it still has a few meanings in Tcl. In particular, it's the modulus operator in expr and a substitution field specifier in format, scan, clock format and clock scan. (It's also the default prompt character, and I have a trivial pass-through % command in my ~/.tclshrc to make cut-n-pasting code easier, but nobody else in the world needs to follow my lead there!)
But the code you have written does not appear to be any of those (because it would be a syntax error in all of the commands I've mentioned). It looks like it is some sort of directive processing scheme (with the special sequences being [% and %], with the brackets) though not one I recognise such as doctools or rivet. Because a program that embeds a Tcl interpreter could do an arbitrary transformation to scripts before executing them, it's extremely difficult to guess what it might really be.

Can I treat all domain names as being IDNs without any ill effects?

From testing, it seems like trying to convert both IDNs and regular domain names 'just works' - eg, if the input doesn't need to be changed punycode will just return the input.
punycode.toASCII('lancôme.com');
returns:
'xn--lancme-lxa.com'
And
punycode.toASCII('apple.com');
returns:
'apple.com'
This looks great, but is it specified anywhere? Can I safely convert everything to punycode?

That is correct. If you look at how the procedure for converting unicode strings to ascii punycode, the process only alters any non-ascii character. Since regular domains cannot contain non-ascii characters, if your conversor is correctly implemented, it will never transform any pure-ascii string.
You can read more about how unicode is converted to punycode here: https://en.wikipedia.org/wiki/Punycode
Punycode is specified in RFC 3492: https://www.ietf.org/rfc/rfc3492.txt, and it clearly says:
"Basic code point segregation" is a very simple and
efficient encoding for basic code points occurring in the extended
string: they are simply copied all at once.
Therefore, if your extended string is made of basic code points, it will just be copied without change.

Windows SED command - simple search and replace without regex

How should I use 'sed' command to find and replace the given word/words/sentence without considering any of them as the special character?
In other words hot to treat find and replace parameters as the plain text.
In following example I want to replace 'sagar' with '+sagar' then I have to give following command
sed "s/sagar/\\+sagar#g"
I know that \ should be escaped with another \ ,but I can't do this manipulation.
As there are so many special characters and theie combinations.
I am going to take find and replace parameters as input from user screen.
I want to execute the sed from c# code.
Simply, I do not want regular expression of sed to use. I want my command text to be treated as plain text?
Is this possible?
If so how can I do it?

While there may be sed versions that have an option like --noregex_matching, most of them don't have that option. Because you're getting the search and replace input by prompting a user, you're best bet is to scan the user input strings for reg-exp special characters and escape them as appropriate.
Also, will your users expect for example, their all caps search input to correctly match and replace a lower or mixed case string? In that case, recall that you could rewrite their target string as [Ss][Aa][Gg][Aa][Rr], and replace with +Sagar.
Note that there are far fewer regex characters used on the replacement side, with '&' meaning "complete string that was matched", and then the numbered replacment groups, like \1,\2,.... Given users that have no knowledge or expectation that they can use such characters, the likelyhood of them using is \1 in their required substitution is pretty low. More likely they may have a valid use for &, so you'll have to scan (at least) for that and replace with \&. In a basic sed, that's about it. (There may be others in the latest gnu seds, or some of the seds that have the genesis as PC tools).
For a replacement string, you shouldn't have to escape the + char at all. Probably yes for \. Again, you can scan your user's "naive" input, and add escape chars as need.
Finally if you're doing this for a "package" that will be distributed, and you'll be relying on the users' version of sed, beware that there are many versions of sed floating around, some that have their roots in Unix/Linux, and others, particularly of super-sed, that (I'm pretty sure) got started as PC-standalones and has a very different feature set.
IHTH.

Best way to read output of shell command

In Vim, What is the best (portable and fast) way to read output of a shell command? This output may be binary and thus contain nulls and (not) have trailing newline which matters. Current solutions I see:
Use system(). Problems: does not work with NULLs.
Use :read !. Problems: won’t save trailing newline, tries to be smart detecting output format (dos/unix/mac).
Use ! with redirection to temporary file, then readfile(, "b") to read it. Problems: two calls for fs, shellredir option also redirects stderr by default and it should be less portable ('shellredir' is mentioned here because it is likely to be set to a valid value).
Use system() and filter outputs through xxd. Problems: very slow, least portable (no equivalent of 'shellredir' for pipes).
Any other ideas?

You are using a text editor. If you care about NULs, trailing EOLs and (possibly) conflicting encodings, you need to use a hex editor anyway?
If I need this amount of control of my operations, I use the xxd route indeed, with
:se binary
One nice option you seem to miss is insert mode expression register insertion:
C-r=system('ls -l')Enter
This may or may not be smarter/less intrusive about character encoding business, but you could try it if it is important enough for you.
Or you could use Perl or Python support to effectively use popen
Rough idea:
:perl open(F, "ls /tmp/ |"); my #lines = (<F>); $curbuf->Append(0, #lines)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio