Converting escaped characters to UTF-8 in bash - utf-8

I have a large text file containing sequences such as
\u02BBUtthay\u0101n h\u01E3ng Ch\u0101t Khao Yai
However, they render exactly as above. How do I convert this so people just see UTF-8? I would prefer to process the files at the command line if possible.

use the printf command.
http://manpages.ubuntu.com/manpages/intrepid/man3/printf.3.html
you can wrap it in $() to use as a variable if needed, too.
For example,
echo $(printf '\u02BBUtthay\u0101n h\u01E3ng Ch\u0101t Khao Yai')
this outputs: ʻUtthayān hǣng Chāt Khao Yai
Hope that helps.

Related

Using space-separated arguments from a field in a tab-separated file

I'm writing a shell script intended to edit audio files using the sox command. I've been running into a strange problem I never encountered in bash scripting before: When defining space separated effects in sox, the command will work when that effect is written directly, but not when it's stored in a variable. This means the following works fine and without any issues:
sox ./test.in.wav ./test.out.wav delay 5
Yet for some reason the following will not work:
IFS=' ' # set IFS to only have a tab character because file is tab-separated
while read -r file effects text; do
sox $file.in.wav $file.out.wav $effects
done <in.txt
...when its in.txt is created with:
printf '%s\t%s\t%s\n' "test" "delay 5" "other text here" >in.txt
The error indicates this is causing it to see the output file as another input.
sox FAIL formats: can't open input file `./output.wav': No such file or directory
I tried everything I could think of: Using quotation marks (sox "$file.in.wav" "$file.out.wav" "$effects"), echoing the variable in-line (sox $file.in.wav $file.out.wav $(echo $effects)), even escaping the space inside the variable (effects="delay\ 5"). Nothing seems to work, everything produces the error. Why does one command work but not the other, what am I missing and how do I solve it?
IFS does not only change the behavior of read; it also changes the behavior of unquoted expansions.
In particular, unquoted expansions' content are split on characters found in IFS, before each element resulting from that split is expanded as a glob.
Thus, if you want the space between delay and 5 to be used for word splitting, you need to have a regular space, not just a tab, in IFS. If you move your IFS assignment to be part of the same simple command as the read, as in IFS=$'\t' read -r file effects text; do, that will stop it from changing behavior in the rest of the script.
However, it's not good practice to use unquoted expansions for word-splitting at all. Use an array instead. You can split your effects string into an array with:
IFS=' ' read -r -a effects_arr <<<"$effects"
...and then run sox "$file.in.wav" "$file.out.wav" "${effects_arr[#]}" to expand each item in the array as a separate word.
By contrast, if you need quotes/escapes/etc to be allowed in effects, see Reading quoted/escaped arguments correctly from a string

Echoing an environment variable, keeping newlines intact? [duplicate]

This question already has answers here:
When to wrap quotes around a shell variable?
(5 answers)
Closed 7 years ago.
I want to create some scripts for filling some templates and inserting them into my project folder. I want to use a shell script for this, and the templates are very small so I want to embed them in the shell script. The problem is that echo seems to ignore the line breaks in my string. Either that, or the string doesn't contain line breaks to begin with. Here is an example:
MY_STRING="
Hello, world! This
Is
A
Multi lined
String."
echo -e $MY_STRING
This outputs:
Hello, world! This Is A Multi lined String.
I'm assuming echo is the culprit here. How can I get it to acknowledge the line breaks?
You need double quotes around the variable interpolation.
echo -e "$MY_STRING"
This is an all-too common error. You should get into the habit of always quoting strings, unless you specifically need to split into whitespace-separated tokens or have wildcards expanded.
So to be explicit, the shell will normalize whitespace when it parses your command line. You can see this if you write a simple C program which prints out its argv array.
argv[0]='Hello,'
argv[1]='world!'
argv[2]='This'
argv[3]='Is'
argv[4]='A'
argv[5]='Multi'
argv[6]='lined'
argv[7]='String.'
By contrast, with quoting, the whole string is in argv[0], newlines and all.
For what it's worth, also consider here documents (with cat, not echo):
cat <<"HERE"
foo
Bar
HERE
You can also interpolate a variable in a here document.
cat <<HERE
$MY_STRING
HERE
... although in this particular case, it's hardly what you want.
echo is so nineties. The new (POSIX) kid on the block is printf.
printf '%s\n' "$MY_STRING"
No -e or SYSV vs BSD echo madness and full control over what gets printed where and how wide, escape sequences like in C. Everybody please start using printf now and never look back.
Try this :
echo "$MY_STRING"

replace or delete character '#' in string with bash shell [duplicate]

A console program (translate-shell) has an output with colors and uses special decorate characters for this: ^[[22m, ^[[24m, ^[[1m... and so on.
I'd like to remove them to get a plain text.
I tried with tr -d "^[[22m" and with sed 's/[\^[[22m]//g', but only is removed the number, not the special character ^[
Thanks.
You have multiple options:
https://unix.stackexchange.com/questions/14684/removing-control-chars-including-console-codes-colours-from-script-output
http://www.commandlinefu.com/commands/view/3584/remove-color-codes-special-characters-with-sed
and as -no-ansi as pointed out by Jens in other answer
EDIT
The solution from commandlinefu does the job pretty well:
sed -r "s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g"
The solution from unix.stackexchange might be better but is much longer and so you would want to create a separate script file because it is so long instead of just doing a shell one-liner.
I found this in the manual about the use of ANSI escape codes:
-no-ansi
Do not use ANSI escape codes.
So you should add this option when starting the program.

How to remove the decorate colors characters in bash output?

A console program (translate-shell) has an output with colors and uses special decorate characters for this: ^[[22m, ^[[24m, ^[[1m... and so on.
I'd like to remove them to get a plain text.
I tried with tr -d "^[[22m" and with sed 's/[\^[[22m]//g', but only is removed the number, not the special character ^[
Thanks.
You have multiple options:
https://unix.stackexchange.com/questions/14684/removing-control-chars-including-console-codes-colours-from-script-output
http://www.commandlinefu.com/commands/view/3584/remove-color-codes-special-characters-with-sed
and as -no-ansi as pointed out by Jens in other answer
EDIT
The solution from commandlinefu does the job pretty well:
sed -r "s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g"
The solution from unix.stackexchange might be better but is much longer and so you would want to create a separate script file because it is so long instead of just doing a shell one-liner.
I found this in the manual about the use of ANSI escape codes:
-no-ansi
Do not use ANSI escape codes.
So you should add this option when starting the program.

How to escape two bash variables when echoing them

I want to echo a text like this:
"I'm going to bed at "$'\cc3'"$var"$'\cc'
Sometimes it happens that the $var variable begins with a number and Bash is simply concatenating it or whatever. How could I escape the $var so it is separated but without a space between them?
The ANSI-C Quoting mechanism in Bash uses \cx to generate Control-X. Your use of $'\cc3' generates a Control-C (aka \003 or \x03) character followed by a digit 3.
Superficially, then, you want:
var=01:15
echo "I'm going to bed at "$'\cc'"$var"$'\cc'
which surrounds the time with Control-C characters (though quite why you want that, I'm not clear). If you're after a Unicode character U+0CC3 (KANNADA VOWEL SIGN VOCALIC R — ೃ — if you've got good Unicode support), then you need Bash 4.x and $'\ucc3'.
If you're after something else, you need to explain what you're trying to echo with the ANSI-C Quoting.
You could try sending the control-c using the \nnn format instead of \c:
echo $'I\'m going to bed at \003'"$var"$'\003'
(I changed the quoting slightly just to reduce the the number of context switches used to build the string.)
Or, save the control-c character in a variable:
cc=$'\cc'
echo "I'm going to bed at $cc$var$cc"

Resources