I am trying to get my SMS gateway to send SMS using AT commands. I am connecting via SSH i a command prompt. I am a beginner to this, so please bear over with me.
The following lines work:
DEV=/dev/ttyACM1
DESTNUM="PHONENUMER"
SMS="Test SMS from ø"
echo -e "ATZ\r" >$DEV
echo -e "AT+CMGF=1\r" >$DEV
echo -e "AT+CMGS=\"$DESTNUM\"\r" >$DEV
echo -e "$SMS\x1A" >$DEV
So the above sends the sms correctly, but it does not include the dansih letter "ø". Usually this is included in UTF-8.
How do I get my code working with the danish characters? Any ideas?
By default, short messages are sent using GSM aplhabet (03.38), defined by 3GPP TS 23.038. This setting can be changed by issuing AT+CSMP command (Set Text Mode Parameters - I'll not examine it in depth in this answer).
More specifically, 7-bits aplhabet is used, so that the device can encode 8 characters in 7 bytes (8x7=56), saving "precious space" to send some more information.
This alphabet is a clever derivation from 7-bit ASCII. A set of codes used in ASCII for "useless" characters (especially control characters) are instead used to add specific characters from alphabet such as Danish:
Danish Character
GSM 03.38 code
Ø
\x0B
ø
\x0C
Å
\x0D
å
\x0F
Æ
\x1C
æ
\x1D
For this reason, all you need is to write a special routine converting specific Danish characters into the corresponding escape code.
Related
Note: This self-answered question describes a problem that is specific to using Eclipse Mosquitto on Windows, where it affects both Windows PowerShell and the cross-platform PowerShell (Core) edition, however.
I use something like the following mosquitto_pub command to publish a message:
mosquitto_pub -h test.mosquitto.org -t tofol/test -m '{ \"label\": \"eé\" }'
Note: The extra \-escaping of the " characters, still required as of Powershell 7.1, shouldn't be necessary, but that is a separate problem - see this answer.
Receiving that message via mosquitto_sub unexpectedly mangles the non-ASCII character é and prints Θ instead:
PS> $msg = mosquitto_sub -h test.mosquitto.org -t tofol/test; $msg
{ "label": "eΘ" } # !! Note the 'Θ' instead of 'é'
Why does this happen?
How do I fix the problem?
Problem:
While the mosquitto_sub man page makes no mention of character encoding as of this writing, it seems that on Windows mosquitto_sub exhibits nonstandard behavior in that it uses the system's active ANSI code page to encode its string output rather than the OEM code page that console applications are expected to use.[1]
There also appears to be no option that would allow you to specify what encoding to use.
PowerShell decodes output from external applications into .NET strings, based on the encoding stored in [Console]::OutputEncoding, which defaults to the OEM code page. Therefore, when it sees the ANSI byte representation of character é, 0xe9, in the output, it interprets it as the OEM representation, where it represents character Θ (the assumption is that the active ANSI code page is Windows-1252, and the active OEM code page IBM437, as is the case in US-English systems, for instance).
You can verify this as follows:
# 0xe9 is "é" in the (Windows-1252) ANSI code page, and coincides with *Unicode* code point
# U+00E9; in the (IBM437) OEM code page, 0xe9 represents "Θ".
PS> $oemEnc = [System.Text.Encoding]::GetEncoding([int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage OEMCP));
$oemEnc.GetString([byte[]] 0xe9)
Θ # Greek capital letter theta
Note that the decoding to .NET strings (System.String) that invariably happens means that the characters are stored as UTF-16 code units in memory, essentially as [uint16] values underlying the System.Char instances that make up a .NET string. Such a code unit encodes a Unicode character either in full, or - for characters outside the so-called BMP (Basic Multilingual Plane) - half of a Unicode character, as part of a so-called surrogate pair.
In the case at hand this means that the Θ character is stored as a different code point, namely a Unicode code point: Θ (Greek capital letter theta, U+0398).
Solution:
Note: A simple way to solve the problem is to activate system-wide support for UTF-8 (available in Windows 10), which sets both the ANSI and the OEM code page to 65001, i.e. UTF-8. However, this feature is (a) still in beta as of this writing and (b) has far-reaching consequences - see this answer for details.
However, it amounts to the most fundamental solution, as it also makes cross-platform Mosquitto use work properly (on Unix-like platforms, Mosquitto uses UTF-8).
PowerShell must be instructed what character encoding to use in this case, which can be done as follows:
PS> $msg = & {
# Save the original console output encoding...
$prevEnc = [Console]::OutputEncoding
# ... and (temporarily) set it to the active ANSI code page.
# Note: In *Windows PowerShell* - only - [System.TextEncoding]::Default work as the RHS too.
[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding([int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP))
# Now PowerShell will decode mosquitto_sub's output correctly.
mosquitto_sub -h test.mosquitto.org -t tofol/test
# Restore the original encoding.
[Console]::OutputEncoding = $prevEnc
}; $msg
{ "label": "eé" } # OK
Note: The Get-ItemPropertyValue cmdlet requires PowerShell version 5 or higher; in earlier version, either use [Console]::OutputEncoding = [System.TextEncoding]::Default or, if the code must also run in PowerShell (Core), [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding([int] (Get-ItemProperty HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP).ACP)
Helper function Invoke-WithEncoding can encapsulate this process for you. You can install it directly from a Gist as follows (I can assure you that doing so is safe, but you should always check):
# Download and define advanced function Invoke-WithEncoding in the current session.
irm https://gist.github.com/mklement0/ef57aea441ea8bd43387a7d7edfc6c19/raw/Invoke-WithEncoding.ps1 | iex
The workaround then simplifies to:
PS> Invoke-WithEncoding -Encoding Ansi { mosquitto_sub -h test.mosquitto.org -t tofol/test }
{ "label": "eé" } # OK
A similar function focused on diagnostic output is Debug-NativeInOutput, discussed in this answer.
As an aside:
While PowerShell isn't the problem here, it too can exhibit problematic character-encoding behavior.
GitHub issue #7233 proposes making PowerShell (Core) windows default to UTF-8 to minimize encoding problems with most modern command-line programs (it wouldn't help with mosquitto_sub, however), and this comment fleshes out the proposal.
[1] Note that Python too exhibits this nonstandard behavior, but it offers UTF-8 encoding as an opt-in, either by setting environment variable PYTHONUTF8 to 1, or via the v3.7+ CLI option -X utf8 (must be specified case-exactly!).
I'm putting together a script and need to take a file's content as input for setting a variable. I'm using Out-File to produce a text file:
$string | Out-File -FilePath C:\Full\Path\To\file.txt -NoNewLine
Then I am using that file to set a variable in batch:
set /P variablename=<C:\Full\Path\To\file.txt
The content of that file is a unique id string that looks practically like this:
1i32l54bl5b2hlthtl098
When I echo this variable, I get this:
echo %variablename%
■1
When I have tried a different string in the input file, I see that what is being echoed is the ■ character and then the first character in the string. So, if my string was "apfvuu244ty0vh" then it would echo "■a" instead.
Why isn't the variable being set to the content of the file? I'm using the method from this stackoverflow post where the chosen answer says to use this syntax with the set command. Am I doing something wrong? Is there perhaps a problem with using a full path as input to a set variable?
tl;dr:
Use Out-File -Encoding oem to produce files that cmd.exe reads correctly.
This effectively limits you to the 256 characters available in the legacy "ANSI" / OEM code pages, except NUL (0x0). See bottom section if you need full Unicode support.
In Windows PowerShell (but not PowerShell Core), Out-File and its effective alias > default to UTF-16LE character encoding, where most characters are represented as 2-byte sequences; for characters in the ASCII range, the 2nd byte of each sequence is NUL (0x0); additionally, such files start with a BOM that indicates the type of encoding.
By contrast, cmd.exe expects input to use the legacy single-byte OEM encoding (note that starting cmd.exe with /U only controls the encoding of its output).
When cmd.exe (unbeknownst to it) encounters UTF-16LE input:
It interprets the bytes individually as characters (even though characters in UTF-16LE are composed of 2 bytes (typically), or, in rare cases, of 4 (a pair of 2-byte sequences)).
It interprets the 2 bytes that make up the BOM (0xff, 0xfe) as part of the string. With OEM code page 437 (US-English) in effect, 0xff renders like a space, whereas 0xfe renders as ■.
Reading stops once the first NUL (0x0 byte) is encountered, which happens with the 1st character from the ASCII range, which in your sample string is 1.
Therefore, string 1i32l54bl5b2hlthtl098 encoded as UTF-16LE is read as ■1, as you state.
If you need full Unicode support, use UTF-8 encoding:
Use Out-File -Encoding utf8 in PowerShell.
Before reading the file in cmd.exe (in a batch file), run chcp 65001 in order to switch to the UTF-8 code page.
Caveats:
Not all Unicode chars. may render correctly, depending on the font used in the console window.
Legacy applications may malfunction with code page 65001 in effect, especially on older Windows versions.
A possible strategy to avoid problems is to temporarily switch to code page 65001, as needed, and then switch back.
Note that the above only covers communication via files, and only in one direction (PowerShell -> cmd.exe).
To also control the character encoding used for the standard streams (stdin, stdout, stderr), both when sending strings to cmd.exe / external programs and when interpreting strings received from them, see this answer of mine.
I have a decompiled stardict dictionary in the form of a tab file
κακός <tab> bad
where <tab> signifies a tabulation.
Unfortunately, the way the words are defined requires the query to include all diacritical marks. So if I want to search for ζῷον, I need to have all the iotas and circumflexes correct.
Thus I'd like to convert the whole file so that the keyword has the diacritic removed. So the line would become
κακος <tab> <h3>κακός</h3> <br/> bad
I know I could read the file line by line in bash, as described here [1]
while read line
do
command
done <file
But what is there any way to automatize the operation of converting the line? I heard about iconv [2] but didn't manage to achieve the desired conversion using it. I'd best like to use a bash script.
Besides, is there an automatic way of transliterating Greek, e.g. using the method Perseus has?
/edit: Maybe we could use the Unicode codes? We can notice that U+1F0x, U+1F8x for x < 8, etc. are all variants of the letter α. This would reduce the amount of manual work. I'd accept a C++ solution as well.
[1] http://en.kioskea.net/faq/1757-how-to-read-a-file-line-by-line
[2] How to remove all of the diacritics from a file?
You can remove diacritics from a string relatively easily using Perl:
$_=NFKD($_);s/\p{InDiacriticals}//g;
for example:
$ echo 'ὦὢῶὼώὠὤ ᾪ' | perl -CS -MUnicode::Normalize -pne '$_=NFKD($_);s/\p{InDiacriticals}//g'
ωωωωωωω Ω
This works as follows:
The -CS enables UTF8 for Perl's stdin/stdout
The -MUnicode::Normalize loads a library for Unicode normalisation
-e executes the script from the command line; -n automatically loops over lines in the input; -p prints the output automatically
NFKD() translates the line into one of the Unicode normalisation forms; this means that accents and diacritics are decomposed into separate characters, which makes it easier to remove them in the next step
s/\p{InDiacriticals}//g removes all characters that Unicoded denotes as diacritical marks
This should in fact work for removing diacritics etc for all scripts/languages that have good Unicode support, not just Greek.
I'm not so familiar with Ancient Greek as I am with Modern Greek (which only really uses two diacritics)
However I went through the vowels and found out which combined with diacritics. This gave me the following list:
ἆἂᾶὰάἀἄ
ἒὲέἐἔ
ἦἢῆὴήἠἤ
ἶἲῖὶίἰἴ
ὂὸόὀὄ
ὖὒῦὺύὐὔ
ὦὢῶὼώὠὤ
I saved this list as a file and passed it to this sed
cat test.txt | sed -e 's/[ἆἂᾶὰάἀἄ]/α/g;s/[ἒὲέἐἔ]/ε/g;s/[ἦἢῆὴήἠἤ]/η/g;s/[ἶἲῖὶίἰἴ]/ι/g;s/[ὂὸόὀὄ]/ο/g;s/[ὖὒῦὺύὐὔ]/υ/g;s/[ὦὢῶὼώὠὤ]/ω/g'
Credit to hungnv
It's a simple sed. It takes each of the options and replaces it with the unmarked character. The result of the above command is:
ααααααα
εεεεε
ηηηηηηη
ιιιιιιι
οοοοο
υυυυυυυ
ωωωωωωω
Regarding transliterating the Greek: the image from your post is intended to help the user type in Greek on the site you took it from using similar glyphs, not always similar sounds. Those are poor transliterations. e.g. β is most often transliterated as v. ψ is ps. φ is ph, etc.
When I use unicode 6.0 character(for example, 'beer mug') in Bash(4.3.11), it doesn't display correctly.
Just copy and paste character is okay, but if you use utf-16 hex code like
$ echo -e '\ud83c\udf7a',
output is '??????'.
What's the problem?
You can't use UTF-16 with bash and a unix(-like) terminal. Bash strings are strings of bytes, and the terminal will (if you have it configured correctly) be expecting UTF-8 sequences. In UTF-8, surrogate pairs are illegal. So if you want to show your beer mug, you need to provide the UTF-8 sequence.
Note that echo -e interprets unicode escapes in the forms \uXXXX and \UXXXXXXXX, producing the corresponding UTF-8 sequence. So you can get your beer mug (assuming your terminal font includes it) with:
echo -e '\U0001f37a'
Long story short:
+ I'm using ffmpeg to check the artist name of a MP3 file.
+ If the artist has asian characters in its name the output is UTF8.
+ If it just has ASCII characters the output is ASCII.
The output does not use any BOM indication at the beginning.
The problem is if the artist has for example a "ä" in the name it is ASCII, just not US-ASCII so "ä" is not valid UTF8 and is skipped.
How can I tell whether or not the output text file from ffmpeg is UTF8 or not? The application does not have any switches and I just think it's plain dumb not to always go with UTF8. :/
Something like this would be perfect:
http://linux.die.net/man/1/isutf8
If anyone knows of a Windows version?
Thanks a lot in before hand guys!
This program/source might help you:
Detect Encoding for In- and Outgoing
Detect the encoding of a text without BOM (Byte Order Mask) and choose the best Encoding ...
You say, "ä" is not valid UTF-8 ... This is not correct...
It seems you don't have a clear understanding of what UTF-8 is. UTF-8 is a system of how to encode Unicode Codepoints. The issue of validity is notin the character itself, it is a question of how has it been encoded...
There are many systems which can encode Unicode Codepoints; UTF-8 is one and UTF16 is another... "ä" is quite legal in the UTF-8 system.. Actually all characters are valid, so long as that character has a Unicode Codepoint.
However, ASCII has only 128 valid values, which equate identically to the first 128 characters in the Unicode Codepoint system. Unicode itself is nothing more that a big look-up table. What does the work is teh encoding system; eg. UTF-8.
Because the 128 ASCII characters are identical to the first 128 Unicode characters, and because UTF-8 can represent these 128 values is a single byte, just as ASCII does, this means that the data in an ASCII file is identical to a file with the same date but which you call a UTF-8 file. Simply put: ASCII is a subset of UTF-8... they are indistinguishable for data in the ASCII range (ie, 128 characters).
You can check a file for 7-bit ASCII compliance..
# If nothing is output to stdout, the file is 7-bit ASCII compliant
# Output lines containing ERROR chars -- to stdout
perl -l -ne '/^[\x00-\x7F]*$/ or print' "$1"
Here is a similar check for UTF-8 compliance..
perl -l -ne '/
^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print' "$1"