iconv can convert special characters like ö (odiaeresis) to ascii characters like o when used with //TRANSLIT. Is there a table of characters somewhere that lists how those transformations work? I already poked around the source code but am not familiar enough with c to find what I'm looking for.
The table is defined in the file translit.def. You may find it inside the libiconv. The library might be download at: https://ftp.gnu.org/gnu/libiconv/. I've extracted the first few lines of the table, which are displayed bellow:
# Definition of transliteration from Unicode to poorer character sets.
#
# This covers all of Markus Kuhn's TARGET1.
#
# The second column gives the transliteration. It is enclosed between tabs!
#
00A0 # NO-BREAK SPACE
00A1 ! # INVERTED EXCLAMATION MARK
00A2 c # CENT SIGN
00A3 lb # POUND SIGN
00A4 # CURRENCY SIGN
00A5 yen # YEN SIGN
00A6 | # BROKEN BAR
00A7 SS # SECTION SIGN
00A8 " # DIAERESIS
.
.
.
You might see that ¡ (0x00A1) is translated into !, ¢ (0x00A2) is translated into c, £ (0x00A3) is translated into lb and so on...
Related
I have some text files and getting the first line(s) into a variable using var=$(head -n 1 "$#"), however the variable contains special charaters that I want removed (ASCII 1-31).
Is there a quick way to strip the end of a variable from ASCII codes 1-31? I've used ${var//[^[:ascii:]]/} and var="${var//[$'\t\r\n']}" already, however I need ASCII 1-31 removed from the end in a simple way (not just CF/LF/Tab/FF/etc.).
There's a character class for control characters – quote from the grep manual:
[:cntrl:]
Control characters. In ASCII, these characters have octal codes 000 through 037, and 177 (DEL). In other character sets, these are the equivalent characters, if any.
So, you could do
var=${var//[[:cntrl:]]}
I often have trouble finding esoterica in the Ruby docs, and this is a case in point. Where in the official docs can I read up on using a backslash character \ to indicate line and/or string continuation? Thanks!
https://ruby-doc.org/docs/ruby-doc-bundle/Manual/man-1.4/syntax.html
Ruby programs are sequence of expressions. Each expression are delimited by semicolons(;) or newlines. Backslashes at the end of line does not terminate expression.
https://ruby-doc.org/docs/ruby-doc-bundle/ProgrammingRuby/book/language.html
You can also put a backslash at the end of a line to continue it onto the next.
https://ruby-doc.org/docs/ruby-doc-bundle/UsersGuide/rg/misc.html
If a line ends with a backslash (\), the linefeed following it is ignored; this allows you to have a single logical line that spans several lines.
https://ruby-doc.org/docs/ruby-doc-bundle/Tutorial/part_02/loops.html
You can make lines "wrap around" by putting a backslash - \ - at the very end of the line.
I guess the "most official" place this is documented is in Section 8.4 Whitespace of the ISO Ruby Language Specification:
whitespace ::
0x09 | 0x0b | 0x0c | 0x0d | 0x20 | line-terminator-escape-sequence
line-terminator-escape-sequence ::
\ line-terminator
Where line-terminator in turn is defined in Section 8.3 Line terminators as follows:
line-terminator ::
0x0d? 0x0a
[Note: the ? is supposed to be superscript, indicating optionality, like this: 0x0d?, but that is hard to write in a code block.]
So, put the two together, and it says that a backslash followed by LF or CRLF is considered whitespace.
I have a decompiled stardict dictionary in the form of a tab file
κακός <tab> bad
where <tab> signifies a tabulation.
Unfortunately, the way the words are defined requires the query to include all diacritical marks. So if I want to search for ζῷον, I need to have all the iotas and circumflexes correct.
Thus I'd like to convert the whole file so that the keyword has the diacritic removed. So the line would become
κακος <tab> <h3>κακός</h3> <br/> bad
I know I could read the file line by line in bash, as described here [1]
while read line
do
command
done <file
But what is there any way to automatize the operation of converting the line? I heard about iconv [2] but didn't manage to achieve the desired conversion using it. I'd best like to use a bash script.
Besides, is there an automatic way of transliterating Greek, e.g. using the method Perseus has?
/edit: Maybe we could use the Unicode codes? We can notice that U+1F0x, U+1F8x for x < 8, etc. are all variants of the letter α. This would reduce the amount of manual work. I'd accept a C++ solution as well.
[1] http://en.kioskea.net/faq/1757-how-to-read-a-file-line-by-line
[2] How to remove all of the diacritics from a file?
You can remove diacritics from a string relatively easily using Perl:
$_=NFKD($_);s/\p{InDiacriticals}//g;
for example:
$ echo 'ὦὢῶὼώὠὤ ᾪ' | perl -CS -MUnicode::Normalize -pne '$_=NFKD($_);s/\p{InDiacriticals}//g'
ωωωωωωω Ω
This works as follows:
The -CS enables UTF8 for Perl's stdin/stdout
The -MUnicode::Normalize loads a library for Unicode normalisation
-e executes the script from the command line; -n automatically loops over lines in the input; -p prints the output automatically
NFKD() translates the line into one of the Unicode normalisation forms; this means that accents and diacritics are decomposed into separate characters, which makes it easier to remove them in the next step
s/\p{InDiacriticals}//g removes all characters that Unicoded denotes as diacritical marks
This should in fact work for removing diacritics etc for all scripts/languages that have good Unicode support, not just Greek.
I'm not so familiar with Ancient Greek as I am with Modern Greek (which only really uses two diacritics)
However I went through the vowels and found out which combined with diacritics. This gave me the following list:
ἆἂᾶὰάἀἄ
ἒὲέἐἔ
ἦἢῆὴήἠἤ
ἶἲῖὶίἰἴ
ὂὸόὀὄ
ὖὒῦὺύὐὔ
ὦὢῶὼώὠὤ
I saved this list as a file and passed it to this sed
cat test.txt | sed -e 's/[ἆἂᾶὰάἀἄ]/α/g;s/[ἒὲέἐἔ]/ε/g;s/[ἦἢῆὴήἠἤ]/η/g;s/[ἶἲῖὶίἰἴ]/ι/g;s/[ὂὸόὀὄ]/ο/g;s/[ὖὒῦὺύὐὔ]/υ/g;s/[ὦὢῶὼώὠὤ]/ω/g'
Credit to hungnv
It's a simple sed. It takes each of the options and replaces it with the unmarked character. The result of the above command is:
ααααααα
εεεεε
ηηηηηηη
ιιιιιιι
οοοοο
υυυυυυυ
ωωωωωωω
Regarding transliterating the Greek: the image from your post is intended to help the user type in Greek on the site you took it from using similar glyphs, not always similar sounds. Those are poor transliterations. e.g. β is most often transliterated as v. ψ is ps. φ is ph, etc.
I am running Windows 7 and (have to) use Turbo Grep (Borland something) to search in a file.
I have 2 version of this file, one encoded in UTF-8 and one in ANSI.
If I run the following grep on the ANSI file, I get the expected results, but I get no results with the same statement on the UTF-8 file:
grep -ni "[äöü]" myfile.txt
[-n for line numbers, -i for ignoring cases]
The Turbo Grep Version is :
Turbo GREP 5.6 Copyright (c) 1992-2010 Embarcadero Technologies, Inc.
Syntax: GREP [-rlcnvidzewoqhu] searchstring file[s] or #filelist
GREP ? for help
Help for this command lists:
Options are one or more option characters preceded by "-", and optionally
followed by "+" (turn option on), or "-" (turn it off). The default is "+".
-r+ Regular expression search -l- File names only
-c- match Count only -n- Line numbers
-v- Non-matching lines only -i- Ignore case
-d- Search subdirectories -z- Verbose
-e Next argument is searchstring -w- Word search
-o- UNIX output format Default set: [0-9A-Z_]
-q- Quiet: supress normal output
-h- Supress display of filename
-u xxx Create a copy of grep named 'xxx' with current options set as default
A regular expression is one or more occurrences of: One or more characters
optionally enclosed in quotes. The following symbols are treated specially:
^ start of line $ end of line
. any character \ quote next character
* match zero or more + match one or more
[aeiou0-9] match a, e, i, o, u, and 0 thru 9 ;
[^aeiou0-9] match anything but a, e, i, o, u, and 0 thru 9
Is there a problem with the encoding of these charactes in UTF-8? Might there be a problem with Turbo Grep and UTF-8?
Thanks in advance
Yes there are a different w7 use UTF-16 little endian not UTF-8, UTF-8 is used in unix, linux and plan 9 for cite a few OS.
Jon Skeet explain:1
ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default code page for my system" which is obtained via Encoding.Default, and is often Windows-1252
UTF-8: Variable length encoding, 1-4 bytes covers every current character. ASCII values are encoded as ASCII.
UTF-16 is more similar to ANSI so for this reason with ANSI work well.
if you use only ascii both encodings are usable, but with special characters as ä ö ü etc you need use UTF-16 in windows and UTF-8 in the others
I am trying to make a regular expression, that allow to create string with the small and big letters + numbers - a-zA-z0-9 and also with the chars: .-_
How do I make such a regex?
The following regex should be what you are looking for (explanation below):
\A[-\w.]*\z
The following character class should match only the characters that you want to allow:
[-a-zA-z0-9_.]
You could shorten this to the following since \w is equivalent to [a-zA-z0-9_]:
[-\w.]
Note that to include a literal - in your character class, it needs to be first character because otherwise it will be interpreted as a range (for example [a-d] is equivalent to [abcd]). The other option is to escape it with a backslash.
Normally . means any character except newlines, and you would need to escape it to match a literal period, but this isn't necessary inside of character classes.
The \A and \z are anchors to the beginning and end of the string, otherwise you would match strings that contain any of the allowed characters, instead of strings that contain only the allowed characters.
The * means zero or more characters, if you want it to require one or more characters change the * to a +.
/\A[\w\-\.]+\z/
\w means alphanumeric (case-insensitive) and "_"
\- means dash
\. means period
\A means beginning (even "stronger" than ^)
\z means end (even "stronger" than $)
for example:
>> 'a-zA-z0-9._' =~ /\A[\w\-\.]+\z/
=> 0 # this means a match
UPDATED thanks phrogz for improvement