Suppose we have the name written in any none-latin letters - languages, like Arabic, Hebrew, Chinese, Japanese etc.
How could a search engine match between the original name and the English spelling of the same name. and vice versa?
Something like the name 拓海 in Japanese and the English spelling Takumi.
what is the algorithm/technique used to do this ?
good day.
you have to do following:
classificate each lang in the world on the same symbols:
all langs:
Engish [26 letters] a b c d e f g ...
Russian [33 letters] a б в г д е ....
Chinese [x letters] ....
Ukrainian [x letters] a б в г д ..... i
Japanese [x letters] ...
.................
finally you will be have rules between any symbols spelling in any langs.
Some langs, for instance, Hindi, Chinese and etc not will be have any rules. you should be create your own rules(based on transcription of this langs).
algo:
[w][e][п] = wep
e e r
e - eng
r - rus
transcription[п] = p
Search engines (like Google) probably has huge amount of data sets (corpus), each corpus in different languages.
When you want to translate a word in one language to other language, it can be done by searching the word in the corpus in the first language, and return the compatible word in the corpus of the second language. (same technique for names)
That's the basic idea.
You better read about the NLP field here for some background:
http://en.wikipedia.org/wiki/Natural_language_processing
Related
sorry as the stack overflow answer check was not allowing the native format which im trying to post. below are the 2 images however.
i was wondering why H2O.ai in question and regular H2O.ai are different. is it some sort of char map? i have seen that on many Instagram user descriptions. how to generate it and purpose behind it, any info will be appreciated.
Those characters in "𝐇𝟐𝐎.𝐚𝐢", "H2O.ai" or "H2O.ai" strings come from different Unicode subranges (blocks), except the full stops.
You can check their codepoints using the following Python code snippet; dots (full stops) are removed from sample test string:
# -*- coding: utf-8 -*-
import unicodedata
string = '𝐇𝟐𝐎𝐚𝐢 H2Oai H2Oai' # ℍ𝟚𝕆𝕒𝕚 Ⓗ②Ⓞⓐⓘ
print( "\n" + string + "\n")
for letter in string:
print( letter, # character itself
'{:02x}'.format(ord(letter)).rjust(5), # codepoint (in hex)
unicodedata.name(letter,'???') # name of the character
)
Output: .\SO\63984352.py
𝐇𝟐𝐎𝐚𝐢 H2Oai H2Oai
𝐇 1d407 MATHEMATICAL BOLD CAPITAL H
𝟐 1d7d0 MATHEMATICAL BOLD DIGIT TWO
𝐎 1d40e MATHEMATICAL BOLD CAPITAL O
𝐚 1d41a MATHEMATICAL BOLD SMALL A
𝐢 1d422 MATHEMATICAL BOLD SMALL I
20 SPACE
H 48 LATIN CAPITAL LETTER H
2 32 DIGIT TWO
O 4f LATIN CAPITAL LETTER O
a 61 LATIN SMALL LETTER A
i 69 LATIN SMALL LETTER I
20 SPACE
H ff28 FULLWIDTH LATIN CAPITAL LETTER H
2 ff12 FULLWIDTH DIGIT TWO
O ff2f FULLWIDTH LATIN CAPITAL LETTER O
a ff41 FULLWIDTH LATIN SMALL LETTER A
i ff49 FULLWIDTH LATIN SMALL LETTER I
You can use printed codepoints in HTML entities like H 𝐇 H. Those will render as H 𝐇 H
My application is developed in C++'11 and uses Qt5. In this application, I need to store a UTF-8 text as Windows-1250 coded file.
I tried two following ways and both work expect for Romanian 'ș' and 'ț' characters :(
1.
auto data = QStringList() << ... <some texts here>;
QTextStream outStream(&destFile);
outStream.setCodec(QTextCodec::codecForName("Windows-1250"));
foreach (auto qstr, data)
{
outStream << qstr << EOL_CODE;
}
2.
auto data = QStringList() << ... <some texts here>;
auto *codec = QTextCodec::codecForName("Windows-1250");
foreach (auto qstr, data)
{
const QByteArray encodedString = codec->fromUnicode(qstr);
destFile.write(encodedString);
}
In case of 'ț' character (alias 0xC89B), instead of expected 0xFE value, the character is coded and stored as 0x3F, that it is unexpected.
So I am looking for any help or experience / examples regarding text recoding.
Best regards,
Do not confuse ț with ţ. The former is what is in your post, the latter is what's actually supported by Windows-1250.
The character ț from your post is T-comma, U+021B, LATIN SMALL LETTER T WITH COMMA BELOW, however:
This letter was not part of the early Unicode versions, which is why Ţ (T-cedilla, available from version 1.1.0, June 1993) is often used in digital texts in Romanian.
The character referred to is ţ, U+0163, LATIN SMALL LETTER T WITH CEDILLA (emphasis mine):
In early versions of Unicode, the Romanian letter Ț (T-comma) was considered a glyph variant of Ţ, and therefore was not present in the Unicode Standard. It is also not present in the Windows-1250 (Central Europe) code page.
The story of ş and ș, being S-cedilla and S-comma is analogous.
If you must encode to this archaic Windows 1250 code page, I'd suggest replacing the comma variants by the cedilla variants (both lowercase and uppercase) before encoding. I think Romanians will understand :)
Short version
Given: 1/16/2006 2∶30∶11 ᴘᴍ
How to get: 1/16/2006 2:30:11 PM
rather than: ?1/?16/?2006 ??2:30:11 ??
Background
I have an example Unicode (UTF-16) encoded string:
U+200e U+0031 U+002f U+200e U+0031 U+0036 U+002f U+200e U+0032 U+0030 U+0030 U+0036 U+0020 U+200f U+200e U+0032 U+2236 U+0033 U+0030 U+2236 U+0031 U+0031 U+0020 U+1d18 U+1d0d
[LTR] 1 / [LTR] 1 6 / [LTR] 2 0 0 6 [RTL] [LTR] 2 ∶ 3 0 ∶ 1 1 ᴘ ᴍ
In a slightly easier to read form is:
LTR1/LTR16/LTR2006 RTLLTR2∶30∶11 ᴘᴍ
The actual final text as you're supposed to see it is:
I currently use the Windows function WideCharToMultiByte to convert the UTF-16 to the local code-page:
WideCharToMultiByte(CP_ACP, 0, text, length, NULL, 0, NULL, NULL);
and when i do the text comes out as:
?1/?16/?2006 ??2:30:11 ??
I don't control the presence of the Unicode text direction markers; it's a security thing. But obviously when i'm converting the Unicode to (for example) ISO-8859-1, those characters are irrelevant, make no sense, and i would hope can be dropped.
Is there a Windows function (e.g. FoldString, WideCharToMultiByte) that can be instructed to drop these non-mappable non-printable character?
1/16/2006 2∶30∶11 ᴘᴍ
That gets us close
If a function did that, dropped the non-printing characters that don't have a representation in the target code-page, we would get:
1/16/2006 2∶30∶11 ᴘᴍ
When converted to ISO-8859-1, it becomes:
1/16/2006 2?30?11 ??
That's because some of those characters don't map exactly into ISO-8859-1:
1/16/2006 2U+223630U+223611 U+1d18U+1d0d
1/16/2006 2RATIO30RATIO11 Small Capital PSmall Capital M
But when you see them, it doesn't seem unreasonable that they could be best-fit mapped into:
Original: 1/16/2006 2∶30∶11 ᴘᴍ
Mapped: 1/16/2006 2:30:11 PM
Is there a function that can do that?
I'm happy to suffer with:
1/16/2006 2?30?11 ??
But i really need to fix:
?1/?16/?2006 ??2:30:11 ??
Unicode has the notion
Unicode already has the notion of what "fancy" character you can replace with what "normal" character.
U+00BA º → o (Masculine ordinal indicator) → (Small latin letter o, superscripted)
U+FF0F / → / (Fullwidth solidus) → (solidus, wide)
U+00BC ¼ → 1/4 (Vulgar fraction one quarter)
U+2033 ″ → ′′ (Double prime)
U+FE64: ﹤ → <
I know these are technically for a different purpose;. But there is also the general notion of a mapping list (which again is for a different purpose).
Microsoft SQL Server, when being asked to insert a Unicode string into a non-unicode varchar column does an even better job:
Is there a mapping list for the purpose of unicode best-fit?
Because the reality is that it just makes a mess for users:
I am new to using GNU Prolog.
Given the following facts:
theme(cafe).
role(manager).
role(boss).
role(coworker).
numberOfCharacters(theme(cafe), 3).
charactersRole(numberCharacters(theme(cafe), 3), role('boss'), role('manager'), role('çoworker')).
When I query:
charactersRole(numberCharacters(theme('cafe'), 3), role(X), role(Y), role(Z)).
It returns some of the values correctly, while one value contains ç in place of normal character 'c'.
X = boss
Y = manager
Z = 'çoworker'
Thanks! :)
role('çoworker')
You have cedilla right here, which gets misrepresented by two characters, usually by not being unicode-aware. This is not a Prolog issue.
ç are U+00C3 U+00A7 in Unicode
And ç is
U+00E7 LATIN SMALL LETTER C WITH CEDILLA
UTF-8: 0xC3 0xA7
That's what you're getting by outputting an UTF-8 2-byte character into non-UTF8-aware LATIN1 terminal.
I like to replace first or first and second consonant in conjunct consonants with a underlined consonant in roman transliterated text of Indic language.
for example
kiss bulk hind pyaar kyaa kranti mukhya inglish tren drive patni buddha >>>>>
kis̠s bul̠k hin̠d p̠yaar k̠ranti muk̠ya in̠g̠lish
t̠ren d̠rive pat̠ni bud̠dha
Please correct my guessed Regular Expression.
Replace
([bcdfghjklmnprstvwxyz]h?)([bcdfghjklmnprstvwxyz]h?
with
$1$2
I have typed macrons below letters here. They may or may not be unicode macrons.
b̠c̠d̠f̠g̠h̠j̠k̠l̠m̠n̠p̠r̠s̠t̠v̠w̠x̠y̠z̠
http://www.lexilogos.com/keyboard/phonetic.htm