Ruby print text, characters in Hindi, Sanskrit etc languages - ruby

I want to print text/characters in Hindi/Sanskrit language.Can anybody guide me using Ruby language how can I achieve this? Are there any libraries/gems available to achieve this? I tried searching for this but could not find desired resources.Basically various webistes displays its contents in Hindi, Gujarati, Sanskrit etc languages but I guess that rendering task is done by browser using font files.Using a programming language how we can achieve the same?
Thanks.

In ruby we have [ ].pack('U*') method which converts the given array's value(as Unicode values) to corresponding characters and "XYZ".unpack('U*') which returns me the Unicode values of each characters.
After executing,
[2309].pack('U*')
I got अ
Let's understand it with an example,
I want मिरो में आपके लिए एक नया सिलाई आदेश
Then execute
"मिरो में आपके लिए एक नया सिलाई आदेश".unpack('U*')
This will return me Unicode values of each character of above string.
[32, 2350, 2375, 2306, 32, 2310, 2346, 2325, 2375, 32, 2354, 2367, 2319, 32, 2319, 2325, 32, 2344, 2351, 2366, 32, 2360, 2367, 2354, 2366, 2312, 32, 2310, 2342, 2375, 2358]
Then to regenerate the above Hindi string,
[2350, 2367, 2352, 2379, 32, 2350, 2375, 2306, 32, 2310, 2346, 2325, 2375, 32, 2354, 2367, 2319, 32, 2319, 2325, 32, 2344, 2351, 2366, 32, 2360, 2367, 2354, 2366, 2312, 32, 2310, 2342, 2375, 2358].pack('U*')
Note: .unpack('U*') return's unicode since we asked for it by mentioning 'U*', This unpack/pack methods are also works for binary and many other forms.
Read more about pack/unpack

It's nout about the server-side language - it's about using the proper character encoding for your data. UTF-8 is the most common format used to support international languages. Most of India is covered. The browser does all the work and no font additions are required (unless you're getting really fancy with the typography).
उदाहरण के लिये
See: http://www.unicode.org/standard/supported.html

You need to use the right encoding, which for interoperability, is UTF-8. To use UTF-8 in your program you need to add the following comment at the top of your file:
#encoding: utf-8
Then, you need to use your text editors way of inserting symbols in in Hindi or other languages into strings, and Ruby will happily print them.

Related

Do huggingface translation models support separate vocabulary for source and target?

Every example I've looked at so far seems to use a shared vocabulary between source and target languages, and I'm wondering if that is a hard-coded constraint of the Huggingface models, or my misunderstanding, or I've just not looked in the right place yet?
To take a random example, when I look at the files here, https://huggingface.co/Helsinki-NLP/opus-mt-en-zls/tree/main, I see separate "spm" (sentience piece model) files for source and target languages, and they are of different sizes (792kb vs. 850kb). But there is only a single "vocab.json" file. And the config.json file only mentions a single "vocab_size": 57680.
I've also been experimenting, e.g. tokenizer(inputs, text_target=inputs, return_tensors="pt"). If source and target used different vocabulary I would expect the returned input_ids and labels to use different numbers. But every model I've tried so far the numbers are identical (NO, my mistake - see update below).
Can a Huggingface tokenizer even support two vocabularies? If not then a model would need two tokenizers, which seems to clash with the way AutoTokenizer works.
UPDATE
Here is a test script to show the above model is actually using two spm vocabs with AutoTokenizer.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = 'Helsinki-NLP/opus-mt-en-zls'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
inputs = ['Filter all items from same host']
targets = ['Filtriraj sve stavke s istog hosta']
x=tokenizer(inputs, text_target=targets, return_tensors="pt")
print(x)
print(tokenizer.decode(x['input_ids'][0]))
print(tokenizer.decode(x['labels'][0]))
print("\nGiving inputs on both sides")
x=tokenizer(inputs, text_target=inputs, return_tensors="pt")
print(x) ## Expecting to see different numbers if they use different vocabs
print(tokenizer.decode(x['input_ids'][0]))
print(tokenizer.decode(x['labels'][0]))
print("\nGiving targets on both sides")
x=tokenizer(targets, text_target=targets, return_tensors="pt") ## Expecting to see different numbers if they use different vocabs
print(x)
print(tokenizer.decode(x['input_ids'][0]))
print(tokenizer.decode(x['labels'][0]))
print(model)
The output is:
{'input_ids': tensor([[10373, 90, 8255, 98, 605, 6276, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[11638, 1392, 7636, 386, 35861, 95, 2130, 218, 6276, 27,
0]])}
▁Filter all▁items from same host</s>
Filtriraj sve stavke s istog hosta</s>
Giving inputs on both sides
{'input_ids': tensor([[10373, 90, 8255, 98, 605, 6276, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[11638, 911, 90, 3188, 7, 98, 605, 6276, 0]])}
▁Filter all▁items from same host</s>
Filter all items from same host</s>
Giving targets on both sides
{'input_ids': tensor([[11638, 1392, 7636, 95, 120, 914, 465, 478, 95, 29,
25, 897, 6276, 27, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[11638, 1392, 7636, 386, 35861, 95, 2130, 218, 6276, 27,
0]])}
Filtriraj sve stavke s istog hosta</s>
Filtriraj sve stavke s istog hosta</s>
When I choose identical strings in English or Croatian it gives slightly different numbers, showing that different tokenizers are involved. You can then see that the different ids sometimes map back to an identical string, sometimes not.
But when I print out the model we see it is actually a shared vocabulary, which makes the two spm models a bit pointless.
(encoder): MarianEncoder(
(embed_tokens): Embedding(57680, 512, padding_idx=57679)
...
(decoder): MarianDecoder(
(embed_tokens): Embedding(57680, 512, padding_idx=57679)
...
(lm_head): Linear(in_features=512, out_features=57680, bias=False)
I haven't got as far as finding out if a non-shared vocabulary is possible, but still yet to see evidence of one.
For Marian-based models, HuggingFace now supports separate vocabularies for source and target, but some models may not, especially older models.
(As you know, OPUS-MT models are based on MarianMT. The MarianMT framework supports it.)
Before https://github.com/huggingface/transformers/pull/15831, HuggingFace used a shared vocabulary file for Marian.
This PR updates the Marian model:
To allow not sharing embeddings between encoder and decoder.
Allow tying only decoder embeddings with lm_head.
Separate two vocabs in tokenizer for src and tgt language
...
share_encoder_decoder_embeddings: to indicate if emb should be shared or not
So models trained with earlier versions of the framework, or that parameter set to false, only have one shared vocabulary file for source and target.

K-fold cross validation - save folds for different models

I am trying to train my models and validate them using sklearn's cross validation. What I want to do is use the same folds across all of my models (which will be running from different python scripts).
How can I do this? Should I save them to a file? or should I save the kfold model? or should I use the same seed?
kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
Well the easiest way I found to save the folds was to simply get them from the stratified k fold split method by looping over it. Then storing it to a json file:
kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
folds = {}
count = 1
for train, test in kfold.split(np.zeros(len(y)), y.argmax(1)):
folds['fold_{}'.format(count)] = {}
folds['fold_{}'.format(count)]['train'] = train.tolist()
folds['fold_{}'.format(count)]['test'] = test.tolist()
count += 1
print(len(folds) == n_splits)#assert we have the same number of splits
#dump folds to json
import json
with open('folds.json', 'w') as fp:
json.dump(folds, fp)
Note 1: Argmax here is used because my y values are one hot variables so we need to get the class that is predicted/ground truth.
Now to load it from any other script:
#load to dict to be used
with open('folds.json') as f:
kfolds = json.load(f)
From here we can easily just loop over the elements in the dict:
for key, val in kfolds.items():
print(key)
train = val['train']
test = val['test']
Our json file looks like so:
{"fold_1": {"train": [193, 2405, 2895, 565, 1215, 274, 2839, 1735, 2536, 1196, 40, 2541, 980,...SNIP...830, 1032], "test": [1, 5, 6, 7, 10, 15, 20, 26, 37, 45, 52, 54, 55, 59, 60, 64, 65, 68, 74, 76, 78, 90, 100, 106, 107, 113, 122, 124, 132, 135, 141, 146,...SNIP...]}

Same string but different bytes codes

I have two strings:
a = 'hà nội'
b = 'hà nội'
When I compare them with a == b, it returns false.
I checked the byte codes:
a.bytes = [104, 97, 204, 128, 32, 110, 195, 180, 204, 163, 105]
b.bytes = [104, 195, 160, 32, 110, 225, 187, 153, 105]
What is the cause? How can I fix it so that a == b returns true?
This is an issue with Unicode equivalence.
In order to compare these strings you need to normalize them, so that they both use the same byte sequences for these types of characters.
a.unicode_normalize == b.unicode_normalize
unicode_normalize(form=:nfc) [link]
Returns a normalized form of str, using Unicode normalizations NFC,
NFD, NFKC, or NFKD. The normalization form used is determined by form,
which is any of the four values :nfc, :nfd, :nfkc, or :nfkd. The
default is :nfc.
If the string is not in a Unicode Encoding, then an Exception is
raised. In this context, 'Unicode Encoding' means any of UTF-8,
UTF-16BE/LE, and UTF-32BE/LE, as well as GB18030, UCS_2BE, and
UCS_4BE. Anything else than UTF-8 is implemented by converting to
UTF-8, which makes it slower than UTF-8.

sed how to delete text and symbols between

I have sql file with this strings :
(17, 14, '2015-01-20 10:38:40', 211, 'Just text\n\nFrom: Support <support#domain.com>\n Send: 20 Jan 2015 year. 10:33\n To: Admin\n Theme: [TST #0000014] Just text \n\nJust text: Text\n Test text test text\n\nJust text:\n Text\n\n-- \n Test\n Text.\n Many text words 0.84.2', 0, 2);
I want remove all text between symbols \n\ and ', 0, 2);
I want get this result:
(17, 14, '2015-01-20 10:38:40', 211, 'Just text', 0, 2);
How I can do it via sed?
I try use this example - cat file | sed 's/<b>.*</b>//g'. I changed <b> to \n\ and </b> to ', 0, 2); But it dont work, I get error in console
Thanks in advance!
You can try this command
sed 's/\\n\\.*\('\'', 0, 2);\)/\1/g' FileName
Output :
(17, 14, '2015-01-20 10:38:40', 211, 'Just text', 0, 2);
You have to escape the single quotes like '\'' as well as back slash \\
If you can find it, you can replace it with nothing.
So, depending on what you mean by \n and what you need to escape, you want something like sed 's/\\n\\.*'//g'.
Obviously, take care that this is really what you want to replace on every line. It might be worth searching for the target \\n\\.*' first, to make sure it doesn't accidentally grab too much on an unexpected line.

Read entire text file into a Prolog variable

I am having the hardest time figuring this out, even though it is one of the simplest things to do in other languages: Is there a simple way to read the entire contents of a text file into a Prolog variable?
Simply state with a DCG what you want to describe, and use library(pio) to parse from a file:
:- use_module(library(pio)).
all([]) --> [].
all([L|Ls]) --> [L], all(Ls).
Example:
?- once(phrase_from_file(all(Ls), 'all.pl')).
Ls = [10, 58, 45, 32, 117, 115, 101, 95, 109|...].
in library(readutil) there are some built ins: see read_file_to_codes or read_file_to_terms

Resources