Text direction and pluralization dataset - ruby

I am working on a Ruby on Rails project, in which I need to get the text direction and the plural form for different languages. Something like:
en: { plural_keys: [:one, :other], text_dir: :left_to_right },
sk: { plural_keys: [:one, :few, :other], text_dir: :left_to_right },
...
Is there any free dataset I could extract this information from?

I have found CLDR - Unicode Common Locale Data Repository which contains all data I need. It is also available as the twitter_cldr gem.

Related

Ajax call to wiktionary.org to retrieve information on one specific language

To get information about a word in Wiktionary, I can make an Ajax call to the following URL at wiktionary.org: https://en.wiktionary.org/w/api.php?action=parse&format=json&prop=text|revid&callback=?&page=слово, where слово is the word of interest.
This will return a JSON object with the format...
{ "parse":
{ "title": "\u0442\u044b"
, "pageid":96216
, "revid":38162039
, "text":{
"*": <HTML string>
}
}
}
... where <HTML string> contains information on the given word, in all the languages that Wiktionary associates with the word. In the case of the word слово, this means
Bulgarian
Macedonian
Old Church Slavonic
Russian
Serbo-Croatian
Ukrainian
How can I change the URL for the Ajax call so that it only returns data for a single language (for example: Russian)?
It's not possible as in terms of Wiktionary each word is a page with wiki-formatted text. It's not structured as a JSON or any other machine readable data structure format. Though, they have a convention which every page is supposed to follow. According to it, each page should have level 2 headings with word's language. So, you can parse the wikitext to extract only the data for the language you want by looking for such a heading and taking only the data below it.

Rails mongoid regex on an Integer field

I have some IDs 214001, 214002, 215001, etc...
From a searchbar, I want autocompletion with the ID
"214" should trigger autocompletion for IDs 214001, 214002
Apparently, I can't just do a
scope :by_number, ->(number){
where(:number => /#{number.to_i}/i)
}
with mongoid. Anyone know a working way of matching a mongoid Integer field with a regex ?
This question had some clue, but how can I do this inside Rails ?
EDIT : The context is to be able to find a project by its integer ID or its short description :
scope :by_intitule, ->(regex){
where(:intitule => /#{Regexp.escape(regex)}/i)
}
# TODO : Not working !!!!
scope :by_number, ->(numero){
where(:number => /#{number.to_i}/i)
}
scope :by_name, ->(regex){
any_of([by_number(regex).selector, by_intitule(regex).selector])
}
The MongoDB solution from the linked question would be:
db.models.find({ $where: '/^124/.test(this.number)' })
Things that you hand to find map pretty much one-to-one to Mongoid:
where(:$where => "/^#{numero.to_i}/.test(this.number)")
The to_i call should make string interpolation okay for this limited case.
Keep in mind that this is a pretty horrific thing to do to your database: it can't use indexes, it will scan every single document in the collection, ...
You might be better off using a string field so that you can do normal regex matching. I'm pretty sure MongoDB will be able to use an index if you anchor your regex at the beginning too. If you really need it to be a number inside the database then you could always store it as both an Integer and a String field:
field :number, :type => Integer
field :number_s, :type => String
and then have some hooks to keep :number_s up to date as :number changes. If you did this, your pattern matching scope would look at :number_s. Precomputing and duplicating data like this is pretty common with MongoDB so you shouldn't feel bad about it.
The way to do a $where in mongoid is using Criteria#for_js
Something like this
Model.for_js("new RegExp(number).test(this.int_field)", number: 763)

Ruby RDF query - extracting simple data from Seq and Bag items

I am receiving xml-serialised RDF (as part of XMP media descriptions in case that is relevent), and processing in Ruby. I am trying to work with rdf gem, although happy to look at other solutions.
I have managed to load and query the most basic data, but am stuck when trying to build a query for items which contain sequences and bags.
Example XML RDF:
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<rdf:Description rdf:about='' xmlns:dc='http://purl.org/dc/elements/1.1/'>
<dc:date>
<rdf:Seq>
<rdf:li>2013-04-08</rdf:li>
</rdf:Seq>
</dc:date>
</rdf:Description>
</rdf:RDF>
My best attempt at putting together a query:
require 'rdf'
require 'rdf/rdfxml'
require 'rdf/vocab/dc11'
graph = RDF::Graph.load( 'test.rdf' )
date_query = RDF::Query.new( :subject => { RDF::DC11.date => :date } )
results = date_query.execute(graph)
results.map { |result| { result.subject.to_s => result.date.inspect } }
=> [{"test.rdf"=>"#<RDF::Node:0x3fc186b3eef8(_:g70100421177080)>"}]
I get the impression that my results at this stage ("query solutions"?) are a reference to the rdf:Seq container. But I am lost as to how to progress. For the example above, I'd expect to end up, eventually, with an array ["2013-04-08"].
When there is incoming data without the rdf:Seq and rdf:li containers, I am able to extract the strings I want using RDF::Query, following examples at http://rdf.rubyforge.org/RDF/Query.html - unfortunately I cannot find any examples of more complex queries or RDF structures processed in Ruby.
Edit: In addition, when I try to find appropriate methods to use with the RDF::Node object, I cannot see any way to explore any further relations it may have:
results[0].date.methods - Object.methods
=> [:original, :original=, :id, :id=, :node?, :anonymous?, :unlabeled?, :labeled?, :to_sym, :resource?, :constant?, :variable?, :between?, :graph?, :literal?, :statement?, :iri?, :uri?, :valid?, :invalid?, :validate!, :validate, :to_rdf, :inspect!, :type_error, :to_ntriples]
# None of the above leads AFAICS to more data in the graph
I know how to get the same data in xpath (well, at least provided we always get the same paths in the serialisation), but feel it is not the best query language to use in this case (it's my backup plan, however, if it turns out too complex to implement an RDF-query solution)
I think you're correct when saying "my results at this stage ("query solutions"?) are a reference to the rdf:Seq container". RDF/XML is a really horrible serialisation format, instead think of the data as a graph. Here a picture of an RDF:Bag. RDF:Seq works the same and the #students in the example is analogous to the #date in your case.
So to get to the date literal, you need to hop one node further in the graph. I'm not familiar with the syntax of this Ruby library, but something like:
require 'rdf'
require 'rdf/rdfxml'
require 'rdf/vocab/dc11'
graph = RDF::Graph.load( 'test.rdf' )
date_query = RDF::Query.new({
:yourThing => {
RDF::DC11.date => :dateSeq
},
:dateSeq => {
RDF.type => RDF.Seq,
RDF._1 => :dateLiteral
}
})
date_query.execute(graph).each do |solution|
puts "date=#{solution.dateLiteral}"
end
Of course, if you expect the Seq to actually to contain multiple dates (otherwise it wouldn't make sense to have a Seq), you will have to match them with RDF._1 => :dateLiteral1, RDF._2 => :dateLiteral2, RDF._3 => :dateLiteral3 etc.
Or for a more generic solution, match all the properties and objects on the dateSeq with:
:dateSeq => {
:property => :dateLiteral
}
and then filter out the case where :property ends up being RDF:type while :dateLiteral isn't actually the date but RDF:Seq. Maybe the library has also a special method to get all the Seq's contents.

Can I avoid transposing an array in Ruby on Rails?

I have a Rails app that has a COUNTRIES list with full country names and abbreviations created inside the Company model. The array for the COUNTRIES list is used for a select tag on the input form to store abbreviations in the DB. See below. VALID_COUNTRIES is used for validations of abbreviations in the DB. FULL_COUNTRIES is used to display the full country name from the abbreviation.
class Company < ActiveRecord::Base
COUNTRIES = [["Afghanistan","AF"],["Aland Islands","AX"],["Albania","AL"],...]
COUNTRIES_TRANSFORM = COUNTRIES.transpose
VALID_COUNTRIES = COUNTRIES_TRANSPOSE[1]
FULL_COUNTRIES = COUNTRIES_TRANSPOSE[0]
validates :country, inclusion: { in: VALID_COUNTRIES, message: "enter a valid country" }
...
end
On the form:
<%= select_tag(:country, options_for_select(Company::COUNTRIES, 'US')) %>
And to convert back the the full country name:
full_country = FULL_COUNTRIES[VALID_COUNTRIES.index(:country)]
This seems like an excellent application for a hash, except the key/value order is wrong. For the select I need:
COUNTRIES = {"Afghanistan" => "AF", "Aland Islands" => "AX", "Albania" => "AL",...}
While to take the abbreviation from the DB and display the full country name I need:
COUNTRIES = {"AF" => "Afghanistan", "AX" => "Aland Islands", "AL" => "Albania",...}
Which is a shame, because COUNTRIES.keys or COUNTRIES.values would give me the validation list (depending on which hash layout is used).
I'm relatively new to Ruby/Rails and am looking for the more Ruby-like way to solve the problem. Here are the questions:
Does the transpose occur only once, and if so, when is it executed?
Is there a way to specify the FULL_ and VALID_ lists that do not require the transpose?
Is there a better or reasonable alternate way to do this? For instance, VALID_COUNTRIES is COUNTRIES[x][1] and FULL_COUNTRIES is COUNTRIES[x][0], but VALID_ must work with the validation.
Is there a way to make a hash work with just one hash rather then one for the select_tag and one for converting the abbreviations in the DB back to full names for display?
1) Does the transpose occur only once, and if so, when is it executed?
Yes at compile time because you are assigning to constants if you want it to be evaluated every time use a lambda
FULL_COUNTRIES = lambda { COUNTRIES_TRANSPOSE[0] }
2) Is there a way to specify the FULL_ and VALID_ lists that do not require the transpose?
Yes use a map or collect (they are the same thing)
VALID_COUNTRIES = COUNTRIES.map &:first
FULL_COUNTRIES = COUNTRIES.map &:last
3) Is there a better or reasonable alternate way to do this? For instance, VALID_COUNTRIES is COUNTRIES[x][1] and FULL_COUNTRIES is COUNTRIES[x][0], but VALID_ must work with the validation.
See Above
4) Is there a way to make the hash work?
Yes I am not sure why a hash isn't working as the rails docs say options_for_select will use hash.to_a.map &:first for the options text and hash.to_a.map &:last for the options value so the first hash you give should be working if you can clarify why it is not I can help you more.

How to make client side I18n with mustache.js

i have some static html files and want to change the static text inside with client side modification through mustache.js.
it seems that this was possible Twitter's mustache extension on github: https://github.com/bcherry/mustache.js
But lately the specific I18n extension has been removed or changed.
I imagine a solution where http:/server/static.html?lang=en loads mustache.js and a language JSON file based on the lang param data_en.json.
Then mustache replaces the {{tags}} with the data sent.
Can someone give me an example how to do this?
You can use lambdas along with some library like i18next or something else.
{{#i18n}}greeting{{/i18n}} {{name}}
And the data passed:
{
name: 'Mike',
i18n: function() {
return function(text, render) {
return render(i18n.t(text));
};
}
}
This solved the problem for me
I don't think Silent's answer really solves/explains the problem.
The real issue is you need to run Mustache twice (or use something else and then Mustache).
That is most i18n works as two step process like the following:
Render the i18n text with the given variables.
Render the HTML with the post rendered i18n text.
Option 1: Use Mustache partials
<p>{{> i18n.title}}</p>
{{#somelist}}{{> i18n.item}}{{/somelist}}
The data given to this mustache template might be:
{
"amount" : 10,
"somelist" : [ "description" : "poop" ]
}
Then you would store all your i18n templates/messages as a massive JSON object of mustache templates on the server:
Below is the "en" translations:
{
"title" : "You have {{amount}} fart(s) left",
"item" : "Smells like {{description}}"
}
Now there is a rather big problem with this approach in that Mustache has no logic so handling things like pluralization gets messy.
The other issue is that performance might be bad doing so many partial loads (maybe not).
Option 2: Let the Server's i18n do the work.
Another option is to let the server do the first pass of expansion (step 1).
Java does have lots of options for i18n expansion I assume other languages do as well.
Whats rather annoying about this solution is that you will have to load your model twice. Once with the regular model and second time with the expanded i18n templates. This is rather annoying as you will have to know exactly which i18n expansions/templates to expand and put in the model (otherwise you would have to expand all the i18n templates). In other words your going to get some nice violations of DRY.
One way around the previous problem is pre-processing the mustache templates.
My answer is based on developingo's. He's answer is very great I'll just add the possibility to use mustache tags in the message keycode. It is really needed if you want to be able the get messages according to the current mustache state or in loops
It's base on a simple double rendering
info.i18n = function(){
return function(text, render){
var code = render(text); //Render first to get all variable name codes set
var value = i18n.t(code)
return render(value); //then render the messages
}
}
Thus performances aren't hit because of mustache operating on a very small string.
Here a little example :
Json data :
array :
[
{ name : "banana"},
{ name : "cucomber" }
]
Mustache template :
{{#array}}
{{#i18n}}description_{{name}}{{/i18n}}
{{/array}}
Messages
description_banana = "{{name}} is yellow"
description_cucomber = "{{name}} is green"
The result is :
banana is yellow
cucomber is green
Plurals
[Edit] : As asked in the comment follows an example with pseudo-code of plural handling for english and french language. Its a very simple and not tested example but it gives you a hint.
description_banana = "{{#plurable}}a {{name}} is{{/plurable}} green" (Adjectives not getting "s" in plurals)
description_banana = "{{#plurable}}Une {{name}} est verte{{/plurable}}" (Adjectives getting an "s" in plural, so englobing the adjective as well)
info.plurable = function()
{
//Check if needs plural
//Parse each word with a space separation
//Add an s at the end of each word except ones from a map of common exceptions such as "a"=>"/*nothing*/", "is"=>"are" and for french "est"=>"sont", "une" => "des"
//This map/function is specific to each language and should be expanded at need.
}
This is quite simple and pretty straightforward.
First, you will need to add code to determine the Query String lang. For this, I use snippet taken from answer here.
function getParameterByName(name) {
var match = RegExp('[?&]' + name + '=([^&]*)')
.exec(window.location.search);
return match && decodeURIComponent(match[1].replace(/\+/g, ' '));
}
And then, I use jQuery to handle ajax and onReady state processing:
$(document).ready(function(){
var possibleLang = ['en', 'id'];
var currentLang = getParameterByName("lang");
console.log("parameter lang: " + currentLang);
console.log("possible lang: " + (jQuery.inArray(currentLang, possibleLang)));
if(jQuery.inArray(currentLang, possibleLang) > -1){
console.log("fetching AJAX");
var request = jQuery.ajax({
processData: false,
cache: false,
url: "data_" + currentLang + ".json"
});
console.log("done AJAX");
request.done(function(data){
console.log("got data: " + data);
var output = Mustache.render("<h1>{{title}}</h1><div id='content'>{{content}}</div>", data);
console.log("output: " + output);
$("#output").append(output);
});
request.fail(function(xhr, textStatus){
console.log("error: " + textStatus);
});
}
});
For this answer, I try to use simple JSON data:
{"title": "this is title", "content": "this is english content"}
Get this GIST for complete HTML answer.
Make sure to remember that other languages are significantly different from EN.
In FR and ES, adjectives come after the noun. "green beans" becomes "haricots verts" (beans green) in FR, so if you're plugging in variables, your translated templates must have the variables in reverse order. So for instance, printf won't work cuz the arguments can't change order. This is why you use named variables as in Option 1 above, and translated templates in whole sentences and paragraphs, rather than concatenating phrases.
Your data needs to also be translated, so the word 'poop', which came from data - somehow that has to be translated. Different languages do plurals differently, as does english, as in tooth/teeth, foot/feet, etc. EN also has glasses and pants that are always plural. Other languages similarly have exceptions and strange idoms. In the UK, IBM 'are' at the trade show whereas in in the US, IBM 'is' at the trade show. Russian has several different rules for plurals depending on if they are people, animals, long narrow objects, etc. In other countries, thousands separators are spaces, dots, or apostrophes, and in some cases don't work by 3 digits: 4 in Japan, inconsistently in India.
Be content with mediocre language support; it's just too much work.
And don't confuse changing language with changing country - Switzerland, Belgium and Canada also have FR speakers, not to mention Tahiti, Haiti and Chad. Austria speaks DE, Aruba speaks NL, and Macao speaks PT.

Resources