Portuguese makes use of five diacritics: the cedilla (ç), acute accent (á, é, í, ó, ú), circumflex accent (â, ê, ô), tilde (ã, õ), and grave accent (à, and rarely è, ì, ò, and ù). The cedilla indicates that ç is pronounced /s/ (from a historic palatalization).
When searching, removing the diacritics and convert to uppercases like : á, é, í, ó, ú -> A E I O U
Are there any nunjucks filter for this job?
{{ content_with_diacritics | filter_for_search }}
Thank you.
Looks like you looking for replace filter. Call it multile times with required replacements.
Related
sorry as the stack overflow answer check was not allowing the native format which im trying to post. below are the 2 images however.
i was wondering why H2O.ai in question and regular H2O.ai are different. is it some sort of char map? i have seen that on many Instagram user descriptions. how to generate it and purpose behind it, any info will be appreciated.
Those characters in "𝐇𝟐𝐎.𝐚𝐢", "H2O.ai" or "H2O.ai" strings come from different Unicode subranges (blocks), except the full stops.
You can check their codepoints using the following Python code snippet; dots (full stops) are removed from sample test string:
# -*- coding: utf-8 -*-
import unicodedata
string = '𝐇𝟐𝐎𝐚𝐢 H2Oai H2Oai' # ℍ𝟚𝕆𝕒𝕚 Ⓗ②Ⓞⓐⓘ
print( "\n" + string + "\n")
for letter in string:
print( letter, # character itself
'{:02x}'.format(ord(letter)).rjust(5), # codepoint (in hex)
unicodedata.name(letter,'???') # name of the character
)
Output: .\SO\63984352.py
𝐇𝟐𝐎𝐚𝐢 H2Oai H2Oai
𝐇 1d407 MATHEMATICAL BOLD CAPITAL H
𝟐 1d7d0 MATHEMATICAL BOLD DIGIT TWO
𝐎 1d40e MATHEMATICAL BOLD CAPITAL O
𝐚 1d41a MATHEMATICAL BOLD SMALL A
𝐢 1d422 MATHEMATICAL BOLD SMALL I
20 SPACE
H 48 LATIN CAPITAL LETTER H
2 32 DIGIT TWO
O 4f LATIN CAPITAL LETTER O
a 61 LATIN SMALL LETTER A
i 69 LATIN SMALL LETTER I
20 SPACE
H ff28 FULLWIDTH LATIN CAPITAL LETTER H
2 ff12 FULLWIDTH DIGIT TWO
O ff2f FULLWIDTH LATIN CAPITAL LETTER O
a ff41 FULLWIDTH LATIN SMALL LETTER A
i ff49 FULLWIDTH LATIN SMALL LETTER I
You can use printed codepoints in HTML entities like H 𝐇 H. Those will render as H 𝐇 H
I have a string, and I would like to replace all special characters with underscores.
In other words, I just want 26 english letters (lower and upper cases) and 0-9 and the "_" character.
Also note that there are the non-english characters and they need to be replaced with "_" as well.
What is the most elegant way to do this in Ruby?
It sounds like you want to replace all non-word characters with underscores. Therefore,
result = subject.gsub(/[^\w]/, '_')
But are you okay that this would also replace newlines and other whitespace characters?
If not, change it to
result = subject.gsub(/[^\w\s]/, '_')
Explain Regex
[^\w\s] # any character except: word characters (a-
# z, A-Z, 0-9, _), whitespace (\n, \r, \t,
# \f, and " ")
Note
As #CarySwoveland mentions, the [^\w] can also be written with the shorthand \W.
I would like to remove all non alpha numerical characters from a string. Except space, - and some German characters.
Example
regexp = "mönchengladbach."
regexp.gsub(/[^0-9a-z \-]/i, '')
=> mnchengladbach
I need this:
=> mönchengladbach
It should also not replace other German characters such as:
ä ö ü ß
Thanks!
Edit:
It was just me not testing properly. The IRB did not accept special characters. This works for me:
regexp.gsub(/[^0-9a-z \-äüöß]/i, '')
To remove all that is not a letter or a space you can use this:
str.gsub(/[^\p{L}\s]+/, '')
I use here a negated character class, [^\p{L}\s] means all that is not a letter (in all language you want) or a white charater (space, tab, newlines)
\p{L} is an unicode character class for Letters.
You can easily add other characters you want to preserve like -:
str.gsub(/[^\p{L}\s-]+/, '')
example script:
# encoding: UTF-8
str = "mönchengladbach."
str = str.gsub(/[^\p{L}\s]+/, '#')
puts str
I think you want:
/[^[:alnum:] -]/
Note the //i is not necessary and no need to escape - when it's at the end of a []
Suppose we have the name written in any none-latin letters - languages, like Arabic, Hebrew, Chinese, Japanese etc.
How could a search engine match between the original name and the English spelling of the same name. and vice versa?
Something like the name 拓海 in Japanese and the English spelling Takumi.
what is the algorithm/technique used to do this ?
good day.
you have to do following:
classificate each lang in the world on the same symbols:
all langs:
Engish [26 letters] a b c d e f g ...
Russian [33 letters] a б в г д е ....
Chinese [x letters] ....
Ukrainian [x letters] a б в г д ..... i
Japanese [x letters] ...
.................
finally you will be have rules between any symbols spelling in any langs.
Some langs, for instance, Hindi, Chinese and etc not will be have any rules. you should be create your own rules(based on transcription of this langs).
algo:
[w][e][п] = wep
e e r
e - eng
r - rus
transcription[п] = p
Search engines (like Google) probably has huge amount of data sets (corpus), each corpus in different languages.
When you want to translate a word in one language to other language, it can be done by searching the word in the corpus in the first language, and return the compatible word in the corpus of the second language. (same technique for names)
That's the basic idea.
You better read about the NLP field here for some background:
http://en.wikipedia.org/wiki/Natural_language_processing
I have this regex:
var characterReg = /^\s*[a-zA-Z0-9,\s]+\s*$/;
How do I include the letters: Å, Ø, Æ, å, ø, æ ?
Use the unicode values:
\u{1234}{2}
for perl, use:
\x{1234}{2}
will match the 1234 unicode character twice.
There is much more information on this here:
http://www.regular-expressions.info/unicode.html
In ruby, looks like unicode support is half-baked:
http://www.ruby-forum.com/topic/133538