Converting Non-ASCII code to ASCII equivalent in terms of look

Converting Non-ASCII code to ASCII equivalent in terms of look - converters

I have thousands of name in a mysql database that have the extended ASCII code in them. I want to convert them to a normal english alphabet. Here is an example :
Indāpur Jejūri convert to -> Indapur Jejuri
So how can I do it ? I know Java and Groovy, and a bunch of other scripting languages but didn't have much luck. Any suggestion ?

I found the answer after going through many posts in stackoverflow : Converting Symbols, Accent Letters to English Alphabet
import java.text.Normalizer;
import java.util.regex.Pattern;
public String deAccent(String str) {
String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
return pattern.matcher(nfdNormalizedString).replaceAll("");
}

Related

How to import data from a file as list in Mathematica

I have a large *.txt file that has real numbers. I want to import and Execute the RootMeanSquare function on it, but the output of that function is not a real number.
a.txt:
0.00005589924852471949
0.000036651199287161235
0.000016275882123536572
-4.955137498989977*^-6
-0.00002680629351951319
-0.000048814313574683916
ah=Import["a.txt", "List"];
RootMeanSquare[ah]
Sqrt[7.83436*10^-9 + ("-4.955137498989977*^-6")^2]/Sqrt[6]
In my opinion, the problem is in the number -4.955137498989977*^-6.
Please help me.
thank you.

The import yields a string, so split it and convert to numeric expressions.
ah = ToExpression#StringSplit#Import["a.txt"]
StringSplit splits a string at whitespaces by default.
ToExpression is listable, so it operates on the string list in one step.

Is there a way to remove ALL special characters using Lucene filters?

Standard Analyzer removes special characters, but not all of them (eg: '-'). I want to index my string with only alphanumeric characters but referring to the original document.
Example: 'doc-size type' should be indexed as 'docsize' and 'type' and both should point to the original document: 'doc-size type'

It depends what you mean by "special characters", and what other requirements you may have. But the following may give you what you need, or point you in the right direction.
The following examples all assume Lucene version 8.4.1.
Basic Example
Starting with the very specific example you gave, where doc-size type should be indexed as docsize and type, here is a custom analyzer:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.pattern.PatternReplaceFilter;
import java.util.regex.Pattern;
public class MyAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new WhitespaceTokenizer();
TokenStream tokenStream = source;
Pattern p = Pattern.compile("\\-");
boolean replaceAll = Boolean.TRUE;
tokenStream = new PatternReplaceFilter(tokenStream, p, "", replaceAll);
return new TokenStreamComponents(source, tokenStream);
}
}
This splits on whitespace, and then removes hyphens, using a PatternReplaceFilter. It works as shown below (I use ｢ and ｣ as delimiters to show where whitespaces may be part of the inputs/outputs):
Input text:
｢doc-size type｣
Output tokens:
｢docsize｣
｢type｣
NOTE - this will remove all hyphens which are standard keyboard hyphens - but not things such as em-dashes, en-dashes, and so on. It will remove these standard hyphens regardless of where they appear in the text (word starts, word ends, on their own, etc).
A Set of Punctuation Marks
You can change the pattern to cover more punctuation, as needed - for example:
Pattern p = Pattern.compile("[$^-]");
This does the following:
Input text:
｢doc-size type $foo^bar｣
Output tokens:
｢docsize｣
｢type｣
｢foobar｣
Everything Which is Not a Character or Digit
You can use the following to remove everything which is not a character or digit:
Pattern p = Pattern.compile("[^A-Za-z0-9]");
This does the following:
Input text:
｢doc-size 123 %^&*{} type $foo^bar｣
Output tokens:
｢docsize｣
｢123｣
｢｣
｢type｣
｢foobar｣
Note that this has one empty string in the resulting tags.
WARNING: Whether the above will work for you depends very much on your specific, detailed requirements. For example, you may need to perform extra transformations to handle upper/lowercase differences - i.e. the usual things which typically need to be considered when indexing text.
Note on the Standard Analyzer
The StandardAnalyzer actually does remove hyphens in words (with some obscure exceptions). In your question you mentioned that it does not remove them. The standard analyzer uses the standard tokenizer. And the standard tokenizer implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified here. There's a section discussing how hyphens in words are handled.
So, the Standard analyzer will do this:
Input text:
｢doc-size type｣
Output tokens:
｢doc｣
｢size｣
｢type｣
That should work with searches for doc as well as doctype - it's just a question of whether it works well enough for your needs.
I understand that may not be what you want. But if you can avoid needing to build a custom analyzer, life will probably be much simpler.

Golang Typecasting

I have specific questions for my project
input = "3d6"
I want to convert this string some parts to integer. For instance I want to use input[0] like integer.
How can I do this?

There's two problems here:
How to convert a string to an integer
The most straightforward method is the Atoi (ASCII to integer) function in the strconv package., which will take a string of numeric characters and coerce them into an integer for you.
How to extract meaningful components of a known string pattern
In order to use strconv.Atoi, we need the numeric characters of the input by themselves. There's lots of ways to slice and dice a string.
You can just grab the first and last characters directly - input[:1] and input[2:] are the ticket.
You could split the string into two strings on the character "d". Look at the split method, a member of the strings package.
For more complex problems in this space, regular expressions are used. They're a way to define a pattern the computer can look for. For example, the regular expression ^x(\d+)$ will match on any string that starts with the character x and is followed by one or more numeric characters. It will provide direct access to the numeric characters it found by themselves.
Go has first class support for regular expressions via its regexp package.

For example,
package main
import (
"fmt"
)
func main() {
input := "3d6"
i := int(input[0] - '0')
fmt.Println(i)
}
Playground: https://play.golang.org/p/061miKcXdIF
Output:
3

How to reverse a string in processing?

I have been searching a lot online for a solution to this, but here is my question.
Basically I need to reverse a string of 4 characters: ABCD becomes DCBA.
Here is the start of the program:
import javax.swing.JOptionPane;
String input;
input = JOptionPane.showInputDialog("Enter a string with 4 characters (e.g. ABCD): " );
Thanks

The same way you do it in Java.
String output = new StringBuilder(input).reverse().toString();
You could also use a for loop to loop over the characters and build the String yourself.

Replacing %uXXXX to the corresponding Unicode codepoint in Ruby

I have filenames which contain %uXXXX substrings, where XXXX are hexadecimal numbers / digits, for example %u0151, etc. I got these filenames by applying URI.unescape, which was able to replace %XX substrings to the corresponding characters but %uXXXX substrings remained untouched. I would like to replace them with the corresponding Unicode codepoints applying String#gsub. I tried the following, but no success:
"rep%u00fcl%u0151".gsub(/%u([0-9a-fA-F]{4,4})/,'\u\1')
I get this:
"rep\\u00fcl\\u0151"
Instead of this:
"repülő"

Try this code:
string.gsub(/%u([0-9A-F]{4})/i){[$1.hex].pack("U")}
In the comments, cremno has a better faster solution:
string.gsub(/%u([0-9A-F]{4})/i){$1.hex.chr(Encoding::UTF_8)}
In the comments, bobince adds important restrictions, worth reading in full.

Per commenter #cremno's idea, try also this code:
gsub(/%u([0-9A-F]{4})/i) { $1.hex.chr(Encoding::UTF_8) }
For example:
s = "rep%u00fcl%u0151"
s.gsub(/%u([0-9A-F]{4})/i) { $1.hex.chr(Encoding::UTF_8) }
# => "repülő"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Converting Non-ASCII code to ASCII equivalent in terms of look - converters

Related

How to import data from a file as list in Mathematica

Is there a way to remove ALL special characters using Lucene filters?

Golang Typecasting

How to reverse a string in processing?

Replacing %uXXXX to the corresponding Unicode codepoint in Ruby

Categories

Resources