regexp find for Chinese unicode character [duplicate] - go

This question already has answers here:
Why does "Year 2010" =~ /([0-4]*)/ results in empty string in $1?
(7 answers)
Closed 6 years ago.
i have code like this
re, err = regexp.Compile(`\p{Han}*`)
if err != nil {
fmt.Println(err)
return
}
s := "foo中文哦woqu"
fmt.Println(re.FindString(s))
but it print empty.
and then i change \p{Han}* to \p{Han}+, it display's 中文哦.
change \p{Han}* to \p{Han}?, it print empty.
I find the document like this:
x* zero or more x, prefer more
x+ one or more x, prefer more
x? zero or one x, prefer one
so i expect my print is:
\p{Han}* print 中文哦
\p{Han}+ print 中文哦
\p{Han}? print 中
could someone tell me what happened?

As the docs say (emphasis added):
FindString returns a string holding the text of the leftmost match in s of the regular expression. If there is no match, the return value is an empty string, but it will also be empty if the regular expression successfully matches an empty string. Use FindStringIndex or FindStringSubmatch if it is necessary to distinguish these cases.
\p{Han}* matches an empty string. You can also see that by using FindAllString:
fmt.Printf("%q", re.FindAllString(s, -1))
// Prints ["" "" "" "中文哦" "" "" "" ""]
You can use \p{Han}+ which doesn't match an empty string.

Related

Unable to Match numbers only using Regular Expression in golang using Regex

matched, err := regexp.MatchString(`[0-9]`, `a.31`)
fmt.Println(matched)
The above expression is returning true.. Isn't it supposed to be false?
I want to extract only numbers but why is "a.31" considered true? I've noticed that having atleast a number in the string will return "true"...
How to make it return "true" only for numbers?

How to compare a character with the next one in the same string

I am struggling a bit on how to operate with strings. Furthermore, apparently there is no "while" loops, there are only "for" loops, which doesn't allow me to achieve what I want.
Basically, given the string:
"helloujjkk" I want to compare all characters with the next one, and to verify if they match.
Example, for "helloujjkk", I want to return "l","j", and "k" because those characters are followed by the same character.
The way I did this in Python was like this:
hello="helloujjkk"
i=0
while i < len(hello)-1:
if hello[i] == hello[i+1]:
print hello[i]
i +=1
So far, this is the way I am iterating over the string:
word := "helloujjkk"
for _,character := range word {
fmt.Println(string(character))
}
but I haven't found how can I find the "next" character in the string.
You can do the same thing you did in Python:
word := "helloujjkk"
for i:=0;i<len(word)-1;i++ {
if word[i]==word[i+1] {
fmt.Println(string(word[i]))
}
}
However, this will break if your word contains multibyte characters. String indexing in Go treats the string as an array of bytes, so word[i] is the i'th byte of the string. This is not necessarily the i'th character.
A better solution would be to keep the last character read from the string:
var last rune
for i,c:=range word {
if i>0 && c==last {
fmt.Println(string(c))
}
last=c
}
}
A range over a string will iterate the runes of the string, not the bytes. So this version is correct even if the string contains multibyte characters.

Classic ASP InStr() Evaluates True on Empty Comparison String

I ran into an issue with the Classic ASP VbScript InStr() function. As shown below, the second call to InStr() returns 1 when searching for an empty string in a non empty string. I'm curious why this is happening.
' InStr Test
Dim someText : someText = "So say we all"
Dim emptyString : emptyString = ""
'' I expect this to be true
If inStr(1,someText,"so",1) > 0 Then
Response.write ( "I found ""so""<br />" )
End If
'' I expect this to be false
If inStr(1, someText, emptyString, 1) > 0 Then
Response.Write( "I found an empty string<br />" )
End If
EDIT:
Some additional clarification: The reason for the question came up when debugging legacy code and running into a situation like this:
Function Go(value)
If InStr(1, "Option1|Option2|Option3", value, 1) > 0 Then
' Do some stuff
End If
End Function
In some cases function Go() can get called with an empty string. The original developer's intent was not to check whether value was empty, but rather, whether or not value was equal to one of the piped delimited values (Option1,Option2, etc.).
Thinking about this further it makes sense that every string is created from an empty string, and I can understand why a programming language would assume a string with all characters removed still contains the empty string.
What doesn't make sense to me is why programming languages are implementing this. Consider these 2 statements:
InStr("so say we all", "s") '' evaluates to 1
InStr("so say we all", "") '' evaluates to 1
The InStr() function will return the position of the first occurrence of one string within another. In both of the above cases, the result is 1. However, position 1 always contains the character "s", not an empty string. Furthermore, using another string function like Len() or LenB() on an empty string alone will result in 0, indicating a character length of 0.
It seems that there is some inconsistency here. The empty string contained in all strings is not actually a character, but the InStr() function is treating it as one when other string functions are not. I find this to be un-intuitive and un-necessary.
The Empty String is the Identity Element for Strings:
The identity element I (also denoted E, e, or 1) of a group or related
mathematical structure S is the unique element such that Ia=aI=a for
every element a in S. The symbol "E" derives from the German word for
unity, "Einheit." An identity element is also called a unit element.
If you add 0 to a number n the result is n; if you add/concatenate "" to a string s the result is s:
>> WScript.Echo CStr(1 = 1 + 0)
>> WScript.Echo CStr("a" = "a" & "")
>>
True
True
So every String and SubString contains at least one "":
>> s = "abc"
>> For p = 1 To Len(s)
>> WScript.Echo InStr(p, s, "")
>> Next
>>
1
2
3
and Instr() reports that faithfully. The docs even state:
InStr([start, ]string1, string2[, compare])
...
The InStr function returns the following values:
...
string2 is zero-length start
WRT your
However, position 1 always contains the character "s", not an empty
string.
==>
Position 1 always contains the character "s", and therefore an empty
string too.
I'm puzzled why you think this behavior is incorrect. To the extent that asking Does 'abc' contain ''? even makes sense, the answer has to be "yes": All strings contain the empty string as a trivial case. So the answer to your "why is this happening" question is because it's the only sane thing to do.
It is s correct imho. At least it is what I expect that empty string is part of any other string. But maybe this is a philosophical question. ASP does it so, so live with it. Practically speaking, if you need a different behavior write your own Method, InStrNotEmpty or something, which returns false on empty search string.

Splitting with empty space in Ruby [duplicate]

This question already has an answer here:
How do I avoid trailing empty items being removed when splitting strings?
(1 answer)
Closed 8 years ago.
In both Ruby and JavaScript I can write expression " x ".split(/[ ]+/)
. In JavaScript I get somehow reasonable result ["", "x", ""], but in Ruby (2.0.0) i get ["", "x"], which is for me quite counterintuitive. I have problems to understand how regular expressions works in Ruby. Why don't I get the same result as in JavaScript or just ["x"]?
From string#split documentation, emphasis my own:
split(pattern=$;, [limit])
If pattern is a String, then its contents are used as the delimiter when splitting str. If pattern is a single space, str is split on whitespace, with leading whitespace and runs of contiguous whitespace characters ignored.
If pattern is a Regexp, str is divided where the pattern matches. Whenever the pattern matches a zero-length string, str is split into individual characters. If pattern contains groups, the respective matches will be returned in the array as well.
If pattern is omitted, the value of $; is used. If $; is nil (which is the default), str is split on whitespace as if ` ' were specified.
If the limit parameter is omitted, trailing null fields are suppressed. If limit is a positive number, at most that number of fields will be returned (if limit is 1, the entire string is returned as the only entry in an array). If negative, there is no limit to the number of fields returned, and trailing null fields are not suppressed.
So if you were to use " x ".split(/[ ]+/, -1) you would get your expected result of ["", "x", ""]
*edited to reflect Wayne's comment
I found this in the C code for String#split, almost right at the end:
if (NIL_P(limit) && lim == 0) {
long len;
while ((len = RARRAY_LEN(result)) > 0 &&
(tmp = RARRAY_AREF(result, len-1), RSTRING_LEN(tmp) == 0))
rb_ary_pop(result);
}
So it actually pops empty strings off the end of the result array before returning! It looks like the creators of Ruby didn't want String#split to return a bunch of empty strings.
Notice the check for NIL_P(limit) -- this accords exactly with what the documentation says, as #dax pointed out.

Ruby: String Comparison Issues

I'm currently learning Ruby, and am enjoying most everything except a small string comparason issue.
answer = gets()
if (answer == "M")
print("Please enter how many numbers you'd like to multiply: ")
elsif (answer. == "A")
print("Please enter how many numbers you'd like to sum: ")
else
print("Invalid answer.")
print("\n")
return 0
end
What I'm doing is I'm using gets() to test whether the user wants to multiply their input or add it (I've tested both functions; they work), which I later get with some more input functions and float translations (which also work).
What happens is that I enter A and I get "Invalid answer."The same happens with M.
What is happening here? (I've also used .eql? (sp), that returns bubcus as well)
gets returns the entire string entered, including the newline, so when they type "M" and press enter the string you get back is "M\n". To get rid of the trailing newline, use String#chomp, i.e replace your first line with answer = gets.chomp.
The issue is that Ruby is including the carriage return in the value.
Change your first line to:
answer = gets().strip
And your script will run as expected.
Also, you should use puts instead of two print statements as puts auto adds the newline character.
your answer is getting returned with a carriage return appended. So input "A" is never equal to "A", but "A(return)"
You can see this if you change your reject line to print("Invalid answer.[#{answer}]"). You could also change your comparison to if (answer.chomp == ..)
I've never used gets put I think if you hit enter your variable answer will probably contain the '\n' try calling .chomp to remove it.
Add a newline when you check your answer...
answer == "M\n"
answer == "A\n"
Or chomp your string first: answer = gets.chomp

Resources