RethinkDB: matching a substring in a list of strings - rethinkdb

Thanks to the answer here, I manged to get all the rows that contain a given string as a substring of a specific field's value by:
r.db('my_db').table('my_table').filter(lambda row: row['some_key'].match(".\*some_given_string.\*"))
What if I want to have a similar result, but this time, "some_key" is a list of strings instead of a single string? Say for the following table:
[{"name": "row1", "some_key": ["str1", "str2"]}, {"name": "row2", "some_key": ["str3", "blah"]}, {"name": "row3", "some_key": ["blah", "blahblah"]}]
I want to look for ".*tr.*" and get the first two rows only because the last one has a list under "some_key" that doesn't contain "tr" in none of its strings.
How can I do that with rethinkdb?

On a stream/array you can use contains that behaves like a any operator when given a function.
r.db('my_db').table('my_table').filter(lambda row:
row["some_key"].contains(lambda key:
key.match(".\*some_given_string.\*")
)
)

Short answer:
def has_match(row, regex):
return row['some_key']
.map(lambda x: x.match(regex))
.reduce(lambda x,y: x | y)
my_table.filter(lambda row: has_match(row, ".*tr.*"))
Longer answer:
match is a method that you can call on a string. In general in ReQL when you have an array of X and a function you want to apply to each element of the array you want to use the map command. For example if you run:
r.expr(["foo", "boo", "bar"]).map(lambda x: x.match(".\*oo"))
you'll get back:
[True, True, False]
I'm a bit unclear from your question but I think what you want here is to get all the documents in which ANY of these strings matches regex. To see if any of them match you need to reduce the booleans together using or so it would be:
list_of_bools.reduce(lambda x,y: x | y)

Related

Ruby regex: union 2K values in one regex,

I code a process to process bunch of text files and capture its name if any of 2000 literals exists in it (1 or many). So I'm thinking to combine that many values into one regex, do you think it's doable, I did test for 100 and looks like it's OK. Tx all
Code below depics my flow and sample code, just without looping.
# 1. read regex value list as file [alpha,fox, delta] # 2000 values
# 2. read file into s #5000 files
# 3. find if any of #1 values exists in each #2 file. *with regex tweaks to match format dbname.dob.table
s = '1 dbName.dbo.ALPHA 2 DBNAME.bcd.ALPHA 3 dbName..ALPHA 4 ALPHA 5x dbName.alphA 6x alpha.XX 7x ###dbName.###a.alpha --alpha
dbName..FOX dbName.dbo.DELTA clarity.aba..fox '
value1 = '(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:alpha)(?=\s|$)'
value2 = '(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:fox)(?=\s|$)'
##...
value2000 = '(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:delta)(?=\s|$)'
regex = /#{value1}|#{value2}|#{value2000}/i ## can I union 2000 regex's ???
puts 'reg1: ' + regex.to_s
puts 'result: ' + s.scan(regex).to_s
if s.scan(regex) then puts '...Match!!!d' end
Declaring 2000 variables is highly unnecessary; you should define all values in a single array, then somehow loop through them.
Also, the regular expression is highly repetitive - e.g. the use of (?:dbName\.[a-z]*\.) 2000 times. This can be simplified by grouping all of your values within the non-capture group as follows:
values = %w(alpha fox delta)
regex = /(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:#{Regexp.union(values)})(?=\s|$)/
This is the result:
/(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:(?-mix:alpha|fox|delta))(?=\s|$)/
If you extend that values array to contain 2000 strings, the other code does not need to change.
Provided two conditions are met, I would do it as follows, which I think would be far more efficient than using a gigantic regular expression, which, by its nature, requires that a linear search of the "bad words" be performed for each word in the string, until a match is found or it is determined that there are no matches.
We are given a file whose path is contained in a variable fname and an array of bad words:
arr = ["alpha", "fox", "delta", "charlie", "mabel"]
The first condition that I spoke of above is that, by way of example, "ALPHA" and "Alpha" match "alpha", but "aLPha" does not (or some variant of that).
The second condition is that there is a regular expression with a capture group that would capture a bad word if a bad word were present at the given location in a match. For example:
regex = (?<=^|\s)(?:dbName\.[a-z]*\.)?(\p{Alpha}+)(?=\s|$)
Wherever there is a match, the capture group (\p{Alpha}+) would capture a string of one or more alphanumeric characters whose value is assigned to the global variable $1. We will then check to see if the value of $1 is a bad word. (The regular expression might have other capture groups as well, in which case we might be looking for $2 or $3, say, or a named capture group.)
If there were more than one such regular expression to check for, the code below could be executed for each of them until a match is found or it is determined that there are no more matches.
The first step is to convert the array of bad words to a set:
require 'set'
bad_words = arr.flat_map { |w| [w, w.capitalize, w.upcase] }.to_set
#=> #<Set: {"alpha", "Alpha", "ALPHA", "fox", "Fox", "FOX",
# "delta", "Delta", "DELTA", "charlie", "Charlie", "CHARLIE",
# "mabel", "Mabel", "MABEL"}>
This allows very fast word lookups--much faster than stepping through an array. We may then search the file as follows.
rv = IO.foreach(fname).any? do |line|
line.gsub(regex).any? { bad_words.include?($1) }
end
IO::foreach without a block is seen to return an enumerator. We can then chain that to any? to determine if there is a line that contains a match of the regular expression and the value of its capture group is contained in the set bad_words. If such a line is found the search terminates and true is returned; else, false is returned.
It is seen that String#gsub without a block returns an enumerator, which here I've chained to any?. This form of gsub has nothing to do with string replacements; it just generates matches. Those matches are passed to the block, but we are only interested in the contents of the capture group, which are held by $1. Hence the expression bad_words.include?($1).

Replace with multiple patterns mutually exclusively

I have the following text:
a phrase whith length one, which is "uno"
Using the following dictionary,
1) phrase --- frase
2) a phrase --- una frase
3) one --- uno
4) uno --- one
I'm trying to replace the occurrences of the dictionary items in the text. The desired output is:
[a phrase|una frase] whith length [one|uno], which is "[uno|one]"
I've done this:
text = %(a phrase whith length one, which is "uno")
dictionary.each do |original, translation|
text.gsub! original, "[#{original}|#{translation}]"
end
This snippet outputs the following for each dictionary word:
1) a [phrase|frase] whith length one, which is "uno"
2) a [phrase|frase] whith length one, which is "uno"
3) a [phrase|frase] whith length [one|uno], which is "uno"
3) a [phrase|frase] whith length [one|[uno|one]], which is "[uno|one]"
I see two problems here:
The word phrase is being replaced instead of a phrase. I think that this can be fixed by sorting the dictionary by length, giving priority to longer terms.
The already replaced words are being re-replaced, like uno in [one|uno]. I thought of using some sort of regular expression list (with Regex::union), but I don't know how efficient and clean it'll be.
Any ideas?
To solve your second problem, you have to replace in a single pass.
Convert the dictionary into a hash with the key-value pairs in the order you mention (sorted by length, perhaps).
dictionary = {
"a phrase" => "[a phrase|una frase]",
"phrase" => "[phrase|frase]",
"one" => "[one|uno]",
"uno" => "[uno|one]",
}
Then replace all in a single pass.
text.gsub(Regexp.union(*dictionary.keys.map{|w| "\b#{w}\b"}), dictionary)

Lua: Inverse of string.char()?

I'm wondering if there is a function that does the exact opposite of string.char(). It would be convenient to get a number value from letters in order to sort things alphabetically.
string.byte()
Is probably what you're looking for.
To get the first UTF-8 Byte of a string, you can use either string.byte or str:byte() where str is your string in question.
However, if you're sorting a table, or doing a sort in general, Lua actually has you covered! You can compare two strings as if they were numbers! "A" < "B" returns true and "B" < "A" returns false. This also works for multiple letters in a string. "Ba" > "Aa" and "Ab" > "Aa" and so on. So you can do table.sort(t) or if you're sorting by a sub value, table.sort(t,function(a,b) return a.text < b.text end). Hope this helps!

Use ruby to remove a part of a string on each entry in an array where it exists

I have a list of file paths, for example
[
'Useful',
'../Some.Root.Directory/Path/Interesting',
'../Some.Root.Directory/Path/Also/Interesting'
]
(I mention that they're file paths in case there is something that makes this task easier because they're files but they can be considered simply a set of strings some of which may start with a particular string)
and I need to make this into a set of pairs so that I have the original list but also
[
'Useful',
'Interesting',
'Also/Interesting'
]
I expected I'd be able to do this
'../Some.Root.Directory/Path/Interesting'.gsub!('../Some.Root.Directory/Path/', '')
or
'../Some.Root.Directory/Path/Interesting'.gsub!('\.\.\/Some\.Root\.Directory\/Path\/', '')
but neither of those replaces the provided string/pattern with an empty string...
So in irb
puts '../Some.Root.Directory/Path/Interesting'.gsub('\.\.\/Some\.Root\.Directory\/Path\/', '')
outputs
../Some.Root.Directory/Path/Interesting
and the desired output is
Interesting
How can I do this?
NB the path will be passed in so really I have
file_path.gsub!(removal_path, '')
If you are positive that strings start with removal_path you can do:
string[removal_path.size..-1]
to get the remaining part.
If you want to get pairs of the original paths and the shortened ones, you can use sub in combination with map:
a = [
'../Some.Root.Directory/Path/Interesting',
'../Some.Root.Directory/Path/Also/Interesting'
]
b = a.map do |v|
[v, v.sub('../Some.Root.Directory/Path', '')]
end
puts b
This will return an Array of arrays - each sub-array contains the original path plus the shortened one. As noted by #sawa - you can simply use sub instead of gsub, since you want to replace only a single occurrence.

Overlapping string matching using regular expressions

Imagine we have some sequence of letters in the form of a string, call it
str = "gcggcataa"
The regular expression
r = /(...)/
matches any three characters, and when I execute the code
str.scan(r)
I get the following output:
["gcg", "gca", "taa"]
However, what if I wanted to scan through and instead of the distinct, non-overlapping strings as above but instead wanted to get this output:
["gcg", "cgg", "ggc", "gca", "cat", "ata", "taa"]
What regular expression would allow this?
I know I could do this with a loop but I don't want to do that
str = "gcggcataa"
str.chars.each_cons(3).map(&:join) # => ["gcg", "cgg", "ggc", "gca", "cat", "ata", "taa"]

Resources