Searchkick substring matches lookup - elasticsearch

I'm thinking this might be a question for the wider ElasticSearch community, but since we're using Searchkick, I thought I'd start here...
We have an index containing records with multiple string fields, say:
"Jimi", "Hendrix", "Guitar"
"Phil", "Collins", "Drums"
"Sting", "", "Bass"
"Ringo", "Starr", "Drums"
"Paul", "McCartney", "Bass"
I want to pass searchkick/elasticsearch a long string, say:
"It is known that Jimi liked to set light to his guitar and smash up all the drums while on stage."
and i want to get returned the fields that have any matches - preferably in order of the most matches first:
"Jimi", "Hendrix", "Guitar"
"Phil", "Collins", "Drums"
"Ringo", "Starr", "Drums"
How do I go about setting up the query?
Thanks!

Related

Is there a way to use grok to break up a message using the character numbers?

So for instance the log I need to break apart is something like this
"01234567895467894ACCP 844"
Where
0123456789 is phone number,
5467894 mandate number,
ACCP is the type of mandate but for instance could be 6 long so it gets 2 spaces afterward. 844 some other number. What I need to do is separate the line based on character number. Which will always be constant.
So Something like %{CHAR 0-10:Phonenumber)%{CHAR 11-18:Mandate}%{CHAR 19-24:Type} Is there someway to do this using groks? I tried looking but did not find anything like it.
The following regular expression based grok expression allows you to capture what you expect:
(?<Phonenumber>\d{10})(?<Mandate>\d{7})(?<Type>[A-Z\s]{4,})(?<Other>\d{3,})
You'd get this:
{
"Phonenumber": "0123456789",
"Mandate": "5467894",
"Type": "ACCP ",
"Other": "844"
}

mutate() and str_replace() function

I'm trying to remove any strings that contains the word "Seattle" in the column named "County" in the table named Population
Population %>%
mutate( str_replace(County, "Seattle", ""))
It gives me an error message.
I suspect you are getting an error because in your mutate you aren't defining what column you're mutating...
Also, I think you will have better success with an if_else statement detecting the string pattern of Seattle using grepl and then replacing the contents. Below is the code I've used for something similar.
Population %>%
mutate(County = if_else(grepl("Seattle", County),"",County))
The grepl will detect the string pattern in the County field and provide a TRUE/FALSE return. From there, you just define what to do if it is found to be true, i.e. replace it with nothing (""), or keep the value as is (County).

Elasticsearch term query with colons

I have a string field "title"(not analyzed) in elasticsearch. A document has title "Garfield 2: A Tail Of Two Kitties (2006)".
When I use the following json to query, no result returns.
{"query":{"term":{"title":"Garfield 2: A Tail Of Two Kitties (2006)"}}}
I tried to escape the colon character and the braces, like:
{"query":{"term":{"title":"Garfield 2\\: A Tail Of Two Kitties \\(2006\\)"}}}
Still not working.
Term query wont tokenize or apply analyzers to the search text. Instead if looks for the exact match which wont work as the string fields are analyzed/tokenized by default.
To give this a better explanation -
Lets say there is a string value as - "I am in summer:camp"
When indexing this its broken into tokens as below -
"I am in summer:camp" => [ I , am , in , summer , camp ]
Hence even if you do a term search for "I am in summer:camp" , it wont still work as the token "I am in summer:camp" is not present in the index.
Something like phrase query might work better here.
Or you can leave "index" field as "not_analyzed" to make sure that string is not tokenized.

How to match undefined number of arguments or how to match known keywords in a regular expression

Some questions about regex, simple for you but not for me.
a) I want to match a string using a regular expression.
keyword term1,term2,term3,.....termN
The number of terms is undefined. I know how to begin but after I am lost ;-)
\(\w+)(\s+) but after ?\i
b) A little bit more complicated:
capitale france paris,england london,germany berlin, ...
I want to separate the couples ai bi in order to analyse them.
c) how to check if one among several keywords are present or not ?
direction LEFT,RIGHT,UP,DOWN
This isn't a good task for a regular expression as you want to use it. In addition, you're asking several questions that have to be addressed in several steps; Determining duplicates isn't part of a regex's skill set.
Regex assume there is a repeating pattern, and if you're trying to parse an entire line of indeterminate number of elements at once, it will take a very complex pattern.
I'd recommend you use a simple split(',') to break the line on commas:
'keyword term1,term2,term3,.....termN'.split(',')
# => ["keyword term1", "term2", "term3", ".....termN"]
'capitale france paris,england london,germany berlin, ...'.split(',')
# => ["capitale france paris", "england london", "germany berlin", " ..."]
Once you have the line split, if you want to break apart complex entries on white-space, use a bare split:
'capitale france paris,england london,germany berlin, ...'.split(',').map(&:split)
# => [["capitale", "france", "paris"],
# ["england", "london"],
# ["germany", "berlin"],
# ["..."]]
This will all fall apart if there are embedded commas in a field. The data you're working with looks like CSV (comma-separated values), and that spec allows for them. IF you're working with true CSV data, then use the CSV library that comes with Ruby. It will save your sanity and keep you from trying to reinvent a wheel.
To count keywords you can do something like:
entries = 'capitale france paris,england london,germany berlin, ...'.split(',').map(&:split)
# => [["capitale", "france", "paris"],
# ["england", "london"],
# ["germany", "berlin"],
# ["..."]]
keywords = Hash.new { |h, k| h[k] = 0 }
entries.each do |entry|
entry.each do |e|
keywords[e] += 1 if e[/\b(?:france|england|germany)\b/i]
end
end
keywords # => {"france"=>1, "england"=>1, "germany"=>1}
There are other ways to do this using various methods in Enumerable and Array, but this demonstrates the technique. I used a pattern to locate the keyword hits because it's fast and can find the keyword within a string. You could do a lookup using index or find or any? but they'll slow your code as the list of keywords grows.

Split a string with multiple delimiters in Ruby

Take for instance, I have a string like this:
options = "Cake or pie, ice cream, or pudding"
I want to be able to split the string via or, ,, and , or.
The thing is, is that I have been able to do it, but only by parsing , and , or first, and then splitting each array item at or, flattening the resultant array afterwards as such:
options = options.split(/(?:\s?or\s)*([^,]+)(?:,\s*)*/).reject(&:empty?);
options.each_index {|index| options[index] = options[index].sub("?","").split(" or "); }
The resultant array is as such: ["Cake", "pie", "ice cream", "pudding"]
Is there a more efficient (or easier) way to split my string on those three delimiters?
What about the following:
options.gsub(/ or /i, ",").split(",").map(&:strip).reject(&:empty?)
replaces all delimiters but the ,
splits it at ,
trims each characters, since stuff like ice cream with a leading space might be left
removes all blank strings
First of all, your method could be simplified a bit with Array#flatten:
>> options.split(',').map{|x|x.split 'or'}.flatten.map(&:strip).reject(&:empty?)
=> ["Cake", "pie", "ice cream", "pudding"]
I would prefer using a single regex:
>> options.split /\s*, or\s+|\s*,\s*|\s+or\s+/
=> ["Cake", "pie", "ice cream", "pudding"]
You can use | in a regex to give alternatives, and putting , or first guarantees that it won’t produce an empty item. Capturing the whitespace with the regex is probably best for efficiency, since you don’t have to scan the array again.
As Zabba points out, you may still want to reject empty items, prompting this solution:
>> options.split(/,|\sor\s/).map(&:strip).reject(&:empty?)
=> ["Cake", "pie", "ice cream", "pudding"]
As "or" and "," does the same thing, the best approach is to tell the regex that multiple cases should be treated the same as a single case:
options = "Cake or pie, ice cream, or pudding"
regex = /(?:\s*(?:,|or)\s*)+/
options.split(regex)

Resources