SPHINX field search operator issue - full-text-search

I am using sphinx 2.0.4-release with SPH_MATCH_EXTENDED2 query syntax. When I have an "empty value" in my query i.e.:
blah & ''
sphinx ignores it and searches just "blah". It still works the same way when i use field search operator and an empty value comes last:
#field1 blah #field2 ''
But this query:
#field1 '' #field2 blah
causes error: syntax error, unexpected TOK_FIELDLIMIT near ' '' #field2 blah'. Of course i can trim empty values, but this behaviour seems illogical to me... Am i doing something wrong? Or is it actually a bug?

Sphinx uses an inverted index. It breaks up the text into words and stores (hashes of) them.
As such it doesnt index 'nothing' (its not a word) - so you can't search an empty string.
All of those queries are strictly a syntax error - and nonsense. But in some cases sphinx will just dispose of invalid syntax silently (because it then falls back and thinks its word char, which are then not in charset_table and so go) - and in so doing come up with a 'valid' query (just not what you intended)
The solution is to simply turn an empty field into a 'word' at indexing time, then you can search for the empty string!
eg
sql_query = SELECT id, title, IF(field1 = '','EMPTY_STRING',field1) AS field1, ....
Then you can just query as
#field1 EMPTY_STRING #field2 blah
What you use as 'EMPTY_STRING' is completely arbitrary.

Related

MariaDB fulltext search with special chars and "word starts with"

I can do a MariaDB fulltext query which searches for the word beginning like this:
select * from mytable
where match(mycol) against ('+test*' in boolean mode)>0.0;
This finds words like "test", "tester", "testing".
If my search string contains special characters, I can put the search string in quotes:
select * from mytable
where match(mycol) against ('+"test-server"' in boolean mode)>0.0;
This will find all rows which contain the string test-server.
But it seems I cannot combine both:
select * from mytable
where match(mycol) against ('+"test-serv"*' in boolean mode)>0.0;
This results in an error:
Error: (conn:7) syntax error, unexpected $end, expecting FTS_TERM or FTS_NUMB or '*'
SQLState: 42000
ErrorCode: 1064
Placing the ´*´ in the quoted string will return no results (as expected):
select * from mytable
where match(mycol) against ('+"test-serv*"' in boolean mode)>0.0;
Does anybody know whether this is a limitation of MariaDB? Or a bug?
My MariaDB version is 10.0.31
WHERE MATCH(mycol) AGAINST('+test +serv*' IN BOOLEAN MODE)
AND mycol LIKE '%test_serv%'
The MATCH will find the desired rows plus some that are not desired. Then the LIKE will filter out the duds. Since the LIKE is being applied to only some rows, its slowness is masked.
(Granted, this does not work in all cases. And it requires some manual manipulation.)
d'Artagnan - Use
WHERE MATCH(mycol) AGAINST("+Arta*" IN BOOLEAN MODE)
AND mycol LIKE '%d\'Artagnan%'
Note that I used the suitable escaping for getting the apostrophe into the LIKE string.
So, the algorithm for your code goes something like:
Break the string into "words" the same way FULLTEXT would.
Toss any strings that are too short.
If no words are left, then you cannot use FULLTEXT and are stuck with a slow LIKE.
Stick * after the last word (or each word?).
Build the AGAINST with those word(s).
Add on AND LIKE '%...%' with the original phrase, suitably escaped.

Elasticsearch term query with colons

I have a string field "title"(not analyzed) in elasticsearch. A document has title "Garfield 2: A Tail Of Two Kitties (2006)".
When I use the following json to query, no result returns.
{"query":{"term":{"title":"Garfield 2: A Tail Of Two Kitties (2006)"}}}
I tried to escape the colon character and the braces, like:
{"query":{"term":{"title":"Garfield 2\\: A Tail Of Two Kitties \\(2006\\)"}}}
Still not working.
Term query wont tokenize or apply analyzers to the search text. Instead if looks for the exact match which wont work as the string fields are analyzed/tokenized by default.
To give this a better explanation -
Lets say there is a string value as - "I am in summer:camp"
When indexing this its broken into tokens as below -
"I am in summer:camp" => [ I , am , in , summer , camp ]
Hence even if you do a term search for "I am in summer:camp" , it wont still work as the token "I am in summer:camp" is not present in the index.
Something like phrase query might work better here.
Or you can leave "index" field as "not_analyzed" to make sure that string is not tokenized.

Space characters inside Oracle's Contains() function

I needed to use Oracle 11g's Contains() function to search some exact text contained in some field typed by the user. I was asked not to use the 'like' operator.
According to the Oracle documentation, for everything to work you need to:
Double } characters
Put the whole input between {}
This works in most cases except for a few ones. Below it a test case:
create table theme
(name varchar2(300 char) not null);
insert into theme (name)
values ('a');
insert into theme (name)
values ('b');
insert into theme (name)
values ('a or b');
insert into theme (name)
values ('Pdz344_1_b');
create index name_index on theme(name) indextype is ctxsys.context;
If the 'or' operator was interpreted, I would get all four results, which is hopefully not the case. Now if I run the following, I would expect is to only find 'a or b'.
select * from theme
where contains(name, '{a or b}')>0;
However I also get 'Pdz344_1_b'. But there's no 'a', 'o' not 'r' and I find it very surprising that this text is matched. Is there something I don't get about contains()'s syntax?
CONTAINS is not like LIKE operator at all. Since it using ORACLE TEXT search engine (something like google search), not just string matching.
{} - is an escape marker. Means everything you put inside should be treated as text to escape.
Therefore you issue query to find text that looks like a or b not like a or b.
So your query get matched against Pdz344_1_b because it has b char in it.
Row with only a character ain't matched because a character exists in the default stop list.
Why just b ain't matched? Because your match sequence actually looks like a\ or\ b.
So we have 3 tokens a _or _b (underscores represents spaces). a in stop list, and we have no string _b in the b row, because there only single character. But we do have this combination in the Pdz344_1_b row, because non-alphabetic characters are treated as whitespace. If you remove {} or query for {b or a} then you'll get matches against b as well.

ruby extract string between two string

I am having a string as below:
str1='"{\"#Network\":{\"command\":\"Connect\",\"data\":
{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"'
I wanted to extract the somename string from the above string. Values of xx:xx:xx:xx:xx:xx, somename and 123456789 can change but the syntax will remain same as above.
I saw similar posts on this site but don't know how to use regex in the above case.
Any ideas how to extract the above string.
Parse the string to JSON and get the values that way.
require 'json'
str = "{\"#Network\":{\"command\":\"Connect\",\"data\":{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"
json = JSON.parse(str.strip)
name = json["#Network"]["data"]["Name"]
pwd = json["#Network"]["data"]["Pwd"]
Since you don't know regex, let's leave them out for now and try manual parsing which is a bit easier to understand.
Your original input, without the outer apostrophes and name of variable is:
"{\"#Network\":{\"command\":\"Connect\",\"data\":{\"Id\":\"xx:xx:xx:xx:xx:xx\",\"Name\":\"somename\",\"Pwd\":\"123456789\"}}}\0"
You say that you need to get the 'somename' value and that the 'grammar will not change'. Cool!.
First, look at what delimits that value: it has quotes, then there's a colon to the left and comma to the right. However, looking at other parts, such layout is also used near the command and near the pwd. So, colon-quote-data-quote-comma is not enough. Looking further to the sides, there's a \"Name\". It never occurs anywhere in the input data except this place. This is just great! That means, that we can quickly find the whereabouts of the data just by searching for the \"Name\" text:
inputdata = .....
estposition = inputdata.index('\"Name\"')
raise "well-known marker wa not found in the input" unless estposition
now, we know:
where the part starts
and that after the "Name" text there's always a colon, a quote, and then the-interesting-data
and that there's always a quote after the interesting-data
let's find all of them:
colonquote = inputdata.index(':\"', estposition)
datastart = colonquote+3
lastquote = inputdata.index('\"', datastart)
dataend = lastquote-1
The index returns the start position of the match, so it would return the position of : and position of \. Since we want to get the text between them, we must add/subtract a few positions to move past the :\" at begining or move back from \" at end.
Then, fetch the data from between them:
value = inputdata[datastart..dataend]
And that's it.
Now, step back and look at the input data once again. You say that grammar is always the same. The various bits are obviously separated by colons and commas. Let's try using it directly:
parts = inputdata.split(/[:,]/)
=> ["\"{\\\"#Network\\\"",
"{\\\"command\\\"",
"\\\"Connect\\\"",
"\\\"data\\\"",
"\n{\\\"Id\\\"",
"\\\"xx",
"xx",
"xx",
"xx",
"xx",
"xx\\\"",
"\\\"Name\\\"",
"\\\"somename\\\"",
"\\\"Pwd\\\"",
"\\\"123456789\\\"}}}\\0\""]
Please ignore the regex for now. Just assume it says a colon or comma. Now, in parts you will get all the, well, parts, that were detected by cutting the inputdata to pieces at every colon or comma.
If the layout never changes and is always the same, then your interesting-data will be always at place 13th:
almostvalue = parts[12]
=> "\\\"somename\\\""
Now, just strip the spurious characters. Since the grammar is constant, there's 2 chars to be cut from both sides:
value = almostvalue[2..-3]
Ok, another way. Since regex already showed up, let's try with them. We know:
data is prefixed with \"Name\" then colon and slash-quote
data consists of some text without quotes inside (well, at least I guess so)
data ends with a slash-quote
the parts in regex syntax would be, respectively:
\"Name\":\"
[^\"]*
\"
together:
inputdata =~ /\\"Name\\":\\"([^\"]*)\\"/
value = $1
Note that I surrounded the interesting part with (), hence after sucessful match that part is available in the $1 special variable.
Yet another way:
If you look at the grammar carefully, it really resembles a set of embedded hashes:
\"
{ \"#Network\" :
{ \"command\" : \"Connect\",
\"data\" :
{ \"Id\" : \"xx:xx:xx:xx:xx:xx\",
\"Name\" : \"somename\",
\"Pwd\" : \"123456789\"
}
}
}
\0\"
If we'd write something similar as Ruby hashes:
{ "#Network" =>
{ "command" => "Connect",
"data" =>
{ "Id" => "xx:xx:xx:xx:xx:xx",
"Name" => "somename",
"Pwd" => "123456789"
}
}
}
What's the difference? the colon was replaced with =>, and the slashes-before-quotes are gone. Oh, and also opening/closing \" is gone and that \0 at the end is gone too. Let's play:
tmp = inputdata[2..-4] # remove opening \" and closing \0\"
tmp.gsub!('\"', '"') # replace every \" with just "
Now, what about colons.. We cannot just replace : with =>, because it would damage the internal colons of the xx:xx:xx:xx:xx:xx part.. But, look: all the other colons have always a quote before them!
tmp.gsub!('":', '"=>') # replace every quote-colon with quote-arrow
Now our tmp is:
{"#Network"=>{"command"=>"Connect","data"=>{"Id"=>"xx:xx:xx:xx:xx:xx","Name"=>"somename","Pwd"=>"123456789"}}}
formatted a little:
{ "#Network"=>
{ "command"=>"Connect",
"data"=>
{ "Id"=>"xx:xx:xx:xx:xx:xx","Name"=>"somename","Pwd"=>"123456789" }
}
}
So, it looks just like a Ruby hash. Let's try 'destringizing' it:
packeddata = eval(tmp)
value = packeddata['#Network']['data']['Name']
Done.
Well, this has grown a bit and Jonas was obviously faster, so I'll leave the JSON part to him since he wrote it already ;) The data was so similar to Ruby hash because it was obviously formatted as JSON which is a hash-like structure too. Using the proper format-reading tools is usually the best idea, but mind that the JSON library when asked to read the data - will read all of the data and then you can ask them "what was inside at the key xx/yy/zz", just like I showed you with the read-it-as-a-Hash attempt. Sometimes when your program is very short on the deadline, you cannot afford to read-it-all. Then, scanning with regex or scanning manually for "known markers" may (not must) be much faster and thus prefereable. But, still, much less convenient. Have fun.

ruby regex not working to remove class name from sql

I have:
BEFORE Gsub sql ::::
SELECT record_type.* FROM record_type WHERE (name = 'Registrars')
sql = sql.gsub(/SELECT\s+[^\(][A-Z]+\./mi,"SELECT ")
AFTER GSUB SQL ::::
SELECT record_type.* FROM record_type WHERE (name = 'Registrars')
The desired result is to remove the "record_type." from the statement:
So it should be :
SELECT * FROM record_type WHERE (name = 'Registrars')
After the regex is run.
I didn't write this, it's in the asf-soap-adaptor gem. Can someone tell me why it doesn't work, and how to fix?
I suppose it should be written like this...
sql = sql.gsub(/SELECT\s+[^\(][A-Z_]+\./mi,"SELECT ")
... as the code in the question won't match if the field name contains _ (underscore) symbol. I suppose that's why this code is in gem: it can work in some conditions (i.e., with underscoreless field names).
Still, I admit I don't understand why exactly this replacement should be done - and shouldn't it include 0-9 check as well (as, for example, 'record_id1' field still won't be matched - and replaced - by the character class in the regular expression; you may have to either expand it, like [0-9A-Z_], or just replace completely with \w).
so your before and after gsubs are the same? I can't tell you why it doesn't work if you dont tell me your expected result. Also for help with interpreting ruby regular expressions check out rubular.com

Resources