Spring JPA find entity where property containing List - spring

For example:
This is the name property of an object I want to search for:
Lorem ipsum dolor sit amet, consectetur adipiscing elit
When I fill in "lo adip tetur" (lo (= Lorem), dipi (=adipiscing), tetur (=consectetur) ) I want to be able to find this object.
I tried to split my name property on space and pass it to the jpa method but I did not get any results.
List<Obj> findAllByNameIgnoreCaseContaining(String[] splittedName);
What would be the correct way to solve this problem? Thanks!

A regex query will allow you to specify this type of complex text search criteria.
Regular expressions are supported by many databases, and can be supplied when using a native query.
An example for postgres could look like this:
#Query(nativeQuery = true, value =
"SELECT * FROM my_entity
WHERE text_column ~ :contentRegex")
List<MyEntity> findByContentRegex(#Param("contentRegex") String contentRegex);
To match the first two characters of the first three words, you could for example pass a regex like this one:
var result = repository.findByContentRegex("^Lo\S*\sip\S*\sdo.*");
(a string starting with Lo, followed by an arbitrary number of non-whitespace characters, a whitespace character, ip, an arbitrary number of non-whitespace characters, a whitespace character, do, and an arbitrary number of arbitrary characters)
Of course you can dynamically assemble the regex, e.g. by concatenating user-supplied search term fragments:
List<String> searchTerms = List.of("Lo", "ip", "do"); // e.g. from http request url params
String regex = "^"+String.join("\S*\s", searchTerms) + ".*";
var result = repository.findByContentRegex(regex);
See e.g. https://regex101.com/ for an interactive playground.
Note that complex expressions may cause the query to become expensive, so that one may consider more advanced approaches at some point, like e.g. full text search which can make use of special indexing. https://www.postgresql.org/docs/current/textsearch-intro.html
Also note that setting a query timeout is recommended for potentially hostile parameter sources.
Apart from that, there are also more specialized search servers e.g. like apache-solr.

Related

Is there a way to remove ALL special characters using Lucene filters?

Standard Analyzer removes special characters, but not all of them (eg: '-'). I want to index my string with only alphanumeric characters but referring to the original document.
Example: 'doc-size type' should be indexed as 'docsize' and 'type' and both should point to the original document: 'doc-size type'
It depends what you mean by "special characters", and what other requirements you may have. But the following may give you what you need, or point you in the right direction.
The following examples all assume Lucene version 8.4.1.
Basic Example
Starting with the very specific example you gave, where doc-size type should be indexed as docsize and type, here is a custom analyzer:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.pattern.PatternReplaceFilter;
import java.util.regex.Pattern;
public class MyAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new WhitespaceTokenizer();
TokenStream tokenStream = source;
Pattern p = Pattern.compile("\\-");
boolean replaceAll = Boolean.TRUE;
tokenStream = new PatternReplaceFilter(tokenStream, p, "", replaceAll);
return new TokenStreamComponents(source, tokenStream);
}
}
This splits on whitespace, and then removes hyphens, using a PatternReplaceFilter. It works as shown below (I use 「 and 」 as delimiters to show where whitespaces may be part of the inputs/outputs):
Input text:
「doc-size type」
Output tokens:
「docsize」
「type」
NOTE - this will remove all hyphens which are standard keyboard hyphens - but not things such as em-dashes, en-dashes, and so on. It will remove these standard hyphens regardless of where they appear in the text (word starts, word ends, on their own, etc).
A Set of Punctuation Marks
You can change the pattern to cover more punctuation, as needed - for example:
Pattern p = Pattern.compile("[$^-]");
This does the following:
Input text:
「doc-size type $foo^bar」
Output tokens:
「docsize」
「type」
「foobar」
Everything Which is Not a Character or Digit
You can use the following to remove everything which is not a character or digit:
Pattern p = Pattern.compile("[^A-Za-z0-9]");
This does the following:
Input text:
「doc-size 123 %^&*{} type $foo^bar」
Output tokens:
「docsize」
「123」
「」
「type」
「foobar」
Note that this has one empty string in the resulting tags.
WARNING: Whether the above will work for you depends very much on your specific, detailed requirements. For example, you may need to perform extra transformations to handle upper/lowercase differences - i.e. the usual things which typically need to be considered when indexing text.
Note on the Standard Analyzer
The StandardAnalyzer actually does remove hyphens in words (with some obscure exceptions). In your question you mentioned that it does not remove them. The standard analyzer uses the standard tokenizer. And the standard tokenizer implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified here. There's a section discussing how hyphens in words are handled.
So, the Standard analyzer will do this:
Input text:
「doc-size type」
Output tokens:
「doc」
「size」
「type」
That should work with searches for doc as well as doctype - it's just a question of whether it works well enough for your needs.
I understand that may not be what you want. But if you can avoid needing to build a custom analyzer, life will probably be much simpler.

How to Generate Random String using Laravel Faker?

is there any way or method to generate fake string using laravel faker ?
like in laravel we generate string upto 20 chars..
str_random(20);
Faker offers a couple of methods that let you replace placeholders in a given string with random characters:
lexify - takes given string and replaces ? with random letters
asciify - takes given string and replaces * with random ascii characters
numerify - takes given string and replaces # with random digits
bothify - combines the lexify and numerify
You could try to use one of them, depending on the requirements you have for that random string you need. asciify uses the largest set of characters as replacement so using that one makes most sense.
The following will give you a random string of 20 ascii characters:
$faker->asciify('********************')
Alternate for generate string without special chars.
$faker->regexify('[A-Za-z0-9]{20}')
$faker->text($maxNbChars = 50);
$faker->text()
// generates 50 char by default: "Aut quo omnis placeat eos omnis eos."
$faker->text(10);
// generates 10 char by default: "Labore."
All texts seems to be one or more latin pseudo-sentences with spaces and always a dot in the end (of each sentence).
uze Faker\Provider\en_US\Text
<?php
realText($maxNbChars = 200, $indexSize = 2) // "And yet I wish you could manage it?) 'And what are they made of?' Alice asked in a shrill, passionate voice. 'Would YOU like cats if you were never even spoke to Time!' 'Perhaps not,' Alice replied."

Regexp hangs when input string contains brackets

I have:
vv = /added:\s{0,}\d{1,2}\/\d{1,2}\/\d{4}|terminated:\s{0,}\d{1,2}\/\d{1,2}\/\d{4}|(?-mix:\((\w+([\p{P}\s]{,3}\w*)*)\))/i
Below is my experiment:
detail = "(value containts lorem ipsum lorum ipsum"
detail =~ vv
When I try without bracket at the start of input string, it works.
detail = "value containts lorem ipsum lorum ipsum"
detail =~ vv
# => nil
The problem you experience is catastrophical backtracking. Your \w+([\p{P}\s]{,3}\w*)* causes an issue as the ([\p{P}\s]{,3}\w*)* contains a nested zero or more quantifier *. The problem arises because the parts inside are both optional (=can match empty strings) and quantified. See your regex demo, try adding one more symbol and see the step amount increase: adding a space after (value containt will double the number of steps from 65,742 to 102,610! Adding 1 more symbol crashes the demo.
Replacing it with \w+(?:[\p{P}\s]{1,3}\w+)*, or even \w+(?:\W{1,3}\w+)* should fix the issue as the subpatterns inside the grouping (...) construct will no longer be matching empty strings (but the whole group will be optional, zero or more repetitions). [\p{P}\s]{1,3} requires at least 1 punctuation or whitespace and \w+ requires one or more word characters.
Also note that you do not need the (?-mix:...) group, I removed it from my suggested pattern: you have no . inside (no need for m), no letters that can be in lower- or upper case (no need for i) and there are no spaces to ignore in the pattern (no need for x). Also, {0,} quantifier is equal to *, I replaced one or two in the beginning.
Use
vv = /added:\s*\d{1,2}\/\d{1,2}\/\d{4}|terminated:\s*\d{1,2}\/\d{1,2}\/\d{4}|\((\w+(?:[\p{P}\s]{1,3}\w+)*)\)/i
detail = "(value containts lorem ipsum lorum ipsum"
detail =~ vv
See Ruby demo

How to read csv using LINQ ,some columns contain ,

i have a CSV in the below way. "India,Inc" is a company name which is single value which contains , in it
How to Get the Values in LINQ
12321,32432,423423,Kevin O'Brien,"India,Inc",234235,23523452,235235
Assuming that you will always have the columns that you specify and that the only variable is that company name can have commas inside, this UGLY code can help you achieve your goal.
var file = File.ReadLines("test.csv");
var value = from p in file
select new string[]
{ p.Split(',')[0],
p.Split(',')[1],
p.Split(',')[2],
p.Split(',')[3],
p.Split(',').Count() == 7 ? p.Split(',')[4] :
(p.Split(',').Count() > 7 ? String.Join(",",p.Split(',').Skip(4).Take(p.Split(',').Count() - 7).ToArray() ) : ""),
p.Split(',')[p.Split(',').Count() - 3],
p.Split(',')[p.Split(',').Count() - 2],
p.Split(',')[p.Split(',').Count() - 1]
};
A regular expression would work, bit nasty due to the recursive nature but it does achieve your goal.
List<string> matches = new List<string>();
string subjectString = "12321,32432,423423,Kevin O'Brien,\"India,Inc\",234235,23523452,235235";
Regex regexObj = new Regex(#"(?<="")\b[123456789a-z,']+\b(?="")|[123456789a-z']+", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success)
{
matches.Add(matchResults.Value);
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
matchResults = matchResults.NextMatch();
}
This should suffice in most cases. It handles quoted strings, strings with double quotes within them, and embedded commas.
var subjectString = "12321,32432,423423,Kevin O'Brien,\"India,Inc\",234235,\"Test End\"\"\",\"\"\"Test Start\",\"Test\"\"Middle\",23523452,235235";
var result=Regex.Split(subjectString,#",(?=(?:[^""]*""[^""]*"")*[^""]*$)")
.Select(x=>x.StartsWith("\"") && x.EndsWith("\"")?x.Substring(1,x.Length-2):x)
.Select(x=>x.Replace("\"\"","\""));
It does however break, if you have a field with a single double quote inside it, and the string itself is not enclosed in double quotes -- this is invalid in most definitions of a CSV file, where any field that contains CR, LF, Comma, or Double quote must be enclosed in double quotes.
You should be able to reuse the same Regex expression to break on lines as well for small CSV files. Larger ones you would want a better implementation. Replace the double quotes with LF, and remove the matching ones (unquoted LF's). Then use the regular expression again replacing the quotes with CR, and split on matching.
Another option is to use CSVHelper and not traying to reinvent the wheel
var csv = new CsvHelper.CsvReader(new StreamReader("test.csv"));
while (csv.Read())
{
Console.WriteLine(csv.GetField<int>(0));
Console.WriteLine(csv.GetField<string>(1));
Console.WriteLine(csv.GetField<string>(2));
Console.WriteLine(csv.GetField<string>(3));
Console.WriteLine(csv.GetField<string>(4));
}
Guide
I would recommend LINQ to CSV, because it is powerful enough to handle special characters including commas, quotes, and decimals. They have really worked a lot of these issues out for you.
It only takes a few minutes to set up and it is really worth the time because you won't run into these types of issues down the road like you would with custom code. Here are the basic steps, but definitely follow the instructions in the link above.
Install the Nuget package
Create a class to represent a line item (name the fields the way they're named in the csv)
Use CsvContext.Read() to read into an IEnumerable which you can easily manipulate with LINQ
Use CsvContext.Write() to write a List or IEnumerable to a CSV
This is very easy to setup, has very little code, and is much more scalable than doing it yourself.
becuase you're only reading values delminated bycommas, the spaces shouldn't cause an issue if you just treat them like any other character.
var values = File.ReadLines(path)
SelectMany(line => line.Split(','));

What is the most efficient way to search a blob of text for an array of regular expressions?

I'm looking for the most efficient way to search a blob of text (± 1/2KB) for many regular expressions stored in an array.
Example code:
patterns = [/patternA/i,/patternB/i,/patternC/m,...,/patternN/i]
content = "Lorem ipsum dolor sit amet, consectetur... officiam id est laborum."
r = patterns.collect{ |pattern|
pattern unless ( content =~ pattern ).blank?
}.compact
Where r now contains patterns that matched the content string.
If you are only interested in whether any of the patterns match the text, then consider combining all patterns into a single big regex, using the regex 'or' operator, and compiling that giant regex once.
For instance, if your patterns are: A, B, C, create a single regex of the form A|B|C
Sorry, I don't know Ruby, but hopefully you can turn that into code (:
Side Note: This is how Mercurial's .hgignore files are handled last I looked. In that case there are 1000s of filenames that get thrown at the one big regex, which is more efficient than those filenames getting thrown at each of hundreds of smaller regexes.
Solution 1
Do this:
r = patterns.select{|pattern| content =~ pattern}
Since the string is huge, it is better to implement this method on String rather then on something else because passing a large argument seems to be slow.
class String
def filter_patterns patterns
patterns.select{|r| self =~ pattern}
end
end
and use it like:
content.filter_patterns(patterns)
Solution 2
it has restrictions that each regex does not include a named/numbered capture.
combined_regex = Regexp.new(patterns.map{|r| "(?=[.\n]*(#{r.source}))?"}.join)
content =~ combined_regex
The following part will have problem if the regex inside patterns include a named/numbered capture. If there is a way to know for each regex how many potential captures there are, then it will solve the problem.
r = patterns.select.with_index{|pattern, i| Regexp.last_match[i]}
Addition
Given:
dogs = {
'saluki' => 'Hounds',
'russian wolfhound' => 'Hounds',
'italian greyhound' => 'Hounds',
..
}
content = "Running in the fields at great speeds, the sleek saluki dog comes from..."
you can do this:
combined_regex =
Regexp.new(dogs.keys.map{|w| "(?=[.\n]*(#{w}))?"}.join, Regexp::IGNORECASE)
content =~ combined_regex
r = patterns.select.with_index{|pattern, i| Regexp.last_match[i]}
"This article talks about #{r.collect{|x| dogs[x]}.to_sentence}."
=> "This article talks about Hounds."
To avoid outputs like This article talks about Hounds, Hounds and Hounds., you might want to put uniq in it.
"This article talks about #{r.uniq.collect{|x| dogs[x]}.to_sentence}."
How about:
text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor magna'
targets = [ /(am?et)/, /(ips.m)/, /(elit)/, /(magna)/, /([Ll]or[eu]m)/ ]
regex = Regexp.union(targets)
hits = []
text.scan(regex) { |a| hits += a.each_with_index.to_a }
r = hits.select{ |w,i| w }.map{ |w,i| targets[i]} # => [/([lL]or[eu]m)/, /(ips.m)/, /(am?et)/, /(elit)/, /(magna)/]
This works to return the matched patterns in the order that the words were found in the text.
There's probably a way to do it using named-captures too.
What you want is exactly what a lexer has been designed to do. Pick out a set of regular expressions from an input stream with only a single pass over the input required.
Unfortunately I haven't been able to find a good lexer gem for Ruby which lets you define your own lexer. I'll update the answer if I find anything.

Resources