How to make full text search with mongodb spring data? - spring

Before asking this question, I have searched much about my problem. I need to make full text search from mongodb in spring framework. Up to now I just tried something with regex, but it does not cover my requirement. For example, I have a search string as 'increased world population' , and my search algorithm should return documents well-matched to search string or documents including at least one word from search string. I know Lucene does full text search, but I don't know how to implement it with my mongodb spring data and I dont know whether spring data already offer full text search. I need a tutorial which explain that.
what I have done up to now:
Criteria textCriteri = Criteria.where("title").regex(searchStr.trim().replaceAll(" +", " "), "i");
Query query = new Query(locationCriteria).addCriteria(textCriteri).limit(Consts.MONGO_QUERY_LIMIT);
List<MyObject> advs = mongoTemplate.find(query, MyObject.class);

You can create a 'text' index in mongodb and search through that, see http://docs.mongodb.org/manual/core/index-text/
Depending on your search queries, you probably want to use a more powerful search engine like ElasticSearch (as you mentioned Lucene).

Related

Changing the token length of Spring Boot Elasticsearch pattern matching size

I am using Spring Boot and Elasticsearch and I am trying to use three character searches but the searches only match on five characters or more.
If I have a user name of 'Bob Smith' I can find the match searching for 'Smith' but searching for 'Bob' does not find a match.
I suspect this is something that needs to be changed in my class ''SearchMappingConfig implements HibernateOrmSearchMappingConfigurer'' but I can't find any information about changing the size of the tokens needed to successfully match a result.
My ''#Entity'' tables have ''#FullTextField(analyzer = "english")'' annotations on the fields I want included in the token searches.
How do I change the length of the search match?
Ideally I would like any three letters to form a match, so a search for 'Ron' would match 'Ronald' and 'Laronda'
Elasticsearch 7.14
Spring Boot 2.7.6
I have been reading Spring Boot and Elasticsearch documentation but cannot find any information about changing the match length.
Hibernate is able to use an Elasticsearch or Lucene client. Our existing project uses Lucene and it would have been a large undertaking to replace that.
The recommended solution is to create new analyzers so that incoming data creates smaller tokens, I didn't want to change analyzers on my existing database.
A lot of the documentation I was able to find pointed to using the Elasticsearch query builder or the Hibernate 5 method of using a wildcard.
I tested our Elasticsearch and found that the wildcard solution would work.
I ended up using the Hibernate 6 method for wildcard searching and it works well.
SearchResult<DataClass> result = searchSession.search(DataClass.class)
.where(f -> f.wildcard()
.fields(
"firstname",
"lastname",
"username",
"currentLegalName")
.matching("*"+searchText.toLowerCase()+"*"))
.fetch(10);
long totalHitCount = result.total().hitCount();
logger.debug("Search results size {}", totalHitCount);

Multilingual search in elastic search

I want to search for "Salt" in my app (using elastic search). I want to search in my native language. So when I search "namak", I should get the result of all products related to "salt".
The easiest way I can think of would be adding a field to your elasticsearch that contains the same words as the english ones, but in your native language. The easiest way to do that is probably to change data before it is loaded into elasticsearch.
Example using python and pandas, and assuming you have a dictionnary named dic that maps english words to your native tongue's words (e.g. {"Salt": "namak"}), you'd simply do :
df["other_language"] = df["english"].replace(dic)
And then proceed and load df into elasticsearch as you would normally do.
In elasticsearch, what then happens is that every document having Salt in one of its fields now also has namak in the field other_language, thus, searching for namak filters out the exact same documents as searching for Salt would.

faster search for a substring through large document

I have a csv file of more than 1M records written in English + another language. I have to make a UI that gets a keyword, search through the document, and returns record where that key appears. I look for the key in two columns only.
Here is how I implemented it:
First, I made a postgres database for the data stored in the CSV file. Then made a classic website where the user can enter a keyword. This is the SQL query that I use(In spring boot)
SELECT * FROM table WHERE col1 LIKE %:keyword% OR col2 LIKE %:keyword%;
Right now, it is working perfectly fine, but I was wondering how to make search faster? was using SQL instead of classic document search better?
If the document is only searched once and thrown away, then it's overhead to load into a database. Instead can search the file directly using the nio parallel search feature which uses multiple threads to concurrently search the file:
List<Record> result = Files.lines("some/path")
.parallel()
.unordered()
.map(l -> lineToRecord(l))
.filter(r -> r.getCol1().contains(keyword) || r.getCol2().contains(keyword))
.collect(Collectors.toList());
NOTE: need to provide the lineToRecord() method and the Record class.
If the document is going to be searched over and over again, then can think about indexing the document. This means pre-processing the document to suit the search requirements. In this case it's keywords of col1 and col2. An index is like a map in java, eg:
Map<String, Record> col1Index
But since you have the "LIKE" semantics, this is not so easy to do as it's not as simple as splitting the string by white space since the keyword could match a substring. So in this case it might be best to look for some tool to help. Typically this would be something like solr/lucene.
Databases can also provide similar functionality eg: https://www.postgresql.org/docs/current/pgtrgm.html
For LIKE queries, you should look at the pg_trgm index type with the gin_trgm_ops operator class. You shouldn't need to change query at all, just build the index on each column. Or maybe one multi-column index.

ElasticSearch: how to perform search across multiple fields using Java High Level REST Client?

I am looking at this tutorial and it describes how ES search could be executed against an index, but the search is done only using one field of each document:
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
sourceBuilder.query(QueryBuilders.termQuery("user", "kimchy"));
I would like to perform my search against multiple fields: like user name, display name, email etc.
Should I use Multi-Search API to achieve it?
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/master/java-rest-high-multi-search.html
MultiMatchQueryBuilder allows to do it, thanks to Abhay for helping me to find the proper solution.
https://snapshots.elastic.co/javadoc/org/elasticsearch/elasticsearch/6.4.0-SNAPSHOT/org/elasticsearch/index/query/MultiMatchQueryBuilder.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html

How to treat each keyword as a prefix in Elastic Search

I am used to FTS techs doing this for me but imagine I input mongo fac into Elastic Search.
I expect it to be able to find mongo factory or mongodb something equally, however, it does not.
Assume I have a single field called title. I have three documents with the titles:
Mongo Factory
Mongodb something
cheese
I have a single boolean should clause with:
array('prefix' => array(
'title' => 'mongo fac'
)),
Using default analyzers, no special configuration, mongo factory will be found but not mongodb something.
What I want is for monogdb something to appear in the results as well, basically for Elastic Search to tokenize the keywords; as well as searching for mongo fac it should also search for mongo and fac.
Except for tokenizing myself what else can I do to get elastic search to work the way I want to, perferrably using their tokenizer as a means to tokenize my keywords?
For reference to others who come across this question: I didn't find a valid solution in the end so I just wrote a function to tokenise the words myself and form separate prefix queries and it works as it should.

Resources