How to perform a Join query on Elastic Search via springboot/java? - spring-boot

I have a springboot application that interacts with elastic search (or as it know now OpenSearch). It can perform basic operations such as search, index etc. I used this as my base (although I replaced high level client since it is deprecated) and to perform queries, I am using #Query annotation mostly (as described in section 2.2 here, although I also used QueryBuilders).
Now, I have an interesting use case - I would like to perform 2 queries at the same time. First query would find a file in elastic search that would contain 3 ids. These 3 ids are ids of other files in the same elastic search. The 2nd query would look for these 3 files and finally return them to me. Now, I can easily do it in 2 steps:
Have a query to find a file containing 3 ids and return it
Have a second query (multisearch query can do bulk search as I understand) to search
for 3 files using info from the first query.
However, I need them to happen within the same query - so within the same query I need to search for a file containing the 3 ids and then perform a search for these 3 files.
So currently my files in elastic search look like so:
{
"docId": "docId57",
"relatedDocs": [
{
"relatedId": "docId1",
"type": "apple"
},
{
"relatedId": "docId2",
"type": "orange"
},
{
"relatedId": "docId3",
"type": "banana"
}
]
}
and my goal is to have a query that will accept docId57 as an arg (so a method findFilesViaJoin(docId57) or something) and return a list of 3 files: file for docId1, file docId2 and file for docId3.
I know it is possible either via nested queries, child/parent queries or good old SQL queries (via jpa/hibarnate).
I attempted to use all of these and was unsuccessful for reasons described below.
Child/parent queries
So for child/parent queries, I attempted to use DSL with #Query but couldn't quite get it since I don't have a solid documentation to refer to (the one that actually helps with java not curls). After some time I found this and this articles - I maybe can figure out how to make it work with child/parent but neither explain how to do mapping. If this approach can do what I want, my question is: how to set up & map parent/child in springboot.
Using SQL queries
So for this one, I need to change my set up to use hibarnate. I used this as my base. It works, the only problem I have is that my SQL queries get ignored. Instead, the search is done based of a method's name, not the content of #Query. So it is as if I don't have an annotation used at all. So using the structure mentioned above, the following method in my app:
#Query("select t from MyModel t where t.docId = ?1")
findByRelatedDocsRelatedId(String id)
will return files that has a relatedId that matches the id passed via method ard id (as oppose to reading query from #Query that tells method to search all docs based on docId). Now, I don't mind using method name as a query to search for something. But then why would I use #Query for? (not to mention how do I create a name that does join). It might be possible that my hibernate is set up wrong (never used it before this week). So question here is, does anybody have a nice complete example of hibarnate being used with elastic search that does join query?
Nested queries
For these queries, I assume that I just need to figure out what to put inside the #Query but due to limited documentation about how to compose nested query I didn't manage to make it even remotely to work. Any concreate documentation on how to create DSL nested query would be appreciated.
Any of the ways I described will work for me. I think child/parent seems the best choice (seeing as they kind created for this purpose) but any will do.

Related

Custom query with Deepstackai haystack

I am exploring deepset haystack and found it very interesting for multiple use cases like a chatbot, search engine, document search, etc
But have not found any reference where I can create multiple indexes for different documents and search based on indexes. I thought of using meta tags for conditional search(on a particular area) by tagging the documents first and then using the params parameter of query API but the same doesn't seem to work and throws an error(I used its vanilla docker-compose based setup)
You can use multiple indices in the same document store if you want to support multiple use cases, indeed. The write_documents method of the document store has a parameter index so that you can store documents for your different use cases in different indices. In the same way, you can pass an index parameter to the query method.
As you expected, there is an alternative solution that uses the meta field of documents. However, the format needs to be slightly different. Your query needs to have the following format:
{"query": "What's the capital town?", "params": {"filters": {"name": "75_Algeria75.txt"}}}
and your documents need to have the following format:
{'text': 'Algeria is...', 'meta':{'name': "75_Algeria75.txt"}}

faster search for a substring through large document

I have a csv file of more than 1M records written in English + another language. I have to make a UI that gets a keyword, search through the document, and returns record where that key appears. I look for the key in two columns only.
Here is how I implemented it:
First, I made a postgres database for the data stored in the CSV file. Then made a classic website where the user can enter a keyword. This is the SQL query that I use(In spring boot)
SELECT * FROM table WHERE col1 LIKE %:keyword% OR col2 LIKE %:keyword%;
Right now, it is working perfectly fine, but I was wondering how to make search faster? was using SQL instead of classic document search better?
If the document is only searched once and thrown away, then it's overhead to load into a database. Instead can search the file directly using the nio parallel search feature which uses multiple threads to concurrently search the file:
List<Record> result = Files.lines("some/path")
.parallel()
.unordered()
.map(l -> lineToRecord(l))
.filter(r -> r.getCol1().contains(keyword) || r.getCol2().contains(keyword))
.collect(Collectors.toList());
NOTE: need to provide the lineToRecord() method and the Record class.
If the document is going to be searched over and over again, then can think about indexing the document. This means pre-processing the document to suit the search requirements. In this case it's keywords of col1 and col2. An index is like a map in java, eg:
Map<String, Record> col1Index
But since you have the "LIKE" semantics, this is not so easy to do as it's not as simple as splitting the string by white space since the keyword could match a substring. So in this case it might be best to look for some tool to help. Typically this would be something like solr/lucene.
Databases can also provide similar functionality eg: https://www.postgresql.org/docs/current/pgtrgm.html
For LIKE queries, you should look at the pg_trgm index type with the gin_trgm_ops operator class. You shouldn't need to change query at all, just build the index on each column. Or maybe one multi-column index.

Indices with nested property in both Kibana vizualization and index queries

So I have following problem which I'm trying to solve last two days. I have python script which parses logs and inserts data in elastic search, dynamically creating indices via bulk function.
Problem is my mapping has one "type": "nested" property, something like "users" field. And particularly when I'm only adding "type": "nested" in this property I can't query objects from Kibana nor creating any vizualization (because nested objects are separate documents If I'm not making mistakes). First think I tried: adding aditional "include_in_parent": true parameter to users field, but as result I got "wrong" queries (i.e. running something like +users.name: 'test' +users.age: 30) would result in ANY document which has those two fields, not exactly referring to ONE user object. Also vizualization was obviously wrong too.
Second solution I found was adding parent-child relationship. But this could be potentially be waste of time as I don't know will it result in correct queries. So I'm asking, if it will be normal solution to my problem?
Found out that Kibana doesn't support nested objects.
But ppadovani made this fork which supports this feature.
https://github.com/homeaway/kibana/tree/nestedSupport-4.5.4

Updating filtered documents in elasticsearch

I want to know if there is a way to update elasticsearch documents after filtering them out.
Let's say I have a user collection with following documents:
[
{ "name":"u1","age":23},
{ "name":"u2","age":31},
{ "name":"u3","age":27},
{ "name":"u4","age":33}
]
Now what I need to do is update the names of all the users who have ages above 30.
Looking at a lot of documentation and searching for hours on google, including the following document
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_updating_documents.html
I couldn't find a way to do it. So if we look into the docs, we are providing the id of the document, so it doesn't suite my need. Is there a way to do this sort do this sort of stuff in Elasticsearch?
From the link you provided:
Note that as of this writing, updates can only be performed on a
single document at a time. In the future, Elasticsearch will provide
the ability to update multiple documents given a query condition (like
an SQL UPDATE-WHERE statement).
So, this is not supported at the moment. But you can consider taking a look at this plugin: https://github.com/yakaz/elasticsearch-action-updatebyquery/.

How to retrieve all document ids matching a search, in elastic search?

I'm working on a simple side project, and have a tech stack that involves both a SQL database and ElasticSearch. I only have ElasticSearch because I assumed that as my project grows, my full text searching would be most efficiently performed by ES. My ES schema is very simple - documents that I insert into ES have 2 fields, one being the id and the other being the field with the body of text to search. The id being inserted into ES corresponds to that document's primary key id from the SQL database.
insert record into SQL -> insert record into ES using PK from SQL
Searching would be the reverse of that. Query ES and grab all the matching ids, and then turn around and use those ids to get records from SQL.
search ES can get all PK ids -> use those ids to get documents from SQL
The problem that I am facing though, is that ES can only return documents in a paginated manner. This is a problem because I also have a WHERE clause on my SQL query, beyond just the ids. My SQL query might look like this ...
SELECT * FROM foo WHERE id IN (1,2,3,4,5) AND bar != 'baz'
Well, with ES paginating the results, my WHERE clause will always only be querying a subset of the full results from ES. Even if I utilize ES' skip and take, I'm still only querying SQL using a subset of document ids.
Is there a way to get Elastic Search to only return the entire list of matching document ids? I realize this is here to not allow me to shoot myself in the foot, because doing this across all shards and many many documents is not efficient. Is there no way, though?
After putting in some hours on this project, I've only now realized that I've poorly engineered this, unless I can get all of these ids from ES. Some alternative implementations that I've thought of would be to store the things that I'm filtering on, in SQL, in ES as well. A problem there is that I'd have to update the ES document every time I update the document in SQL. This would require a pretty big rewrite to some of my data access code. I could scrap ElasticSearch all together and just perform searching in Postgres, for now, until I can think of a better way to structure this.
The elasticsearch not support return each and every doc match to you queries. Because it Ll overload the system. Instead of this.. Use scroll concept in elasticsearch.. It's lik cursor concept in db's..
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scan-scroll.html
For more examples refer the Github repo. https://github.com/sidharthancr/elasticsearch-java-client
Hope it helps..
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-fields.html
please have a look into the elastic search document where you can specify only particular fields that return from the match documents
hope this resolves your problem
{
"fields" : ["user", "postDate"],
"query" : {
"term" : { "user" : "kimchy" }
}
}

Resources