With spring-data-elasticsearch and searching for similar documents, how to get similarity score? - elasticsearch

I am using the latest version of elasticsearch (in docker) and a spring boot (latest version) app where I attempt to search for similar documents. My document class has a String field:
#Field(
name = "description",
type = FieldType.Text,
fielddata = true,
analyzer = "icu_analyzer",
termVector = TermVector.with_positions_offsets,
similarity = Similarity.BM25)
private String description;
I get plenty of results for my query when I use the built-in searchSimilar method:
public Page<BookInfo> findSimilarDocuments(final long id) {
return bookInfoRepository.findById(id)
.map(bookInfo -> bookInfoRepository.searchSimilar(bookInfo, new String[]{"description"}, pageable))
.orElse(Page.empty());
}
However, I have no idea how similar the documents are, because it is just a page of my Document object. It would be great to be able to see the similarity score, or to set a similarity threshold when performing the query. Is there something different that I should be doing?

I just had a look, the existing method Page<T> searchSimilar(T entity, #Nullable String[] fields, Pageable pageable) was added to the ElasticsearchRepository interface back in 2013, it just returns a Page<T> which does not contain any score information.
Since Spring Data Elasticsearch version 4.0 the score information is available and when you look at the implementation you see that it is stripped from the return value of the function in order to adhere to the method signature from the interface:
public Page<T> searchSimilar(T entity, #Nullable String[] fields, Pageable pageable) {
Assert.notNull(entity, "Cannot search similar records for 'null'.");
Assert.notNull(pageable, "'pageable' cannot be 'null'");
MoreLikeThisQuery query = new MoreLikeThisQuery();
query.setId(stringIdRepresentation(extractIdFromBean(entity)));
query.setPageable(pageable);
if (fields != null) {
query.addFields(fields);
}
SearchHits<T> searchHits = execute(operations -> operations.search(query, entityClass, getIndexCoordinates()));
SearchPage<T> searchPage = SearchHitSupport.searchPageFor(searchHits, pageable);
return (Page<T>) SearchHitSupport.unwrapSearchHits(searchPage);
}
You could implement a custom repository fragment (see https://docs.spring.io/spring-data/elasticsearch/docs/4.2.6/reference/html/#repositories.custom-implementations) that provides it's own implementation of the method that returns a SearchPage<T>:
public SearchPage<T> searchSimilar(T entity, #Nullable String[] fields, Pageable pageable) {
Assert.notNull(entity, "Cannot search similar records for 'null'.");
Assert.notNull(pageable, "'pageable' cannot be 'null'");
MoreLikeThisQuery query = new MoreLikeThisQuery();
query.setId(stringIdRepresentation(extractIdFromBean(entity)));
query.setPageable(pageable);
if (fields != null) {
query.addFields(fields);
}
SearchHits<T> searchHits = execute(operations -> operations.search(query, entityClass, getIndexCoordinates()));
SearchPage<T> searchPage = SearchHitSupport.searchPageFor(searchHits, pageable);
return searchPage;
}
A SearchPage<T> is a page containing SearchHit<T> instances; these contain the entity and the additional information like the score.

Related

Spring Mongo perform pagination/sorting with multiple collections

I am making use of org.springframework.data.mongodb.core.query.Query, with Pageable for all the regular search and pagination related feature. It is working fine, now i wanted to join multiple collections and do the pagination operations. Is there any provision for the same?
Thanks!
UPDATE
Please find the method which am using for achieving the same:
public <E> Page<E> searchEntity(final SearchCriteria searchCriteria, Pageable pageable, Class<E> entityClass) {
Query query = prepareSearch(searchCriteria); // method which frames the required search criteria based on the requests
long count = mongoOps.count(query, entityClass);
if (count == 0) {
return new PageImpl<>(new ArrayList<>(), pageable, 0);
}
query.with(pageable);
List<E> list = mongoOps.find(query, entityClass);
return new PageImpl<>(list, pageable, count);
}

Hibernate Criteria FetchMode.JOIN is doing lazy loading

I have a paginated endpoint which internally uses Hibernate Criteria to fetch certain objects and relations. The FetchMode is set as FetchMode.JOIN.
When I am trying to hit the endpoint, the request seems to work fine for few pages but is then erring out with :
could not initialize proxy - no Session
Method is as as below:
#Override
public Page<Person> findAllNotDeleted(final Pageable pageable)
{
final var criteria = createCriteria();
criteria.add(Restrictions.or(Restrictions.isNull(DELETED), Restrictions.eq(DELETED, false)));
criteria.setFetchMode(PERSON_RELATION, FetchMode.JOIN);
criteria.setFetchMode(DEPARTMENT_RELATION, FetchMode.JOIN);
criteria.setFirstResult((int) pageable.getOffset());
criteria.setMaxResults(pageable.getPageSize());
criteria.addOrder(asc("id"));
final var totalResult = getTotalResult();
return new PageImpl<>(criteria.list(), pageable, totalResult);
}
private int getTotalResult()
{
final Criteria countCriteria = createCriteria();
countCriteria.add(Restrictions.or(Restrictions.isNull(DELETED), Restrictions.eq(DELETED, false)));
return ((Number) countCriteria.setProjection(Projections.rowCount()).uniqueResult()).intValue();
}
Also, the call to findAllNotDeleted is done from a method anotated with #Transactional.
Not sure what is going wrong.
Any help would be highly appreciated.
EDIT
I read that FetchMode.Join does not work with Restrictions. So I tried implementing it using CriteriaBuilder but again stuck with the issue.
#Override
public Page<Driver> findAllNotDeleted(final Pageable pageable)
{
final var session = getCurrentSession();
final var builder = session.getCriteriaBuilder();
final var query = builder.createQuery(Person.class);
final var root = query.from(Driver.class);
root.join(PERSON_RELATION, JoinType.INNER)
.join(DEPARTMENT_RELATION,JoinType.INNER);
//flow does not reach here.....
var restrictions_1 = builder.isNull(root.get(DELETED));
var restrictions_2 = builder.equal(root.get(DELETED), false);
query.select(root).where(builder.or(restrictions_1,restrictions_2));
final var result = session.createQuery(query).getResultList();
return new PageImpl<>(result, pageable, result.size());
}
The flow does not seem to reach after root.join.
EDIT-2
The relations are as follows:
String PERSON_RELATIONSHIP = "person.address"
String DEPARTMENT_RELATION = "person.department"
and both person, address, department themselves are classes which extend Entity
I guess the associations you try to fetch i.e. PERSON_RELATION or DEPARTMENT_RELATION are collections? In such a case, it is not possible to directly do pagination on the entity level with Hibernate. You would have to fetch the ids first and then do a second query to fetch just the entities with the matching ids.
You could use Blaze-Persistence on top of Hibernate though which has a special pagination API that does these tricks for you behind the scenes. Here is the documentation about the pagination: https://persistence.blazebit.com/documentation/core/manual/en_US/index.html#pagination
There is also a Spring Data integration, so you could also use the Spring Data pagination convention along with Blaze-Persistence Entity-Views which are like Spring Data Projections on steroids. You'd use Page<DriverView> findByDeletedFalseOrDeletedNull(Pageable p) with
#EntityView(Driver.class)
interface DriverView {
Long getId();
String getName();
PersonView getPersonRelation();
DepartmentView getDepartmentRelation();
}
#EntityView(Person.class)
interface PersonView {
Long getId();
String getName();
}
#EntityView(Department.class)
interface DepartmentView {
Long getId();
String getName();
}
Using entity views will only fetch what you declare, nothing else. You could also use entity graphs though:
#EntityGraph(attributePaths = {"personRelation", "departmentRelation"})
Page<Driver> findByDeletedFalseOrDeletedNull(Pageable p);

custom sql queries in JPA Criteria Predicate, StringPath, Querydsl

I have a Spring boot project with hibernate 5.4.12, Java 11 and Postgres.
I am trying to build a custom Sort/Filter mechanism using JPA and Querydsl, here is one blog for reference.
We have a gin index column which is used for full text search feature by postgres. In jpa repository, I can query the column easily as below
#Query(value = "select * from products where query_token ## plainto_tsquery(:query)", nativeQuery = true)
Page<Product> findAllByTextSearch(#Param("query") String query, Pageable pageable);
I am aware that fts queries are not yet supported by JPA criteria or querydsl APIs (I may be wrong). Since normal filtering logic will go through criteria API, how do add fts capabilities in criteria API? Is there a way to add custom native query as predicate or StringPath or any other Qtype paths?
UPDATE
My SearchPredicate class
public class SearchPredicate<E extends Enum<E>> {
private SearchCriteria<E> searchCriteria;
public <T> BooleanExpression getPredicate(Class<T> entityClass, String entityName) {
PathBuilder<T> entityPath = new PathBuilder<>(entityClass, entityName);
switch (searchCriteria.getPathType()) {
case String:
StringPath stringPath = entityPath.getString(searchCriteria.getKey());
return stringPath.eq(searchCriteria.getStringValue());
case Enum:
return entityPath.getEnum(searchCriteria.getKey(), searchCriteria.getEnumClass())
.eq(Enum.valueOf(searchCriteria.getEnumClass(), searchCriteria.getStringValue()));
case Float:
NumberPath<Float> floatPath = entityPath.getNumber(searchCriteria.getKey(), Float.class);
Float floatValue = Float.parseFloat(searchCriteria.getStringValue());
return floatPath.eq(floatValue);
case Integer:
NumberPath<Integer> integerPath = entityPath.getNumber(searchCriteria.getKey(), Integer.class);
Integer integerValue = Integer.parseInt(searchCriteria.getStringValue());
return integerPath.eq(integerValue);
}
return null;
}
}
My SearchCriteria class
public class SearchCriteria<E extends Enum<E>> {
private String key;
private Object value;
private PathType pathType;
private Class<E> enumClass;
public String getStringValue() {
return value.toString();
}
}
And My PathType Enum
public enum PathType {
String, Enum, Integer, Float;
}
On these same lines, I am assuming/expecting something for text search as well e.g.
case Search:
FtsPath ftsPath = entityPath.getFtsPath("query_token");
return ftsPath.search("some search string")
You should first make the ## operator available by registering a custom function for your ORM. Then you can do plainto_tsquery(query_token, :query) in your JPQL query. How to register a custom function depends on the ORM you use. Assuming you use Hibernate, you're probably best of using the MetadataContributor SPI because functions registered through the Dialect have less flexibility with regard to the underlying SQL rendering AFAIK.
Then, if you want to use this in QueryDSL, you'd have to create a custom Operator and register a Template for that Operator in a subclass of JPQLTemplates. Alternatively, you can bypass the Operation expressions using a simple TemplateExpression: Expressions.booleanTemplate("plainto_tsquery({0}, {1})", QProduct.product.queryToken, query), which returns a predicate.

Paging results of aggregation pipeline with spring data mongodb

I am having a bit of trouble with paging the results of an aggregation pipeline. After looking at In spring data mongodb how to achieve pagination for aggregation I came up with what feels like a hacky solution. I first performed the match query, then grouped by the field that I searched for, and counted the results, mapping the value to a private class:
private long getCount(String propertyName, String propertyValue) {
MatchOperation matchOperation = match(
Criteria.where(propertyName).is(propertyValue)
);
GroupOperation groupOperation = group(propertyName).count().as("count");
Aggregation aggregation = newAggregation(matchOperation, groupOperation);
return mongoTemplate.aggregate(aggregation, Athlete.class, NumberOfResults.class)
.getMappedResults().get(0).getCount();
}
private class NumberOfResults {
private int count;
public int getCount() {
return count;
}
public void setCount(int count) {
this.count = count;
}
}
This way, I was able to provide a "total" value for the page object I was returning:
public Page<Athlete> findAllByName(String name, Pageable pageable) {
long total = getCount("team.name", name);
Aggregation aggregation = getAggregation("team.name", name, pageable);
List<Athlete> aggregationResults = mongoTemplate.aggregate(
aggregation, Athlete.class, Athlete.class
).getMappedResults();
return new PageImpl<>(aggregationResults, pageable, total);
}
You can see that the aggregation to get the total count of results is not too different from the actual aggregation that I want to perform:
MatchOperation matchOperation = match(Criteria.where(propertyName).is(propertyValue));
SkipOperation skipOperation = skip((long) (pageable.getPageNumber() * pageable.getPageSize()));
LimitOperation limitOperation = limit(pageable.getPageSize());
SortOperation sortOperation = sort(pageable.getSort());
return newAggregation(matchOperation, skipOperation, limitOperation, sortOperation);
This definitely worked, but, as I was saying, it feels hacky. Is there a way to get the count for the PageImpl instance without essentially having to run the query twice?
your question has helped me get around the same problem of paging with aggregation and so I did a little digging and came up with a solution to your problem. I know it's a bit late but someone might get use out of this answer. I am in no way a Mongo expert so if what I am doing is bad practice or not very performant please don't hesitate to let me know.
Using group, we can add the root documents to a set and also count.
group().addToSet(Aggregation.ROOT).as("documents")
.count().as("count"))
Here is my solution for almost the exact same problem you were facing.
private Page<Customer> searchWithFilter(final String filterString, final Pageable pageable, final Sort sort) {
final CustomerAggregationResult aggregationResult = new CustomerAggregationExecutor()
.withAggregations(match(new Criteria()
.orOperator(
where("firstName").regex(filterString),
where("lastName").regex(filterString))),
skip((long) (pageable.getPageNumber() * pageable.getPageSize())),
limit(pageable.getPageSize()),
sort(sort),
group()
.addToSet(Aggregation.ROOT).as("documents")
.count().as("count"))
.executeAndGetResult(operations);
return new PageImpl<>(aggregationResult.getDocuments(), pageable, aggregationResult.getCount());
}
CustomerAggregationResult.java
#Data
public class CustomerAggregationResult {
private int count;
private List<Customer> documents;
public static class PageableAggregationExecutor {
private Aggregation aggregation;
public CustomerAggregationExecutor withAggregations(final AggregationOperation... operations) {
this.aggregation = newAggregation(operations);
return this;
}
#SuppressWarnings("unchecked")
public CustomerAggregationResult executeAndGetResult(final MongoOperations operations) {
return operations.aggregate(aggregation, Customer.class, CustomerAggregationResult.class)
.getUniqueMappedResult();
}
}
}
Really hope this helps.
EDIT: I had initially created a generic PageableAggregationResult with List but this returns a IllegalArgumentException as I pass PageableAggregationResult.class with no type for T. If I find a solution for this I will edit this answer as I want to be able to aggregate multiple collections eventually.

Spring data + Mongodb + query single value?

how to query a field instead of a whole object? I am trying to do something like that, want to see is that possible?
public BigInteger findUserIDWithRegisteredEmail(String email){
Query query = Query.query(Criteria.where("primaryEmail").is (email));
query.fields().include("_id");
return (BigInteger) mongoTemplate.find(query, BigInteger.class);
}
In method
find(Query query, Class<YourCollection> entityClass)
entityClass should be the corresponding collection, not the type of id.
If you are just trying to get id use
Query query = Query.query(Criteria.where("primaryEmail").is (email));
query.fields().include("_id");
mongoTemplate.find(query, <YourCollection>.class).getId();
If you only include _id, all the other fields will be null in your result.
If you want to avoid serialization, this is one way you could handle it:-
final List<String> ids = new ArrayList<String>();
mongoTemplate.executeQuery(query, "collectionName", new DocumentCallbackHandler() {
#Override
public void processDocument(DBObject dbObject) throws MongoException, DataAccessException {
ids.add(dbObject.get("_id").toString());
}
});

Resources