Batch indexing Spring Data JPA entries to Elastic through Spring Data ElasticSearch - spring

Our current setup is MySQL as main data source through Spring Data JPA, with Hibernate Search to index and search data. We now decided to go to Elastic Search for searching to better align with other features, besides we need to have multiple servers sharing the indexing and searching.
I'm able to setup Elastic using Spring Data ElasticSearch for data indexing and searching easily, through ElasticsearchRepository. But the challenge now is how to index all the existing MySQL records into Elastic Search. Hibernate Search provides an API to do this org.hibernate.search.jpa.FullTextEntityManager#createIndexer which we use all the time. But I cannot find a handy solution within Spring Data ElasticSearch. Hope somebody can help me out here or provide some pointers.
There is a similar question here, however the solution proposed there doesn't fit my needs very well as I'd prefer to be able to index a whole object, which fields are mapped to multiple DB tables.

So far I haven't found a better solution than writing my own code to index all JPA entries to ES inside my application, and this one worked out for me fine
Pageable page = new PageRequest(0, 100);
Page<Instance> curPage = instanceManager.listInstancesByPage(page); //Get data by page from JPA repo.
long count = curPage.getTotalElements();
while (!curPage.isLast()) {
List<Instance> allInstances = curPage.getContent();
for (Instance instance : allInstances) {
instanceElasticSearchRepository.index(instance); //Index one by one to ES repo.
}
page = curPage.nextPageable();
curPage = instanceManager.listInstancesByPage(page);
}
The logic is very straightforward, just depending on the quantity of the data it might take a while, so breaking down to batches and adding some messages can be helpful.

Related

Fetching Index data from elasticsearch DB using Spring Data ElasticSearch

I have a java code which connects to Elasticsearch DB using Spring-data-elasticsearch and fetches all the index data by connecting to the repository and executing the findAll() method. The data received from ES is being processed by a seperate application. When new data is inserted into elastic search, I have the below queries
1. How can I fetch only the newly inserted data Programatically ?
2. Apart from using the DSL queries, Is there a way to Asyncronously get the new records as and when new data is inserted into elasticsearch DB.
I dont want to execute the findAll() method again. Because it returns the entire data ( including the previously processed records as well) .
Any help on this is much appreciated.
You will need to add a field (I call it createdAt here) to your entities that contains the timestamp when your application inserts into Elasticsearch. One possibility would be to use the auditing support of Spring Data Elasticsearch to have the value set automatically, or you set the value in your application. If the data is inserted by some other application you need to make sure that it contains a timestamp in a format that maps the field type definition of this field in your application.
Then you'd need to define a method in your repository like
SearchHits<T> findByCreatedAtAfter(Timestamp referenceValue);
As for getting a notification in some form when new data is inserted: I'm not aware that Elasticsearch offers something like that. You will probably need to regularly call the method that retrieves the data.

How about including JSON doc version? Is it possible for elastic search, to include different versions of JSON docs, to save and to search?

We are using ElasticSearch to save and manage information on complex transactions. We might need to add more information for every transaction, on the near future.
How about including JSON doc version?
Is it possible for elastic search, to include different versions of JSON docs, to save and to search?
How does this affects performance on ElasticSearch?
It's completely possible, By default elastic uses the dynamic mappings for every new documents such as your JSON documents to index them. For each field in your documents elastic creates a table called inverted_index and the search queries executed against them so regardless of your field variation as long as you know which field you want to execute query the data throughput and performance will not be affected.

Is there any tool out there for generating elasticsearch mapping

Mostly what I do is to assemble the mapping by hand. Choosing the correct types myself.
Is there any tool which facilitates this?
For example which will read a class (c#,java..etc) and choosing the closest ES types accordingly.
I've never seen such a tool, however I know that ElasticSearch has a REST API over HTTP.
So you can create a simple HTTP query with JSON body that will depict your object with your fields: field names + types (Strings, numbers, booleans) - pretty much like a Java/C# class that you've described in the question.
Then you can ask the ES to store the data in the non-existing index (to "index" your document in ES terms). It will index the document, but it will also create an index, and the most importantly for your question, will create a mapping for you "dynamically", so that later you will be able to query the mapping structure (again via REST).
Here is the link to the relevant chapter about dynamically created mappings in the ES documentation
And Here you can find the API for querying the mapping structure
At the end of the day you'd still want to retain some control over how your mapping is generated. I'd recommend:
syncing some sample documents w/o a mapping
investigating what mapping was auto generated and
dropping the index & using dynamic_templates to pseudo-auto-generate / update the mapping as new documents come in.
This GUI could help too.
Currently, there is no such tool available to generate the mapping for elastic.
It is a kind of similar thing as we have to design a database in MySQL.
But if we want such kind of thing then we use Mongo DB which requires no predefined schema.
But Elastic comes with its very dynamic feature, which allows us to play around it. One of the most important features of Elasticsearch is that it tries to get out of your way and let you start exploring your data as quickly as possible like the mongo schema which can be manipulated dynamically.
To index a document, you don’t need to first define a mapping or schema and define your fields along with their data type .
You can just index a document and the index, type, and fields will be created automatically.
For further details you can go through the below documentation:
Elastic Dynamic Mapping

Guidance on how to index Elasticsearch documents using Spring Data

My application uses both Spring Data JPA and Spring Data Elasticsearch.
I plan to first persist the JPA entities, then map them to a slightly different java class (the Elasticsearch document) and finally index that document into the Elasticsearch index.
However, I have a few questions as how, where and when to index the documents.
Is indexing a time consuming process that should be asynchronous?
What design pattern could help me avoid having problematic code such as the following?
saveAdvertisement method from AdvertisementService:
public void saveAdvertisement(Advertisement jpaAdvertisement) {
jpaAdvertisementRepository.save(jpaAdvertisement);
//somehow map the jpa entity to the es document
elasticSearchTemplate.index(esAdvertisement);
}
whereby I have to have two concerns in the same method:
JPA persist
Elasticsearch indexing

Is lucene calling datasource for every request?

I am new to Apache Lucene. Please someone guide me how apache lucene works.
For every request, will it invoke datasource(documents, database. etc) from lucene index?
or it will look at the index alone?
Once documents are indexed, Lucene will only look at the index and nowhere else.
You also need to understand the difference between indexing and storing data in the index. Former allows document to be found while latter allows the data to be read when relevant document is found.
Why is this necessary? Sometimes you can index all fields but only store the ID and retrieve the actual data from external source (e.g. database) using that ID. Or you can store data in the index and load it from there instead of going to another data source.

Resources