What options do I have regarding indexing PDFs while running on Elasticsearch 1.x and Spring Data 1.x, especially if I want to upgrade? - elasticsearch

We have a new requirement on our Elasticsearch - to index PDFs. We are still running on Elasticsearch version 1.x (and Spring Data 1.3.4).
I look at the documentation for Elasticsearch 5 and they have new ways of supporting PDFs in 5 (and I would like to upgrade).
So given all this the way I see it I have the following options:
Sit tight and wait for Spring Data to support Elasticsearch 5. This is viable if it is not too far away (please let us know, Spring Data and Elasticsearch dev) although given the business urgency on this feature I don't think I have much leeway
Move off Spring data altogether - this is not as crazy as it sounds as given the complexity of my queries I don't use the Spring Data repositories a great deal. I do however use them for inserting data. I would have to provide my own implementations of the current repository interfaces. It would be work but I wouldn't need to wait for any one and would not need to use any outdated plugins etc
Somehow run on Elasticsearch 5 with Spring Data 2.x/3.x. Will this work at all? Chances are it probably won't even startup.
Upgrade my Elasticsearch/Spring Data to 2.x and use the "old" way of indexing PDFs.
Which option is the best way to go?

Related

Are there any good alternatives to Apache Ignite as an In-Memory Data Grid used together with Spring as a distributed cache?

We have a solution which uses the Apache Ignite-provided In-Memory Data Grid as a distributed cache. For newer projects, we ended up using Spring, and as such we wished homogenize our software ecosystem and using Spring for the first solution as well. In addition, we do not use all the features of Ignite to excuse its use (discovery, caching).
Since we currently only use a limited subset of features from Ignite, we are basically looking for a self-managed application-level distributed cache solution (similar to what Ignite provides). This means that dedicated caching infrastructure like Redis, Memcached, etc. is not what we want.
I've researched the topic somewhat and found that there are some possible alternatives like:
Tayzgrid - Last update seems to be quite some time ago, not sure if still actively maintained
Druid - Still incubating, and I have also read that new releases being somewhat broken was not that uncommon
Hazelcast - Seems like the best choice given its maturity and the existence of Spring Data Hazelcast, though I am unsure what the level of support is here.
Has anyone has experience with integrating one of the above IMDGs (aside from Ignite) with Spring Cache? Any pointers in the right direction would be greatly appreciated.
You can use Redisson - Redis Java client with features of
In-Memory Data Grid. It also implements Spring Data support. Here is the documentation.
Hazelcast has official support for Spring Data Hazelcast and also this module has many users as now. I can also suggest you to have a look at the resources below:
Using Hazelcast with Spring Data
Getting Started with Microservices Using Hazelcast IMDG and Spring Boot

In-memory elastic search

I have a scenario where I want to query database once and after that want to cache the whole data in memory.
I got the suggestion for in-memory elastic search, I have googled it understand what it is and how can I implement it in my spring boot application, but I didn't find any appropriate solution.
Any suggestion on this like how can I implement this in my spring boot app and what would be the approach.
There used to be an in-memory storage type in Elasticsearch in 1.x, but it has been removed in 2.x and later versions. If your working set is small enough it might be mapped to memory in full, but you cannot really control that other than having enough memory.
If you want to run an embedded / in-process Elasticsearch with your Spring Boot application that feature was removed in 5.x and this blog post explains why.

What is the best way to use Spring and ElasticSearch?

I have to implement some application by using springframework.
All i have to do is just select from repository (no RDBMS, maybe lucene or elastic search core) and Display some view pages for customers. that is not save or update but read.
What is the best way to select for repositories in spring framework ?
You can use spring-data-elasticsearch which is the Spring Data implementation for ElasticSearch.
In order to get started, you may like to refer to https://www.mkyong.com/spring-boot/spring-boot-spring-data-elasticsearch-example/ which explains the integration with an example. Although it is a bit old but provide you with enough information to get it working.

How to import data into Apache Solr index from LDAP

I have an LDAP-Server that contains a large set of user data and would like to import this into an Apache Solr index. The question is not about whether this is a good idea or not (as discussed here). I need this kind of architecture as one of our production systems depends on a Solr index of our ldap data.
I'm considering different options to do so, but I'm not sure which one should be preferred:
Option 1: Use the Apache Solr DataImportHandler:
This seems to be the most straight forward Solr way of doing so. Unfortunately there does not seem to be DataSource available that would work with LDAP.
I tried to combine the JdbcDataSource with the JDBC-LDAP-Bridge. In theory that might probably work but the driver looks quite dated (latest Version from 2007).
Another Option might be to write a custom LdapDataSource using some of the LDAP-Libraries for Java (probably Spring LDAP, directly via JNDI or something similar?).
Option 2: Build a custom Feeder:
Another option might be to write a standalone service/script that bridges between the two services. However that feels a bit like reinventing the wheel.
Option 3: Something I haven't thought of yet:
Maybe there are additional options here that I simply haven't discovered yet.
Solved it by writing a custom LDAP DataSource for the Solr DataImportHandler.
It's not as hard as it sounds. The JdbcDataSource can be used as a template for writing your custom DataSource, so basically you just have to rewrite that one Java-Class for the LDAP protocol.
For accessing the LDAP-Client there are numerous options, such as plain JNDI, UnboundID LDAP SDK, Apache LDAP API, OpenDJ LDAP SDK or OpenLDAP JLDAP (there are probably more but I only had a look at those).
I went for UnboundID LDAP due to its well documented API and full support for LDAPv3.
Afterwards it is just a matter of referencing the datasource from the data-config.xml.
A nice side-effect of this setup is, that you can use all the goodies that the Solr DataImportHandler provides while indexing the LDAP server (Entity Processors and Transformers). This makes it easy to map the data structure between LDAP and the Solr Index.

How to combine neo4j and elasticsearch

I am developing a Question answering application and for that I need to use neo4j and elasticsearch in the same maven project. I am using elasticsearch to make my application more robust.
As we know that neo4j and elasticsearch works on different version of lucene, so whichever version I include in dependency, it gives an error.
Here is what I am doing:
First elasticsearch will index the data and the data and relationships will be stored as graphdatabase using neo4j. Then the user will input as a query, through which the data will be retrieved with the help of indexes. This data will be trigerred in graphdatabasev using trigger score which will be then propagated along the graphdatabase to find relevant results according to the user query.
Is there any way that I can integrate neo4j and elasticsearch in same maven project, or is there any other way through which these two modules can interact seperately.
Thanks
Please check out our integration page:
http://neo4j.com/developer/elastic-search/
Which has some discussion and also an example project to get you started.
http://github.com/neo4j-contrib/neo4j-elasticsearch

Resources