Avoid data replication when using Elasticsearch + MySQL backend? - elasticsearch

I'm working on a project where we have some legacy data in MySQL and now we want to deploy ES for better full text search.
We still want to use MySQL as the backend data storage because the current system is closely coupled with that.
It seems that most of the available solutions suggest syncing the data between the two, but this would result in storing all the documents twice in both ES and MySQL. Since some of the documents can be rather large, I'm wondering if there's a way to have only a single copy of the documents?
Thanks!

Impossible. This is analogous to asking the following: if you have legacy data in an Excel spreadsheet, can I use a MySQL database to query the data without also storing it in MySQL?
Elasticsearch is not just an application layer that interprets userland queries and turns them into database queries, it is itself a database system (in fact, it can be used as your primary data store, though it's not recommended due to various drawbacks). Its search functionality fundamentally depends on how its own backing storage is organized. Elasticsearch cannot query other databases.
You should consider what portions of your data actually need to be stored in Elasticsearch, i.e. what fields need text completion. You will need to build a component which syncs that view of the data between Elasticsearch and your MySQL database.

Related

Using ElasticSearch as a permanent storage

Recently I am working on a project which is producing a huge amount of data every day, in this project, there are two functionalities, one is storing data into Hbase for future analysis, and second one is pushing data into ElasticSearch for monitoring.
As the data is huge, we should store data into two platforms(Hbase,Elasticsearch)!
I have no experience in both of them. I want no know is it possible to use elasticsearch instead of hbase as a persistence storage for future analytics?
I recommend you reading this old but still valid article : https://www.elastic.co/blog/found-elasticsearch-as-nosql
Keep in mind, Elasticsearch is only a search engine. But it depends if your data are critical or if you can accept to lose some of them like non critical logs.
If you don't want to use an additionnal database with huge large data, you probably can store them into files in something like HDFS.
You should also check Phoenix https://phoenix.apache.org/ which may provide the monitoring features that you are looking for

How to retrieve the data from database without using apache jackrabbit datastore?

I have integrated the jack rabbit with Oracle database and I am storing the
Data using Jackrabbit, if I don't want to retrieve the data using the
Jackrabbit, in what way I can get the data. In database data is storing in
blob type.
The way Jackrabbit stores the data in the DB is an implementation detail, and it does not magically map this into a "nice" DB schema if that's what you mean. (The hierarchical nature and all the JCR features make this impossible). It's a bit like having a Unix file system and then asking how can I read the low level inodes etc. from the file system implementation - you really should not.
Last but not least note that while it is running nothing else (except for a Jackrabbit cluster setup) must write to the DB (the tables used by Jackrabbit) as this will easily lead to data corruption.
As #TedTrippin already mentioned above, an ORM framework would make things much easier. But if you really want to do it manually in Oracle, the approach would be:
Study the code of the OCM http://jackrabbit.apache.org/jcr/object-content-mapping.html, then get the content according to the logic of associations and relations from Oracle, probably not in one but multiple queries per document; eventually with user-defined functions, which are supported in Oracle and might make things easier.
Would be interesting to know the background of your questions. You tagged it with "Spring" and "CMS". I don't see any reason why you would want to access the data directly from Oracle, it's tedious. In case you want to provide an API for the content to an external system, or in case you have lost a CMS that was once in front of and just using the Jackrabbit repo as a content store, you could still use such ORM / OCM framework standalone to make it easier to access the data.

Accessing BizTalk tracking data

I want to take all BizTalk tracking data and move it into Elasticsearch.
Is there a way to access the data before its put into the tracking database?
Or do I have to extract it from the database and then into Elasticsearch?
Can I use the BAM API for this?
You should not change any BizTalk Stored Procedures.
How current do you need the data?
Here's two thoughts:
Take the Archives and load them into another database for Elasticsearch.
Do option 1 and point Elasticsearch to the DTA database as well. Querying DTA separately is acceptable. Just be sure to not block BizTalk's write operations.

How to query the session in ASP.NET MVC with a dynamic query

I want to store some user data in memory, like some in-memory noSQL database.
But later on I want to query that data with a dynamic query constructed from the user. That query is stored in a classic DB like a string, so when I need to query the data stored in memory I would like to parse that string and construct the desired query (by some known rules).
I looked at Redis and I figured out it isn't maintained for Windows anymore, I have also looked at RavenDB but it's main query language is LINQ, even though it can be created dynamic Lucene Query.
Can you suggest me another in memory DB that work with ASP.NET and can be queried with a dynamically created query? Maybe I haven't seen all the options.
I prefer name-value or JSON based noSQL so it's schema can be easyly modified without the constraints of the relation type of DBs
I would suggest to simply use sqlite. It can be easily used as an in-memory database (just open the database using ":memory:" instead of a file name).
You can use a simple 2 columns table with a primary key to emulate a key/value store.
Here are a few links you might find helpful:
http://www.sqlite.org/inmemorydb.html
How to create asp.net web application using sqlite

Incremental data import in Solr from MSSQL normalized table with complex joins

Working on Solr incremental data import from an existing normalized mssql database. I'm unable to decide on the strategy I need to implement, or not knowing whether there are existing tools to do the same, so I don't need to reinvent the wheel.
I need to import a document into Solr 3.6 to build a Solr Data, which is saved in MSSQL in heavily normalized fashion. To retrieve the data for single document, there are many joins required which is killing performance. I have appx. 1 million such document in db. So full import into Solr is not an option for me.
While deciding the approach I have two issues to consider:
Incremental data import, so that SQL server doesn't have heavy load while fetching data from db.
Updating data that has been changed in SQL Server into Solr data once a day
I am looking after your help in deciding the strategy and tool for incremental data import into Solr. I think, I have following options:
Custom develop application to fetch data from MSSQL and pass it to Solr. I need to keep track of data as what all records are inserted into Solr and what are pending. Again, 2% data records in MSSQL keeps updated on daily basis, so need to track what data has changed since then, and then update them again at some point of time into Solr.
Use any existing tool or utility in Solr to do the same, like DIH. I'm not sure how this will address both of the issue of incremental data retrieval and how it will track what data has change in SQL server? Again, not sure how DIH will handle complex joins requires to fetch data from db.
Or use something like Lusql with DIH, bust still not sure about how it will address both the issues. Although Lusql will give ability to do complex joins in db, so I hope this might fit my purpose.
I'm in favor of using LuSQL with DIH in Solr, if it can fit the purpose, but still not sure how it keep track of what data has change? Or for this part I have to manage manually by maintaining the document id where the change is made, and then supplying it to LuSQL to fetch data from SQL and import into Solr.
I am also looking forward for your suggestions beyond this to handle this kind of situations.
I will share with you the way i do this.
Mainly I have the same requierements and until this week I used solr dataimport with delta imports. I have a program that regularly updates a status for the new items from 0 to 1 and then calls solr data import to get all the documents with status 1. Solrdataimport uses a stored procedure to join and get the documents with status 1 from db. If the import finish successfully I then update the status to 2 and I know that this documents are in solr. If a documents get changed I simply change from status 2 to status 0 and then the import process updates the document in solr.
Everything works fine for me using this process. I always get the new documents in solr without having to fetch all the data form the database.
Now my requirements have changed because we decide to keep the date archived in the database, as we only need it in solr. So I need to have a program that deserialize the data and then have it sent to solr.
My approach now is to add all the new/updated documents via update handler and after I added all the documents to commit them, and if the commit is successful then I update the status in the database. With this approach I have no experience yet so I don't know if it will work or not but I will just try and see what happens.
I researched in the past a better way to do this but I couldn't find anything so if you find a better solution please share it with me.
Good luck :)
We had to index from a heavily normalized schema with 25+ tables, half of them contain over 5M records each. Largest ~20M.
We use informatica to load these records from oracle to solr. ETL tools like informatica provide ways to join tables/results of a query outside the relational database. It has a sorter transformation to sort out side database. An aggregate transformation to group by records outside db. There is also a lookup transformation..
Essentially, our data is de-normalized in stages and loading/indexing process is distributed.
There are open source ETL tools of course. There is a Microsoft ETL tool..
Indexing to solr happens through update handler.. Delta indexing is very similar to full indexing with additional logic for change data capture. ETL activity is scheduled.

Resources