Elasticsearch to index RDBMS data - elasticsearch

These are three simple questions which was surprisingly hard to find definite answers.
Does ElasticSearch support indexing data in RDBMS tables ( Oracle/SQLServer/Informix) out of the box?
If yes, can you please point me to documentation on how to do it
If not, what are alternate ways (plugins like Rivers - deprecated) with good reputation

I'm surprised there isn't any solid answer as yet for this. So here's the solution. Logstash directly gives us the ability to push data from a RDBMS into Elasticsearch.
Here's a link to a tutorial which tell you how to go about it. Briefly(all details in link 1), you simply need a JDBC driver for the relational database you'll be using (Postgres, MySQL etc) and make a config file specifying your input as the Relational Database and your output as Elasticsearch. You can also specify a cron which would allow you to keep updating one regular intervals.
Here's the article which mentions the configuration and gets you started (See Example 2): https://www.elastic.co/blog/logstash-jdbc-input-plugin
Here's the article which tells you how to configure the Cronjob as such: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html#_scheduling

Related

Using ElasticSearch as a permanent storage

Recently I am working on a project which is producing a huge amount of data every day, in this project, there are two functionalities, one is storing data into Hbase for future analysis, and second one is pushing data into ElasticSearch for monitoring.
As the data is huge, we should store data into two platforms(Hbase,Elasticsearch)!
I have no experience in both of them. I want no know is it possible to use elasticsearch instead of hbase as a persistence storage for future analytics?
I recommend you reading this old but still valid article : https://www.elastic.co/blog/found-elasticsearch-as-nosql
Keep in mind, Elasticsearch is only a search engine. But it depends if your data are critical or if you can accept to lose some of them like non critical logs.
If you don't want to use an additionnal database with huge large data, you probably can store them into files in something like HDFS.
You should also check Phoenix https://phoenix.apache.org/ which may provide the monitoring features that you are looking for

Apache Nifi - Federated Search

My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.

How to retrieve the data from database without using apache jackrabbit datastore?

I have integrated the jack rabbit with Oracle database and I am storing the
Data using Jackrabbit, if I don't want to retrieve the data using the
Jackrabbit, in what way I can get the data. In database data is storing in
blob type.
The way Jackrabbit stores the data in the DB is an implementation detail, and it does not magically map this into a "nice" DB schema if that's what you mean. (The hierarchical nature and all the JCR features make this impossible). It's a bit like having a Unix file system and then asking how can I read the low level inodes etc. from the file system implementation - you really should not.
Last but not least note that while it is running nothing else (except for a Jackrabbit cluster setup) must write to the DB (the tables used by Jackrabbit) as this will easily lead to data corruption.
As #TedTrippin already mentioned above, an ORM framework would make things much easier. But if you really want to do it manually in Oracle, the approach would be:
Study the code of the OCM http://jackrabbit.apache.org/jcr/object-content-mapping.html, then get the content according to the logic of associations and relations from Oracle, probably not in one but multiple queries per document; eventually with user-defined functions, which are supported in Oracle and might make things easier.
Would be interesting to know the background of your questions. You tagged it with "Spring" and "CMS". I don't see any reason why you would want to access the data directly from Oracle, it's tedious. In case you want to provide an API for the content to an external system, or in case you have lost a CMS that was once in front of and just using the Jackrabbit repo as a content store, you could still use such ORM / OCM framework standalone to make it easier to access the data.

Suggestion technology/design on this BIGDATA usecase

I am new to big data technologies and design so looking for help from java world.
I have concept of tags and tagcombinations.
For example U.S.A and Pen are two tags AND if they come together in some definition then register a tagcombination(U.S.A-Pen) for that..
tags (U.S.A, Pen, Pencil, India, Shampoo)
tagcombinations(U.S.A-Pen, India-pencil, U.S.A-Pencil, India-Pen, India-Pen-Shampoo)
millions of tags
billions of tagcombinations
one tagcombination generally have 2-8 tags....
Every day we get lakhs of new tagcombinations to write
daily crores of queries to find matching combination by set of tags
Query need to support :
one tag or set of tags appears in how many tagcombinationids ????
If i query for Pen,India then it should return two tagcombinaions (India-Pen, India-Pen-Shampoo))..Query will be fired by application in realtime.
Please suggest a solution which is distributed with java client and can
handle scale of data i am looking for..
Already tried on cassandra but not able toconclude that as right match for my problem..
Thanks
Naresh
I suggest you look into Apache Lucene project:
http://lucene.apache.org/
You won't be able to use Cassandra directly for this but if you store your data inside Cassandra, you can use Solr to add extra indexes on top of your data. DataStax has a bundle solution called DataStax Enterprise that has Cassandra/Solr together:
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise

Hbase / Hadoop Query Help

I'm working on a project with a friend that will utilize Hbase to store it's data. Are there any good query examples? I seem to be writing a ton of Java code to iterate through lists of RowResult's when, in SQL land, I could write a simple query. Am I missing something? Or is Hbase missing something?
I think you, like many of us, are making the mistake of treating bigtable and HBase like just another RDBMS when it's actually a column-oriented storage model meant for efficiently storing and retrieving large sets of sparse data. This means storing, ideally, many-to-one relationships within a single row, for example. Your queries should return very few rows but contain (potentially) many datapoints.
Perhaps if you told us more about what you were trying to store, we could help you design your schema to match the bigtable/HBase way of doing things.
For a good rundown of what HBase does differently than a "traditional" RDBMS, check out this awesome article: Matching Impedance: When to use HBase by Bryan Duxbury.
If you want to access HBase using a query language and a JDBC driver it is possible. Paul Ambrose has released a library called HBQL at hbql.com that will help you do this. I've used it for a couple of projects and it works well. You obviously won't have access to full SQL, but it does make it a little easier to use.
I looked at Hadoop and Hbase and as Sean said, I soon realised it didn't give me what I actually wanted, which was a clustered JDBC compliant database.
I think you could be better off using something like C-JDBC or HA-JDBC which seem more like what I was was after. (Personally, I haven't got farther with either of these other than reading the documentation so I can't tell which of them is any good, if any.)
I'd recommend taking a look at Apache Hive project, which is similar to HBase (in the sense that it's a distributed database) which implements a SQL-esque language.
Thanks for the reply Sean, and sorry for my late response. I often make the mistake of treating HBase like a RDBMS. So often in fact that I've had to re-write code because of it! It's such a hard thing to unlearn.
Right now we have only 4 tables. Which, in this case, is very few considering my background. I was just hoping to use some RDBMS functionality while mostly sticking to the column-oriented storage model.
Glad to hear you guys are using HBase! I'm not an expert by any stretch of the imagination, but here are a couple of things that might help.
HBase is based on / inspired by BigTable, which happens to be exposed by AppEngine as their db api, so browsing their docs should help a great deal if you're working on a webapp.
If you're not working on a webapp, the kind of iterating you're describing is usually handled with via map/reduce (don't emit the values you don't want). Skipping over values using iterators virtually guarantees your application will have bottlenecks with HBase-sized data sets. If you find you're still thinking in SQL, check out cloudera's pig tutorial and hive tutorial.
Basically the whole HBase/SQL mental difference (for non-webapps) boils down to "Send the computation to the data, don't send the data to the computation" -- if you keep that in mind while you're coding you'll do fine :-)
Regards,
David

Resources