Scalable elasticsearch module with spring data elasticsearch possible? - spring

I am working on designing a scalable service(springboot) using which data will be indexed to elastic search.
Use case:
My application uses 6 databases(mySql) having same schema. Each database caters to specific region. I have a micro service that connects to all these dbs and indexes data from specific tables to elasicsearch server(v6.8.8) in similar fashion having 6 elasticsearch indexes one for each db.
Quartz jobs are employed for this purpose and RestHighLevelClient. Also there are delta jobs running each second to look for changes using audit and indexes.
Current problem:
Current design is not scalable - one service doing all the work(data loading, mapping, upsert in bulk). Because indexing is done through quarts jobs, scaling services(running multiple instances) will run the same job multiple times.
No failover - Looking for a distributed elasticsearch nodes and indexing data to both nodes. How to do this efficiently.
I am considering spring data elasticsearch to index data sametime when it is going to be persisted to db.
Does it offer all features ? I use :
Elasticsearch right from installing template to creating/deleting indexes, aliases.
Blue/green deployment - index to non-active nodes and change the aliases.
bulk upsert, querying, aggregations..etc
Any other solutions are welcome. Thanks for your time.

Your one of the use case is to move data from DB (Mysql) to ES in a scalable manner. It is basically a CDC (Change data capture) pipeline.
You can use kafka-connect framework for the same.
The flow should be like:
Read Mysql Transaction logs => Publish the data to Kafka (This can be accomplished using Debezium Source Connector)
Consume data from Kafka => Push it to Elastic Search (This can be accomplished using ES-SYNC Connector)
Why to use the framework ?
Using connect framework data can be read directly from Mysql Transaction logs without writing code.
Connect framework is a distributed & Scalable system
It will reduce the load on your database as you now don't need to query your database for detecting any changes
Easy to set-up

Related

ElasticSeach read data from Apache Hadoop

We are trying to implement Elasticsearch into our big data environment. Currently we are running Apache Hadoop 2.7, include Hive and Spark. Data store as Parquest format in Hadoop.
When we implement ELK into our environment, Can we only store data into Hadoop HDFS ? Or we must to extract data from Hadoop and import into Elasticsearch so can create indexing, but we have duplicate dataset into system(Hadoop HDFS and ElasticSearch)
Thank you.
Sorry if this is a bit long, but I hope that would help!
What Elasticsearch is?
Elasticsearch is a search engine. Period. Search Solution. Or rather a type of database or datastore which helps you organize your data in such a way which can help you perform activities like data discovery or build search applications for your organisation.
Although you can also do lot of analytical queries and build analytical solutions around it, there are certain limitations.
The nature of Elasticsearch and the datastructures used in it are so different, that you would need to push the data(ingest) into it in order to perform all search/data-discover/analytical activities. It has its own file system and data structures which manage/store the data specifically for efficient searching.
So yes there will be duplication of data.
What Elasticsearch is not?
It is not to be used as analytical solution, although it does come with lot of aggregation queries, it is not as expressive as processing engine like Apache Spark or data virtualisation tools like Denodo or Presto.
It is not a storage solution like HDFS or S3 and used as a data lake for the organization.
It is not to be used as a transactional database to be replaced with RDBMS solutions.
Most of the times, many organisations ingest data into ES from various different sources like OLAP, RDBMS, NoSQL database, CMS, Messaging Queues so that they can do searching of the data more efficiently.
In a way, for most of the times, ES is never a primary datasource.
How organisations use it?
Build search solutions for e.g. if you wish to provide or build any e-commerce solution, you can have its search implementation managed by Elasticsearch.
Enterprise Search Solutions (internal and external) for IT people to be more productive and allow the data for their customers to find required documentation, knowledgebase, downloadable pdfs text etc for their products for e.g. Installation docs, Configuration docs, Release docs, New Product Docs. All the contents would be assembled from various different sources in a company and pushed into ES so that they could be searchable.
Ingest data for e.g. logs from application servers and messaging queues in order to perform Logging and Monitoring activities (Alerts, Fraud analysis).
So two most common usage of ES is searching and logging and monitoring activities. Basically real time activities.
How it is different from Hadoop?
Mostly organisations are increasing leveraging Hadoop for its file system i.e. HDFS to be used as a data store while utilising Spark or Hive for data processing. Mostly to do heavy data analytical solutions for which ES has limitation.
Hadoop has the capability to store all file formats(of course you need to make use of parquet or other formats for efficient storage) however Elasticsearch only makes use of JSON. This makes Hadoop a default industry standard along with S3 and other FS for data-lake or data-archival storage tool.
If you are storing data in Hadoop, you probably would have to make do with other frameworks to do efficient data processing like Spark or Giraph or Hive to transform data and do complex analytical processing for which ES has a limitation. ES in its core is a full-text retrieval/search engine.
Hadoop for search
You probably need to run Map-Reduce or Spark Jobs and write tons of pattern-matching algorithm to find the documents or folders with any keyword you want to search. Every search would result into one such job. Which would not be practical.
Even if you transform and organise them in such a way for you to leverage Hive, still it would not be as efficient as Elasticsearch for text processing.
Here is a link that can help you understand core data structure used in Elasticsearch and why text search is faster/different/efficient.
How can we make use of Hadoop and Elasticsearch?
Perhaps the below diagram as mentioned in this link could be useful.
Basically you can set up ingestion pipelines, process raw data from Hadoop, transform and thereby index them in Elasticsearch so that you can make use of its search capabilities.
Take a look this link to understand how you can make use of Spark with Elasticsearch and have two-way communication achieved between them.
Hope that helps!

NiFi GetMongo fetches data forever

I have millions of records in MongoDB and I want to use NiFi to move data. Here is the scenario I want to run:
1) I will setup NiFi
2) NiFi will automatically fetch records with batches of 100 records.
3) Once it is done, it will fetch when a new entry is added.
I tried to apply this scenario with a small MongoDB collection (fetch from mongo and store as a file) and I saw that NiFi is repeating the process forever and it is duplicating the records.
Here is the flow I created on NiFi:
Are there are any suggestions to solve this problem?
Unfortunately, GetMongo doesn't have state tracking capabilities. There are similar questions where I have explained about it. You can find them:
Apache NIFI Jon is not terminating automatically
Apache Niffi getMongo Processor

Cassandra for datawarehouse

Is Cassandra a good alternative for Hadoop as a data warehouse where data is append only and all updates in source databases should not overwrite the existing rows in the data warehouse but get appended. Is Cassandra really ment to act as a data warehouse or just as a database to store the results of batch / stream queries?
Cassandra can be used both as a data warehouse(raw data storage) and as a database (for final data storage). It depends more on the cases you want to do with the data.
You even may need to have both Hadoop and Cassandra for different purposes.
Assume, you need to gather and process data from multiple mobile devices and provide some complex aggregation report to the user.
So at first, you need to save data as fast as possible (as new portions appear very often) so you use Cassandra here. As Cassandra is limited in aggregation features, you load data into HDFS and do some processing via HQL scripts (assume, you're not very good at coding but great in complicated SQLs). And then you move the report results from HDFS to Cassandra in a dedicated reports table partitioned by user id.
So when the user wants to have some aggregation report about his activity in the last month, the application takes the id of active user and returns the aggregated result from Cassandra (as it is simple key-value search).
So for your question, yes, it could be an alternative, but the selection strategy depends on the data types and your application business cases.
You can read more information about usage of Cassandra
here

Joining Oracle Table Data with MongoDB Collection

I have a reporting framework to build and generate reports (tabular format reports). As of now I used to write SQL query and it used to fetch data from Oracle. Now I have got an interesting challenge where half of data will come from Oracle and remaining data come from MongoDB based on output from Oracle data. Fetched tabular format data from Oracle will have one additional column which will contain key to fetch data from MongoDB. With this I will have two data set in tabular format one from Oracle data and one from MongoDB. Based on one common column I need to merge both table data and produce one data set to produce report.
I can write logic in java code to merge two tables (say data in 2D array format). But instead of doing this from my own, I am thinking to utilize some RDBMS in-memory data concept. For example, H2 database, where I can create two tables in memory on the fly and execute H2 queries to merge two tables. Or, I believe, there could be something in Oracle too like global temp table etc. Could someone please suggest the better approach to join oracle table data with MongoDB collection.
I think you can try and use Kafka and Spark Streaming to solve this problem. Assuming your data is transactional, you can create a Kafka broker and create a topic. Then make change to the existing services where you are saving to Oracle and MongoDB. Create 2 Kafka producers (one for Oracle and another for Mongo) to write the data as streams to the Kafka topic. Then create a consumer group to receive streams from Kafka. You may then aggregate the real time streams using a Spark cluster(You can look at Spark Streaming API for Kafka 1) and save the results back to MongoDB (using Spark Connector from MongoDB 2) or any other distributed database. Then you can do data visualizations/reporting on those results stored in MongoDB.
Another suggestion would be to use apache drill. https://drill.apache.org
You can use a mongo and JDBC drill bits and then you can join oracle tables and mongo collections together.

Cassandra aggregate to Map

I am new to cassandra, I've mainly been using Hive the past several months. Recently I started a project where I need to do some of the things I did in hive with cassandra instead.
Essentially, I am trying to find a way to do an aggregate of multiple rows into a single map on query.
In hive, I simply do a group by, with a "map" aggregate. Does a way exist in cassandra to do something similar?
Here is an example of a working hive query that does the task I am looking to do:
select
map(
"quantity", count(caseid)
, "title" ,casesubcat
, "id" , casesubcatid
, "category", named_struct("id",casecatid,'title',casecat)
) as casedata
from caselist
group by named_struct("id",casecatid,'title',casecat) , casesubcat, casesubcatid
Mapping query results to Map (or some other type/structure/class of your choice) is responsibility of client application and usually is a trivial task (but you didn't specify in what context this map is going to be used).
Actual question here is about GROUP BY in Cassandra. This is not supported out of the box. You can check Cassandra's standard aggregate functions or try creating user defined function, but Cassandra Way is knowing your query in advance, designing your schema accordingly, doing heavy lifting in write phase and simplistic querying afterwards. Thus, grouping/aggregation can often be achieved by using dedicated counter tables.
Another option is to do data processing in additional layer (Apache Spark, for example). Have you considered using Hive on top of Cassandra?

Resources