Best ways to import data (initial load) from Oracle to Elastic Search - elasticsearch

I am working on a project where I have two big tables (parent and child) in Oracle. One is having 65 Million and the other 80 Million records. In total, data from 10 columns are required from these tables and saved as one document into Elastic search. The load for two tables can be done separately also. What are two comparable options to move data (one time load) from these tables into Elastic search and out of the two which one would you recommend? The requirement is that it should be Fast and simple so that it can not only be used for one time data load but also be used in case there is a failure and the elastic search index needs to be created again from scratch.

As already suggested one option may be logstash: the advantage of logstash is simplicity, but it can be complicated to monitor and it can be difficult to configure if you have to transform some field during the ingestion.
One alternative can be nifi: it offers jdbc and elasticsearch plugin, and you can monitor, start and stop the ingestion directly with the web interface. It is possible with nifi to build a more complex and robust pipeline: handling exceptions, translating data types and performing data enrichment.

Related

Filter result in memory to search in elasticsearch from multiple indexes

I have 2 indexes and they both have one common field (basically relationship).
Now as elastic search is not giving filters from multiple indexes, should we store them in memory in variable and filter them in node.js (which basically means that my application itself is working as a database server now).
We previously were using MongoDB which is also a NoSQL DB but we were able to manage it through aggregate queries but seems the elastic search is not providing that.
So even if we use both databases combined, we have to store results of them somewhere to further filter data from them as we are giving users advanced search functionality where they are able to filter data from multiple collections.
So should we store results in memory to filter data further? We are currently giving advanced search in 100 million records to customers but that was not having the advanced text search that elastic search provides, now we are planning to provide elastic search text search to customers.
What do you suggest should we use the approach here to make MongoDB and elastic search together? We are using node.js to serve data.
Or which option to choose from
Denormalizing: Flatten your data
Application-side joins: Run multiple queries on normalized data
Nested objects: Store arrays of objects
Parent-child relationships: Store multiple documents through joins
https://blog.mimacom.com/parent-child-elasticsearch/
https://spoon-elastic.com/all-elastic-search-post/simple-elastic-usage/denormalize-index-elasticsearch/
Storing things client side in memory is not the solution.
First of all the simplest way to solve this problem is to simply make one combined index. Its very trivial to do this. Just insert all the documents from index 2 into index 1. Prefix all fields coming from index-2 by some prefix like "idx2". That way you won't overwrite any similar fields. You can use an ingestion pipeline to do this, or just do it client side. You only will ever do this once.
After that you can perform aggregations on the single index, since you have all the data in one-index.
If you are using somehting other than ES as your primary data-store you need to reconfigure the indexing operation to redirect everything that was earlier going into index-2 to go into index-1 as well(with the prefixed terms).
100 million records is trivial for something like ELasticsearch. Doing anykind of "joins" client side is NOT RECOMMENDED, as this will obviate the entire value of using ES.
If you need any further help on executing this, feel free to contact me. I have 11 years exp in ES. And I have seen people struggle with "joins" for 99% of the time. :)
The first thing to do when coming from MySQL/PostGres or even Mongodb is to restructure the indices to suit the needs of data-querying. Never try to work with multiple indices, ES is not built for that.
HTH.

Is is more efficient to query multiple ElasticSearch indices at once or one big index

I have an ElasticSearch cluster and my system handles events coming from an API.
Each event is a document stored in an index and I create a new index per source (the company calling the API). Sources come and go, so I have new sources every week and most sources become inactive after a few weeks. Each source send between 100k and 10M new events every day.
Right now my indices are named api-events-sourcename
The documents contain a datetime field and most of my queries look like "fetch the data for that source between those dates.
I frequently use Kibana and I have configured a filter that matches all my indices (api-events-*) at once, and I then add terms to filter a specific source and specific days.
My requests can be slow at times and they tend to slow down the ingestion of new data.
Given that workflow, should I see any performance benefits to create an index per source and per day, instead of the index per source only that I use today ?
Are there other easy tricks to avoid putting to much strain on the cluster ?
Thanks!

Is there a way to import data (csv data) to the winlogbeat kibana dashboard?

I had just started learning about ElasticSearch and Kibana. I created a Winlogbeat dashboard where the logs are working fine. I want to import additional data (CSV data) which I created using Python. I tried uploading the CSV file but I am only allowed to create a separate index and not merge it with the Winlogbeat data. Does anyone know how to do this?
Thanks in advance
In many use cases, you don't need to actually combine into a single index. Here's a few ways you can show combined data, in approximate order of complexity:
Straightforward methods, using separate indices:
Use multiple charts on a dashboard
Use multiple indices in a single chart
More complex methods that combine data into a single index:
Pivot indices using Data Transforms
Combine at ingest-time
Roll your own
Use multiple charts on a dashboard
This is the simplest way: ingest your data into separate indices, make separate visualizations for them, then add those visualisations to one dashboard. If the events are time-based, this simple approach could be all you need.
Use multiple indices in a single chart
Lens, TSVB and Timelion can all use multiple data sources. (Vega can too, but that's playing on hard mode)
Here's an official Elastic video about how to do it in Lens: youtube
Create pivot indices using Data Transforms
You can use Elasticsearch's Data Transforms functionality to fetch, combine and aggregate your disparate data sources into a combined data structure which is then available for querying with Kibana. The official tutorial on Transforming the eCommerce sample data is a good place to learn more.
Combine at ingest-time
If you have (or can add) Logstash in the mix, you have several options for combining datasets during the filter phase of your pipelines:
Using a file-based lookup table and the translate filter plugin
By waiting for related documents to come in then outputting a combined document to Elasticsearch with the aggregate filter plugin
Using external lookups with filter plugins like elasticsearch or http
Executing arbitrary ruby code using the ruby filter plugin
Roll your own
If you're generating the CSV file with a Python program, you might want to think about incorporating the python Elasticsearch DSL lib to run queries on the winlogbeat data, then ingest it in its combined state (whether via a CSV or other means).
Basically, Winlogbeat is a data shipper to Elasticsearch. Which ships windows specific data to an index named winlogbeat with a specific schema and document structure.
You can't merge another document with a different schema into winlogbeat index.
If your goal is to correlate different data points. Please use Time-series visual builder to overlay two different datasets to visualize.

ElasticSearch vs Relational Database

I'm creating a microservice to handle the contacts that are created in the software. I'll need to create contacts and also search if a contact exists based on some information (name, last name, email, phone number). The idea is the following:
A customer calls, if it doesn't exist we create the contact asking all his personal information. The second time he calls, we will search coincidences by name, last name, email, to detect that the contact already exists in our DB.
What I thought is to use a MongoDB as primary storage and use ElasticSearch to perform the query, but I don't know if there is really a big difference between this and querying in a common relational database.
EDIT: Imagine a call center that is getting calls all the time from mostly different people, and we want to search fast (by name, email, last name) if that person it's in our DB, wouldn't ElasticSearch be good for this?
A relational database can store data and also index it.
A search engine can index data but also store it.
Relational databases are better in read-what-was-just-written performance. Search engines are better at really quick search with additional tricks like all kinds of normalization: lowercase, รค->a or ae, prefix matches, ngram matches (if indexed respectively). Whether its 1 million or 10 million entries in the store is not the big deal nowadays, but what is your query load? Well, there are only this many service center workers, so your query load is likely far less than 1qps. No problem for a relational DB at all. The search engine would start to make sense if you want some normalization, as described above, or you start indexing free text comments, descriptions of customers.
If you don't have a problem with performance, then keep it simple and use 1 single datastore (maybe with some caching in your application).
Elasticsearch is not meant to be a primary datastore so my advice is to use a simple relational database like Postgres and use simple SQL queries / a ORM mapper. If the dataset is not really large it should be fast enough.
When you have performance issues on searches you can use a combination of relation db and Elasticsearch. You can use Elasticsearch feeders to update ES with your data in you relational db.
Indexed RDBMS works well for search
If your data is structured i.e. columns are clearly defined, searching 1 million records will also not be a problem in RDBMS.
When to use Elastic
Text Search: Searching words across multiple properties (e.g. description, name etc.)
JSON Store and search: If data being stored is in json format and later needs to be searched
Auto Suggestions: Elastic is better at providing autocomplete suggestions
Elastic as an application data provider
Elastic should not be seen as data store, even if you storing data in it. It is about how you perceive elastic. Elastic should be used to store and setup data for the application. It is the application which decides how and when to use elastic (search and suggestions). Elastic is not a nosql storage alternative if compared to RDBMS, you should use a nosql database instead.
This perception puts elastic in line with redis and kafka. These tools are key components of an application design and they are used to serve as events stores, search engines and cache etc. to the applications.
Database with Elastic
Your design should use both. For storing the contacts use the database, index the contacts for querying. Also make the data available in elastic for searching, autocomplete and related matches.
As always, it depends on your specific use case. You briefly described it, but how are you acually going to use the data?
If it's just something simple like checking if a customer exists and then creating a new customer, then use the RDMS option. Moreover, if you don't expect a large dataset, so that scaling isn't an issue (hence the designation that Elasticsearch is for BigData), but you have transactions and data integrity is important, then a RDMS will be the right fit. Some examples could be for tax, leasing, or financial reporting systems.
However, if you have a large dataset, you need a wide range of query capabilities, such as a fuzzy search or searches where the user
can select multiple filters on the data or you want to do some predictive analysis on the data, then Elasticsearch is the clear choice.
For example, I worked on an web based app with a large customer base: 11 million, with 200+ hits per second at peak time for a find a doctor application. The customer could check some checkboxes to determine, specialty, spoken languages, ratings, hospitals, etc. all sorted by the distance from the users location with a 2 second or less response time. It would be very difficult for a RDMS to match that.

MongoDB Vs Oracle for Real time search

I am building an application where i am tracking user activity changes and showing the activity logs to the users. Here are a few points :
Insert 100 million records per day.
These records to be indexed and available in search results immediately(within a few seconds).
Users can filter records on any of the 10 fields that are exposed.
I think both Mongo and Oracle will not accomplish what you need. I would recommend offloading the search component from your primary data store, maybe something like ElasticSearch:
http://www.elasticsearch.org/
My recommendation is ElasticSearch as your primary use-case is "filter" (Facets in ElasticSearch) and search. Is it written to scale-up (otherwise Lucene is also good) and keeping big data in mind.
100 million records a day sounds like you would need a rapidly growing server farm to store the data. I am not familiar with how Oracle would distribute these data, but with MongoDB, you would need to shard your data based on the fields that your search queries are using (including the 10 fields for filtering). If you search only by shard key, MongoDB is intelligent enough to only hit the machines that contain the correct shard, so it would be like querying a small database on one machine to get what you need back. In addition, if the shard keys can fit into the memory of each machine in your cluster, and are indexed with MongoDB's btree indexing, then your queries would be pretty instant.

Resources