Elasticsearch Best Practices Flow - elasticsearch

I am using elastic-search for product filtering for products. We have complex logic of product availability. I can see two options
Using elastic to store only product specific data and product availability logic resides in web server part. we first filter data from elastic then check the condition on those result set if it matches the logic of availability.
or We can flatten the data and store it in elastic though for that case there will be duplicates of data.
My concern is if it is good practice to call elastic endpoint from browser. As it does not have any auth system by default. and every query and response will be visible in network log. I believe call should be made from web server to elastic and front end will communicate with elastic unaware of elastic existence.
Any best practice insight will be helpful

Simply create and authenticated endpoint in your backend and send the queries to that endpoint. Do make sure there are some enforced limits such as
size -- You don't want to let anybody download your whole index and
aggregation depth -- you don't want anyone to perform summaries on your whole index/indices to get a competitive advantage.
Regarding the duplicates: I wouldn't worry too much about the storage aspect (many NoSQL approaches will probably have some duplication to facilitate fast queries) but keep in mind that aggregations might yield "wrong" counts and sums. You'd typically perform those aggregations to get, say, the totals in your product categories and you want to make sure they are representative of your warehouse state.
More cannot really be said right now based on the limited information you've provided.

Related

Using elastic search for a UI dashboard behind a proxy

I am working on a search dashboard with full text search capabilities, backed by ES. The search would initially be consumed by a UI dashboard. I am planning to have an application web service (WS) api layer between the UI dashboard and ES which will route the business search to ES.
There can be multiple clients to WS going forward, each with its own business use cases, and complex data requirements (basically response fields). There are many entities and huge number of fields across them. Each client would need to specify what fields entities it wants to return with what fields.
To support this dynamically changing requirement, one approach could be to have the WS be a pass through to the ES (with pre validations like access control and post transformations to the response from ES). The WS APIs will look exactly like the ES APIs, the UI should build ES queries through JS client and send it to WS, which after access control will get data from ES.
I am new to ES and skeptic of this approach. Can there be any particular challenges in this approach. One of my colleague has worked on ES before but always with a backend Java client, so he's not too sure.
I looked up a ES Js client and there's an official one here.
Some Context here:
We have around 4 different entities (can increase in future) with both full text and keyword type fields. A typical search could have multiple filters and search terms and would want to specify the result fields. Also, some searches would be across entities and some to individual ones. We are maintaining a separate entity for each entity.
What I understand from your post is, below is what you want to achieve at high level.
There can be multiple clients to WS going forward, each with its own
business use cases, and complex data requirements (basically response
fields)
And as you are not sure, how to do this, you are thinking to build Elasticsearch queries from Javascript in your front-end only. I am not a very big fan of this approach as it exposes, how you are building queries and if some hacker knows crucial like below information, then can bring your entire ES cluster to its knees:
Knows what types of wildcard queries.
Knows index names and ES cluster details(although you may have access control but still you are exposing the crucial info).
How you are building your search queries.
Above are just a few examples and will add more info.
Right approach
As you already have a backend, where you would be checking the access, there only build the Elasticsearch queries and you even have the advantage of your teammates who knows it.
For building complex response field, you can use the source filtering, using which you can specify in your search request, what all fields you want to return in your search result.

ElasticSearch vs Relational Database

I'm creating a microservice to handle the contacts that are created in the software. I'll need to create contacts and also search if a contact exists based on some information (name, last name, email, phone number). The idea is the following:
A customer calls, if it doesn't exist we create the contact asking all his personal information. The second time he calls, we will search coincidences by name, last name, email, to detect that the contact already exists in our DB.
What I thought is to use a MongoDB as primary storage and use ElasticSearch to perform the query, but I don't know if there is really a big difference between this and querying in a common relational database.
EDIT: Imagine a call center that is getting calls all the time from mostly different people, and we want to search fast (by name, email, last name) if that person it's in our DB, wouldn't ElasticSearch be good for this?
A relational database can store data and also index it.
A search engine can index data but also store it.
Relational databases are better in read-what-was-just-written performance. Search engines are better at really quick search with additional tricks like all kinds of normalization: lowercase, รค->a or ae, prefix matches, ngram matches (if indexed respectively). Whether its 1 million or 10 million entries in the store is not the big deal nowadays, but what is your query load? Well, there are only this many service center workers, so your query load is likely far less than 1qps. No problem for a relational DB at all. The search engine would start to make sense if you want some normalization, as described above, or you start indexing free text comments, descriptions of customers.
If you don't have a problem with performance, then keep it simple and use 1 single datastore (maybe with some caching in your application).
Elasticsearch is not meant to be a primary datastore so my advice is to use a simple relational database like Postgres and use simple SQL queries / a ORM mapper. If the dataset is not really large it should be fast enough.
When you have performance issues on searches you can use a combination of relation db and Elasticsearch. You can use Elasticsearch feeders to update ES with your data in you relational db.
Indexed RDBMS works well for search
If your data is structured i.e. columns are clearly defined, searching 1 million records will also not be a problem in RDBMS.
When to use Elastic
Text Search: Searching words across multiple properties (e.g. description, name etc.)
JSON Store and search: If data being stored is in json format and later needs to be searched
Auto Suggestions: Elastic is better at providing autocomplete suggestions
Elastic as an application data provider
Elastic should not be seen as data store, even if you storing data in it. It is about how you perceive elastic. Elastic should be used to store and setup data for the application. It is the application which decides how and when to use elastic (search and suggestions). Elastic is not a nosql storage alternative if compared to RDBMS, you should use a nosql database instead.
This perception puts elastic in line with redis and kafka. These tools are key components of an application design and they are used to serve as events stores, search engines and cache etc. to the applications.
Database with Elastic
Your design should use both. For storing the contacts use the database, index the contacts for querying. Also make the data available in elastic for searching, autocomplete and related matches.
As always, it depends on your specific use case. You briefly described it, but how are you acually going to use the data?
If it's just something simple like checking if a customer exists and then creating a new customer, then use the RDMS option. Moreover, if you don't expect a large dataset, so that scaling isn't an issue (hence the designation that Elasticsearch is for BigData), but you have transactions and data integrity is important, then a RDMS will be the right fit. Some examples could be for tax, leasing, or financial reporting systems.
However, if you have a large dataset, you need a wide range of query capabilities, such as a fuzzy search or searches where the user
can select multiple filters on the data or you want to do some predictive analysis on the data, then Elasticsearch is the clear choice.
For example, I worked on an web based app with a large customer base: 11 million, with 200+ hits per second at peak time for a find a doctor application. The customer could check some checkboxes to determine, specialty, spoken languages, ratings, hospitals, etc. all sorted by the distance from the users location with a 2 second or less response time. It would be very difficult for a RDMS to match that.

separating data access with elasticsearch

I'm just getting to know elasticsearch and I'm wondering if it suits my case at all:
Considering a system where companies (with multiple employees) can register and administer their clients, and send documents to their clients.
Now, I want to enable companies to search their documents - but ONLY theirs, not the documents of other companies. In other words: how to separate the data of those companies for searches? How can this be implemented with elasticsearch?
Is this separation to be handled by elasticsearch itself? I.e. there is some mapping between the companies in my system and a related user for elasticsearch.
Or is this to be handled by the backend of my system? I.e. the backend somehow decides (how?) to show only search results for that particular company. So there would be just one user, namely the backend of my system, that accesses and filters the results of elasticsearch. But is this sensible?
I'm sure there is a wealth of information about this out there. Please just give me a hint, because I don't know what to search for. Searches for elasticsearch authentication/authorization, for example, only yield results about who gains access to the search system in general - not about a pattern to solve this separation.
Thanks in advance!
Elasticsearch on its own does not support Authorization and Authentication, you need to add this via plugins, of which there are two that I know of. Shield is the official solution, which is part of the X-Pack and you need to pay Elastic if you want to use it. SearchGuard is an open source alternative with enterprise upgrades that you can buy.
Both of these enable you to define fine grained access rights for different users. What you'd probably want to do is give every company an index of their own for their documents and then restrict their user to only be able to read/write that index. Or if you absolutely want all documents in one index, you can add document level restrictions as well, so that everybody queries the same index but only gets results returned for their company. Depending on how many companies you expect to service this might make more sense in order to not have too many indices and shards, but I'd suspect that an index per company would be the best way to go.
Without these plugins you would need to resort to something on the http-layer, for example an nginx reverse proxy that filters requests based on the index names contained in the urls or something, but I'd severely advise against this, lots of pain lies that way!

What is elastic search

I'm just wanting to know what is exactly Elastic Search.
It is said it helps to search data but when I see some webinars it feels like I have to replicate my data in a kind of Elastic datastore... which not means very otpimized to me. In that way all modification done on left hand will have to be reported on right hand and data returned by Elastic Search may not be in the right format.
Can Elastic Search can directly search in my database?
It's to use with a Neo4J graph database. Does somebody already did something like that? Does that only replace the Cypher queries?
Thanks for advices, helping me on realize on what Elastic Search can really helps on our project.
Elasticsearch is a database, however it's not a relational database like you may be used to. It is a NoSQL database.
You insert JSON documents into an index. You query that index to find documents that match a particular criterion.
It is also sharded and node distributed, which gives it resilience and scalability, and also - if you set it up right - performance.
This means it's really good at 'search engine' style database queries, but because it's not relational, it cannot do the equivalent of a SQL JOIN operation very easily.
One example use case is logstash and kibana - known as the ELK stack - where system event logs (syslog, httpd logs, that kind of thing) are processed by logstash to parse metadata - like log source, referrer, URL, session ID, etc. - and then inserted into elasticsearch.
As each event is a self contained piece of information, this is what elasticsearch does particularly well.
You can then use Kibana as a visualisation engine to display your logs, but also perform analysis - most hit pages, geographic distribution of requests, incoming referrers, time based distribution of requests, etc.
But it also collates these logs, so if you run a really large, geographically distributed website with multiple webserver nodes - or maybe you just have a lot of servers in your computer room and want to summarise the system logs - you can feed the whole lot into elastic search.
It's design is such that it's good at handling near-real-time data insertion and analysis. It also works quite well for 'forum style' data models, as essentially all you're doing is querying a list of posts with a particular forum name, and finding replies to a particular parent node - but they're standalone 'documents'.
So yes, you probably could use it to search an existing database, but you'll have to think about your data model - you can't just translate a conventional relational model, you would have to flatten it. Denormalisation is something of a sin in RDBMS terms, but it's actually quite good for search engines, because you can execute queries in parallel more efficiently.
There exist some way to combine both approaches. Have a look at this blog post:
http://graphaware.com/neo4j/2015/09/30/recommendations-with-neo4j-and-graph-aided-search.html
Databases cannot be optimized for all use cases, but luckily there are many databases available so we can choose the best one for each task.
Elasticsearch is optimized for:
Filtering of documents (exact match)
Search ranking of documents (relevance of search terms)
Aggregation of results (sums, distinct counts, percentiles, ...)
Neo4j is optimized for:
Graph traversal (naturally)
High performance when operated on a "local" graph neighborhood (context)
Actually both databases use the same underlying library Lucene to "index" data to be searched later.
ES is an open source, distributed, RESTful, JSON-based search engine. It is easy to use, scalable and flexible. The indexing feature helps in fast retrieval of search queries.

What caching strategy for search queries

We are developing a search engine web application that will enable users to search the content of about 200 portals.
Our business partner is taking care of maintaining and feeding a solr/lucene instance that is doing the workhorse job of indexing the data.
Our application queries solr and presents the results in a human-friendly way. However, we are wondering how we could limit the amount of queries, perhaps using some form of caching. The results could be cached for few hours.
What we are wondering is: what could be a good strategy for caching the queries results? Obviously we expect the method invocations to vary a lot... Does it make sense at all to do caching?
Is there some caching system that is particularly suitable in this use case? We are using Spring 3 for the development.
I would keep in mind that Solr already has a lot of caching built into it in order to speed up common queries. I'd advise you to look into the inherent capabilities in Solr/Lucene before you go off and reinvent the wheel with your own query cache.
Here is a good place to start.
The simplest solution is to reform your query before it hits Solr.
I created my own QueryBuilder method, which I pass through my query string before hitting Solr.
All this does is explodes all of the arguments and then sorts them in to a predefined group set.
For example, in order to normalize your queries so that they can be cacheable, you can sort alphabetically on each key, then reform the query string, and then use this to query Solr. (The actual query result will be unchanged).
Before you actually run the query, you could then create a hash of the Solr query string and check an in memory hash of all keys that have been saved against. If you find yourself approaching millions of query keys which might be quite likely, you might want to start looking at using a BloomFilter to reduce the keyspace and still maintain some degree of accuracy on cache hits.
Alternatively, you might want to look at putting a reverse proxy cache in between you and Solr. For example, if you were to query Solr like, Spring -> Varnish -> Solr, Varnish could be used to cache and it would use the query string as a hash. You would then be able to set a 2 hour Expires, in order to have the results automatically flushed/cleared/invalidated.
Hopefully this helps.
I have found that caching the results or the rendered content outside Lucene works best. Having an API search service that points to a caching tier with the results from a Lucene Index.
If you separate the caching tier out, you can then plug in whatever caching you want...distributed caching (Redis, Azure AppFabric, other cloud caching etc). Also you can cache the partial renderings of the web page (i.e. outputcaching in ASP.NET) or cache the API calls themselves using RESTful conventions. Things like cache-warming or proactive caching (based on usage) then are easy to do with services.
Your application/index cache then can be "re-used" across more tiers of your app instead of just caching at the index level. This all depends on how if your indexing updates are real-time, if the queries are date-level secure for each client/user id etc. As mentioned above Solr already does "some" of this stuff for you.

Resources