Is using ElasticSearch and Azure Search as regular data stores combined with search appropriate? - elasticsearch

We are still deciding on ElasticSearch on an Azure VM or Azure Search service to act as our search repository. However, for user accounts, etc., is there any need to create a separate db (in SQL Azure or even another noSQL db)?

No, there is no need to create a separate db account in order to us Azure Search (or ElasticSearch on Azure VM). Azure Search is a REST API based service where you push your data to be "indexed" at which point it becomes searchable, also through this REST API. The only time you might need a SQL account that I can think of is to use our new Indexer that will automatically ingest data (and data changes) into Azure Search from your Azure SQL or SQL Server on Azure VM database.

I think what you are asking is whether you can use Elasticsearch/Azure Search as your primary store for everything in an app, not just searchable data.
You can certainly do it. There are a few aspects you need to keep in mind (I'm sure there are others besides this):
Durability: when search indexes are just an index sometimes it's fine to run with no replicas or just 1 replica. If you want strong durability you probably want at least 3 total copies of the index to ensure availability and resilience to index corruption and things like that.
Consistency. Elasticsearch has a weak consistency model which also surfaces in Azure Search. You need to write your application taking into account this fact, which can make some scenarios tricky. Other stores such as SQL and DocumentDB offer the option for strict consistency which is easier to work with for a primary store.

Related

Is using Elastic Search as authoritative datastore for applications advisable?

I'm new to using elastic search, and I'm trying to find a datastore for our application where we can also add a front end for analytics, in this case Kibana. I'm planning to use them as a datastore for dr/cr transactions on our billing system.
Most use case I read is towards data analytics and searching related. I don't see a use case wherein it is used as a regular datastore for an application. So I'm worried I might use it on a wrong use case.
I was hoping if anyone can add their insights on this. Like why or why not use Elastic Search as authoritative/primary datastore for applications.
You should read a official blog of elasticsearch, where they clearly mentioned that databases must be robust and should not stop working unless you tell to do it.
From the robustness section of same blog
A database should be robust, especially if it is your authoritative
system of record. Ideally, a costly query should be possible to
cancel, and you certainly don't want the database to stop working
unless you tell it to.
Unfortunately, Elasticsearch (and the components it's made of) does
not currently handle OutOfMemory-errors very well. We cover this in
more depth in Elasticsearch in Production, OutOfMemory-Caused Crashes.
It is very important to provide Elasticsearch with enough memory and
be careful before running searches with unknown memory requirements on
a production cluster.
In short, you shouldn't use Elasticsearch as a primary data-store where you can't afford to loose the data.

File Share Content Connector in Azure for Search

I have a large number of documents, word/excel/pdf etc stored across a number of windows file shares, not sure on the total size but will be min a few TBs of files. I need an interface to search these documents (including their contents) and preview/download documents that match the search. It's also important that ACLs are respected, only returning search results for files the logged in user has access to.
The initial idea was to use a tool like Apache Tika to get the file contents/meta data and dump it all into elastic or something similar. The biggest challenge with this idea is respecting the ACLs and filtering search results.
Is there an obvious Office365/Azure solution to this? I'm a newbie with Azure and it's a bit of a minefield but have seen I can use an on premise gateway to connect file share's to power apps and other azure tools. So hoping there's functionality available that will allow me to create a front end to search through these file shares etc.
Two separated questions in here. You can use Azure Search which has indexers capable to extract and index the content of your files with 0 lines of code. However, due the large amount of data, the billing will not be cheap and you'll need several partitions which also increase the cost.
About authentication / authorization, you'll need a frontend to display the results, so you'd better implement the authentication / authorization on it, and let Azure Search only for the query part. You can grant permission just to your front end.
PS: You can use Azure AD for authentication part, but there's no ready to use functionality to assign which information each user can see. You'll need to implement this part

Should I use elasticsearch for audit logs?

I am building an application in a micro service architecture . So I have my different business models running on different microservices.
Microservices are using graph and document databases.
What I have to do is, I need to keep all audit logs about the objects whenever they were changed. There are couple of ways to do this,two I thought of :
Store audit logs in the each databases whenever something changes to object.
Instead of having it localized, make it to a central repository where we can see all the audits for whole application as behind the
scenes application is served by micro services but at front this is
just one app for the users and also for us. Would elastic search be
used for this purpose of long term storage ? or we have other
solutions ?
Which other ways are the best practices that I must follow. My objective in the end is to the when what was changed in the object by whom.
Cheers!
General recommendation is not to use ES as your authoritative data store. If you want 99.99% reliability for the audit data store it somewhere else, and index in ES when you need its searching abilities.
In my experience ES is quite resilient, still I keep in mind its storage is not that polished comparing to well known relational DBs or Cassandra/HDFS and I would not store important data there.
Also keep in mind ES index in not very flexible, if you want to heavily rescale your cluster or to change field mapping you may have to reindex everything. Newer versions of ES offer "Reindex API", still it's weak point.

Azure Technology Choice for Project

There is a lot of information out there about the various Azure data storage flavors however I'd like to ask for some advice for my particular scenario.
I'm putting together a pet project to become more familiar with Azure technology, in particular, Service Bus/Event Hubs and data storage platforms. The system I want to create is fairly simple: accept a moderate load of events (not IoT scale), persist them, and make aggregated data available such as 'User A had N events of type X in the past day/week/month/etc.' as reports.
Given that the data will be quite structured (e.g. users, user groups, events, etc.), and I will need aggregation capabilities, it suggests that relational storage may be the best fit, although more expensive.
Another alternative I've considered is to maintain aggregated data at near real-time using something like stream analytics but not sure if this is overkill compared to a more data warehouse-esque solution.
Any suggestions/help would be greatly appreciated.
John
John,
Azure SQL would be a decent choice, or if that proves to be too expensive, regular SQL hosted on a VM. You can create an Azure Service Bus to hold the incoming requests, and then create competing consumers on 1 or more worker roles to monitor and process the messages. Each consumer can run the SQL and persist the data in a new table that is created and "pre-aggregated" for the caller, or you could persist the information to Azure BLOB storage in a structured format that matches your reporting tool (i.e. JSON). BLOB storage of the aggregated information will be the most cost effective, and relieve strain on SQL.
An alternative would be HDInsight which can aggregate the information in batch processing mode as well. I guess the choice between SQL/HDInsight depends on the native format of the base (non-aggregated) information.
I agree with Daniel. SQL Azure may be the way to go for your relational data needs. Another option to investigate for larger workloads for streaming and analytics is Azure Data Lake (https://azure.microsoft.com/en-us/solutions/data-lake/)

Lucene.NET + Azure + AzureDirectory or something else?

Good morning.
I am currently working on a project which was originally going to be hosted on a physical server with SQL2k8R2, but it looks like we are moving towards the cloud and Azure... Since SQL Azure does not currently support Full Text Indexing, i have been looking at Lucene.NET with the AzureDirectory project for back end storage. The way this will work is that updates will come in and be queued. once processed, they will be placed in a ToIndex queue, which will kick off Lucene.NET indexing. I am just wondering if there would be a better way of doing this? We dont need to use Azure for this project, so if there is a better solution somewhere, please tell us... main requirement for hosting is it is in Europe...(Azure and Amazon Data centers in Dublin is handy, RackSpace in US is not so handy).
Thanks.
I haven't used that project, but it looks promising. From what I understand, the basic issue is that Lucene requires a file-system. I see 2 other possible solutions (basically just doing what the library does):
Use Azure Drive Storage and a worker role
Use Drive storage, but use a VM (if there are config issues with using a worker role)
http://go.microsoft.com/?linkid=9710117
SQLite also has full text search available, but it has the same basic issue - it requires a filesystem:
http://www.sqlite.org/fts3.html
I have another solution for you, but it's a bit more radical, and a bit more of a conceptual one.
You could create your own indexes, using azure table storage. Create partitions based on each word in your documents, as all tables are indexed on the partitionkey, per word search should be fast, and just do memory joins for multiple word searches.
You could host it as an Azure Website as long as your Lucene index is less than 1GB.
I did this recently when I rewrote Ask Jon Skeet to be hosted as a self contained Azure Website. It uses WebBackgrounder to poll the Stackoverflow API for changes, before updating the Lucene index.

Resources