File Share Content Connector in Azure for Search - windows

I have a large number of documents, word/excel/pdf etc stored across a number of windows file shares, not sure on the total size but will be min a few TBs of files. I need an interface to search these documents (including their contents) and preview/download documents that match the search. It's also important that ACLs are respected, only returning search results for files the logged in user has access to.
The initial idea was to use a tool like Apache Tika to get the file contents/meta data and dump it all into elastic or something similar. The biggest challenge with this idea is respecting the ACLs and filtering search results.
Is there an obvious Office365/Azure solution to this? I'm a newbie with Azure and it's a bit of a minefield but have seen I can use an on premise gateway to connect file share's to power apps and other azure tools. So hoping there's functionality available that will allow me to create a front end to search through these file shares etc.

Two separated questions in here. You can use Azure Search which has indexers capable to extract and index the content of your files with 0 lines of code. However, due the large amount of data, the billing will not be cheap and you'll need several partitions which also increase the cost.
About authentication / authorization, you'll need a frontend to display the results, so you'd better implement the authentication / authorization on it, and let Azure Search only for the query part. You can grant permission just to your front end.
PS: You can use Azure AD for authentication part, but there's no ready to use functionality to assign which information each user can see. You'll need to implement this part

Related

When using asp.net MVC core + EF core + ability to encrypt files. What will be the differences between Blob, FileStream & File System, to manage files

I am working on an asp.net core mvc web application, the web application is a document management workflow. where inside each of the workflow steps users can upload documents, as follow:-
users can upload documents with the following restriction; a file can not exceed 5 MB + all the documents inside a workflow can not exceed 50 MB, unless admin approves it. they can upload as many documents as they want.
we will have lot of views which will show the step and all its documents attached to it, and users can chose to download the documents.
we can have unlimited number of workflows. as the more users register with our application the more workflow will be created.
certain files can be marked as confidential, so they should be encrypted when storing them either inside the database or inside the file system.
we are planning to use EF core as the data access layer for our web application + SQL server 2016 or 2017.
now my question is how we should manage our files, where i found these 3 approaches.
Blob.
FileStream
File system.
now the first approach, will allow us to encrypt the files inside the database + will work with EF. but it will have a huge drawback on performance, since opening a file or querying the files from database means they will be loaded inside the hosting server memory. so since we are seeking for an extensible approach, so i think this approach will not work for us since it is less scalable.
Second approach. will have better performance compared to first approach (Blob), but FileStream are not supported with EF + does not allow encryption. so we have to exclude this also.
third approach. of storing the files inside a folder which have the workflow ID + store the link to the file/folder inside the DB. will allow us to encrypt the files + will work with EF. and have a better performance compared to Blob (not sure if this is valid for FileStream). the only drawback, is that we can not achieve Atomic-ity between the files and their related records inside the database. but with adding some code we can handle this by our-self. for example deleting a database record will delete all its documents inside the folder, and we can add some background jobs to make sure all the documents have database records, other wise to delete the documents..
so based on the above i found that the third approach is the best fit for our need? so can anyone advice on this please? are my assumption correct? and is there a fourth appraoch or a hybrid appraoch that can be a better fit for us?
Although modern RDBMS have been optimised for data storage with the perks of integrity and atomicity, databases should be considered the least most alternative (StackOverflow posts like this and this shall corroborate the above) and therefore the third option mentioned or an improvement thereof shall be the vote.
For instance, a potential improvement would be to store the files renamed to a hash of the content and database the hash which shall eliminate all OS restrictions on subdirectories/files, filenames, and paths. Moreover, with a well structured directory layout duplicates could be filtered out.
The User-defined Database Functions shall aid in achieving atomicity which will efface the need of background jobs. An excellent guide on UDFs particularly for the use of accessing filesytem and invoking an executable can be found here.

Storing user profiles

I would like to store user profile information. After researching a bit online, I am confused between the following options:
Use a LDAP server (example: Open DJ) - I can write Java clients which can interact with the LDAP server using LDAP APIs.
Store user profile in a database as a JSON document (like in Elastic DB) - The No SQL databases can then index the documents to improve lookup time.
What are the factors that I should keep in mind before selecting one of the approaches?
For a start, if you are storing passwords, then using LDAP is a no brainer IMO. See http://smart421.com/smart-identity-and-fraud/why-bother-with-an-ldap-anyway/ .
Otherwise I would recommend you do a PoC with each solutions (do not forget to add indexes for OpenDJ and you may also use Rest2LDAP) see how they fill your needs. Both products are open source so its easy to get started.
If your user population is a known group that may already have accounts in an existing LDAP repository, or where user account information needs to be shared between systems, then it makes sense to use and add on to the existing LDAP repository.
If you are starting out from scratch and have mainly external, unknown users who have no other interaction with your infrastructure but this one application, then LDAP is not a good choice imo because of the overhead that you are getting for creating and managing the server. Then a lightweight JSON approach seems better suited (even thought the L in LDAP stands for "lightweight").
The number of expected users is less of a consideration - you need to thread carefully with very large populations in either scenario.
See this questions as well for additional insights Reasons to store users' data in LDAP instead of RDBMS

Will it be a safe to access elasticsearch directly from Javascript API

I am learning elasticsearch. I wanted to know how safe (in terms of access control & validating user access) it is to access ES server directly from JavaScript API rather than accessing it through your backend ? Will it be a safe to access ES directly from Javascript API ?
Depends on what you mean by "safe".
If you mean "safe to expose to the internet", then no, definitely not, as there isn't any access control and anyone will be able to insert data or even drop all the indexes.
This discussion gives a good overview of the issue. Relevant section:
Just as you would not expose a database directly to the Internet and let users send arbitrary SQL, you should not expose Elasticsearch to the world of untrusted users without sanitizing the input. Specifically, these are the problems we want to prevent:
Exposing private data. This entails limiting the searches to certain indexes, and/or applying filters to the searches.
Restricting who can update what.
Preventing expensive requests that can overwhelm or crash nodes and/or the entire cluster.
Preventing arbitrary code execution through dynamic scripts.
Its most certainly possible, and it can be "safe" if say you're using it as an internal tool behind some kind of authentication. In general, no, its not secure. Elasticsearch API can create, delete, update, and search, which is not something you want to give a client access to or they could essentially do any or all of these things.
You could, in theory, create role based auth with ElasticSearch Shield, but it's far from standard practice. Its not really anymore difficult to implement search on your backend then just have a simple call to return search results.

Is using ElasticSearch and Azure Search as regular data stores combined with search appropriate?

We are still deciding on ElasticSearch on an Azure VM or Azure Search service to act as our search repository. However, for user accounts, etc., is there any need to create a separate db (in SQL Azure or even another noSQL db)?
No, there is no need to create a separate db account in order to us Azure Search (or ElasticSearch on Azure VM). Azure Search is a REST API based service where you push your data to be "indexed" at which point it becomes searchable, also through this REST API. The only time you might need a SQL account that I can think of is to use our new Indexer that will automatically ingest data (and data changes) into Azure Search from your Azure SQL or SQL Server on Azure VM database.
I think what you are asking is whether you can use Elasticsearch/Azure Search as your primary store for everything in an app, not just searchable data.
You can certainly do it. There are a few aspects you need to keep in mind (I'm sure there are others besides this):
Durability: when search indexes are just an index sometimes it's fine to run with no replicas or just 1 replica. If you want strong durability you probably want at least 3 total copies of the index to ensure availability and resilience to index corruption and things like that.
Consistency. Elasticsearch has a weak consistency model which also surfaces in Azure Search. You need to write your application taking into account this fact, which can make some scenarios tricky. Other stores such as SQL and DocumentDB offer the option for strict consistency which is easier to work with for a primary store.

Lucene.NET + Azure + AzureDirectory or something else?

Good morning.
I am currently working on a project which was originally going to be hosted on a physical server with SQL2k8R2, but it looks like we are moving towards the cloud and Azure... Since SQL Azure does not currently support Full Text Indexing, i have been looking at Lucene.NET with the AzureDirectory project for back end storage. The way this will work is that updates will come in and be queued. once processed, they will be placed in a ToIndex queue, which will kick off Lucene.NET indexing. I am just wondering if there would be a better way of doing this? We dont need to use Azure for this project, so if there is a better solution somewhere, please tell us... main requirement for hosting is it is in Europe...(Azure and Amazon Data centers in Dublin is handy, RackSpace in US is not so handy).
Thanks.
I haven't used that project, but it looks promising. From what I understand, the basic issue is that Lucene requires a file-system. I see 2 other possible solutions (basically just doing what the library does):
Use Azure Drive Storage and a worker role
Use Drive storage, but use a VM (if there are config issues with using a worker role)
http://go.microsoft.com/?linkid=9710117
SQLite also has full text search available, but it has the same basic issue - it requires a filesystem:
http://www.sqlite.org/fts3.html
I have another solution for you, but it's a bit more radical, and a bit more of a conceptual one.
You could create your own indexes, using azure table storage. Create partitions based on each word in your documents, as all tables are indexed on the partitionkey, per word search should be fast, and just do memory joins for multiple word searches.
You could host it as an Azure Website as long as your Lucene index is less than 1GB.
I did this recently when I rewrote Ask Jon Skeet to be hosted as a self contained Azure Website. It uses WebBackgrounder to poll the Stackoverflow API for changes, before updating the Lucene index.

Resources