Context:
We are moving from ES 5.X to ES 7.X
Earlier we were using JEST Client, now we are planning to use ES High-Level Client
Our search queries are complex and we are planning to use SearchTemplate API
We will store template files locally & cache them to reduce the overhead of I/O
What I have tried so far:
I've read the documentation of EHLC and I can't find a mechanism to load & cache script files directly from the file system
I can see that we can store the script in E.S which we don't want to do, assumingly we won't be having changelogs there.
Question:
Is there an inbuilt mechanism to use the locally stored file as a script in EHLC? OR we shall use inline scripts and load & cache the script file using custom code
Based on the comments I'd suggest the following:
Keep track of the templates with git.
Monitor the changes and trigger a pub/sub message whenever applicable (PR merges etc.).
Configure your pub/sub handler to update the stored search template in ES.
Otherwise, when we talk about local loading + caching, the machines with slightly older EHLC processes wouldn't get notified about the most recent changes in git and would continue using stale scripts.
Related
I have an azure blob container with data which I have not uploaded myself. The data is not locally on my computer.
Is it possible to use dvc to download the data to my computer when I haven’t uploaded the data with dvc? Is it possible with dvc import-url?
I have tried using dvc pull, but can only get it to work if I already have the data locally on the computer and have used dvc add and dvc push .
And if I do it that way, then the folders on azure are not human-readable. Is it possible to upload them in a human-readable format?
If it is not possible is there then another way to download data automatically from azure?
I'll build up on #Shcheklein's great answer - specifically on the 'external dependencies' proposal - and focus on your last question, i.e. "another way to download data automatically from Azure".
Assumptions
Let's assume the following:
We're using a DVC pipeline, specified in an existing dvc.yaml file. The first stage in the current pipeline is called prepare.
Our data is stored on some Azure blob storage container, in a folder named dataset/. This folder follows a structure of sub-folders that we'd like to keep intact.
The Azure blob storage container has been configured in our DVC environment as a DVC 'data remote', with name myazure (more info about DVC 'data remotes' here)
High-level idea
One possibility is to start the DVC pipeline by synchronizing a local dataset/ folder with the dataset/ folder on the remote container.
This can be achieved with a command-line tool called azcopy, which is available for Windows, Linux and macOS.
As recommended here, it is a good idea to add azcopy to your account or system path, so that you can call this application from any directory on your system.
The high-level idea is:
Add an initial update_dataset stage to the DVC pipeline that checks if changes have been made in the remote dataset/ directory (i.e., file additions, modifications or removals).
If changes are detected, the update_datset stage shall use the azcopy sync [src] [dst] command to apply the changes on the Azure blob storage container (the [src]) to the local dataset/ folder (the [dst])
Add a dependency between update_dataset and the subsequent DVC pipeline stage prepare, using a 'dummy' file. This file should be added to (a) the outputs of the update_dataset stage; and (b) the dependencies of the prepare stage.
Implementation
This procedure has been tested on Windows 10.
Add a simple update_dataset stage to the DVC pipeline by running:
$ dvc stage add -n update_dataset -d remote://myazure/dataset/ -o .dataset_updated azcopy sync \"https://[account].blob.core.windows.net/[container]/dataset?[sas token]\" \"dataset/\" --delete-destination=\"true\"
Notice how we specify the 'dummy' file .dataset_updated as an output of the stage.
Edit the dvc.yaml file directly to modify the command of the update_dataset stage. After the modifications, the command shall (a) create the .dataset_updated file after the azcopy command - touch .dataset_updated - and (b) pass the current date and time to the .dataset_updated file to guarantee uniqueness between different update events - echo %date%-%time% > .dataset_updated.
stages:
update_dataset:
cmd: azcopy sync "https://[account].blob.core.windows.net/[container]/dataset?[sas token]" "dataset/" --delete-destination="true" && touch .dataset_updated && echo %date%-%time% > .dataset_updated # updated command
deps:
- remote://myazure/dataset/
outs:
- .dataset_updated
...
I recommend editing the dvc.yaml file directly to modify the command, as I wasn't able to come up with a complete dvc add stage command that took care of everything in one go.
This is due to the use of multiple commands chained by &&, special characters in the Azure connection string, and the echo expression that needs to be evaluated dynamically.
To make the prepare stage depend on the .dataset_updated file, edit the dvc.yaml file directly to add the new dependency, e.g.:
stages:
prepare:
cmd: <some command>
deps:
- .dataset_updated # add new dependency here
- ... # all other dependencies
...
Finally, you can test different scenarios on your remote side - e.g., adding, modifying or deleting files - and check what happens when you run the DVC pipeline up till the prepare stage:
$ dvc repro prepare
Notes
The solution presented above is very similar to the example given in DVC's external dependencies documentation.
Instead of the az copy command, it uses azcopy sync.
The advantage of azcopy sync is that it only applies the differences between your local and remote folders, instead of 'blindly' downloading everything from the remote side when differences are detected.
This example relies on a full connection string with an SAS token, but you can probably do without it if you configure azcopy with your credentials or fetch the appropriate values from environment variables
When defining the DVC pipeline stage, I've intentionally left out an output dependency with the local dataset/ folder - i.e. the -o dataset part - as it was causing the azcopy command to fail. I think this is because DVC automatically clears the folders specified as output dependencies when you reproduce a stage.
When defining the azcopy command, I've included the --delete-destination="true" option. This allows synchronization of deleted files, i.e. files are deleted on your local dataset folder if deleted on the Azure container.
Please, bear with me, since you have a lot of questions. Answer needs a bit structure and background to be useful. Or skip to the very end to find some new ways of doing Is it possible to upload them in a human-readable format? :). Anyways, please let me know if that solves your problem, and in general would be great to have a better description of what you are trying to accomplish at the end (high level description).
You are right that by default DVC structures its remote in a content-addressable way (which makes it non human-readable). There are pros and cons to this. It's easy to deduplicate data, it's easy to enforce immutability and make sure that no one can touch it directly and remove something, directory names in projects make it connected to actual project and their meaning, etc.
Some materials on this: Versioning Data and Models, my answer of on how DVC structures its data, upcoming Data Management User Guide section (WIP still).
Saying that, it's clear there are downsides to this approach, especially when it comes to managing a lot of objects in the cloud (e.g. millions of images, etc). To name a few concerns that I see a lot as a pattern:
Data has been created (and being updated) by someone else. There is some ETL, third party tool, etc. We need to keep that format.
Third party tool expect to have data in "human" readable way. It doesn't integrate with DVC to being able to access it indirectly via Git. (one of the examples - Label Studio need direct links to S3).
It's not practical to move all of data into DVC, it doesn't make sense to instantiate all the files at once as one directory. Users need slices, usually based on some annotations (metadata), etc.
So, DVC has multiple features to deal with data in its own original layout:
dvc import-url - it'll download objects, it'll cache them, and will by default push (dvc push) to remote to again save them to guarantee reproducibility (this can be changed). This command creates a special file .dvc that is being used to detect changes in the cloud to see if DVC needs to download something again. It should cover the case for "to download data automatically from azure".
dvc get-url - this more or less wget or rclone or aws s3 cp, etc with multi cloud support. It just downloads objects.
A bit advanced thing (if you DVC pipelines):
Similar to import-url but for DVC pipelines - external dependencies
The the third (new) option. It's in beta phase, it's called "cloud versioning" and essentially it tries to keep the storage human readable while still benefit from using .dvc files in Git if you need them to reference an exact version of the data.
Cloud Versioning with DVC (it's WPI when I write this, if PR is merged it means you can find it in the docs
The document summarizes well the approach:
DVC supports the use of cloud object versioning for cases where users prefer to retain their original filenames and directory hierarchy in remote storage, in exchange for losing the de-duplication and performance benefits of content-addressable storage. When cloud versioning is enabled, DVC will store files in the remote according to their original directory location and
filenames. Different versions of a file will then be stored as separate versions of the corresponding object in cloud storage.
Is it possible to save a bunch of queries into a single JSON file to import in Kibana Console?
I know there's an option to save a single query[2] and the Kibana console is based on local storage, but I would like to load up the queries based on parameters, such that changing the params(e.g load_from=filename.json) should load up a different set of queries.
For example, when I open http://localhost:5601/app/kibana#/dev_tools/console?load_from=filename.json, it should open the Kibana console with ES queries from the file.
EDIT: As a workaround, it's possible to do this with Postman API Client or similar API clients.
Solution:
EDIT 2 on 22/02/2022: Kibana Spaces is the answer. It lets you organize dashboards and other saved objects into meaningful categories[3]. Whenever you load http://localhost:5601/ it lets you choose the space you want to work with. Having multiple browser tabs with different saved spaces should work for most cases.
[2] https://www.elastic.co/guide/en/kibana/master/save-load-delete-query.html
[3] https://www.elastic.co/guide/en/kibana/master/xpack-spaces.html
Unfortunately, that's not possible yet.
Elastic is (supposedly) working on a new Kibana feature (tabbed console panes #10095) that will provide support for better organizing the code in the Dev Tools application. The issue has been opened for a while and not much seems to be happening, so we'll see.
The release date of that feature is not known yet.
Hope you are doing well !
We have already developed ETL pipeline using apache NiFi. Which gets trigger only when client uploads source data file from portal.After that, the data present inside source file goes through various layers,gets transformed and stored back to warehouse(i.e. hive).
Goal : To identify sensitive information and mask it so that end user won't see actual data.
Identify Sensitive data & masking strategy : We will make use of open source tool to achieve this goal as follow.
Data steward studio : This tool allow me to identify sensitive information and tag it properly.
Apache Atlas : Once data steward user has confirmed the tag then that tag will be pushed into Apache atlas.
Apache ranger : At the final, we can define tag based-masking policy using Apache ranger which will allow or deny to specific user.
For more details on above solution , please visit link.
https://www.youtube.com/watch?v=RzEfLwJaLsc
Problem : In order to feed the data to DSS tool, it should be loaded first in hive table. That is fine. But we cannot stop the existing ETL flow in-between and then start identification process of sensitive information. The above solution must require some manual process which i want to get rid of and make it automated.that is, it should be plugged in somewhere within NiFi pipeline.But so far, as per my understanding DSS do not allow us to do something like that.
Manual Process :
Create Asset collection
Accept/Reject suggested tags within DSS.
If we cannot plug identification process in pipeline, then client sensitive data will be exposed to everyone and visible to everyone in team. I want something where we can de-identify sensitive data before it actually get loaded into HDFS or hive tables.
Please write your response to me on the same problem, if anyone has already worked into this particular area.
I did not test it, but here are my thoughts on this challenge.
Set up the system such that data is NOT visible to everyone(or anyone) by default
Load the data into hive
Let the profilers run and accept its suggestions
Open up the data to those who should have access (except for the things found by the profiler)
There are still some implementation details to work out (e.g. How to automate step 3/4 and whether you can just solve this with tags or whether the data needs to sit in a staging area first). But I hope this steers you in a good direction.
One idea might be to use EncryptContent of nifi (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.EncryptContent/). Then the values loaded into Hive will be encrypted in the first place and would not be visible to the stewards. Once the tagging has been done - then in the subsequent part of the pipeline (where I'm assuming you're using nifi as well) - you can decrypt back content as required.
I want to disable the Database Replication from the replica cluster in MarkLogic 8 using ML-Gradle. After updating the configurations, I also want to re-enable it.
There are tasks for enabling and disabling flexrep in ML Gradle. But I couldn't found any such thing for Database Replication. How can this be done?
ml-gradle uses the Management API to handle configuration changes. Database Replication is controlled by sending a PUT command to /manage/v2/databases/[id-or-name]/properties. Update your ml-config/databases/content-database.json file (example that does not include that property) to include database-replication, including replication-enabled: true.
To see what that object should look like, you can send a GET request to the properties endpoint.
You can create your own command to set replication-enabled - see https://github.com/rjrudin/ml-gradle/wiki/Writing-your-own-management-task
I'll also add a ticket for making official commands - e.g. mlEnableReplication and mlDisableReplication, with those defaulting to the content database, and allowing for any database to be specified.
I'm working on an app that uses Jena for storage (with the TDB backend). I'm looking for something like the equivalent of Squirrel, that lets me see what's being stored, run queries etc. This seems like an obvious thing to need, but my (perhaps badly phrased) google queries aren't turning up anything promising.
Any suggestions, please? I'm on XP. Even a command line tool would be helpful.
Take a look at my Store Manager tool which is part of the dotNetRDF Toolkit which I develop as part of the wider dotNetRDF project I maintain.
It provides a fairly basic GUI through which you can connect to various Triple Stores including TDB provided that you expose your dataset via Joseki/Fuseki. You need to have .Net 3.5 installed to run the apps in the toolkit.
If you don't already expose your TDB dataset via HTTP try using Fuseki as it is ridiculously easy to use and can be run just on your local machine when necessary to make your TDB store available via HTTP for use with my tool e.g.
java -jar fuseki-0.1.0-server.jar --update --loc data /dataset
Please see the Fuseki wiki for more information on running Fuseki and the various options. In the above example Fuseki is run with SPARQL Update enabled (the --update flag), using the TDB dataset located in the directory data (the --loc data argument) and with a base URI of /dataset for the data.
Once running you can use my tool to connect to a Fuseki server by going to File > New Generic Store Manager, selecting the "Fuseki" tab from the dialog that appears, entering the URI http://localhost:3030/dataset/data and then clicking "Connect to Fuseki".
Twinkle is a handy SPARQL client : http://www.ldodds.com/projects/twinkle/
As it happens I'm working on something similar myself, but it still needs a lot of work (check back in a month :) http://hyperdata.org/wiki/Scute
first download jena fusaki from
https://jena.apache.org/download/index.cgi
un-zip the file and copy the "jena-fuseki-1.0.1" to c drive
open cmd
type for accesing the folder
"cd C:\jena-fuseki-1.0.1"
then type
"java -jar fuseki-server.jar --update --loc data /dataset"
at last open a browser and type
"localhost:3030/"
remember you must first declear the enviorment verible(located in system poperties then advance tab)
and edit variable name call "Path" in the "System verible" to
"C:\jena-fuseki-1.0.1"
I also develop a SPARQL client, Open Source in Java Swing: EulerGUI.
In fact it does a lot more, see the manual:
http://eulergui.svn.sourceforge.net/viewvc/eulergui/trunk/eulergui/html/documentation.html
For the SPARQL feature, better take the EulerGUI minimal build:
http://sourceforge.net/projects/eulergui/files/eulergui/1.11/