How to set reference time for SUTime via StanfordNLP server - stanford-nlp

I am trying to get sutime annotations using the Stanford CoreNLP server and it seems that the reference time cannot be set using any properties on the server.
Is there a way to do this?
For example, given the text "I need a desk for tomorrow from 2pm to 3pm", I need to provide the reference date as datetime.now() in my Python client for the server in order for SUTime to resolve the word 'tomorrow' to the correct date.

There is, at least in Stanford CoreNLP 3.9.1. Send your text to the following URL:
[stanford_server_url]/?properties={"ner.providedDocDate": "yyyy-mm-dd"}
I'd wondered about this myself for quite a while. Had to read the source code to find it, though.

Related

How to download a CSV from a HTTPS URL to file using Pentaho Data Integration - Spoon (Kettle)?

When googling this question, it seems to have been asked, and partially (and poorly) answered a number of times, mostly for older versions.
Question: How can I download a CSV to a local file, with the below constraints? I'm designing in Spoon.
URL: Will always be the same. https://example.com/data/my.csv . The website prepares the csv and provides it back to the web client as a file download after about 4-5 seconds. In a browser this means it is downloaded as a .csv, and not displayed.
Authentication: The website does not require authentication for access. The data isn't sensitive.
Local file path: The downloaded CSV will overwrite the existing csv. eg: d:\data\my.csv . Ie, I can set this on a timer and have it download the newest csv every hour or so.
Proxy: It is quite likely I will need to traverse a network proxy. eg badproxy.mynetwork.internal:8080 and that proxy requires a username and password. It's far better if I can set this password in a single location so any future things created can reference it. Not really sure on how to approach this either.
The rest of my process focuses on addressing the content of the csv, and already works fine.
The processes I've found on google show using the Http Client component, though it's not particularly straightforward how this translates into a file being saved locally into a known location.
Thanks for any pointers.
PDI v9.0.0.0-423
The HTTP client step needs to be triggered. Use a Row generator step generating e.g. 1 empty row and link that with a hop to the HTTP client step.
for your solution , try this:
Data Grid -->HTTP Client-->CSV File Input->Text file output(extension with csv)

How to plug in a process of identifying sensitive information somewhere in ETL pipeline?

Hope you are doing well !
We have already developed ETL pipeline using apache NiFi. Which gets trigger only when client uploads source data file from portal.After that, the data present inside source file goes through various layers,gets transformed and stored back to warehouse(i.e. hive).
Goal : To identify sensitive information and mask it so that end user won't see actual data.
Identify Sensitive data & masking strategy : We will make use of open source tool to achieve this goal as follow.
Data steward studio : This tool allow me to identify sensitive information and tag it properly.
Apache Atlas : Once data steward user has confirmed the tag then that tag will be pushed into Apache atlas.
Apache ranger : At the final, we can define tag based-masking policy using Apache ranger which will allow or deny to specific user. 
For more details on above solution , please visit link.
https://www.youtube.com/watch?v=RzEfLwJaLsc
Problem : In order to feed the data to DSS tool, it should be loaded first in hive table. That is fine. But we cannot stop the existing ETL flow in-between and then start identification process of sensitive information. The above solution must require some manual process which i want to get rid of and make it automated.that is, it should be plugged in somewhere within NiFi pipeline.But so far, as per my understanding DSS do not allow us to do something like that.
Manual Process :
Create Asset collection
Accept/Reject suggested tags within DSS.
If we cannot plug identification process in pipeline, then client sensitive data will be exposed to everyone and visible to everyone in team. I want something where we can de-identify sensitive data before it actually get loaded into HDFS or hive tables.
Please write your response to me on the same problem, if anyone has already worked into this particular area.
  
I did not test it, but here are my thoughts on this challenge.
Set up the system such that data is NOT visible to everyone(or anyone) by default
Load the data into hive
Let the profilers run and accept its suggestions
Open up the data to those who should have access (except for the things found by the profiler)
There are still some implementation details to work out (e.g. How to automate step 3/4 and whether you can just solve this with tags or whether the data needs to sit in a staging area first). But I hope this steers you in a good direction.
One idea might be to use EncryptContent of nifi (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.EncryptContent/). Then the values loaded into Hive will be encrypted in the first place and would not be visible to the stewards. Once the tagging has been done - then in the subsequent part of the pipeline (where I'm assuming you're using nifi as well) - you can decrypt back content as required.

Is the GBMV3 object from the H2O server different from the GBMV3 class in the H2O library?

We are working with H2O version 3.22.0.1. We created a process in java 10 that communicates with the REST API utilizing jersey version 2.27 with gson 2.3.1. The process invokes ImportFiles, followed by ParseSetup and Parse. Everything works well up until that point. Then the process invokes 3/ModelBuilders/gbm/parameters. From examining the log, it appears that the H2O server responds as expected. However, gson throws a JsonSyntaxException caused by the following:
java.lang.IllegalStateException: Expected BEGIN_OBJECT but was BEGIN_ARRAY at line 1 column 4115 path $.parameters
Upon further analysis, it appears that the H2O server is providing a GBMV3 object with an array of ModelParameterSchemaV3 objects, while the GBMV3 class, as defined in the library that our client uses, extends SharedTreeV3, which extends ModelBuilderSchema, which has a single instance of ModelParametersSchemaV3. There is an apparent discrepancy between the way the GBMV3 object provided by the H2O server is composed, and the way the class is defined in the H2O library. One has an array of ModelParameterSchemaV3 objects, while the other has a single instance of ModelParametersSchemaV3. Is that the case? If so, could you please help us understand what we may be doing wrong, and how to correct it?
See the files located at: https://1drv.ms/f/s!AsSlPHvlhJI1hIpB2M5X49J5L-h1qw
Run the H2O server. Import the CSV file in H2O Flow. SetupParse and Parse the data. Run the test procedure. Thank you for your kind assistance.
Thanks for the detailed description. To better understand your problem - would you be able to provide a simplified example of how you are calling H2O-3 using the Java bindings?
You might be hitting a bug so if you are able to give us a reproducer we could expedite a fix for this issue.

What to put into consideration while adding a new Kibana-5 data source?

I'm attempting to add solr as a new data source for Kibana-5 to read from. Does it all it take is to only add a new plugin to the source code or there are other areas where I should take into consideration?
You make it sound very simple: "only add a new plugin". I think this will be very hard, since Elasticsearch and its query DSL are baked into Kibana very deeply.
Lucidworks tried to fork Kibana twice:
https://github.com/lucidworks/silk — no commits since February 2016
https://github.com/lucidworks/banana — no commits since January 2017
You can probably take a look at their commits to get an idea, but this will be a lot of work.

ParseConfig vs ParseQuery - Which is faster/better?

So i'm essentially using a column in my parse server db as a configuration for my app. I now realize parse has a config feature specifically for this.
My question is which is faster/better to use and why? Hitting a query every single time i want to check, or check the parse config. Isn't checking the config variable just another query?
If anyone can provide proper/better/ or any documentation...
ParseConfig is faster because it store the config in the cache while ParseQuery will retrieve it from the DB. But you cannot use ParseConfig for every query operation but only to get things that are relevant to all of your users like: "Message of the day" and more.
Another very cool thing about ParseConfig is that if you don't have network connectivity it will fallback automatically to the client cache.

Resources