Can I cluster document from a local file? - carrot2

I have already used carrot2 for my clustering project. I integrate carrot2 with my php codes so I use DCS.
My question is "can I cluster documents from a local file?" since there's an option 'From XML File' in parameter 'Document Source' in the welcome screen of carrot2 (Quick start - Document Clustering Server - Carrot2).
If it can be, how could it be?I mean, how are the example codes to cluster from a local file? (considered the file is xml type and uses xml format specified by carrot2).
In the 'example.php' there's an example code for clustering from external data source (etools) but I confuse how to change this code to cluster from xml file.
thanks..

When you choose 'From XML File' in Carrot2 DCS, it is the browser that uploads the local file to the DCS for clustering. To get the same functionality in your PHP code, your PHP code would have to accept file uploads and then send the content to the DCS for clustering.

Related

How to download a CSV from a HTTPS URL to file using Pentaho Data Integration - Spoon (Kettle)?

When googling this question, it seems to have been asked, and partially (and poorly) answered a number of times, mostly for older versions.
Question: How can I download a CSV to a local file, with the below constraints? I'm designing in Spoon.
URL: Will always be the same. https://example.com/data/my.csv . The website prepares the csv and provides it back to the web client as a file download after about 4-5 seconds. In a browser this means it is downloaded as a .csv, and not displayed.
Authentication: The website does not require authentication for access. The data isn't sensitive.
Local file path: The downloaded CSV will overwrite the existing csv. eg: d:\data\my.csv . Ie, I can set this on a timer and have it download the newest csv every hour or so.
Proxy: It is quite likely I will need to traverse a network proxy. eg badproxy.mynetwork.internal:8080 and that proxy requires a username and password. It's far better if I can set this password in a single location so any future things created can reference it. Not really sure on how to approach this either.
The rest of my process focuses on addressing the content of the csv, and already works fine.
The processes I've found on google show using the Http Client component, though it's not particularly straightforward how this translates into a file being saved locally into a known location.
Thanks for any pointers.
PDI v9.0.0.0-423
The HTTP client step needs to be triggered. Use a Row generator step generating e.g. 1 empty row and link that with a hop to the HTTP client step.
for your solution , try this:
Data Grid -->HTTP Client-->CSV File Input->Text file output(extension with csv)

How to plug in a process of identifying sensitive information somewhere in ETL pipeline?

Hope you are doing well !
We have already developed ETL pipeline using apache NiFi. Which gets trigger only when client uploads source data file from portal.After that, the data present inside source file goes through various layers,gets transformed and stored back to warehouse(i.e. hive).
Goal : To identify sensitive information and mask it so that end user won't see actual data.
Identify Sensitive data & masking strategy : We will make use of open source tool to achieve this goal as follow.
Data steward studio : This tool allow me to identify sensitive information and tag it properly.
Apache Atlas : Once data steward user has confirmed the tag then that tag will be pushed into Apache atlas.
Apache ranger : At the final, we can define tag based-masking policy using Apache ranger which will allow or deny to specific user. 
For more details on above solution , please visit link.
https://www.youtube.com/watch?v=RzEfLwJaLsc
Problem : In order to feed the data to DSS tool, it should be loaded first in hive table. That is fine. But we cannot stop the existing ETL flow in-between and then start identification process of sensitive information. The above solution must require some manual process which i want to get rid of and make it automated.that is, it should be plugged in somewhere within NiFi pipeline.But so far, as per my understanding DSS do not allow us to do something like that.
Manual Process :
Create Asset collection
Accept/Reject suggested tags within DSS.
If we cannot plug identification process in pipeline, then client sensitive data will be exposed to everyone and visible to everyone in team. I want something where we can de-identify sensitive data before it actually get loaded into HDFS or hive tables.
Please write your response to me on the same problem, if anyone has already worked into this particular area.
  
I did not test it, but here are my thoughts on this challenge.
Set up the system such that data is NOT visible to everyone(or anyone) by default
Load the data into hive
Let the profilers run and accept its suggestions
Open up the data to those who should have access (except for the things found by the profiler)
There are still some implementation details to work out (e.g. How to automate step 3/4 and whether you can just solve this with tags or whether the data needs to sit in a staging area first). But I hope this steers you in a good direction.
One idea might be to use EncryptContent of nifi (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.EncryptContent/). Then the values loaded into Hive will be encrypted in the first place and would not be visible to the stewards. Once the tagging has been done - then in the subsequent part of the pipeline (where I'm assuming you're using nifi as well) - you can decrypt back content as required.

How to send my most recent dataset by FTP?

I am using IBM Mainframe TSO to view files from a dataset. I recently have been told to start FTPing the latest generation dataset everyday to a folder on my desktop. The problem is that the FTP script I have only lets me FTP a file with the exact name I put. Everyday the dataset changes.
How can I write a script that will FTP the latest generation? Please see example below of how the dataset changes:
Dataset
8/30/18 - KIBI.AL242422.REPORT.G6441V00
8/31/18- KIBI.AL242422.REPORT.G6442V00
9/1/18 - KIBI.AL242422.REPORT.G6443V00
9/4/18 - KIBI.AL242422.REPORT.G6444V00
9/5/18 - KIBI.AL242422.REPORT.G6445V00
command.bat
ftp -i -s:Command.txt
quit
command.txt
open sc01.sample.com
USER NAME
PASSWORD
get 'KIBI.AL242422.REPORT.G6441V00'
What your referring to are Generation Data Groups. You can refer to the files in relative form where (0) is the most current. (-1) is the previous generation, etc. In your case you want to access the dataset by relative reference. In your FTP client do the following:
cd KIBI
get AL242422.REPORT(0)
The system will determine which of the datasets is the one you want. Its a nice feature.

Desktop SPARQL client for Jena (TDB)?

I'm working on an app that uses Jena for storage (with the TDB backend). I'm looking for something like the equivalent of Squirrel, that lets me see what's being stored, run queries etc. This seems like an obvious thing to need, but my (perhaps badly phrased) google queries aren't turning up anything promising.
Any suggestions, please? I'm on XP. Even a command line tool would be helpful.
Take a look at my Store Manager tool which is part of the dotNetRDF Toolkit which I develop as part of the wider dotNetRDF project I maintain.
It provides a fairly basic GUI through which you can connect to various Triple Stores including TDB provided that you expose your dataset via Joseki/Fuseki. You need to have .Net 3.5 installed to run the apps in the toolkit.
If you don't already expose your TDB dataset via HTTP try using Fuseki as it is ridiculously easy to use and can be run just on your local machine when necessary to make your TDB store available via HTTP for use with my tool e.g.
java -jar fuseki-0.1.0-server.jar --update --loc data /dataset
Please see the Fuseki wiki for more information on running Fuseki and the various options. In the above example Fuseki is run with SPARQL Update enabled (the --update flag), using the TDB dataset located in the directory data (the --loc data argument) and with a base URI of /dataset for the data.
Once running you can use my tool to connect to a Fuseki server by going to File > New Generic Store Manager, selecting the "Fuseki" tab from the dialog that appears, entering the URI http://localhost:3030/dataset/data and then clicking "Connect to Fuseki".
Twinkle is a handy SPARQL client : http://www.ldodds.com/projects/twinkle/
As it happens I'm working on something similar myself, but it still needs a lot of work (check back in a month :) http://hyperdata.org/wiki/Scute
first download jena fusaki from
https://jena.apache.org/download/index.cgi
un-zip the file and copy the "jena-fuseki-1.0.1" to c drive
open cmd
type for accesing the folder
"cd C:\jena-fuseki-1.0.1"
then type
"java -jar fuseki-server.jar --update --loc data /dataset"
at last open a browser and type
"localhost:3030/"
remember you must first declear the enviorment verible(located in system poperties then advance tab)
and edit variable name call "Path" in the "System verible" to
"C:\jena-fuseki-1.0.1"
I also develop a SPARQL client, Open Source in Java Swing: EulerGUI.
In fact it does a lot more, see the manual:
http://eulergui.svn.sourceforge.net/viewvc/eulergui/trunk/eulergui/html/documentation.html
For the SPARQL feature, better take the EulerGUI minimal build:
http://sourceforge.net/projects/eulergui/files/eulergui/1.11/

Out of the box approach to upload XML file to the BIRT Server for Processing

I have the BIRT Report Server configured in TOMCAT and it works fine when running reports that require an XML datasource, but that XML file has be available on the network in order for the server to find it and run. Is there an out of the box configuration in the BIRT server that will prompt the user to upload the XML file directly to the server when they try to run a given report that requires an XML data source? This would be handy for users that have the XML datasource stored locally on their C drive and not have to move them to a network server in order to be read by BIRT. Thanks in advance.
Paul
There is not an OOTB solution that does what you describe.
Without the OOTB option, the best way to handle this would be using Actuate's IDAPI. This will give you all the tools to get the file uploaded and added to the iServer. You can expose the IDAPI interface in any number of ways including on the BIRT report itself or on a custom parameter request page.

Resources