UnStructured Text Classification issues Using OpenNLP - opennlp

I was trying to classify documents using OpenNLP tool , created a set of training data with key and whitespace separated value and using Java API made it worked .
The issues is when more than one KEY (training set data key ) appears in the same document , the result seems to be inconsistent and not correct , is their any other tool which we may need to look into to get more precise output or any thing more that we need to do in OpenNLP to make it perfect.
Sample Training data
Assign Doc Type ANM
Realty certificate realty
Lease Doc Type LEA
Lease limited warranty lease

Related

Error when running NiFi ExecuteSQL processor. Datetime column can not be converted to Avro format

everyone. I'm learning about some NiFi processors.
I want to obtain all the data of several tables automatically.
So I used a ListDatabaseTable processor with the aim of getting the tables names that are in a specific catalog.
After that, I used other processors to generate the queries like GenerateTableFetch and
RemplaceText. Everything works perfectly since here.
Finally, ExecuteSQL processor plays a role, and here and error is displayed. It says that a datetime column can not be converted to Avro format.
The problem is that there are several tables so specify those columns would be complicated to cast them.
Is a possible solution to fix the error?
The connection is with Microsoft SQL Server.
Here is the image of my flow :

Elastic Architecture Full Text Searching on 1 million file's content

Summary
I am trying to design an elastic index(s) that will provide a solid foundation for indexing 1,000,000+ Files and full text searching on the contents. New files will be continuously added after the initial digitization process.
Use Case
Various File Types (Pdf, outlook email, mp3, txt, jpeg of handwritten things, ..etc) need to be searchable by their contents and meta-data. Users want to manually tag relationships between documents. ex Document A -> contains information about -> Document B. Users want to be able to see related/similar texts. Users want Named Entity Recognition on the text contents. The physical files are already stored on an external computer just waiting to be processed.
Implementation
File Content extraction pipeline using Apache Tika
NER using Spacy
Upload File Contents + NER Tags to Elastic
Eventually we would run our own search models to gain better search insights + data science.
How do I best store my extracted contents to fit the needs of the user and have a scalable foundation? Is it better to run our trained Named Entity Recognition on initial index or after text extraction has been uploaded to elastic?
Or does it make more sense to use an existing solution from below to not reinvent the wheel?
https://github.com/dadoonet/fscrawler
https://github.com/deepset-ai/haystack
https://github.com/RD17/ambar
Instead of reinventing the wheel, I'd recommend to use existing solutions such as Jina, there's a working example of pdf search using Jina. You can also search across different modalities(text, image, pdf, etc.) using this.

How to do data transformation using Apache NiFi standrad processor?

I have to do data transfomration using Apache NiFi standard processor for below mentioned input data. I have to add two new fields class and year and drop extra price fields.
Below are my input data and transformed data.
Input data
Expected output
Disclaimer: I am assuming that your input headers are not dynamic, which means that you can maintain a predictable input schema. If that is true, you can do this with the standard processors as of 1.12.0, but it will require a little work.
Here's a blog post of mine about how to use ScriptedTransformRecord to take input from one schema, build a new data structure and mix it with another schema. It's a bit involved.
I've used that methodology recently to convert a much larger set of data into summary records, so I know it works. The summary of what's involved is this:
Create two schemas, one that matches input and one for output.
Set up ScriptedTransformRecord to use a writer that explicitly sets which schema to use since ScriptedTransformRecord doesn't support the ability to change the schema configuration internally.
Create a fat jar with Maven or Gradle that compiles your Avro schema into an object that can be used with the NiFi API to expose a static RecordSchema (NiFi API) to your script.
Write a Groovy script that generates a new MapRecord.

How to store streaming data in cassandra

I am new in Cassandra, I am very confused.I know that cassandra write speed is very fast.I want to store twitter data coming from storm.I googled, Every time I got make sstable and load into cluster. If every time I have to make sstable then how it possible to store twitter data streaming in cassandra.
please help me.
How I can store log data, which is generated at 1000log per second.
please correct me if I am wrong
I think Cassandra single node can handle 1000 logs per second without bulk loading if your schema is good. Also depends on the size of each log.
Or you could use Cassandra's Copy From CSV command.
For this you need to create a table first.
Here's an example from datastax website :
CREATE TABLE airplanes (
name text PRIMARY KEY,
manufacturer text,
year int,
mach float
);
COPY airplanes (name, manufacturer, year, mach) FROM 'temp.csv';
You need to specify the names of the columns based on the order in which they will be stored in the CSV. And for values with comma(,) you could enclose them in double quotes (") or use a different delimiter.
For more details refer http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/copy_r.html

Merge multiple document categorizer models in OpenNLP

I am trying to write a map-reduce implementation of Document Categorizer using OpenNLP.
During the training phase, I am planning to read a large amount of files and create a model file as result of the map-reduce computation(may be a chain of jobs). I will distribute the files to different mappers, I would create a number of model files as result of this step. Now, I wish to reduce these model files to a single model file to be used for classification.
I understand that this is not the most intuitive of use cases, but I am ready to get my hands dirty and extend/modify the OpenNLP source code, assuming it is possible to tweak the maxent algorithm to work this way.
In case this seems too far fetched, I request for suggestions to do this by generating document samples corresponding to the input files as output of map-reduce step and reducing them to model files by feeding them to document categorizer trainer.
Thanks!
I've done this before, and my approach was to not have each reducer produce the model, but rather only produce the properly formatted data.
Rather than use a category as a key, which separates all the categories Just use a single key and make the value the proper format (cat sample newline) then in the single reducer you can read in that data as (a string) a bytearrayinputstream and train the model. Of course this is not the only way. You wouldn't have to modify opennlp at all to do this.
Simply put, my recommendation is to use a single job that behaves like this:
Map: read in your data, create category label and sample pair. Use a key called 'ALL' and context.write each pair with that key .
Reduce: use a stringbuilder to concat all the cat: sample pairs into the proper training format. Convert the string into a bytearrayinputstream and feed the training API . Write the model somewhere.
Problem may occur that your samples data is too huge to send to one node. If so, you can write the values to A nosql db and read then in from a beefier training node. Or you can use randomization in your mapper to produce many keys and build many models, then at classification time write z wrapper that tests data across them all and Getz The best from each one..... Lots of options.
HTH

Resources