How to add our trained data to existing Stanford NER classifier? - stanford-nlp

I have to add my trained data to existing CRF classifier [english.all.7class.distsim.crf.ser].
Is there any API to extract the the existing models or de-serialise them ?
Thanks

Unfortunately, no: there is no way to recover the training data from the serialized model files. Nor is there a way to [easily] continue training from a serialized model. Part of this is by design: Stanford is not allowed to redistribute the data that was used to train the classifier. If you have access to the data, you can of course train with it alongside the new data, as per http://nlp.stanford.edu/software/crf-faq.shtml#a

Related

How do I use huggingface models offline? Also, how can I retrain those models with my own datasets for specific task?

I want to use huggingface Model offline. Also, I want to use models for particular task trains on a specific dataset.
I expect I will retrain my model with specific datasets and also i can use it offline.

How to export models trained in AutoML to saved model or frozen graph?

I have trained a model on Google Auto-ML. I wish to use it along with other custom models I have built, hence I need the models trained in AutoML in frozen graph or saved model format. Any direction would be much appreciated.
You cannot do that in Google AutoML, you can do it in your local machine and export it to AWS sage-maker kind of platforms and deploy it.

How to implement Data Lineage on Hadoop?

We are implementing few business flows in financial area. The requirement (unfortunately, not very specific) from the regulatory is to have a data lineage for auditing purpose.
The flow contains 2 parts: synchronous and asynchronous. The syncronous part is a payment attempt containing bunch of info about point of sale, the customer and the goods. The asynchronous part is a batch process that feeds the credit assessment data model with a newly-calculated portion of variables on an hourly basis. The variables might include some aggregations like balances and links to historical transactions.
For calculating the asynchronous part we ingest the data from multiple relational DBs and store them in HDFS in a raw format (rows from tables in csv format).
When storing the data on HDFS is done a job based on Spring XD that calculates some aggregations and produces the data for the synchronous part is triggered.
We have relational data, raw data on HDFS and MapReduce jobs relying on POJOs that describe the relevant semantics and the transformations implemented in SpringXD.
So, the question is how to handle the auditing in the scenario described above?
We need at any point in time to be able to explain why a specific decision was made and also be able to explain how each and every variable used in the policy (synchronous or near-real-time flow) was calculated.
I looked at existing Hadoop stack and it looks like currently no tool could provide with a good enterprise-ready auditing capabilities.
My thinking is to start with custome implementation that includes>
A business glossary with all the business terms
Operational and technical metadata - logging transformation execution for each entry into a separate store.
log changes to a business logic (use data from version control where the business rules and transformations are kept).
Any advice or sharing your experience would be greatly appreciated!
Currently Cloudera sets the industry standard for Data Lineage/Data Governance in the big data space.
Glossary, metadata and historically run (versions of) queries can all be facilitated.
I do realize some of this may not have been in place when you asked the question, but it certainly is now.
Disclaimer: I am an employee of Cloudera

How is social media data unstructured data?

I recently began reading up on big data, and how there are tools like hadoop or BigInsights that can manage both structured and unstructured data.
Social Media Analytics is something that can be done on BigInsights, and it takes unstructured data and analyzes/structures it accordingly.
This got me wondering, how is Social Media Data unstructured? For example, the information you can receive on tweets can be called using the Twitter REST API, and returned to you in a structured JSON format.
So isn't Social Media data already structured? If so why do you need a platform that manages mainly unstructured data?
Some make the distinction „semi-structured”, too.
But the point is the ability to query the data. Yes, Tweets etc. usually have some structure. But it's not helpful for analysis.
Given an ugly SQL schema, you could indeed run a query like
SELECT AVG(TweetID) FROM Twitter;
but that functionality is useless in practise. And that is probably why the data is best considered unstructured: you do not benefit from squeezing it into a relational schema.
Beware of buzzword bingo with big data, though. More often than not „supports unstructured data” actually means „does not benefit from structure in your data (by using indexes) but rereads data every time”
Its not only about getting the tweets. The real value of the data is knowing about what is being tweeted. Consider Facebook, where we can comment about any picture or a video. We need a platform to know what all the comments are positive about the video or how many are sledging it, or how many comments are real feedback about it. How many are providing suggestions to that to be a better one. And also you need to know how many times the video is shared and liked. Again those who all shared are whom, the one who dislikes it or likes it. Such so many varieties of data can be collected hence these are all called unstructured data.

How should business rules be implemented in ETL?

I work on an product that imports data from a mainframe using SSIS via flat file. The SSIS packages use a stage database to transform flat file data and then call stored procedures in the ODS to load the transformed data. There is a potential plan to route all ETL data through a .NET service layer (instead of directly to the ODS via stored procedures) to centralize business rules/activity, etc. I'm looking for input on this approach and dissenting opinions.
Sounds fine; you're turning basic ETL into ETVL, adding a "validate" step. Normally this is considered part of the "transform" stage, but I prefer to keep that stage purer when I conceptualize an architecture like this; transform is turning the raw fields which were pulled out and chopped up in the extract stage into objects of my domain model. Verifying that those objects are in a valid state for the system is validation.

Resources