I have a requirement to detect company names from the given text. I have trained CRFClassifier, with my training data and gazette data both. After training the classifier, when i use test data, to Identify companies it's not detecting properly. If i gave company name which is part of training data, its able to recognize, if i give any company name that is part of gazette file, its not able to recognize those entities. Can you help me, how i can proceed further to do it in a right direction to recognize entities.
Property file that i'm Using looks like this
trainFile=training-data.tsv
serializeTo=custom-classification-model.ser.gz
map=word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true
useGazettes=true
gazette=gazette.txt
cleanGazette=true
Sample Training Data file
Warburg COMPANY
Pincus COMPANY
has O
agreed O
to O
acquire O
North O
Carolina O
O
based O
Service O
Gazzette File Data file
ACON COMPANY
Investments COMPANY
LLS COMPANY
Post COMPANY
Oak COMPANY
Energy COMPANY
Capital COMPANY
Merrill COMPANY
Lynch COMPANY
International COMPANY
Aion COMPANY
Direct COMPANY
Singapore COMPANY
Your gazette file is not properly formatted.
An example entry should be like:
CLASS1 this is an example
There is a more detailed answer on the NER FAQ page:
https://nlp.stanford.edu/software/crf-faq.html#gazette
Related
I'm looking to understand the finance interactions a bit. If I have a doctor who takes 2 or 3 insurance plans, what are the FHIR objects I need to use to model that?
My current guess is that one needs
An Organization to represent an insurance company with type = ins
Another Organization to represent the healthcare provider with type prov
Somehow attach and InsurancePlan to the network.
The Finance Overview does not cover this use case.
Scopus Serial API allows to retrieve titles for a particular classification category by a subj parameter. For instance, when I specify subj=COMP&content=journal, I get all the journals in a category "Computer Science (all)", abbrev=COMP, code=1700.
However, in this list, there are only journals with code=1700 and the journals from Computer Science sub-categories are missing. How do I get journals, e.g., for a sub-category "Computer Science (Software)" that has code=1712 and the same abbrev=COMP?
This seems to be a bug in the API. According to the Scopus Source Title list (https://www.elsevier.com/?a=91122), there are over 2,000 titles in Computer Science; and according to the Scopus Subject Classification API (https://dev.elsevier.com/documentation/SubjectClassificationsAPI.wadl), they all should have the 'COMP' abbreviation (even if they have different sub-classifications, i.e. 17xx codes). But when calling the Scopus Serial API as in your example (https://api.elsevier.com/content/serial/title?subj=COMP), there seem to be at most 330 or so journal records that can be retrieved. We'll report this to our development group.
I followed this Entities on my gazette are not recognized
Even after adding minimal example of training data "Damiano" in gazette entity i am not able to recognition John or Andrea as PERSON.
I tried this using on large training data and gazette but still not able to tag any gazette entity. why?
Given I have two entity: Person and Company, and there are multiple relationships between them:
Person - Company:
The person can be the employee of the company
The person can be the shareholder of the company
The person can be the legal person of the company
Company - Company:
The company can be the legal of the company
The company can be the shareholder of the company
So how to modeling this in spring data neo4j?
What I tried is make 3 relationship types: EMPLOY, INVEST, LEGAL, each relationship type with the Company as the StartNode and the person as the EndNode, then in company and person, keep these relationships with the "UNDIRECTED" direction, just same as the diagram present, but always get the stackoverflow error when saving and searching.
Yes, now here is the solution in github, all the classes are house in sample.spring.data.neo4j package, and the the corresponding test sample.spring.data.neo4j.repositories.CompanyRepositoryTest
The big issue at the beginning is it always throws the StackOverFlow exception which is due the the lombok annotation, after remove all the lombok annotations and use the plain getter/setter, everything goes well.
I'm using Stanford NER and I have some results with the entity "MISC" in the
4 class :Location, Person, Organization, Misc
but I don't know what really represent this entity, anyone know what is that entity ?
Thanks
MISC is a category from the CoNLL 2003 evaluation data which is typically used to develop NER models. Honestly I don't think there is any definition of MISC beyond "is a named entity" and "isn't PERSON, ORG, or LOC".
I found this description on spaCy:
"MISC: Miscellaneous entities, e.g., events, nationalities, products, or works of art."
for models recognizing PER, LOC, ORG, MISC.