why I'm getting multiple entities as one

why I'm getting multiple entities as one - stanford-nlp

I'm using a Custom NER model(CRF based) for NER tagging. but the problem is whenever multiple entities separated by punctuation or any stopword occur in test data. it tags whole as one entity.
for example-
for "India, china" it produce
(u'India', u'B-LOC'),(u',', u'I-LOC'),(u'china', u'I-LOC')
and for "india and australia" it produce
(u'india', u'B-LOC'),(u'and', u'I-LOC'),(u'australia', u'I-LOC')
I have not removed any stopwords or punctuation from my training dataset and they are labled as 'O'. But why I'm getting these punchuation and stopwords that occur between two entities as part of single entity?
here is my property file and dataset that I used in my model training-
Property File(ner.prop)
trainFile=Clean_Data.tsv
serializeTO=ner-model_cleanGazette_full.ser.gz
map = word=0,answer=1,tag=2
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
qnSize=10
entitySubclassification=IOB2
retainEntitySubclassification=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true
useGazettes=true
gazette=gazetta.txt
sloppyGazette=true
Kaggle Dataset used(Clean_Data.tsv)
**Word ner pos**
Thousands O NNS
of O IN
demonstrators O NNS
have O VBP
marched O VBN
through O IN
London B-LOC NNP
to O TO
protest O VB
what else can I add or remove to overcome this problem?

Related

The most efficient way for inference(zero-shot classification HuggingFace) on CPU

I have a pretty large dataset dataset(200k records), which consists of 2 columns:
Text
Labels for prediction
What I want to do is to apply pretrained Roberta model for zero-shot classification. Here is the way I did it:
#convert pandas to dataset:
dataset = Dataset.from_pandas(data)
#loading model:
model = AutoModelForSequenceClassification.from_pretrained('joeddav/xlm-roberta-large-xnli')
tokenizer = AutoTokenizer.from_pretrained('joeddav/xlm-roberta-large-xnli')
classifier = pipeline("zero-shot-classification", model= model,tokenizer = tokenizer ,framework = 'pt')
hypothesis_template = "Im Text geht es um {}"
#define prediction function and apply it to the dataset
def prediction(record,classifier):
hypothesis_template = "Im Text geht es um {}"
output = classifier(record['text'],record['label'],hypothesis_template=hypothesis_template)
record['prediction'] = output['labels'][0]
record['scores'] = output['scores'][0]
return record
dataset.map(lambda x: prediction(x,classifier=classifier))
But I am not sure if it's the most efficient way for inference.
Official page (https://huggingface.co/docs/transformers/main_classes/pipelines) says, that I should avoid batching if I am using CPU. But still my questions:
Is pipeline wrapper pipeline fast enough or should stick to more 'low level'(like native pytorch)?
Is inference though .map considered a good practice? If not, what should be used instead?
Having relative short text(maximum 5-6 words) should batching be used instead of one record at a time?

Pig latin join by field

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.

A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

Linq datatable to get unique rows and their count

i have data table like :
country
China
India
Thailand
India
china
china
Thailand
Hong kong
India
can get my output as shown below using LINQ
Country Count
India 3
China 2
Thailand 2
Hong kong 1

As Ben Allred pointed out, what you're likely looking for is the LINQ GroupBymethod.
Using query syntax, it may look something like this:
var query = from tuple in table
group tuple by tuple.Country into g
select new { Country = g.Key, Count = g.Count() };
query now contains an IEnumerable collection of anonymous objects which have as members the string Country and the integer Count representing the number of occurrences of that country in the table.
You can now of course iterate over these objects as such:
foreach (var item in query)
{
Console.WriteLine("Country : {0} - Count : {1}", item.Country, item.Count);
}
For more examples, I strongly suggest the 101 LINQ Samples
It's also worth pointing out if you haven't used LINQ before that the processing is deferred, meaning that the iteration over the query object doesn't occur until you try to access any of its items, for example, in the foreach statement. If the collection or reading from table is expensive and you intend to use the results of the query more than once, you can call ToList() on query to return a more tangible, concrete collection.

Relational algebra for one-to-many relations

Suppose I have the following relations:
Academic(academicID (PK), forename, surname, room)
Contact (contactID (PK), forename, surname, phone, academicNO (FK))
I am using Java & I want to understand the use of the notation.
Π( relation, attr1, ... attrn ) means project the n attributes out of the relation.
σ( relation, condition) means select the rows which match the condition.
⊗(relation1,attr1,relation2,attr2) means join the two relations on the named attributes.
relation1 – relation2 is the difference between two relations.
relation1 ÷ relation2 divides one relation by another.
Examples I have seen use three tables. I want to know the logic when only two tables are involved (academic and contact) as opposed to three (academic, contact, owns).
I am using this structure:
LessNumVac = Π( σ( job, vacancies < 2 ), type )
AllTypes = Π( job, type )
AllTypes – LessNumVac
How do I construct the algebra for:
List the names of all contacts owned by academic "John"

List the names of all contacts who is owned by academic "John".
For that, you would join the Academic and Conctact relations, filter for John, and project the name attributes. For efficiency, select John before joining:
πforename, surename (Contact ⋈academicNO = academicID (πacademicID (σforename = "John" Academic))))

You have to extend your operations set with natural join ⋈, Left outer join ⟕ and/or Right outer join ⟖ to show joins.
There is a great Wikipedia article about Relational Algebra. You should definitely read that one!

SQL to Relational Algebra

How do I go about writing the relational algebra for this SQL query?
Select patient.name,
patient.ward,
medicine.name,
prescription.quantity,
prescription.frequency
From patient, medicine, prescription
Where prescription.frequency = "3perday"
AND prescription.end-date="08-06-2010"
AND canceled = "Y"
Relations...
prescription
prescription-ref
patient-ref
medicine-ref
quantity
frequency
end-date
cancelled (Y/N))
medicine
medicine-ref
name
patient
Patient-ref
name
ward

I will just point you out the operators you should use
Projection (π)
π(a1,...,an): The result is defined as the set that is obtained when all tuples in R are restricted to the set {a1,...,an}.
For example π(name) on your patient table would be the same as SELECT name FROM patient
Selection (σ)
σ(condition): Selects all those tuples in R for which condition holds.
For example σ(frequency = "1perweek") on your prescription table would be the same as SELECT * FROM prescription WHERE frequency = "1perweek"
Cross product(X)
R X S: The result is the cross product between R and S.
For example patient X prescription would be SELECT * FROM patient,prescription
You can combine these operands to solve your exercise. Try posting your attempt if you have any issues.
Note: I did not include the natural join as there are no joins. The cross product should be enough for this exercise.

An example would be something like the following. This is only if you accidentally left out the joins between patient, medicine, and prescription. If not, you will be looking for cross product (which seems like a bad idea in this case...) as mentioned by Lombo. I gave example joins that may fit your tables marked as "???". If you could include the layout of your tables that would be helpful.
I also assume that canceled comes from prescription since it is not prefixed.
Edit: If you need it in standard RA form, it's pretty easy to get from a diagram.
alt text http://img532.imageshack.us/img532/8589/diagram1b.jpg

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

why I'm getting multiple entities as one - stanford-nlp

Related

The most efficient way for inference(zero-shot classification HuggingFace) on CPU

Pig latin join by field

Linq datatable to get unique rows and their count

Relational algebra for one-to-many relations

SQL to Relational Algebra

Categories

Resources