How to make Vertex AI multi-label classification AutoML not ignore texts with no labels? - google-cloud-vertex-ai

I prepared a training dataset for multi-label classification in JSON Lines format as described in docs.
My upload file looks like
{
"textContent": "This text corresponds to 2 labels",
"classificationAnnotations": [
{"displayName": "LABEL_1"},
{"displayName": "LABEL_2"}
]
}
{
"textContent": "This text doesn't correspond to any labels",
"classificationAnnotations": []
}
// ... and other 5,853 lines
Only 1,037 texts have non-empty list of labels.
Other texts are considered "Unlabeled". AutoML ignores unlabeled texts.
As a workaround I added an extra label to every text
{
"textContent": "This text corresponds to 2 labels",
"classificationAnnotations": [
{"displayName": "LABEL_1"},
{"displayName": "LABEL_2"},
{"displayName": "EXTRA_LABEL"}
]
}
{
"textContent": "This text doesn't correspond to any labels",
"classificationAnnotations": [
{"displayName": "EXTRA_LABEL"}
]
}
// ... and other 5,853 texts
Is there a way to make AutoML use "Unlabeled" texts as texts with 0 labels?

We often put the unlabeled text to an all-zero vector for training. This can't be done in Automl for now, I think.

Related

What is this data structure in YAML?

I am trying to make the cleanest YAML data structure possible (by clean I mean the least amount of extraneous markup).
What I am trying to create is a list of sections, where each section is itself a list of different types of objects.
In JSON this would look like
{
"sections": [
[
{"p": "Words"}
],
[
{
"ul": "more words",
"p": "other"
}
]
]
}
This is what I have in YAML so far:
sections:
-
- p: 'Test words.'
- ul:
- "Words"
- "More words"
-
- p: "Other words"
I am confused by the dashes.
In the example above, do both p and the ul need dashes (to be part of the name object) or is just the first dash necessary?
i.e. is this functionally the same?
sections:
-
- p: 'Test words.'
ul:
- "Words"
- "More words"
Further, what do dashes with no content after them (like the first dash under sections) denote?
sections:
- // this dash
- p: 'Test words.'
Your JSON example:
{
"sections": [
[
{"p": "Words"}
],
[
{
"ul": "more words",
"p": "other"
}
]
]
}
can be written like this in YAML:
---
sections:
- - p: Words
- - p: other
ul: more words
There are two hyphens (dashes), because it is a nested sequence (aka list, array).
Every sequence item starts with a hyphen in YAML.
The JSON example is also valid YAML because JSON is a subset of YAML.
YAML was invented as an alternative to XML, around the same time as JSON was invented independently.
When the YAML inventors became aware of JSON, they modified their syntax to be compatible with it in version 1.2, but it already supported a compact style similar to JSON.
The example could also be written like this:
---
sections:
- [
{p: Words}
]
- [
{
ul: more words,
p: other,
}
]
More about nested sequences and mappings

How to extract the words behind and above a keywords range in TokenRegex?

I am using TokenRegex for my words extraction project and meet the following problem.
Eg. Sentence: Someone guesses to go there and have a lunch today
If I want to extract "to go there and have a lunch", how to set the pattern to extract the words from "guesses" to "today"?
I have tried
{
ruleType: "tokens",
pattern: ( [ !{ word: /guesses/ } ] []{1,} [ { word:/today/ } ]),
result: "Weight"
}
It just highlights the words: to go there and have a lunch today, not: to go there and have a lunch

In Logstash how to extract substring in a bigger string?

Feeling difficulty in writing grok patterns.Please help
I have GetIndicatorsByAnalysisProcessIDServlet service method is called and in this how to extract only GetIndicatorsByAnalysisProcess and text GetIndicatorsByAnalysisProcess will not be same
Here challenging i felt is truncating string from backward direction
i followed up
grok {
match => ["destinationid", "(?<fieldname>discard.{7})"]
}
it high-lets considering number of characters from start
If I understand you correctly, you need to have the first word in a variable.
This is achievable via
(?<fieldname>[^\s]*)\s*
with sample output from it
{
"fieldname": [
[
"GetIndicatorsByAnalysisProcessIDServlet"
]
]
}
In case you have various beginnings with optional spaces but an exactly same ending of the sentence, the effective regexp will be different.

Stanford TokensRegex: how to set normalized annotation using normalized output of NER annotation?

I am creating a TokensRegex annotator to extract the number of floors a building has (just an example to illustrate my question). I have a simple pattern that will recognize both "4 floors" and "four floors" as instances of my custom entity "FLOORS".
I would also like to add a NormalizedNER annotation, using the normalized value of the number entity used in the expression, but I can't get it to work the way I want to:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
normalized = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NormalizedNamedEntityTagAnnotation" }
tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }
ENV.defaults["ruleType"] = "tokens"
{
pattern: ( ( [ { ner:NUMBER } ] ) /floor(s?)/ ),
action: ( Annotate($0, ner, "FLOORS"), Annotate($0, normalized, $$1.text) )
}
The rules above only set the NormalizedNER fields in the output to the text value of the number, "4" and "four" for the above examples respectively. Is there a way to use the NUMBER entity's normalized value ("4.0" both for "4" and "four") as the normalized value for my "FLOORS" entity?
Thanks in advance.
Try changing
action: ( Annotate($0, ner, "FLOORS"), Annotate($0, normalized, $$1.text) )
to
action: ( Annotate($0, ner, "FLOORS"), Annotate($0, normalized, $$1.normalized) )
Annotate takes three arguments
arg1 = object to annotate (typically the matched tokens indicated by $0)
arg2 = annotation field
arg3 = value (in this case you want the NormalizedNER field instead of the text field)
With $$1.normalized as you suggested, running on the input "The
building has seven floors" yields the following error message:
Annotating file test.txt { Error extracting annotation from seven
floors }
It might be because the NamedEntityTagAnnotation key is not already present for the token represented by $$1. I suppose, before running TokensRegex, you'd want to make sure that your numeric tokens - either "four" or "4" in this case - have the corresponding normalized value - "4.0" in this case - set to their NamedEntityTagAnnotation key.
Also, could you please direct me to where I can find more information
on the possible 3rd arguments of Annotate()? Your Javadoc page for
TokensRegex expressions doesn't list $$n.normalized (perhaps it needs
updating?)
I believe, that what $$n.normalized would do, would be to retrieve the value which, in Java code, would be the equivalent of coreLabel.get(edu.stanford.nlp.ling.CoreAnnotations$NormalizedNamedEntityTagAnnotation.class) where coreLabel is of type CoreLabel and corresponds with $$n in TokensRegex.
This is because of the following line in your TokensRegex: normalized = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NormalizedNamedEntityTagAnnotation" }
The correct answer is based on #AngelChang's answer and comment, I'm just posting it here for the sake of ordeliness.
The rule has to be modified so the 2nd Annotate() action's 3rd parameter is $1[0].normalized:
{
pattern: ( ( [ { ner:NUMBER } ] ) /floor(s?)/ ),
action: ( Annotate($0, ner, "FLOORS"), Annotate($0, normalized, $1[0].normalized) )
}
According to #Angel's comment:
$1[0].normalized is the "normalized" field of the 0th token of the 1st
capture group (as a CoreLabel). The $$1 gives you back the
MatchedGroupInfo which has the "text" field but not the normalized
field (since that is on the actual token)

How do I create JSON from parsed HTML table using Nokogiri?

I want to create A JSON object from each TR from this site.
For now I'm able to get the whole table and each TR but it is not enough... That's why I'm confused.
Here is an example of the JSON I want in return. This one is for chapter 07:
{
"chapter":"07",
"title":"LIFTING AND SHORING",
"description":"This chapter shall...",
"section":[
{
"number":"00",
"title":"GENERAL",
"description":"",
},
{
"number":"10",
"title":"JACKING",
"description":"Provides information relative...",
},
{
"number":"20",
"title":"SHORING",
"description":"Those instructions necessary...",
}
]
}
What I need is to get this whole thing at once, but here is what I've managed so far:
parsed_html.css("table")[1].css("tr")
I'm using Nokogiri for parsing.
This is really not a problem to choose for learning Ruby. The data is awkward and unreliable, and it would be a useful programming challenge for a language you knew relatively well. When you are learning a language you need tasks that are relatively simple but will test your knowledge of the language itself.
I have written this, which complies at least with your example for chapter 07.
It works by choosing a (the only) table from the page that has more than one row. Then it iterates through those rows, extracting an array of fields, converting non-breakable spaces to ordinary spaces and stripping leading and trailing spaces. All empty fields are discarded, and the whole row is ignored if it contains no data.
Then a row where the first column starts with decimal digits indicates the first line of a chapter, or if there is a preceding hyphen then it is section information for the same chapter.
If a field was absent from the source (the description fields and the section titles) I would normally choose to omit it from the intermediate data. However I have defaulted these fields to an empty string to comply with your example of expected JSON output. (There is a difference between a non-existent hash element and one that has a value of nil.)
I hope this helps.
require 'open-uri'
require 'nokogiri'
require 'json'
open('http://www.s-techent.com/ATA100.htm') do |f|
doc = Nokogiri::HTML(f)
table = doc.at_xpath('//table[count(tr) > 1]')
chapters = []
chapter = nil
table.xpath('tr').each do |tr|
td = tr.xpath('td')
td = td.map { |td| td.content.gsub("\u00A0", ' ').strip }
td = td.select { |txt| not txt.empty? }
next if td.empty?
if td[0] =~ /^\d+/
chapters << chapter if chapter
chapter = {
'chapter' => td[0],
'title' => td[1],
'description' => td[2] || ''
}
elsif td[0] =~ /^-(\d+)/
section = {
'number' => $1,
'title' => td[1] || '',
'description' => td[2] || ''
}
chapter['section'] ||= []
chapter['section'] << section
end
end
chapters << chapter if chapter
puts JSON.pretty_generate(chapters)
end
(partial) output
{
"chapter": "07",
"title": "LIFTING AND SHORING",
"description": "This chapter shall include the necessary procedures to lift and shore aircraft in any of the conditions to which it may be subjected. Includes lifting and shoring procedures that may be employed during aircraft maintenance and repair.",
"section": [
{
"number": "00",
"title": "GENERAL",
"description": ""
},
{
"number": "10",
"title": "JACKING",
"description": "Provides information relative to jack points, adapters, tail supports, balance weights, jacks and jacking procedures utilized during aircraft maintenance and repair."
},
{
"number": "20",
"title": "SHORING",
"description": "Those instructions necessary to support the aircraft during maintenance and repair. Includes information on shoring materials and equipment, contour dimensions, shoring locations, etc."
}
]
},
This problem is very difficult in general because the markup has been done manually and very badly, and there is no reliable way of extracting the data across updates.
For instance
There are two Chapter 01s: INTRODUCTION and OPERATIONS INFORMATION
The chapter number is sometimes just numeric, like 05, and sometimes a mixture, like 72(R)
Chapters up to 23 have the title in the second column with a colspan="2" attribute on the td element, but chapters after that have a blank second column and the title in the third column
There is erratic and spurious use of non-breakable space U+00A0, which the String class doesn't recognize as whitespace
There are blank lines with a grey background that have bgcolor="#CCCCCC" that could be used to separate the chapter information. But again we are relying on the accuracy of manual input
Is anything to be done with the GROUP DEFINITION lines in the table?
This would be reasonably straightforward if the program doesn't have to extract data from other similar pages, or from a (manually) modified version of the same page. Otherwise you have to bow to the fact that manually-entered data can't be parsed and give up.

Resources