Extending Stanford NER terms with new terms - stanford-nlp

We need to add terms to named entity extraction tables/model in Stanford and can't figure out how. Use case - we need to build up a set of IED terms over time and want to the Stanford pipeline to extract the terms when found in text files.
Looking to see if this is something someone has done before

Please take a look at http://nlp.stanford.edu/software/regexner/ to see how to use it. It allows you to specify a file of mappings of phrases to entity types. When you want to update the mappings, you update the file and rerun the Stanford pipeline.
If you are interested in how to actually learn patterns for the terms over time, you can take a look at our pattern learning system: http://nlp.stanford.edu/software/patternslearning.shtml

Could you specify the tags you want to apply?
To use the RegexNER all you have to do is build a file with 1 entry per line of the form:
TEXT_PATTERN\tTAG
You would put all of the things you want in your custom dictionary into a file, say custom_dictionary.txt
I am assuming by IED you mean
https://en.wikipedia.org/wiki/Improvised_explosive_device ??
So your file might look like:
VBIED\tIED_TERM
sticky bombs\tIED_TERM
RCIED\tIED_TERM
New Country\tLOCATION
New Person\tPERSON
(Note Stack Overflow has some strange formatting, there should not be blank lines between each entry, it should be 1 entry per line!!)
If you then run this command:
java -mx1g -cp '*' edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators 'tokenize,ssplit,pos,lemma,regexner,ner' -file sample_input.txt -regexner.mapping custom_dictionary.txt
you will tag sample_input.txt
Updating is merely a matter of updating custom_dictionary.txt
One thing to be on the lookout for, it matters if you put "ner" first or "regexner" first in your list of annotators.
If your highest priority is tagging with your specialized terms (for instance IED_TERM), I would run the regexner first in the pipeline, since there are some tricky issues with how the taggers overwrite each other.

Related

How can I auto generate inputs.tf and outputs.tf variables when working with Terraform?

Note: Please see the #### UPDATE ### section below. I've heavily modified the question for clarity on what I'm trying to achieve, but added it as an addendum rather than rewrite the question.
As my infrastructure grows, adding input variables in my variables.tf files and then syncing those values to output variables in my outputs.tf file is now impossible to do manually. Not only is it taking up a lot of unnecessary time, probably more time is spent going back and fixing the ones that terraform validate told me that I missed by human error. This is especially true when building / using modules whose arguments add an additional layer to manage.
There has to be a better way? Here is what I want to achieve.
Let's say I'm creating an Azure AKS Kubernetes cluster. The Terraform resource is azurerm_kubernetes_cluster.
Only 8 arguments are required to create a base install, but there are almost 250 additional ones. They all have default values. Per the documentation page, they also already have fantastic descriptions. (I'm tired of copying and pasting into my variables { description = "this"} block.)
The information is there in the documentation. terraform plan also has knowledge of every single additional one because it of course comes up in the pre-apply plan. (known after apply) means its optional, but will have a default value.
In my dream world, I'd run this hypothetical command sequence:
terraform plan
terraform document <- Here it auto generates every argument as a variables block and inserts it into variables.tf. It also auto generates every possible output "out_putable" {} block and inserts it into outputs.tf.
terraform apply -update-inputs -update-outputs <- Here everything that was optional (known after apply) is now known and it should auto update variables.tf and outputs.tf accordingly. Adding a -update-modules flag lets it take care of that additional layer introduced by using modules.
This feels like a problem that has been addressed before. Before I write a custom tool that parses Terraform web docs and the output of terraform show, is there already a way to do this? Terraform-docs is the closest I've come to finding a solution for README.md. If it can do what I need, I haven't figure it out yet.
How can I automate all this?
############
UPDATE
############
This article and video is spot-on when it comes to Terraform's evolution in an organization. My organization is somewhere between late-stage pattern 3 and early 5. As we decompose our "Terralith" we have inconsistencies among teams (patterns, naming conventions, variable and argument choices etc). These are starting to cause errors in CI/CD forcing a ticket-review process that is slowing things down.
All resources have required and optional arguments. But in my organization, we have, for example, additional optional arguments that are required for us.
Scenario: Dev A in Japan creates a resource, forgets an optional variable or two or names them something obscure, etc. Dev B in America is blocked until they can convene and discuss. Given time zones, language differences, ticket review, this one issue is now a week or more delayed.
I need to automate this and create exact consistency so that Dev A starts out with exactly what Dev B would start with or is expecting; and, what CI/CD tests are expecting - templating the initial process, if you will. In other words, I need to remove the human element of manually creating main.tf, variables.tf, outputs.tf, etc.
Here are thoughts on how to achieve this:
Use Golang to autogenerate the files by querying the API
How can I query the API to get a list of all required arguments for a specific resource?
I found that I can query for provider information, but I can't find info to retrieve resource information. My thinking is when a developer wants to create a new resource, He'll run a go or typescript to generate the manifest files along with expected naming conventions, and populate main.tf, variables.tf, outputs.tf, etc, with exactly what data that everyone is expecting. I'm looking form something like curl registry.terraform.io/providers/hashicorp/azurerm/v2.99/resource_group?required=yes This should show me all required arguments along with descriptions and other info I can use straight from the API.
Use CDKTF to generate an HCL manifest.tf file from JSON
How can I use CDKTF to generate an HCL .tf file?
CDKTF is EXACTLY what I'm looking for - except in reverse. HCL is seamlessly compatible with JSON. Running cdktf synth creates ./out/cdk.tf.out I'm so close! How do I turn that file into main.tf?!?
The goal here is to have a master file from which all future manifest files are derived. Whether we use azurerm_kubernetes_cluster 1 time or 1000 times, I know for certain that every argument, every variable name, every desired output is exactly the same. If a chance is needed in our desired structure, it will be updated at the JSON level, and CI/CD can ensure those changes are propagated across instances of its use.
I know that I can use the cdk.out.tf file as a drop in replacement for a module, but I don't want my team members to have to learn typescript or how to read json. If I can create a templatized JSON file containing exactly what I'm expecting users to start with, and if they can run some command like cdktf convert cdk.tf.out --HCL output-file.tf then I've accomplished my goal.
If cdktf synth can create an HCL JSON file, and cdktf convert can take a manifest.tf file and turn it into HCL JSON, can't it do the exact opposite? Turn the HCL JSON file into the human-readable, declarative, manifest.tf file?
Perhaps think of it this way. Terraform has a required file structure for a module if it's to be allowed into the module registry. I'm trying to create a similar required structure for each of the resources our organization uses regardless of when and where it's used.
If your goal is to derive input variables and output values from resource type schemas then Terraform can provide you with the information to do so.
In the working directory of a configuration that already uses the provider whose resource type you want to use, run the following command:
terraform providers schema -json
The result contains a JSON description of all of the resource types available in the providers for the current configuration, and for each one the metadata about its attributes, including the type constraint information and descriptions for each one.
From that you can generate whatever other files you need based on that information.
Note that if you are intending to build modules which export the entire surface area (all inputs and all outputs) of a particular resource type the Terraform documentation explicitly recommends against this, suggesting to just use the resource type directly instead since such a module would often not offer sufficient benefit to outweigh the additional complexity and maintenance overhead it implies:
In principle any combination of resources and other constructs can be factored out into a module, but over-using modules can make your overall Terraform configuration harder to understand and maintain, so we recommend moderation.
A good module should raise the level of abstraction by describing a new concept in your architecture that is constructed from resource types offered by providers.
For example, aws_instance and aws_elb are both resource types belonging to the AWS provider. You might use a module to represent the higher-level concept "HashiCorp Consul cluster running in AWS" which happens to be constructed from these and other AWS provider resources.
We do not recommend writing modules that are just thin wrappers around single other resource types. If you have trouble finding a name for your module that isn't the same as the main resource type inside it, that may be a sign that your module is not creating any new abstraction and so the module is adding unnecessary complexity. Just use the resource type directly in the calling module instead.
I've got the same question and develop a small bash script to create output definitions based on module code
This code required the hcledit tool to extract blocks from hcl code
#!/usr/bin/env bash
set -o pipefail
_hcledit=$(which hcledit)
for tf_file in $(ls *.tf); do
cat $tf_file | $_hcledit block list | while read line; do
block_type="${line%%.*}"
line="${line#*.}"
case $block_type in
locals|output|variable|data) continue; break ;;
module)
output_name=$line
output_description="Module '$output_name' attributes"
output_value="$block_type.$output_name"
;;
resource)
label_kind="${line%.*}"
label_name="${line#*.}"
output_name="${label_kind}_${label_name//[\-]/_}"
output_description="Resource '$label_kind.$label_name' attributes"
output_value="$label_kind.$label_name"
;;
esac
cat <<-EOT
output "$output_name" {
description = "$output_description"
value = $output_value
}
EOT
done
done

Tiny Elasticsearch like search library in ruby?

Is there an alternative to Elasticsearch? I am looking for something lightweight and tiny. Some sort of Ruby library would be ideal. For example, if I don't need to run an application to do the job but instead using that library I can search YAML or JSON files for matching text etc. And based on match score returns the sorted list.
Example here would make it clear.
Let's say I have a single YAML file with array of strings E.g
values:
- “some nice text here”
- “this is another sentence”
- “some nice”
- “nice text here”
So now if I search for “some nice text here” the first line should come on top but also there is a slight similarity with the last two lines, so the search results should return as follows:
some nice text here
nice text here
some nice
Thanks
Sphinx, Solr, come to mind but they're not small. Hell even using PostgreSQL until you really need to scale is fine (and you'll know when you need to scale) and it's probably already baked into your stack.
If your needs are to search YAML files, you could easily write your own naive approac
There's also https://github.com/mezis/fuzzily but it might require some updating.

ElasticSearch What analyzer to use for searching code

I'm writing a search tool for searching code but I'm having a hard time finding the right analyzer to use. I've tried doing a whitespace analyzer but you end up with issues where you might have dbo.My_Procedure and searching "my_procedure" should work as well as searching ".My_Procedure". My idea is to split on special characters but store them into their own tokens as well. But then if you write my_procedure as a search it will just look for my, _ and procedure anywhere in the file unless you wrap it in quotes (even though to the user it looks like it's just one word). What approach have people taken for analyzing code?
If your code is in Java, according to Java naming conventions your methods and classes should be camel-case so you should not run into names like my_search but rather mySearch.
If that is the case - you can use the (default) standard analyzer which uses word boundaries as delimiters for split.
That said, if there's no way around it and you have to consider names like my_search in the tokenizing part, you can implement your own custom analyzer.
This answer shows an example of setting a custom-analyzer.

Ruby Text/Sentiment Analysis

I have two strings -
"I like running around the track.
I like swimming in the pool, but only in the morning.
I need to pull out what people "like" from the above two comments (running around the track and swimming in the pool.
Does anyone have a recommendation for a text analytics gem or other method of pulling in that kind of information? I don't necessarily need word counts or n-grams, I just want to know what words are seen in relation to the word "like".
For a quick-and-dirty fix, you could use a Regex to search for all the forms of "like" and pull out all the text between there and the punctuation mark or Newline character.
You could use a dependency parser such as The Stanford Parser
to parse your text and find the keys words in your sentiment dictionary, and probably put some constraints on the type of dependencies for disambiguation. For example, the dependency needs to be of type "dobj" (direct object). Then follow the dependency structures to the end of phrase or sentence depending on your needs.

Structured debug log

I am writing a complex application (a compiler analysis). To debug it I need to examine the application's execution trace to determine how its values and data structures evolve during its execution. It is quite common for me to generate megabytes of text output for a single run and sifting my way through all that is very labor-intensive. To help me manage these logs I've written my own library that formats them in HTML and makes it easy to color text from different code regions and indent code in called functions. An example of the output is here.
My question is: is there any better solution than my own home-spun library? I need some way to emit debug logs that may include arbitrary text and images and visually structure them and if possible, index them so that I can easily find the region of the output I'm most interested. Is there anything like this out there?
Regardless you didn't mentioned a language applied, I'd like to propose apache Log4XXX family: http://logging.apache.org/
It offers customizable details level as well as tag-driven loggers. GUI tool (chainsaw) can be combined with "old good" GREP approach (so you see only what you're interested in at the moment).
Colorizing, search and filtering using an expression syntax is available in the latest developer snapshot of Chainsaw. The expression syntax also supports regular expressions (using the 'like' keyword).
Chainsaw can parse any regular text log file, not just log files generated by log4j.
The latest developer snapshot of Chainsaw is available here:
http://people.apache.org/~sdeboy
The File, load Chainsaw configuration menu item is where you define the 'format' and location of the log file you want to process, and the expression syntax can be found in the tutorial, available from the help menu.
Feel free to email the log4j users list if you have additional questions.
I created a framework that might help you, https://github.com/pablito900/VisualLogs

Resources