OpenNLp training data for organisation - opennlp

I am trainng my data for opennlp organisation entity finder from command line but it's showing a null pointer exception
I have used:
opennlp TokenNameFinderTrainer -model en-ner-person.bin -lang en -data >en-ner->person.train -encoding UTF-8

I think your problem is that you use training data for "Person" type.
So firstly, you should create training data for type "Organization":
$ bin/opennlp TokenNameFinderConverter conll03 -data eng.train -lang en -types org > corpus_train_org.txt
Then train your model:
$ bin/opennlp TokenNameFinderTrainer -lang en -encoding utf8 -iterations 500 -data corpus_train_org.txt -model en_ner_organization.bin

Related

Select-String Doesn't Show All Matches With Get-AppxPackage

I get all the packages installed on my PC using Get-AppxPackage and I'm trying to find all the matches in that with N lines before and after using Select-String.
However, the select string is only showing a matches as single line and it's not showing all the matches either. This only happens when I pipe the output from Get-AppxPackage and not if I write it to a file and then do cat <filename> | select-string ....
As you can see in the example below the two results of using pipe and cat. I'm interested in results like from cat i.e. detailed info about the app.
So what am I doing wrong here? Why is the output different?
Example (everyone should have MS Edge so I'll use that as an example) :
PS > Get-AppxPackage | Select-String -pattern 'edge' -context 3, 3 -allmatches
Microsoft.Windows.StartMenuExperienceHost_10.0.18362.329_neutral_neutral_cw5n1h2txyewy
Microsoft.Windows.Cortana_1.13.0.18362_neutral_neutral_cw5n1h2txyewy
Microsoft.AAD.BrokerPlugin_1000.18362.329.0_neutral_neutral_cw5n1h2txyewy
> Microsoft.MicrosoftEdge_44.18362.329.0_neutral__8wekyb3d8bbwe
Microsoft.Windows.CloudExperienceHost_10.0.18362.329_neutral_neutral_cw5n1h2txyewy
Microsoft.Windows.ContentDeliveryManager_10.0.18362.329_neutral_neutral_cw5n1h2txyewy
Windows.CBSPreview_10.0.18362.329_neutral_neutral_cw5n1h2txyewy
Microsoft.Windows.Apprep.ChxApp_1000.18362.329.0_neutral_neutral_cw5n1h2txyewy
Microsoft.Win32WebViewHost_10.0.18362.329_neutral_neutral_cw5n1h2txyewy
Microsoft.PPIProjection_10.0.18362.329_neutral_neutral_cw5n1h2txyewy
> Microsoft.MicrosoftEdgeDevToolsClient_1000.18362.329.0_neutral_neutral_8wekyb3d8bbwe
Microsoft.LockApp_10.0.18362.329_neutral__cw5n1h2txyewy
> Microsoft.EdgeDevtoolsPlugin_10.0.18362.329_neutral_neutral_cw5n1h2txyewy
Microsoft.ECApp_10.0.18362.329_neutral__8wekyb3d8bbwe
Microsoft.CredDialogHost_10.0.18362.329_neutral__cw5n1h2txyewy
Microsoft.BioEnrollment_10.0.18362.329_neutral__cw5n1h2txyewy
PS > cat .\appx-packages.txt | select-string -pattern 'edge' -context 3, 3 -allmatches
SignatureKind : System
Status : Ok
> Name : Microsoft.MicrosoftEdge
Publisher : CN=Microsoft Corporation, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
Architecture : Neutral
ResourceId :
Version : 44.18362.329.0
> PackageFullName : Microsoft.MicrosoftEdge_44.18362.329.0_neutral__8wekyb3d8bbwe
> InstallLocation : C:\Windows\SystemApps\Microsoft.MicrosoftEdge_8wekyb3d8bbwe
IsFramework : False
> PackageFamilyName : Microsoft.MicrosoftEdge_8wekyb3d8bbwe
PublisherId : 8wekyb3d8bbwe
IsResourcePackage : False
IsBundle : False
SignatureKind : System
Status : Ok
> Name : Microsoft.MicrosoftEdgeDevToolsClient
Publisher : CN=Microsoft Corporation, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
Architecture : Neutral
ResourceId : neutral
Version : 1000.18362.329.0
> PackageFullName : Microsoft.MicrosoftEdgeDevToolsClient_1000.18362.329.0_neutral_neutral_8wekyb3d8bbwe
> InstallLocation : C:\Windows\SystemApps\Microsoft.MicrosoftEdgeDevToolsClient_8wekyb3d8bbwe
IsFramework : False
> PackageFamilyName : Microsoft.MicrosoftEdgeDevToolsClient_8wekyb3d8bbwe
PublisherId : 8wekyb3d8bbwe
IsResourcePackage : False
IsBundle : False
SignatureKind : System
Status : Ok
> Name : Microsoft.EdgeDevtoolsPlugin
Publisher : CN=Microsoft Windows, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
Architecture : Neutral
ResourceId : neutral
Version : 10.0.18362.329
> PackageFullName : Microsoft.EdgeDevtoolsPlugin_10.0.18362.329_neutral_neutral_cw5n1h2txyewy
> InstallLocation : C:\Windows\SystemApps\Microsoft.EdgeDevtoolsPlugin_cw5n1h2txyewy
IsFramework : False
> PackageFamilyName : Microsoft.EdgeDevtoolsPlugin_cw5n1h2txyewy
PublisherId : cw5n1h2txyewy
IsResourcePackage : False
IsBundle : False
tl;dr
As of PowerShell v7.3, to make Select-String search non-string input by the same rich string representation you'd see in the console (terminal), you must pipe the input to oss first:
Get-AppxPackage | oss | Select-String -Pattern 'edge' -Context 3, 3
Note:
This intermediate step should not be necessary, as discussed in GitHub issue #10726.
In fact it already isn't necessary when you're piping non-string input to an external program instead, say findstr.exe, because PowerShell then implicitly uses the rich string representation, because it has to send the input as (meaningful) strings.[1]
As an aside: Select-String is convenient for searching through lines of text and - with oss - for searching the formatted string representations of objects, without having to know or care about their specific structure.
However, if you do know the structure, using OO techniques via different cmdlets, such as Where-Object and Select-Object is more robust. For instance, the following filters packages by their .Name property and then selects only the properties of interest:
# Note: Get-AppXPackage *edge* would obviate the need for Where-Object
Get-AppXPackage |
Where-Object Name -like *edge* |
Select-Object Name, Version
Background information:
Select-String, when given input other than strings, uses simple .ToString() stringification[2] on each input object before looking for the given pattern.
In your case, the [Microsoft.Windows.Appx.PackageManager.Commands.AppxPackage] instances output by Get-AppXPackage stringify to the full package names (e.g., Microsoft.MicrosoftEdge_44.18362.387.0_neutral__8wekyb3d8bbwe), which explains your output.
In order to make Select-String search the for-display string representations of objects - as they would print to the console and as they would appear in a file saved to with > / Out-File (cat is Out-File's built-in alias on Windows) - you must, surprisingly, use Out-String -Stream as an intermediate pipeline segment; since PowerShell v5, if memory serves, you can use built-in wrapper function oss for brevity:
# oss is a built-in wrapper function for Out-String -Stream
Get-AppxPackage | oss | Select-String -Pattern 'edge' -Context 3, 3
Out-String uses PowerShell's formatting system to produce human-friendly display representations of the input objects, the same way that default console output, the Format-* cmdlets, and > / Out-File do.
-Stream causes the output lines to be sent through the pipeline one by one, so that Select-String can match individual lines.
Given that the solution is both non-obvious and cumbersome, it would be nice if Select-String directly supported this behavior, ideally by default, but at least on an opt-in basis via a switch parameter - see feature request #10726 on GitHub - up-vote the proposal there if you agree.
[1] As of v7.3, PowerShell only "speaks text" when communicating with external programs, so it has to create a string representation of non-string objects when passing data to them: see this answer. While it makes sense to default to the for-display string representation of such objects, note that this representation isn't meant for programmatic processing; for the latter, it's best to explicitly output a structured text format, such as JSON via ConvertTo-Json.
[2] More accurately, .psobject.ToString() is called, either as-is, or - if the object's ToString method supports an IFormatProvider-typed argument - as .psobject.ToString([cultureinfo]::InvariantCulture) so as to obtain a culture-invariant representation - see this answer for more information.

Insert Special characters 'Mongolian tögrög' and symbol '₮' in Oracle Database

I need to insert currency Mongolian tögrög and symbol ₮ to Oracle Database.
The insert query as :
INSERT INTO CURRENCY (CUR_ISO_ID, CUR_ISO_CODE, CUR_DESC, CUR_DECIMAL_PLACE, CUR_SYMBOL)
VALUES (496,'MNT','Mongolian tögrög',2,'₮');
results as:
CUR_ISO_ID | CUR | CUR_DESC | CUR_DECIMAL_PLACE | CUR_SYMBOL |
-----------------------------------------------------------------------
496 | MNT | Mongolian t?gr?g | 2 | . |
Kindly advise on how to get the special characters inserted as is to the Database? i.e. the symbol not as . but ₮ and the description not as Mongolian t?gr?g but Mongolian tögrög. Please help.
Before you launch your SQL*Plus enter these commands:
chcp 65001
set NLS_LANG=.AL32UTF8
The first command sets codepage of cmd.exe to UTF-8.
The second command tells your database: "I am using UTF-8"
Then you sql should work. I don't think there is any 8-bit Windows codepage 125x which supports Mongolian tögrög.
See also this post to get some more information: NLS_LANG and others
Check also this discussion how to use sqlplus with utf8 on windows command line, there is an issue when you use UTF-8 at command line.

About the Stanford CoreNLP in chinese model

How to use the chinese model, and I download the "stanford-corenlp-3.5.2-models-chinese.jar" in my classpath and I copy
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.5.2</version>
<classifier>models-chinese</classifier>
</dependency>
to pom.xml file. In additional, my input.txt is
因出席中國大陸閱兵引發爭議的國民黨前主席連戰今晚金婚宴,立法院長王金平說,已向連戰恭喜,等一下回南部。
連戰夫婦今晚的50週年金婚紀念宴,正值連戰赴陸出席閱兵引發爭議之際,社會關注會否受到影響。
包括國民黨主席朱立倫、副主席郝龍斌等人已分別對外表示另有行程,無法出席。
then I compile the program using the code
java -cp "*" -Xmx1g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment,ssplit -file input.txt
and the result is as follows. But it gives the following error and how do i solve this problem?
C:\stanford-corenlp-full-2015-04-20>java -cp "*" -Xmx1g edu.stanford.nlp.pipelin
e.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment,
ssplit -file input.txt
Registering annotator segment with class edu.stanford.nlp.pipeline.ChineseSegmen
terAnnotator
Adding annotator segment
Loading Segmentation Model ... Loading classifier from edu/stanford/nlp/models/s
egmenter/chinese/ctb.gz ... Loading Chinese dictionaries from 1 file:
edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
Done. Unique words in ChineseDictionary is: 423200.
done [22.9 sec].
Ready to process: 1 files, skipped 0, total 1
Processing file C:\stanford-corenlp-full-2015-04-20\input.txt ... writing to C:\
stanford-corenlp-full-2015-04-20\input.txt.xml {
Annotating file C:\stanford-corenlp-full-2015-04-20\input.txt Adding Segmentat
ion annotation ... INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | u
sePKChar2=false
INFO: TagAffixDetector: building TagAffixDetector from edu/stanford/nlp/models/s
egmenter/chinese/dict/character_list and edu/stanford/nlp/models/segmenter/chine
se/dict/in.ctb
Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese
/dict/character_list
Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.
ctb
?]?X?u????j???\?L??o??????????e?D?u?s???????B?b?A??k?|???????????A?w?V?s?????A?
??#?U?^?n???C
?s?????????50?g?~???B?????b?A????s??u???X?u?\?L??o???????A???|???`?|?_????v?T?C
?]?A?????D?u?????B??D?u?q?s?y???H?w???O??~???t????{?A?L?k?X?u?C
--->
[?, ], ?, X, ?u????j???, \, ?L??o??????????e?D?u?s???????B?b?A??k?|???????????A?
w?V?s?????A???#?U?^?n???C, , , , ?s?????????, 50, ?, g?, ~, ???B?????b?A????s??u
???X?u?, \, ?L??o???????A???, |, ???, `, ?, |, ?_????v?T?C, , , , ?, ], ?, A????
?D?u???, ??, B??D?u?q, ?, s?y???H?w???O??, ~, ???t????, {, ?, A?L?k?X?u?C]
}
Processed 1 documents
Skipped 0 documents, error annotating 0 documents
Annotation pipeline timing information:
ChineseSegmenterAnnotator: 0.1 sec.
TOTAL: 0.1 sec. for 34 tokens at 485.7 tokens/sec.
Pipeline setup: 0.0 sec.
Total time for StanfordCoreNLP pipeline: 0.1 sec.
I edited your question to change the command to the one that you actually used to produce the output shown. It looks like you worked out that the former command:
java -cp "*" -Xmx1g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt input.xml
ran the English analysis pipeline, and that didn't work very well for Chinese text....
The CoreNLP support of Chinese in v3.5.2 is still a little rough, and will hopefully be a bit smoother in the next release. But from here you need to:
Specify a properties file for Chinese, giving appropriate models. (If no properties file is specified, CoreNLP defaults to English): -props StanfordCoreNLP-chinese.properties
At present, word segmentation of Chinese is not the annotator tokenize, but segment, specified as a custom annotator in StanfordCoreNLP-chinese.properties. (Maybe we'll unify the two in a future release...)
The current dcoref annotator only works for English. There is Chinese coreference, but it is not fully integrated into the pipeline. If you want to use it, you currently have to write some code, as explained here. So let's delete it. (Again, this should be better integrated in the future).
At that point, things run, but the ugly stderr output you show is that by default the segmenter has VERBOSE turned on, but your output character encoding is not right for our Chinese output. We should have VERBOSE off by default, but you can turn it off with: -segment.verbose false
We have no Chinese lemmatizer, so may as well delete that annotator.
Also, CoreNLP needs more than 1GB of RAM. Try 2GB.
At this point, all should be good! With the command:
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment,ssplit,pos,ner,parse -segment.verbose false -file input.txt
you get the output in input.txt.xml. (I'm not posting it, since it's a couple of thousand lines long....)
Update for CoreNLP v3.8.0: If using the (current in 2017) CoreNLP v3.8.0, then there are some changes/progress: (i) We now use the annotator tokenize for all languages and it doesn't require loading a custom annotator for Chinese; (ii) verbose segmentation is correctly turned off by default; (iii) [negative progress] the requirements require the lemma annotator prior to ner, even though it is a no-op for Chinese; and (iv) coreference is now available for Chinese, invoked as coref, which requires the prior annotator `mentions, and its statistical models require considerable memory. Put that all together, and you're now good with this command:
java -cp "*" -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators tokenize,ssplit,pos,lemma,ner,parse,mention,coref -file input.txt

BLAST Database error: No alias or index file found for nucleotide database

I am trying to run blastn, and then also SIFT standalone. I am having database configuration issues however as I am getting the following:
arron#arron-Ideapad-Z570 ~/Phd/programs/sift4.0.3b $ blastn -query test/lacI.fasta -db db/swissprot/
BLAST Database error: No alias or index file found for nucleotide database [db/swissprot/] in search path [/home/arron/Phd/programs/sift4.0.3b:::]
After some advice from other threads, I downloaded a protein database, for example swissprot:
wget ftp://ftp.ebi.ac.uk/pub/databases/fastafiles/uniprot/uniprotkb_swissprot.gz
zcat uniprotkb_swissprot.gz | awk '{if (/^>/) { print ">" $2} else { print $_}}' > swissprot.fa
and then used makeblastdb to create a blast database:
arron#arron-Ideapad-Z570 ~/Phd/programs/sift4.0.3b/db/swissprot $ makeblastdb -in swissprot.fa -dbtype prot
Building a new DB, current time: 10/27/2014 13:18:57
New DB name: swissprot.fa
New DB title: swissprot.fa
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 546439 sequences in 19.0039 seconds.
yet I am still getting the same problem. What am I doing wrong?
You've listed the folder that the database file is in, not the database itself. Try:
blastn -query test/lacI.fasta -db db/swissprot/swissprot.fa
(of course, that won't work either because you'r trying to use a protein database for blastn. You'd need to use blastx)
You could try running protein blast, because swissprot is a protein database, and blastn is for nucleotide sequences

DB2: How to set encoding for db2clp under Windows?

I have a DB2 that was created with encoding set to UTF-8
db2 create database mydb using codeset UTF-8
My data insert scripts are also stored in encoding UTF-8.
The problem now is that the command line processor seems to work with a different encoding as the Windows installation doesn't use UTF-8:
C:\Users\Administrator>chcp
Active code page: 850
This leads to the problem that my data (which contains special characters) is not stored correctly to the database.
Under Linux/AIX I could change the command line encoding by setting
export LC_ALL=en_US.UTF-8
How do I achieve this under Windows? I already tried
chcp 65001
UPDATE:
But that won't have any effect? It seems like the db2clp can't deal with the UTF-8 encoded file because it will print out junk:
D:\Program Files\ibm_db2\SQLLIB\BIN>chcp 65001
Active code page: 65001
D:\Program Files\ibm_db2\SQLLIB\BIN>type d:\tmp\encoding.sql
INSERT INTO MY_TABLE (ID, TXT) VALUES (99, 'äöü');
D:\Program Files\ibm_db2\SQLLIB\BIN>db2 connect to mydb
Datenbankverbindungsinformationen
Datenbank-Server = DB2/NT64 9.5.0
SQL-Berechtigungs-ID = MYUSER
Aliasname der lokalen Datenbank = MYDB
D:\Program Files\ibm_db2\SQLLIB\BIN>db2 -tvf d:\tmp\encoding.sql
INSERT INTO MY_TABLE (ID, TXT) VALUES (99, 'äöü')
DB20000I Der Befehl SQL wurde erfolgreich ausgeführt.
You need to set both:
CHCP 65001
SET DB2CODEPAGE=1208
on the db2cmd command line, before running db2 -tvf. This works for databases that have CODESET set to UTF-8. To check the CODESET setup for database run:
db2 get db cfg for <your database>
and look for "Database code page" and "Database code set" they should be 1208 and UTF-8 respectively.
when dealing with encodings, you have to take a careful look into your envirnoments, and where you are currently. So in your case:
the Server stores its data in encoding A (like UTF-8)
the client resides in an environment which has encoding B (like windows-1252)
in your client, you have to have to use the encoding of your client (or tell the client you intentionally use another encoding on client side (like UTF-8-encoded file inside a windows-1251 environment)!). The connection between the Client and the server is doing the work for you to change encoding B into encoding A for storing the data into the database.
It's work for me by setting db2codepage, thanks to Mr. Zoran Regvart.
by the way, after setting, you need to execute "db2 terminate" to reset client, and then reconnect.

Resources