I am using Stanford CoreNLP for extraction. Below is the sentence from which am trying to extract currency along with the currency symbol
5 March 2015 Kering Issue of €500,000,000 0.875 per cent
The data that I need to extract is €500,000,000 0.875
NLP by default its giving sentence as
5 March 2015 Kering Issue of **$**500,000,000 0.875 per cent
So i wrote
public static readonly TokenizerFactory TokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(),
"normalizeCurrency=false");
DocumentPreprocessor docPre = new DocumentPreprocessor(new java.io.StringReader(textChunk));
docPre.setTokenizerFactory(TokenizerFactory);
Now the sentence is coming properly as
5 March 2015 Kering Issue of €500,000,000 0.875 per cent
But when I do
props.put("annotators", "tokenize, cleanxml, ssplit, pos, lemma, ner, regexner");
props.setProperty("ner.useSUTime", "0");
_pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(text);
_pipeline.annotate(document);
where text = 5 March 2015 Kering Issue of €500,000,000 0.875 per cent
am getting output as
<token id="9">
<word>$</word>
<lemma></lemma>
<CharacterOffsetBegin>48</CharacterOffsetBegin>
<CharacterOffsetEnd>49</CharacterOffsetEnd>
<POS>CD</POS>
<NER>MONEY</NER>
<NormalizedNER>$5.000000000875E9</NormalizedNER>
</token>
So I added the line props.put("tokenize.options", "normalizeCurrency=false");
But still the output is same with $5.000000000875E9
Can anybody Please help me. Thank you
When I ran this code it didn't change the currency symbol to "$":
package edu.stanford.nlp.examples;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import java.util.*;
public class TokenizeOptionsExample {
public static void main(String[] args) {
Annotation document = new Annotation("5 March 2015 Kering Issue of €500,000,000 0.875 per cent");
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit");
props.setProperty("tokenize.options", "normalizeCurrency=false");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
for (CoreLabel token : document.get(CoreAnnotations.TokensAnnotation.class)) {
System.out.println(token);
}
}
}
Related
I am using itext7 in java to create a 1-B PdfA using this simple code:
import com.itextpdf.kernel.font.PdfFont;
import com.itextpdf.kernel.font.PdfFontFactory;
import com.itextpdf.kernel.geom.PageSize;
import com.itextpdf.kernel.pdf.PdfAConformanceLevel;
import com.itextpdf.kernel.pdf.PdfDocumentInfo;
import com.itextpdf.kernel.pdf.PdfOutputIntent;
import com.itextpdf.kernel.pdf.PdfString;
import com.itextpdf.kernel.pdf.PdfViewerPreferences;
import com.itextpdf.kernel.pdf.PdfWriter;
import com.itextpdf.kernel.xmp.PdfConst;
import com.itextpdf.kernel.xmp.XMPConst;
import com.itextpdf.kernel.xmp.XMPException;
import com.itextpdf.kernel.xmp.XMPMeta;
import com.itextpdf.kernel.xmp.XMPMetaFactory;
import com.itextpdf.layout.Document;
import com.itextpdf.layout.element.Paragraph;
import com.itextpdf.pdfa.PdfADocument;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.text.SimpleDateFormat;
import gestionepdf.PdfDocument_Configurazione_Text;
/**
*
* #author UC9001309
*/
public class TestCreatePDFA {
public static final String courier = "C:\\Windows\\fonts\\couri.ttf";
public static final String times = "C:\\Windows\\fonts\\times.ttf";
public static final String helvetica = "C:\\Windows\\fonts\\helvetica.ttf";
public static void main(String[] args) throws FileNotFoundException, IOException, XMPException {
// TODO code application logic here
PdfWriter pdfWriter = new PdfWriter("C:\\Temp\\" + new SimpleDateFormat("yyyyMMddHHmmss").format(new java.util.Date()) + ".pdf");
PdfADocument pdfA = new PdfADocument( pdfWriter,PdfAConformanceLevel.PDF_A_1B,new PdfOutputIntent("Custom", "", "https://www.color.org",
"sRGB2014", new FileInputStream("C:\\Users\\UC9001309\\Documents\\NetBeansProjects\\GestionePdf\\sRGB2014.icc")));
Document document = new Document(pdfA, PageSize.A4, false);
XMPMeta meta = XMPMetaFactory.create();
PdfDocumentInfo info = pdfA.getDocumentInfo();
pdfA.getCatalog().setViewerPreferences(new PdfViewerPreferences().setDisplayDocTitle(true));
pdfA.getCatalog().setLang(new PdfString("it-IT"));
info.addCreationDate();
meta.setProperty(XMPConst.NS_XMP, PdfConst.CreateDate, info.getMoreInfo("CreationDate"));
info.addModDate();
meta.setProperty(XMPConst.NS_XMP, PdfConst.ModifyDate, info.getMoreInfo("ModDate"));
info.setAuthor("MyAuthor");
info.setCreator("MyCreator");
info.setProducer("Producer");
info.setTitle("TEST PdfA " );
PdfFont font_h = PdfFontFactory.createFont(helvetica);
PdfFont font_c = PdfFontFactory.createFont(courier);
PdfFont font_t = PdfFontFactory.createFont(times);
Paragraph p = new Paragraph();
p.setFont(font_c);
p.setItalic();
p.add("Prova pdfa");
p.getAccessibilityProperties().setRole("P");
Paragraph p1 = new Paragraph();
p1.setFont(font_h);
p1.add("Prova pdfa");
p1.getAccessibilityProperties().setRole("P");
Paragraph p2 = new Paragraph();
p2.setFont(font_t);
p2.add("Prova pdfa");
p2.getAccessibilityProperties().setRole("P");
document.add(p);
document.add(p1);
document.add(p2);
pdfA.setXmpMetadata(meta);
document.close();
}
}
But if I try to validate, using online sites some are fine with others not. In particular, a validator gives me this error :
<?xml version="1.0" encoding="UTF-8"?>
-<ValidationReport>
<VersionInformation Version="14.1.182" ID="GdPicture.NET.14"/>
<ValidationProfile Level="B" Part="1" Conformance="PDF/A"/>
<FileInfo FileSize="57849 bytes" FileName="20220726115840.pdf"/>
<ValidationResult Statement="PDF file is not compliant with validation profile requirements." sCompliant="False"/>
-<Details>
-<FailedChecks Count="4">
-<Check ID="InconsistentPDFInfo" OccurenceCount="4">
<Occurence Statement="The PDF Information Dictionary entry CreationDate is not consistent with PDF XMP metadata information." ObjReference="None" Context="Document"/>
<Occurence Statement="The PDF Information Dictionary entry Creator is not consistent with PDF XMP metadata information." ObjReference="None" Context="Document"/>
<Occurence Statement="The PDF Information Dictionary entry ModDate is not consistent with PDF XMP metadata information." ObjReference="None" Context="Document"/>
<Occurence Statement="The PDF Information Dictionary entry Producer is not consistent with PDF XMP metadata information." ObjReference="None" Context="Document"/>
</Check>
</FailedChecks>
</Details>
</ValidationReport>
This is xmp of my pdfa :
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.1.0-jc003">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/"
xmp:CreateDate="2022-07-26T16:13:07+02:00"
xmp:ModifyDate="2022-07-26T16:13:07+02:00"
xmp:CreatorTool="MyCreator"
pdf:Producer="Producer; modified using iText® Core 7.2.2 (AGPL version) ©2000-2022 iText Group NV"
pdfaid:part="1"
pdfaid:conformance="B">
<dc:creator>
<rdf:Seq>
<rdf:li>MyAuthor</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">TEST PdfA </rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
This is the dictionary :
Author - MyAuthor
CreationDate - D:20220727094440+02'00'
Creator - MyCreator
ModDate - D:20220727094440+02'00'
Producer - Producer
Title - TEST PdfA
I don't understand if I'm wrong or the validator?
Thanks in advance for anyone who wants to help !
I am trying to train custom relations in Stanford CoreNLP using the birthplace model.
I have gone through this documentation which details us to make a properties file (similar to the roth.properties) as follows:
#Below are some basic options. See edu.stanford.nlp.ie.machinereading.MachineReadingProperties class for more options.
# Pipeline options
annotators = pos, lemma, parse
parse.maxlen = 100
# MachineReading properties. You need one class to read the dataset into correct format. See edu.stanford.nlp.ie.machinereading.domains.ace.AceReader for another example.
datasetReaderClass = edu.stanford.nlp.ie.machinereading.domains.roth.RothCONLL04Reader
#Data directory for training. The datasetReaderClass reads data from this path and makes corresponding sentences and annotations.
trainPath = "D:\\stanford-corenlp-full-2017-06-09\\birthplace.corp"
#Whether to crossValidate, that is evaluate, or just train.
crossValidate = false
kfold = 10
#Change this to true if you want to use CoreNLP pipeline generated NER tags. The default model generated with the relation extractor release uses the CoreNLP pipeline provided tags (option set to true).
trainUsePipelineNER=false
# where to save training sentences. uses the file if it exists, otherwise creates it.
serializedTrainingSentencesPath = "D:\\stanford-corenlp-full-2017-06-09\\rel\\sentences.ser"
serializedEntityExtractorPath = "D:\\stanford-corenlp-full-2017-06-09\\rel\\entity_model.ser"
# where to store the output of the extractor (sentence objects with relations generated by the model). This is what you will use as the model when using 'relation' annotator in the CoreNLP pipeline.
serializedRelationExtractorPath = "D:\\stanford-corenlp-full-2017-06-09\\rel\\roth_relation_model_pipeline.ser"
# uncomment to load a serialized model instead of retraining
# loadModel = true
#relationResultsPrinters = edu.stanford.nlp.ie.machinereading.RelationExtractorResultsPrinter,edu.stanford.nlp.ie.machinereading.domains.roth.RothResultsByRelation. For printing output of the model.
relationResultsPrinters = edu.stanford.nlp.ie.machinereading.RelationExtractorResultsPrinter
#In this domain, this is trivial since all the entities are given (or set using CoreNLP NER tagger).
entityClassifier = edu.stanford.nlp.ie.machinereading.domains.roth.RothEntityExtractor
extractRelations = true
extractEvents = false
#We are setting the entities beforehand so the model does not learn how to extract entities etc.
extractEntities = false
#Opposite of crossValidate.
trainOnly=true
# The set chosen by feature selection using RothCONLL04:
relationFeatures = arg_words,arg_type,dependency_path_lowlevel,dependency_path_words,surface_path_POS,entities_between_args,full_tree_path
# The above features plus the features used in Bjorne BioNLP09:
# relationFeatures = arg_words,arg_type,dependency_path_lowlevel,dependency_path_words,surface_path_POS,entities_between_args,full_tree_path,dependency_path_POS_unigrams,dependency_path_word_n_grams,dependency_path_POS_n_grams,dependency_path_edge_lowlevel_n_grams,dependency_path_edge-node-edge-grams_lowlevel,dependency_path_node-edge-node-grams_lowlevel,dependency_path_directed_bigrams,dependency_path_edge_unigrams,same_head,entity_counts
I am executing this command in my directory D:\stanford-corenlp-full-2017-06-09:
D:\stanford-corenlp-full-2017-06-09\stanford-corenlp-3.8.0\edu\stanford\nlp>java -cp classpath edu.stanford.nlp.ie.machinereading.MachineReading --arguments roth.properties
and I am getting this error
Error: Could not find or load main class edu.stanford.nlp.ie.machinereading.MachineReading
Caused by: java.lang.ClassNotFoundException: edu.stanford.nlp.ie.machinereading.MachineReading
Also I have tried to programmatically train the custom relation model with the below C# code:
using java.util;
using System.Collections.Generic;
namespace StanfordRelationDemo
{
class Program
{
static void Main(string[] args)
{
string jarRoot = #"D:\Stanford English Model\stanford-english-corenlp-2018-10-05-models\";
string modelsDirectory = jarRoot + #"edu\stanford\nlp\models";
string sutimeRules = modelsDirectory + #"\sutime\defs.sutime.txt,"
//+ modelsDirectory + #"\sutime\english.holidays.sutime.txt,"
+ modelsDirectory + #"\sutime\english.sutime.txt";
Properties props = new Properties();
props.setProperty("annotators", "pos, lemma, parse");
props.setProperty("parse.maxlen", "100");
props.setProperty("datasetReaderClass", "edu.stanford.nlp.ie.machinereading.domains.roth.RothCONLL04Reader");
props.setProperty("trainPath", "D://Stanford English Model//stanford-english-corenlp-2018-10-05-models//edu//stanford//nlp//models//birthplace.corp");
props.setProperty("crossValidate", "false");
props.setProperty("kfold", "10");
props.setProperty("trainOnly", "true");
props.setProperty("trainUsePipelineNER", "true");
props.setProperty("serializedTrainingSentencesPath", "D://Stanford English Model//stanford-english-corenlp-2018-10-05-models//edu//stanford//nlp//models//rel//sentences.ser");
props.setProperty("serializedEntityExtractorPath", "D://Stanford English Model//stanford-english-corenlp-2018-10-05-models//edu//stanford//nlp//models//rel//entity_model.ser");
props.setProperty("serializedRelationExtractorPath", "D://Stanford English Model//stanford-english-corenlp-2018-10-05-models//edu//stanford//nlp//models//rel//roth_relation_model_pipeline.ser");
props.setProperty("relationResultsPrinters", "edu.stanford.nlp.ie.machinereading.RelationExtractorResultsPrinter");
props.setProperty("entityClassifier", "edu.stanford.nlp.ie.machinereading.domains.roth.RothEntityExtractor");
props.setProperty("extractRelations", "true");
props.setProperty("extractEvents", "false");
props.setProperty("extractEntities", "false");
props.setProperty("trainOnly", "true");
props.setProperty("relationFeatures", "arg_words,arg_type,dependency_path_lowlevel,dependency_path_words,surface_path_POS,entities_between_args,full_tree_path");
var propertyKeys = props.keys();
var propertyStringArray = new List<string>();
while (propertyKeys.hasMoreElements())
{
var key = propertyKeys.nextElement();
propertyStringArray.Add($"-{key}");
propertyStringArray.Add(props.getProperty(key.ToString(), string.Empty));
}
var machineReader = edu.stanford.nlp.ie.machinereading.MachineReading.makeMachineReading(propertyStringArray.ToArray());
var utestResultList = machineReader.run();
}
}
}
I am getting this exception:
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Unhandled Exception: edu.stanford.nlp.io.RuntimeIOException: Error while loading a tagger model (probably missing model file) --->
java.io.IOException: Unable to open
"edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger"
as class path, filename or URL
at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(String
textFileOrUrl)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(Properties
config, String modelFileOrUrl, Boolean printLoading)
--- End of inner exception stack trace ---
at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(Properties
config, String modelFileOrUrl, Boolean printLoading)
at edu.stanford.nlp.tagger.maxent.MaxentTagger..ctor(String modelFile, Properties config, Boolean printLoading)
at edu.stanford.nlp.tagger.maxent.MaxentTagger..ctor(String modelFile)
at edu.stanford.nlp.pipeline.POSTaggerAnnotator.loadModel(String ,
Boolean )
at edu.stanford.nlp.pipeline.POSTaggerAnnotator..ctor(String annotatorName, Properties props)
at edu.stanford.nlp.pipeline.AnnotatorImplementations.posTagger(Properties
properties)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getNamedAnnotators$42(Properties
, AnnotatorImplementations )
at edu.stanford.nlp.pipeline.StanfordCoreNLP.<>Anon4.apply(Object ,
Object )
at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getDefaultAnnotatorPool$65(Entry
, Properties , AnnotatorImplementations )
at edu.stanford.nlp.pipeline.StanfordCoreNLP.<>Anon27.get()
at edu.stanford.nlp.util.Lazy.3.compute()
at edu.stanford.nlp.util.Lazy.get()
at edu.stanford.nlp.pipeline.AnnotatorPool.get(String name)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(Properties ,
Boolean , AnnotatorImplementations , AnnotatorPool )
at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props, Boolean enforceRequirements, AnnotatorPool annotatorPool)
at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props, Boolean enforceRequirements)
at edu.stanford.nlp.ie.machinereading.MachineReading.makeMachineReading(String[]
args)
at StanfordRelationDemo.Program.Main(String[] args) in C:\Users\m1039332\Documents\Visual Studio
2017\Projects\StanfordRelationDemo\StanfordRelationDemo\Program.cs:line
46
I am simply thus unable to train the custom relation using CoreNLP any obvious mistakes which I am making, I would appreciate if anybody would point it out.
I don't think the machine reading code is distributed with the standard distribution.
You should build a jar from the full GitHub.
https://github.com/stanfordnlp/CoreNLP/tree/master/src/edu/stanford/nlp/ie/machinereading
Hi i am using the pipeline object to parse forums posts. for each one i do the following :
Annotation document = new Annotation(post);
mPipeline.annotate(document); // annoatiate the post text
I would like each call to annotate to timeout after a few seconds.
I have followed the example at line 65: https://github.com/stanfordnlp/CoreNLP/blob/master/itest/src/edu/stanford/nlp/pipeline/ParserAnnotatorITest.java
So i am creating the pipeline object as follows :
Properties props = new Properties();
props.setProperty("parse.maxtime", "30");
props.setProperty("dcoref.maxtime", "30");
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
mPipeline = new StanfordCoreNLP(props);
How ever when i add the maxtime properties i get the following exception :
Exception in thread "main" java.lang.NullPointerException
at edu.stanford.nlp.pipeline.SentenceAnnotator.annotate(SentenceAnnotator.java:64)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:68)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:412)
Without the maxtime option there is no exception.
how can i set maxtime properly ?
Thanks
Which code properties I need use to get the result as in online parse http://nlp.stanford.edu:8080/parser/index.jsp:
Sentence: Wonderful Doctor. I cant say enough good things about him.
Online parse:
compound(Doctor-2, Wonderful-1)
root(ROOT-0, Doctor-2)
nsubj(say-3, I-1)
advmod(say-3, cant-2)
root(ROOT-0, say-3)
advmod(good-5, enough-4)
amod(things-6, good-5)
dobj(say-3, things-6)
case(him-8, about-7)
nmod(things-6, him-8)
Coding:
I am using Stanford coreNLP 3.5.2. I tried several properties as below, but I Am getting different results.
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, depparse");
//props.put("ner.model", "edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz");
// props.put("annotators", "tokenize, ssplit, pos, lemma, ner, depparse", "parse","sentiment, dcoref");
//props.put("parse.model", "edu/stanford/nlp/models/srparser/englishSR.ser.gz");
//props.put("parse.model","edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz");
//props.put("tokenize.options", "ptb3Escaping=false");
//props.put("parse.maxlen", "10000");
//props.put("ner.applyNumericClassifiers", "true");
//props.put("depparse.model", "edu/stanford/nlp/models/parser/nndep/english_SD.gz");
//props.put("depparse.extradependencies", "ref_only_uncollapsed");
props.put("depparse.extradependencies", "MAXIMAL");
//props.put("depparse.originalDependencies", false);
props.put("parse.originalDependencies", false);
I ran a clustering test on crawled pages (more than 25K docs ; personal data set).
I've done a clusterdump :
$MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-1/ --output clusteranalyze.txt
The output after running cluster dumper is shown 25 elements "VL-xxxxx {}" :
VL-24130{n=1312 c=[0:0.017, 10:0.007, 11:0.005, 14:0.017, 31:0.016, 35:0.006, 41:0.010, 43:0.008, 52:0.005, 59:0.010, 68:0.037, 72:0.056, 87:0.028, ... ] r=[0:0.442, 10:0.271, 11:0.198, 14:0.369, 31:0.421, ... ]}
...
VL-24868{n=311 c=[0:0.042, 11:0.016, 17:0.046, 72:0.014, 96:0.044, 118:0.015, 135:0.016, 195:0.017, 318:0.040, 319:0.037, 320:0.036, 330:0.030, ...] ] r=[0:0.740, 11:0.287, 17:0.576, 72:0.239, 96:0.549, 118:0.273, ...]}
How to interpret this output?
In short : I am looking for document ids which belong to a particular cluster.
What is the meaning of :
VL-x ?
n=y c=[z:z', ...]
r=[z'':z''', ...]
Does 0:0.017 means "0" is the document id which belongs to this cluster?
I already have read on mahout wiki-pages what CL, n, c and r means. But can someone please explain them to me better or points to a resource where it is explained a bit more in detail?
Sorry, if i am asking some stupid questions, but i am a newbie wih apache mahout and using it as part of my course assignment for clustering.
By default, kmeans clustering uses WeightedVector which does not include the data point name. So, you would like to make a sequence file yourself using NamedVector. There is a one to one correspondence between the number of seq files and the mapping tasks. So if your mapping capacity is 12, you want to chop your data into 12 pieces when making seqfiles
NamedVecotr:
vector = new NamedVector(new SequentialAccessSparseVector(Cardinality),arrField[0]);
Basically you need to download the clusteredPoints from your HDFS system and write your own code to output the results. Here is the code that I wrote to output the cluster point membership.
import java.io.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.TreeMap;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.mahout.clustering.WeightedVectorWritable;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.iterator.sequencefile.PathFilters;
import org.apache.mahout.common.iterator.sequencefile.PathType;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;
import org.apache.mahout.math.NamedVector;
public class ClusterOutput {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
BufferedWriter bw;
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
File pointsFolder = new File(args[0]);
File files[] = pointsFolder.listFiles();
bw = new BufferedWriter(new FileWriter(new File(args[1])));
HashMap<String, Integer> clusterIds;
clusterIds = new HashMap<String, Integer>(5000);
for(File file:files){
if(file.getName().indexOf("part-m")<0)
continue;
SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(file.getAbsolutePath()), conf);
IntWritable key = new IntWritable();
WeightedVectorWritable value = new WeightedVectorWritable();
while (reader.next(key, value)) {
NamedVector vector = (NamedVector) value.getVector();
String vectorName = vector.getName();
bw.write(vectorName + "\t" + key.toString()+"\n");
if(clusterIds.containsKey(key.toString())){
clusterIds.put(key.toString(), clusterIds.get(key.toString())+1);
}
else
clusterIds.put(key.toString(), 1);
}
bw.flush();
reader.close();
}
bw.flush();
bw.close();
bw = new BufferedWriter(new FileWriter(new File(args[2])));
Set<String> keys=clusterIds.keySet();
for(String key:keys){
bw.write(key+" "+clusterIds.get(key)+"\n");
}
bw.flush();
bw.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
To complete the answer:
VL-x: is the identifier of the cluster
n=y: is the number of elements in the cluster
c=[z, ...]: is the centroid of the cluster, with the
z's being the weights of the different dimensions
r=[z, ...]: is the radius of the cluster.
More info here:
https://mahout.apache.org/users/clustering/cluster-dumper.html
I think you need to read the source code -- download from http://mahout.apache.org. VL-24130 is just a cluster identifier for a converged cluster.
You can use mahout clusterdump
https://cwiki.apache.org/MAHOUT/cluster-dumper.html