OpenNLP Tokenizer - Incompatible Types Error? - opennlp

I am trying to create a Tokenizer using the Apache OpenNLP API. I have extracted the code from their site but I get an 'incompatible types' error for the following line of code in the Tokenize class:
Tokenize tokenizer = new TokenizerME(model);
Does anyone know the reason for this error as it appears that they shouldn't be incompatible?
This is the main class:
public class OpenNLP {
/**
* #param args the command line arguments
*/
public static void main(String[] args)
{
try
{
Tokenizer T = new Tokenizer();
T.Tokenize();
}
catch(Exception e)
{}
}
}
This is the Tokenize class with the error:
public class Tokenize {
public void Tokenize() throws InvalidFormatException, IOException
{
InputStream is = new FileInputStream("en-token.bin");
TokenizerModel model = new TokenizerModel(is);
Tokenize tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("Hi. How are you? This is Mike.");
for (String a : tokens)
System.out.println(a);
is.close();
}
}

I have sorted this now. The following line:
Tokenize tokenizer = new TokenizerME(model);
Should have been:
Tokenizer tokenizer = new TokenizerME(model);

Related

Spring Hibernate Criteria API Builder pass list parameter to function

I try to implement MySQL full text search inside Criteria API Builder and I stuck with passing multiple column list in custom function.
Custom MySQL dialect to enable MATCH AGAINST function:
import org.hibernate.dialect.MySQL5Dialect;
import org.hibernate.dialect.function.SQLFunctionTemplate;
import org.hibernate.type.StandardBasicTypes;
public class CustomMySQL5Dialect extends MySQL5Dialect {
public CustomMySQL5Dialect() {
super();
registerFunction("match", new SQLFunctionTemplate(StandardBasicTypes.DOUBLE, "match(?1) against (?2 in boolean mode)"));
}
}
Customer service part for building query:
Specification.where((root, query, cb) -> {
Expression<Double> match = cb.function(
"match",
Double.class,
root.get(Customer_.FIRST_NAME),
cb.literal("mySearchTerm")
);
return cb.greaterThan(match, 0.);
});
But now I would like to extend full text search to search against multiple columns. Final SQL should looks like:
SELECT * FROM customer WHERE MATCH (first_name,last_name) AGAINST ('mysearchterm' IN BOOLEAN MODE) > 0.0
So, does anyone know how to pass list of column names for 1st paramter.
public class MariaDB10Dialect extends org.hibernate.dialect.MariaDB10Dialect {
public MariaDB10Dialect() {
registerFunction("match", new SQLFunction() {
#Override
public boolean hasArguments() {
return true;
}
#Override
public boolean hasParenthesesIfNoArguments() {
return false;
}
#Override
public Type getReturnType(Type firstArgumentType, Mapping mapping) throws QueryException {
return StandardBasicTypes.DOUBLE;
}
#Override
public String render(Type firstArgumentType, List arguments, SessionFactoryImplementor factory) throws QueryException {
StringBuilder sb = new StringBuilder("match(");
int i=0;
for (i=0; i<arguments.size()-1; i++) {
if (i>0)
sb.append(", ");
sb.append(arguments.get(i));
}
sb.append(") against (").append(arguments.get(i)).append(")");
return sb.toString();
}
});
}
}

How to set the Mapping Type in Liferay DXP while Indexing a document?

We have a Custom entity in Liferay, also we are able to index it in the elastic search. By default, liferay sets the mapping type of all the indexed document as "LiferayDocumentType" in elastic search, we need to change it as "PublicationType". The following picture shows our document in elastic search and the highlighted field is what we need to change.
The following class is our indexer class for publication.
#Component(
immediate = true,
property = { "indexer.class.name=com.ctc.myct.bo.model.Publication" },
service = Indexer.class)
public class StanlyPublicationIndexer extends BaseIndexer<Publication> {
private static final String CLASS_NAME = Publication.class.getName();
private static final String PORTLET_ID = "Corporate Portlet";
private static final Log _log = LogFactoryUtil.getLog(StanlyPublicationIndexer.class);
String example = "This is an example";
byte[] bytes = example.getBytes();
public StanlyPublicationIndexer() {
setFilterSearch(true);
setPermissionAware(true);
}
#Override
protected void doDelete(Publication object) throws Exception {
Document doc = getBaseModelDocument(PORTLET_ID, object);
IndexWriterHelperUtil.deleteDocument(this.getSearchEngineId(), object.getCompanyId(), object.getUuid(), true);
}
#Override
protected Document doGetDocument(Publication object) throws Exception {
Document doc = getBaseModelDocument(PORTLET_ID, object);
User user = UserLocalServiceUtil.getUser(PrincipalThreadLocal.getUserId());
long userid = user.getUserId();
String username = user.getScreenName();
object.setUserId(userid);
object.setUserName(username);
doc.addKeyword(Field.USER_ID, userid);
doc.addText(Field.USER_NAME, username);
doc.addText("title", object.getTitle());
doc.addText("firstName", object.getFirstName());
doc.addText("lastName", object.getLastName());
doc.addText("additional_Information", object.getAdditionalInformation());
doc.addKeyword("roleId", object.getRoleId());
doc.addNumber("publicationId", object.getPublicationId());
doc.addNumber("articleId", object.getJournalArticleId());
Field field = new Field("_type");
_log.info("The document with title" + " " + object.getTitle() + " " + "and firstName" + " "
+ object.getFirstName() + " " + "has been created successfully ");
return doc;
}
#Override
protected Summary doGetSummary(Document document, Locale locale, String snippet, PortletRequest portletRequest,
PortletResponse portletResponse)
throws Exception {
return null;
}
#Override
protected void doReindex(Publication object) throws Exception {
_log.info("doReindex2 is Executing");
Document doc = this.doGetDocument(object);
try {
IndexWriterHelperUtil.addDocument(this.getSearchEngineId(), object.getCompanyId(), doc, true);
IndexWriterHelperUtil.commit(this.getSearchEngineId(), object.getCompanyId());
;
} catch (Exception e) {
e.printStackTrace();
}
_log.info("Publication document with publicationId " + " " + object.getPublicationId() + " "
+ " has been successfully reIndexed into the elastic search");
}
#Override
protected void doReindex(String[] arg0) throws Exception {
// TODO Auto-generated method stub
}
#Override
protected void doReindex(String arg0, long arg1) throws Exception {
// TODO Auto-generated method stub
}
#Override
public Hits search(SearchContext searchContext) throws SearchException {
return super.search(searchContext);
}
#Override
public String getClassName() {
return CLASS_NAME;
}
It's possible to change the type using Java API but we are looking for a solution based on Liferay Elastic API's. I really appreciate your time and valuable information about this.
It isn't possible to change Elasticsearch document type without modifying Liferay search module code, see:
com/liferay/portal/search/elasticsearch/internal/util/DocumentTypes.java
com/liferay/portal/search/elasticsearch/internal/index/LiferayTypeMappingsConstants.java
We can see document type is a constant used in elasticsearch API calls.
If you change it, the change will apply to all indexes. I don't think it is possible to change it to only one entity without rewriting a lot of code

Map multiple words to single word in Lucene SynonymGraphFilter

I'm using lucene 6.4.0. When I map dns to domain name system, I can get the correct query.
But when I try to map domain name system to dns, I can't get the correct query. I set parser.setSplitOnWhitespace(false).
public class SynonymAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String s, Reader reader) {
SynonymMap synonymMap = null;
SynonymMap.Builder builder=null;
try {
addTo(builder,new String[]{"dns"},new String[]{"domain\u0000name\u0000system"});
addTo(builder,new String[]{"domain\u0000name\u0000system"},new String[]{"dns"});
synonymMap = builder.build();
}catch (Exception e) {
e.printStackTrace();
}
Tokenizer tokenizer = new StandardTokenizer(reader);
TokenStream filter = new SynonymGraphFilter(tokenizer, synonymMap, true);
return new TokenStreamComponents(tokenizer, filter);
}
private void addTo(SynonymMap.Builder builder, String[] from, String[] to) {
for (String input : from) {
for (String output : to) {
builder.add(new CharsRef(input), new CharsRef(output), false);
}
}
}
}
public static void main(String[] args) throws Exception {
String queryStr="domain name system";
QueryParser parser = new QueryParser("n",SynonymAnalyzer);
parser.setDefaultOperator(QueryParser.Operator.AND);
parser.setSplitOnWhitespace(false);
Query query=parser.parse(queryStr);
}

Hadoop Mapreduce: Custom Input Format

I have a file with data having text and "^" in between:
SOME TEXT^GOES HERE^
AND A FEW^MORE
GOES HERE
I am writing a custom input format to delimit the rows using "^" character. i.e The output of the mapper should be like:
SOME TEXT
GOES HERE
AND A FEW
MORE GOES HERE
I have written a written a custom input format which extends FileInputFormat and also written a custom record reader that extends RecordReader. Code for my custom record reader is given below. I dont know how to proceed with this code. Having trouble with the nextKeyValue() method in the WHILE loop part. How should I read the data from a split and generate my custom key-value? I am using all new mapreduce package instead of the old mapred package.
public class MyRecordReader extends RecordReader<LongWritable, Text>
{
long start, current, end;
Text value;
LongWritable key;
LineReader reader;
FileSplit split;
Path path;
FileSystem fs;
FSDataInputStream in;
Configuration conf;
#Override
public void initialize(InputSplit inputSplit, TaskAttemptContext cont) throws IOException, InterruptedException
{
conf = cont.getConfiguration();
split = (FileSplit)inputSplit;
path = split.getPath();
fs = path.getFileSystem(conf);
in = fs.open(path);
reader = new LineReader(in, conf);
start = split.getStart();
current = start;
end = split.getLength() + start;
}
#Override
public boolean nextKeyValue() throws IOException
{
if(key==null)
key = new LongWritable();
key.set(current);
if(value==null)
value = new Text();
long readSize = 0;
while(current<end)
{
Text tmpText = new Text();
readSize = read //here how should i read data from the split, and generate key-value?
if(readSize==0)
break;
current+=readSize;
}
if(readSize==0)
{
key = null;
value = null;
return false;
}
return true;
}
#Override
public float getProgress() throws IOException
{
}
#Override
public LongWritable getCurrentKey() throws IOException
{
}
#Override
public Text getCurrentValue() throws IOException
{
}
#Override
public void close() throws IOException
{
}
}
There is no need to implement that yourself. You can simply set the configuration value textinputformat.record.delimiter to be the circumflex character.
conf.set("textinputformat.record.delimiter", "^");
This should work fine with the normal TextInputFormat.

Distributed Cache Hadoop not retrieving the file content

I am getting some garbage like value instead of the data from the file I want to use as distributed cache.
The Job Configuration is as follows:
Configuration config5 = new Configuration();
JobConf conf5 = new JobConf(config5, Job5.class);
conf5.setJobName("Job5");
conf5.setOutputKeyClass(Text.class);
conf5.setOutputValueClass(Text.class);
conf5.setMapperClass(MapThree4c.class);
conf5.setReducerClass(ReduceThree5.class);
conf5.setInputFormat(TextInputFormat.class);
conf5.setOutputFormat(TextOutputFormat.class);
DistributedCache.addCacheFile(new URI("/home/users/mlakshm/ap1228"), conf5);
FileInputFormat.setInputPaths(conf5, new Path(other_args.get(5)));
FileOutputFormat.setOutputPath(conf5, new Path(other_args.get(6)));
JobClient.runJob(conf5);
In the Mapper, I have the following code:
public class MapThree4c extends MapReduceBase implements Mapper<LongWritable, Text,
Text, Text >{
private Set<String> prefixCandidates = new HashSet<String>();
Text a = new Text();
public void configure(JobConf conf5) {
Path[] dates = new Path[0];
try {
dates = DistributedCache.getLocalCacheFiles(conf5);
System.out.println("candidates: "+candidates);
String astr = dates.toString();
a = new Text(astr);
} catch (IOException ioe) {
System.err.println("Caught exception while getting cached files: " +
StringUtils.stringifyException(ioe));
}
}
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer st = new StringTokenizer(line);
st.nextToken();
String t = st.nextToken();
String uidi = st.nextToken();
String uidj = st.nextToken();
String check = null;
output.collect(new Text(line), a);
}
}
The output value, I am getting from this mapper is:[Lorg.apache.hadoop.fs.Path;#786c1a82
instead of the value from the distributed cache file.
That looks like what you get when you call toString() on an array and if you look at the javadocs for DistributedCache.getLocalCacheFiles(), that is what it returns. If you need to actually read the contents of the files in the cache, you can open/read them with the standard java APIs.
From your code:
Path[] dates = DistributedCache.getLocalCacheFiles(conf5);
Implies that:
String astr = dates.toString(); // is a pointer to the above array (ie.dates) which is what you see in the output as [Lorg.apache.hadoop.fs.Path;#786c1a82.
You need to do the following to see the actual paths:
for(Path cacheFile: dates){
output.collect(new Text(line), new Text(cacheFile.getName()));
}

Resources