Creating a whitespace character filter - elasticsearch

I want to use a custom analyzer with a pattern tokenizer and a custom token filter. But, before that step, I want to make the tokens on each whitespace. I know, I can use the whitespace analyzer but I also want to use my custom analyzer.
Basically, I want to generate a token on each special character and whitespace in a string.
For example, I have a string "Google's url is https://www.google.com/."
My tokens should be like "Google", "Google'", "Google's", "url", "is", "https", "https:", "https:/", "://", "//www","/www."... and so on.
Basically, I want to be my tokens like that of n-gram but only a limited one like the below which will break only on special character.
My tokenizerFactory files looks like this:
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.pattern.PatternTokenizer;
import org.elasticsearch.common.regex.Regex;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenizerFactory;
import java.util.regex.Pattern;
public class UrlTokenizerFactory extends AbstractTokenizerFactory {
private final Pattern pattern;
private final int group;
public UrlTokenizerFactory(IndexSettings indexSettings, Environment environment, String name, Settings settings) {
super(indexSettings, name, settings);
String sPattern = settings.get("pattern", "[^\\p{L}\\p{N}]");
if (sPattern == null) {
throw new IllegalArgumentException("pattern is missing for [" + name + "] tokenizer of type 'pattern'");
}
this.pattern = Regex.compile(sPattern, settings.get("flags"));
this.group = settings.getAsInt("group", -1);
}
#Override
public Tokenizer create() {
return new PatternTokenizer(pattern, group);
}
}
My TokenFilterfactory file is currently empty.

You can simply use the whitespace tokenizer in your custom analyzer definition, below is the example of custom_analyzer which uses it.
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": { --> name of custom analyzer
"type": "custom",
"tokenizer": "whitespace", --> note this
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_custom_analyzer" --> note this
}
}
}
}

Related

Liferay Elastic Search Query: Search for DLFileEntries that have a Custom Document Type

I work with Liferay 7.2 and I need to make an Elasticsearch query that finds als DLFileEntries that have the Document Type "XY". Currently I need to do this in Postman.
I am already able to find all DLFileEntry:
{
"query": {
"bool": {
"must": [
{
"match": {
"entryClassName": "com.liferay.document.library.kernel.model.DLFileEntry"
}
}
]
}
}
}
But I need to find only these DLFileEntry that have Document Type "XY".
How can I do this?
You can simply add another match to the field fileEntryTypeId, where its value must be equal to the created Document Type id. You can find this id on table dlfileentrytype on column fileentrytypeid. Considering the id equals 37105, the query would be like this
{
"query": {
"bool": {
"must": [
{
"match": {
"entryClassName": "com.liferay.document.library.kernel.model.DLFileEntry"
}
},
{
"match": {
"fileEntryTypeId": "37105"
}
}
]
}
}
}
edit: Responding to your comment about how to search the DLFileEntry by its DLFileEntryType name, there is no direct way to do this as the DLFileEntryType is not indexed on Elastic Search by default. It would also probably need sub queries to achieve this and Elastic Search doesn't support sub queries.
With that in mind, the easiest approach I can think of is to customize the way DLFileEntry is indexed on Elastic Search, adding the field fileEntryTypeName. For that, you only need to implement a ModelDocumentContributor for DLFileEntry and add the fileEntryTypeName field to the document.
Basically, you just need to create a class like this:
package com.test.liferay.override;
import com.liferay.document.library.kernel.model.DLFileEntry;
import com.liferay.portal.kernel.exception.PortalException;
import com.liferay.portal.kernel.search.Document;
import com.liferay.portal.search.spi.model.index.contributor.ModelDocumentContributor;
import org.osgi.service.component.annotations.Component;
#Component(
immediate = true,
property = "indexer.class.name=com.liferay.document.library.kernel.model.DLFileEntry",
service = ModelDocumentContributor.class
)
public class DLFileEntryModelDocumentContributor
implements ModelDocumentContributor<DLFileEntry> {
#Override
public void contribute(Document document, DLFileEntry dlFileEntry) {
try {
document.addText(
"fileEntryTypeName", dlFileEntry.getDLFileEntryType().getName());
} catch (PortalException e) {
// handle error
}
}
}
As the DLFileEntryType name is localized, you should probably index it as a localized value:
package com.test.liferay.override;
import com.liferay.document.library.kernel.model.DLFileEntry;
import com.liferay.portal.kernel.exception.PortalException;
import com.liferay.portal.kernel.search.Document;
import com.liferay.portal.kernel.search.Field;
import com.liferay.portal.kernel.util.LocaleUtil;
import com.liferay.portal.search.spi.model.index.contributor.ModelDocumentContributor;
import org.osgi.service.component.annotations.Component;
import java.util.Locale;
#Component(
immediate = true,
property = "indexer.class.name=com.liferay.document.library.kernel.model.DLFileEntry",
service = ModelDocumentContributor.class
)
public class DLFileEntryModelDocumentContributor
implements ModelDocumentContributor<DLFileEntry> {
#Override
public void contribute(Document document, DLFileEntry dlFileEntry) {
try {
Locale siteDefaultLocale = LocaleUtil.getSiteDefault();
String localizedName = dlFileEntry
.getDLFileEntryType().getName(siteDefaultLocale);
String localizedField = Field.getLocalizedName(
siteDefaultLocale, "fileEntryTypeName");
document.addText(localizedField, localizedName);
} catch (PortalException e) {
// handle error
}
}
}
Now your query will be something like this:
{
"query": {
"bool": {
"must": [
{
"match": {
"entryClassName": "com.liferay.document.library.kernel.model.DLFileEntry"
}
},
{
"match": {
"fileEntryTypeName_en_US": "XY"
}
}
]
}
}
}
The name fileEntryTypeName_en_US depends on your site default locale. For example, if it is pt_BR, the name would be fileEntryTypeName_pt_BR.
Obs.: The fileEntryType name field is not unique, as it is localized, so you might find files with the same fileEntryType name but different fileEntryType.

Spring Data Elasticsearch setting annotation did not take effect

I'm trying with spring data elastic search and have a class defined like this:
#Data
#NoArgsConstructor
#AllArgsConstructor
#Document(indexName = "master", type = "master", shards = 1, replicas = 0)
#Setting(settingPath = "/settings/setting.json")
public class Master {
#Id
private String id;
#MultiField(mainField = #Field(type = FieldType.String, store = true),
otherFields = {
#InnerField(suffix = "autocomplete", type = FieldType.String, indexAnalyzer = "autocomplete", searchAnalyzer = "standard")
}
)
private String firstName;
private String lastName;
}
The setting file is under /src/main/settings/setting.json, which looks like this
{
"index": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
I ran my test class by first deleting the index, and recreate the index like this
elasticsearchTemplate.deleteIndex(Master.class);
elasticsearchTemplate.createIndex(Master.class);
elasticsearchTemplate.putMapping(Master.class);
elasticsearchTemplate.refresh(Master.class);
But when I try to save something into the index there is this error message for MapperParsingException:
2017-10-04 18:56:31.806 ERROR 2942 --- [ main] .d.e.r.s.AbstractElasticsearchRepository : failed to load elasticsearch nodes : org.elasticsearch.index.mapper.MapperParsingException: analyzer [autocomplete] not found for field [autocomplete]
Spent 4 hours trying to figure this out, looked at the Debug mode log, nothing.
I tried to break the JSON format by deleting a comma, it broke so the JSON was being interpreted.
I used the RestAPI to query the master index but the settings doesn't seem to contain the autocomplete analyzer or any analyzer.
Weird thing is that my document can be saved and queried even with this error. But I do want this analyzer.
BTW, this is a parent class in a parent-child relationship, if that's relevant.
Finally got it figured out!
I have to put the same setting across all domains using the same index (both parent and child), then delete the index, restart the server, and it worked!

Elasticsearch. Can not find custom analyzer

I have model like this:
#Getter
#Setter
#Document(indexName = "indexName", type = "typeName")
#Setting(settingPath = "/elastic/elastic-setting.json")
public class Model extends BaseModel {
#Field(type = FieldType.String, index = FieldIndex.analyzed, analyzer = "customAnalyzer")
private String name;
}
And i have elastic-setting.json inside ../resources/elastic/elastic-setting.json:
{
"index": {
"number_of_shards": "1",
"number_of_replicas": "0",
"analysis": {
"analyzer": {
"customAnalyzer": {
"type": "custom",
"tokenizer": "uax_url_email"
}
}
}
}
}
I clean my elastic DB and when i start my application i have exception:
MapperParsingException[analyzer [customAnalyzer] not found for field [name]]
What's wrong with my code?
Help me, please!
EDIT
Val, I thought #Setting is like an addition for #Document, but looks like they are interchangeably.
In my case i also have another model, with:
#Document(indexName = "indexName", type = "anotherTypeName")
So, first i create index with name "indexName" for anotherModel, next when elastic preparing Model, it see, that index with name "indexName" already created, and he does not use #Setting.
Now i have another quesion.
How to add custom analyzer to already created index in java code, for example in InitializingBean. Something like - is my analyzer created? no - create. yes - do not create.
Modify your elastic-setting.json file like this:
{
"index": {
"number_of_shards": "1",
"number_of_replicas": "0"
},
"analysis": {
"analyzer": {
"customAnalyzer": {
"type": "custom",
"tokenizer": "uax_url_email"
}
}
}
}
}
Note that you need to delete your index first and recreate it.
UPDATE
You can certainly add a custom analyzer via Java code, however, you won't be able to change your existing mapping in order to use that analyzer, so you're really better off wiping your index and recreating it from scratch with a proper elastic-setting.json JSON file.
For Val:
Yeah, i use something like this.
Previously, i had added #Setting in one of my entity class, but when i started app, index with same name was already created, before Spring Data had analysed entity with #Setting, and index was not modified, because index with same name was already created.
Now I add annotation #Setting(path = "elastic-setting.json") on abstract baseModel, and high level hierarchy class was scanned firstly and analyzer was created as well.

elasticsearch with NativeSearchQueryBuilder space and uppercase

I'm using the following code to filter by elastic search java api,it works fine and return result if i use string query ,but If i use text with spaces or uppercase letters it don't return any data
if use
String query={"bool":{"should":[{"term":{"name":"test"}}]}}
return data
and if i use
String query={"bool":{"should":[{"term":{"name":"test airportone"}}]}}
or
String query={"bool":{"should":[{"term":{"name":"TEST"}}]}}
return no data
String query={"bool":{"should":[{"term":{"name":"test airport one"}}]}}
BoolQueryBuilder bool = new BoolQueryBuilder();
bool.must(new WrapperQueryBuilder(query));
SearchQuery searchQuery = new
NativeSearchQueryBuilder()
.withQuery(bool)
.build();
Page<Asset> asset =
elasticsearchTemplate.queryForPage(searchQuery,Asset.class);
return asset.getContent();
You have two options depending on your use-case.
First option: You can use match instead of term to search for a string if you want to get advantage of ElasticSearch full text search capabilities.
{
"bool": {
"should": [{
"match": {
"name": "test airportone"
}
}]
}
}
Second option: You can also specify that the name field is not analyzed when mapping your index so ElasticSearch will always store it as it is and always will get the exact match.
"mappings": {
"user": {
"properties": {
"name": {
"type": "string"
"index": "not_analyzed"
}
}
}
}

Spring data elasticsearch query products with multiple fields

ES newbie here, sorry for the dumb question.
I have been trying to create a elasticsearch query for a products index. I'm able to query it, but it never returns as I expect.
I'm probably using the query builder in a wrong way, have tried all sorts of queries builders and never got to make it work as I expected.
My Product class (simpler for the sake of the question):
public class Product {
private String sku;
private Boolean soldOut;
private Boolean freeShipping;
private Store store;
private Category category;
private Set<ProductUrl> urls;
private Set<ProductName> names;
}
Category nas name and id which I use for aggregations
The boolean field are used for filters.
ProductName and ProductUrl both have a String locale and String name or String url accordingly
I am currently building my query with the following logic
private SearchQuery buildSearchQuery(String searchTerm, List<Long> categories, Pageable pageable) {
NativeSearchQueryBuilder builder = new NativeSearchQueryBuilder();
if (searchTerm != null) {
builder.withQuery(
new MultiMatchQueryBuilder(searchTerm, "names.name", "urls.url", "descriptions.description", "sku")
.operator(Operator.AND)
.type(MultiMatchQueryBuilder.Type.MOST_FIELDS)
);
}
builder.withPageable(pageable);
return builder.build();
}
The problem is that lots of products are not being matched, for example:
query "andro" does not return "android" products.
What am I missing? Is this way of building the query right?
UPDATE
Adding the names part of my product mapping:
{
"mappings": {
"product": {
"properties": {
"names": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "string"
},
"locale": {
"type": "string"
}
}
}
}
}
}
}

Resources