Terms Aggregation excluding last vowels using Spanish analyze - Elasticsearch 6.4 - elasticsearch

I am trying to get keywords from a bunch of tweets in the Spanish language. The thing is that when I get the results the last vowel in most words in the response is removed. Any idea of why is this happening?
The data are clean tweets extracted from Twitter in the Spanish language
Here is the query:
{
"query": {
"bool": {
"must": {
"terms": {
"full_text_sentiment": "positive"
}
},
"filter": {
"range": {
"created_at": {
"gte": greaterThanTime,
"lte": lessThanTime
}
}
}
}
},
"aggs": {
"keywords": {
"terms": { "field": "full_text_clean", "size": 10}
}
}
}
The mapping is the following for the field:
"full_text_clean": {
"type": "text",
"analyzer": "spanish",
"fielddata": true,
"fielddata_frequency_filter": {
"min": 0.1,
"max": 1.0,
"min_segment_size": 10
},
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 512
}
}
}
And this is the buckets in the response:
[ { key: 'aquí', doc_count: 3 },
{ key: 'deport', doc_count: 3 },
{ key: 'informacion', doc_count: 3 },
{ key: '23', doc_count: 2 },
{ key: 'corazon', doc_count: 2 },
{ key: 'dios', doc_count: 2 },
{ key: 'mexic', doc_count: 2 },
{ key: 'mujer', doc_count: 2 },
{ key: 'quier', doc_count: 2 },
{ key: 'siempr', doc_count: 2 }]
where "deport", should be "deporte", "mexic" should be "mexico", "quier" should be "quiero" etc.
Any idea of what is happening?
Thank you!

Hello the spanish analyzer (reference here) contains a stemming token filter. It is this stemmer that reduce words to their root, and thus remove generally some characters at the end of words.
More information about stemming here
To avoid this behavior you will need to create a new custom analyzer without stemming.
You can use the example from the documentation and just remove the spanish_stemmer filter.

Related

OpenSearch / ElasticSearch index mappings

I have a system that ingests multiple scores for events and we use opensearch (previously elastic search) for getting the averages.
For example, an input would be similar to:
// event 1
{
id: "foo1",
timestamp: "some-iso8601-timestamp",
scores: [
{ name: "arbitrary-name-1", value: 80 },
{ name: "arbitrary-name-2", value: 55 },
{ name: "arbitrary-name-3", value: 30 },
]
}
// event 2
{
id: "foo2",
timestamp: "some-iso8601-timestamp",
scores: [
{ name: "arbitrary-name-1", value: 90 },
{ name: "arbitrary-name-2", value: 65 },
{ name: "arbitrary-name-3", value: 40 },
]
}
The score name are arbitrary and subject to change from time to time.
We ultimately would like to query the data to get the average scores values:
[
{ name: "arbitrary-name-1", value: 85 },
{ name: "arbitrary-name-2", value: 60 },
{ name: "arbitrary-name-3", value: 35 },
]
However, the only way we have been able to achieve this so far has been to insert multiple documents, one for each score name/value pair in each event. This seems wasteful. The search in place currently is to group the documents by score name and timestamp intervals, then to perform a weighted average of the scores in each bucket.
Is there a way the data can be inserted to allow this query pattern to take place by only adding one document into opensearch per event/record (rather than one document per score per event/record)? How might that look?
Thanks!
Is it what you were trying to do ?
I got a bit confused. ^^
DELETE /71397606
PUT /71397606
{
"mappings": {
"properties": {
"id": {
"type": "text"
},
"scores": {
"type": "nested",
"properties": {
"name": {
"type": "keyword"
},
"value": {
"type": "long"
}
}
},
"timestamp": {
"type": "text"
}
}
}
}
POST /_bulk
{"index":{"_index":"71397606"}}
{"id":"foo1","timestamp":"some-iso8601-timestamp","scores":[{"name":"arbitrary-name-1","value":80},{"name":"arbitrary-name-2","value":55},{"name":"arbitrary-name-3","value":30}]}
{"index":{"_index":"71397606"}}
{"id":"foo2","timestamp":"some-iso8601-timestamp","scores":[{"name":"arbitrary-name-1","value":90},{"name":"arbitrary-name-2","value":65},{"name":"arbitrary-name-3","value":40}]}
{"index":{"_index":"71397606"}}
{"id":"foo2","timestamp":"some-iso8601-timestamp","scores":[{"name":"arbitrary-name-1","value":85},{"name":"arbitrary-name-x","value":65},{"name":"arbitrary-name-y","value":40}]}
GET /71397606/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"nested": {
"nested": {
"path": "scores"
},
"aggs": {
"pername": {
"terms": {
"field": "scores.name",
"size": 10
},
"aggs": {
"avg": {
"avg": {
"field": "scores.value"
}
}
}
}
}
}
}
}
ps:
If not could you give an example ?

ElasticSearch aggregation query with List in documents

I have following records of car sales of different brands in different cities.
Document -1
{
"city": "Delhi",
"cars":[{
"name":"Toyota",
"purchase":100,
"sold":80
},{
"name":"Honda",
"purchase":200,
"sold":150
}]
}
Document -2
{
"city": "Delhi",
"cars":[{
"name":"Toyota",
"purchase":50,
"sold":40
},{
"name":"Honda",
"purchase":150,
"sold":120
}]
}
I am trying to come up with query to aggregate car statistics for a given city but not getting the right query.
Required result:
{
"city": "Delhi",
"cars":[{
"name":"Toyota",
"purchase":150,
"sold":120
},{
"name":"Honda",
"purchase":350,
"sold":270
}]
}
First you need to map your array as a nested field (script would be complicated and not performant). Nested field are indexed, aggregation will be pretty fast.
remove your index / or create a new one. Please note i use test as type.
{
"mappings": {
"test": {
"properties": {
"city": {
"type": "keyword"
},
"cars": {
"type": "nested",
"properties": {
"name": {
"type": "keyword"
},
"purchase": {
"type": "integer"
},
"sold": {
"type": "integer"
}
}
}
}
}
}
}
Index your document (same way you did)
For the aggregation:
{
"size": 0,
"aggs": {
"avg_grade": {
"terms": {
"field": "city"
},
"aggs": {
"resellers": {
"nested": {
"path": "cars"
},
"aggs": {
"agg_name": {
"terms": {
"field": "cars.name"
},
"aggs": {
"avg_pur": {
"sum": {
"field": "cars.purchase"
}
},
"avg_sold": {
"sum": {
"field": "cars.sold"
}
}
}
}
}
}
}
}
}
}
result:
buckets": [
{
"key": "Honda",
"doc_count": 2,
"avg_pur": {
"value": 350
},
"avg_sold": {
"value": 270
}
}
,
{
"key": "Toyota",
"doc_count": 2,
"avg_pur": {
"value": 150
},
"avg_sold": {
"value": 120
}
}
]
if you have index the name / city field as a text (you have to ask first if this is necessary), use .keyword in the term aggregation ("cars.name.keyword").

Aggregation on filtered, nested inner_hits query in ElasticSearch

I'm only a few days new to ElasticSearch, and as a learning exercise have implemented a rudimentary job scraper that aggregates jobs from a few job listing sites and populates an index with some data for me to play with.
My index contains a document for each website that lists jobs. A property of each of these documents is a 'jobs' array, which contains an object for each job that exists on that site. I am considering indexing each job as its own document (especially since the ElasticSearch documentation says that inner_hits is an experimental feature) but for now, I am trying to see if I can accomplish what I want to do using the inner_hits and nested features of ElasticSearch.
I am able to query, filter, and return back only matching jobs. However, I am not sure how to apply the same inner_hits constraints to an aggregation.
This is my mapping:
{
"jobsitesIdx" : {
"mappings" : {
"sites" : {
"properties" : {
"createdAt" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"jobs" : {
"type" : "nested",
"properties" : {
"company" : {
"type" : "string"
},
"engagement" : {
"type" : "string"
},
"link" : {
"type" : "string",
"index" : "not_analyzed"
},
"location" : {
"type" : "string",
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
}
},
"title" : {
"type" : "string"
}
}
},
"jobscount" : {
"type" : "long"
},
"sitename" : {
"type" : "string"
},
"url" : {
"type" : "string"
}
}
}
}
}
}
This is a query and aggregate that I am trying (from Node.js):
client.search({
"index": 'jobsitesIdx,
"type": 'sites',
"body": {
"aggs" : {
"jobs" : {
"nested" : {
"path" : "jobs"
},
"aggs" : {
"location" : { "terms" : { "field" : "jobs.location.raw", "size": 25 } },
"company" : { "terms" : { "field" : "jobs.company.raw", "size": 25 } }
}
}
},
"query": {
"filtered": {
"query": {"match_all": {}},
"filter": {
"nested": {
"inner_hits" : { "size": 1000 },
"path": "jobs",
"query":{
"filtered": {
"query": { "match_all": {}},
"filter": {
"and": [
{"term": {"jobs.location": "york"}},
{"term": {"jobs.location": "new"}}
]
}
}
}
}
}
}
}
}
}, function (error, response) {
response.hits.hits.forEach(function(jobsite) {
jobs = jobsite.inner_hits.jobs.hits.hits;
jobs.forEach(function(job) {
console.log(job);
});
});
console.log(response.aggregations.jobs.location.buckets);
});
This gives me back all inner_hits of jobs in New York, but the aggregate is showing me counts for every location and company, not just the ones matching the inner_hits.
Any suggestions on how to get the aggregate on only the data contained in the matching inner_hits?
Edit:
I am updating this to include an export of the mapping and index data, as requested. I exported this using Taskrabbit's elasticdump tool, found here:
https://github.com/taskrabbit/elasticsearch-dump
The index: http://pastebin.com/WaZwBwn4
The mapping: http://pastebin.com/ZkGnYN94
The above linked data differs from the sample code in my original question in that the index is named jobsites6 in the data instead of jobsitesIdx as referred to in the question. Also, the type in the data is 'job' whereas in the code above it is 'sites'.
I've filled in the callback in the code above to display the response data. I am seeing only jobs in New York from the foreach loop of the inner_hits, as expected, however I am seeing this aggregation for location:
[ { key: 'New York, NY', doc_count: 243 },
{ key: 'San Francisco, CA', doc_count: 92 },
{ key: 'Chicago, IL', doc_count: 43 },
{ key: 'Boston, MA', doc_count: 39 },
{ key: 'Berlin, Germany', doc_count: 22 },
{ key: 'Seattle, WA', doc_count: 22 },
{ key: 'Los Angeles, CA', doc_count: 20 },
{ key: 'Austin, TX', doc_count: 18 },
{ key: 'Anywhere', doc_count: 16 },
{ key: 'Cupertino, CA', doc_count: 15 },
{ key: 'Washington D.C.', doc_count: 14 },
{ key: 'United States', doc_count: 11 },
{ key: 'Atlanta, GA', doc_count: 10 },
{ key: 'London, UK', doc_count: 10 },
{ key: 'Ulm, Deutschland', doc_count: 10 },
{ key: 'Riverton, UT', doc_count: 9 },
{ key: 'San Diego, CA', doc_count: 9 },
{ key: 'Charlotte, NC', doc_count: 8 },
{ key: 'Irvine, CA', doc_count: 8 },
{ key: 'London', doc_count: 8 },
{ key: 'San Mateo, CA', doc_count: 8 },
{ key: 'Boulder, CO', doc_count: 7 },
{ key: 'Houston, TX', doc_count: 7 },
{ key: 'Palo Alto, CA', doc_count: 7 },
{ key: 'Sydney, Australia', doc_count: 7 } ]
Since my inner_hits are limited to those in New York, I can see that the aggregation is not on my inner_hits because it is giving me counts for all locations.
You can achieve this by adding the same filter in your aggregation to only include New York jobs.
Also note that in your second aggregation you had company.raw but in your mapping the jobs.company field has no not_analyzed part named raw, so you probably need to add it if you want to aggregate on the not analyzed company name.
{
"_source": [
"sitename"
],
"query": {
"filtered": {
"filter": {
"nested": {
"inner_hits": {
"size": 1000
},
"path": "jobs",
"query": {
"filtered": {
"filter": {
"terms": {
"jobs.location": [
"new",
"york"
]
}
}
}
}
}
}
}
},
"aggs": {
"jobs": {
"nested": {
"path": "jobs"
},
"aggs": {
"only_loc": {
"filter": { <----- add this filter
"terms": {
"jobs.location": [
"new",
"york"
]
}
},
"aggs": {
"location": {
"terms": {
"field": "jobs.location.raw",
"size": 25
}
},
"company": {
"terms": {
"field": "jobs.company",
"size": 25
}
}
}
}
}
}
}
}

Elasticsearch : get most similar result

I'm relatively new to ElasticSearch. I need to write an api which returns most similar result match with a name.
Example: i want to find a phone has name most similar to 'samsung s6', this is my JSON query:
{
"query": {
"match": {
"title": {
"query": 'samsung s6',
"operator": "and"
}
}
},
"track_scores": True
}
and i got (formatted on my own) :
6 - Samsung S6 Edge - 5.9510574
5 - Samsung S6 - 7.512151
where first field is just an Id, second is Name field on which ElasticSearch performed it's searching, and third is score.
I tried to sort by _score:
{
"query": {
"match": {
"title": {
"query": 'samsung s6',
"operator": "and"
}
}
},
"track_scores": True,
"sort": [
{
"_score": {
"order": "desc"
}
}
]
}
it seems work fine. But when i try with another name, i.e: 'iphone 6':
3 - Apple iPhone 6 - 5.569293
1 - Apple iPhone 6 Plus - 5.8405986
How can i get most similar result match with name?
UPDATE:
This is mapping:
"device_group": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "string",
}
}
}

Elasticsearch search for a value across multiple fields

My purpose is to search for a value across multiple fields and return the count of these values ​​and the distinct value.
To do this I realized that I have to use the facets.
This is the database schema:
index:
analysis:
analyzer:
custom_search_analyzer:
type: custom
tokenizer: standard
filter : [standard, snowball, lowercase, asciifolding]
custom_index_analyzer:
type: custom
tokenizer: standard
filter : [standard, snowball, lowercase, asciifolding, custom_filter]
filter:
custom_filter:
type: edgeNGram
side: front
min_gram: 1
max_gram: 20
{
"structure": {
"properties": {
"name": {"type": "string", "search_analyzer": "custom_search_analyzer", "index_analyzer": "custom_index_analyzer"},
"locality": {"type": "string", "search_analyzer": "custom_search_analyzer", "index_analyzer": "custom_index_analyzer"},
"province": {"type": "string", "search_analyzer": "custom_search_analyzer", "index_analyzer": "custom_index_analyzer"},
"region": {"type": "string", "search_analyzer": "custom_search_analyzer", "index_analyzer": "custom_index_analyzer"}
}
}
}
and this is the query that I tried to use:
{
"query": {
"bool": {
"should": [
{
"match": {
"locality": "bolo"
}
},
{
"match": {
"region": "bolo"
}
},
{
"match": {
"name": "bolo"
}
}
]
}
},
"facets": {
"region": {
"query": {
"term": {
"region": "bolo"
}
}
},
"locality": {
"query": {
"term": {
"locality": "bolo"
}
}
},
"name": {
"query": {
"term": {
"name": "bolo"
}
}
}
}
}
Of all the tests I've done this is the query that is closest to my desired result, however, does not tell me the count of distinct field, I found it to count the total field.
For example, the above query returns the following result:
facets: {
region: {
_type: query
count: 0
}
locality: {
_type: query
count: 2
}
name: {
_type: query
count: 0
}
}
I would like to have a result like this (not so obviously written is correct, but does understand what I need):
facets: {
....
locality: {
_type: query
"terms": [
{"term": "Bologna", "count": 1},
{"term": "Bolognano", "count": 1}
]
}
How can I do?
I have already tried to use "terms" instead of "query" in the facets and put "index: not_analyzed" in the fields of research, but is only returned if I try the exact scope, not part of it!
This can be done using value count aggregation.
In value count aggregation , it provides you the number of unique terms.
While terms aggregation gives you the unique term and its document count.
I believe you are looking for the value count aggregation - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-valuecount-aggregation.html

Resources