How to use the snowball analyzer with NEST and ElasticSearch? - elasticsearch

I am unclear: does the snowball analyzer have to be used when making the index?
var Client = new ElasticClient(Settings);
Client.CreateIndex("pictures", i => i
.Settings(st => st
.Analysis(a => a
.Analyzers(ad => ad
.Snowball("snowball", s => s.Language(SnowballLanguage.English))
)
)
)
);
or when doing the search?
var queryResults = Client.Search<PictureIndex>(s => s
.From(0).Size(10)
.Query(q=>q
.QueryString(qs=>qs
.Analyzer("snowball")
.Query("my test string")
)
)
);
This code doesn't return the expected results.
For example, if I have:
tomato, and
tomatoes
in my index, I'm expecting to find 2 results if I search for tomato, but it's not the case.
I'm trying to make an English only, case insensitive, stemmed search and add fuzziness to accommodate misspellings.
(As a bonus, I'd like to be able to submit a list of synonyms)
Edit:
this is the test code I have, but I think I misunderstand how to enter synonyms. The code will return no matches.
public static class Program
{
public class IndexData
{
public int Id { get; set; }
public string Text { get; set; }
}
public static void Main()
{
var Settings = new ConnectionSettings(new Uri("http://elasticsearch:9200")).DefaultIndex("testindex");
var A = new List<IndexData>
{
new IndexData { Id = 11, Text = "I like red, green and blue. But also cookies and candies" },
new IndexData { Id = 12, Text = "There is a red cookie on the shelf" },
new IndexData { Id = 13, Text = "Blue candies are my favorite" }
};
var Client = new ElasticClient(Settings);
var D = Client.DeleteIndexAsync("testindex").Result;
var U = Client.CreateIndex("testindex", i => i
.Settings(s => s
.Analysis(a => a
.CharFilters(cf => cf
.Mapping("my_char_filter", m => m
.Mappings("Blue => blue", "Red => red", "Green => green")
)
)
.TokenFilters(tf => tf
.Synonym("my_synonym", sf => sf
.Synonyms("red, blue")
.Synonyms("green, blue")
)
)
.Analyzers(an => an
.Custom("my_analyzer", ca => ca
.Tokenizer("standard")
.CharFilters("my_char_filter")
.Filters("lowercase", "stop", "my_synonym")
)
)
)
)
);
var R = Client.IndexDocument(A[0]);
R = Client.IndexDocument(A[1]);
R = Client.IndexDocument(A[2]);
var Articles = Client.Search<IndexData>(s => s
.From(0)
.Size(1000)
.Analyzer("my_analyzer")
.Query(q => q.Fuzzy(fz => fz.Field("text").Value("blue").MaxExpansions(2)))
);
var Documents = Articles.Documents;
}
}
What I am trying to achieve is a text search where:
I can have some minor misspellings
Handle plurals: tomato = tomatoes
I can define synonyms (for example here I'm expecting search for 'blue' to return also the 'red' and the 'green')
Get a sorted list of matches with best hits first. By best, I mean hits that cover more of the search terms.
I have to admit that, despite going over the docs, I am extremely confused by the terminology and the flow of the whole system. Another issue is that half of the samples on the web just don't compile because it looks like the API has changed at some point.

Related

EF Core query with GroupBy and Count not working as expected

I have a .NET Core 3.1 project with EF Core 3.1.8. Lets say I have two entitys:
public class Card
{
public int CardId { get; set; }
public int Stage { get; set; }
public int SectionId { get; set; }
public Section Section { get; set; }
}
public class Section
{
public int SectionId { get; set; }
public string Title { get; set; }
public List<Card> Cards { get; set; }
}
Now I want a query that gives me the sections and for each section the information of how many Cards with Stage=1, Stage=2, Stage=3 etc. are in there.
I tried this:
var q = _dbContext.Sections
.Include(s => s.Cards)
.Select(s => new
{
s.SectionId,
cards = s.Cards
.Select(c => c.Stage)
.GroupBy(c => c)
.Select(c => new { c.Key, count = c.Count() })
})
.ToList();
But in the result is always only one section with only one card. How can I do this?
I made slight tweak on Group by
var q = _dbContext.Sections
.Include(s => s.Cards)
.GroupBy(s => s.SectionId)
.Select(s => new
{
s.Key,
cards = s.SelectMany(t => t.Cards)
.GroupBy(c => c.Stage)
.Select(c => new { c.Key, count = c.Count() })
})
.ToList();
When I run into issues where EntityFramework isn't quite behaving as I would expect, I tend to fall back to thinking about how I would do this in SQL directly. Mimicking that usually makes EF work.
//Create a query to group and count the cards
//In SQL:
// SELECT SectionId, Stage, COUNT(CardId)
// FROM Cards
// GROUP BY SectionId, Stage
//In EF (note, not executing just building up the query):
var cardCountQuery = context.Cards
.Select(c => new
{
c.SectionId,
c.Stage
})
.GroupBy(c => c)
.Select(c => new
{
SectionAndStage = c.Key,
Count = c.Count()
});
//Now use that as a subquery and join to sections
//In SQL
// SELECT s.SectionId, s.Title, c.Stage, c.CardCount
// FROM Sections s
// INNER JOIN (
// SELECT SectionId, Stage, COUNT(CardId) AS CardCount
// FROM Cards
// GROUP BY SectionId, Stage
// ) c ON c.SectionId = s.SectionId
//In EF:
var sectionsWithCardCountByStage = context.Sections
.Join(cardCountQuery,
s => s.SectionId,
c => c.SectionAndStage.SectionId,
(s, g) => new
{
s.SectionId,
g.SectionAndStage.Stage,
CardCount = g.Count
})
.ToList();
Edit: Reshaping the data per comment
From what is above we can then reshape the data to what you are looking for.
//If you don't mind bring back the Section data multiple times (this will increase the result set size) you can alter the above to bring back the entire Section in the query and then re-shape it in memory.
//NOTE: This will only bring back Sections that have cards
var sectionsWithCardCountByStage = context.Sections
.Join(cardCountQuery,
s => s.SectionId,
c => c.SectionAndStage.SectionId,
(s, g) => new
{
Section = s,
g.SectionAndStage.Stage,
CardCount = g.Count
})
.ToList()
.GroupBy(g => g.Section.SectionId)
.Select(g => new
{
g.First().Section,
Cards = g.ToDictionary(c => c.Stage, c => c.CardCount)
})
.ToList();
//Or you can bring back Sections only once to reduce result set size. This extends the query from the first response section above.
var sections = context.Sections
.Where(s => s.Cards.Count > 0) //Only bring back sections with cards. Remove it to bring back all sections and have an empty dictionary of card counts.
.ToList()
.Select(s => new
{
Section = s,
Cards = sectionsWithCardCountByStage
.Where(c => c.SectionId == s.SectionId)
.ToDictionary(c => c.Stage, c => c.CardCount)
})
.ToList();
EDIT: I try minimize my queries and bring back only the data necessary to do the job. But if you aren't dealing with a lot of data then this might offer a more compact single query option at the expense of possibly bringing back more data then you need and thus a larger result set.
var sections = context.Sections
.Include(s => s.Cards)
.ToList()
.Select(s => new
{
Section = s,
CardCount = s.Cards.GroupBy(c => c.Stage)
.Select(g => new { Stage = g.Key, CardCount = g.Count() })
.ToDictionary(c => c.Stage, c => c.CardCount)
})
.ToList();

LINQ DistinctBy chosing what object to keep

If I have a list of objects and I don't want to allow duplicates of a certain attribute of the objects. My understanding is that I can use DistinctBy() to remove one of the objects. My question is, how do I choose which of the objects with the same value of an attribute value do I keep?
Example:
How would I go about removing any objects with a duplicate value of "year" in the list tm and keep the object with the highest value of someValue?
class TestModel{
public int year{ get; set; }
public int someValue { get; set; }
}
List<TestModel> tm = new List<TestModel>();
//populate list
//I was thinking something like this
tm.DistinctBy(x => x.year).Select(x => max(X=>someValue))
You can use GroupBy and Aggregate (there is no MaxBy built-in method in LINQ):
tm
.GroupBy(tm => tm.year)
.Select(g => g.Aggregate((acc, next) => acc.someValue > next.someValue ? acc : next))
User the GroupBy followed by the SelectMany/Take(1) pattern with an OrderBy:
IEnumerable<TestModel> result =
tm
.GroupBy(x => x.year)
.SelectMany(xs =>
xs
.OrderByDescending(x => x.someValue)
.Take(1));
Here's an example:
List<TestModel> tm = new List<TestModel>()
{
new TestModel() { year = 2020, someValue = 5 },
new TestModel() { year = 2020, someValue = 15 },
new TestModel() { year = 2019, someValue = 6 },
};
That gives me:

ElasticSearch NEST 5.6.1 Query for unit test

I wrote a bunch of queries to elastic search and I wanted to write a unit test for them. using this post moq an elastic connection I was able to preform a general mocking. But When I tried to view the Json which is being generated from my query I didn't manage to get it in any way.
I tried to follow this post elsatic query moq, but it is relevant only to older versions of Nest because the method ConnectionStatus and RequestInformation is no longer available for an ISearchResponse object.
My test look as follow:
[TestMethod]
public void VerifyElasticFuncJson()
{
//Arrange
var elasticService = new Mock<IElasticService>();
var elasticClient = new Mock<IElasticClient>();
var clinet = new ElasticClient();
var searchResponse = new Mock<ISearchResponse<ElasticLog>>();
elasticService.Setup(es => es.GetConnection())
.Returns(elasticClient.Object);
elasticClient.Setup(ec => ec.Search(It.IsAny<Func<SearchDescriptor<ElasticLog>,
ISearchRequest>>())).
Returns(searchResponse.Object);
//Act
var service = new ElasticCusipInfoQuery(elasticService.Object);
var FindFunc = service.MatchCusip("CusipA", HostName.GSMSIMPAPPR01,
LogType.Serilog);
var con = GetConnection();
var search = con.Search<ElasticLog>(sd => sd
.Type(LogType.Serilog)
.Index("logstash-*")
.Query(q => q
.Bool(b => b
.Must(FindFunc)
)
)
);
**HERE I want to get the JSON** and assert it look as expected**
}
Is there any other way to achieve what I ask?
The best way to do this would be to use the InMemoryConnection to capture the request bytes and compare this to the expected JSON. This is what the unit tests for NEST do. Something like
private static void Main()
{
var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var connectionSettings = new ConnectionSettings(pool, new InMemoryConnection())
.DefaultIndex("default")
.DisableDirectStreaming();
var client = new ElasticClient(connectionSettings);
// Act
var searchResponse = client.Search<Question>(s => s
.Query(q => (q
.Match(m => m
.Field(f => f.Title)
.Query("Kibana")
) || q
.Match(m => m
.Field(f => f.Title)
.Query("Elasticsearch")
.Boost(2)
)) && +q
.Range(t => t
.Field(f => f.Score)
.GreaterThan(0)
)
)
);
var actual = searchResponse.RequestJson();
var expected = new
{
query = new {
#bool = new {
must = new object[] {
new {
#bool = new {
should = new object[] {
new {
match = new {
title = new {
query = "Kibana"
}
}
},
new {
match = new {
title = new {
query = "Elasticsearch",
boost = 2d
}
}
}
},
}
},
new {
#bool = new {
filter = new [] {
new {
range = new {
score = new {
gt = 0d
}
}
}
}
}
}
}
}
}
};
// Assert
Console.WriteLine(JObject.DeepEquals(JToken.FromObject(expected), JToken.Parse(actual)));
}
public static class Extensions
{
public static string RequestJson(this IResponse response) =>
Encoding.UTF8.GetString(response.ApiCall.RequestBodyInBytes);
}
I've used an anonymous type for the expected JSON as it's easier to work with than an escaped JSON string.
One thing to note is that Json.NET's JObject.DeepEquals(...) will return true even when there are repeated object keys in a JSON object (so long as the last key/value matches). It's not likely something you'll encounter if you're only serializing NEST searches though, but something to be aware of.
If you're going to have many tests checking serialization, you'll want to create a single instance of ConnectionSettings and share with all, so that you can take advantage of the internal caches within it and your tests will run quicker than instantiating a new instance in each test.

elasticsearch nest stopword filter does not work

I am tiring to implement elasticsearch NEST client and indexing documents and SQL data and able to search these perfectly. But I am not able to apply stopwords on these records. Below is the code. Please note I put "abc" as my stopword.
public IndexSettings GetIndexSettings()
{
var stopTokenFilter = new StopTokenFilter();
string stopwordsfilePath = Convert.ToString(ConfigurationManager.AppSettings["Stopwords"]);
string[] stopwordsLines = System.IO.File.ReadAllLines(stopwordsfilePath);
List<string> words = new List<string>();
foreach (string line in stopwordsLines)
{
words.Add(line);
}
stopTokenFilter.Stopwords = words;
var settings = new IndexSettings { NumberOfReplicas = 0, NumberOfShards = 5 };
settings.Settings.Add("merge.policy.merge_factor", "10");
settings.Settings.Add("search.slowlog.threshold.fetch.warn", "1s");
settings.Analysis.Analyzers.Add("xyz", new StandardAnalyzer { StopWords = words });
settings.Analysis.Tokenizers.Add("keyword", new KeywordTokenizer());
settings.Analysis.Tokenizers.Add("standard", new StandardTokenizer());
settings.Analysis.TokenFilters.Add("standard", new StandardTokenFilter());
settings.Analysis.TokenFilters.Add("lowercase", new LowercaseTokenFilter());
settings.Analysis.TokenFilters.Add("stop", stopTokenFilter);
settings.Analysis.TokenFilters.Add("asciifolding", new AsciiFoldingTokenFilter());
settings.Analysis.TokenFilters.Add("word_delimiter", new WordDelimiterTokenFilter());
return settings;
}
public void CreateDocumentIndex(string indexName = null)
{
IndexSettings settings = GetIndexSettings();
if (!this.client.IndexExists(indexName).Exists)
{
this.client.CreateIndex(indexName, c => c
.InitializeUsing(settings)
.AddMapping<Document>
(m => m.Properties(ps => ps.Attachment
(a => a.Name(o => o.Documents)
.TitleField(t => t.Name(x => x.Name)
.TermVector(TermVectorOption.WithPositionsOffsets))))));
}
var r = this.client.GetIndexSettings(i => i.Index(indexName));
}
Indexing Data
var documents = GetDocuments();
documents.ForEach((document) =>
{
indexRepository.IndexData<Document>(document, DOCindexName, DOCtypeName);
});
public bool IndexData<T>(T data, string indexName = null, string mappingType = null)
where T : class, new()
{
if (client == null)
{
throw new ArgumentNullException("data");
}
var result = this.client.Index<T>(data, c => c.Index(indexName).Type(mappingType));
return result.IsValid;
}
In one of my document I have put a single line "abc" and I do not expect this to be returned as "abc" is in my stopword list. But On Searching Document It is also returning the above document. Below is the search query.
public IEnumerable<dynamic> GetAll(string queryTerm)
{
var queryResult = this.client.Search<dynamic>(d => d
.Analyzer("xyz")
.AllIndices()
.AllTypes()
.QueryString(queryTerm)).Documents;
return queryResult;
}
Please suggest where I am going wrong.

How would you address this aggregation/reporting scenario based on RavenDB document data?

We're using RavenDB (2261) as the back end for a queue-based video upload system, and we've been asked to provide a 'live' SLA report on various metrics to do with the upload system.
The document format looks like this:
{
"ClipGuid": "01234567-1234-abcd-efef-123412341234",
"CustomerId": "ABC123",
"Title": "Shakespeare in Love",
"DurationInSeconds": 82,
"StateChanges": [
{
"OldState": "DoesNotExist",
"NewState": "ReceivedFromUpload",
"ChangedAt": "2013-03-15T15:38:38.7050002Z"
},
{
"OldState": "ReceivedFromUpload",
"NewState": "Validating",
"ChangedAt": "2013-03-15T15:38:38.8453975Z"
},
{
"OldState": "Validating",
"NewState": "AwaitingSubmission",
"ChangedAt": "2013-03-15T15:38:39.9529762Z"
},
{
"OldState": "AwaitingSubmission",
"NewState": "Submitted",
"ChangedAt": "2013-03-15T15:38:43.4785084Z"
},
{
"OldState": "Submitted",
"NewState": "Playable",
"ChangedAt": "2013-03-15T15:41:39.5523223Z"
}
],
}
Within each ClipInfo record, there's a collection of StateChanges that are added each time the clip is passed from one part of the processing chain to another. What we need to to is to reduce these StateChanges to two specific timespans - we need to know how long a clip took to change from DoesNotExist to AwaitingSubmission, and how long it took from DoesNotExist to Playable. We then need to group these durations by date/time, so we can draw a simple SLA report that looks like this:
The necessary predicates can be expressed as LINQ statements but when I try specifying this sort of complex logic within a Raven query I just seem to get back empty results (or lots of DateTime.MinValue results)
I realise document databases like Raven aren't ideal for reporting - and we're happy to explore replication into SQL or some other sort of caching mechanism - but at the moment I just can't see any way of extracting the data other than doing multiple queries to retrieve the entire contents of the store and then performing the calculations in .NET.
Any recommendations?
Thanks,
Dylan
I have made some assumptions which you may need to adjust for:
You operate strictly in the UTC time zone - your "day" is midnight to midnight UTC.
Your week is Sunday through Saturday
The date you want to group by is the first status date reported (the one marked with "DoesNotExist" as its old state.)
You will need a separate map/reduce index per date bracket that you are grouping on - Daily, Weekly, Monthly.
They are almost identical, except for how the starting date is defined. If you want to get creative, you might be able to come up with a way to make these into a generic index definition - but they will always end up being three separate indexes in RavenDB.
// This is the resulting class that all of these indexes will return
public class ClipStats
{
public int CountClips { get; set; }
public int NumPassedWithinTwentyPct { get; set; }
public int NumPlayableWithinOneHour { get; set; }
public DateTime Starting { get; set; }
}
public class ClipStats_ByDay : AbstractIndexCreationTask<ClipInfo, ClipStats>
{
public ClipStats_ByDay()
{
Map = clips => from clip in clips
let state1 = clip.StateChanges.FirstOrDefault(x => x.OldState == "DoesNotExist")
let state2 = clip.StateChanges.FirstOrDefault(x => x.NewState == "AwaitingSubmission")
let state3 = clip.StateChanges.FirstOrDefault(x => x.NewState == "Playable")
let time1 = state2.ChangedAt - state1.ChangedAt
let time2 = state3.ChangedAt - state1.ChangedAt
select new
{
CountClips = 1,
NumPassedWithinTwentyPct = time1.TotalSeconds < clip.DurationInSeconds * 0.2 ? 1 : 0,
NumPlayableWithinOneHour = time2.TotalHours < 1 ? 1 : 0,
Starting = state1.ChangedAt.Date
};
Reduce = results => from result in results
group result by result.Starting
into g
select new
{
CountClips = g.Sum(x => x.CountClips),
NumPassedWithinTwentyPct = g.Sum(x => x.NumPassedWithinTwentyPct),
NumPlayableWithinOneHour = g.Sum(x => x.NumPlayableWithinOneHour),
Starting = g.Key
};
}
}
public class ClipStats_ByWeek : AbstractIndexCreationTask<ClipInfo, ClipStats>
{
public ClipStats_ByWeek()
{
Map = clips => from clip in clips
let state1 = clip.StateChanges.FirstOrDefault(x => x.OldState == "DoesNotExist")
let state2 = clip.StateChanges.FirstOrDefault(x => x.NewState == "AwaitingSubmission")
let state3 = clip.StateChanges.FirstOrDefault(x => x.NewState == "Playable")
let time1 = state2.ChangedAt - state1.ChangedAt
let time2 = state3.ChangedAt - state1.ChangedAt
select new
{
CountClips = 1,
NumPassedWithinTwentyPct = time1.TotalSeconds < clip.DurationInSeconds * 0.2 ? 1 : 0,
NumPlayableWithinOneHour = time2.TotalHours < 1 ? 1 : 0,
Starting = state1.ChangedAt.Date.AddDays(0 - (int) state1.ChangedAt.Date.DayOfWeek)
};
Reduce = results => from result in results
group result by result.Starting
into g
select new
{
CountClips = g.Sum(x => x.CountClips),
NumPassedWithinTwentyPct = g.Sum(x => x.NumPassedWithinTwentyPct),
NumPlayableWithinOneHour = g.Sum(x => x.NumPlayableWithinOneHour),
Starting = g.Key
};
}
}
public class ClipStats_ByMonth : AbstractIndexCreationTask<ClipInfo, ClipStats>
{
public ClipStats_ByMonth()
{
Map = clips => from clip in clips
let state1 = clip.StateChanges.FirstOrDefault(x => x.OldState == "DoesNotExist")
let state2 = clip.StateChanges.FirstOrDefault(x => x.NewState == "AwaitingSubmission")
let state3 = clip.StateChanges.FirstOrDefault(x => x.NewState == "Playable")
let time1 = state2.ChangedAt - state1.ChangedAt
let time2 = state3.ChangedAt - state1.ChangedAt
select new
{
CountClips = 1,
NumPassedWithinTwentyPct = time1.TotalSeconds < clip.DurationInSeconds * 0.2 ? 1 : 0,
NumPlayableWithinOneHour = time2.TotalHours < 1 ? 1 : 0,
Starting = state1.ChangedAt.Date.AddDays(1 - state1.ChangedAt.Date.Day)
};
Reduce = results => from result in results
group result by result.Starting
into g
select new
{
CountClips = g.Sum(x => x.CountClips),
NumPassedWithinTwentyPct = g.Sum(x => x.NumPassedWithinTwentyPct),
NumPlayableWithinOneHour = g.Sum(x => x.NumPlayableWithinOneHour),
Starting = g.Key
};
}
}
Then when you want to query...
var now = DateTime.UtcNow;
var today = now.Date;
var dailyStats = session.Query<ClipStats, ClipStats_ByDay>()
.FirstOrDefault(x => x.Starting == today);
var startOfWeek = today.AddDays(0 - (int) today.DayOfWeek);
var weeklyStats = session.Query<ClipStats, ClipStats_ByWeek>()
.FirstOrDefault(x => x.Starting == startOfWeek);
var startOfMonth = today.AddDays(1 - today.Day);
var monthlyStats = session.Query<ClipStats, ClipStats_ByMonth>()
.FirstOrDefault(x => x.Starting == startOfMonth);
In the results, you will have totals. So if you want percent averages for your SLA, simply divide the statistic by the count, which is also returned.

Resources