Uniqueness check in Elasticsearch without constantly refreshing index - elasticsearch

I'm indexing a lot of data in Elasticsearch (through NEST) from multiple processes each running multiple threads. Part of indexing a document is finding out if we have seen a similar document before. This feature is implemented by generating a hash of a set of fields on the document and checking if we have documents in Elasticsearch with the same hash. Before indexing a document, I make the following query:
var result = elasticClient
.Index(indexName)
.Count<MyDocument>(c => c
.Query(q => q
.ConstantScore(qs => qs
.Filter(f => f
.Term(field => field.Hash, hash))))
...
This returns a count of existing documents with the specified hash. So far so good. Things are working. If a process is indexing two documents with the same hash within the same second, the count check doesn't work, since the first document isn't available for search yet. I'm running with the default refresh interval (1 second). For now I have added a refresh call after indexing each document:
var refreshResponse = client.Refresh(indexName);
This also works but it doesn't scale when indexing large amounts of documents (indexing becomes slow as already pointed out here: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html).
Any ideas for how to avoid having to call Refresh but still be able to perform a uniqueness check? I'm thinking some sort of local cache shared between all threads with hashes of documents indexed since the last refresh. I know that this won't work across processes, but that is acceptaple for now.

I ended up implementing a write-through cache as suggested by Val. This makes it possible to remove the call to Refresh but still make the count on each iteration. This is implemented using a singleton MemoryCache shared between all threads:
var cache = new MemoryCache("hashes");
When checking for uniqueness I check the cache in case no similar documents are found in Elasticsearch:
var result = elasticClient
.Count<MyDocument>(c => c
.Index(indexName)
.Query(q => q
.ConstantScore(qs => qs
.Filter(f => f
.Term(field => field.Hash, hash)))));
bool isUnique = false;
if (result.Count == 0)
{
isUnique = !cache.Contains(hash);
}
In case the count for the hash returns 0 I check a cache for that hash.
When a document has been successful indexed, I store the hash in the cache with an expiration:
var policy = new CacheItemPolicy();
policy.AbsoluteExpiration = DateTimeOffset.UtcNow.AddSeconds(5);
cache.AddOrGetExisting(hash, string.Empty, policy);
TTL could probably be 1 second as well since that is the refresh interval I currently have configured on the index.

Related

How can I use X.PagedList with ElasticSearch Nest?

Background
I'm using ElasticSearch as the search engine for a new ASP.Net Core 2.1 website I'm working on. I'm using the Nest API to integrate with it. I want to use the X.PagedList to handle the paging for me.
I've used this in other ASP.Net Core projects and it's worked well querying data in MS SQL Server.
Code
ISearchResponse<Foo> searchResponse =
_elasticSearchClient.Search<Foo>(s => s
.Query(q => q
.Bool(b => b.Filter(distanceFilters))
)
.Source(src => src
.Includes(i => i
.Fields(
f => f.Field1,
f => f.Field2,
f => f.Field3
)
)
)
.From(options.From)
.Size(options.Size)
);
var hitsMD = searchResponse.HitsMetadata;
var results = hitsMD?.Hits.Select(s => new Hit()
{
Index = s.Index,
Id = s.Id,
Score = s.Score,
Job = s.Source
}
).ToPagedList(PageNumber, PageSize);
Issue
When I call .ToPagedList on the search results returned by ElasticSearch, it only shows one page of results.
The issue is that ElasticSearch has its own paging mechanism so it's only returning one page of hits.
I had the idea that because ElasticSearch passes back the total number of hits I could tell the PagedList how many items are in the list by setting the PagedList.TotalItemCount property. However, I can't do this as it's a private set.
I've tried removing the from and size but this returns 10 hits which is ElasticSearch's default size which they obviously put in place for performance reasons.
Question
How can I make use of the X.PagedList package whilst integrating into ElasticSearch using the Nest API?
You've basically got all the pieces here already. All you're missing is StaticPagedList<T>. Since paging is already being handled by Elasticsearch, you need to simply define a static paging setup, i.e.:
var pagedResults = new StaticPagedList<Foo>(results, PageNumber, PageSize, total);

Document is not deleted if delete request was sent immediately after insert request

I have a service that indexes documents.
The service receives two following requests - the first is to insert a document and the second to delete it.
When there is some time between them it works fine but when they are sent one after another the document is not deleted.
The response from I get from Nest looks successful.
My function is kinds long so I will only write the insert and delete inside. If more info is needed I will add it (For example, in case of insertion in also deletes it from all other available indices and inserts some mapping if needed).
Insert code:
IBulkResponse res = await _client.IndexManyAsync(entities, index, type);
Delete Code:
var termFilter = new List<Func<QueryContainerDescriptor<JObject>, QueryContainer>>
{
c => c.Terms(t => t.Field(ID_FIELD).Terms(ids))
};
await _client.DeleteByQueryAsync<JObject>(indices, types, d => d.Query(q => q.Bool(b => b.Must(termFilter))));
For example, this integration test doesn't work:
var indices = new { "some_index_1", "some_index_2" };
var entity = new Entity { Action = ReplicationAction.INSERT, ... };
await elasticDal.Insert(new List { entity }, "some_index_1", "666", indices);
entity.Action = ReplicationAction.DELETE;
await elasticDal.Insert(new List { entity }, "some_index_1", "666", indices);
Versions: ElasticSearch 2.3.5, .Net 4.6, Nest 2.4.6
When you are inserting any document following steps happens:
Document is added to the in-memory buffer and appended to the translog.
Refresh
The docs in the in-memory buffer are written to a new segment, without
a Fsync.
a. The segment is opened to make it visible to search.
b. The in-memory buffer is cleared.
The segment is opened to make it visible to search.
Every so often—such as when the translog is getting too big—the index is
flushed; a new translog is created, and a full commit is performed :
a. Any docs in the in-memory buffer are written to a new segment.
b. The buffer is cleared.
c. A commit point is written to disk.
d. The filesystem cache is flushed with a Fsync.
e. The old translog is deleted.
Elasticsearch doesn't delete the document. It marks the document as a deleted document, and while merging the index segments, ES leaves deleted document if it is in memory.
So my guess is you are missing refresh API after deleting.
If your DELETE API is not so frequent, then you can refresh your ES after calling DELETE API by calling REFRESH API.
If you want to learn more about how indexing is happing behind the picture, you can refer to this link (https://www.elastic.co/guide/en/elasticsearch/guide/current/translog.html)

Slow query over large collection

I'm working on an audit log which saves sessions in RavenDB. Initially, the website for querying the audit logs was responsive enough but as the amount of logged data has increased, the search page became unusable (it times out before returning using default settings - regardless of the query used). Right now we have about 45mil sessions in the table that gets queried but steady state is expected to be around 150mil documents.
The problem is that with this much live data, playing around to test things has become impractical. I hope some one can give me some ideas what would be the most productive areas to investigate.
The index looks like this:
public AuditSessions_WithSearchParameters()
{
Map = sessions => from session in sessions
select new Result
{
ApplicationName = session.ApplicationName,
SessionId = session.SessionId,
StartedUtc = session.StartedUtc,
User_Cpr = session.User.Cpr,
User_CprPersonId = session.User.CprPersonId,
User_ApplicationUserId = session.User.ApplicationUserId
};
Store(r => r.ApplicationName, FieldStorage.Yes);
Store(r => r.StartedUtc, FieldStorage.Yes);
Store(r => r.User_Cpr, FieldStorage.Yes);
Store(r => r.User_CprPersonId, FieldStorage.Yes);
Store(r => r.User_ApplicationUserId, FieldStorage.Yes);
}
The essense of the query is this bit:
// Query input paramters
var fromDateUtc = fromDate.ToUniversalTime();
var toDateUtc = toDate.ToUniversalTime();
sessionQuery = sessionQuery
.Where(s =>
s.ApplicationName == applicationName &&
s.StartedUtc >= fromDateUtc &&
s.StartedUtc <= toDateUtc
);
var totalItems = Count(sessionQuery);
var sessionData =
sessionQuery
.OrderByDescending(s => s.StartedUtc)
.Skip((page - 1) * PageSize)
.Take(PageSize)
.ProjectFromIndexFieldsInto<AuditSessions_WithSearchParameters.ResultWithAuditSession>()
.Select(s => new
{
s.SessionId,
s.SessionGroupId,
s.ApplicationName,
s.StartedUtc,
s.Type,
s.ResourceUri,
s.User,
s.ImpersonatingUser
})
.ToList();
First, to determine the number of pages of results, I count the number of results in my query using this method:
private static int Count<T>(IRavenQueryable<T> results)
{
RavenQueryStatistics stats;
results.Statistics(out stats).Take(0).ToArray();
return stats.TotalResults;
}
This turns out to be very expensive in itself, so optimizations are relevant both here and in the rest of the query.
The query time is not related to the amount of result items in any relevant way. If I use a different value for the applicationName parameter than any of the results, it is just as slow.
One area of improvement could be to use sequential IDs for the sessions. For reasons not relevant to this post, I found it most practical to use guid based ids. I'm not sure if I can easily change IDs of the existing values (with this much data) and I would prefer not to drop the data (but might if the expected impact is large enough). I understand that sequential ids result in better behaving b-trees for the indexes, but I have no idea how significant the impact is.
Another approach could be to include a timestamp in the id and query for documents with ids starting with the string matching enough of the time to filter the result. An example id could be AuditSessions/2017-12-31-24-31-42/bc835d6c-2fba-4591-af92-7aab96339d84. This also requires me to update or drop all the existing data. This of course also has the benefits of mostly sequential ids.
A third approach could be to move old data into a different collection over time, in recognition of the fact that you would most often look at the most recent data. This requires a background job and support for querying across collection time boundaries. It also has the issue that the collection with the old sessions is still slow if you need to access it.
I'm hoping there is something simpler than these solutions, such as modifying the query or the indexed fields in a way that avoids a lot of work.
At a glance, it is probably related to the range query on the StartedUtc.
I'm assuming that you are using exact numbers, so you have a LOT of distinct values there.
If you can, you can dramatically reduce the cost by changing the index to index on a second / minute granularity (which is usually what you are querying on), and then use Ticks, which allow us to use numeric range query.
StartedUtcTicks = new Datetime(session.StartedUtc.Year, session.StartedUtc.Month, session.StartedUtc.Day, session.StartedUtc.Hour, session.StartedUtc.Minute, session.StartedUtc.Second).Ticks,
And then query by the date ticks.

SharePoint Library Query running very slow

We have a SharePoint query running on a library of maybe 500 documents to pull back the highest version of the published documents that are flagged as active (have active=true in the Active column).
This is taking way too long to run (about 3-5 seconds), which is frustrating the users.
What can be done to the query below to speed it up (I would hope for this to be virtually instantaneous!)
using (var site = new SPSite(Helpers.ConfigurationFile.SharePointUrl().ToString()))
{
using (var web = site.RootWeb)
{
return
web.Folders["Templates"].Files.OfType<SPFile>()
.Where(file => file.Item.HasPublishedVersion)
.Where(file => file.Item.Properties["Active"].ToString() == "true")
.Where(file => file.Item.Versions.OfType<SPListItemVersion>()
.Any(x => x.Level == SPFileLevel.Published))
.Select(file => new LibraryItem(
file.Item.UniqueId,
file.Item.ID,
file.Item.Title,
Helpers.ConfigurationFile.SharePointUrl()
.ToString().AddPathSegment(file.Url),
true,
float.Parse(file.MajorVersion.ToString()
+ "." + file.MinorVersion.ToString())
)
)
.ToList();
}
}
A few points:
Aren't the HasPublishedVersion and file.Item.Versions checks doing the same thing - checking whether a published version of the SPFile exists?
The Any query on the SPListItemVersions will search them in order from Versions[0] to Versions[count - 1] which translates to 'newest to oldest' (since Versions[0] is the latest version in SP). If you think it's likely that your docs were published more frequently soon after creation than they have been in recent times, it'll possibly be faster to perform the same check by looping backwards through the collection (from oldest version to newest).
You could index on the Active column.
You can optimise it by performing part of the query using CAML (it's horrible, I know). I would try to delegate the SPFile fetching and Active checking to a CAML query, then do the rest of it using the object model. See this link
If you do use CAML, PortalSiteMapProvider.GetCachedListItemsByQuery() might prove to be faster SPList.GetItems().

EF Takes Forever to Generate this Query

I have a parent-child table relationship. In a repository, I'm doing this:
return (from p in _ctx.Parents
.Include( "Children" )
select p).AsQueryable<Parent>();
Then in a filter, I want to filter the parent by a list of child ids:
IQueryable<Parent> qry; // from above
List<int> ids; // huge list (8500)
var filtered =
from p in qry.Where( p => p.Children.Any(c => ids.Contains(c.ChildId)) ) select s;
My list of ids is huge. This generates a simple SQL statement that does have a huge list of ids "in (1,2,3...)", but it takes no appreciable time to run by itself. EF, however, takes about a full minute just to generate the statement. I proved this by setting a breakpoint and calling:
((ObjectQuery<Parent>)filtered).ToTraceString();
This takes all the time. Is the problem in my last linq statement? I don't know any other way to do the equivalent of Child.ChildId in (ids). And even if my linq statement is bad, how in the world should this take so long?
Unfortunately, building queries in Linq to Entities is a pretty heavy hit, but I've found it usually saves time due to the ability to build queries from their component pieces before actually hitting the database.
It is likely that the way they implement the Contains method uses an algorithm that assumes that Contains is generally used for a relatively small set of data. According to my tests, the amount of time it takes per ID in the list begins to skyrocket at around 8000.
So it might help to break your query into pieces. Group them into groups of 1000 or less, and concatenate a bunch of Where expressions.
var idGroups = ids.GroupBy(i => i / 1000);
var q = Parents.Include("Children").AsQueryable();
var newQ = idGroups.Aggregate(q,
(s, g) => s.Concat(
q.Where(w => w.Children.Any(wi => g.Contains(wi.ChildId)))));
This speeds things up significantly, but it might not be significantly enough for your purposes, in which case you'll have to resort to a stored procedure. Unfortunately, this particular use case just doesn't fit into "the box" of expected Entity Framework behavior. If your list of ids could begin as a query from the same Entity Context, Entity Framework would have worked fine.
Re-write your query in Lambda syntax and it will cut the time by as much as 3 seconds (or at least it did for my EF project).
return _ctx.Parents.Include( "Children" ).AsQueryable<Parent>();
and
IQueryable<Parent> qry; // from above
List<int> ids; // huge list (8500)
var filtered = qry.Where( p => p.Children.Any(c => ids.Contains(c.ChildId)) );

Resources