Document is not deleted if delete request was sent immediately after insert request - elasticsearch

I have a service that indexes documents.
The service receives two following requests - the first is to insert a document and the second to delete it.
When there is some time between them it works fine but when they are sent one after another the document is not deleted.
The response from I get from Nest looks successful.
My function is kinds long so I will only write the insert and delete inside. If more info is needed I will add it (For example, in case of insertion in also deletes it from all other available indices and inserts some mapping if needed).
Insert code:
IBulkResponse res = await _client.IndexManyAsync(entities, index, type);
Delete Code:
var termFilter = new List<Func<QueryContainerDescriptor<JObject>, QueryContainer>>
{
c => c.Terms(t => t.Field(ID_FIELD).Terms(ids))
};
await _client.DeleteByQueryAsync<JObject>(indices, types, d => d.Query(q => q.Bool(b => b.Must(termFilter))));
For example, this integration test doesn't work:
var indices = new { "some_index_1", "some_index_2" };
var entity = new Entity { Action = ReplicationAction.INSERT, ... };
await elasticDal.Insert(new List { entity }, "some_index_1", "666", indices);
entity.Action = ReplicationAction.DELETE;
await elasticDal.Insert(new List { entity }, "some_index_1", "666", indices);
Versions: ElasticSearch 2.3.5, .Net 4.6, Nest 2.4.6

When you are inserting any document following steps happens:
Document is added to the in-memory buffer and appended to the translog.
Refresh
The docs in the in-memory buffer are written to a new segment, without
a Fsync.
a. The segment is opened to make it visible to search.
b. The in-memory buffer is cleared.
The segment is opened to make it visible to search.
Every so often—such as when the translog is getting too big—the index is
flushed; a new translog is created, and a full commit is performed :
a. Any docs in the in-memory buffer are written to a new segment.
b. The buffer is cleared.
c. A commit point is written to disk.
d. The filesystem cache is flushed with a Fsync.
e. The old translog is deleted.
Elasticsearch doesn't delete the document. It marks the document as a deleted document, and while merging the index segments, ES leaves deleted document if it is in memory.
So my guess is you are missing refresh API after deleting.
If your DELETE API is not so frequent, then you can refresh your ES after calling DELETE API by calling REFRESH API.
If you want to learn more about how indexing is happing behind the picture, you can refer to this link (https://www.elastic.co/guide/en/elasticsearch/guide/current/translog.html)

Related

How does Apollo paginated "read" and "merge" work?

I was reading through the docs to learn pagination approaches for Apollo. This is the simple example where they explain the paginated read function:
https://www.apollographql.com/docs/react/pagination/core-api#paginated-read-functions
Here is the relevant code snippet:
const cache = new InMemoryCache({
typePolicies: {
Query: {
fields: {
feed: {
read(existing, { args: { offset, limit }}) {
// A read function should always return undefined if existing is
// undefined. Returning undefined signals that the field is
// missing from the cache, which instructs Apollo Client to
// fetch its value from your GraphQL server.
return existing && existing.slice(offset, offset + limit);
},
// The keyArgs list and merge function are the same as above.
keyArgs: [],
merge(existing, incoming, { args: { offset = 0 }}) {
const merged = existing ? existing.slice(0) : [];
for (let i = 0; i < incoming.length; ++i) {
merged[offset + i] = incoming[i];
}
return merged;
},
},
},
},
},
});
I have one major question around this snippet and more snippets from the docs that have the same "flaw" in my eyes, but I feel like I'm missing some piece.
Suppose I run a first query with offset=0 and limit=10. The server will return 10 results based on this query and store it inside cache after accessing merge function.
Afterwards, I run the query with offset=5 and limit=10. Based on the approach described in docs and the above code snippet, what I'm understanding is that I will get only the items from 5 through 10 instead of items from 5 to 15. Because Apollo will see that existing variable is present in read (with existing holding initial 10 items) and it will slice the available 5 items for me.
My question is - what am I missing? How will Apollo know to fetch new data from the server? How will new data arrive into cache after initial query? Keep in mind keyArgs is set to [] so the results will always be merged into a single item in the cache.
Apollo will not slice anything automatically. You have to define a merge function that keeps the data in the correct order in the cache. One approach would be to have an array with empty slots for data not yet fetched, and place incoming data in their respective index. For instance if you fetch items 30-40 out of a total of 100 your array would have 30 empty slots then your items then 60 empty slots. If you subsequently fetch items 70-80 those will be placed in their respective indexes and so on.
Your read function is where the decision on whether a network request is necessary or not will be made. If you find all the data in existing you will return them and no request to the server will be made. If any items are missing then you need to return undefined which will trigger a network request, then your merge function will be triggered once data is fetched, and finally your read function will run again only this time the data will be in the cache and it will be able to return them.
This approach is for the cache-first caching policy which is the default.
The logic for returning undefined from your read function will be implemented by you. There is no apollo magic under the hood.
If you use cache-and-network policy then a your read doesn't need to return undefined when data

Uniqueness check in Elasticsearch without constantly refreshing index

I'm indexing a lot of data in Elasticsearch (through NEST) from multiple processes each running multiple threads. Part of indexing a document is finding out if we have seen a similar document before. This feature is implemented by generating a hash of a set of fields on the document and checking if we have documents in Elasticsearch with the same hash. Before indexing a document, I make the following query:
var result = elasticClient
.Index(indexName)
.Count<MyDocument>(c => c
.Query(q => q
.ConstantScore(qs => qs
.Filter(f => f
.Term(field => field.Hash, hash))))
...
This returns a count of existing documents with the specified hash. So far so good. Things are working. If a process is indexing two documents with the same hash within the same second, the count check doesn't work, since the first document isn't available for search yet. I'm running with the default refresh interval (1 second). For now I have added a refresh call after indexing each document:
var refreshResponse = client.Refresh(indexName);
This also works but it doesn't scale when indexing large amounts of documents (indexing becomes slow as already pointed out here: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html).
Any ideas for how to avoid having to call Refresh but still be able to perform a uniqueness check? I'm thinking some sort of local cache shared between all threads with hashes of documents indexed since the last refresh. I know that this won't work across processes, but that is acceptaple for now.
I ended up implementing a write-through cache as suggested by Val. This makes it possible to remove the call to Refresh but still make the count on each iteration. This is implemented using a singleton MemoryCache shared between all threads:
var cache = new MemoryCache("hashes");
When checking for uniqueness I check the cache in case no similar documents are found in Elasticsearch:
var result = elasticClient
.Count<MyDocument>(c => c
.Index(indexName)
.Query(q => q
.ConstantScore(qs => qs
.Filter(f => f
.Term(field => field.Hash, hash)))));
bool isUnique = false;
if (result.Count == 0)
{
isUnique = !cache.Contains(hash);
}
In case the count for the hash returns 0 I check a cache for that hash.
When a document has been successful indexed, I store the hash in the cache with an expiration:
var policy = new CacheItemPolicy();
policy.AbsoluteExpiration = DateTimeOffset.UtcNow.AddSeconds(5);
cache.AddOrGetExisting(hash, string.Empty, policy);
TTL could probably be 1 second as well since that is the refresh interval I currently have configured on the index.

How to append to dexie entry using a rolling buffer (to store large entries without allocating GBs of memory)

I was redirected here after emailing the author of Dexie (David Fahlander). This is my question:
Is there a way to append to an existing Dexie entry? I need to store things that are large in dexie, but I'd like to be able to fill large entries with a rolling buffer rather than allocating one huge buffer and then doing a store.
For example, I have a 2gb file I want to store in dexie. I want to store that file by storing 32kb at a time into the same store, without having to allocate a 2gb of memory in the browser. Is there a way to do that? The put method seems to only overwrite entries.
Thanks for putting your question here at stackoverflow :) This helps me build up an open knowledge base for everyone to access.
There's no way in IndexedDB to update an entry without also instanciating the whole entry. Dexie adds the update() and modify() methods, but they only emulate a way to alter certain properties. In the background, the entire document will always be loaded in memory temporarily.
IndexedDB also has Blob support, but when a Blob i stored into IndexedDB, its entire content is cloned/copied into the database by specification.
So the best way to deal with this would be to dedicate a table for dynamic large content and add new entries to it.
For example, let's say you have a the tables "files" and "fileChunks". You need to incrementially grow the "file", and each time you do that, you don't want to instanciate the entire file in memory. You could then add the file chunks as separate entries into the fileChunks table.
let db = new Dexie('filedb');
db.version(1).stores({
files: '++id, name',
fileChunks: '++id, fileId'
});
/** Returns a Promise with ID of the created file */
function createFile (name) {
return db.files.add({name});
}
/** Appends contents to the file */
function appendFileContent (fileId, contentToAppend) {
return db.fileChunks.add ({fileId, chunk: contentToAppend});
}
/** Read entire file */
function readEntireFile (fileId) {
return db.fileChunks.where('fileId').equals(fileId).toArray()
.then(entries => {
return entries.map(entry=>entry.chunk)
.join(''); // join = Assume chunks are strings
});
}
Easy enough. If you want appendFileContent to be a rolling buffer (with a max size and erase old content), you could add truncate methods:
function deleteOldChunks (fileId, maxAllowedChunks) {
return db.fileChunks.where('fileId').equals(fileId);
.reverse() // Important, so that we delete old chunks
.offset(maxAllowedChunks) // offset = skip
.delete(); // Deletes all records older before N last records
}
You'd get other benefits as well, such as the ability to tail a stored file without loading its entire content into memory:
/** Tail a file. This function only shows an example on how
* dynamic the data is stored and that file tailing would be
* simple to do. */
function tailFile (fileId, maxLines) {
let result = [], numNewlines = 0;
return db.fileChunks.where('fileId').equals(fileId)
.reverse()
.until(() => numNewLines >= maxLines)
.each(entry => {
result.unshift(entry.chunk);
numNewlines += (entry.chunk.match(/\n/g) || []).length;
})
.then (()=> {
let lines = result.join('').split('\n')
.slice(1); // First line may be cut off
let overflowLines = lines.length - maxLines;
return (overflowLines > 0 ?
lines.slice(overflowLines) :
lines).join('\n');
});
}
The reason I know that chunks will come in the correct order in readEntireFile() and tailFile() is that indexedDB queries will always be retrieved in in the order of the queried column primary, but secondary in the order of the primary keys, which are auto-incremented numbers.
This pattern could be used for other cases, like logging etc. In case the file is not string based, you would have to alter this sample a little. Specifically, don't use string.join() or array.split().

elasticseach 2.4 : retrieve all records which are fulfilling all search criterias using scroll

I am using elastic search for the first time and based on requirements i have some doubts and questions for scroll
To retrieve all data which are fulfilling all search criteria
1)I am trying to use scroll but i found while searching about it
https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking_21_search_changes.html
i found Search type scan is deprecated
but NEST is supporting it
so should i use "search type scan" or "sort by doc"? (I am using elastic search 2.4)
2)Can i use "sorting on any field" when using scrolling?
3)while doing clear scroll
var test2 = client.ClearScroll(x=>x.ScrollId(results.ScrollId));
Getting error as below:
Invalid NEST response built from a unsuccessful low level call on DELETE: /_search/scroll
Audit trail of this API call:
[1] BadResponse: Node: http://mydomain#localhost:9200/ Took: 00:00:00.0160110
OriginalException: System.Net.WebException: The remote server returned an error: (404) Not Found.
at System.Net.HttpWebRequest.GetResponse()
at Elasticsearch.Net.HttpConnection.Request[TReturn](RequestData requestData) in C:\Users\russ\source\elasticsearch-net-2.x\src\Elasticsearch.Net\Connection\HttpConnection.cs:line 141
Request:
{"scroll_id":["c2NhbjswOzE7dG90YWxfaGl0czoxMjs="]}
Response:
{}
so is it correct way of clearing scroll or not?
Update: : below is my code :
List<Object> indexedList = new List<Object>();
ISearchResponse<ListingSearch> listingResult =
client.Search<ListingSearch>(search => search
.Index(Constant.ES_INDEX)
.Type(Constant.ES_TYPE)
.From(listingSearch.StartIndex)
.Size(10)
.Source(s => s.Include(i => i.Fields(outpputFields)))
.Query(query => query.
Bool(boolean => boolean.
Must(
must => must.Term(t => t.Field("is_deleted").Value(false))
)
.Sort(x => x.Field("_doc", SortOrder.Ascending))
.Scroll("60s")
);
List<Object> indexedList = new List<Object>();
var results = client.Scroll<ListingSearch>("60s", listingResult.ScrollId);
while (results.Documents.Any())
{
foreach (var doc in results.Hits)
{
indexedList.Add(doc);
}
results = client.Scroll<ListingSearch>("60s", results.ScrollId);
}
var test2 = client.ClearScroll(x=>x.ScrollId(results.ScrollId));
//Clear Scroll
With above code I am getting data
but if i change size from 10 to 1000, getting no records.
Not sure if issue is the amount of data because my ES db has only 12-15 documents.
NEST 2.x versions have SearchType.Scan because NEST 2.x versions are compatible with all Elasticsearch 2.x versions, so the search type needs to exist when using NEST 2.x against Elasticsearch 2.0. Sending the search type through in later versions won't have any effect.
The most efficient way of retrieving documents with scroll is sorting by _doc but you can specify any sort parameters when scrolling.
When using the scroll API, you should use the scroll_id from the previous request in the next scroll call to fetch the next set of results. Once you have finished with a scroll, it is a good idea to clear it by calling ClearScroll() as you are doing. Your call looks correct; perhaps the scroll_id has already expired at the point you make the clear call?

MemoryCacheClient works differently than others - reference retained

I have a service that pulls statistics for a sales region. The service computes the stats for ALL regions and then caches that collection, then returns only the region requested.
public object Any(RegionTotals request)
{
string cacheKey = "urn:RegionTotals";
//make sure master list is in the cache...
base.Request.ToOptimizedResultUsingCache<RegionTotals>(
base.Cache, cacheKey, CacheExpiryTime.DailyLoad(), () =>
{
return RegionTotalsFactory.GetObject();
});
//then retrieve them. This is all teams
RegionTotals tots = base.Cache.Get<RegionTotals>(cacheKey);
//remove all except requested
tots.Records.RemoveAll(o => o.RegionID != request.RegionID);
return tots;
}
What I'm finding is that when I use a MemoryCacheClient (as part of a StaticAppHost that I use for Unit Tests), the line tots.Records.RemoveAll(...) actually affects the object in the cache. This means that I get the cached object, delete rows, and then the cache no longer contains all regions. Therefore, subsequent calls to this service for any other region return no records. If I use my normal Cache, of course the Cache.Get() makes a new copy of the object in the cache, and removing records from that object doesn't affect the cache.
This is because an In Memory cache doesn't add any serialization overhead and just stores your object instances in memory. Whereas when you use any of the other Caching Providers your values are serialized first then sent to the remote Caching Provider then when it's retrieved it's deserialized back so it's never reusing the same object instances.
If you plan on mutating cached values you'll need to clone the instances before mutating them, if you don't want to manually implement ICloneable you can serialize and deserialize them with:
var clone = TypeSerializer.Clone(obj);

Resources