Elastic Search, from-start doesn't work correctly - elasticsearch

I am going to fetch some records between two dates using Elastic Search query.
First, I check the number of records between two dates to know whether it is greater than 10000 or not. If it is, i try to fetch them 10000 by 10000.
//get count
var result_count = client.Count<TelegramMessageStructure>(s => s
.AllTypes()
.AllIndices()
.Query(q => q
.DateRange(r => r
.Field(f => f.messageDate)
.GreaterThanOrEquals("2018-06-03 00:00:00.000")
.LessThan("2018-06-03 00:59:00.000")
)
)
);
long count = result_count.Count; //count = 27000
it returns 27000. So I want to fetch them 10000 by 10000. I use this query to do that:
int MaxMessageCountPerQuery=10000;
for (int i = 0; i < count; i += MaxMessageCountPerQuery)
{
client = new ElasticClient(connectionSettings);
// No change whether the client is renewed or not
var result = client.Search<TelegramMessageStructure>(s => s
.AllTypes()
.AllIndices()
.MatchAll()
.From(i)
.Size(MaxMessageCountPerQuery)
.Sort(ss => ss.Ascending(p => p.id))
.Query(q => q
.DateRange(r => r
.Field(f => f.messageDate)
.GreaterThanOrEquals("2018-06-03 00:00:00.000")
.LessThan("2018-06-03 00:59:00.000")
)
)
);
//when i=0, result.documents contains 10000 records otherwise it has 0
}
In The first round, when i=0, result.documents contains 10000 records otherwise it contains 0 records.
What is wrong with this?

Based on this link:
scroll in elastic net-api
Your codes should contains below steps:
1- Search with all parameters that you need plus .Scroll("5m") (I assume from(0) and size(10000) is set too and save response in result variable)
2- Now you have first 10000 records (in result.Documents)
3- For receive more records, you should use ScrollId param to get more results. (Each call of bellow code give you next 10000 records)
var result_new = client.Scroll<TelegramMessageStructure>("10m", result.ScrollId);
In fact, your codes should be like this:
int MaxMessageCountPerQuery=10000;
client = new ElasticClient(connectionSettings);
// No change whether the client is renewed or not
var result = client.Search<TelegramMessageStructure>(s => s
.AllTypes()
.AllIndices()
.MatchAll()
.From(i)
.Size(MaxMessageCountPerQuery)
.Sort(ss => ss.Ascending(p => p.id))
.Query(q => q
.DateRange(r => r
.Field(f => f.messageDate)
.GreaterThanOrEquals("2018-06-03 00:00:00.000")
.LessThan("2018-06-03 00:59:00.000")
)
)
.Scroll("5m") // Add this parameter
);
// TODO some code:
// save and use result.Documents
for (int i = 0; i < result.Total; i += MaxMessageCountPerQuery)
{
var result_new = client.Scroll<TelegramMessageStructure>("10m", result.ScrollId); // Add this line to loop , Each loop you can get next 10000 record.
// TODO some code:
// save and use result_new.Documents
}

Elasticsearch has a default index.max_result_window = 10000 and it's well explained at
https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html
To understand why deep paging is problematic, let’s imagine that we
are searching within a single index with five primary shards. When we
request the first page of results (results 1 to 10), each shard
produces its own top 10 results and returns them to the coordinating
node, which then sorts all 50 results in order to select the overall
top 10.
Now imagine that we ask for page 1,000—results 10,001 to 10,010.
Everything works in the same way except that each shard has to produce
its top 10,010 results. The coordinating node then sorts through all
50,050 results and discards 50,040 of them!
You can see that, in a distributed system, the cost of sorting results
grows exponentially the deeper we page. There is a good reason that
web search engines don’t return more than 1,000 results for any query.

Related

How can I find the total hits for an Elastic NEST query?

In my application I have a query which brings limits the number of hits returned to 50 as follows
var response = await client.SearchAsync<Episode>(s => s
.Source(sf => sf
.Includes(i => i
.Fields(
f => f.Title,
f => f.PublishDate,
f => f.PodcastTitle
)
)
.Excludes(e => e
.Fields(f => f.Description)
)
)
.From(request.Skip)
.Size(50)
.Query(q => q
.Term(t => t.Title, request.Search) || q
.Match(mq => mq.Field(f => f.Description).Query(request.Search))));
I am interested in the total number of hits for the query (i.e. not limited to the size), so that I can deal with pagination on the front-end. Does anyone know how I can do this?
You are looking for Total property on the search response object. Have a look.
So in your particular case that will be response.Total.
For those who are working on indices with more than 10000 documents, Elasticsearch will calculate total hits up to 10000 by default. To get around that, include .TrackTotalHits(true) in your query:
var resp = client.Search<yourmodel>(s => s
.Index(yourindexname)
.TrackTotalHits(true)
.Query(q => q.MatchAll()));

How to increase the speed of this MongoDB query?

MongoDB 2.0.7 & PHP 5
I'm trying to count the length of each array. Every document has one array. I want to get the number of elements in each array and the ID of the document. There are no indexes except from Id.
Here's my code:
$map = new MongoCode("function() {
emit(this._id,{
'_id':this._id,'cd':this.cd,'msgCount':this.cs[0].msgs.length}
);
}");
$reduce = new MongoCode("function(k, vals) {
return vals[0];
}");
$cmmd = smongo::$db->command(array(
"mapreduce" => "sessions",
"map" => $map,
"reduce" => $reduce,
"out" => "result"));
These are the timings. As you can see, the query is very slow
Array
(
[result] => result
[timeMillis] => 29452
[counts] => Array
(
[input] => 106026
[emit] => 106026
[reduce] => 0
[output] => 106026
)
[ok] => 1
)
How can I reduce the timings?
If you are going to frequently need the counts for your arrays, a better approach would be to include a count field in your actual documents. Otherwise you are going to be scanning all documents to do the count (as per your Map/Reduce example).
You can use an Atomic Operation such as $inc to increment/decrement this count at the same time as you are updating the arrays.

Groupby and where clause in Linq

I am a newbie to Linq. I am trying to write a linq query to get a min value from a set of records. I need to use groupby, where , select and min function in the same query but i am having issues when using group by clause. here is the query I wrote
var data =newTrips.groupby (x => x.TripPath.TripPathLink.Link.Road.Name)
.Where(x => x.TripPath.PathNumber == pathnum)
.Select(x => x.TripPath.TripPathLink.Link.Speed).Min();
I am not able to use group by and where together it keeps giving error .
My query should
Select all the values.
filter it through the where clause (pathnum).
Groupby the road Name
finally get the min value.
can some one tell me what i am doing wrong and how to achieve the desired result.
Thanks,
Pawan
It's a little tricky not knowing the relationships between the data, but I think (without trying it) that this should give you want you want -- the minimum speed per road by name. Note that it will result in a collection of anonymous objects with Name and Speed properties.
var data = newTrips.Where(x => x.TripPath.PathNumber == pathnum)
.Select(x => x.TripPath.TripPathLink.Link)
.GroupBy(x => x.Road.Name)
.Select(g => new { Name = g.Key, Speed = g.Min(l => l.Speed) } );
Since I think you want the Trip which has the minimum speed, rather than the speed, and I'm assuming a different data structure, I'll add to tvanfosson's answer:
var pathnum = 1;
var trips = from trip in newTrips
where trip.TripPath.PathNumber == pathnum
group trip by trip.TripPath.TripPathLink.Link.Road.Name into g
let minSpeed = g.Min(t => t.TripPath.TripPathLink.Link.Speed)
select new {
Name = g.Key,
Trip = g.Single(t => t.TripPath.TripPathLink.Link.Speed == minSpeed) };
foreach (var t in trips)
{
Console.WriteLine("Name = {0}, TripId = {1}", t.Name, t.Trip.TripId);
}

Why is this LINQ so slow?

Can anyone please explain why the third query below is orders of magnitude slower than the others when it oughtn't to take any longer than doing the first two in sequence?
var data = Enumerable.Range(0, 10000).Select(x => new { Index = x, Value = x + " is the magic number"}).ToList();
var test1 = data.Select(x => new { Original = x, Match = data.Single(y => y.Value == x.Value) }).Take(1).Dump();
var test2 = data.Select(x => new { Original = x, Match = data.Single(z => z.Index == x.Index) }).Take(1).Dump();
var test3 = data.Select(x => new { Original = x, Match = data.Single(z => z.Index == data.Single(y => y.Value == x.Value).Index) }).Take(1).Dump();
EDIT: I've added a .ToList() to the original data generation because I don't want any repeated generation of the data clouding the issue.
I'm just trying to understand why this code is so slow by the way, not looking for faster alternative, unless it sheds some light on the matter. I would have thought that if Linq is lazily evaluated and I'm only looking for the first item (Take(1)) then test3's:
data.Select(x => new { Original = x, Match = data.Single(z => z.Index == data.Single(y => y.Value == x.Value).Index) }).Take(1);
could reduce to:
data.Select(x => new { Original = x, Match = data.Single(z => z.Index == 1) }).Take(1)
in O(N) as the first item in data is successfully matched after one full scan of the data by the inner Single(), leaving one more sweep of the data by the remaining Single(). So still all O(N).
It's evidently being processed in a more long winded way but I don't really understand how or why.
Test3 takes a couple of seconds to run by the way, so I think we can safely assume that if your answer features the number 10^16 you've made a mistake somewhere along the line.
The first two "tests" are identical, and both slow. The third adds another entire level of slowness.
The first two LINQ statements here are quadratic in nature. Since your "Match" element potentially requires iterating through the entire "data" sequence in order to find the match, as you progress through the range, the length of time for that element will get progressively longer. The 10000th element, for example, will force the engine to iterate through all 10000 elements of the original sequence to find the match, making this an O(N^2) operation.
The "test3" operation takes this to an entirely new level of pain, since it's "squaring" the O(N^2) operation in the second single - forcing it to do another quadratic operation on top of the first one - which is going to be a huge number of operations.
Each time you do data.Single(...) with the match, you're doing an O(N^2) operation - the third test basically becomes O(N^4), which will be orders of magnitude slower.
Fixed.
var data = Enumerable.Range(0, 10000)
.Select(x => new { Index = x, Value = x + " is the magic number"})
.ToList();
var forward = data.ToLookup(x => x.Index);
var backward = data.ToLookup(x => x.Value);
var test1 = data.Select(x => new { Original = x,
Match = backward[x.Value].Single()
} ).Take(1).Dump();
var test2 = data.Select(x => new { Original = x,
Match = forward[x.Index].Single()
} ).Take(1).Dump();
var test3 = data.Select(x => new { Original = x,
Match = forward[backward[x.Value].Single().Index].Single()
} ).Take(1).Dump();
In the original code,
data.ToList() generates 10,000 instances (10^4).
data.Select( data.Single() ).ToList() generates 100,000,000 instances (10^8).
data.Select( data.Single( data.Single() ) ).ToList() generates 100,000,000,000,000,000 instances (10^16).
Single and First are different. Single throws if multiple instances are encountered. Single must fully enumerate its source to check for multiple instances.

Find the max value in a grouped list using Linq

I have a linq expression that returns transactions in groups. Each transaction has a numerical value and I now need to know what is the highest value from all the transactions returned. This value is held in a field called TransactionId
Here is the expression I am using to get the grouped list.
var transactions = ctx.MyTransactions
.Where (x => x.AdapterId == Id)
.GroupBy(x => x.DeviceTypeId);
I now need to write an expression that works on the “transactions” grouped list to find the “max” of the TransactionId field. I’ve tried different ideas but none seem to work with the grouped results. I’m new to linq so I’m not sure how to do this.
Have you tried finding the maximum in each group and then finding the maximum of that over all groups?
int max = transactions.Max(g => g.Max(t => t.TransactionId));
Or you could just query the database again:
int max = ctx.MyTransactions
.Where(x => x.AdapterId == Id)
.Max(t => t.TransactionId);
This will give you the max in each group
var transactionIds = ctx.MyTransactions
.Where (x => x.AdapterId == Id)
.GroupBy(x => x.DeviceTypeId,
g => new {
DeviceTypeId = g.Key,
MaxTransaction = g.Max(x => x.TransactionId)
});

Resources