MongoDB 2.0.7 & PHP 5
I'm trying to count the length of each array. Every document has one array. I want to get the number of elements in each array and the ID of the document. There are no indexes except from Id.
Here's my code:
$map = new MongoCode("function() {
emit(this._id,{
'_id':this._id,'cd':this.cd,'msgCount':this.cs[0].msgs.length}
);
}");
$reduce = new MongoCode("function(k, vals) {
return vals[0];
}");
$cmmd = smongo::$db->command(array(
"mapreduce" => "sessions",
"map" => $map,
"reduce" => $reduce,
"out" => "result"));
These are the timings. As you can see, the query is very slow
Array
(
[result] => result
[timeMillis] => 29452
[counts] => Array
(
[input] => 106026
[emit] => 106026
[reduce] => 0
[output] => 106026
)
[ok] => 1
)
How can I reduce the timings?
If you are going to frequently need the counts for your arrays, a better approach would be to include a count field in your actual documents. Otherwise you are going to be scanning all documents to do the count (as per your Map/Reduce example).
You can use an Atomic Operation such as $inc to increment/decrement this count at the same time as you are updating the arrays.
Related
I am going to fetch some records between two dates using Elastic Search query.
First, I check the number of records between two dates to know whether it is greater than 10000 or not. If it is, i try to fetch them 10000 by 10000.
//get count
var result_count = client.Count<TelegramMessageStructure>(s => s
.AllTypes()
.AllIndices()
.Query(q => q
.DateRange(r => r
.Field(f => f.messageDate)
.GreaterThanOrEquals("2018-06-03 00:00:00.000")
.LessThan("2018-06-03 00:59:00.000")
)
)
);
long count = result_count.Count; //count = 27000
it returns 27000. So I want to fetch them 10000 by 10000. I use this query to do that:
int MaxMessageCountPerQuery=10000;
for (int i = 0; i < count; i += MaxMessageCountPerQuery)
{
client = new ElasticClient(connectionSettings);
// No change whether the client is renewed or not
var result = client.Search<TelegramMessageStructure>(s => s
.AllTypes()
.AllIndices()
.MatchAll()
.From(i)
.Size(MaxMessageCountPerQuery)
.Sort(ss => ss.Ascending(p => p.id))
.Query(q => q
.DateRange(r => r
.Field(f => f.messageDate)
.GreaterThanOrEquals("2018-06-03 00:00:00.000")
.LessThan("2018-06-03 00:59:00.000")
)
)
);
//when i=0, result.documents contains 10000 records otherwise it has 0
}
In The first round, when i=0, result.documents contains 10000 records otherwise it contains 0 records.
What is wrong with this?
Based on this link:
scroll in elastic net-api
Your codes should contains below steps:
1- Search with all parameters that you need plus .Scroll("5m") (I assume from(0) and size(10000) is set too and save response in result variable)
2- Now you have first 10000 records (in result.Documents)
3- For receive more records, you should use ScrollId param to get more results. (Each call of bellow code give you next 10000 records)
var result_new = client.Scroll<TelegramMessageStructure>("10m", result.ScrollId);
In fact, your codes should be like this:
int MaxMessageCountPerQuery=10000;
client = new ElasticClient(connectionSettings);
// No change whether the client is renewed or not
var result = client.Search<TelegramMessageStructure>(s => s
.AllTypes()
.AllIndices()
.MatchAll()
.From(i)
.Size(MaxMessageCountPerQuery)
.Sort(ss => ss.Ascending(p => p.id))
.Query(q => q
.DateRange(r => r
.Field(f => f.messageDate)
.GreaterThanOrEquals("2018-06-03 00:00:00.000")
.LessThan("2018-06-03 00:59:00.000")
)
)
.Scroll("5m") // Add this parameter
);
// TODO some code:
// save and use result.Documents
for (int i = 0; i < result.Total; i += MaxMessageCountPerQuery)
{
var result_new = client.Scroll<TelegramMessageStructure>("10m", result.ScrollId); // Add this line to loop , Each loop you can get next 10000 record.
// TODO some code:
// save and use result_new.Documents
}
Elasticsearch has a default index.max_result_window = 10000 and it's well explained at
https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html
To understand why deep paging is problematic, let’s imagine that we
are searching within a single index with five primary shards. When we
request the first page of results (results 1 to 10), each shard
produces its own top 10 results and returns them to the coordinating
node, which then sorts all 50 results in order to select the overall
top 10.
Now imagine that we ask for page 1,000—results 10,001 to 10,010.
Everything works in the same way except that each shard has to produce
its top 10,010 results. The coordinating node then sorts through all
50,050 results and discards 50,040 of them!
You can see that, in a distributed system, the cost of sorting results
grows exponentially the deeper we page. There is a good reason that
web search engines don’t return more than 1,000 results for any query.
Below is the NEST query and aggregations:
Func<QueryContainerDescriptor<ConferenceWrapper>, QueryContainer> query =
q =>
q.Term(p => p.type, "conference")
// && q.Term(p => p.conference.isWaitingAreaCall, true)
&& q.Range(d => d.Field("conference.lengthSeconds").GreaterThanOrEquals(minSeconds))
&& q.DateRange(qd => qd.Field("conference.firstCallerStart").GreaterThanOrEquals(from.ToUniversalTime().ToString("yyyy-MM-ddTHH:mm:ss.fffZ")))
&& q.DateRange(qd => qd.Field("conference.firstCallerStart").LessThanOrEquals(to.ToUniversalTime().ToString("yyyy-MM-ddTHH:mm:ss.fffZ")));
Func<AggregationContainerDescriptor<ConferenceWrapper>, IAggregationContainer> waitingArea =
a => a
.Terms("both", t => t
.Field(p => p.conference.orgNetworkId) // seems ignore this field
.Field(p => p.conference.isWaitingAreaCall)
// .Field(new Field( p => p.conference.orgNetworkId + "-ggg-" + p.conference.networkId)
.Size(300)
.Aggregations(a2 => a2.Sum("sum-length", d2 => d2.Field("conference.lengthSeconds"))));
I have called .Field(p => p.conference.orgNetworkId) followed by .Field(p => p.conference.isWaitingAreaCall) But it seems the NEST client tries to ignore the first field expression.
Is is possible to have multiple fields to be the terms group by?
Elasticsearch doesn't support a terms aggregation on multiple fields directly; the calls to .Field(...) within NEST are assignative rather than additive, so the last call will overwrite any previously set values.
In order to aggregate on multiple fields, you can either
Create a composite field at index time that incorporates the values that you wish to aggregate on
or
Use a Script to generate the terms on which to aggregate at query time, by combining the two field values.
The performance of the first option will be better than the second.
I am using NEST. The number of buckets returned from ElasticSearch aggregation is always 10 (default value), in spite of the fact that the size is set to 10000
You need to set the size inside the Terms aggregation and not outside of it. Try this:
.Aggregations( a => a
.Terms(category_agg", st => st
.Field(o => o.categories.Select(x => x.id))
.Size(10000)
)
)
I am following nettut+ tutorial for pagination and to store POST inputs as querystrings in db. So far, everything works fine until, suppose if I get an array as POST input, i am unable to loop through it and get all the array values and to store into query_array (i.e., store array within array).
The snippets below:
$query_array = array(
'gender' => $this->input->post('gender'),
'minage' => $this->input->post('minage'),
'maxage' => $this->input->post('maxage'),
'Citizenship' => $this->input->post('citizenship'), // checkboxes with name citizenship[]
);
This returns only last stored array value in Citizenship.
The output array:
Array ( [gender] => 1 [minage] => 18 [maxage] => 24 [Citizenship] => 2 )
makes the query string as:
&gender=1&minage=18&maxage=24&Citizenship=2
But, my requirement is to get all the values of 'Citizenship' array instead of last stored value.
The output required to make query string:
Array ( [gender] => 1 [minage] => 18 [maxage] => 24 [Citizenship] => 2 [Citizenship] => 4 [Citizenship] => 6 )
The query string :
&gender=1&minage=18&maxage=24&Citizenship[]=2&Citizenship[]=4&Citizenship[]=6
Any help appreciated..
Thanks.
Doesn't look like code ignighter supports un-named multidimensional arrays as input without a bit of hacking.
If you can access raw $_POST data try replacing
$this->input->post('citizenship')
with
array_map('intval',$_POST['citizenship'])
Alternativly add keys to your post data:
&gender=1&minage=18&maxage=24&Citizenship[0]=2&Citizenship[1]=4&Citizenship[2]=6
I fixed it myself. I just looped through the POST array and got the individual array key & pair values.
foreach($_POST['Citizenship'] as $k => $v) {
$Citizenship[$v] = $v;
}
Hope this helps someone who face similar problem.
I have a bunch of posts which have category tags in them.
I am trying to find out how many times each category has been used.
I'm using rails with mongodb, BUT I don't think I need to be getting the occurrence of categories from the db, so the mongo part shouldn't matter.
This is what I have so far
#recent_posts = current_user.recent_posts #returns the 10 most recent posts
#categories_hash = {'tech' => 0, 'world' => 0, 'entertainment' => 0, 'sports' => 0}
#recent_posts do |cat|
cat.categories.each do |addCat|
#categories_hash.increment(addCat) #obviously this is where I'm having problems
end
end
end
the structure of the post is
{"_id" : ObjectId("idnumber"), "created_at" : "Tue Aug 03...", "categories" :["world", "sports"], "message" : "the text of the post", "poster_id" : ObjectId("idOfUserPoster"), "voters" : []}
I'm open to suggestions on how else to get the count of categories, but I will want to get the count of voters eventually, so it seems to me the best way is to increment the categories_hash, and then add the voters.length, but one thing at a time, i'm just trying to figure out how to increment values in the hash.
If you aren't familiar with map/reduce and you don't care about scaling up, this is not as elegant as map/reduce, but should be sufficient for small sites:
#categories_hash = Hash.new(0)
current_user.recent_posts.each do |post|
post.categories.each do |category|
#categories_hash[category] += 1
end
end
If you're using mongodb, an elegant way to aggregate tag usage would be, to use a map/reduce operation. Mongodb supports map/reduce operations using JavaScript code. Map/reduce runs on the db server(s), i.e. your application does not have to retrieve and analyze every document (which wouldn't scale well for large collections).
As an example, here are the map and reduce functions I use in my blog on the articles collection to aggregate the usage of tags (which is used to build the tag cloud in the sidebar). Documents in the articles collection have a key named 'tags' which holds an array of strings (the tags)
The map function simply emits 1 on every used tag to count it:
function () {
if (this.tags) {
this.tags.forEach(function (tag) {
emit(tag, 1);
});
}
}
The reduce function sums up the counts:
function (key, values) {
var total = 0;
values.forEach(function (v) {
total += v;
});
return total;
}
As a result, the database returns a hash that has a key for every tag and its usage count as a value. E.g.:
{ 'rails' => 5, 'ruby' => 12, 'linux' => 3 }