How to get keySet() and size() for entire GridGain cluster? - caching

GridCache.keySet(), .primarySize(), and .size() only return information for that node.
How do I get these information but for the whole cluster?
Scanning the entire cluster "works", but all I need is the keys or the count, not the values.
The problem is SQL query works if I want to find based on an indexed field, but I can't find based on the grid cache entry key itself.
My workaround that works but far from elegant and performant is:
Set<String> ruleIds = FluentIterable.from(cache.queries().createSqlFieldsQuery("SELECT property FROM YagoRule").execute().get())
.<String>transform((it) -> (String) it.iterator().next()).toSet();
This requires the key is the same as one of the field, and the field need to be indexed for performance reasons.

Next release of GridGain (6.2.0) will have globalSize() and globalPrimarySize() methods which will ask the cluster for the sizes.
For now you can use the following code:
// Only grab nodes on which cache "mycache" is started.
GridCompute compute = grid.forCache("mycache").compute();
Collection<Integer> res = compute.broadcast(
// This code will execute on every caching node.
new GridCallable<Integer>() {
#Override public Integer call() {
return grid.cache("mycache").size();
}
}
).get();
int sum = 0;
for (Integer i : res)
sum += i;

Related

Enrich each existing value in a cache with the data from another cache in an Ignite cluster

What is the best way to update a field of each existing value in a Ignite cache with data from another cache in the same cluster in the most performant way (tens of millions of records about a kilobyte each)?
Pseudo code:
try (mappings = getCache("mappings")) {
try (entities = getCache("entities")) {
entities.foreach((key, entity) -> entity.setInternalId(mappings.getValue(entity.getExternalId());
}
}
I would advise to use compute and send a closure to all the nodes in the cache topology. Then, on each node you would iterate through a local primary set and do the updates. Even with this approach you would still be better off batching up updates and issuing them with a putAll call (or maybe use IgniteDataStreamer).
NOTE: for the example below, it is important that keys in "mappings" and "entities" caches are either identical or colocated. More information on collocation is here:
https://apacheignite.readme.io/docs/affinity-collocation
The pseudo code would look something like this:
ClusterGroup cacheNodes = ignite.cluster().forCache("mappings");
IgniteCompute compute = ignite.compute(cacheNodes.nodes());
compute.broadcast(() -> {
IgniteCache<> mappings = getCache("mappings");
IgniteCache<> entities = getCache("entities");
// Iterate over local primary entries.
entities.localEntries(CachePeekMode.PRIMARY).forEach((entry) -> {
V1 mappingVal = mappings.get(entry.getKey());
V2 entityVal = entry.getValue();
V2 newEntityVal = // do enrichment;
// It would be better to create a batch, and then call putAll(...)
// Using simple put call for simplicity.
entities.put(entry.getKey(), newEntityVal);
}
});

Getting the objects with similar secondary index in Riak?

Is there a way to get all the objects in key/value format which are under one similar secondary index value. I know we can get the list of keys for one secondary index (bucket/{{bucketName}}/index/{{index_name}}/{{index_val}}). But somehow my requirements are such that if I can get all the objects too. I don't want to perform a separate query for each key to get the object details separately if there is way around it.
I am completely new to Riak and I am totally a front-end guy, so please bear with me if something I ask is of novice level.
In Riak, it's sometimes the case that the better way is to do separate lookups for each key. Coming from other databases this seems strange, and likely inefficient, however you may find your query will be faster over an index and a bunch of single object gets, than a map/reduce for all the objects in a single go.
Try both these approaches, and see which turns out fastest for your dataset - variables that affect this are: size of data being queried; size of each document; power of your cluster; load the cluster is under etc.
Python code demonstrating the index and separate gets (if the data you're getting is large, this method can be made memory-efficient on the client, as you don't need to store all the objects in memory):
query = riak_client.index("bucket_name", 'myindex', 1)
query.map("""
function(v, kd, args) {
return [v.key];
}"""
)
results = query.run()
bucket = riak_client.bucket("bucket_name")
for key in results:
obj = bucket.get(key)
# .. do something with the object
Python code demonstrating a map/reduce for all objects (returns a list of {key:document} objects):
query = riak_client.index("bucket_name", 'myindex', 1)
query.map("""
function(v, kd, args) {
var obj = Riak.mapValuesJson(v)[0];
return [ {
'key': v.key,
'data': obj,
} ];
}"""
)
results = query.run()

Sitecore - Load-balance Lucene queries

Sitecore.NET 6.6.0 (rev. 130404)
Our production website is very search-heavy and our Lucene indexes are queried heavily throughout the day. This amounts to considerable amount of CPU power being spent on Lucene query processing. Are there industry practices to offload Lucene indexes and queries to a different machine? or are there any hardware mechanisms that can be used to boost Lucene query performance?
(Our most used Lucene index contains less than 10,000 entries)
Update (more info):
Although our index contains less than 10,000, can the CPU usage be caused by high number of Lucene queries that get executed parallely? We have a very complex faceted search. Initially, when users try out various search criteria, we were displaying result-count-breakdowns alongside all the search options (resulting in 50-60 count queries with each search request). This caused the CPU usage reach 90-95% during high traffic. When we removed the counts, the CPU stabilized around 20-30%.
Here are the two methods we use for querying:
public static Document[] GetLuceneDocuments(ACIndex acIndex, Query query, Sort sort = null, int maxResults = 999, bool trackScores = false, bool fillFields = true)
{
Index index = SearchManager.GetIndex(GetIndexName(acIndex));
if (sort == null)
{
sort = new Sort(new SortField(null, SortField.SCORE));
}
using (IndexSearchContext searchContext = index.CreateSearchContext())
{
Lucene.Net.Search.IndexSearcher searcher = searchContext.Searcher;
TopFieldCollector collector = TopFieldCollector.create(sort, maxResults, fillFields, trackScores, false, false);
searcher.Search(query, collector);
TopDocs topdocs = collector.TopDocs();
Document[] documents = new Document[topdocs.ScoreDocs.Length];
for (int i = 0; i < topdocs.ScoreDocs.Length; i++)
{
documents[i] = searcher.Doc(topdocs.ScoreDocs[i].doc);
}
return documents;
}
}
public static int GetSearchResultCount(ACIndex acIndex, Query query)
{
Index index = SearchManager.GetIndex(GetIndexName(acIndex));
using (IndexSearchContext searchContext = index.CreateSearchContext())
{
Lucene.Net.Search.IndexSearcher searcher = searchContext.Searcher;
TopScoreDocCollector collector = TopScoreDocCollector.create(1, false);
searcher.Search(query, collector);
return collector.GetTotalHits();
}
}
You should look into implementing Solr for your searches. While not an expert on the subject, Solr is Lucene based (making the transition easier) and runs off a central server or servers, dealing with all your search requirements.
Solr isn't natively officially supported in versions prior to Sitecore 7 - but I have worked on a number of Sitecore 6 solutions that did use Solr.
This article should give you a lead start: How to implement Solr into Sitecore
As far as industry processes go, with Sitecore, Solr is the solution to this particular problem. Depending on your solution implementation however, it could take some doing to get up and going.
You might look at www.alpha-solutions.dk/sitecore-search-solution for a Solr on Sitecore 6 approach.
Note: I am affiliated with Alpha Solutions
Your index is small, I know there are recommendations that you rearchitect the whole solution, however, I recommend something I have done in the past that has worked well for me and will not require that you provision another server or install another indexing tool like Elastic or SOLR.
First, store the fields in the index that you facet on, like below (either in configuration or using a custom crawler):
_group
_path
_creator
Manufacturer
Size
Year
... [other fields]
Create a class that represents a result
public class MyThing
{
public string Manufacturer { get; set; }
public string Size { get; set; }
public int Year { get; set; }
public MyThing(Document doc)
{
Manufacturer = doc.GetField("Manufacturer").Value;
Size = doc.GetField("Size").Value;
Year = int.Parse(doc.GetField("Year").Value);
}
}
Then take your main search result hits, instantiate your lightweight POCO's, and do counts off of that. Voila, 1 query!
int countForSomething = results.Count(result=>result.Size == "XL");
NOTE: I kind of wrote this code off the top of my head, but you get the idea. I have used this process on indexes in Lucene up to 700K+ results in Sitecore without much issue. Good luck sir!
Ah! Just tackled the issue of faceted search and CPU usage myself. This is some border-line black-magic coding and some really creative caching.
We found a way to implement Solr's faceted querying into Lucene, and boy oh boy are the results stunningly fast.
Short version:
Build a static class that holds onto a dictionary. Key: unique representation of an individual filter, Value: the BitArray produced by a Lucene QueryFilter object.
var queryFilter = new QueryFilter(filterBooleanQuery);
var bits = queryFilter.Bits(indexReader);
result[filter.ID.ToString()] = bits
Build this dictionary periodically asynchronously in the background. My index of about 80k documents only took about 15 seconds to build, but that's enough to make a lot of users angry so doing it in a non-blocking manner is crucial.
Query this dictionary using bitwise logic to find the resulting BitArray representing the hits you're looking for.
var combo =
facetDictionary[thisFilter.ID.ToString()]
.And(facetDictionary[selectedFilter.ID.ToString()]);
Long Version:
http://www.devatwork.nl/articles/lucenenet/faceted-search-and-drill-down-lucenenet/
Now, our implementation was only to get the cardinality of these result sets, but theoretically you could use these bit arrays to get actual documents out of the index as well.
Good luck!
Upgrading to sitecore 7 would give you the facets out of the box. Abstracted in a nice LINQ API that lets you switch from Lucene and SOLR (others, like ElasticSearch are coming)...

List Find ,Hashset Or Linq Which One is Better On list

I Have a list of string where i want to find particular value and return.
If i just want to search i can use Hashset instead of list
HashSet<string> data = new HashSet<string>();
bool contains = data.Contains("lokendra"); //
But for list i am using Find because i want to return the value also from list.
I found this methos is time consuming. The method where this code resides is hit more than 1000 times and the size of list is appx 20000 to 25000.This method takes time.Is there any other way i can make search faster.
List<Employee> employeeData= new List<Employee>();
var result = employeeData.Find(element=>element.name=="lokendra")
Do we have any linq or any other approach which makes retrievel of data faster from search.
Please help.
public struct Employee
{
public string role;
public string id;
public int salary;
public string name;
public string address;
}
I have the list of this structure and if the name property matches the value "lokendra".then i want to retrun the whole object.Consider list as the employee data.
I want to know the way we have Hashset to get faster search is there anyway we can search data and return fast other than find.
It sounds like what you actually want is a Dictionary<string, Employee>. Build that once, and you can query it efficiently many times. You can build it from a list of employees easily:
var employeesByName = employees.ToDictionary(e => e.Name);
...
var employee;
if (employeesByName.TryGetValue(name, out employee))
{
// Yay, found the employee
}
else
{
// Nope, no employee with that name
}
EDIT: Now I've seen your edit... please don't create struct types like this. You almost certainly want a class instead, and one with properties rather than public fields...
You can try with employeeData.FirstOrDefault(e => e == "lokendra"), but it still needs to iterate over collection, so will have performance list Find method.
If your list content is set only once and then you're searching it again and again you should consider implementing your own solution:
sort list before first search
use binary search (which would be O(log n) instead of O(n) for standard Find and Where)

Design a Data Structure for web server to store history of visited pages

The server must maintain data for last n days. It must show the most visited pages of the current day first and then the most visited pages of next day and so on.
I'm thinking along the lines of hash map of hash maps. Any suggestions ?
Outer hash map with key of type date and value of type hash map.
Inner hash map with key of type string containing the url and value of type int containing the visit count.
Example in C#:
// Outer hash map
var visitsByDay =
new Dictionary<DateTime, VisitsByUrl>(currentDate, new VisitsByUrl());
...
// inner hash map
public class VisitsByUrl
{
public Dictionary<string, int> Urls { get; set; }
public VisitsByUrl()
{
Urls = new Dictionary<string, int>();
}
public void Add(string url)
{
if (Urls[url] != null)
Urls[url] += 1;
else
Urls.Add(url, 1);
}
}
You can keep a hash for each day that has will of the type :-
And a queue of length n. which will have these hashes for each day. Also you will store seperate hash totalHits which will sum all of these
Class Stats {
queue< hash<url,hits> > completeStats;
hash<url,hits> totalStats;
public:-
int getNoOfTodayHits(url) {
return completeStats[n-1][url];
}
int getTotalStats(url) {
return totalStats[url];
}
void addAnotherDay() {
// before popping check if the length is n or not :)
hash<url,hits> lastStats = completeStats.pop();
hash<url,hits> todayStats;
completeStats.push_back(todayStats);
// traverse through lastStats and decrease the value from total stats;
}
// etc.
};
We can have a combination of Stack & Hash Map.
We can create an Object of URL and timestamp, then push it onto the Stack.
Most recent visited Url will be on the top.
We can use the timestamp combined with URL to create a key, which is mapped to the count of visited Urls.
In order to display most visited pages in chronological order, we can pop the stack, create a key and fetch the count associated with the Url. Sort them while displaying.
Time complexity: O(n) + Sort time (depends on the number of pages visited)
This depends on what you want. For example, do you want to store the actual data for the pages in the history, or just the URLs? If somebody has visited a page twice, should it show up twice in the history?
A hash map would be suitable if you wanted to store the data for a page, and wanted each page to show up only once.
If, as I'd consider more likely, you want to store only the URLs, but want each stored multiple times if it was visited more than once, an array/vector would probably make more sense. If you expect to see a lot of duplication of (relatively) long URLs, you could create a set of URLs, and for each visit store some sort of pointer/index/reference to the URL in question. Note, however, that maintaining this can become somewhat non-trivial.

Resources