Design a Data Structure for web server to store history of visited pages - algorithm

The server must maintain data for last n days. It must show the most visited pages of the current day first and then the most visited pages of next day and so on.
I'm thinking along the lines of hash map of hash maps. Any suggestions ?

Outer hash map with key of type date and value of type hash map.
Inner hash map with key of type string containing the url and value of type int containing the visit count.
Example in C#:
// Outer hash map
var visitsByDay =
new Dictionary<DateTime, VisitsByUrl>(currentDate, new VisitsByUrl());
...
// inner hash map
public class VisitsByUrl
{
public Dictionary<string, int> Urls { get; set; }
public VisitsByUrl()
{
Urls = new Dictionary<string, int>();
}
public void Add(string url)
{
if (Urls[url] != null)
Urls[url] += 1;
else
Urls.Add(url, 1);
}
}

You can keep a hash for each day that has will of the type :-
And a queue of length n. which will have these hashes for each day. Also you will store seperate hash totalHits which will sum all of these
Class Stats {
queue< hash<url,hits> > completeStats;
hash<url,hits> totalStats;
public:-
int getNoOfTodayHits(url) {
return completeStats[n-1][url];
}
int getTotalStats(url) {
return totalStats[url];
}
void addAnotherDay() {
// before popping check if the length is n or not :)
hash<url,hits> lastStats = completeStats.pop();
hash<url,hits> todayStats;
completeStats.push_back(todayStats);
// traverse through lastStats and decrease the value from total stats;
}
// etc.
};

We can have a combination of Stack & Hash Map.
We can create an Object of URL and timestamp, then push it onto the Stack.
Most recent visited Url will be on the top.
We can use the timestamp combined with URL to create a key, which is mapped to the count of visited Urls.
In order to display most visited pages in chronological order, we can pop the stack, create a key and fetch the count associated with the Url. Sort them while displaying.
Time complexity: O(n) + Sort time (depends on the number of pages visited)

This depends on what you want. For example, do you want to store the actual data for the pages in the history, or just the URLs? If somebody has visited a page twice, should it show up twice in the history?
A hash map would be suitable if you wanted to store the data for a page, and wanted each page to show up only once.
If, as I'd consider more likely, you want to store only the URLs, but want each stored multiple times if it was visited more than once, an array/vector would probably make more sense. If you expect to see a lot of duplication of (relatively) long URLs, you could create a set of URLs, and for each visit store some sort of pointer/index/reference to the URL in question. Note, however, that maintaining this can become somewhat non-trivial.

Related

How to get keySet() and size() for entire GridGain cluster?

GridCache.keySet(), .primarySize(), and .size() only return information for that node.
How do I get these information but for the whole cluster?
Scanning the entire cluster "works", but all I need is the keys or the count, not the values.
The problem is SQL query works if I want to find based on an indexed field, but I can't find based on the grid cache entry key itself.
My workaround that works but far from elegant and performant is:
Set<String> ruleIds = FluentIterable.from(cache.queries().createSqlFieldsQuery("SELECT property FROM YagoRule").execute().get())
.<String>transform((it) -> (String) it.iterator().next()).toSet();
This requires the key is the same as one of the field, and the field need to be indexed for performance reasons.
Next release of GridGain (6.2.0) will have globalSize() and globalPrimarySize() methods which will ask the cluster for the sizes.
For now you can use the following code:
// Only grab nodes on which cache "mycache" is started.
GridCompute compute = grid.forCache("mycache").compute();
Collection<Integer> res = compute.broadcast(
// This code will execute on every caching node.
new GridCallable<Integer>() {
#Override public Integer call() {
return grid.cache("mycache").size();
}
}
).get();
int sum = 0;
for (Integer i : res)
sum += i;

Linq compared to IComparer

I have seen this class that looks like this:
/// <summary>
/// Provides an internal structure to sort the query parameter
/// </summary>
protected class QueryParameter
{
public QueryParameter(string name, string value)
{
Name = name;
Value = value;
}
public string Name { get; private set; }
public string Value { get; private set; }
}
/// <summary>
/// Comparer class used to perform the sorting of the query parameters
/// </summary>
protected class QueryParameterComparer : IComparer<QueryParameter>
{
public int Compare(QueryParameter x, QueryParameter y)
{
return x.Name == y.Name
? string.Compare(x.Value, y.Value)
: string.Compare(x.Name, y.Name);
}
}
Then there is a call later in the code that does the sort:
parameters.Sort(new QueryParameterComparer());
which all works fine.
I decided that it was a waste of time creating a QueryParameter class that only had name value and it would probably be better to use Dictionary. With the dictionary, rather than use the Sort(new QueryParameterComparer()); I figured I could just do this:
parameters.ToList().Sort((x, y) => x.Key == y.Key ? string.Compare(x.Value, y.Value) : string.Compare(x.Key, y.Key));
The code compiles fine, but I am unsure whether it is working because the list just seems to output in the same order it was put in.
So, can anyone tell me if I am doing this correctly or if I am missing something simple?
Cheers
/r3plica
The List<T>.Sort method is not part of LINQ.
You can use OrderBy/ThenBy extension methods before calling ToList():
parameters = parameter.OrderBy(x => x.Key).ThenBy(x => x.Value).ToList();
From your code, I surmise that parameters is your dictionary, and you're calling
parameters.ToList().Sort(...);
and then carrying on using parameters.
ToList() creates a new list; you are then sorting this list and discarding it. You're not sorting parameters at all, and in fact you can't sort it because it's a dictionary.
What you need is something along the lines of
var parametersList = parameters.ToList();
parametersList.Sort(...);
where ... is the same sort as before.
You could also do
var parametersList = parameters.OrderBy(...).ToList();
which is a more LINQ-y way of doing things.
It may even be appropriate to just do e.g.
foreach(var kvp in parameters.OrderBy(...))
(or however you plan on using the sorted sequence) if you're using the sorted seqence more often than you're changing the dictionary (i.e. there's no point caching a sorted version because the original data changes a lot).
Another point to note - a dictionary can't contain duplicate keys, so there's no point checking x.Key == y.Key any more - you just need to sort via (x, y) => string.Compare(x.Key, y.Key)
I'd be careful here, though - by the look of it, the original code did support duplicate keys, so by switchnig to a dictionary you might be breaking something.
Dictionary are only equivalent to two hash map, and allow you to access to any alement (given the key) with costant time O(1) (because the make a lookup search on an hashtable).
So if you would order the elements because you intended to do a dicotomic search later, you do not need that you should use directly dictionary (or if you would query for both the value in dictionary you could use a couple of dictionary with the same elements but switching key value pairs).
As somebody write before me, if you question is how to order a list with linq you should work with linq and with orderby thenby.

List Find ,Hashset Or Linq Which One is Better On list

I Have a list of string where i want to find particular value and return.
If i just want to search i can use Hashset instead of list
HashSet<string> data = new HashSet<string>();
bool contains = data.Contains("lokendra"); //
But for list i am using Find because i want to return the value also from list.
I found this methos is time consuming. The method where this code resides is hit more than 1000 times and the size of list is appx 20000 to 25000.This method takes time.Is there any other way i can make search faster.
List<Employee> employeeData= new List<Employee>();
var result = employeeData.Find(element=>element.name=="lokendra")
Do we have any linq or any other approach which makes retrievel of data faster from search.
Please help.
public struct Employee
{
public string role;
public string id;
public int salary;
public string name;
public string address;
}
I have the list of this structure and if the name property matches the value "lokendra".then i want to retrun the whole object.Consider list as the employee data.
I want to know the way we have Hashset to get faster search is there anyway we can search data and return fast other than find.
It sounds like what you actually want is a Dictionary<string, Employee>. Build that once, and you can query it efficiently many times. You can build it from a list of employees easily:
var employeesByName = employees.ToDictionary(e => e.Name);
...
var employee;
if (employeesByName.TryGetValue(name, out employee))
{
// Yay, found the employee
}
else
{
// Nope, no employee with that name
}
EDIT: Now I've seen your edit... please don't create struct types like this. You almost certainly want a class instead, and one with properties rather than public fields...
You can try with employeeData.FirstOrDefault(e => e == "lokendra"), but it still needs to iterate over collection, so will have performance list Find method.
If your list content is set only once and then you're searching it again and again you should consider implementing your own solution:
sort list before first search
use binary search (which would be O(log n) instead of O(n) for standard Find and Where)

Sorting a NotesDocumentCollection based on a date field in SSJS

Using Server side javascript, I need to sort a NotesDcumentCollection based on a field in the collection containing a date when the documents was created or any built in field when the documents was created.
It would be nice if the function could take a sort option parameter so I could put in if I want the result back in ascending or descending order.
the reason I need this is because I use database.getModifiedDocuments() which returns an unsorted notesdocumentcollection. I need to return the documents in descending order.
The following code is a modified snippet from openNTF which returns the collection in ascending order.
function sortColByDateItem(dc:NotesDocumentCollection, iName:String) {
try{
var rl:java.util.Vector = new java.util.Vector();
var tm:java.util.TreeMap = new java.util.TreeMap();
var doc:NotesNotesDocument = dc.getFirstDocument();
while (doc != null) {
tm.put(doc.getItemValueDateTimeArray(iName)[0].toJavaDate(), doc);
doc = dc.getNextDocument(doc);
}
var tCol:java.util.Collection = tm.values();
var tIt:java.util.Iterator = tCol.iterator();
while (tIt.hasNext()) {
rl.add(tIt.next());
}
return rl;
}catch(e){
}
}
When you construct the TreeMap, pass a Comparator to the constructor. This allows you to define custom sorting instead of "natural" sorting, which by default sorts ascending. Alternatively, you can call descendingMap against the TreeMap to return a clone in reverse order.
This is a very expensive methodology if you are dealing with large number of documents. I mostly use NotesViewEntrycollection (always sorted according to the source view) or view navigator.
For large databases, you may use a view, sorted according to the modified date and navigate through entries of that view until the most recent date your code has been executed (which you have to save it somewhere).
For smaller operations, Tim's method is great!

Does HBase scan returns sorted columns?

I am working on a HBase map reduce job and need to understand if the columns in a single column family are returned sorted by their names (key). If so, I wouldnt need to do it in the shuffle sort stage.
Thanks
I have a very similar data model as you. Upon insertion however, I set my own values for the timestamps on the Put object. However, I did so in a way that took a "seed" of the current time and appended a incrementing counter for each event I persisted in the batch.
When I pulled the results out from the Scan, I wrote a comparator:
public class KVTimestampComparator implements Comparator<KeyValue> {
#Override
public int compare(KeyValue kv1, KeyValue kv2) {
Long kv1Timestamp = kv1.getTimestamp();
Long kv2Timestamp = kv2.getTimestamp();
return kv1Timestamp.compareTo(kv2Timestamp);
}
}
Then sorted the raw row:
List<KeyValue> row = Arrays.asList(result.raw());
Collections.sort(row, new KVTimestampComparator());
Got this idea from person who answered this : Sorted results from hbase scanner
no, columns are not sorted
They are stored internally as key-value pairs in a long byte array. But, you should clarify your question about what you actually need this for.

Resources