Getting top 100 URL from a log file - algorithm

One of my friends was asked the following question in an interview. Can anyone tell me how to solve it?
We have a fairly large log file, about 5GB. Each line of the log file contains an url which a user has visited on our site. We want to figure out what's the most popular 100 urls visited by our users. How to do it?

In case we have more than 10GB RAM, just do it straight forward with hashmap.
Otherwise, separate it into several files, using a hash function. And then process each file and get a top 5. With "top 5"s for each file, it will be easy to get an overall top 5.
Another solution can be sort it using any external sorting method. And then scan the file once to count each occurrence. In the process, you don't have to keep track of the counts. You can safely throw anything that doesn't make into top5 away.

Just sort the log file according to the URLs (needs constant space if you chose an algorithm like heap sort or quick sort) and then count for each URL how many times it appears (easy, the lines with the same URLs are next to each other).
Overall complexity is O(n*Log(n)).
Why splitting in many files and keeping only top 3 (or top 5 or top N) for each file is wrong:
File1 File2 File3
url1 5 0 5
url2 0 5 5
url3 5 5 0
url4 5 0 0
url5 0 5 0
url6 0 0 5
url7 4 4 4
url7 never makes it to the top 3 in the individual files but is the best overall.

Because the log file is fairly large you should read the log-file using a stream-reader. Don't read it all in the memory.
I would expect it is feasible to have the number of possible distinct links in the memory while we work on the log-file.
// Pseudo
Hashmap map<url,count>
while(log file has nextline){
url = nextline in logfile
add url to map and update count
}
List list
foreach(m in map){
add m to list
}
sort the list by count value
take top n from the list
The runtime is O(n) + O(m*log(m)) where n is the size of the log-file in lines and where the m is number of distinct found links.
Here's a C# implementation of the pseudo-code. An actual file-reader and a log-file is not provided.
A simple emulation of reading a log-file using a list in the memory is provided instead.
The algorithm uses a hashmap to store the found links. A sorting algorithm founds the top 100 links afterward. A simple data container data-structure is used for the sorting algorithm.
The memory complexity is dependent on expected distinct links.
The hashmap must be able to contain the found distinct links,
else this algorithm won't work.
// Implementation
using System;
using System.Collections.Generic;
using System.Linq;
public class Program
{
public static void Main(string[] args)
{
RunLinkCount();
Console.WriteLine("press a key to exit");
Console.ReadKey();
}
class LinkData : IComparable
{
public string Url { get; set; }
public int Count { get; set; }
public int CompareTo(object obj)
{
var other = obj as LinkData;
int i = other == null ? 0 : other.Count;
return i.CompareTo(this.Count);
}
}
static void RunLinkCount()
{
// Data setup
var urls = new List<string>();
var rand = new Random();
const int loglength = 500000;
// Emulate the log-file
for (int i = 0; i < loglength; i++)
{
urls.Add(string.Format("http://{0}.com", rand.Next(1000)
.ToString("x")));
}
// Hashmap memory must be allocated
// to contain distinct number of urls
var lookup = new Dictionary<string, int>();
var stopwatch = new System.Diagnostics.Stopwatch();
stopwatch.Start();
// Algo-time
// O(n) where n is log line count
foreach (var url in urls) // Emulate stream reader, readline
{
if (lookup.ContainsKey(url))
{
int i = lookup[url];
lookup[url] = i + 1;
}
else
{
lookup.Add(url, 1);
}
}
// O(m) where m is number of distinct urls
var list = lookup.Select(i => new LinkData
{ Url = i.Key, Count = i.Value }).ToList();
// O(mlogm)
list.Sort();
// O(m)
var top = list.Take(100).ToList(); // top urls
stopwatch.Stop();
// End Algo-time
// Show result
// O(1)
foreach (var i in top)
{
Console.WriteLine("Url: {0}, Count: {1}", i.Url, i.Count);
}
Console.WriteLine(string.Format("Time elapsed msec: {0}",
stopwatch.ElapsedMilliseconds));
}
}
Edit: This answer has been updated based on the comments
added: running time and memory complexity analysis
added: pseudo-code
added: explain how we manage a fairly large log-file

Related

Fetch data from a middle of a big stack using searchAfter(jump to a specific page,)

I have a large data set around 25million records
I am using searchAfter with PointInTime to walk through the data
My question is there a way where I can skip records over the limit of 10000
index.max_result_window
and start picking the records for example from 100,000 up to 105,000
right now I am sending multiple requests to Elasticsearch until I reach the desired point but it is not efficient and it is consuming a lot of time
Here is how I did it :
I calculated how many pages I needed to do the pagination.
Then the user will send a request with page number i.e number 3. So in this case only when I reach the desired page I will set the source to true.
this I best I managed to do to improve the performance and reduce the response size for none required pages
int numberOfPages = Pagination.GetTotalPages(totalCount, _size);
var pitResponse = await _esClient.OpenPointInTimeAsync(content._index, p => p.KeepAlive("2m"));
if (pitResponse.IsValid)
{
IEnumerable<object> lastHit = null;
for (int round = 0; round < numberOfPages; round++)
{
bool fetchSource = round == requiredPage;
var response = await _esClient.SearchAsync<ProductionDataItem>(s => s
.Index(content._index)
.Size(10000)
.Source(fetchSource)
.Query(query)
.PointInTime(pitResponse.Id)
.Sort(srt => {
if (content.Sort == 1) { srt.Ascending(sortBy); }
else { srt.Descending(sortBy); }
return srt; })
.SearchAfter(lastHit)
);
if (fetchSource)
{
itemsList.AddRange(response.Documents.ToList());
break;
}
lastHit = response.Hits.Last().Sorts;
}
}
//Closing PIT
await _esClient.ClosePointInTimeAsync(p => p.Id(pitResponse.Id));
Check here: Elasticsearch Pagination Techniques
I think the best way to do it, is how I did it
by keeping scrolling via Point in time and only loading the result when the desired page is reached by using the .source(bool)

The probability distribution of two words in a file using java 8

I need the number of lines that contain two words. For this purpose I have written the following code:
The input file contains 1000 lines and about 4,000 words, and it takes about 4 hours.
Is there a library in Java that can do it faster?
Can I implement this code using Appache Lucene or Stanford Core NLP to achieve less run time?
ArrayList<String> reviews = new ArrayList<String>();
ArrayList<String> terms = new ArrayList<String>();
Map<String,Double> pij = new HashMap<String,Double>();
BufferedReader br = null;
FileReader fr = null;
try
{
fr = new FileReader("src/reviews-preprocessing.txt");
br = new BufferedReader(fr);
String line;
while ((line= br.readLine()) != null)
{
for(String term : line.split(" "))
{
if(!terms.contains(term))
terms.add(term);
}
reviews.add(line);
}
}
catch (IOException e) { e.printStackTrace();}
finally
{
try
{
if (br != null)
br.close();
if (fr != null)
fr.close();
}
catch (IOException ex) { ex.printStackTrace();}
}
long Count = reviews.size();
for(String term_i : terms)
{
for(String term_j : terms)
{
if(!term_i.equals(term_j))
{
double p = (double) reviews.parallelStream().filter(s -> s.contains(term_i) && s.contains(term_j)).count();
String key = String.format("%s_%s", term_i,term_j);
pij.put(key, p/Count);
}
}
}
Your first loop getting the distinct words relies on ArrayList.contains, which has a linear time complexity, instead of using a Set. So if we assume nd distinct words, it already has a time complexity of “number of lines”×nd.
Then, you are creating nd×nd word combinations and probing all 1,000 lines for the presence of these combination. In other words, if we only assume 100 distinct words, you are performing 1,000×100 + 100×100×1,000 = 10,100,000 operations, if we assume 500 distinct words, we’re talking about 250,500,000 already.
Instead, you should just create the combinations actually existing in a line and collect them into the map. This will only process those combinations actually existing and you may improve this by only checking either of each “a_b”/“b_a” combination, as the probability of both is identical. Then, you are only performing “number of lines”דword per line”דword per line” operations, in other words, roughly 16,000 operations in your case.
The following method combines all words of a line, only keeping one of the “a_b”/“b_a” combination, and eliminates duplicates so each combination can count as a line.
static Stream<String> allCombinations(String line) {
String[] words = line.split(" ");
return Arrays.stream(words)
.flatMap(word1 ->
Arrays.stream(words)
.filter(words2 -> word1.compareTo(words2)<0)
.map(word2 -> word1+'_'+word2))
.distinct();
}
This method can be use like
List<String> lines = Files.readAllLines(Paths.get("src/reviews-preprocessing.txt"));
double ratio = 1.0/lines.size();
Map<String, Double> pij = lines.stream()
.flatMap(line -> allCombinations(line))
.collect(Collectors.groupingBy(Function.identity(),
Collectors.summingDouble(x->ratio)));
It ran through my copy of “War and Peace” within a few seconds, without needing any attempt to do parallel processing. Not much surprising, “and_the” was the combination with the highest probability.
You may consider changing the line
String[] words = line.split(" ");
to
String[] words = line.toLowerCase().split("\\W+");
to generalize the code to work with different input, handling multiple spaces or other punctuation characters and ignoring the case.

How to sort IEnumerable with limited result count? (another implementation of .OrderBy.Take)

I have a binary file which contains more than 100 millions of objects and I read the file using BinaryReader and return (Yield) the object (File reader and IEnumerable implementation is here: Performance comparison of IEnumerable and raising event for each item in source? )
One of object's properties indicates the object rank (like A5). Assume that I want to get sorted top n objects based on the property.
I saw the code for OrderBy function: it uses QuickSort algorithm. I tried to sort the IEnumerable result with OrderBy and Take(n) function together, but I got OutOfMemory exception, because OrderBy function creates an array with size of total objects count to implement Quicksort.
Actually, the total memory I need is n so there is no need to create a big array. For instance, if I get Take(1000) it will return only 1000 objects and it doesn't depend on the total count of whole objects.
How can I get the result of OrderBy function with Take function? In another word, I need a limited or blocked sorted list with the capacity which is defined by end-user.
If you want top N from ordered source with default LINQ operators, then only option is loading all items into memory, sorting them and selecting first N results:
items.Sort(condition).Take(N) // Out of memory
If you want to sort only top N items, then simply take items first, and sort them:
items.Take(N).Sort(condition)
UPDATE you can use buffer for keeping N max ordered items:
public static IEnumerable<T> TakeOrdered<T, TKey>(
this IEnumerable<T> source, int count, Func<T, TKey> keySelector)
{
Comparer<T, TKey> comparer = new Comparer<T,TKey>(keySelector);
List<T> buffer = new List<T>();
using (var iterator = source.GetEnumerator())
{
while (iterator.MoveNext())
{
T current = iterator.Current;
if (buffer.Count == count)
{
// check if current item is less than minimal buffered item
if (comparer.Compare(current, buffer[0]) <= 0)
continue;
buffer.Remove(buffer[0]); // remove minimual item
}
// find index of current item
int index = buffer.BinarySearch(current, comparer);
buffer.Insert(index >= 0 ? index : ~index, current);
}
}
return buffer;
}
This solution also uses custom comparer for items (to compare them by keys):
public class Comparer<T, TKey> : IComparer<T>
{
private readonly Func<T, TKey> _keySelector;
private readonly Comparer<TKey> _comparer = Comparer<TKey>.Default;
public Comparer(Func<T, TKey> keySelector)
{
_keySelector = keySelector;
}
public int Compare(T x, T y)
{
return _comparer.Compare(_keySelector(x), _keySelector(y));
}
}
Sample usage:
string[] items = { "b", "ab", "a", "abcd", "abc", "bcde", "b", "abc", "d" };
var top5byLength = items.TakeOrdered(5, s => s.Length);
var top3byValue = items.TakeOrdered(3, s => s);
LINQ does not have a built-in class that lets you take the top n elements without loading the whole collection into memory, but you can definitely build it yourself.
One simple approach would be using a SortedDictionary of lists: keep adding elements to it until you hit the limit of n. After that, check each element that you are about to add with the smallest element that you have found so far (i.e. dict.Keys.First()). If the new element is smaller, discard it; otherwise, remove the smallest element, and add a new one.
At the end of the loop your sorted dictionary will have at most n elements, and they would be sorted according to the comparator that you set on the dictionary.

algorithm - equally fill different size containers based on two criterias

I am trying to wrap my head around an algorithm. I've never coded for an algorithm before and not sure how to go about this issue. Here is the jist of it:
I can have n number of containers, each container has two sets of numbers that are important to me: the amount of memory (x) and the number of logical processors (y) each container can have different values.
Each virtual machine has an amount of memory (x) and a number of logical processors (y). I am trying to create an algorithm that will balance the load of memory (x) and a number of logical processors (y) across all hosts in the cluster equally. It will not be a true equal among all hosts but all hosts will be within 10% +/- of each host.
How would I go about this problem I would suppose mathematically.
If I understood your problem correctly, you want to minimize the relative load of the hosts, so that each one has a load that deviates no more than 10% from the others. So we want to optimize the "relative load" between hosts by finding a minimum value.
To do so, you could use some sort of Combinatorial Optimization to reach an acceptable or optimal solution. A classic metaheuristic like Simulated Annealing or Tabu Search would do the job.
Example generic steps for your problem :
define an initial state by randomly assigning each VM to a host
find new states by iteratively swapping VM's between hosts until:
some acceptable solution is found, or
the number of iterations is exhausted, or
some other condition is met(like simulated annealing's "temperature")
develop a compare function to decide when to switch states (solutions) in each iteration
In your case, you should measure the relative load between all hosts and only swap states when the relative load of the new state is lower than the current state.
This of course assumes that you will do this algorithm with some form of logical representation and not the actual VM's. Once you found the solution simulating your real conditions, then you would apply them physically to your VM's/hosts configuration.
Hope this helps!
You've probably moved on by now, but if you ever come back to this issue, this answer may be useful. If any part is confusing, let me know and I'll try to clarify.
Your problem is a case of 2D variable size bin packing without rotation. Your dimensions are Memory and CPU, rather than length and width (hence the lack of rotation).
I would use a simple offline packing algorithm. (offline means that your VMs and hosts are all known beforehand)
The simple packing I use is:
sort your unassigned VMs by memory required
sort your set of Hosts by available memory
find the Host with the most available memory that the VM will fit on and assign it to that Host (be sure to check CPU capacity, too. the Host with the most available RAM may not have enough CPU resources)
remove the VM from the list
reduce the Host's available memory and CPU capacity
if you still have VMs, go to 2
Here is how I defined VMs and Hosts:
[DebuggerDisplay("{Name}: {MemoryUsage} | {ProcessorUsage}")]
class VirtualMachine
{
public int MemoryUsage;
public string Name;
public int ProcessorUsage;
public VirtualMachine(string name, int memoryUsage, int processorUsage)
{
MemoryUsage = memoryUsage;
ProcessorUsage = processorUsage;
Name = name;
}
}
[DebuggerDisplay("{Name}: {Memory} | {Processor}")]
class Host
{
public readonly string Name;
public int Memory;
public Host Parent;
public int Processor;
public Host(string name, int memory, int processor, Host parent = null)
{
Name = name;
Memory = memory;
Processor = processor;
Parent = parent;
}
public bool Fits(VirtualMachine vm) { return vm.MemoryUsage <= Memory && vm.ProcessorUsage <= Processor; }
public Host Assign(VirtualMachine vm) { return new Host(Name + "_", Memory - vm.MemoryUsage, Processor - vm.ProcessorUsage, this); }
}
The Host Fits and Assign methods are important for checking if a VM can fit, and reducing the Host available memory/CPU. I create a "Host-Prime" to represent the host with reduced resources, removing the original Host and inserting Host-Prime into the Host list.
Here is the bin pack solving algorithm. If you are running against a large data set, there should be plenty of opportunities for speeding up execution, but this is good enough for small data sets.
class Allocator
{
readonly List<Host> Bins;
readonly List<VirtualMachine> Items;
public Allocator(List<Host> bins, List<VirtualMachine> items)
{
Bins = bins;
Items = items;
}
public Dictionary<Host, List<VirtualMachine>> Solve()
{
var bins = new HashSet<Host>(Bins);
var items = Items.OrderByDescending(item => item.MemoryUsage).ToList();
var result = new Dictionary<Host, List<VirtualMachine>>();
while (items.Count > 0)
{
var item = items.First();
items.RemoveAt(0);
var suitableBin = bins.OrderByDescending(b => b.Memory).FirstOrDefault(b => b.Fits(item));
if (suitableBin == null)
return null;
bins.Remove(suitableBin);
var remainder = suitableBin.Assign(item);
bins.Add(remainder);
var rootBin = suitableBin;
while (rootBin.Parent != null)
rootBin = rootBin.Parent;
if (!result.ContainsKey(rootBin))
result[rootBin] = new List<VirtualMachine>();
result[rootBin].Add(item);
}
return result;
}
}
So you have a packing algorithm now, but you still don't have a load balancing solution. Since this algorithm will pack the VMs onto hosts without concern of balancing the memory usage, we need another level of solving. To achieve some rough memory balance, I take a brute force approach. Reduce the initial memory on each Host to represent a target usage goal. Then solve to see if your VMs fit into the reduced memory available. If no solution is found, relax the memory constraint. Repeat this until a solution is found, or none is possible (using the given algorithm). This should give a rough approximation of the optimal memory load.
class Program
{
static void Main(string[] args)
{
//available hosts, probably loaded from a file or database
var hosts = new List<Host> {new Host("A", 4096, 4), new Host("B", 8192, 8), new Host("C", 3072, 8), new Host("D", 3072, 8)};
var hostLookup = hosts.ToDictionary(h => h.Name);
//VMs required to run, probably loaded from a file or database
var vms = new List<VirtualMachine>
{
new VirtualMachine("1", 512, 1),
new VirtualMachine("2", 1024, 2),
new VirtualMachine("3", 1536, 5),
new VirtualMachine("4", 1024, 8),
new VirtualMachine("5", 1024, 1),
new VirtualMachine("6", 2048, 1),
new VirtualMachine("7", 2048, 2)
};
var solution = FindMinumumApproximateSolution(hosts, vms);
if (solution == null)
Console.WriteLine("No solution found.");
else
foreach (var hostAssigment in solution)
{
var host = hostLookup[hostAssigment.Key.Name];
var vmsOnHost = hostAssigment.Value;
var xUsage = vmsOnHost.Sum(itm => itm.MemoryUsage);
var yUsage = vmsOnHost.Sum(itm => itm.ProcessorUsage);
var pctUsage = (xUsage / (double)host.Memory);
Console.WriteLine("{0} used {1} of {2} MB {5:P2} | {3} of {4} CPU", host.Name, xUsage, host.Memory, yUsage, host.Processor, pctUsage);
Console.WriteLine("\t VMs: " + String.Join(" ", vmsOnHost.Select(vm => vm.Name)));
}
Console.ReadKey();
}
static Dictionary<Host, List<VirtualMachine>> FindMinumumApproximateSolution(List<Host> hosts, List<VirtualMachine> vms)
{
for (var targetLoad = 0; targetLoad <= 100; targetLoad += 1)
{
var solution = GetTargetLoadSolution(hosts, vms, targetLoad / 100.0);
if (solution == null)
continue;
return solution;
}
return null;
}
static Dictionary<Host, List<VirtualMachine>> GetTargetLoadSolution(List<Host> hosts, List<VirtualMachine> vms, double targetMemoryLoad)
{
//create an alternate host list that reduces memory availability to the desired target
var hostsAtTargetLoad = hosts.Select(h => new Host(h.Name, (int) (h.Memory * targetMemoryLoad), h.Processor)).ToList();
var allocator = new Allocator(hostsAtTargetLoad, vms);
return allocator.Solve();
}
}

Paging a collection with LINQ

How do you page through a collection in LINQ given that you have a startIndex and a count?
It is very simple with the Skip and Take extension methods.
var query = from i in ideas
select i;
var paggedCollection = query.Skip(startIndex).Take(count);
A few months back I wrote a blog post about Fluent Interfaces and LINQ which used an Extension Method on IQueryable<T> and another class to provide the following natural way of paginating a LINQ collection.
var query = from i in ideas
select i;
var pagedCollection = query.InPagesOf(10);
var pageOfIdeas = pagedCollection.Page(2);
You can get the code from the MSDN Code Gallery Page: Pipelines, Filters, Fluent API and LINQ to SQL.
I solved this a bit differently than what the others have as I had to make my own paginator, with a repeater. So I first made a collection of page numbers for the collection of items that I have:
// assumes that the item collection is "myItems"
int pageCount = (myItems.Count + PageSize - 1) / PageSize;
IEnumerable<int> pageRange = Enumerable.Range(1, pageCount);
// pageRange contains [1, 2, ... , pageCount]
Using this I could easily partition the item collection into a collection of "pages". A page in this case is just a collection of items (IEnumerable<Item>). This is how you can do it using Skip and Take together with selecting the index from the pageRange created above:
IEnumerable<IEnumerable<Item>> pageRange
.Select((page, index) =>
myItems
.Skip(index*PageSize)
.Take(PageSize));
Of course you have to handle each page as an additional collection but e.g. if you're nesting repeaters then this is actually easy to handle.
The one-liner TLDR version would be this:
var pages = Enumerable
.Range(0, pageCount)
.Select((index) => myItems.Skip(index*PageSize).Take(PageSize));
Which can be used as this:
for (Enumerable<Item> page : pages)
{
// handle page
for (Item item : page)
{
// handle item in page
}
}
This question is somewhat old, but I wanted to post my paging algorithm that shows the whole procedure (including user interaction).
const int pageSize = 10;
const int count = 100;
const int startIndex = 20;
int took = 0;
bool getNextPage;
var page = ideas.Skip(startIndex);
do
{
Console.WriteLine("Page {0}:", (took / pageSize) + 1);
foreach (var idea in page.Take(pageSize))
{
Console.WriteLine(idea);
}
took += pageSize;
if (took < count)
{
Console.WriteLine("Next page (y/n)?");
char answer = Console.ReadLine().FirstOrDefault();
getNextPage = default(char) != answer && 'y' == char.ToLowerInvariant(answer);
if (getNextPage)
{
page = page.Skip(pageSize);
}
}
}
while (getNextPage && took < count);
However, if you are after performance, and in production code, we're all after performance, you shouldn't use LINQ's paging as shown above, but rather the underlying IEnumerator to implement paging yourself. As a matter of fact, it is as simple as the LINQ-algorithm shown above, but more performant:
const int pageSize = 10;
const int count = 100;
const int startIndex = 20;
int took = 0;
bool getNextPage = true;
using (var page = ideas.Skip(startIndex).GetEnumerator())
{
do
{
Console.WriteLine("Page {0}:", (took / pageSize) + 1);
int currentPageItemNo = 0;
while (currentPageItemNo++ < pageSize && page.MoveNext())
{
var idea = page.Current;
Console.WriteLine(idea);
}
took += pageSize;
if (took < count)
{
Console.WriteLine("Next page (y/n)?");
char answer = Console.ReadLine().FirstOrDefault();
getNextPage = default(char) != answer && 'y' == char.ToLowerInvariant(answer);
}
}
while (getNextPage && took < count);
}
Explanation: The downside of using Skip() for multiple times in a "cascading manner" is, that it will not really store the "pointer" of the iteration, where it was last skipped. - Instead the original sequence will be front-loaded with skip calls, which will lead to "consuming" the already "consumed" pages over and over again. - You can prove that yourself, when you create the sequence ideas so that it yields side effects. -> Even if you have skipped 10-20 and 20-30 and want to process 40+, you'll see all side effects of 10-30 being executed again, before you start iterating 40+.
The variant using IEnumerable's interface directly, will instead remember the position of the end of the last logical page, so no explicit skipping is needed and side effects won't be repeated.

Resources