Most Frequent 3 page sequence in a weblog - algorithm

Given a web log which consists of fields 'User ' 'Page url'. We have to find out the most frequent 3-page sequence that users takes.
There is a time stamp. and it is not guaranteed that the single user access will be logged sequentially it could be like user1 Page1 user2 Pagex user1 Page2 User10 Pagex user1 Page 3 her User1s page sequence is page1-> page2-> page 3

Assuming your log is stored in timestamp order, here's an algorithm to do what you need:
Create a hashtable 'user_visits' mapping user ID to the last two pages you observed them to visit
Create a hashtable 'visit_count' mapping 3-tuples of pages to frequency counts
For each entry (user, URL) in the log:
If 'user' exists in user_visits with two entries, increment the entry in visit_count corresponding to the 3-tuple of URLs by one
Append 'URL' to the relevant entry in user_visits, removing the oldest entry if necessary.
Sort the visit_count hashtable by value. This is your list of most popular sequences of URLs.
Here's an implementation in Python, assuming your fields are space-separated:
fh = open('log.txt', 'r')
user_visits = {}
visit_counts = {}
for row in fh:
user, url = row.split(' ')
prev_visits = user_visits.get(user, ())
if len(prev_vists) == 2:
visit_tuple = prev_vists + (url,)
visit_counts[visit_tuple] = visit_counts.get(visit_tuple, 0) + 1
user_visits[user] = (prev_vists[1], url)
popular_sequences = sorted(visit_counts, key=lambda x:x[1], reverse=True)

Quick and dirty:
Build a list of url/timestamps per
user
sort each list by timestamp
iterate over each list
for each 3 URL sequence, create or increment a counter
find the highest count in the URL sequence count list
foreach(entry in parsedLog)
{
users[entry.user].urls.add(entry.time, entry.url)
}
foreach(user in users)
{
user.urls.sort()
for(i = 0; i < user.urls.length - 2; i++)
{
key = createKey(user.urls[i], user.urls[i+1], user.urls[i+2]
sequenceCounts.incrementOrCreate(key);
}
}
sequenceCounts.sortDesc()
largestCountKey = sequenceCounts[0]
topUrlSequence = parseKey(largestCountkey)

Here's a bit of SQL assuming you could get your log into a table such as
CREATE TABLE log (
ord int,
user VARCHAR(50) NOT NULL,
url VARCHAR(255) NOT NULL,
ts datetime
) ENGINE=InnoDB;
If the data is not sorted per user then (assuming that ord column is the number of the line from the log file)
SELECT t.url, t2.url, t3.url, count(*) c
FROM
log t INNER JOIN
log t2 ON t.user = t2.user INNER JOIN
log t3 ON t2.user = t3.user
WHERE
t2.ord IN (SELECT MIN(ord)
FROM log i
WHERE i.user = t.user AND i.ord > t.ord)
AND
t3.ord IN (SELECT MIN(ord)
FROM log i
WHERE i.user = t.user AND i.ord > t2.ord)
GROUP BY t.user, t.url, t2.url, t3.url
ORDER BY c DESC
LIMIT 10;
This will give top ten 3 stop paths for a user. Alternatively if you can get it ordered by user and time you can join on rownumbers more easily.

Source code in Mathematica
s= { {user},{page} } (* load List (log) here *)
sortedListbyUser=s[[Ordering[Transpose[{s[[All, 1]], Range[Length[s]]}]] ]]
Tally[Partition [sortedListbyUser,3,1]]

This problem is similar to
Find k most frequent words from a file
Here is how you can solve it :
Group each triplet (page1, page2, page3) into a word
Apply the algorithm mentioned here

1.Reads user page access urls from file line by line,these urls separated by separator,eg:
u1,/
u1,main
u1,detail
The separator is comma.
2.Store each page's visit count to map:pageVisitCounts;
3.Sort the visit count map by value in descend order;
public static Map<String, Integer> findThreeMaxPagesPathV1(String file, String separator, int depth) {
Map<String, Integer> pageVisitCounts = new HashMap<String, Integer>();
if (file == null || "".equals(file)) {
return pageVisitCounts;
}
try {
File f = new File(file);
FileReader fr = new FileReader(f);
BufferedReader bf = new BufferedReader(fr);
Map<String, List<String>> userUrls = new HashMap<String, List<String>>();
String currentLine = "";
while ((currentLine = bf.readLine()) != null) {
String[] lineArr = currentLine.split(separator);
if (lineArr == null || lineArr.length != (depth - 1)) {
continue;
}
String user = lineArr[0];
String page = lineArr[1];
List<String> urlLinkedList = null;
if (userUrls.get(user) == null) {
urlLinkedList = new LinkedList<String>();
} else {
urlLinkedList = userUrls.get(user);
String pages = "";
if (urlLinkedList.size() == (depth - 1)) {
pages = urlLinkedList.get(0).trim() + separator + urlLinkedList.get(1).trim() + separator + page;
} else if (urlLinkedList.size() > (depth - 1)) {
urlLinkedList.remove(0);
pages = urlLinkedList.get(0).trim() + separator + urlLinkedList.get(1).trim() + separator + page;
}
if (!"".equals(pages) && null != pages) {
Integer count = (pageVisitCounts.get(pages) == null ? 0 : pageVisitCounts.get(pages)) + 1;
pageVisitCounts.put(pages, count);
}
}
urlLinkedList.add(page);
System.out.println("user:" + user + ", urlLinkedList:" + urlLinkedList);
userUrls.put(user, urlLinkedList);
}
bf.close();
fr.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return pageVisitCounts;
}
public static void main(String[] args) {
String file = "/home/ieee754/Desktop/test-access.log";
String separator = ",";
Map<String, Integer> pageVisitCounts = findThreeMaxPagesPathV1(file, separator, 3);
System.out.println(pageVisitCounts.size());
Map<String, Integer> result = MapUtil.sortByValueDescendOrder(pageVisitCounts);
System.out.println(result);
}

Related

Algo: replace chars of string to find the correct word

I have a string from OCR which contains some errors.
For example "2SQ41S" in place of "250415", i have a dictionary for the possible replacements:
O/Q can be replaced by 0,
S can be replaced by 5...
I can calculate the checksum to be sure that the good word is found.
Here is the function recursive which doesn't work, it will be stopped when startPosition>=6, it's before the correct word was found:
public void CombinaisonTest()
{
string date = "2SO41S";
Dictionary<char, String[]> replaceDictionary= new Dictionary<char, String[]>()
{
{'O', new []{"Q", "0"}},
{'S', new []{"8", "5", "B"}}
};
String result = "";
var r = combinations2(date, 0, replaceDictionary);
Console.WriteLine("Date: " + date);
Console.WriteLine("R: " + r);
}
public string combinations2(string date, int startPosition, Dictionary<char, String[]> dictionary)
{
Console.WriteLine("Call function " + date + ", " + startPosition);
if (string.Join("", date).Equals("250415")) //need to calculate checksum
{
Console.WriteLine("Found: " + date);
return date;
}
if (startPosition >= date.Length)
{
Console.WriteLine("Not Found: ");
return "";
}
for (int i = startPosition; i < date.Length; i++)
{
if (dictionary.ContainsKey(date.ToCharArray()[i]))
{
foreach (var value in dictionary[date.ToCharArray()[i]])
{
return combinations2(date.Remove(i, 1).Insert(i, value), startPosition + 1, dictionary);
}
}
else
{
return combinations2(date, i + 1, dictionary);
}
}
return combinations2(date, startPosition + 1, dictionary);
}
Do you have any ideas for the corrections, please?
Thank you.
There are a couple of issues with the code. The first is that when iterating through the values in the dictionary, it returns after checking the first one, so it will only ever try and substitute Q for 0 and 8 for S. The second is that you are attempting two methods of processing the characters in the string: iterative AND recursive. You don't need to iterate over the index i with a for loop and also use recursion.
Another issue (which isn't a problem in your use case but stops the algorithm being more generic) is that the algorithm attempts to do a substitution in every case where an ambiguous character is encountered, as well as iterating over each value in the dictionary you should also consider the case where the character is left unmodified.
The function can be changed to remove the outer for loop and iterate over the values in the dictionary, test each one (recursing over the remainder of the string) and only return if a match is found. A simple way to do this is to store the result in a string and only return it if it is not the empty string (since your function returns the empty string when no match is found). If all the values in the dictionary have been tried and no match has been found, then it tries recursing without modifying the string.
public string combinations2(string date, int startPosition, Dictionary<char, String[]> dictionary)
{
Console.WriteLine("Call function " + date + ", " + startPosition);
if (string.Join("", date).Equals("250415")) //need to calculate checksum
{
Console.WriteLine("Found: " + date);
return date;
}
if (startPosition >= date.Length)
{
Console.WriteLine("Not Found: ");
return "";
}
if (dictionary.ContainsKey(date.ToCharArray()[startPosition]))
{
foreach (var value in dictionary[date.ToCharArray()[startPosition]])
{
string result = combinations2(date.Remove(startPosition, 1).Insert(startPosition, value), startPosition + 1, dictionary);
if(result != "")
return result;
}
}
return combinations2(date, startPosition + 1, dictionary);
}

In pig, how do I count the number of lines that contained a specific string?

Suppose I have a group of target words:
a b c d
and an input file:
a d f s g e
12399
c a d i f
a 2
then I should return:
a 3
b 0
c 1
d 2
How can I do that in pig? Thank you!
First remove the duplicate words from each line then run word count.
Pig steps:
REGISTER 'udf-1.0-SNAPSHOT.jar'
define tuple_set com.ts.pig.UniqueRecords();
data = load '<file>' using PigStorage();
remove duplicate words from each line
unique= foreach data generate tuple_set($0) as line;
words= foreach unique generate flatten(TOKENIZE(line,' ')) as word;
grouped = group words BY word;
count= foreach grouped GENERATE group, COUNT(words);
dump count;
Pig UDF sample code:
/**
* This udf removes duplicate words from line
*/
public class UniqueRecords extends EvalFunc<String> {
#Override
public String exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0)
return null;
String[] splits=tuple.get(0).toString().split(" ");
Set<String> elements = new HashSet<String>(Arrays.asList(splits));
StringBuilder sb = new StringBuilder();
for(String element:elements ){
sb.append(element+" ");
}
return sb.toString();
}
}

Finding highest number in text file

I have a text file that contains 50 student names and scores for each student in the format.
foreName.Surname:Mark
I have figured out how to split up each line into a forename, surname and mark using this code.
string[] Lines = File.ReadAllLines(#"StudentExamMarks.txt");
int i = 0;
var items = from line in Lines
where i++ != 0
let words = line.Split(' ', '.', ':')
select new
{
foreName = words[0],
Surname = words[1],
Mark = words[2]
};
I am unsure of how i would incorporate a findMax algorithm into to find the highest mark and display the pupil with the highest mark. this as i have not used text files that often.
You can use any sorting algorithm there is a Pseudo Code available to find maximum number in any list or array..
Try this code, required just parse all files.
string[] lines = File.ReadAllLines(#"StudentExamMarks.txt");
string maxForeName = null;
string maxSurName = null;
var maxMark = 0;
for (int i = 0; i < lines.Length; i++)
{
var tmp = lines[i].Split(new char[] { ' ', '.', ':' }, StringSplitOptions.RemoveEmptyEntries);
if (tmp.Length == 3)
{
int value = int.Parse(tmp[2]);
if (i == 0 || value > maxMark)
{
maxMark = value;
maxForeName = tmp[0];
maxSurName = tmp[1];
}
}
}

Hbase scan with offset

Is there a way to scan a HBase table getting, for example, the first 100
results, then later get the next 100 and so on... Just like in SQL we do
with LIMIT and OFFSET?
My row keys are uuid
You can do it multiple ways. The easiest one is a page filter. Below is the code example from HBase: The Definitive Guide, page 150.
private static final byte[] POSTFIX = new byte[] { 0x00 };
Filter filter = new PageFilter(15);
int totalRows = 0; byte[] lastRow = null;
while (true) {
Scan scan = new Scan();
scan.setFilter(filter);
if (lastRow != null) {
byte[] startRow = Bytes.add(lastRow, POSTFIX);
System.out.println("start row: " + Bytes.toStringBinary(startRow));
scan.setStartRow(startRow);
}
ResultScanner scanner = table.getScanner(scan);
int localRows = 0;

 Result result;

 while ((result = scanner.next()) != null) {
System.out.println(localRows++ + ": " + result);
totalRows++;
lastRow = result.getRow();
}
scanner.close();
if (localRows == 0) break;
}

System.out.println("total rows: " + totalRows);
Or you can set catching on scan for the limit you want and then change the start row to the last row + 1 from the prev scan for every get.

Algorithm to generate all variants of a word

i would like to explain my problem by the following example.
assume the word: abc
a has variants: ä, à
b has no variants.
c has variants: ç
so the possible words are:
abc
äbc
àbc
abç
äbç
àbç
now i am looking for the algorithm that prints all word variantions for abritray words with arbitray lettervariants.
I would recommend you to solve this recursively. Here's some Java code for you to get started:
static Map<Character, char[]> variants = new HashMap<Character, char[]>() {{
put('a', new char[] {'ä', 'à'});
put('b', new char[] { });
put('c', new char[] { 'ç' });
}};
public static Set<String> variation(String str) {
Set<String> result = new HashSet<String>();
if (str.isEmpty()) {
result.add("");
return result;
}
char c = str.charAt(0);
for (String tailVariant : variation(str.substring(1))) {
result.add(c + tailVariant);
for (char variant : variants.get(c))
result.add(variant + tailVariant);
}
return result;
}
Test:
public static void main(String[] args) {
for (String str : variation("abc"))
System.out.println(str);
}
Output:
abc
àbç
äbc
àbc
äbç
abç
A quickly hacked solution in Python:
def word_variants(variants):
print_variants("", 1, variants);
def print_variants(word, i, variants):
if i > len(variants):
print word
else:
for variant in variants[i]:
print_variants(word + variant, i + 1, variants)
variants = dict()
variants[1] = ['a0', 'a1', 'a2']
variants[2] = ['b0']
variants[3] = ['c0', 'c1']
word_variants(variants)
Common part:
string[] letterEquiv = { "aäà", "b", "cç", "d", "eèé" };
// Here we make a dictionary where the key is the "base" letter and the value is an array of alternatives
var lookup = letterEquiv
.Select(p => p.ToCharArray())
.SelectMany(p => p, (p, q) => new { key = q, values = p }).ToDictionary(p => p.key, p => p.values);
A recursive variation written in C#.
List<string> resultsRecursive = new List<string>();
// I'm using an anonymous method that "closes" around resultsRecursive and lookup. You could make it a standard method that accepts as a parameter the two.
// Recursive anonymous methods must be declared in this way in C#. Nothing to see.
Action<string, int, char[]> recursive = null;
recursive = (str, ix, str2) =>
{
// In the first loop str2 is null, so we create the place where the string will be built.
if (str2 == null)
{
str2 = new char[str.Length];
}
// The possible variations for the current character
var equivs = lookup[str[ix]];
// For each variation
foreach (var eq in equivs)
{
// We save the current variation for the current character
str2[ix] = eq;
// If we haven't reached the end of the string
if (ix < str.Length - 1)
{
// We recurse, increasing the index
recursive(str, ix + 1, str2);
}
else
{
// We save the string
resultsRecursive.Add(new string(str2));
}
}
};
// We launch our function
recursive("abcdeabcde", 0, null);
// The results are in resultsRecursive
A non-recursive version
List<string> resultsNonRecursive = new List<string>();
// I'm using an anonymous method that "closes" around resultsNonRecursive and lookup. You could make it a standard method that accepts as a parameter the two.
Action<string> nonRecursive = (str) =>
{
// We will have two arrays, of the same length of the string. One will contain
// the possible variations for that letter, the other will contain the "current"
// "chosen" variation of that letter
char[][] equivs = new char[str.Length][];
int[] ixes = new int[str.Length];
for (int i = 0; i < ixes.Length; i++)
{
// We start with index -1 so that the first increase will bring it to 0
equivs[i] = lookup[str[i]];
ixes[i] = -1;
}
// The current "workin" index of the original string
int ix = 0;
// The place where the string will be built.
char[] str2 = new char[str.Length];
// The loop will break when we will have to increment the letter with index -1
while (ix >= 0)
{
// We select the next possible variation for the current character
ixes[ix]++;
// If we have exausted the possible variations of the current character
if (ixes[ix] == equivs[ix].Length)
{
// Reset the current character to -1
ixes[ix] = -1;
// And loop back to the previous character
ix--;
continue;
}
// We save the current variation for the current character
str2[ix] = equivs[ix][ixes[ix]];
// If we are setting the last character of the string, then the string
// is complete
if (ix == str.Length - 1)
{
// And we save it
resultsNonRecursive.Add(new string(str2));
}
else
{
// Otherwise we have to do everything for the next character
ix++;
}
}
};
// We launch our function
nonRecursive("abcdeabcde");
// The results are in resultsNonRecursive
Both heavily commented.

Resources