Better ways to traverse a map of maps - algorithm

I'm doing some data analytics aggregations and here is my data structures:
{
12300 {
views {
page-1 {
link-2 40
link-6 9
}
page-7 {
link-3 9
link-11 8
}
}
buttons {
page-1 {
link-22 2
}
}
}
34000 ....
}
Where 12300, 34000 are a time values.
What I want to do is to traverse that data structure and insert entries into a database, something like this:
insert into views (page, link, hits, time) values (page-1, link-2, 40, 12300)
insert into views (page, link, hits, time) values (page-1, link-6, 9, 12300)
What would be an idiomatic way to code that? Am I complicating the data structure? do you suggest any better way to collect the data?

Assuming you have a jdbc connection from clojure.java.jdbc, this should come close to what you want.
(jdbc/do-prepared "INSERT INTO views (page, link, hits, time) VALUES (?, ?, ?, ?)"
(for [[time data] m
[data-type page-data] data
[page links] page-data
[link hits] links]
[page link hits time]))
;; why aren't we using data-type, eg buttons?
Edit for clarified problem
(let [m '{12300 {views {page-1 {link-2 40
link-6 9}
page-7 {link-3 9
link-11 8}}
buttons {page-1 {link-22 2}}}
34000 {views {page-2 {link-2 5}}}}]
(doseq [[table rows] (group-by :table (for [[time table] m
[table-name page-data] table
[page links] page-data
[link hits] links]
{:table table-name, :row [page link hits time]}))]
(jdbc/do-prepared (format "INSERT INTO %s (page, link, hits, time) VALUES (?, ?, ?, ?)" table)
(map :row rows))))

Simple solution: take advantage of the fact that you are using maps of maps and use get-in, assoc-in functions to view/change data. See these for examples:
http://clojuredocs.org/clojure_core/clojure.core/get-in
http://clojuredocs.org/clojure_core/clojure.core/assoc-in
Advanced solution: use functional zippers. This allows you to traverse and change a tree-like structure in a functional manner.
An example here:
http://en.wikibooks.org/wiki/Clojure_Programming/Examples/API_Examples/Advanced_Data_Structures#zipper
If you've got special data structures, not maps of maps, you can create a zipper yourself by simply implementing the 3 required methods. After that, all zipper functions will work on your data structure, too.

Related

Google Sheets, Sort table based on unique values in a single column

I am currently trying to use google sheets to sort a table based on the unique values found in only two of the columns.
What I want to happen is that columns A (Brand) and B (Dimensions) are both checked for unique values, removing any duplicate data.
The problem is using these two columns to filter and show the rest of table. I can't managed to achieve it.
Original Data
What is should look like after being culled
.
You can use Query function:
=query(A1:D8,"select A, B, min(C), min(D) group by A, B",1)
Example spreadsheet
try:
=ARRAYFORMULA(SORT(SPLIT(TRANSPOSE(QUERY(TRANSPOSE(SORTN(
IF(A2:C<>"", {"♦"&A2:A&"♦"&B2:B, "♦"&C2:D}, ), 99^99, 2, 1, 1))
,,99^99)), "♦"), 2, 1))

Perfomance wise for LUA table selection

I'm a bit new to LUA. So I have a game that I need to capture the Entity and insert into the table. The maximum possible Entity table that could happen at the same time is 14. So I read that an array based solution is good.
But I saw that the table size increment even if we delete some value, for example from 10 table value and delete value at index 9 its not automatically shift the size when I want to insert table number 11.
Example:
local Table = {"hello", "hello", "hello", "hello", "hello", "hello", "hello", "hello", "hello", "hello"}
-- Current Table size = 10
-- Perform delete at index 9
Table[9] = nil
-- Have new Entity to insert
Table[#Table + 1] = "New Value"
-- The table size will grow by the time the game extend.
So for this type of situation did array based table with nil value inside that grow by the time of new table value inserted will have better perfomance or should I move into table with key?
Or I should just stick with array based table and perform full cleanup when the table isnt used?
If you set an element in a table to nil, then that just stays there as a "hole" in your array.
tab = {1, 2, 3, 4}
tab[2] = nil
-- tab == {1, nil, 3, 4}
-- #tab is actually undefined and could be both 1 or 4 (or something completely unexpected)!
What you need to do is set the field to nil, then shift all the following fields to fill that hole. Luckily, Lua has a function for that, which is table.remove(table, index).
tab = {1, 2, 3, 4}
table.remove(tab, 2)
-- tab == {1, 3, 4}
-- #tab == 3
Keep in mind that this can get very slow as there's lots of memory access involved, so don't go applying this solution when you have a few million elements some day :)
While table.remove(Table, 9) will do the job in your case (removing field from "array" table and shifting remaining fields to fill the hole), you should first consider using "set" table instead.
If you:
- often remove/add elements
- don't care about their order
- often check if table contains a certain element
then the "set" table is your choice. Use it like so
local tab = {
["John"] = true,
["Jane"] = true,
["Bob"] = true,
}
Your elements will be stored as indices in a table.
Remove an element with
tab["Jane"] = nil
Test if table contains an element with
if tab["John"] then
-- tab contains "John"
Advantages compared to array table:
- this will eliminate performance overhead when removing an element because other elements will remain intact and no shifting is required
- checking if element exists in this table (which I assume is the main puspose of this table) is also faster than using array table because it no longer requires iterating over all the elements to find a match, the hash lookup is used instead
Note however that this approach doesn't let you have duplicate values as your elements, because tables can't contain duplicate keys. In that case you can use numbers as values to store the amount of times the element is duplicated in your set, e.g.
local tab = {
["John"] = 1,
["Jane"] = 2,
["Bob"] = 35,
}
Now you have 1 John, 2 Janes and 35 Bobs
https://www.lua.org/pil/11.5.html

How to design querying multiple tags on analytics database

I would like to store user purchase custom tags on each transaction, example if user bought shoes then tags are "SPORTS", "NIKE", SHOES, COLOUR_BLACK, SIZE_12,..
These tags are that seller interested in querying back to understand the sales.
My idea is when ever new tag comes in create new code(something like hashcode but sequential) for that tag, and code starts from "a-z" 26 letters then "aa, ab, ac...zz" goes on. Now keep all the tags given for in one transaction in the one column called tag (varchar) by separating with "|".
Let us assume mapping is (at application level)
"SPORTS" = a
"TENNIS" = b
"CRICKET" = c
...
...
"NIKE" = z //Brands company
"ADIDAS" = aa
"WOODLAND" = ab
...
...
SHOES = ay
...
...
COLOUR_BLACK = bc
COLOUR_RED = bd
COLOUR_BLUE = be
...
SIZE_12 = cq
...
So storing the above purchase transaction, tag will be like tag="|a|z|ay|bc|cq|" And now allowing seller to search number of SHOES sold by adding WHERE condition tag LIKE %|ay|%. Now the problem is i cannot use index (sort key in redshift db) for "LIKE starts with %". So how to solve this issue, since i might have 100 millions of records? dont want full table scan..
any solution to fix this?
Update_1:
I have not followed bridge table concept (cross-reference table) since I want to perform group by on the results after searching the specified tags. My solution will give only one row when two tags matched in a single transaction, but bridge table will give me two rows? then my sum() will be doubled.
I got suggestion like below
EXISTS (SELECT 1 FROM transaction_tag WHERE tag_id = 'zz' and trans_id
= tr.trans_id) in the WHERE clause once for each tag (note: assumes tr is an alias to the transaction table in the surrounding query)
I have not followed this; since i have to perform AND and OR condition on the tags, example ("SPORTS" AND "ADIDAS") ---- "SHOE" AND ("NIKE" OR "ADIDAS")
Update_2:
I have not followed bitfield, since dont know redshift has this support also I assuming if my system will be going to have minimum of 3500 tags, and allocating one bit for each; which results in 437 bytes for each transaction, though there will be only max of 5 tags can be given for a transaction. Any optimisation here?
Solution_1:
I have thought of adding min (SMALL_INT) and max value (SMALL_INT) along with tags column, and apply index on that.
so something like this
"SPORTS" = a = 1
"TENNIS" = b = 2
"CRICKET" = c = 3
...
...
"NIKE" = z = 26
"ADIDAS" = aa = 27
So my column values are
`tag="|a|z|ay|bc|cq|"` //sorted?
`minTag=1`
`maxTag=95` //for cq
And query for searching shoe(ay=51) is
maxTag <= 51 AND tag LIKE %|ay|%
And query for searching shoe(ay=51) AND SIZE_12 (cq=95) is
minTag >= 51 AND maxTag <= 95 AND tag LIKE %|ay|%|cq|%
Will this give any benefit? Kindly suggest any alternatives.
You can implement auto-tagging while the files get loaded to S3. Tagging at the DB level is too-late in the process. Tedious and involves lot of hard-coding
While loading to S3 tag it using the AWS s3API
example below
aws s3api put-object-tagging --bucket --key --tagging "TagSet=[{Key=Addidas,Value=AY}]"
capture tags dynamically by sending and as a parameter
2.load the tags to dynamodb as a metadata store
3.load data to Redshift using S3 COPY command
You can store tags column as varchar bit mask, i.e. a strictly defined bit sequence of 1s or 0s, so that if a purchase is marked by a tag there will be 1 and if not there will be 0, etc. For every row, you will have a sequence of 0s and 1s that has the same length as the number of tags you have. This sequence is sortable, however you would still need lookup into the middle but you will know at which specific position to look so you don't need like, just substring. For further optimization, you can convert this bit mask to integer values (it will be unique for each sequence) and make matching based on that but AFAIK Redshift doesn't support that yet out of box, you will have to define the rules yourself.
UPD: Looks like the best option here is to keep tags in a separate table and create an ETL process that unwraps tags into tabular structure of order_id, tag_id, distributed by order_id and sorted by tag_id. Optionally, you can create a view that joins the this one with the order table. Then lookups for orders with a particular tag and further aggregations of orders should be efficient. There is no silver bullet for optimizing this in a flat table, at least I don't know of such that would not bring a lot of unnecessary complexity versus "relational" solution.

Relation Table data structure in Clojure

I am looking for Clojure data structure, that works like relation table (as in relational databases).
Map (even biderectional) id -> (val1, val2, val3, ...) does not do the job. If I, for example, want to find all rows with val2 = "something" it will took O(n).
But I want to perform search in a column in O(log n)!
Searching for rows in database with a column predicate without an index is O(n) as every row has to be checked if it matches the predicate. If there is an index for a column that your predicate uses then the index can be used to find all the rows for a specific value by looking up that value as the key in the index to fetch all the matching rows. Then it is usually log(n) operation (it depends on the internal implementation of the index, e.g. for B-tree it is log(n)).
I am not aware of out-of-the-box implementation of a Clojure data structure having such characteristics as they have usually single-purpose (e.g. map is an associative datastructure for lookup by a single key, not multiple keys as in DB with multiple indexes). You would rather need a library providing some kind of an in-memory database, for example (as mentioned by Thumbnail in his comment) DataScript, or even in-memory SQL DB with JDBC interface (e.g. H2SQL, HSQLDB or DerbyDB using their in-memory stores).
I am not sure of your specific requirements, but you could also implement some of the features yourself using basic API from Clojure. For example, you could use a Clojure set as your "table" and enhance it with some functions from clojure.set:
Your table:
(def my-table #{{:id 1 :name "John" :age 30 :gender :male}
{:id 2 :name "Jane" :age 25 :gender :female}
{:id 3 :name "Joe" :age 40 :gender :male}})
And specify your indices:
(def by-id (clojure.set/index my-table [:id]))
(def by-age (clojure.set/index my-table [:age]))
(def by-gender (clojure.set/index my-table [:gender]))
And then use your indices when querying/filtering your table:
(clojure.set/intersection
(by-age {:age 30})
(by-gender {:gender :male}))
;; => #{{:id 1, :name "John", :age 30, :gender :male}}

MongoDB ranged pagination

It's said that using skip() for pagination in MongoDB collection with many records is slow and not recommended.
Ranged pagination (based on >_id comparsion) could be used
db.items.find({_id: {$gt: ObjectId('4f4a3ba2751e88780b000000')}});
It's good for displaying prev. & next buttons - but it's not very easy to implement when you want to display actual page numbers 1 ... 5 6 7 ... 124 - you need to pre-calculate from which "_id" each page starts.
So I have two questions:
1) When should I start worry about that? When there're "too many records" with noticeable slowdown for skip()? 1 000? 1 000 000?
2) What is the best approach to show links with actual page numbers when using ranged pagination?
Good question!
"How many is too many?" - that, of course, depends on your data size and performance requirements. I, personally, feel uncomfortable when I skip more than 500-1000 records.
The actual answer depends on your requirements. Here's what modern sites do (or, at least, some of them).
First, navbar looks like this:
1 2 3 ... 457
They get final page number from total record count and page size. Let's jump to page 3. That will involve some skipping from the first record. When results arrive, you know id of first record on page 3.
1 2 3 4 5 ... 457
Let's skip some more and go to page 5.
1 ... 3 4 5 6 7 ... 457
You get the idea. At each point you see first, last and current pages, and also two pages forward and backward from the current page.
Queries
var current_id; // id of first record on current page.
// go to page current+N
db.collection.find({_id: {$gte: current_id}}).
skip(N * page_size).
limit(page_size).
sort({_id: 1});
// go to page current-N
// note that due to the nature of skipping back,
// this query will get you records in reverse order
// (last records on the page being first in the resultset)
// You should reverse them in the app.
db.collection.find({_id: {$lt: current_id}}).
skip((N-1)*page_size).
limit(page_size).
sort({_id: -1});
It's hard to give a general answer because it depends a lot on what query (or queries) you are using to construct the set of results that are being displayed. If the results can be found using only the index and are presented in index order then db.dataset.find().limit().skip() can perform well even with a large number of skips. This is likely the easiest approach to code up. But even in that case, if you can cache page numbers and tie them to index values you can make it faster for the second and third person that wants to view page 71, for example.
In a very dynamic dataset where documents will be added and removed while someone else is paging through data, such caching will become out-of-date quickly and the limit and skip method may be the only one reliable enough to give good results.
I recently encounter the same problem when trying to paginate a request while using a field that wasn't unique, for example "FirstName". The idea of this query is to be able to implement pagination on a non-unique field without using skip()
The main problem here is being able to query for a field that is not unique "FirstName" because the following will happen:
$gt: {"FirstName": "Carlos"} -> this will skip all the records where first name is "Carlos"
$gte: {"FirstName": "Carlos"} -> will always return the same set of data
Therefore the solution I came up with was making the $match portion of the query unique by combining the targeted search field with a secondary field in order to make it a unique search.
Ascending order:
db.customers.aggregate([
{$match: { $or: [ {$and: [{'FirstName': 'Carlos'}, {'_id': {$gt: ObjectId("some-object-id")}}]}, {'FirstName': {$gt: 'Carlos'}}]}},
{$sort: {'FirstName': 1, '_id': 1}},
{$limit: 10}
])
Descending order:
db.customers.aggregate([
{$match: { $or: [ {$and: [{'FirstName': 'Carlos'}, {'_id': {$gt: ObjectId("some-object-id")}}]}, {'FirstName': {$lt: 'Carlos'}}]}},
{$sort: {'FirstName': -1, '_id': 1}},
{$limit: 10}
])
The $match part of this query is basically behaving as an if statement:
if firstName is "Carlos" then it needs to also be greater than this id
if firstName is not equal to "Carlos" then it needs to be greater than "Carlos"
Only problem is that you cannot navigate to an specific page number (it can probably be done with some code manipulation) but other than it solved my problem with pagination for non-unique fields without having to use skip which eats a lot of memory and processing power when getting to the end of whatever dataset you are querying for.

Resources