Can someone advise on an HBase schema click stream data - hadoop

I would like to create a click stream application using HBase, in sql this would be a pretty simple task but in Hbase I have not got the first clue. Can someone advise me on a schema design and keys to use in HBase.
I have provided a rough data model and several questions that I would like to interrogate the data for.
Questions I would like to ask for accessing data
What events led to a conversion?
What was the last page / How many paged viewed?
What pages a customer drops off?
What products does a male customer between 20 and 30 like to buy?
A customer has bought product x also likely to buy product y?
Conversion amount from first page ?
{
PageViews: [
{
date: "19700101 00:00",
domain: "http://foobar.com",
path: "pageOne.html",
timeOnPage: "10",
pageViewNumber: 1,
events: [
{ name: "slideClicked", value: 0, time: "00:00"},
{ name: "conversion", value: 100, time: "00:05"}
],
pageData: {
category: "home",
pageTitle: "Home Page"
}
},
{
date: "19700101 00:01",
domain: "http://foobar.com",
path: "pageTwo.html",
timeOnPage: "20",
pageViewNumber: 2,
events: [
{ name: "addToCart", value: 50.00, time: "00:02"}
],
pageData: {
category: "product",
pageTitle: "Mans Shirt",
itemValue: 50.00
}
},
{
date: "19700101 00:03",
domain: "http://foobar.com",
path: "pageThree.html",
timeOnPage: "30",
pageViewNumber: 3,
events: [],
pageData: {
category: "basket",
pageTitle: "Checkout"
}
}
],
Customer: {
IPAddress: 127.0.0.1,
Browser: "Chrome",
FirstName: "John",
LastName: "Doe",
Email: "john.doe#email.com",
isMobile: 1,
returning: 1,
age: 25,
sex: "Male"
}
}

Well, you data is mainly in one-to-many relationship. One customer and an array of page view entities. And since all your queries are customer centric, it makes sense to store each customer as a row in Hbase and have customerid(may be email in your case) as part of row key.
If you decide to store one row for one customer, each page view details would be stored as nested. The video link regarding hbase design will help you understand that. So for you above example, you get one row, and three nested entities
Another approach would be, denormalized form, for hbase to perform good lookup. Here each row would be page view, and customer data gets appended for every row.So for your above example, you end up with three rows. Data would be duplicated. Again the video gives info regarding that too(compression things).
You have more nested levels inside each page view - live events and pagedata. So it will only get worse, with respect to denormalization. As everything in Hbase is a key value pair, it is difficult to query and match these nested levels. Hope this helps you to kick off
Good video link here

Related

GraphQL named Object literal typeDef

I'm working through a personal project and I'm trying to figure out the best way to define an Object of Objects that looks like the following in GraphQL's type definitions.
{
"2020-12-29": {
open: true,
hours: 2,
appointments: {
"09:00-am": {
appointmentId: "5223a4ef-a3cf-4e2f-b761-3e06193e2e21",
userName: "Shaun Cartwright Glover",
email: "Maverick.Quitzon31#yahoo.com",
phoneNumber: "1-401-519-4771",
avatar: "https://s3.amazonaws.com/uifaces/faces/twitter/ffbel/128.jpg",
},
},
totalAppointments: 1,
},
As you can see, the name of the Object literal for the top level schedule is the date and the same is being done for each individual appointment. I'm also using graphql prisma if that helps.
To follow along with your example, let's break down your Object literal into the entities that it represents. In high level details, you're looking at
An entity representing an Appointment
The User associated with said Appointment
A list of Appointments, associated with a certain AppointmentTime
A list of AppointmentTimes for a given Day
A list of Days, in the form of a a Schedule
To me, this seems to closely (but not exactly) match the data in the example you have. Based on this, I've defined a schema below, with some intentional design decisions that deviate slightly from your proposal.
Let's define some types, and knit them together, starting with a User type:
type User {
id: ID!
name: String!
email: String!
# Depending on your requirements, a user may not have to provide a phone number.
phoneNumber: String
# Depending on your requirements, a user may not have an Avatar.
avatarUrl: String
}
type Appointment {
id: ID!
user: User!
}
type AppointmentTime {
time: String!
appointments: [Appointment!]!
}
type Day {
# The Day ID could be the actual day itself, i.e. 2020-12-29
id: ID!
open: Boolean!
hours: Int!
appointmentTimes: AppointmentTime!
}
type Schedule {
days: [Day!]!
}
this would allow you to write a query (assuming you have a getSchedule query -- or something to a similar effect) like this:
getSchedule {
days {
id
open
hours
appointmentTimes {
time
appointments {
id
user {
name
email
phoneNumber
avatarUrl
}
}
}
}
}
{
days: [
{
id: "2020-12-29",
open: true,
hours: 2,
appointmentTimes: [
{
time: "09:00-am",
appointments: [
{
id: "5223a4ef-a3cf-4e2f-b761-3e06193e2e21",
user: {
name: "John Smith",
email: "john#smith.com",
phoneNumber: "123...",
avatarUrl: "...",
}
},
...
]
},
...
]
},
...
]
}
Note that this will end up producing a slightly different output than the one you posted. Why?
Well, I made the following design choices, and I'd encourage you to investigate them, too:
A user should be a separate field. In your example, the user and the appointment information are both under the same key -- 09:00-am -- here, we want to leverage GraphQL's type system to normalize the schema by defining a User type that we can attach to an appointment. Better for introspection, too.
Your appointments key points to another object as its value, not a list. Since you're returning a list of appointments at the end of the day, you should model this as a GraphQL List
Added an AppointmentTime type associated with a list of appointments. This allows you to potentially have multiple appointments at the same time. (future proof)
Each day has a list of AppointmentTime --- this is optimal, as you are now no longer dependent on the key (in your case, 09:00-am) to define the data associated with each appointment time. (future proof)
If you really did want the object literal to match the graphql output exactly, you can inline some of the fields I chose to extract to other types, but really, you should be leveraging lists for this kind of thing.

Linking 3 types of document for a view

I'm struggling with linked documents when creating a view.
A salesperson has multiple clients, each client has multiple
purchases.
I need to get a view containing:
salesperson ids for each client purchase.
In a relational database I would join:
purchase.clientid -> client._id
client.salesperson -> salesperson._id
Given:
{ _id: "1", type: "purchase", clientid: "2", items: [] }
{ _id: "2", type: "client", salespersonid: "3", name: "Chris the client" }
{ _id: "3", type: "salesperson", name: "Simon the salesperson" }
I've tried reading a lot of stuff, but nothing has clicked. How would I do this in a view?
{
_id: 'purchase-client-2-<unique-purchase-id>',
salespersonId: 'sales-3'
}
{
_id: 'sales-3',
name: 'Simon the salesperson'
}
{
_id: 'client-2',
name: 'Chris the client'
}
With the above documents you could query for all documents starting with 'purchase-client2' to get an array of purchase document. Each purchase document then tells you who the sales person was. Depending on the number of sales staff you may already have everything you need right there, assuming your map of sales id to name is already in memory.
If not, you could do a further lookup (and potentially cache that result). If that in-memory lookup or extra lookup doesn't work for you you could also duplicating the sales person's name in the purchase document. After all, NoSQL DB's don't follow the same rules as relational DB's and it's ok to duplicate now and again. You just have to think about how you keep the dups sync'ed up later.
If you can use and abuse the ID field and getaway without views then you may be better off. Views bring their own set of problems. Good luck!

RethinkDB: JOIN with dynamic property/field names

The solution to this RethinkDB query has eluded me for days now. Does not help that I'm a t-SQL developer...
Given a document structure like this:
var messages = {
"Category A" : {
catName: 'Category A',
userInfo: {
userName: 'Hoot',
userId: 12345,
count: 77
}
},
"Category FOO" : {
catName: 'Category FOO',
userInfo: {
userName: 'Woot',
userId: 67890,
count: 42
}
}
}
where "Category A" and "Category FOO" could be any string entered by a user in the app, basically a dynamic property. This is great for iterating on client side JS using square bracket notation. (messages["Category A"]).
I need to update, using ReQL, the userInfo.count field by "joining" to another table UserMessages on userId field, which exists in both tables.
I've managed to add the result (e.g: "12345":77, "67890": 42 ) as a subdocument to "messages" object using r.forEach() and r.object().
But I must have the result added to the userInfo.
EDIT: To add clarity... I struggle to navigate to the userId if I don't know the Top-level property. ie.:
r.table('messages')('???????????')('userInfo')('userId')
Any suggestions would be greatly appreciated.
Thank you.

How to structure firebase data for multiple views

I'm new to Firebase and I'm building my first app on it so thought I'd ask if my current plans for the app's data structure make sense.
I've read the Firebase blog posts and several answers on SO which have helped me understand the concept of "optimise for the way the data will be read". However, my data will be read in a few different ways and it feels like I may be over complicating things.
Background
The app is like a directory for businesses in multiple towns (schemes) to promote their upcoming events and offers. I think of the data hierarchy like this:
Scheme: A town (the app has multiple schemes)
Category: A group of businesses around a theme (e.g. shoe shops)
Business: An administrative organisation (handles billing etc). Each business can have multiple locations (shops in different towns).
Location: A shop in a town.
Event: Each location can promote events. An event can be promoted at multiple locations but not necessarily all of a business's locations.
Offer: Similar to an event but a different type of object.
Viewing the data
The app user can view the offer & event data in 5 ways:
specific to a business (e.g. Joe's shoes' offers)
for a scheme (e.g. all offers in a Smalltown)
for the whole app (e.g. all offers anywhere)
in a category in a scheme (e.g. all shoe offers in Smalltown)
in a category in the whole app (e.g. all shoe offers anywhere)
In addition, I need to make sure that an administrator from each business can see/edit all of their business's data via a CMS I'm also building.
My approach
This is the data structure I'm thinking of using:
root {
schemes{
scheme1{
name: "smalltown",
logo: "base64 data",
bgcolor: "#FF0000"
},
scheme2{...}
},
businesses{
business1{
name: "Joe's Shoes",
logo: "base64 data",
locations: {
location1: true,
location3: true,
location15: true
},
address_hq: {
street: "45 Acacia Avenue",
town: "Bigtown",
postcode: "BT1 1JS"
},
contact_hq: {
name: "Joe Simpson",
position: "Owner",
email: "joe#joesshoes.com",
tel: "07123 456789"
},
subscription: {
plan: "Standard",
date_start: "10/10/2015",
date_renewal: "10/10/2016"
},
owner: "james1"
},
business2{...}
},
locations{
location1{
name: "Joe's Shoes",
logo: "base64 data",
scheme: "scheme1",
events: {
event1: true,
event27: true
},
offers: {
offer1: true,
offer6: true
},
business: "business1",
owner: "james1"
},
location2{...}
},
events{
event1{
schemes: {
scheme1: true,
scheme4: true
},
locations{
location1: true,
location21: true
},
categories: {
shoes: true,
footwear: true,
fashion: true
},
business: "business1",
date: "5/5/2016",
title: "The History of Shoes",
description: "A fascinating talk about the way shoes have...",
image: "base64 data",
venue: {
street: "Great Hotel",
town: "Bigtown",
postcode: "BT1 1JS"
},
price: "£10"
},
event2{...}
},
offers{
offer1{
schemes: {
scheme1: true,
scheme4: true
},
locations{
location1: true,
location21: true
},
categories: {
shoes: true,
footwear: true,
fashion: true
},
business: "business1",
date_start: "5/5/2016",
date_end: "5/5/2016",
title: "All children's shoes Half Price",
description: "Get 50% off all children's shoes - just in time for the summer",
image: "base64 data",
},
offer2{...}
}
}
Here's a graphic of similar data structure in case it's easier to read:
My question is whether I need to denormalise the data further (repeat more data in more places) or is there a better way to think about this altogether?
It feels like I'm getting potential complications from having to keep data in sync without the ability to simply read from a single place (e.g. I'll need to use queries and indexes (?) to combine location and event data for scheme-wide event listings).
Any advice on making this data structure more efficient would be great.

Mapping relationships where the JSON representation can hold either a nested object or an object ID (in RestKit)

Is it possible to get RestKit to properly map relationships where the property might contain either a nested object or an object ID?
The API I'm currently working with often returns deep cyclical object graphs where multiple occurrences of the same object are replaced with the object's ID.
A simple example:
Nested
[
{
id: 1,
title: "Post Title",
category: {
id: 1,
name: "Category Title"
}
},
{
id: 2,
title: "Post Title",
category: 1
}
]
Since the replacement can happen at any level manually fixing the JSON before mapping is not very practical.

Resources