Different size for the same view on CouchDB - performance

I have two databases with similar data (organized differently) and I've created a view for each one returning the same response. I have notice that the time response of the query is different even returning the same response, one being 3182ms, other being 217ms approximately, having queried 5 times.
I query both using:
curl -x GET ...db1/_design/query1/view/q1?group=true and
curl -x GET ...db2/_design/query1/view/q1?group=true.
I have checked the data sizes of the design documents using curl -x GET ...db1/_design/query1/_info. The design data size of the first is 146073878 bites and the second is 3739596 bites.
I thought both should have the same size, because they return the same view, and i havent used any filters, both views beeing equal.
Somebody can explain me why the same view created by different databases have different sizes?
My data is organized using two differents roots, but the same data, changing only the root:
Customer data in the root:
{
"c_customer_sk": 65836,
"c_first_name": "Frank",
"c_last_name": "White",
"store_sales": [
{
"ss_sales_price": 20.24,
"ss_ext_sales_price": 1012,
"ss_coupon_amt": 0,
"date": [
{
"d_month_seq": 1187,
"d_year": 1998
}
],
"item": [
{
"i_item_sk": 10454,
"i_item_id": "AAAAAAAAGNICAAAA",
"i_item_desc": "Results highlight as patterns; so right years show. Sometimes suitable lips move with the critics. English, old mothers ought to lift now perhaps future managers. Active, single ch",
"i_current_price": 2.88,
"i_class": "romance",
"i_category_id": 9,
"i_category": "Books"
}
]
},
{
"ss_sales_price": 225,
"ss_ext_sales_price": 1023,
"ss_coupon_amt": 0,...
View function for customer in the root:
function(doc)
{
for each (store_sales in doc.store_sales) {
var s=store_sales.ss_ext_sales_price;
if(s==null){s=0}
for each (item in store_sales.item){
var item_id=item.i_item_id;
var item_desc=item.i_item_desc;
var category=item.i_category;
var class=item.i_class;
var price=item.i_current_price;}
if(category=="Music" || category=="Home" || category=="Sports"){
for each (date in store_sales.date){
var g=date.d_month_seq;}
if (g>=1200 && g<=1211){
emit({item_id:item_id,item_desc:item_desc, category:category, class:class, current_price:price},s);
}
}}}
reduce:_sum
Example of answer:
key:
{"item_id": "AAAAAAAAAAAEAAAA", "item_desc": "Rates expect probably necessary events. Circumstan", "category": "Sports", "class": "optics", "current_price": 3.99}
Value:
106079.49999999999
Item data in the root:
{
"i_item_sk": 10454,
"i_item_id": "AAAAAAAAGNICAAAA",
"i_item_desc": "Results highlight as patterns; so right years show. Sometimes suitable lips move with the critics. English, old mothers ought to lift now perhaps future managers. Active, single ch",
"i_current_price": 2.88,
"i_class": "romance",
"i_category_id": 9,
"i_category": "Books",
"store_sales": [
{
"ss_sales_price": 20.24,
"ss_ext_sales_price": 1012,
"ss_coupon_amt": 0,
"date": [
{
"d_month_seq": 1187,
"d_year": 1998
}
],
"customer": [
{
"c_customer_sk": 65836,
"c_first_name": "Frank",
"c_last_name": "White",
}
]
},
{
"ss_sales_price": 225,
"ss_ext_sales_price": 1023,
"ss_coupon_amt": 0,...
View for item on root:
function(doc)
{
var item_id=doc.i_item_id;
var item_desc=doc.i_item_desc;
var category=doc.i_category;
var class=doc.i_class;
var price=doc.i_current_price;
if(category=="Music" || category=="Home" || category=="Sports"){
for each (store_sales in doc.store_sales) {
var s=store_sales.ss_ext_sales_price;
if(s==null){s=0}
for each (date in store_sales.date){
var g=date.d_month_seq;}
if (g>=1200 && g<=1211){
emit({item_id:item_id,item_desc:item_desc, category:category, class:class, current_price:price},s);
}
}}}
reduce:_sum
Returning the same answer.
I have made the cleanup and compaction of the designs and the time response of the database which the itens data are in the root is much faster, and the sizes of the data size is smaller too, but I dont know why.
Can someone explain me?

could it be a difference of database compaction? When you replicate an existing databases to an empty one, only the last revision of each documents are sent to the new one, making it potentially way lighter. The same applies to views

Related

Highstock dataGrouping not working with live data

I am currently working on a project for my company, where I need to plot highstock charts, which show energy-data of our main buildings.
Since it is live data, new datapoints come per Websocket every few-or-so seconds. However, the graph should only show one datapoint every hour. I wanted to clear this with the highstock dataGrouping, but it does not really work. It groups the points yes, but it still shows the „transmission“, the graph-line, between them. Thus making the whole graph completely irreadable.
In an other Version of the project, the graph only shows the latest datapoint of each group (as specified in the „approximate“ object in the chart options), but also does not start a new group after the chosen Interval runs through.
I've been sitting on this problem for about 3 days now and have not found any proper completely working solution yet.
Unfortunately, due company policy and due to hooks and components necessary, which are only used here in the company, I'm not able to give you a jsfilddle or similar, even though I'd really love to. What I can do is give you the config, mabye you find something wrong there?
const options = {
plotOptions: {
series: {
dataGrouping: {
anchor: 'end',
approximation: function (groupData: unknown[]) {
return groupData[groupData.length - 1];
},
enabled: true,
forced: true,
units: [['second', [15]]],
},
marker: {
enabled: false,
radius: 2.5,
},
pointInterval: minutesToMilliseconds(30),
pointStart: currentWeekTraversed?.[0]?.[0],
},
},
}
This would be the plotOptions.
If you need any more information, let me know. I'll see then, what and how I can send it to you.
Thank you for helping. ^^
This is example how dataGrouping works with live data,
try to recreate your case in addition or use another demo from official Highcharts React wrapper page.
rangeSelector: {
allButtonsEnabled: true,
buttons: [{
type: 'minute',
count: 15,
text: '15S',
preserveDataGrouping: true,
dataGrouping: {
forced: true,
units: [
['second', [15]]
]
}
}, {
type: 'hour',
count: 1,
text: '1M',
preserveDataGrouping: true,
dataGrouping: {
forced: true,
units: [
['minute', [1]]
]
}
}
},
Demo: https://jsfiddle.net/BlackLabel/sr3oLkvu/

Apollo mixes two different arrays of the same query seemingly at random

With a schema like
schema {
query: QueryRoot
}
scalar MyBigUint
type Order {
id: Int!
data: OrderCommons!
kind: OrderType!
}
type OrderBook {
bids(limit: Int): [Order!]!
asks(limit: Int): [Order!]!
}
type OrderCommons {
quantity: Int!
price: MyBigUint! // where it doesn't matter whether it's MyBigUint or a simple Int - the issue occurs anyways
}
enum OrderType {
BUY
SELL
}
type QueryRoot {
orderbook: OrderBook!
}
And a query query { orderbook { bids { data { price } }, asks { data { price } } } }
In a graphql playground of my graphql API (and on the network level of my Apollo app too) I receive a result like
{
"data": {
"orderbook": {
"bids": [
{
"data": {
"price": "127"
}
},
{
"data": {
"price": "74"
}
},
...
],
"asks": [
{
"data": {
"price": "181"
}
},
{
"data": {
"price": "187"
}
},
...
]
}
}
}
where, for the purpose of this question, the bids are ordered in descending order by price like ["127", "74", "73", "72"], etc, and asks are ordered in ascending order, accordingly.
However, in Apollo, after a query is done, I notice that one of the arrays gets seemingly random data.
For the purpose of the question, useQuery react hook is used, but the same happens when I query imperatively from a freshly initialized ApolloClient.
const { data, subscribeToMore, ...rest } = useQuery<OrderbookResponse>(GET_ORDERBOOK_QUERY);
console.log(data?.orderbook?.bids?.map(r => r.data.price));
console.log(data?.orderbook?.asks?.map(r => r.data.price));
Here, corrupted data of Bids gets printed i.e. ['304', '306', '298', '309', '277', '153', '117', '108', '87', '76'] (notice the order being wrong, at the least), whereas Asks data looks just fine. Inspecting the network, I find that Bids are not only properly ordered there, but also have different (correct, from DB) values!
Therefore, it seems something's getting corrupted on the way while Apollo delivers the data.
What could be the issue here I wonder, and where to start debugging such kind of an issue? There seem to be no warnings from Apollo either, it seems to just silently corrupt the data.
I'm clearly doing something wrong, but what?
The issue seems to stem from how Apollo caches data.
My Bids and Asks could have the same numeric IDs but share the same Order graphql type. Apollo rightfully assumes a Bid and an Ask with the same ID are the same things and the resulting data gets wrecked as a consequence.
An easy fix is to show Apollo that there's a complex key to the Order type on cache initialization:
cache: new InMemoryCache({
typePolicies: {
Order: {
keyFields: ['id', 'kind'],
}
}
})
This way it'll understand that the Order entities Ask and Bid with the same ID are different pieces of data indeed.
Note that the field kind should be also added to the query strings accordingly.

Group user events into activity sessions

You’re in charge of implementing a new analytics “sessions” view. You’re given a set of data that consists of individual web page visits, along with a visitorId which is generated by a tracking cookie that uniquely identifies each visitor. From this data we need to generate a list of sessions for each visitor.
The data set looks like this:
"events": [
{
"url": "/pages/a-big-river",
"visitorId": "d1177368-2310-11e8-9e2a-9b860a0d9039",
"timestamp": 1512754583000
},
{
"url": "/pages/a-small-dog",
"visitorId": "d1177368-2310-11e8-9e2a-9b860a0d9039",
"timestamp": 1512754631000
},
{
"url": "/pages/a-big-talk",
"visitorId": "f877b96c-9969-4abc-bbe2-54b17d030f8b",
"timestamp": 1512709065294
},
{
"url": "/pages/a-sad-story",
"visitorId": "f877b96c-9969-4abc-bbe2-54b17d030f8b",
"timestamp": 1512711000000
},
{
"url": "/pages/a-big-river",
"visitorId": "d1177368-2310-11e8-9e2a-9b860a0d9039",
"timestamp": 1512754436000
},
{
"url": "/pages/a-sad-story",
"visitorId": "f877b96c-9969-4abc-bbe2-54b17d030f8b",
"timestamp": 1512709024000
}
]
}
Given this input data, we want to create a set of sessions of the incoming data. A sessions is defined as a group of events from a single visitor with no more than 10 minutes between each consecutive event. A visitor can have multiple sessions. So given the example input data above, we would expect output which looks like:
{
"sessionsByUser": {
"f877b96c-9969-4abc-bbe2-54b17d030f8b": [
{
"duration": 41294,
"pages": [
"/pages/a-sad-story",
"/pages/a-big-talk"
],
"startTime": 1512709024000
},
{
"duration": 0,
"pages": [
"/pages/a-sad-story"
],
"startTime": 1512711000000
}
],
"d1177368-2310-11e8-9e2a-9b860a0d9039": [
{
"duration": 195000,
"pages": [
"/pages/a-big-river",
"/pages/a-big-river",
"/pages/a-small-dog"
],
"startTime": 1512754436000
}
]
}
}
Notes
Timestamps are in milliseconds.
Events may not be given in chronological order.
The visitors in sessionsByUser can be in any order.
For each visitor, sessions to be in chronological order.
For each session, the URLs should be sorted in chronological order
For a session with only one event the duration should be zero
Each event in a session (except the first event) must have occurred
within 10 minutes of the preceding event in the session. This means
that there can be more than 10 minutes between the first and the last
event in the session.
Note: I am not gonna show you how to make the exact output format you need, but I will show you the general approach for solving this problem, and hopefully you can figure out how to change it for your requirements.
You are gonna want to start out by grouping the events by user:
events_by_user = events.group_by { |event| event[:visitorId] }
Now, for each user, you need to sort their events by timestamp:
events_by_user.transform_values! do |events|
events.sort_by { |event| event[:timestamp] }
end
Now, you need to loop through each user's events and compare them in sequential order, putting them in groups based on timestamp similarity:
session_length = 10 # seconds
sessions = {}
events_by_user.each do |visitor_id, events|
sessions[visitor_id] = []
events.each do |event|
if sessions[visitor_id].empty?
sessions[visitor_id].push([event])
else
last_session = sessions[visitor_id].last
last_timestamp = last_session.last[:timestamp]
if (event[:timestamp] - last_timestamp) <= session_length
last_session.push(event)
else
sessions[visitor_id].push([event])
end
end
end
end
Now sessions will contain a hash like this:
{
<visitor_id>: [
[<list of events in session 1>],
[<list of events in session 2>]
],
etc.
}
You can then extract the start time, total duration, etc
Group the "events" array by the property "visitorId" first. You can use in JavaScript the
Array.prototype.reduce(): The reduce() method executes a user-supplied
“reducer” callback function on each element of the array, in order,
passing in the return value from the calculation on the preceding
element. The final result of running the reducer across all elements
of the array is a single value. So, set {} as the initial value for the reducer function, at each pass use the visitorId as key to an array that will hold events of the visitor, push the current event to respective position in the array.
The a variable || [] statement is used to make an 'undefined'
value as [], empty array.
Now sort the events array that is built just now by timestamp in ascending order.
Loop through it and compare timestamps pairwise (previous and current array element), if difference is below given session length(eg 10 min), merge the two sessions and push it in an array with 'visitorId' as key. Use a variable to keep track of the index of the session to be merged together.
let data = require('d:\\dataset.json');
//Group by visitorId
let sessions = {
sessionsByUser: data.events.reduce(function (events, session) {
(events[session['visitorId']] = events[session['visitorId']] || []).push(session);
return events;
}, {})
};
//Sort events by timestamp ascending
for(let key in sessions.sessionsByUser){
let events = sessions.sessionsByUser[key];
events = events.sort((a, b) => {
return a.timestamp - b.timestamp;
});
}
let userSessions = {};
for(let key in sessions.sessionsByUser){
let events = sessions.sessionsByUser[key];
let lastIndex = 0;
for(let i = 0; i < events.length; i++)
{
if(i == 0) {
userSessions[key] = [{
duration: 0,
pages: [events[i].url],
startTime: events[i].timestamp
}]
}
else {
//Check difference (10 min)
if(events[i].timestamp - events[i-1].timestamp < 600000) {
let session = userSessions[key][lastIndex];
session.duration += (events[i].timestamp - events[i-1].timestamp);
session.pages.push(events[i].url);
}
else {
userSessions[key].push({
duration: 0,
pages: [events[i].url],
startTime: events[i].timestamp
});
lastIndex++;
}
}
}
}
let soln = {
sessionsByUser: userSessions
};
console.log(JSON.stringify(soln));
Run command in cmd: Node <filename>.js
Change dataset path, navigate to the directory of the file in cmd. Node must
be installed on the system.

How to model recursive data structures in GraphQL

I have a tree data structure that I would like to return via a GraphQL API.
The structure is not particularly large (small enough not to be a problem to return it in one call).
The maximum depth of the structure is not set.
I have modeled the structure as something like:
type Tag{
id: String!
children: [Tag]
}
The problem appears when one wants to get the tags to an arbitrary depth.
To get all the children to (for example) level 3 one would write a query like:
{
tags {
id
children {
id
children {
id
}
}
}
}
Is there a way to write a query to return all the tags to an arbitrary depth?
If not what is the recommended way to model a structure like the one above in a GraphQL API.
Some time ago I came up with another solution, which is the same approach like #WuDo suggested.
The idea is to flatten the tree on data level using IDs to reference them (each child with it's parent) and marking the roots of the tree, then on client side build up the tree again recursively.
This way you should not worry about limiting the depth of your query like in #samcorcos's answer.
schema:
type Query {
tags: [Tag]
}
type Tag {
id: ID!
children: [ID]
root: Boolean
}
response:
{
"tags": [
{"id": "1", "children": ["2"], "root": true},
{"id": "2", "children": [], "root": false}
]
}
client tree buildup:
import find from 'lodash/find';
import isArray from 'lodash/isArray';
const rootTags = [...tags.map(obj => ({...obj})).filter(tag => tag.root === true)];
const mapChildren = childId => {
const tag = find(tags, tag => tag.id === childId) || null;
if (isArray(tag.children) && tag.children.length > 0) {
tag.children = tag.children.map(mapChildren).filter(tag => tag !== null);
}
}
const tagTree = rootTags.map(tag => {
tag.children = tag.children.map(mapChildren).filter(tag => tag !== null);
return tag;
});
// Update 2022-08-16 Fixed typo
Another option if you're willing to give up on the type-safety and subfield querying that GraphQL provides along with the ability to cache and reference the objects by their IDs is to encode the data as JSON. The gaphql-type-json package provides resolvers to make this easy. These are also included with permission by graphql-scalars which contains a lot of other handy scalars.
I'm doing this for the hierarchical data that defines the controls for a dynamic form. In this case, there aren't any IDs to lose, so it's an easy win.

CouchDB performance issue reducing pair of values

Using CouchDB, I'm getting very poor performance trying to compute a "last pass" and "last fail" time over a set of automated test results.
I have a DB of ~5000 records of the form:
{
"completionTime": "2013-06-06T17:28:09.384Z",
"environment": "ENV1",
"passed": true,
"duration": 59142,
"summary": "",
"origin": {
"rowId": "1",
"worksheet": "Sheet1",
"workbook": "book.xlsm"
}
}
I have a view defined with map:
function(run) {
if (run.environment && run.origin && run.origin.rowId && run.origin.worksheet && run.origin.workbook && run.completionTime) {
var key = [run.environment, run.origin.rowId, run.origin.worksheet, run.origin.workbook]
var completionTime = Date.parse(run.completionTime)
if (run.passed)
emit(key, [completionTime, null] );
else
emit(key, [null, completionTime] );
}
}
And reduce:
function (key, values, rereduce) {
var latestPass = null;
var latestFail = null;
for (var i = 0; i < values.length; i++) {
latestPass = Math.max(values[i][0], latestPass);
latestFail = Math.max(values[i][1], latestFail);
}
return [latestPass, latestFail];
}
When querying this view for all results (about 750), it takes anywhere from 10-50 seconds, which is significantly slower than I'd expect.
Am I doing something obviously wrong?
From my limited experience with CouchDB view tuning, I found that writing the view in Erlang significantly improved performance.
Start with this: http://wiki.apache.org/couchdb/EnableErlangViews
Then write your view in Erlang (some examples): Emit Tuples From Erlang Views In CouchDB
It's a bit tricky to get the syntax of the Erlang views correct, but it's fun to try and I saw a little over a 50% increase in performance compared to Javascript views.
I switched to MongoDB, and the same queries ran in hundreds of milliseconds, rather than tens of seconds.

Resources