Let me break down the problem it will take some time.
Consider that you have an entities A, B, C in your system.
A is the parent of everything
B is the child of A
C can be child of A or B, Please note there are some more entities like D,E,F which are same as C. So lets consider C only for time being
So basically its a tree alike structure like
```
A
/ \
/ \
B C(there are similar elements like D, E, F)
|
|
C
```
Now we need are using Elastic Search as secondary DB to store this. In the data base the structure is completely different since A, B, C have dynamic fields, so they are different tables and we join them to get data, but from business prospective this is design.
Now when we try to flat it and store in es for under set
We have a entity A1 who has 2 children C1 and B1, B1 has further children C2
A B C
1 A1 null null
2 A1 null C1
3 A1 B1 null
4 A1 B1 C2
Now what your can query
use says he wants All columns of A,B,C where value of columns A is A1, so adding some null removing rules we can give him row number 2,3,4
now the problem set , now user says he want all As where value of A is A1 , so basically we will return him all rows 1,2,3,4 or 2,3,4 so we will see values like
A
A1
A1
A1
but logically he should see only one column A1 since that is only unique value. As ES doesn't have the ability to group by things.
So how we solved things.
We solved this problem by creating multiple indices and one nested index
So when we need to group by index we go to nested index and other index work as flat index
so we have different index, like index for A and B, A or B and C . But we have more elements so it lead to creation of 5 indices.
As data started increasing its becoming difficult to maintain 5 indices and indexing them from scratch takes too much time.
So to solve this we started to look for other options and we are testing cratedb. But on the first place we are still trying to figure is there any way to do that in ES since need to use many feature of ES as percolation, watcher etc. Any clues on that?
Please also note that we need to apply pagination also. That's why single nested index will not work
Related
I am using redisgraph with a custom implementation of ioredis.
The query runs 3 to 6 seconds on a database that has millions of nodes. It basically filters (b:brand) by different relationship counts by adding the following match and where multiple times on different nodes.
(:brand) - 1mil nodes
(:w) - 20mil nodes
(:e) - 10mil nodes
// matching b before this codeblock
MATCH (b)-[:r1]->(p:p)<-[:r2]-(w:w)
WHERE w.deleted IS NULL
WITH count(DISTINCT w) as count, b
WHERE count >= 0 AND count <= 10
The full query would look like this.
MATCH (b:brand)
WHERE b.deleted IS NULL
MATCH (b)-[:r1]->(p:p)<-[:r2]-(w:w)
WHERE w.deleted IS NULL
WITH count(DISTINCT w) as count, b
WHERE count >= 0 AND count <= 10
MATCH (c)-[:r3]->(d:d)<-[:r4]-(e:e)
WHERE e.deleted IS NULL
WITH count(DISTINCT e) as count, b
WHERE count >= 0 AND count <= 10
WITH b ORDER by b.name asc
WITH count(b) as totalCount, collect({id: b.id)[$cursor..($cursor+$limit)] AS brands
RETURN brands, totalCount
How can I optimize this query as it's really slow?
A few thoughts:
Property lookups are expensive; is there a way you can get around all the .deleted checks?
If possible, can you avoid naming r1, r2, etc.? It's faster when it doesn't have to check the relationship type.
You're essentially traversing the entire graph several times. If the paths b-->p<--w and c-->d<--e don't overlap, you can include them both in the match statement, separated by a comma, and aggregate both counts at once
I don't know if it'll help much, but you don't need to name p and d since you never refer to them
This is a very small improvement, but I don't see a reason to check count >= 0
Also, I'm sure you have your reasons, but why does the c-->d<--e path matter? This would make more sense to me if it were b-->d<--e to mirror the first portion.
EDIT/UPDATE: A few things I said need clarification:
First bullet:
The fastest lookup is on a node label; up to 4 labels are essentially O(0). (Well, for anchor nodes, it's slower for downstream nodes.)
The second-fastest lookup is on an INDEXED property. My comment above assumed UNINDEXED lookups.
Second bullet: I think I was just wrong here. Relationships are stored as doubly-linked lists grouped by relationship type. Therefore, always specify relationship type for better performance. Similarly, always specify direction.
Third bullet: What I said is generally correct, HOWEVER beware of Cartesian joins when you have two MATCH statements separated by a comma. In general, you would only use that structure when you have a common element, like you want directors, actors, and cinematographers all connected to a movie. Still, no overlap between these paths.
I would like to store user purchase custom tags on each transaction, example if user bought shoes then tags are "SPORTS", "NIKE", SHOES, COLOUR_BLACK, SIZE_12,..
These tags are that seller interested in querying back to understand the sales.
My idea is when ever new tag comes in create new code(something like hashcode but sequential) for that tag, and code starts from "a-z" 26 letters then "aa, ab, ac...zz" goes on. Now keep all the tags given for in one transaction in the one column called tag (varchar) by separating with "|".
Let us assume mapping is (at application level)
"SPORTS" = a
"TENNIS" = b
"CRICKET" = c
...
...
"NIKE" = z //Brands company
"ADIDAS" = aa
"WOODLAND" = ab
...
...
SHOES = ay
...
...
COLOUR_BLACK = bc
COLOUR_RED = bd
COLOUR_BLUE = be
...
SIZE_12 = cq
...
So storing the above purchase transaction, tag will be like tag="|a|z|ay|bc|cq|" And now allowing seller to search number of SHOES sold by adding WHERE condition tag LIKE %|ay|%. Now the problem is i cannot use index (sort key in redshift db) for "LIKE starts with %". So how to solve this issue, since i might have 100 millions of records? dont want full table scan..
any solution to fix this?
Update_1:
I have not followed bridge table concept (cross-reference table) since I want to perform group by on the results after searching the specified tags. My solution will give only one row when two tags matched in a single transaction, but bridge table will give me two rows? then my sum() will be doubled.
I got suggestion like below
EXISTS (SELECT 1 FROM transaction_tag WHERE tag_id = 'zz' and trans_id
= tr.trans_id) in the WHERE clause once for each tag (note: assumes tr is an alias to the transaction table in the surrounding query)
I have not followed this; since i have to perform AND and OR condition on the tags, example ("SPORTS" AND "ADIDAS") ---- "SHOE" AND ("NIKE" OR "ADIDAS")
Update_2:
I have not followed bitfield, since dont know redshift has this support also I assuming if my system will be going to have minimum of 3500 tags, and allocating one bit for each; which results in 437 bytes for each transaction, though there will be only max of 5 tags can be given for a transaction. Any optimisation here?
Solution_1:
I have thought of adding min (SMALL_INT) and max value (SMALL_INT) along with tags column, and apply index on that.
so something like this
"SPORTS" = a = 1
"TENNIS" = b = 2
"CRICKET" = c = 3
...
...
"NIKE" = z = 26
"ADIDAS" = aa = 27
So my column values are
`tag="|a|z|ay|bc|cq|"` //sorted?
`minTag=1`
`maxTag=95` //for cq
And query for searching shoe(ay=51) is
maxTag <= 51 AND tag LIKE %|ay|%
And query for searching shoe(ay=51) AND SIZE_12 (cq=95) is
minTag >= 51 AND maxTag <= 95 AND tag LIKE %|ay|%|cq|%
Will this give any benefit? Kindly suggest any alternatives.
You can implement auto-tagging while the files get loaded to S3. Tagging at the DB level is too-late in the process. Tedious and involves lot of hard-coding
While loading to S3 tag it using the AWS s3API
example below
aws s3api put-object-tagging --bucket --key --tagging "TagSet=[{Key=Addidas,Value=AY}]"
capture tags dynamically by sending and as a parameter
2.load the tags to dynamodb as a metadata store
3.load data to Redshift using S3 COPY command
You can store tags column as varchar bit mask, i.e. a strictly defined bit sequence of 1s or 0s, so that if a purchase is marked by a tag there will be 1 and if not there will be 0, etc. For every row, you will have a sequence of 0s and 1s that has the same length as the number of tags you have. This sequence is sortable, however you would still need lookup into the middle but you will know at which specific position to look so you don't need like, just substring. For further optimization, you can convert this bit mask to integer values (it will be unique for each sequence) and make matching based on that but AFAIK Redshift doesn't support that yet out of box, you will have to define the rules yourself.
UPD: Looks like the best option here is to keep tags in a separate table and create an ETL process that unwraps tags into tabular structure of order_id, tag_id, distributed by order_id and sorted by tag_id. Optionally, you can create a view that joins the this one with the order table. Then lookups for orders with a particular tag and further aggregations of orders should be efficient. There is no silver bullet for optimizing this in a flat table, at least I don't know of such that would not bring a lot of unnecessary complexity versus "relational" solution.
I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.
A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})
I am very new to using Pig and Hadoop, so please forgive me if this is very basic. I have a relation that has a list of users and followers (like Twitter) with the format (userA,userB) meaning that userB follows userA. My assignment (yes this is homework) is to find people who follow each other. I have done this, t I have twice as many tuples as I need since I have (userA,userB) and (userB,userA) in the relation. It doesn't matter which of the two tuples I end up with, I just need to eliminate one of them. The DISTINCT keyword won't do me any good since the order is reversed
Without seeing your code it seems you could try sorting the fields of the tuple before de-duplicating, like this:
X = FOREACH A GENERATE (f1 < f2 ? f1 : f2), (f1 < f2 ? f2 : f1);
Y = DISTINCT X;
I have following data set.
f1,f2,f3,f4,f5,f6
I am looking for count of f6 along with rest of the fields.
f1,f2,f3,f4,f5,5
f1,f2,f3,f4,f5,3
and so on.
I tried this code but it takes too long to execute
A = LOAD 'file a'
B = GROUP A BY f6
C = FOREACH B GENERATE FLATTEN (group) as f6, FLATTEN(f1), FLATTEN(f2),FLATTEN(f3),FLATTEN(f4),FLATTEN(f5),COUNT(f6)
Is there any better way to achieve what I am looking for ?
If I simply try to get count without flatten then fields end up in bag but I want final output as tuple.
So trying this gives me output as bag
C = FOREACH B GENERATE FLATTEN (group) as f6, A.f1,A.f2.A.f3,A.f4,A.f5, COUNT(f6)
All inputs are appreciated.
Cheers
It is also possible to flatten the projection which was grouped.
A = LOAD 'file a';
B = GROUP A BY f6;
C = FOREACH B GENERATE FLATTEN(A), COUNT(A) as f6_count;
EDIT 1:
The key is using FLATTEN(A) instead of FLATTEN(group).
FLATTEN(A) will produce a tuple with all of the columns from the original relation and will get rid of the bag, even those that were not used in the group by statement (f1, f2, f3, f4, f5, f6).
FLATTEN(group) will return only columns used in the group by, in this case f6.
Advantage of this approach is that it is very efficient and requires a single Map Reduce job to execute. Any solution that involves JOIN operation adds extra Map Reduce job.
As a general rule of thumb in pig, hive and MR, group by and join operations usually are executed as separate MR Jobs and reducing the number of MR jobs leads to improved performance.