How to flatten recursive hierarchy using Hive/Pig/MapReduce - hadoop

I have unbalanced tree data stored in tabular format like:
The depth of tree is unknow.
how to flatten this hierarchy where each row contains entire path from leaf node to root node in a row as:
leaf node, root node, intermediate nodes
Any suggestions to solve above problem using hive, pig or mapreduce? Thanks in advance.

I tried to solve it using pig, here are the sample code:
Join function:
-- Join parent and child
Define join_hierarchy ( leftA, source, result) returns output {
joined= join $leftA by parent left, $source by child;
tmp_filtered= filter joined by source::parent is null;
part= foreach tmp_filtered leftA::child as child, leftA::path as path;
$result= union part, $result;
part_remaining= filter joined by source::parent is not null;
$output= foreach part_remaining generate $leftA::child as child, source::parent as parent, concat(concat(source::parent,':'),$leftA::path)
Load dataset:
--My dataset field delimiter is ','.
source= load '*****' using pigStorage(',') as (parent:chararray, child:chararray);
--create additional column for path
leftA= foreach source generate child, parent, concat(parent,':');
--initially result table will be blank.
result= limit leftA 1;
result= foreach result generate '' as child , '' as parent;
--Flatten hierarchy to 4 levels. Add below lines equivalent to hierarchy depth.
leftA= join_hierarchy(leftA, source, result);
leftA= join_hierarchy(leftA, source, result);
leftA= join_hierarchy(leftA, source, result);
leftA= join_hierarchy(leftA, source, result);


In Linq to XML, how to you get ALL descendents of root?

In Linq to XML, how to you get ALL descendents of the root?
I can't get my statement to even compile as it always says
A query body must end with a select clause or group clause
So given this XML (SVG in this case):
I would like to enumerate all the shape nodes (rect/ellipse/path etc):
var xml = XDocument.Load(#"C:\diagram.svg");
var query = from c in xml.Root.Descendants().SelectMany(); // <- doesn't compile
It turns out I just needed to add select c linq expression instead of the .Select() or .SelectMany() function call:
var xml = XDocument.Load(#"C:\diagram.svg");
var query = from c in xml.Root.Elements()
select c;
foreach (var element in elements)
// do work
Also in my case I need .Elements() of root rather than .Descendents() as the latter flattens the tree structure of the SVG which is probably not what most people want when parsing the SVG. So the Elements() gives you all the immediate children of the root element.

Filter inner bag in Pig

The data looks like this:
22678, {(112),(110),(2)}
656565, {(110), (109)}
6676, {(2),(112)}
This is the data structure:
(id:chararray, event_list:{innertuple:(innerfield:chararray)})
I want to filter those rows where event_list contains 2. I thought initially to flatten the data and then filter those rows that have 2. Somehow flatten doesn't work on this dataset.
Can anyone please help?
There might be a simpler way of doing this, like a bag lookup etc. Otherwise with basic pig one way of achieving this is:
data = load 'data.txt' AS (id:chararray, event_list:bag{});
-- flatten bag, in order to transpose each element to a separate row.
flattened = foreach data generate id, flatten(event_list);
-- keep only those rows where the value is 2.
filtered = filter flattened by (int) $1 == 2;
-- keep only distinct ids.
dist = distinct (foreach filtered generate $0 as (id:chararray));
-- join distinct ids to origitnal relation
jnd = join a by id, dist by id;
-- remove extra fields, keep original fields.
result = foreach jnd generate a::id, a::event_list;
dump result;
You can filter the Bag and project a boolean which says if 2 is present in the bag or not. Then, filter the rows which says that projection is true or not
input = LOAD 'data.txt' AS (id:chararray, event_list:bag{});
input_filt = FOREACH input {
bag_filter = FILTER event_list BY (val_0 matches '2');
isEmpty(bag_filter.$0) ? false : true AS is_2_present:boolean;
output = FILTER input_filt BY is_2_present;

Linq2Entities Equivalent Query for Parent/Child Relationship, With All Parents and Children, Filtering/Ordering Children

So the question is ridiculously long, so let's go to the code. What's the linq2entities equivalent of the following Sql, given entities (tables) that look like:
The sql:
select p.*, c.*
from parent p
inner join p on
p.parent_id = child.parent_id
c.child_field1 = some_appropriate_value
order by
L2E let's you do .include() and that seems like the appropriate place to stick the ordering and filtering for the child, but the include method doesn't accept an expression (why not!?). So, I'm guessing this can't be done right now, because that's what a lot of articles say, but they're old, and I'm wondering if it's possible with EF6.
Also, I don't have access to the context, so I need the lambda-syntax version.
I am looking for a resultant object hierarchy that looks like:
+-- ChildrenOfParent1
+-- ChildrenOfParent2
and so forth. The list would be end up being an IEnumerable. If one iterated over that list, they could get the .Children property of each parent in that list.
Ideally (and I'm dreaming here, I think), is that the overall size of the result list could be limited. For example, if there are three parents, each with 10 children, for a total of 33 (30 children + 3 parents) entities, I could limit the total list to some arbitrary value, say 13, and in this case that would limit the result set to the first parent, with all its children, and the second parent, with only one of its children (13 total entities). I'm guessing all of this would have to be done manually in code, which is disappointing because it can be done quite easily in SQL.
when you get a query from db using entityframewrok to fetch parents, parent's fields are fetched in single query. now you have a result set like this:
var parentsQuery = db.Parents.ToList();
then, if you have a foreign key on parent, entityframework creates a navigation property on parent to access to corresponding entity (for example Child table).
in this case, when you use this navigation property from parent entities which already have been fetched, to get childs, entityframework creates another connection to sql server per parent.
for example if count of parentsQueryis 15, by following query entityframework creates 15 another connection, and get 15 another query:
var Childs = parentsQuery.SelectMany(u => u.NavigationProperty_Childs).ToList();
in these cases you can use include to prevent extra connections to fetch all childs with its parent, when you are trying to get parents in single query, like this:
var ParentIncludeChildsQuery = db.Parents.Include("Childs").ToList();
then by following Query, entityframework doesn't create any connection and doesn't get any query again :
var Childs = ParentIncludeChildsQuery.SelectMany(u => u.NavigationProperty_Childs).ToList();
but, you can't create any condition and constraint using include, you can check any constraint or conditions after include using Where, Join, Contains and so forth, like this:
var Childs = ParentIncludeChildsQuery.SelectMany(u => u.NavigationProperty_Childs
.Where(t => t.child_field1 = some_appropriate_value)).ToList();
but by this query, all child have been fetched from database before
the better way to acheieve equivalent sql query is :
var query = parent.Join(child,
p => p.ID
c => c.ParentID
(p, c) => new { Parent = p, Child = c })
.Where(u => u.Child.child_field1 == some_appropriate_value)
.OrderBy(u => u.Parent.parent_field1)
.ThenBy(u => u.Child.child_field2)
according to your comment, this is what you want:
var query = parent.Join(child,
p => p.ID,
c => c.ParentID,
(p, c) => new { Parent = p, Child = c })
.Where(u => u.Child.child_field1 == some_appropriate_value)
.GroupBy(u => u.Parent)
.Select(u => new {
Parent = u.Key,
Childs = u.OrderBy(t => t.Child.child_field2).AsEnumerable()
.OrderBy(u => u.Parent.parent_field1)

Groovy tree-like sort

I have a problem with sorting rows from db having tree-like hierarchy. Each row contains three columns meaningful to this problem: id, parent, lp. Id is String, parent is another row and lp is a number used to sort rows having no parent-child relationship. Each row can have any number of children and only one parent (null on top level)
There are three situations I see:
when first row is parent of another: -1 is returned
when first row is child of a parent with lower lp than another row::
-1 is returned
when none of those relations exist (also when rows have same parent and are on the same level) : to lps of rows are compared
I manadged to write this code that I think should solve the problem but it doesnt work for rows that are deep in hierarchy and it messes the order :
dane = dane.sort {it1, it2 ->
it1 == it2.parent ? -1 :
it1.parent && it1.parent.lp < it2.lp ? -1 :
it1.lp - it2.key.lp
I'd appreciate any suggestions. Thx in advance!
Your comparison should be consistent regardless of the order of the arguments. If the arguments are a = it1 and b = it2, the result should be the negation of b = it1 and a = it2. It doesn't look like that's the case here. For example, the case where it1.parent == it2.

Recursive Linq Grouping

I have database table that stores the hierarchy of another table's many-to-many relationship. An item can have multiple children and can also have more than one parent.
ItemID (key)
MemberID (key)
ParentItemID (fk)
ChildItemID (fk)
Sample hierarchy:
Level1 Level2 Level3
X A A1
B B1
I would like to group all of the child nodes by each parent node in the hierarchy.
Parent Child
X A1
A A1
B B1
Notice how there are no leaf nodes in the Parent column, and how the Child column only contains leaf nodes.
Ideally, I would like the results to be in the form of IEnumerable<IGrouping<Item, Item>> where the key is a Parent and the group items are all Children.
Ideally, I would like a solution that the entity provider can translate in to T-SQL, but if that is not possible then I need to keep round trips to a minimum.
I intend to Sum values that exist in another table joined on the leaf nodes.
Since you are always going to be returning ALL of the items in the table, why not just make a recursive method that gets all children for a parent and then use that on the in-memory Items:
partial class Items
public IEnumerable<Item> GetAllChildren()
//recursively or otherwise get all the children (using the Hierarchy navigation property?)
var items =
from item in Items.ToList()
group new
} by item.itemID;
Sorry for any syntax errors...
Well, if the hierarchy is strictly 2 levels you can always union them and let LINQ sort out the SQL (it ends up being a single trip though it needs to be seen how fast it will run on your volume of data):
var hlist = from h in Hierarchies
select new {h.Parent, h.Child};
var slist = from h in Hierarchies
join h2 in hlist on h.Parent equals h2.Child
select new {h2.Parent, h.Child};
hlist = hlist.Union(slist);
This gives you an flat IEnumerable<{Item, Item}> list so if you want to group them you just follow on:
var glist = from pc in hlist.AsEnumerable()
group pc.Child by pc.Parent into g
select new { Parent = g.Key, Children = g };
I used AsEnumerable() here as we reached the capability of LINQ SQL provider with attempting to group a Union. If you try it against IQueryable it will run a basic Union for eligable parents then do a round-trip for every parent (which is what you want to avoid). Whether or not its ok for you to use regular LINQ for the grouping is up to you, same volume of data would have to come through the pipe either way.
EDIT: Alternatively you could build a view linking parent to all its children and use that view as a basis for tying Items. In theory this should allow you/L2S to group over it with a single trip.
