Suppose I have 3 boxes labeled A, B, C and I have 2 balls, B1 and B2. I want to get all possible combinations of these balls in the boxes. Please note, it is important to know which ball is in each box, meaning B1 and B2 are not the same.
A B C
B1, B2
B1 B2
B1 B2
B2 B1
B2 B1
B1, B2
B1 B2
B2 B1
B1, B2
Edit
If there is a known algorithm for this problem, please tell me its name.
Let N be number of buckets (3 in the example), M number of balls (2). Now, let's have a look at numbers in a range [0..N**M) - [0..9) in the example; these numbers we represent with radix = N. For the example in the question we have trinary numbers
Now we can easily interprete these numbers: first digit shows 1st ball location, second - 2nd ball position.
|--- Second Ball position [0..2]
||-- First Ball position [0..2]
||
0 = 00 - both balls are in the bucket #0 (`A`)
1 = 01 - first ball is in the bucket #1 ('B'), second is in the bucket #0 (`A`)
2 = 02 - first ball is in the bucket #2 ('C'), second is in the bucket #0 (`A`)
3 = 10 - first ball is in the bucket #0 ('A'), second is in the bucket #1 (`B`)
4 = 11 - both balls are in the bucket #1 (`B`)
5 = 12 ...
6 = 20
7 = 21 ...
8 = 22 - both balls are in the bucket #2 (`C`)
the general algorithm is:
For each number in 0 .. N**M range
ith ball (i = 0..M-1) will be in the bucket # (number / N**i) % N (here / stands for integer division, % for remainder)
If you want just total count, the answer is simple N ** M, in the example above 3 ** 2 == 9
C# Code The algorithm itself is easy to implement:
static IEnumerable<int[]> BallsLocations(int boxCount, int ballCount) {
BigInteger count = BigInteger.Pow(boxCount, ballCount);
for (BigInteger i = 0; i < count; ++i) {
int[] balls = new int[ballCount];
int index = 0;
for (BigInteger value = i; value > 0; value /= boxCount)
balls[index++] = (int)(value % boxCount);
yield return balls;
}
}
It's answer representation which can be entangled:
static IEnumerable<string> BallsSolutions(int boxCount, int ballCount) {
foreach (int[] balls in BallsLocations(boxCount, ballCount)) {
List<int>[] boxes = Enumerable
.Range(0, boxCount)
.Select(_ => new List<int>())
.ToArray();
for (int j = 0; j < balls.Length; ++j)
boxes[balls[j]].Add(j + 1);
yield return string.Join(Environment.NewLine, boxes
.Select((item, index) => $"Box {index + 1} : {string.Join(", ", item.Select(b => $"B{b}"))}"));
}
}
Demo:
int balls = 3;
int boxes = 2;
string report = string.Join(
Environment.NewLine + "------------------" + Environment.NewLine,
BallsSolutions(boxes, balls));
Console.Write(report);
Outcome:
Box 1 : B1, B2, B3
Box 2 :
------------------
Box 1 : B2, B3
Box 2 : B1
------------------
Box 1 : B1, B3
Box 2 : B2
------------------
Box 1 : B3
Box 2 : B1, B2
------------------
Box 1 : B1, B2
Box 2 : B3
------------------
Box 1 : B2
Box 2 : B1, B3
------------------
Box 1 : B1
Box 2 : B2, B3
------------------
Box 1 :
Box 2 : B1, B2, B3
Fiddle
There's a very simple recursive implementation that at each level adds the current ball to each box. The recursion ends when all balls have been processed.
Here's some Java code to illustrate. We use a Stack to represent each box so we can simply pop the last-added ball after each level of recursion.
void boxBalls(List<Stack<String>> boxes, String[] balls, int i)
{
if(i == balls.length)
{
System.out.println(boxes);
return;
}
for(Stack<String> box : boxes)
{
box.push(balls[i]);
boxBalls(boxes, balls, i+1);
box.pop();
}
}
Test:
String[] balls = {"B1", "B2"};
List<Stack<String>> boxes = new ArrayList<>();
for(int i=0; i<3; i++) boxes.add(new Stack<>());
boxBalls(boxes, balls, 0);
Output:
[[B1, B2], [], []]
[[B1], [B2], []]
[[B1], [], [B2]]
[[B2], [B1], []]
[[], [B1, B2], []]
[[], [B1], [B2]]
[[B2], [], [B1]]
[[], [B2], [B1]]
[[], [], [B1, B2]]
I have the following the dot file contents:
digraph G {
start -> {a0, b0} -> end;
start -> c0 -> c1 -> c2 -> end;
start -> d0 -> d1 -> d2 -> end;
start -> {e0, f0} -> end;
subgraph cluster_test {
{
rank = same;
a0; b0; c0; d0; e0; f0;
}
{
rank = same;
c1; d1;
}
{
rank = same;
c2; d2;
}
}
}
The resulting graph is as follows:
What I want is for the ordering of level 0 nodes to be maintained, i.e, I want a0, b0 to come before c0, d0 in the horizontal direction.
How do I achieve this?
Empty nodes, edges with weight and explicit ordering of the top row in the cluster helps. See code below with annotations:
digraph so
{
// we define all nodes in the beginning, before edges and clusters
// may not be essential but I think it's good practice
start
a0 b0 c0 d0 e0 f0
c1 d1
c2 d2
end
// we define "empty" nodes that can be used to route the edges
node[ shape = point, height = 0 ];
ax bx ex fx
subgraph cluster_test
{
// we need to keep explicit order of the top nodes in the cluster
{ rank = same; a0 -> b0 -> c0 -> d0 -> e0 -> f0[ style = invis ] }
// the original layout in the cluster, empty nodes added at the bottom
{ rank = same; c1 d1 }
{ rank = same; ax bx c2 d2 ex fx }
c0 -> c1 -> c2;
d0 -> d1 -> d2;
// routing through invisible nodes keeps the position of all other nodes
// edges with no arrowheads, strong weight to keep it vertical
edge[ dir = none, weight = 10 ]
a0 -> ax;
b0 -> bx;
e0 -> ex;
f0 -> fx;
}
// connecting to the start and end node, normal edges again
edge[ weight = 1, dir = forw ];
start -> { a0 b0 c0 d0 e0 f0 }
{ ax bx c2 d2 ex fx } -> end;
}
which gives you
Now I want to transform a list to a map, e.g.
TaskStat t1 = new TaskStat("foo1", "bar1", 1);
TaskStat t2 = new TaskStat("foo1", "bar2", 2);
ArrayList<TaskStat> list = newArrayList(t1, t2);
Map<String, List<TaskStat>> map = list.stream().collect(groupingBy(e -> e.getA() + "_" + e.getB()));
assertEquals(1,map.get("foo1_bar1").get(0).getCount());
because the taskStat list comes from group by sql
select a, b , count(*) from t group by a,b
So every a + b has only one record.
How could transform list to Map<String,TaskStat> but not List?
Use Collectors.toMap instead of Collectors.groupingBy
Map<String, TaskStat> result =
list.stream().collect(Collectors.toMap(ts -> ts.getA() + "_" + ts.getB(),
Function.identity()));
Below is the data
c1 p1 q1 d1
c2 p2 q2 d2
Need output like- if customer has purchase p1 it should give flag as 1 else it should give flag 0. there are millions of customer and millions of product Below is the required output. Any help on this is highly appreciated.
c1 p1 q1 d1 1
c1 p2 q1 d1 0
c2 p2 q2 d2 1
c2 p1 q2 d2 0
you can achieve it with just a mapside logic, sample code for your reference:
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
NullWritable value = NullWritable.get();
String tokens[] = line.split("<your delim>");
if (tokens[1] == "p1") {
line = line + "<your delim>" + "1";
} else if (tokens[1] == "p2") {
line = line + "<your delim>" + "0";
}
context.write(newText(line), value);
}
Instead of counting words I need to count letters.
But I have problems implementing this using Apache Pig version 0.8.1-cdh3u1
Given the following input:
989;850;abcccc
29;395;aabbcc
The ouput should be:
989;850;a;1
989;850;b;1
989;850;c;4
29;395;a;2
29;395;b;2
29;395;c;2
Here is what I tried:
A = LOAD 'input' using PigStorage(';') as (x:int, y:int, content:chararray);
B = foreach A generate x, y, FLATTEN(STRSPLIT(content, '(?<=.)(?=.)', 6)) as letters;
C = foreach B generate x, y, FLATTEN(TOBAG(*)) as letters;
D = foreach C generate x, y, letters.letters as letter;
E = GROUP D BY (x,y,letter);
F = foreach E generate group.x as x, group.y as y, group.letter as letter, COUNT(D.letter) as count;
A, B and C can be dumped, but "dump D" results in "ERROR 2997: Unable to recreate exception from backed error: java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.pig.data.Tuple"
dump C displays(despite the third value being a weird tuple):
(989,850,a)
(989,850,b)
(989,850,c)
(989,850,c)
(989,850,c)
(989,850,c)
(29,395,a)
(29,395,a)
(29,395,b)
(29,395,b)
(29,395,c)
(29,395,c)
Here are the schemas:
grunt> describe A; describe B; describe C; describe D; describe E; describe F;
A: {x: int,y: int,content: chararray}
B: {x: int,y: int,letters: bytearray}
C: {x: int,y: int,letters: (x: int,y: int,letters: bytearray)}
D: {x: int,y: int,letter: bytearray}
E: {group: (x: int,y: int,letter: bytearray),D: {x: int,y: int,letter: bytearray}}
F: {x: int,y: int,letter: bytearray,count: long}
This pig version doesn't seem to support TOBAG($2..$8), hence the TOBAG(*) which also includes x and y, but that could be sorted out synactically later...
I'd like to avoid writing a UDF, otherwise I'd simply use the Java API directly.
But I don't really get the cast error. Can someone please explain it.
I'd propose writing a custom UDF instead. A quick, raw implementation would look like this:
package com.example;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
public class CharacterCount extends EvalFunc<DataBag> {
private static final BagFactory bagFactory = BagFactory.getInstance();
private static final TupleFactory tupleFactory = TupleFactory.getInstance();
#Override
public DataBag exec(Tuple input) throws IOException {
try {
Map<Character, Integer> charMap = new HashMap<Character, Integer>();
DataBag result = bagFactory.newDefaultBag();
int x = (Integer) input.get(0);
int y = (Integer) input.get(1);
String content = (String) input.get(2);
for (int i = 0; i < content.length(); i++){
char c = content.charAt(i);
Integer count = charMap.get(c);
count = (count == null) ? 1 : count + 1;
charMap.put(c, count);
}
for (Map.Entry<Character, Integer> entry : charMap.entrySet()) {
Tuple res = tupleFactory.newTuple(4);
res.set(0, x);
res.set(1, y);
res.set(2, String.valueOf(entry.getKey()));
res.set(3, entry.getValue());
result.add(res);
}
return result;
} catch (Exception e) {
throw new RuntimeException("CharacterCount error", e);
}
}
}
Pack it in a jar and execute it:
register '/home/user/test/myjar.jar';
A = LOAD '/user/hadoop/store/sample/charcount.txt' using PigStorage(';')
as (x:int, y:int, content:chararray);
B = foreach A generate flatten(com.example.CharacterCount(x,y,content))
as (x:int, y:int, letter:chararray, count:int);
dump B;
(989,850,b,1)
(989,850,c,4)
(989,850,a,1)
(29,395,b,2)
(29,395,c,2)
(29,395,a,2)
I do not have 0.8 version, but could you try this one:
A = LOAD 'input' using PigStorage(';') as (x:int, y:int, content:chararray);
B = foreach A generate x, y, FLATTEN(STRSPLIT(content, '(?<=.)(?=.)', 6));
C = foreach B generate $0 as x, $1 as y, FLATTEN(TOBAG(*)) as letter;
E = GROUP C BY (x,y,letter);
F = foreach E generate group.x as x, group.y as y, group.letter as letter, COUNT(C.letter) as count;
You can try this
grunt> a = load 'inputfile.txt' using PigStorage(';') as (c1:chararray, c2:chararray, c3:chararray);
grunt> b = foreach a generate c1,c2,FLATTEN(TOKENIZE(REPLACE(c3,'','^'),'^')) as split_char;
grunt> c = group b by (c1,c2,split_char);
grunt> d = foreach c generate group, COUNT(b);
grunt> dump d;
Output looks like this:
((29,395,a),2)
((29,395,b),2)
((29,395,c),2)
((989,850,a),1)
((989,850,b),1)
((989,850,c),4)