How can I convert a Set<Result> into a Map<Item, Set<String>> or SetMultimap<Item, String> using Java 8 streams or Multimaps, where Result is:
class Result {
String name;
Set<Item> items;
}
For example, I start with:
result1:
name: name1
items:
- item1
- item2
result2:
name: name2
items:
- item2
- item3
and end with:
item1:
- name1
item2:
- name1
- name2
item3:
- name2
two important methods in code snippet below are Stream.flatMap & Collectors.mapping:
import java.util.Map.Entry;
import java.util.AbstractMap.SimpleEntry;
import static java.util.stream.Collectors.*;
results.stream()
//map Stream<Result> to Stream<Entry<Item>,String>
.flatMap(it -> it.items.stream().map(item -> new SimpleEntry<>(item, it.name)))
//group Result.name by Item
.collect(groupingBy(Entry::getKey, mapping(Entry::getValue, toSet())));
Once there is Collectors.flatMapping in jdk-9, it could look a little different. Also your Item instances need to be comparable in some way, or you can specify that in groupBy (let's say by a field 'itemName'):
Map<Item, Set<String>> map = results.stream()
.collect(Collectors.flatMapping(
(Result result) -> result.getItems()
.stream()
.map((Item item) -> new AbstractMap.SimpleEntry<>(item, result.getName())),
Collectors.groupingBy(Entry::getKey, () -> new TreeMap<>(Comparator.comparing(Item::getItemName)),
Collectors.mapping(Entry::getValue, Collectors.toSet()))));
Related
systems:
- name: Fred
country: DE
- name: Wilma
country: US
- name: Pebbles
country: DE
- name: Dino
country: US
---
# Systems
Countries: {{ page.systems | join(",", "country") }}
I am trying to create a GitBook page with a list of items containing no duplicates. I.e I would want to apply a 'unique' filter or 'distinct' filter in my Nunjucks template for the page. The template needs to process the page variables (YAML). The above template generates the output:
Countries: DE,US,DE,US
I would like it to produce the output
Countries: DE,US
How could I achive that? (Given that 'unique' filter is not supported with Nunjucks.)
You can extend your Nunjucks through Custom filter
const nunjucks = require('nunjucks');
const env = new nunjucks.Environment(/* loaders etc... */);
env.addFilter('unique', arr => arr instanceof Array && arr.filter((e, i, arr) => arr.indexOf(e) == i) || arr);
let out = env.renderString(`{{[1, 2, 3, 2] | unique }}`);
console.log(out);
I'm using spark 2.1 on yarn cluster. I have a RDD that contains data I would like to complete based on other RDDs (which correspond to different mongo databases that I get through https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage, but I don't think that is important, just mention it in case)
My problem is that the RDD I have to use to complete data depends on data itself because data contain the database to use. Here is a simplified exemple of what I have to do :
/*
* The RDD which needs information from databases
*/
val RDDtoDevelop = sc.parallelize(Array(
Map("dbName" -> "A", "id" -> "id1", "other data" -> "some data"),
Map("dbName" -> "C", "id" -> "id6", "other data" -> "some other data"),
Map("dbName" -> "A", "id" -> "id8", "other data" -> "some other other data")))
.cache()
/*
* Artificial databases for the exemple. Actually, mongo-hadoop is used. https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage
* This means that generate these RDDs COSTS so we don't want to generate all possible RDDs but only needed ones
*/
val A = sc.parallelize(Array(
Map("id" -> "id1", "data" -> "data1"),
Map("id" -> "id8", "data" -> "data8")
))
val B = sc.parallelize(Array(
Map("id" -> "id1", "data" -> "data1bis"),
Map("id" -> "id5", "data" -> "data5")
))
val C = sc.parallelize(Array(
Map("id" -> "id2", "data" -> "data2"),
Map("id" -> "id6", "data" -> "data6")
))
val generateRDDfromdbName = Map("A" -> A, "B" -> B, "C" -> C)
and the wanted output is :
Map(dbName -> A, id -> id8, other data -> some other other data, new data -> data8)
Map(dbName -> A, id -> id1, other data -> some data, new data -> data1)
Map(dbName -> C, id -> id6, other data -> some other data, new data -> data6)
Since nested RDDs are not possible, I would like to find the best way to use as possible as I can for Spark paralellism. I thought about 2 solutions.
First is creating a collection with the contents of the needed db, then convert it to RDD to benefit of RDD scalability (if the collection doesn't fit into driver memory, I could do it in several times). Finally do a join and filter the content on id.
Second is get the RDDs from all needed databases, key them by dbname and id and then do the join.
Here is the scala code :
Solution 1
// Get all needed DB
val dbList = RDDtoDevelop.map(map => map("dbName")).distinct().collect()
// Fill a list with key value pairs as (dbName,db content)
var dbContents = List[(String,Array[Map[String,String]])]()
dbList.foreach(dbName => dbContents = (dbName,generateRDDfromdbName(dbName).collect()) :: dbContents)
// Generate a RDD from this list to benefit to advantages of RDD
val RDDdbs = sc.parallelize(dbContents)
// Key the initial RDD by dbName and join with the contents of dbs
val joinedRDD = RDDtoDevelop.keyBy(map => map("dbName")).join(RDDdbs)
// Check for matched ids between RDD data to develop and dbContents
val result = joinedRDD.map({ case (s,(maptoDeveleop,content)) => maptoDeveleop + ("new data" -> content.find(mapContent => mapContent("id") == maptoDeveleop("id")).get("data"))})
Solution 2
val dbList = RDDtoDevelop.map(map => map("dbName")).distinct().collect()
// Create the list of the database RDDs keyed by (dbName, id)
var dbRDDList = List[RDD[((String,String),Map[String,String])]]()
dbList.foreach(dbName => dbRDDList = generateRDDfromdbName(dbName).keyBy(map => (dbName,map("id"))) :: dbRDDList)
// Create a RDD containing all dbRDD
val RDDdbs = sc.union(dbRDDList)
// Join the initial RDD based on the key with the dbRDDs
val joinedRDD = RDDtoDevelop.keyBy(map => (map("dbName"), map("id"))).join(RDDdbs)
// Reformate the result
val result = joinedRDD.map({ case ((dbName,id),(maptoDevelop,dbmap)) => maptoDevelop + ("new data" -> dbmap("data"))})
Both of them give the wanted output. To my mind, second one seems better since the match of the db and of the id use the paralellism of Spark, but I'm not sure of that. Could you please help me to choose the best, or even better, give me clues for a better solution than mines.
Any other comment is appreciated ( It's my first question on the site ;) ).
Thanks by advance,
Matt
I would suggest you to convert your RDDs to dataframes and then joins, distinct and other functions that you would want to apply to the data would be very easy.
Dataframes are distributed and with addition to dataframe apis, sql queries can be used. More information can be found in Spark SQL, DataFrames and Datasets Guide and Introducing DataFrames in Apache Spark for Large Scale Data Science Moreover your need of foreach and collect functions which makes your code run slow won't be needed.
Example to convert RDDtoDevelop to dataframe is as below
val RDDtoDevelop = sc.parallelize(Array(
Map("dbName" -> "A", "id" -> "id1", "other data" -> "some data"),
Map("dbName" -> "C", "id" -> "id6", "other data" -> "some other data"),
Map("dbName" -> "A", "id" -> "id8", "other data" -> "some other other data")))
.cache()
Converting the above RDD to dataFrame
val developColumns=RDDtoDevelop.take(1).flatMap(map=>map.keys)
val developDF = RDDtoDevelop.map{value=>
val list=value.values.toList
(list(0),list(1),list(2))
}.toDF(developColumns:_*)
And the dataFrame looks as below
+------+---+---------------------+
|dbName|id |other data |
+------+---+---------------------+
|A |id1|some data |
|C |id6|some other data |
|A |id8|some other other data|
+------+---+---------------------+
Coverting your A rdd to dataframe is as below
Source code for A:
val A = sc.parallelize(Array(
Map("id" -> "id1", "data" -> "data1"),
Map("id" -> "id8", "data" -> "data8")
))
DataFrame code for A :
val aColumns=A.take(1).flatMap(map=>map.keys)
val aDF = A.map{value =>
val list=value.values.toList
(list(0),list(1))
}.toDF(aColumns:_*).withColumn("name", lit("A"))
A new column name is added with database name to have the correct join at the end with developDF.
Output for DataFrame A:
+---+-----+----+
|id |data |name|
+---+-----+----+
|id1|data1|A |
|id8|data8|A |
+---+-----+----+
You can convert B and C in similar ways.
Source for B:
val B = sc.parallelize(Array(
Map("id" -> "id1", "data" -> "data1bis"),
Map("id" -> "id5", "data" -> "data5")
))
DataFrame for B :
val bColumns=B.take(1).flatMap(map=>map.keys)
val bDF = B.map{value =>
val list=value.values.toList
(list(0),list(1))
}.toDF(bColumns:_*).withColumn("name", lit("B"))
Output for B :
+---+--------+----+
|id |data |name|
+---+--------+----+
|id1|data1bis|B |
|id5|data5 |B |
+---+--------+----+
Source for C:
val C = sc.parallelize(Array(
Map("id" -> "id2", "data" -> "data2"),
Map("id" -> "id6", "data" -> "data6")
))
DataFrame code for C:
val cColumns=C.take(1).flatMap(map=>map.keys)
val cDF = C.map{value =>
val list=value.values.toList
(list(0),list(1))
}.toDF(cColumns:_*).withColumn("name", lit("C"))
Output for C:
+---+-----+----+
|id |data |name|
+---+-----+----+
|id2|data2|C |
|id6|data6|C |
+---+-----+----+
After the conversion, A, B and C can be merged using union
var unionDF = aDF.union(bDF).union(cDF)
Which would be
+---+--------+----+
|id |data |name|
+---+--------+----+
|id1|data1 |A |
|id8|data8 |A |
|id1|data1bis|B |
|id5|data5 |B |
|id2|data2 |C |
|id6|data6 |C |
+---+--------+----+
Then its just joining the developDF and unionDF after renaming of id column of unionDF for dropping it later on.
unionDF = unionDF.withColumnRenamed("id", "id1")
unionDF = developDF.join(unionDF, developDF("id") === unionDF("id1") && developDF("dbName") === unionDF("name"), "left").drop("id1", "name")
Finally we have
+------+---+---------------------+-----+
|dbName|id |other data |data |
+------+---+---------------------+-----+
|A |id1|some data |data1|
|C |id6|some other data |data6|
|A |id8|some other other data|data8|
+------+---+---------------------+-----+
You can do the needful after that.
Note : lit function would work with following import
import org.apache.spark.sql.functions._
I have started playing with Bacon.js, and I have come across a problem I can't find any example for.
I have a set of nodes, which may reference other nodes.
It looks like this in imperative code:
alice = { name: "Alice" }
bob = { name: "Bob" }
carol = { name: "Carol" }
alice.likes = bob
bob.likes = carol
alice.likes.name //=> "Bob"
bob.name = "Bobby"
alice.likes.name //=> "Bobby"
alice.likes = carol
alice.likes.name //=> "Carol"
Now, I would like to have a Property for alice.likes.name, which has to change
whenever alice.likes points to a different object, or the name property of the
current object in alice.likes changes.
I have come up with the code below (LiveScript syntax), which correctly logs 3 messages: Bob, Bobby, Carol.
I'm using Bus for testing purposes.
mkBus = (initialValue) ->
bus = new Bacon.Bus()
property = bus.toProperty(initialValue)
property.push = (newValue) -> bus.push(newValue)
property
alice = { pName: mkBus("Alice"), pLikes: mkBus(null) }
bob = { pName: mkBus("Bob"), pLikes: mkBus(null) }
carol = { pName: mkBus("Carol"), pLikes: mkBus(null) }
alice.pLikes.onValue (person) ->
if person
person.pName.onValue (vName) ->
console.log vName
alice.pLikes.push(bob)
bob.pLikes.push(carol)
# Change name
bob.pName.push("Bobby")
# Change reference
alice.pLikes.push(carol)
Question: How can I make a Property that represents the name of alice.likes.name?
I.e:
nameOfAlicesFavourite.onValue (vName) ->
console.log vName
I'm new to FRP, so let me know if I'm doing something horribly wrong.
Thanks #Bergi for pointing me to flatMap.
flatMap creates a new stream for every event in the source. I used flatMapLatest, so that only the name changes from the latest liked person are transmitted on the output stream.
Here's the code:
nameOfAlicesFavourite = alice.pLikes.flatMapLatest (person) ->
if person
person.pName
else
Bacon.never!
I am trying to read this json file into a hive table, the top level keys i.e. 1,2.., here are not consistent.
{
"1":"{\"time\":1421169633384,\"reading1\":130.875969,\"reading2\":227.138275}",
"2":"{\"time\":1421169646476,\"reading1\":131.240628,\"reading2\":226.810211}",
"position": 0
}
I only need the time and readings 1,2 in my hive table as columns ignore position.
I can also do a combo of hive query and spark map-reduce code.
Thank you for the help.
Update , here is what I am trying
val hqlContext = new HiveContext(sc)
val rdd = sc.textFile(data_loc)
val json_rdd = hqlContext.jsonRDD(rdd)
json_rdd.registerTempTable("table123")
println(json_rdd.printSchema())
hqlContext.sql("SELECT json_val from table123 lateral view explode_map( json_map(*, 'int,string')) x as json_key, json_val ").foreach(println)
It throws the following error :
Exception in thread "main" org.apache.spark.sql.hive.HiveQl$ParseException: Failed to parse: SELECT json_val from temp_hum_table lateral view explode_map( json_map(*, 'int,string')) x as json_key, json_val
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:239)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
This would work, if you rename "1" and "2" (key names) to "x1" and "x2" (inside the json file or in the rdd):
val resultrdd = sqlContext.sql("SELECT x1.time, x1.reading1, x1.reading1, x2.time, x2.reading1, x2.reading2 from table123 ")
resultrdd.flatMap(row => (Array( (row(0),row(1),row(2)), (row(3),row(4),row(5)) )))
This would give you an RDD of tuples with time, reading1 and reading2. If you need a SchemaRDD, you would map it to a case class inside the flatMap transformation, like this:
case class Record(time: Long, reading1: Double, reading2: Double)
resultrdd.flatMap(row => (Array( Record(row.getLong(0),row.getDouble(1),row.getDouble(2)),
Record(row.getLong(3),row.getDouble(4),row.getDouble(5)) )))
val schrdd = sqlContext.createSchemaRDD(resultrdd)
Update:
In the case of many nested keys, you can parse the row like this:
val allrdd = sqlContext.sql("SELECT * from table123")
allrdd.flatMap(row=>{
var recs = Array[Record]();
for(col <- (0 to row.length-1)) {
row(col) match {
case r:Row => recs = recs :+ Record(r.getLong(2),r.getDouble(0),r.getDouble(1));
case _ => ;
}
};
recs
})
I'm a new Pig user.
I have an existing schema which I want to modify. My source data is as follows with 6 columns:
Name Type Date Region Op Value
-----------------------------------------------------
john ab 20130106 D X 20
john ab 20130106 D C 19
jphn ab 20130106 D T 8
jphn ab 20130106 E C 854
jphn ab 20130106 E T 67
jphn ab 20130106 E X 98
and so on. Each Op value is always C, T or X.
I basically want to split my data in the following way into 7 columns:
Name Type Date Region OpX OpC OpT
----------------------------------------------------------
john ab 20130106 D 20 19 8
john ab 20130106 E 98 854 67
Basically split the Op column into 3 columns: each for one Op value. Each of these columns should contain appropriate value from column Value.
How can I do this in Pig?
One way to achieve the desired result:
IN = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray,
date:int, region:chararray, op:chararray, value:int);
A = order IN by op asc;
B = group A by (name, type, date, region);
C = foreach B {
bs = STRSPLIT(BagToString(A.value, ','),',',3);
generate flatten(group) as (name, type, date, region),
bs.$2 as OpX:chararray, bs.$0 as OpC:chararray, bs.$1 as OpT:chararray;
}
describe C;
C: {name: chararray,type: chararray,date: int,region: chararray,OpX:
chararray,OpC: chararray,OpT: chararray}
dump C;
(john,ab,20130106,D,20,19,8)
(john,ab,20130106,E,98,854,67)
Update:
If you want to skip order by which adds an additional reduce phase to the computation, you can prefix each value with its corresponding op in tuple v. Then sort the tuple fields by using a custom UDF to have the desired OpX, OpC, OpT order:
register 'myjar.jar';
A = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray,
date:int, region:chararray, op:chararray, value:int);
B = group A by (name, type, date, region);
C = foreach B {
v = foreach A generate CONCAT(op, (chararray)value);
bs = STRSPLIT(BagToString(v, ','),',',3);
generate flatten(group) as (name, type, date, region),
flatten(TupleArrange(bs)) as (OpX:chararray, OpC:chararray, OpT:chararray);
}
where TupleArrange in mjar.jar is something like this:
..
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
public class TupleArrange extends EvalFunc<Tuple> {
private static final TupleFactory tupleFactory = TupleFactory.getInstance();
#Override
public Tuple exec(Tuple input) throws IOException {
try {
Tuple result = tupleFactory.newTuple(3);
Tuple inputTuple = (Tuple) input.get(0);
String[] tupleArr = new String[] {
(String) inputTuple.get(0),
(String) inputTuple.get(1),
(String) inputTuple.get(2)
};
Arrays.sort(tupleArr); //ascending
result.set(0, tupleArr[2].substring(1));
result.set(1, tupleArr[0].substring(1));
result.set(2, tupleArr[1].substring(1));
return result;
}
catch (Exception e) {
throw new RuntimeException("TupleArrange error", e);
}
}
#Override
public Schema outputSchema(Schema input) {
return input;
}
}