I have a Map in following structure and I want to flip the key and value.
Map<String, List<String>> dataMap
Sample data :
acct01: [aa, ab, ad],
acct02: [ac, ad]
acct03: [ax, ab]
Want this data to be converted to,
aa: [acct01],
ab: [acct01, acct03],
ac: [acct02],
ad: [acct01, acct02],
ax: [acct03]
Want to know if there is a java 8 - stream way to transform the Map.
My Current implementation in (without Stream)
Map<String, List<String>> originalData = new HashMap<String, List<String>>();
originalData.put("Acct01", Arrays.asList("aa", "ab", "ad"));
originalData.put("Acct02", Arrays.asList("ac", "ad"));
originalData.put("Acct03", Arrays.asList("ax", "ab"));
System.out.println(originalData);
Map<String, List<String>> newData = new HashMap<String, List<String>>();
originalData.entrySet().forEach(entry -> {
entry.getValue().forEach(v -> {
if(newData.get(v) == null) {
List<String> t = new ArrayList<String>();
t.add(entry.getKey());
newData.put(v, t);
} else {
newData.get(v).add(entry.getKey());
}
});
});
System.out.println(newData);
input and output,
{Acct01=[aa, ab, ad], Acct02=[ac, ad], Acct03=[ax, ab]}
{aa=[Acct01], ab=[Acct01, Acct03], ac=[Acct02], ad=[Acct01, Acct02], ax=[Acct03]}
Looking for way to implement using Stream.
Get the stream for the entry set, flatten it out into one entry per key-value pair, group by value, collect associated keys into a list.
import static java.util.Arrays.asList;
import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.mapping;
import static java.util.stream.Collectors.toList;
import java.util.AbstractMap.SimpleImmutableEntry;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
<K, V> Map<V, List<K>> invert(Map<K, List<V>> map) {
return map.entrySet().stream().flatMap(
entry -> entry.getValue().stream().map(
value -> new SimpleImmutableEntry<>(
entry.getKey(),
value
)
)
).collect(
groupingBy(
Entry::getValue,
mapping(
Entry::getKey,
toList()
)
)
);
}
Here is the solution by Java 8 stream library: StreamEx
newData = EntryStream.of(originalData).invert().flatMapKeys(k -> k.stream()).grouping();
If you are open to using a third party library like Eclipse Collections you can use a ListMultimap (Each key can have a List of values). Multimap has flip(). So, this will work:
MutableListMultimap<String, String> originalData = Multimaps.mutable.list.empty();
originalData.putAll("Acct01", Arrays.asList("aa", "ab", "ad"));
originalData.putAll("Acct02", Arrays.asList("ac", "ad"));
originalData.putAll("Acct03", Arrays.asList("ax", "ab"));
System.out.println(originalData);
MutableBagMultimap<String, String> newData = originalData.flip();
System.out.println(newData);
input: {Acct03=[ax, ab], Acct02=[ac, ad], Acct01=[aa, ab, ad]}
output: {ac=[Acct02], ad=[Acct02, Acct01], aa=[Acct01], ab=[Acct03, Acct01], ax=[Acct03]}
Please note that flip() returns a BagMultimap where each key can have a Bag of values. Bag is special data structure which is unordered and allows duplicates.
Note: I am a committer for Eclipse Collections.
Your current implementation already relies on features of Java8. The forEach method was added to a number of data structures in J8, and while you could use streams, there would be no point, as much of the advantage of streams comes from being able to execute filters, sorting, and other methods in a lazy fashion, which would not apply to key remapping.
If you really wanted to, you could sprinkle in a couple streams by changing all .forEach instances to .stream().forEach in your example.
I am not able to understand the use of the Tuple.getStringByField("ABC") in Apache Storm.
The following is the code:
Public Void execute(Tuple input){
try{
if (input.getSourceStreamId.equals("signals"))
{
str=input.getStringByField("action")
if ("refresh".equals(str))
{....}
}
}...
Here what is input.getStringByField("action") is doing exactly..
Thank you.
In storm, both spout and bolt emit tuple. But the question is what are contained in each tuple. Each spout and bolt can use the below method to define the tuple schema.
#Override
public void declareOutputFields(
OutputFieldsDeclarer outputFieldsDeclarer)
{
// tell storm the schema of the output tuple
// tuple consists of columns called 'mycolumn1' and 'mycolumn2'
outputFieldsDeclarer.declare(new Fields("mycolumn1", "mycolumn2"));
}
The subsequent bolt then can use getStringByField("mycolumn1") to retrieve the value based on column name.
getStringByField() is like getString(), except it looks up the field by it's field name instead of position.
I would like to have YAML files with an include, similar to this question, but with Snakeyaml:
How can I include an YAML file inside another?
For example:
%YAML 1.2
---
!include "load.yml"
!include "load2.yml"
I am having a lot of trouble with it. I have the Constructor defined, and I can make it import one document, but not two. The error I get is:
Exception in thread "main" expected '<document start>', but found Tag
in 'reader', line 5, column 1:
!include "load2.yml"
^
With one include, Snakeyaml is happy that it finds an EOF and processes the import. With two, it's not happy (above).
My java source is:
package yaml;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStream;
import org.yaml.snakeyaml.Yaml;
import org.yaml.snakeyaml.constructor.AbstractConstruct;
import org.yaml.snakeyaml.constructor.Constructor;
import org.yaml.snakeyaml.nodes.Node;
import org.yaml.snakeyaml.nodes.ScalarNode;
import org.yaml.snakeyaml.nodes.Tag;
public class Main {
final static Constructor constructor = new MyConstructor();
private static class ImportConstruct extends AbstractConstruct {
#Override
public Object construct(Node node) {
if (!(node instanceof ScalarNode)) {
throw new IllegalArgumentException("Non-scalar !import: " + node.toString());
}
final ScalarNode scalarNode = (ScalarNode)node;
final String value = scalarNode.getValue();
File file = new File("src/imports/" + value);
if (!file.exists()) {
return null;
}
try {
final InputStream input = new FileInputStream(new File("src/imports/" + value));
final Yaml yaml = new Yaml(constructor);
return yaml.loadAll(input);
} catch (FileNotFoundException ex) {
ex.printStackTrace();
}
return null;
}
}
private static class MyConstructor extends Constructor {
public MyConstructor() {
yamlConstructors.put(new Tag("!include"), new ImportConstruct());
}
}
public static void main(String[] args) {
try {
final InputStream input = new FileInputStream(new File("src/imports/example.yml"));
final Yaml yaml = new Yaml(constructor);
Object object = yaml.load(input);
System.out.println("Loaded");
} catch (FileNotFoundException ex) {
ex.printStackTrace();
}
finally {
}
}
}
Question is, has anybody done a similar thing with Snakeyaml? Any thoughts as to what I might be doing wrong?
I see two issues:
final InputStream input = new FileInputStream(new File("src/imports/" + value));
final Yaml yaml = new Yaml(constructor);
return yaml.loadAll(input);
You should be using yaml.load(input), not yaml.loadAll(input). The loadAll() method returns multiple objects, but the construct() method expects to return a single object.
The other issue is that you may have some inconsistent expectations with the way that the YAML processing pipeline works:
If you think that your !include works like in C where the preprocessor sticks in the contents of the included file, the way to implement it would be to handle it in the Presentation stage (parsing) or Serialization stage (composing). But you have implemented it in the Representation stage (constructing), so !include returns an object, and the structure of your YAML file must be consistent with this.
Let's say that you have the following files:
test1a.yaml
activity: "herding cats"
test1b.yaml
33
test1.yaml
favorites: !include test1a.yaml
age: !include test1b.yaml
This would work ok, and would be equivalent to
favorites:
activity: "herding cats"
age: 33
But the following file would not work:
!include test1a.yaml
!include test1b.yaml
because there is nothing to say how to combine the two values in a larger hierarchy. You'd need to do this, if you want an array:
- !include test1a.yaml
- !include test1b.yaml
or, again, handle this custom logic in an earlier stage such as parsing or composing.
Alternatively, you need to tell the YAML library that you are starting a 2nd document (which is what the error is complaining about: expected '<document start>') since YAML supports multiple "documents" (top-level values) in a single .yaml file.
I am trying to play with MultipleOutputFormat.generateFileNameForKeyValue() .
The idea is to create directory for each of my keys.
This is the code:
static class MyMultipleTextOutputFormat extends MultipleTextOutputFormat<Text, Text> {
#Override
protected String generateFileNameForKeyValue(Text key, Text value, String name) {
arr = key.toString().split("_");
return arr[0]+"/"+name;
}
}
This code works only if the emitted records are few. If i run the code against my real input, it just hangs on reducer around 70%.
What might be the problem here - working on small number of keys, not working on many .
My requirement is to write a custom partitioner. I have these N number of keys coming from mapper for example('jsa','msa','jbac'). Length is not fixed. It can be anyword infact. My requirement is to write a custom partitioner in such a way that It will collect all the same key data in to same file. Number of keys is not fixed. Thank you in Advance.
Thanks,
Sathish.
So you have multiple keys which mapper is outputting and you want different reducers for each key and have a separate file for each key.
So first thing writing Partitioner can be a way to achieve that. By default hadoop has its own internal logic that it performs on keys and depending on that it calls reducers. So if you want to write a custom partitioner than you have to overwrite that default behaviour by your own logic/algorithm. Unless you know how exactly your keys will vary this logic wont be generic and based on variations you have to figure out the logic.
I am providing you a sample example here you can refer that but its not generic.
public class CustomPartitioner extends Partitioner<Text, Text>
{
#Override
public int getPartition(Text key, Text value, int numReduceTasks)
{
if(key.toString().contains("Key1"))
{
return 1;
}else if(key.toString().contains("Key2"))
{
return 2;
}else if(key.toString().contains("Key3"))
{
return 3;
}else if(key.toString().contains("Key4"))
{
return 4;
}else if(key.toString().contains("Key5"))
{
return 5;
}else
{
return 7;
}
}
}
This should solve your problem. Just replace key1,key2 ..etc by your key name...
In case you don't know the key names you can write your own logic by referring following:
public class CustomPartitioner<Text, Text> extends Partitioner<K, V>
{
public int getPartition(Text key, Text value,int numReduceTasks)
{
return (key.toString().charAt(0)) % numReduceTasks;
}
}
In above partitioner just to illustrate that how you can write your own logic I have shown that if you take out length of the keys and do % operation with number of reducers than you will get one unique number which will be between 0 to Number of Reducers so by default different reducers get called and gives output in different files. But in this approach you have to make sure that for two keys same value should not be written
This was about Customized partitioner.
Another solution can be you can override the MultipleOutputFormat class methods that will enable to do the job in a generic way. Also using this approach you will be able to generate customized file name for reducer output files in hdfs.
NOTE: Make sure you use same libraries. Don't mix mapred against mapreduce libraries. org.apache.hadoop.mapred are older libraries and org.apache.hadoop.mapreduce are new ones.
Hope this will help.
I imagine the best way to do this since it will give a more even split would be:
public class CustomPartitioner<Text, Text> extends Partitioner<K, V>
{
public int getPartition(Text key, Text value,int numReduceTasks)
{
return key.hashCode() % numReduceTasks;
}
}