Combining Multiple Maps together in Pig

Combining Multiple Maps together in Pig - hadoop

I am using pig for the first time. I've gotten to the point where I have exactly the answer I want, but in a weirdly nested format:
{(price,49),(manages,"1d74426f-2b0a-4777-ac1b-042268cab09c")}
I'd like the output to be a single map, without any wrapping:
[price#49, manages#"1d74426f-2b0a-4777-ac1b-042268cab09c"]
I've managed to use TOMAP to get this far, but I can't figure out how to merge and flatten it away.
{([price_specification#{"amount":49,"currency":"USD"}]),([manages#"newest-nodes/1d74426f-2b0a-4777-ac1b-042268cab09c"])}
How should I be going about this?

Unfortunately, there are no built-in functions to do this for you. You'll have to write your own UDF. Fortunately, this is a simple one.
The exec method would just go something like:
public Map<String, Object> exec(Tuple input) {
Map<String, Object> m = new HashMap<String, Object>();
for (int i = 0; i < input.size(); i++)
m.putAll((Map<String, Object>) input.get(i));
return m;
}
The UDF could take any number of maps as arguments.
Note that if two or more maps share a key, then the final one encountered will be the one that is kept and the others get overwritten.

Related

Creating a stream of booleans from a boolean array? [duplicate]

There is no nice way to convert given boolean[] foo array into stream in Java-8 in one statement, or I am missing something?
(I will not ask why?, but it is really incomprehensible: why not add stream support for all primitive types?)
Hint: Arrays.stream(foo) will not work, there is no such method for boolean[] type.

Given boolean[] foo use
Stream<Boolean> stream = IntStream.range(0, foo.length)
.mapToObj(idx -> foo[idx]);
Note that every boolean value will be boxed, but it's usually not a big problem as boxing for boolean does not allocate additional memory (just uses one of predefined values - Boolean.TRUE or Boolean.FALSE).

You can use Guava's Booleans class:
Stream<Boolean> stream = Booleans.asList(foo).stream();
This is a pretty efficient way because Booleans.asList returns a wrapper for the array and does not make any copies.

of course you could create a stream directly
Stream.Builder<Boolean> builder = Stream.builder();
for (int i = 0; i < foo.length; i++)
builder.add(foo[i]);
Stream<Boolean> stream = builder.build();
…or by wrapping an AbstractList around foo
Stream<Boolean> stream = new AbstractList<Boolean>() {
public Boolean get(int index) {return (foo[index]);}
public int size() {return foo.length;}
}.stream();

Skimming through the early access JavaDoc (ie. java.base module) of the newest java-15, there is still no neat way to make the primitive boolean array work with Stream API together well. There is no new feature in the API with treating a primitive boolean array since java-8.
Note that there exist IntStream, DoubleStream and LongStream, but nothing like BooleanStream that would represent of a variation of a sequence of primitive booleans. Also the overloaded methods of Stream are Stream::mapToInt, Stream::mapToDouble and Stream::mapToLong, but not Stream::mapToBoolean returning such hypothetical BooleanStream.
Oracle seems to keep following this pattern, which could be found also in Collectors. There is also no such support for float primitives (there is for double primitives instead). In my opinion, unlike of float, the boolean support would make sense to implement.
Back to the code... if you have a boxed boolean array (ie. Boolean[] array), the things get easier:
Boolean[] array = ...
Stream<Boolean> streamOfBoxedBoolean1 = Arrays.stream(array);
Stream<Boolean> streamOfBoxedBoolean2 = Stream.of(array);
Otherwise you have to use more than one statement as said in this or this answer.
However, you asked (emphasizes mine):
way to convert given boolean[] foo array into stream in Java-8 in one statement.
... there is actually a way to achieve this through one statement using a Spliterator made from an Iterator. It is definetly not nice but :
boolean[] array = ...
Stream<Boolean> stream = StreamSupport.stream(
Spliterators.spliteratorUnknownSize(
new Iterator<>() {
int index = 0;
#Override public boolean hasNext() { return index < array.length; }
#Override public Boolean next() { return array[index++]; }
}, 0), false);

Quickly testing freemarker expressions

Today I was faced with the prospect of writing quite a few freemarker expressions. While the overall difficulty is not high, some of them will contain quite a few builtin calls (i.e. parsing formatted string to number, increasing it and formatting again).
That got me to thinking - how can I test that for development purpuses only to minimize the time spent? I know there are IDE tools that help with the syntax - however, what about testing the functionality I wrote on sample strings? Something that would allow me to parse ${" b lah"?trim} and check whether the output is what I expect? Running my app is obviously a possibility, but in my case it takes way too long to get to the part where using freemarker happens.

I would use a simple JUnit test to test that.
Example:
public class FreemarkerSandbox {
#Test
public void testFreemarker() throws TemplateException, IOException {
Configuration cfg = new Configuration();
Template template = cfg.getTemplate("trim.ftl", CharEncoding.UTF_8);
HashMap<String, Object> model = new HashMap<String, Object>();
Writer out = new StringWriter();
template.process(model, out);
assertEquals("b lah", out.toString());
}
}
With trim.ftl just containing the expression to test.
${" b lah"?trim}
It can be improved by passing the test cases in the model.

Two arrays or one in Map structure?

I'm trying to create a Map where the data will be static and not change after the program starts (actually loaded from a server)
Is it better to have two arrays, e.g. in Java:
String keys[] = new String[10];
String values[] = new String[10];
where keys[i] corresponds to values[i]?
or to keep them in a single array, e.g.
String[][] map[] = new String[10][2];
where map[i][0] is the key and map[i][1] is the value?
Personally, the first makes more sense to me, but the second makes more sense to my partner. Is either better performance-wise? Easier to understand?
Update: I'm looking to do this in JavaScript where Map and KeyValuePairs don't exist

Using a Map implementation (in Java) would make this easier to understand as the association is clearer:
static final Map<String, String> my_map;
static
{
my_map = new HashMap<String, String>();
// Populate.
}

A Hashtable looks like what you need. It hashes the keys in such a way that lookup can happen in O(1).

So, you're looking to do this in javascript. Any array or object in js in a map, so you could just do
var mymap = {'key1':'value1','key2':'value2'};

Map Support in Shell Scripting

I am new in Shell Scripting, however i am friendly with Java Maps. I Just wanted to know that how can i use Map facility in Shell Scripting. Below is the facility i need to use in shell-
HashMap<String, ArrayList<String>> users = new HashMap<String, ArrayList<String>>();
String username = "test_user1";
String address = "test_user1_address";
String emailId = "test_user1_emailId";
ArrayList<String> values = new ArrayList<String>();
values.add(address);
values.add(emailId);
users.put(username, values);
String anotherUser = "test_user2";
if (users.containsKey(anotherUser)) {
System.out.println("Do some stuff here");
}
In short, i want to use a Map, which has String as key, either Vector or ArrayList as value (otherwise i have live with Arrays instead of ArrayList and manually take care of indexes) , put method to insert and one more method to check the presence of the key in the existing Map.
The above code is a sample code.
Thank you in advance.

bash does not support nested structures like this. Either use separate variables for each array, or use something more capable such as Python.

Hadoop Custom Input format with the new API

I'm a newbie to Hadoop and I'm stuck with the following problem. What I'm trying to do is to map a shard of the database (please don't ask why I need to do that etc) to a mapper, then do certain operation on this data, output the results to reducers and use that output again to do the second phase map/reduce job on the same data using the same shard format.
Hadoop does not provide any input method to send a shard of the database. You can only send line by line using LineInputFormat and LineRecordReader. NLineInputFormat doesn't also help in this case. I need to extend FileInputFormat and RecordReader classes to write my own InputFormat. I have been advised to use LineRecordReader since the underlying code already deals with the FileSplits and all the problems associated with splitting the files.
All I need to do now is to override the nextKeyValue() method which I don't exactly know how.
for(int i=0;i<shard_size;i++){
if(lineRecordReader.nextKeyValue()){
lineValue.append(lineRecordReader.getCurrentValue().getBytes(),0,lineRecordReader.getCurrentValue().getLength());
}
}
The above code snippet is the one that wrote but somehow doesn't work well.

I would suggest to put into your input files connection strings and some other indications where to find the shard.
Mapper will take this information, connect to the database and do a job. I would not suggest t o convert result sets to hadoop's writable classes - it will hinder performance.
The problem I see to be addressed - is to have enough splits of this relatively small input.
You can simply create enough small files with a few shards references each, or you can tweak input format to build small splits. Second way will be more flexible.

What I did, is something like this. I wrote my own record reader to read n lines at a time and send them to mappers as input
public boolean nextKeyValue() throws IOException,
InterruptedException {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 5; i++) {
if (!lineRecordReader.nextKeyValue()) {
return false;
}
lineKey = lineRecordReader.getCurrentKey();
lineValue = lineRecordReader.getCurrentValue();
sb.append(lineValue.toString());
sb.append(eol);
}
lineValue.set(sb.toString());
//System.out.println(lineValue.toString());
return true;
// throw new UnsupportedOperationException("Not supported yet.");
}
how do you thin

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Combining Multiple Maps together in Pig - hadoop

Related

Creating a stream of booleans from a boolean array? [duplicate]

Quickly testing freemarker expressions

Two arrays or one in Map structure?

Map Support in Shell Scripting

Hadoop Custom Input format with the new API

Categories

Resources