I'm trying to re-write famous example of Spark's text classification (http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/) on Java 8.
I have a problem - in this code I'm making some data preparations for getting idfs of all words in all files:
termDocsRdd.collect().stream().flatMap(doc -> doc.getTerms().stream()
.map(term -> new ImmutableMap.Builder<String, String>()
.put(doc.getName(),term)
.build())).distinct()
And I'm stuck on the groupBy operation. (I need to group this by term, so each term must be a key and the value must be a sequence of documents).
In Scala this operation looks very simple - .groupBy(_._2).
But how can I do this in Java?
I tried to write something like:
.groupingBy(term -> term, mapping((Document) d -> d.getDocNameContainsTerm(term), toList()));
but it's incorrect...
Somebody knows how to write it in Java?
Thank You very much.
If I understand you correctly, you want to do something like this:
(import static java.util.stream.Collectors.*;)
Map<Term, Set<Document>> collect = termDocsRdd.collect().stream().flatMap(
doc -> doc.getTerms().stream().map(term -> new AbstractMap.SimpleEntry<>(doc, term)))
.collect(groupingBy(Map.Entry::getValue, mapping(Map.Entry::getKey, toSet())));
The use of Map.Entry/ AbstractMap.SimpleEntry is due to the absence of a standard Pair<K,V> class in Java-8. Map.Entry implementations can fulfill this role but at the cost of having unintuitive and verbose type and method names (regarding the task of serving as Pair implementation).
If you are using the current Eclipse version (I tested with LunaSR1 20140925) with its limited type inference, you have to help the compiler a little bit:
Map<Term, Set<Document>> collect = termDocsRdd.collect().stream().flatMap(
doc -> doc.getTerms().stream().<Map.Entry<Document,Term>>map(term -> new AbstractMap.SimpleEntry<>(doc, term)))
.collect(groupingBy(Map.Entry::getValue, mapping(Map.Entry::getKey, toSet())));
Related
Without thinking too hard about it I created a column of type [UUID] and successfully stored "strings" (as noted in the documentation, and generally referred to as a separate type altogether) returned from DB::generateKey in it.
Feels like I've done something I shouldn't have.
Can anyone share some light on this. Thanks in advance.
Mostly they return different types.
For clarity, DB::generateKey is equivalent to Uuid::generate |> toString
According to the standard library docs, it's the return type.
Uuid::generate() -> UUID
Generate a new UUID v4 according to RFC 4122
DB::generateKey() -> Str
Returns a random key suitable for use as a DB key
I believe the UUID type is a bitstring representation, that is, a specific sequence of bits in memory.
Whereas the Str type is a string representation of a UUID.
I am looking to arange my headers based on their code point from low to high. Below is my attempt and I was wondering if someone could advise me on whether I have done this correctly. I basically looked up the ASCII chart (ASCII Chart)to do this manually.
Action -> X-Amz-Algorithm -> X-Amz-Credential -> X-Amz-Date -> X-Amz-SignedHeaders - > X-Amz-Signature
You need to sort the headers. Depending on the programming language, this looks different, but the key word is always "sort".
Java example:
List<String> headers = Arrays.asList("Action", "X-Amz-Algorithm", "...");
headers.sort(Comparator.naturalOrder());
I have tried this code:
final List<ScheduleContainer> scheduleContainers = new ArrayList<>();
scheduleResponseContent.getSchedules().parallelStream().forEach(s -> scheduleContainers.addAll(s));
With parallelStream I get either an ArrayIndexOutOfBoundException or a NullpointerException because some entries in scheduleContainers are null.
With ... .stream()... everything works fine.
My question now would be if there is a possibiliy to fix this or did I misuse parallelStream?
Yes, you are misusing parallelStream. First of all, as you have already said twice in your previous question, you should use stream(), and not parallelStream(), by default. Going parallel has an intrinsic cost, that usually makes things less efficient than a simple sequential stream, unless you has a massive amount of data to process, and the process of each element takes time. You should have a performance problem, and measure if a parallel stream solves it, before using one. There's also a much bigger chance of screwing up with a parallel stream, as your post shows.
Read Should I always use a parallel stream when possible? for more arguments.
Second, this code is not thread-safe at all, since it uses several concurrent threads to add to a thread-unsafe ArrayList. It can be safe if you use collect() to create the final list for you instead of forEach() and add things to the list by yourself.
The code should be
List<ScheduleContainer> scheduleContainers =
scheduleResponseContent.getSchedules().
.stream()
.flatMap(s -> s.stream())
.collect(Collectors.toList());
Not sure about the cause of the error, but there are better ways to use the Stream API to create a List from multiple input Lists.
final List<ScheduleContainer> scheduleContainers =
scheduleResponseContent.getSchedules()
.parallelStream()
.flatMap(s->s.stream()) // assuming getSchedules() returns some
// Collection<ScheduleContainer>, based
// on your use of addAll(s)
.collect(Collectors.toList());
For example, key 1 will have values "A","B","C" but key 2 will have value "D". If I use
Map<String, List<String>>
I need to populate the List<String> even when I have only single String value.
What data structure should be used in this case?
Map<String,List<String>> would be the standard way to do it (using a size-1 list when there is only a single item).
You could also have something like Map<String, Object> (which should work in either Java or presumably C#, to name two), where the value is either List<String> or String, but this would be fairly bad practice, as there are readability issue (you don't know what Object represents right off the bat from seeing the type), casting happens during runtime, which isn't ideal, among other things.
It does however depend what type of queries you plan to run. Map<String,Set<String>> might be a good idea if you plan of doing existence checks in the List and it can be large. Set<StringPair> (where StringPair is a class with 2 String members) is another consideration if there are plenty of keys with only 1 mapped value. There are plenty of solutions which would be more appropriate under various circumstances - it basically comes down to looking at the type of queries you want to perform and picking an appropriate structure according to that.
I'm developing a Windows Phone app.
How can I get the language code from CultureInfo.CurrentCulture?
I'm using CultureInfo.CurrentCulture.Name and I getting 'en-US'. I only need en.
Have you tried using the TwoLetterISOLanguageName property?
I'm not sure exactly what you are trying to achieve. If all you want is to remove the region, retaining the script distinction (if you are interested in zh-Hans for example and not just zh) then you will want to use the Parent property (). Though this can return legacy (zh-CHS) so you would want to use the IetfLanguageTag property to resolve that:
CultureInfo.CurrentCulture.Parent.IetfLanguageTag
en-US -> en
zh-CN -> zh-Hans
zh-TW -> zh-Hant
Sometimes it still isn't going to give you the expected answer since it will only language tags that are supported (but this isn't any different from the TwoLetterISOLanguageName property):
az-Cyrl-AZ -> az
az-Latn-AZ -> az
And it seems like some of the chains were omitted:
sr-Cyrl-BA -> (Invariant)
You can check for invariant and then return the TwoLetterISOLanguageName property to work around that.