Implement Hadoop Map with JavaPairRDD as Spark Way - hadoop

I have an RDD:
JavaPairRDD<Long, ViewRecord> myRDD
which is created via newAPIHadoopRDD method. I have an existed map function which I want to implement it in Spark way:
LongWritable one = new LongWritable(1L);
protected void map(Long key, ViewRecord viewRecord, Context context)
throws IOException ,InterruptedException {
String url = viewRecord.getUrl();
long day = viewRecord.getDay();
tuple.getKey().set(url);
tuple.getValue().set(day);
context.write(tuple, one);
};
PS: tuple is derived from:
KeyValueWritable<Text, LongWritable>
and can be found here: TextLong.java

I don't know what tuple is but if you just want to map record to tuple with key (url, day) and value 1L you can do it like this:
result = myRDD
.values()
.mapToPair(viewRecord -> {
String url = viewRecord.getUrl();
long day = viewRecord.getDay();
return new Tuple2<>(new Tuple2<>(url, day), 1L);
})
//java 7 style
JavaPairRDD<Pair, Long> result = myRDD
.values()
.mapToPair(new PairFunction<ViewRecord, Pair, Long>() {
#Override
public Tuple2<Pair, Long> call(ViewRecord record) throws Exception {
String url = record.getUrl();
Long day = record.getDay();
return new Tuple2<>(new Pair(url, day), 1L);
}
}
);

Related

Kafka stream groupByKey not working for count()

I am trying to generate count based on keys, using the below code, this code is based on the word count example. Strangely if the mapValues function returns on a String then the groupBy works as mentioned in the commented line, but when I send a keypair of String as key and GenericRecord as value.
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
final Map<String, String> serdeConfig = Collections.singletonMap("schema.registry.url","http://localhost:8081");
stringSerde.configure(serdeConfig, true); // `true` for record keys
final Serde<GenericRecord> valueGenericAvroSerde = new GenericAvroSerde();
valueGenericAvroSerde.configure(serdeConfig, false); // `false` for record values
StreamsBuilder builder = new StreamsBuilder();
KStream<String, GenericRecord> textLines =
builder.stream("ora-query-in",Consumed.with(stringSerde, valueGenericAvroSerde));
final KTable<String, Long> wordCounts = textLines
.mapValues(new ValueMapperWithKey<String, GenericRecord, KeyValue<String, GenericRecord>>() {
#Override
public KeyValue<String, GenericRecord> apply(String arg0, GenericRecord arg1) {
return new KeyValue<String, GenericRecord>(arg1.get("KEY_FIELD").toString(),arg1);
}
})
// .groupBy((key, value) -> value) //THIS WORKS if value is STRING
// .groupBy((key, value) -> key) //DOES NOT WORK EITHER
.groupByKey() //THIS does nothing
.count();
wordCounts.toStream().to("test.topic.out",Produced.with(stringSerde, longSerde));
Am I missing something in configuration
streamsConfiguration.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081");
streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
You haven't write what exactly is wrong but it seems it is an issue with Serialization
You can use:
KStream::groupBy(final KeyValueMapper<? super K, ? super V, KR> selector, final Grouped<KR, V> grouped).
someStream.groupByKey((key, value) -> value, Grouped.with(newKeySerdes, valueSerdes)
KGroupedStream::count(final Materialized<K, Long, KeyValueStore<Bytes, byte[]>> materialized)
someGroupedStream.count(Materialized.with(newKeySerdes, valueSerdes)
Can be same reason like:
Kafka Streams 2.1.1 class cast while flushing timed aggregation to store
KafkaStreams: Getting Window Final Results

Update list items in PagingLibrary w/o using Room (Network only)

I'm using Paging Library to load data from network using ItemKeyedDataSource. After fetching items user can edit them, this updates are done inside in Memory cache (no database like Room is used).
Now since the PagedList itself cannot be updated (discussed here) I have to recreate PagedList and pass it to the PagedListAdapter.
The update itself is no problem but after updating the recyclerView with the new PagedList, the list jumps to the beginning of the list destroying previous scroll position. Is there anyway to update PagedList while keeping scroll position (like how it works with Room)?
DataSource is implemented this way:
public class MentionKeyedDataSource extends ItemKeyedDataSource<Long, Mention> {
private Repository repository;
...
private List<Mention> cachedItems;
public MentionKeyedDataSource(Repository repository, ..., List<Mention> cachedItems){
super();
this.repository = repository;
this.teamId = teamId;
this.inboxId = inboxId;
this.filter = filter;
this.cachedItems = new ArrayList<>(cachedItems);
}
#Override
public void loadInitial(#NonNull LoadInitialParams<Long> params, final #NonNull ItemKeyedDataSource.LoadInitialCallback<Mention> callback) {
Observable.just(cachedItems)
.filter(() -> return cachedItems != null && !cachedItems.isEmpty())
.switchIfEmpty(repository.getItems(..., params.requestedLoadSize).map(...))
.subscribeOn(Schedulers.io())
.observeOn(AndroidSchedulers.mainThread())
.subscribe(response -> callback.onResult(response.data.list));
}
#Override
public void loadAfter(#NonNull LoadParams<Long> params, final #NonNull ItemKeyedDataSource.LoadCallback<Mention> callback) {
repository.getOlderItems(..., params.key, params.requestedLoadSize)
.subscribeOn(Schedulers.io())
.observeOn(AndroidSchedulers.mainThread())
.subscribe(response -> callback.onResult(response.data.list));
}
#Override
public void loadBefore(#NonNull LoadParams<Long> params, final #NonNull ItemKeyedDataSource.LoadCallback<Mention> callback) {
repository.getNewerItems(..., params.key, params.requestedLoadSize)
.subscribeOn(Schedulers.io())
.observeOn(AndroidSchedulers.mainThread())
.subscribe(response -> callback.onResult(response.data.list));
}
#NonNull
#Override
public Long getKey(#NonNull Mention item) {
return item.id;
}
}
The PagedList created like this:
PagedList.Config config = new PagedList.Config.Builder()
.setPageSize(PAGE_SIZE)
.setInitialLoadSizeHint(preFetchedItems != null && !preFetchedItems.isEmpty()
? preFetchedItems.size()
: PAGE_SIZE * 2
).build();
pagedMentionsList = new PagedList.Builder<>(new MentionKeyedDataSource(mRepository, team.id, inbox.id, mCurrentFilter, preFetchedItems)
, config)
.setFetchExecutor(ApplicationThreadPool.getBackgroundThreadExecutor())
.setNotifyExecutor(ApplicationThreadPool.getUIThreadExecutor())
.build();
The PagedListAdapter is created like this:
public class ItemAdapter extends PagedListAdapter<Item, ItemAdapter.ItemHolder> { //Adapter from google guide, Nothing special here.. }
mAdapter = new ItemAdapter(new DiffUtil.ItemCallback<Mention>() {
#Override
public boolean areItemsTheSame(Item oldItem, Item newItem) {
return oldItem.id == newItem.id;
}
#Override
public boolean areContentsTheSame(Item oldItem, Item newItem) {
return oldItem.equals(newItem);
}
});
, and updated like this:
mAdapter.submitList(pagedList);
P.S: If there is a better way to update list items without using Room please share.
All you need to do is to invalidate current DataSource each time you update your data.
What I would do:
move networking logic from MentionKeyedDataSource to new class that extends PagedList.BoundaryCallback and set it when building your PagedList.
move all you data to some DataProvider that holds all your downloaded data and has reference to DataSource. Each time data is updated in DataProvider invalidate current DataSource
Something like that maybe
val dataProvider = PagedDataProvider()
val dataSourceFactory = InMemoryDataSourceFactory(dataProvider = dataProvider)
where
class PagedDataProvider : DataProvider<Int, DataRow> {
private val dataCache = mutableMapOf<Int, List<DataRow>>()
override val sourceLiveData = MutableLiveData<DataSource<Int, DataRow>>()
//implement data add/remove/update here
//and after each modification call
//sourceLiveData.value?.invalidate()
}
and
class InMemoryDataSourceFactory<Key, Value>(
private val dataProvider: DataProvider<Key, Value>
) : DataSource.Factory<Key, Value>() {
override fun create(): DataSource<Key, Value> {
val source = InMemoryDataSource(dataProvider = dataProvider)
dataProvider.sourceLiveData.postValue(source)
return source
}
}
This approach is very similar to what Room does - every time table row is updated - current DataSource is invalidated and DataSourceFactory creates new data source.
You can modify directly in your adapter if you called currentList like that
class ItemsAdapter(): PagedListAdapter<Item, ItemsAdapter.ViewHolder(ITEMS_COMPARATOR) {
...
fun changeItem(position: Int,newData:String) {
currentList?.get(position)?.data = newData
notifyItemChanged(position)
}
}

Gson: How do I deserialize an inner JSON object to a map if the property name is not fixed?

My client retrieves JSON content as below:
{
"table": "tablename",
"update": 1495104575669,
"rows": [
{"column5": 11, "column6": "yyy"},
{"column3": 22, "column4": "zzz"}
]
}
In rows array content, the key is not fixed. I want to retrieve the key and value and save into a Map using Gson 2.8.x.
How can I configure Gson to simply use to deserialize?
Here is my idea:
public class Dataset {
private String table;
private long update;
private List<Rows>> lists; <-- little confused here.
or private List<HashMap<String,Object> lists
Setter/Getter
}
public class Rows {
private HashMap<String, Object> map;
....
}
Dataset k = gson.fromJson(jsonStr, Dataset.class);
log.info(k.getRows().size()); <-- I got two null object
Thanks.
Gson does not support such a thing out of box. It would be nice, if you can make the property name fixed. If not, then you can have a few options that probably would help you.
Just rename the Dataset.lists field to Dataset.rows, if the property name is fixed, rows.
If the possible name set is known in advance, suggest Gson to pick alternative names using the #SerializedName.
If the possible name set is really unknown and may change in the future, you might want to try to make it fully dynamic using a custom TypeAdapter (streaming mode; requires less memory, but harder to use) or a custom JsonDeserializer (object mode; requires more memory to store intermediate tree views, but it's easy to use) registered with GsonBuilder.
For option #2, you can simply add the names of name alternatives:
#SerializedName(value = "lists", alternate = "rows")
final List<Map<String, Object>> lists;
For option #3, bind a downstream List<Map<String, Object>> type adapter trying to detect the name dynamically. Note that I omit the Rows class deserialization strategy for simplicity (and I believe you might want to remove the Rows class in favor of simple Map<String, Object> (another note: use Map, try not to specify collection implementations -- hash maps are unordered, but telling Gson you're going to deal with Map would let it to pick an ordered map like LinkedTreeMap (Gson internals) or LinkedHashMap that might be important for datasets)).
// Type tokens are immutable and can be declared constants
private static final TypeToken<String> stringTypeToken = new TypeToken<String>() {
};
private static final TypeToken<Long> longTypeToken = new TypeToken<Long>() {
};
private static final TypeToken<List<Map<String, Object>>> stringToObjectMapListTypeToken = new TypeToken<List<Map<String, Object>>>() {
};
private static final Gson gson = new GsonBuilder()
.registerTypeAdapterFactory(new TypeAdapterFactory() {
#Override
public <T> TypeAdapter<T> create(final Gson gson, final TypeToken<T> typeToken) {
if ( typeToken.getRawType() != Dataset.class ) {
return null;
}
// If the actual type token represents the Dataset class, then pick the bunch of downstream type adapters
final TypeAdapter<String> stringTypeAdapter = gson.getDelegateAdapter(this, stringTypeToken);
final TypeAdapter<Long> primitiveLongTypeAdapter = gson.getDelegateAdapter(this, longTypeToken);
final TypeAdapter<List<Map<String, Object>>> stringToObjectMapListTypeAdapter = stringToObjectMapListTypeToken);
// And compose the bunch into a single dataset type adapter
final TypeAdapter<Dataset> datasetTypeAdapter = new TypeAdapter<Dataset>() {
#Override
public void write(final JsonWriter out, final Dataset dataset) {
// Omitted for brevity
throw new UnsupportedOperationException();
}
#Override
public Dataset read(final JsonReader in)
throws IOException {
in.beginObject();
String table = null;
long update = 0;
List<Map<String, Object>> lists = null;
while ( in.hasNext() ) {
final String name = in.nextName();
switch ( name ) {
case "table":
table = stringTypeAdapter.read(in);
break;
case "update":
update = primitiveLongTypeAdapter.read(in);
break;
default:
lists = stringToObjectMapListTypeAdapter.read(in);
break;
}
}
in.endObject();
return new Dataset(table, update, lists);
}
}.nullSafe(); // Making the type adapter null-safe
#SuppressWarnings("unchecked")
final TypeAdapter<T> typeAdapter = (TypeAdapter<T>) datasetTypeAdapter;
return typeAdapter;
}
})
.create();
final Dataset dataset = gson.fromJson(jsonReader, Dataset.class);
System.out.println(dataset.lists);
The code above would print then:
[{column5=11.0, column6=yyy}, {column3=22.0, column4=zzz}]

Begenner at spark Big data programming (spark code)

i'm learning spark for distributed systemes. i runned this code and it's worked.
but i know that it's count word in input files but i have probleme undestanding how Methods are written and what the us of JavaRDD
public class JavaWordCount {
public static void main(String[] args) throws Exception {
System.out.print("le programme commence");
//String inputFile = "/mapr/demo.mapr.com/TestMapr/Input/alice.txt";
String inputFile = args[0];
String outputFile = args[1];
// Create a Java Spark Context.
System.out.print("le programme cree un java spark contect");
SparkConf conf = new SparkConf().setAppName("JavaWordCount");
JavaSparkContext sc = new JavaSparkContext(conf);
// Load our input data.
System.out.print("Context créeS");
JavaRDD<String> input = sc.textFile(inputFile);
// map/split each line to multiple words
System.out.print("le programme divise le document en multiple line");
JavaRDD<String> words = input.flatMap(
new FlatMapFunction<String, String>() {
#Override
public Iterable<String> call(String x) {
return Arrays.asList(x.split(" "));
}
}
);
System.out.print("Turn the words into (word, 1) pairse");
// Turn the words into (word, 1) pairs
JavaPairRDD<String, Integer> wordOnePairs = words.mapToPair(
new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String x) {
return new Tuple2(x, 1);
}
}
);
System.out.print(" // reduce add the pairs by key to produce counts");
// reduce add the pairs by key to produce counts
JavaPairRDD<String, Integer> counts = wordOnePairs.reduceByKey(
new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer x, Integer y) {
return x + y;
}
}
);
System.out.print(" Save the word count back out to a text file, causing evaluation.");
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile(outputFile);
System.out.println(counts.collect());
sc.close();
}
}
As mentioned by PinoSan this question is probably too generic, and you should be able to find your answer in any Spark Getting Started, or Tutorial.
Let me point you to some interesting content:
Spark Quick Start Guide
Getting Started with Apache Spark, ebook
Introduction to Apache Spark with Examples and Use Cases
Disclaimer: I am working for MapR this is why I put online resources on Spark from MapR site

Any suggestions for reading two different dataset into Hadoop at the same time?

Dear hadooper:
I'm new for hadoop, and recently try to implement an algorithm.
This algorithm needs to calculate a matrix, which represent the different rating of every two pair of songs. I already did this, and the output is a 600000*600000 sparse matrix which I stored in my HDFS. Let's call this dataset A (size=160G)
Now, I need to read the users' profiles to predict their rating for a specific song. So I need to read the users' profile first(which is 5G size), let call this dataset B, and then calculate use the dataset A.
But now I don't know how to read the two dataset from a single hadoop program. Or can I read the dataset B into RAM then do the calculation?( I guess I can't, because the HDFS is a distribute system, and I can't read the dataset B into a single machine's memory).
Any suggestions?
You can use two Map function, Each Map Function Can process one data set if you want to implement different processing. You need to register your map with your job conf. For eg:
public static class FullOuterJoinStdDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String person_name, book_title,file_tag="person_book#";
private String emit_value = new String();
//emit_value = "";
public void map(LongWritable key, Text values, OutputCollector<Text,Text>output, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] person_detail = line.split(",");
person_name = person_detail[0].trim();
book_title = person_detail[1].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
person_name = "student name missing";
}
emit_value = file_tag + person_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static class FullOuterJoinResultDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String author_name, book_title,file_tag="auth_book#";
private String emit_value = new String();
// emit_value = "";
public void map(LongWritable key, Text values, OutputCollectoroutput, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] author_detail = line.split(",");
author_name = author_detail[1].trim();
book_title = author_detail[0].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
author_name = "Not Appeared in Exam";
}
emit_value = file_tag + author_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static void main(String args[])
throws Exception
{
if(args.length !=3)
{
System.out.println("Input outpur file missing");
System.exit(-1);
}
Configuration conf = new Configuration();
String [] argum = new GenericOptionsParser(conf,args).getRemainingArgs();
conf.set("mapred.textoutputformat.separator", ",");
JobConf mrjob = new JobConf();
mrjob.setJobName("Inner_Join");
mrjob.setJarByClass(FullOuterJoin.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[0]),TextInputFormat.class,FullOuterJoinStdDetMapper.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[1]),TextInputFormat.class,FullOuterJoinResultDetMapper.class);
FileOutputFormat.setOutputPath(mrjob,new Path(args[2]));
mrjob.setReducerClass(FullOuterJoinReducer.class);
mrjob.setOutputKeyClass(Text.class);
mrjob.setOutputValueClass(Text.class);
JobClient.runJob(mrjob);
}
Hadoop allows you to use different map input formats for different folders. So you can read from several datasources and then cast to specific type in Map function i.e. in one case you got (String,User) in other (String,SongSongRating) and you Map signature is (String,Object).
The second step is selection recommendation algorithm, join those data in some way so aggregator will have least information enough to calculate recommendation.

Resources