partitioning not working in hadoop

partitioning not working in hadoop - hadoop

so in my code i have partition the data in three parts but in output i m only getting the ouput that is retuned by 0th partition even if i set no of reducers to 3
my code
public static class customPartitioner extends Partitioner<Text,Text>{
public int getPartition(Text key, Text value, int numReduceTasks){
String country = value.toString();
if(numReduceTasks==0)
return 0;
if(key.equals(new Text("key1")) && !value.equals(new Text("valuemy")))
return 1%numReduceTasks;
if(value.equals(new Text("valueother")) && key.equals(new Text("key1")) )
return 0;
else
return 2%numReduceTasks;
}
}
and set no of reducers as
job.setNumReduceTasks(3);
it is giving me the output of only 0th partition i.e., return 0

i was doing a very silly mistake ....the partitioning was working fine in my code...but i thought the output is only in part-r-00000 file i thought that it is just to reduce load that it divides the file..but in output it shows the file by combining but i was wrong the different partitions have different outputs.

Related

How to get new userinput in a stream while its running using Java8

I need to validate user input and if it doesn't meet the conditions then I need to replace it with correct input. So far I am stuck on two parts. Im fairly new to java8 and not so familiar with all the libraries so if you can give me advice on where to read up more on these I would appreciate it.
List<String> input = Arrays.asList(args);
List<String> validatedinput = input.stream()
.filter(p -> {
if (p.matches("[0-9, /,]+")) {
return true;
}
System.out.println("The value has to be positve number and not a character");
//Does the new input actually get saved here?
sc.nextLine();
return false;
}) //And here I am not really sure how to map the String object
.map(String::)
.validatedinput(Collectors.toList());

This type of logic shouldn't be done with streams, a while loop would be a good candidate for it.
First, let's partition the data into two lists, one list representing the valid inputs and the other representing invalid inputs:
Map<Boolean, List<String>> resultSet =
Arrays.stream(args)
.collect(Collectors.partitioningBy(s -> s.matches(yourRegex),
Collectors.toCollection(ArrayList::new)));
Then create the while loop to ask the user to correct all their invalid inputs:
int i = 0;
List<String> invalidInputs = resultSet.get(false);
final int size = invalidInputs.size();
while (i < size){
System.out.println("The value --> " + invalidInputs.get(i) +
" has to be positive number and not a character");
String temp = sc.nextLine();
if(temp.matches(yourRegex)){
resultSet.get(true).add(temp);
i++;
}
}
Now, you can collect the list of all the valid inputs and do what you like with it:
List<String> result = resultSet.get(true);

Tables in Gherkin

I am trying to compare between two tables and test the results using Gherkin but I don't know how to make it declare two lists in the #when section instead of one, like how it is shown below:
#When("^the integer table :$")
public void the_integer_table_(List<Integer> comp1, List<Integer> comp2) {
for(int i = 0; i < comp1.size(); i++) {
compare[i] = comp1.get(i);
}
for(int i = 0; i < comp2.size(); i++) {
compare2[i] = comp1.get(i);
}
comparer.comparer_tableau( compare, compare2);
}
Here is my .feature file:
Scenario: Compare the elements of two tables and return a boolean table as a result
Given I am starting a comparision operation
When these two integer table are entered :
|1|2|3|4|5|
|0|2|5|4|5|
Then I should see the correct answer is:
|false|true|false|true|true|
Here is what I get when I run it:
#When("^these two integer table two are entered :$") public void these_two_integer_table_two_are_entered_(DataTable arg1) { }
P.S: I did try and look for solutions but didn't find any.

You can change the step definition as given below and get the values of each row then compare it.
#When("^these two integer table are entered :$")
public void these_two_integer_table_are_entered(DataTable arg1) throws Throwable {
List<DataTableRow> lstRows=arg1.getGherkinRows();
Integer[] compare=new Integer[lstRows.get(0).getCells().size()];
System.out.println(compare.length);
Integer[] compare2=new Integer[lstRows.get(1).getCells().size()];
System.out.println(compare2.length);
//Get the first row values
for(int i = 0; i < compare.length; i++) {
compare[i] = Integer.valueOf(lstRows.get(0).getCells().get(i));
System.out.println(compare[i]);
}
//Get the second row values
for(int i = 0; i < compare2.length; i++) {
compare2[i] = Integer.valueOf(lstRows.get(1).getCells().get(i));
System.out.println(compare2[i]);
}
comparer.comparer_tableau( compare, compare2);
}
AFAIK, You can have only one table per step. But can get the each row values inside your definition.

Cucumber will interpret your When step to use one DataTable
When these two integer table are entered :
|1|2|3|4|5|
|0|2|5|4|5|
Instead of trying to enter two tables in one step, try entering them in two separate steps (one table per step) like this:
When the following integer table is entered :
|1|2|3|4|5|
And the following integer table is entered :
|0|2|5|4|5|

I would use a scenario outline like this and handle the log to retrieve values for each column in step definition. (split value string by comma and use values accordingly).
Scenario Outline : Compare the elements of two tables and return a boolean table as a result
Given I am starting a comparision operation
When integer '<table1>' values are entered
When integer '<table2>' values are entered
Then I should see the correct '<answer>'
Examples:
| table1 | table2 | answer |
| 1,2,3,4,5 | 0,2,5,4,5 | false,true,false,true,true |

Spark is shuffling large amount of data

I have written a spark job. Which looks like below :
public class TestClass {
public static void main(String[] args){
String masterIp = args[0];
String appName = args[1];
String inputFile = args[2];
String output = args[3];
SparkConf conf = new SparkConf().setMaster(masterIp).setAppName(appName);
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> rdd = sparkContext.textFile(inputFile);
Integer[] keyColumns = new Integer[] {0,1,2};
Broadcast<Integer[]> broadcastJob = sparkContext.broadcast(keyColumns);
Function<Integer,Long> createCombiner = v1 -> Long.valueOf(v1);
Function2<Long, Integer, Long> mergeValue = (v1,v2) -> v1+v2;
Function2<Long, Long, Long> mergeCombiners = (v1,v2) -> v1+v2;
JavaPairRDD<String, Long> pairRDD = rdd.mapToPair(new PairFunction<String, String, Integer>() {
private static final long serialVersionUID = -6293440291696487370L;
#Override
public Tuple2<String, Integer> call(String t) throws Exception {
String[] record = t.split(",");
Integer[] keyColumns = broadcastJob.value();
StringBuilder key = new StringBuilder();
for (int index = 0; index < keyColumns.length; index++) {
key.append(record[keyColumns[index]]);
}
key.append("|id=1");
Integer value = new Integer(record[4]);
return new Tuple2<String, Integer>(key.toString(),value);
}}).combineByKey(createCombiner, mergeValue, mergeCombiners).reduceByKey((v1,v2) -> v1+v2);
pairRDD.saveAsTextFile(output);
}
}
The program calculates the sum of values for each key.
As per my understanding, the local combiner should run on each node and add up the values for same keys and
then shuffling occurs with little amount of data.
But on SparkUI it is showing huge amount of shuffle read and shuffle write(almost 58GB).
Am I doing anything wrong?
How to know if the local combiner is working?
Cluster Details :-
20 Nodes cluster
Each Node having 80GB HardDisk, 8GB RAM, 4 cores
Hadoop-2.7.2
Spark-2.0.2(prebuild-with-Hadoop-2.7.x distribution)
Input file details :-
input file is stored on hdfs
input file size : 400GB
number of records : 16,129,999,990
record columns : String(2 char),int,int,String(2 char),int,int,String(2 char),String(2 char),String(2 char)
Note :
Max Number of distinct keys is 1081600.
In spark logs I see the task running with localitylevel NODE_LOCAL.

Let's decompose this problem and see what get. To simplify computations lets assume that:
Total number of records is 1.6e8
Number of unique keys is 1e6
Split size is 128MB (this seems to be consistent with the number of task in you UI).
With these values data will be spitted into ~3200 partitions (3125 in your case). This gives you around 51200 records per split. Furthermore if distribution of number of values per key is uniform there should ~160 records per key on average.
If data is randomly distributed (it is not sorted by key for example) you can expect that on average number of records per key per partition will be close to one*. This is basically the worst case scenario where map side combine doesn't reduce amount of data at all.
Furthermore you have to remember that size of a flat file typically will be significant lower that size of the serialized objects.
With real life data you can typically expect some type of order emerging from data collection process so things should be better than what we calculated above but the bottom line is that, if data is not already grouped by partition, map side combine may provide no improvements at all.
You could probably decrease amount of shuffled data by using a bit larger split (256MB would give you a bit over 100K per partition) but it comes at price of longer GC pauses and possibly other GC issues.
* You can either simulate this by taking samples with replacement:
import pandas as pd
import numpy as np
(pd
.DataFrame({"x": np.random.choice(np.arange(3200), size=160, replace=True)})
.groupby("x")
.x.count()
.mean())
or just think about the problem of randomly assigning 160 balls to 3200 buckets.

Writing to hadoop distributed file system multiple times with Spark

I've created a spark job that reads in a textfile everyday from my hdfs and extracts unique keys from each line in the text file. There are roughly 50000 keys in each text file. The same data is then filtered by the extracted key and saved to the hdfs.
I want to create a directory in my hdfs with the structure: hdfs://.../date/key that contains the filtered data. The problem is that writing to the hdfs takes a very very long time because there are so many keys.
The way it's written right now:
val inputData = sparkContext.textFile(""hdfs://...", 2)
val keys = extractKey(inputData) //keys is an array of approx 50000 unique strings
val cleanedData = cleanData(inputData) //cleaned data is an RDD of strings
keys.map(key => {
val filteredData = cleanedData.filter(line => line.contains(key))
filteredData.repartition(1).saveAsTextFile("hdfs://.../date/key")
})
Is there a way to make this faster? I've thought about repartitioning the data into the number of keys extracted but then I can't save in the format hdfs://.../date/key. I've also tried groupByKey but I can't save the values because they aren't RDDs.
Any help is appreciated :)

def writeLines(iterator: Iterator[(String, String)]) = {
val writers = new mutalbe.HashMap[String, BufferedWriter] // (key, writer) map
try {
while (iterator.hasNext) {
val item = iterator.next()
val key = item._1
val line = item._2
val writer = writers.get(key) match {
case Some(writer) => writer
case None =>
val path = arg(1) + key
val outputStream = FileSystem.get(new Configuration()).create(new Path(path))
writer = new BufferedWriter(outputStream)
}
writer.writeLine(line)
} finally {
writers.values.foreach(._close())
}
}
val inputData = sc.textFile()
val keyValue = inputData.map(line => (key, line))
val partitions = keyValue.partitionBy(new MyPartition(10))
partitions.foreachPartition(writeLines)
class MyPartitioner(partitions: Int) extends Partitioner {
override def numPartitions: Int = partitions
override def getPartition(key: Any): Int = {
// make sure lines with the same key in the same partition
(key.toString.hashCode & Integer.MAX_VALUE) % numPartitions
}
}

I think the approach should be similar to Write to multiple outputs by key Spark - one Spark job. The partition number has nothing to do with the directory number. To implement it, you may need to override generateFileNameForKeyValue with your customized version to save to different directory.
Regarding scalability, it is not an issue of spark, it is hdfs instead. But no matter how you implemented, as long as the requirements is not changed, it is unavoidable. But I think Hdfs is probably OK with 50,000 file handlers

You are specifying just 2 partitions for the input, and 1 partition for the output. One effect of this is severely limiting the parallelism of these operations. Why are these needed?
Instead of computing 50,000 filtered RDDs, which is really slow too, how about just grouping by the key directly? I get that you want to output them into different directories but that is really causing the bottlenecks here. Is there perhaps another way to architect this that simply lets you read (key,value) results?

Getting top 100 URL from a log file

One of my friends was asked the following question in an interview. Can anyone tell me how to solve it?
We have a fairly large log file, about 5GB. Each line of the log file contains an url which a user has visited on our site. We want to figure out what's the most popular 100 urls visited by our users. How to do it?

In case we have more than 10GB RAM, just do it straight forward with hashmap.
Otherwise, separate it into several files, using a hash function. And then process each file and get a top 5. With "top 5"s for each file, it will be easy to get an overall top 5.
Another solution can be sort it using any external sorting method. And then scan the file once to count each occurrence. In the process, you don't have to keep track of the counts. You can safely throw anything that doesn't make into top5 away.

Just sort the log file according to the URLs (needs constant space if you chose an algorithm like heap sort or quick sort) and then count for each URL how many times it appears (easy, the lines with the same URLs are next to each other).
Overall complexity is O(n*Log(n)).
Why splitting in many files and keeping only top 3 (or top 5 or top N) for each file is wrong:
File1 File2 File3
url1 5 0 5
url2 0 5 5
url3 5 5 0
url4 5 0 0
url5 0 5 0
url6 0 0 5
url7 4 4 4
url7 never makes it to the top 3 in the individual files but is the best overall.

Because the log file is fairly large you should read the log-file using a stream-reader. Don't read it all in the memory.
I would expect it is feasible to have the number of possible distinct links in the memory while we work on the log-file.
// Pseudo
Hashmap map<url,count>
while(log file has nextline){
url = nextline in logfile
add url to map and update count
}
List list
foreach(m in map){
add m to list
}
sort the list by count value
take top n from the list
The runtime is O(n) + O(m*log(m)) where n is the size of the log-file in lines and where the m is number of distinct found links.
Here's a C# implementation of the pseudo-code. An actual file-reader and a log-file is not provided.
A simple emulation of reading a log-file using a list in the memory is provided instead.
The algorithm uses a hashmap to store the found links. A sorting algorithm founds the top 100 links afterward. A simple data container data-structure is used for the sorting algorithm.
The memory complexity is dependent on expected distinct links.
The hashmap must be able to contain the found distinct links,
else this algorithm won't work.
// Implementation
using System;
using System.Collections.Generic;
using System.Linq;
public class Program
{
public static void Main(string[] args)
{
RunLinkCount();
Console.WriteLine("press a key to exit");
Console.ReadKey();
}
class LinkData : IComparable
{
public string Url { get; set; }
public int Count { get; set; }
public int CompareTo(object obj)
{
var other = obj as LinkData;
int i = other == null ? 0 : other.Count;
return i.CompareTo(this.Count);
}
}
static void RunLinkCount()
{
// Data setup
var urls = new List<string>();
var rand = new Random();
const int loglength = 500000;
// Emulate the log-file
for (int i = 0; i < loglength; i++)
{
urls.Add(string.Format("http://{0}.com", rand.Next(1000)
.ToString("x")));
}
// Hashmap memory must be allocated
// to contain distinct number of urls
var lookup = new Dictionary<string, int>();
var stopwatch = new System.Diagnostics.Stopwatch();
stopwatch.Start();
// Algo-time
// O(n) where n is log line count
foreach (var url in urls) // Emulate stream reader, readline
{
if (lookup.ContainsKey(url))
{
int i = lookup[url];
lookup[url] = i + 1;
}
else
{
lookup.Add(url, 1);
}
}
// O(m) where m is number of distinct urls
var list = lookup.Select(i => new LinkData
{ Url = i.Key, Count = i.Value }).ToList();
// O(mlogm)
list.Sort();
// O(m)
var top = list.Take(100).ToList(); // top urls
stopwatch.Stop();
// End Algo-time
// Show result
// O(1)
foreach (var i in top)
{
Console.WriteLine("Url: {0}, Count: {1}", i.Url, i.Count);
}
Console.WriteLine(string.Format("Time elapsed msec: {0}",
stopwatch.ElapsedMilliseconds));
}
}
Edit: This answer has been updated based on the comments
added: running time and memory complexity analysis
added: pseudo-code
added: explain how we manage a fairly large log-file

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

partitioning not working in hadoop - hadoop

Related

How to get new userinput in a stream while its running using Java8

Tables in Gherkin

Spark is shuffling large amount of data

Writing to hadoop distributed file system multiple times with Spark

Getting top 100 URL from a log file

Categories

Resources