how to handle skewed data while grouping in pig - hadoop

I am doing a group by operation in which one reduce task is running very longer. Below is the sample code snippet and the description of the issue,
inp =load 'input' using PigStorage('|') AS(f1,f2,f3,f4,f5);
grp_inp = GROUP inp BY (f1,f2) parallel 300;
Since there is skew in data i.e. too many values for one key, one reducer is running for 4 hours. Rest all reduce tasks gets completed in 1 min or so.
What can I do to fix this issue, any alternative approaches ? Any help would be greatly appreciated. Thanks!

You may have to check few things :-
1> Filter out records which have both f1 and f2 value as NULL (if any)
2> Try to use hadoop combiner by implementing algebraic interface if possible :-
https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch10s02.html
3> Using Custom partitioner to use another key for distributing data across reducer.
Here is the sample code I used to partition my skewed data after join (same can be used after group also) :-
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.io.NullableTuple;
import org.apache.pig.impl.io.PigNullableWritable;
public class KeyPartitioner extends Partitioner<PigNullableWritable, Writable> {
/**
* Here key contains value of current key used for partitioning and Writable
* value conatins all fields from your tuple. I used my 5th field from tuple to do partitioning as I knew it has evenly distributed value.
**/
#Override
public int getPartition(PigNullableWritable key, Writable value, int numPartitions) {
Tuple valueTuple = (Tuple) ((NullableTuple) value).getValueAsPigType();
try {
if (valueTuple.size() > 5) {
Object hashObj = valueTuple.get(5);
Integer keyHash = Integer.parseInt(hashObj.toString());
int partitionNo = Math.abs(keyHash) % numPartitions;
return partitionNo;
} else {
if (valueTuple.size() > 0) {
return (Math.abs(valueTuple.get(1).hashCode())) % numPartitions;
}
}
} catch (NumberFormatException | ExecException ex) {
Logger.getLogger(KeyPartitioner.class.getName()).log(Level.SEVERE, null, ex);
}
return (Math.abs(key.hashCode())) % numPartitions;
}
}

Related

How to count partitions with FileSystem API?

I am using Hadoop version 2.7 and its FileSystem API. The question is about "how to count partitions with the API?" but, to put it into a software problem, I am coping here a Spark-Shell script... The concrete question about the script is
The variable parts below is counting the number of table partitions, or other thing?
import org.apache.hadoop.fs.{FileSystem, Path}
import scala.collection.mutable.ArrayBuffer
import spark.implicits._
val warehouse = "/apps/hive/warehouse" // the Hive default location for all databases
val db_regex = """\.db$""".r // filter for names like "*.db"
val tab_regex = """\.hive\-staging_""".r // signature of Hive files
val trStrange = "[\\s/]+|[^\\x00-\\x7F]+|[\\p{Cntrl}&&[^\r\n\t]]+|\\p{C}+".r //mark
def cutPath (thePath: String, toCut: Boolean = true) : String =
if (toCut) trStrange.replaceAllIn( thePath.replaceAll("^.+/", ""), "#") else thePath
val fs_get = FileSystem.get( sc.hadoopConfiguration )
fs_get.listStatus( new Path(warehouse) ).foreach( lsb => {
val b = lsb.getPath.toString
if (db_regex.findFirstIn(b).isDefined)
fs_get.listStatus( new Path(b) ).foreach( lst => {
val lstPath = lst.getPath
val t = lstPath.toString
var parts = -1
var size = -1L
if (!tab_regex.findFirstIn(t).isDefined) {
try {
val pp = fs_get.listStatus( lstPath )
parts = pp.length // !HERE! partitions?
pp.foreach( p => {
try { // SUPPOSING that size is the number of bytes of table t
size = size + fs.getContentSummary(p.getPath).getLength
} catch { case _: Throwable => }
})
} catch { case _: Throwable => }
println(s"${cutPath(b)},${cutPath(t)},$parts,$size")
}
})
}) // x warehouse loop
System.exit(0) // get out from spark-shell
This is only an example to show the focus: the correct scan and semantic interpretation of the Hive default database FileSystem structure, using Hive FileSystem API. The script sometimes need some memory, but is working fine. Run with sshell --driver-memory 12G --executor-memory 18G -i teste_v2.scala > output.csv
Note: the aim here is not to count partitions by any other method (e.g. HQL DESCRIBE or Spark Schema), but to use the API for it... For control and for data quality checks, the API is important as a kind of "lower level measurement".
Hive structures its metadata as database > tables > partitions > files. This typically translates into filesystem directory structure <hive.warehouse.dir>/database.db/table/partition/.../files. Where /partition/.../ signifies that tables can be partitioned by multiple columns thus creating a nested levels of subdirectories. (A partition is a directory with the name .../partition_column=value by convention).
So seems like your script will be printing the number of files (parts) and their total length (size) for each single-column partitioned table in each of your databases, if I'm not mistaken.
As alternative, I'd suggest you look at hdfs dfs -count command to see if it suits your needs, and maybe wrap it in a simple shell script to loop through the databases and tables.

sort documents by effective date

Each document ( inputStream IS) has field called effective date. I need all these individual documents to be combined into one document sorted by effective date.
import java.util.Properties;
import java.io.InputStream;
for( int i = 0; i < dataContext.getDataCount(); i++ ) {
InputStream is = dataContext.getStream(i);
Properties props = dataContext.getProperties(i);
dataContext.storeStream(is, props);
}
Thanks
Nag
Add your documents to an ArrayList, then use List.sort(Comparator) with a comparator that compares the dates. After that iterate over the List with a for-each loop and add the documents to your output.

partitioning not working in hadoop

so in my code i have partition the data in three parts but in output i m only getting the ouput that is retuned by 0th partition even if i set no of reducers to 3
my code
public static class customPartitioner extends Partitioner<Text,Text>{
public int getPartition(Text key, Text value, int numReduceTasks){
String country = value.toString();
if(numReduceTasks==0)
return 0;
if(key.equals(new Text("key1")) && !value.equals(new Text("valuemy")))
return 1%numReduceTasks;
if(value.equals(new Text("valueother")) && key.equals(new Text("key1")) )
return 0;
else
return 2%numReduceTasks;
}
}
and set no of reducers as
job.setNumReduceTasks(3);
it is giving me the output of only 0th partition i.e., return 0
i was doing a very silly mistake ....the partitioning was working fine in my code...but i thought the output is only in part-r-00000 file i thought that it is just to reduce load that it divides the file..but in output it shows the file by combining but i was wrong the different partitions have different outputs.

Getting values in string in Hive query

I have got a table in Hive, in which one of the columns is string. The values in that column are like "x=1,y=2,z=3". I need to write a query that adds the value of x in this column for all the rows. How do I extract the value of x and add them?
you would need a UDF for this transformation:
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
class SplitColumn extends UDF {
public Integer evaluate(Text input) {
if(input == null) return null;
String val=input.toString().split("=")[1];
return Integer.parseInt(val);
}
}
Now you can try this:
hive> ADD JAR target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;
hive> CREATE TEMPORARY FUNCTION SplitColumn as 'com.example.SplitColumn';
hive> select sum(SplitColumn(mycolumnName)) from mytable;
P.S: I have not tested this. But this should give a direction for you to proceed.

How can I add row numbers for rows in PIG or HIVE?

I have a problem when adding row numbers using Apache Pig.
The problem is that I have a STR_ID column and I want to add a ROW_NUM column for the data in STR_ID, which is the row number of the STR_ID.
For example, here is the input:
STR_ID
------------
3D64B18BC842
BAECEFA8EFB6
346B13E4E240
6D8A9D0249B4
9FD024AA52BA
How do I get the output like:
STR_ID | ROW_NUM
----------------------------
3D64B18BC842 | 1
BAECEFA8EFB6 | 2
346B13E4E240 | 3
6D8A9D0249B4 | 4
9FD024AA52BA | 5
Answers using Pig or Hive are acceptable. Thank you.
In Hive:
Query
select str_id,row_number() over() from tabledata;
Output
3D64B18BC842 1
BAECEFA8EFB6 2
346B13E4E240 3
6D8A9D0249B4 4
9FD024AA52BA 5
Facebook posted a number of hive UDFs including NumberRows. Depending on your hive version (I believe 0.8) you may need to add an attribute to the class (stateful=true).
Pig 0.11 introduced a RANK operator that can be used for this purpose.
For folks wondering about Pig, I found the best way (currently) is to write your own UDF.
I wanted to add row numbers for tuples in a bag. This is the code for that:
import java.io.IOException;
import java.util.Iterator;
import org.apache.pig.EvalFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.data.DataType;
public class RowCounter extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
public DataBag exec(Tuple input) throws IOException {
try {
DataBag output = mBagFactory.newDefaultBag();
DataBag bg = (DataBag)input.get(0);
Iterator it = bg.iterator();
Integer count = new Integer(1);
while(it.hasNext())
{ Tuple t = (Tuple)it.next();
t.append(count);
output.add(t);
count = count + 1;
}
return output;
} catch (ExecException ee) {
// error handling goes here
throw ee;
}
}
public Schema outputSchema(Schema input) {
try{
Schema bagSchema = new Schema();
bagSchema.add(new Schema.FieldSchema(null, DataType.BAG));
return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input),
bagSchema, DataType.BAG));
}catch (Exception e){
return null;
}
}
}
This code is for reference only. Might not be error-proof.
This is good answer for you on my example
Step 1. Define row_sequence() function to process for auto increase ID
add jar /Users/trongtran/research/hadoop/dev/hive-0.9.0-bin/lib/hive-contrib-0.9.0.jar;
drop temporary function row_sequence;
create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
Step 2. Insert unique id & STR
INSERT OVERWRITE TABLE new_table
SELECT
row_sequence(),
STR_ID
FROM old_table;
From version 0.11, hive supports analytic functions like lead,lag and also row number
https://issues.apache.org/jira/browse/HIVE-896
Hive solution -
select *
,rank() over (rand()) as row_num
from table
Or, if you want to have rows ascending by STR_ID -
select *
,rank() over (STR_ID,rank()) as row_num
from table
In Hive:
select
str_id, ROW_NUMBER() OVER() as row_num
from myTable;

Resources