Splitting a tuple into multiple tuples in Pig - hadoop

I like to generate multiple tuples from a single tuple. What I mean is:
I have file with following data in it.
>> cat data
ID | ColumnName1:Value1 | ColumnName2:Value2
so I load it by the following command
grunt >> A = load '$data' using PigStorage('|');
grunt >> dump A;
(ID,ColumnName1:Value1,ColumnName2:Value2)
Now I want to split this tuple into two tuples.
(ID, ColumnName1, Value1)
(ID, ColumnName2, Value2)
Can I use UDF along with foreach and generate. Some thing like the following?
grunt >> foreach A generate SOMEUDF(A)
EDIT:
input tuple : (id1,column1,column2)
output : two tuples (id1,column1) and (id2,column2) so it is List or should I return a Bag?
public class SPLITTUPPLE extends EvalFunc <List<Tuple>>
{
public List<Tuple> exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
// not sure how whether I can create tuples on my own. Looks like I should use TupleFactory.
// return list of tuples.
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception processing input row ", e);
}
}
}
Is this approach correct?

You could write a UDF or use a PIG script with built-in functions.
For example:
-- data should be chararray, PigStorage('|') return bytearray which will not work for this example
inpt = load '/pig_fun/input/single_tuple_to_multiple.txt' as (line:chararray);
-- split by | and create a row so we can dereference it later
splt = foreach inpt generate FLATTEN(STRSPLIT($0, '\\|')) ;
-- first column is id, rest is converted into a bag and flatten it to make rows
id_vals = foreach splt generate $0 as id, FLATTEN(TOBAG(*)) as value;
-- there will be records with (id, id), but id should not have ':'
id_vals = foreach id_vals generate id, INDEXOF(value, ':') as p, STRSPLIT(value, ':', 2) as vals;
final = foreach (filter id_vals by p != -1) generate id, FLATTEN(vals) as (col, val);
dump final;
Test INPUT:
1|c1:11:33|c2:12
234|c1:21|c2:22
33|c1:31|c2:32
345|c1:41|c2:42
OUTPUT
(1,c1,11:33)
(1,c2,12)
(234,c1,21)
(234,c2,22)
(33,c1,31)
(33,c2,32)
(345,c1,41)
(345,c2,42)
I hope it helps.
Cheers.

Here is the UDF version. I prefer to return a BAG:
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.FrontendException;
import org.apache.pig.impl.logicalLayer.schema.Schema;
/**
* Converts input chararray "ID|ColumnName1:Value1|ColumnName2:Value2|.." into a bag
* {(ID, ColumnName1, Value1), (ID, ColumnName2, Value2), ...}
*
* Default rows separator is '|' and key value separator is ':'.
* In this implementation white spaces around separator characters are not removed.
* ID can be made of any character (including sequence of white spaces).
* #author
*
*/
public class TupleToBagColumnValuePairs extends EvalFunc<DataBag> {
private static final TupleFactory tupleFactory = TupleFactory.getInstance();
private static final BagFactory bagFactory = BagFactory.getInstance();
//Row separator character. Default is '|'.
private String rowsSeparator;
//Column value separator character. Default i
private String columnValueSeparator;
public TupleToBagColumnValuePairs() {
this.rowsSeparator = "\\|";
this.columnValueSeparator = ":";
}
public TupleToBagColumnValuePairs(String rowsSeparator, String keyValueSeparator) {
this.rowsSeparator = rowsSeparator;
this.columnValueSeparator = keyValueSeparator;
}
/**
* Creates a tuple with 3 fields (id:chararray, column:chararray, value:chararray)
* #param outputBag Output tuples (id, column, value) are added to this bag
* #param id
* #param column
* #param value
* #throws ExecException
*/
protected void addTuple(DataBag outputBag, String id, String column, String value) throws ExecException {
Tuple outputTuple = tupleFactory.newTuple();
outputTuple.append(id);
outputTuple.append(column);
outputTuple.append( value);
outputBag.add(outputTuple);
}
/**
* Takes column{separator}value from splitInputLine, splits id into column value and adds them to the outputBag as (id, column, value)
* #param outputBag Output tuples (id, column, value) should be added to this bag
* #param id
* #param splitInputLine format column{separator}value, which start from index 1
* #throws ExecException
*/
protected void parseColumnValues(DataBag outputBag, String id,
String[] splitInputLine) throws ExecException {
for (int i = 1; i < splitInputLine.length; i++) {
if (splitInputLine[i] != null) {
int columnValueSplitIndex = splitInputLine[i].indexOf(this.columnValueSeparator);
if (columnValueSplitIndex != -1) {
String column = splitInputLine[i].substring(0, columnValueSplitIndex);
String value = null;
if (columnValueSplitIndex + 1 < splitInputLine[i].length()) {
value = splitInputLine[i].substring(columnValueSplitIndex + 1);
}
this.addTuple(outputBag, id, column, value);
} else {
String column = splitInputLine[i];
this.addTuple(outputBag, id, column, null);
}
}
}
}
/**
* input - contains only one field of type chararray, which will be split by '|'
* All inputs that are: null or of length 0 are ignored.
*/
#Override
public DataBag exec(Tuple input) throws IOException {
if (input == null || input.size() != 1 || input.isNull(0)) {
return null;
}
String inputLine = (String)input.get(0);
String[] splitInputLine = inputLine.split(this.rowsSeparator, -1);
if (splitInputLine.length > 1 && splitInputLine[0].length() > 0) {
String id = splitInputLine[0];
DataBag outputBag = bagFactory.newDefaultBag();
if (splitInputLine.length == 1) { // there is just an id in the line
this.addTuple(outputBag, id, null, null);
} else {
this.parseColumnValues(outputBag, id, splitInputLine);
}
return outputBag;
}
return null;
}
#Override
public Schema outputSchema(Schema input) {
try {
if (input.size() != 1) {
throw new RuntimeException("Expected input to have only one field");
}
Schema.FieldSchema inputFieldSchema = input.getField(0);
if (inputFieldSchema.type != DataType.CHARARRAY) {
throw new RuntimeException("Expected a CHARARRAY as input");
}
Schema tupleSchema = new Schema();
tupleSchema.add(new Schema.FieldSchema("id", DataType.CHARARRAY));
tupleSchema.add(new Schema.FieldSchema("column", DataType.CHARARRAY));
tupleSchema.add(new Schema.FieldSchema("value", DataType.CHARARRAY));
return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), tupleSchema, DataType.BAG));
} catch (FrontendException exx) {
throw new RuntimeException(exx);
}
}
}
Here is how it is used in PIG:
register 'path to the jar';
define IdColumnValue myPackage.TupleToBagColumnValuePairs();
inpt = load '/pig_fun/input/single_tuple_to_multiple.txt' as (line:chararray);
result = foreach inpt generate FLATTEN(IdColumnValue($0)) as (id1, c2, v2);
dump result;
A good inspiration for writing UDFs with bags see DataFu source code by LinkedIn

You could use TransposeTupleToBag (UDF from DataFu lib) on the output of STRSPLIT to get the bag, and then FLATTEN the bag to create separate row per original column.

Related

How to infer parquet schema by hive table schema without inserting any records?

Now given a hive table with its schema, namely:
hive> show create table nba_player;
OK
CREATE TABLE `nba_player`(
`id` bigint,
`player_id` bigint,
`player_name` string,
`admission_time` timestamp,
`nationality` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://endpoint:8020/user/hive/warehouse/nba_player'
TBLPROPERTIES (
'transient_lastDdlTime'='1541140811')
Time taken: 0.022 seconds, Fetched: 16 row(s)
How to infer its parquet schema without inserting any records?
The parquet schema is like:
message_meta
{optional int64 id;
optional int64 player_id;
optional binary player_name;
optional timestamp admission_time;
optional binary nationality;}
The code is shown as below
/**
* Generate MessageType by table properties using HiveSchemaConverter
*
* #param tableProperties {#link Properties}
* #return MessageType
*/
public static MessageType getMessageTypeFromTable(final Properties tableProperties) {
final String columnNameProperty = tableProperties.getProperty(IOConstants.COLUMNS);
final String columnTypeProperty = tableProperties.getProperty(IOConstants.COLUMNS_TYPES);
List<String> columnNames;
List<TypeInfo> columnTypes;
if (columnNameProperty.length() == 0) {
columnNames = new ArrayList<String>();
} else {
columnNames = Arrays.asList(columnNameProperty.split(","));
}
if (columnTypeProperty.length() == 0) {
columnTypes = new ArrayList<TypeInfo>();
} else {
columnTypes = TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
}
MessageType messageType = HiveSchemaConverter.convert(columnNames, columnTypes);
logger.info("messageType is inferred to be: {}", messageType.toString());
return messageType;
}
public class ParquetHelperTest {
#Test
public void testGenerateParquetSchemaFromTableProperties() {
Properties tableProperties = new Properties();
tableProperties.setProperty(IOConstants.COLUMNS, "id,player_id,player_name,admission_time,nationality");
tableProperties.setProperty(IOConstants.COLUMNS_TYPES, "bigint,bigint,string,timestamp,string");
MessageType messageType = ParquetHelper.getMessageTypeFromTable(tableProperties);
String expectedMessageType = "message hive_schema {\n"
+ " optional int64 id;\n"
+ " optional int64 player_id;\n"
+ " optional binary player_name (UTF8);\n"
+ " optional int96 admission_time;\n"
+ " optional binary nationality (UTF8);\n"
+ "}";
String calculatedMessageType = messageType.toString();
calculatedMessageType = calculatedMessageType.replaceAll("\\s", "");
expectedMessageType = expectedMessageType.replaceAll("\\s", "");
Assert.assertTrue(calculatedMessageType.equalsIgnoreCase(expectedMessageType));
}
}

Update column value for a specified time range in Hive table

A Hive table "Employee" contains a column "timerange", and the data is
timerange
1:10
1:13
1:17
1:21
1:26
If the last digit range is between (0 & 4), the data must be updated as 0. If the last digit range is between (5 & 9) must be updated as 5.
Expected output is
timerange
1:10
1:10
1:15
1:20
1:25
How can I do this?
You can do this through built-in string manipulation:
SELECT CASE WHEN SUBSTRING(timerange, LENGTH(timerange)) < "5"
THEN CONCAT(SUBSTRING(timerange, 1, LENGTH(timerange) - 1), "0")
ELSE CONCAT(SUBSTRING(timerange, 1, LENGTH(timerange) - 1), "5")
END AS timerange
FROM Employee;
You can create a generic UDF (GenericUDF).
Here is a sample UDF:
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.IntObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector;
public class TimeRangeConverter GenericUDF {
#Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
if (arguments.length != 1) {
throw new UDFArgumentLengthException("The function time_range_converter(time_rage) requires 1 argument.");
}
ObjectInspector timeRangeVal = arguments[0];
if (!(timeRangeVal instanceof StringObjectInspector)) {
throw new UDFArgumentException("First argument must be of type String (time_range as String)");
}
return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
}
#Override
public Object evaluate(DeferredObject[] arguments) throws HiveException {
String timeRangeVal = (String) ObjectInspectorUtils.copyToStandardJavaObject(arguments[0].get(),
PrimitiveObjectInspectorFactory.javaStringObjectInspector);
char[] characters = timeRangeVal.toCharArray();
if (characters[characters.length - 1] > '5') {
characters[characters.length - 1] = '5';
} else {
characters[characters.length - 1] = '0';
}
return String.valueOf(characters);
}
#Override
public String getDisplayString(String[] arguments) {
assert (arguments.length == 1);
return "time_range_converter(" + arguments[0] + ")";
}
}
Call the Hive update statement like:
CREATE TEMPORARY FUNCTION time_range_converterAS 'TimeRangeConverter';
UPDATE
Employee
SET
timerange = time_range_converter(timerange);

Filtering Comma Separated Data

My site has a bunch of widgets and i'm trying to filter them based on the url which is passed in. Say a Widget has the following structure:
public class Widget {
public int Id { get; set; }
public string Name { get; set; }
public string Urls { get; set; }
}
Where Urls is a comma separated list for the urls where the widget should be displayed, e.g.:
/, /Blog/, /Blog/123, /News/*
The asterisk after News indicates the Widget will be selected whenever the passed in url starts with /News/.
How could i modify the following method to filter the widgets based on my conditions above?
public IList<Widget> GetWidgets(string url) {
return _session
.Where(w => w.Urls.Contains(url))
.ToList();
}
Ideally i'd like to use a linq query and it must only hit the database once. I'd appreciate the help. Thanks
I managed to solve this by adding my own wild card match generator. See http://sentinel101.wordpress.com/2010/12/30/extend-nhibernate-linq-for-regex-matching/ for example of how to register the generator. Here's the generator incase anyone is interested:
public class WildCardMatchGenerator : BaseHqlGeneratorForMethod {
public WildCardMatchGenerator() {
var methodDefinition = ReflectionHelper.GetMethodDefinition(() => WildCardMatchExtensions.WildCardMatch(null, null, ','));
SupportedMethods = new[] { methodDefinition };
}
public override HqlTreeNode BuildHql(MethodInfo method, Expression targetObject, ReadOnlyCollection<Expression> arguments, HqlTreeBuilder treeBuilder, IHqlExpressionVisitor visitor) {
return treeBuilder.Equality(treeBuilder.MethodCall("[dbo].[WildCardMatch]", new[] {
visitor.Visit(arguments[0]).AsExpression(),
visitor.Visit(arguments[1]).AsExpression(),
visitor.Visit(arguments[2]).AsExpression()
}), treeBuilder.Constant(1));
}
}
And here is the WildCardMatch UDF:
CREATE FUNCTION [dbo].[WildCardMatch] (
#Pattern NVARCHAR(MAX),
#Input NVARCHAR(MAX),
#Separator NVARCHAR(5)
)
RETURNS BIT
AS
BEGIN
SET #Pattern = REPLACE(#Pattern, '*', '%')
DECLARE #RtnValue BIT
SELECT #RtnValue = CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END FROM [dbo].[Split](#Pattern, #Separator) WHERE #Input LIKE [Data]
RETURN #RtnValue
END
And the Split function it calls (from http://blogs.microsoft.co.il/blogs/itai/archive/2009/02/01/t-sql-split-function.aspx):
CREATE FUNCTION [dbo].[Split]
(
#RowData NVARCHAR(MAX),
#Separator NVARCHAR(MAX)
)
RETURNS #RtnValue TABLE
(
[Id] INT IDENTITY(1,1),
[Data] NVARCHAR(MAX)
)
AS
BEGIN
DECLARE #Iterator INT
SET #Iterator = 1
DECLARE #FoundIndex INT
SET #FoundIndex = CHARINDEX(#Separator, #RowData)
WHILE (#FoundIndex > 0)
BEGIN
INSERT INTO #RtnValue ([Data])
SELECT Data = LTRIM(RTRIM(SUBSTRING(#RowData, 1, #FoundIndex - 1)))
SET #RowData = SUBSTRING(#RowData, #FoundIndex + DATALENGTH(#Separator) / 2, LEN(#RowData))
SET #Iterator = #Iterator + 1
SET #FoundIndex = CHARINDEX(#Separator, #RowData)
END
INSERT INTO #RtnValue ([Data])
SELECT Data = LTRIM(RTRIM(#RowData))
RETURN
END
Lastly you'll need the C# implementation of the above UDF:
public static class WildCardMatchExtensions {
public static bool WildCardMatch(this string pattern, string input, char separator = ',') {
foreach (var str in pattern.Split(new char[] { separator }, StringSplitOptions.RemoveEmptyEntries)) {
if (Regex.IsMatch(input, Regex.Escape(str.Trim()).Replace("\\*", ".*")))
return true;
}
return false;
}
}
Hope this helps.
I don't think SQL server allows you to use IN for comma delimited field values. You need LIKE (SQL example below):
Urls LIKE '%_your_url_%'
Linq, Expressions, NHibernate and Like comparison may help you with translation LIKE to Linq to NHibernate

How to convert a string to decimal in a LINQ query

I have the following Linq statement:
var total = (from a in mobileChunk.Data select a.callCost).Sum();
callCost is a string. I need to convert it to a decimal. How is this done?
i would do something like this....
public static class Extenders
{
public static decimal ToDecimal(this string str)
{
// you can throw an exception or return a default value here
if ( string.IsNullOrEmpty(str) )
return someDefaultValue;
decimal d ;
// you could throw an exception or return a default value on failure
if ( !decimal.TryParse(str, out d ) )
return someDefaultValue;
return d;
}
}
now, in your linq.....
var total = = (from a in mobileChunk.Data select a.callCost.ToDecimal()).Sum();
Perhaps you can try this:
var total = (from a in mobileChunk.Data select decimal.Parse(a.callCost)).Sum();

How do I select records and summary in one query?

The scenario is like this:
name | val
'aa' | 10
'bb' | 20
'cc' | 30
*********
sum | 60
For now I just select all the records in simple LINQ query and invoke the enumerator (ToList())
Then I loop over the list and summarize the val column.
Is there a better way? LINQ selects all to a new typed object so I dont know how to add the additional data.
thanks.
Anonymous type cant allow value to be added or edited once its created. so instead of returning anonymous type, you can use your custom output class. Something like this
public class ResClass
{
public string name;
public int value;
}
public class OutClass
{
public int sum;
public List<ResClass> lstData;
}
int sum=0;
var outtt = objTT.Where(x => x.id == 1).Select(x =>
{
sum += x.value;
return new ResClass { name = x.name, value= x.value };
}).ToList();
OutClass outCls = new OutClass { sum = sum, lstData = outtt };

Resources