How to write a UDF in Tajo [closed] - etl

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am wondering if I can write a UDF in Tajo especially in Python. My use case is for ETL where I want to group log records by some ID (browser ID), then sort the records in the same group by timestamp and then finally use my UDF to go over the sorted records in each group.

I'm a Tajo PMC member. Thank you for your interest in Tajo.
Currently, Tajo does not support Python UDF yet. But, Tajo was designed to have multiple UDF function implementations for each single function signature. Python function UDF feature can be added to Tajo easily.
Now, you should use Java-based one for your own custom functions. There are two ways to implement your custom UDFs. Of course, both ways are very easy.
The first way is to use a legacy Tajo UDF interface. In this manner, each function implementation should be a class which inherits GeneralFunction class. eval() is a body of each UDF implementation. An example is as follow:
#Description(
functionName = "pow",
description = "x raised to the power of y",
example = "> SELECT pow(9.0, 3.0)\n"
+ "729",
returnType = FLOAT8,
paramTypes = {
#ParamTypes(paramTypes = {FLOAT8, FLOAT8})
}
)
public class Pow extends GeneralFunction {
public Pow() {
super(new Column[] {
new Column("x", FLOAT8),
new Column("y", FLOAT8)
});
}
#Override
public Datum eval(Tuple params) {
Datum value1Datum = params.get(0);
Datum value2Datum = params.get(1);
if(value1Datum instanceof NullDatum || value2Datum instanceof NullDatum) {
return NullDatum.get();
}
return DatumFactory.createFloat8(Math.pow(value1Datum.asFloat8(), value2Datum.asFloat8()));
}
}
Another way is to use new function interface. In this manner, you need to just implement a static method for each UDF. Each static method for a function can have one or more Java primitive type or Java object type parameters.
According to whether each parameter is primitive or object type, each parameter has an implicitly different effect. A primitive type argument does not allow NULL value. In this case, if the parameter takes NULL value, this function will automatically return NULL value without actually invoking the function. With SQL's three-value logic, NULL handling like this is very common in SQL function. So, basically, Tajo provides this feature.
An example of a function with primitive parameters:
#ScalarFunction(name = "pow", returnType = FLOAT8, paramTypes = {FLOAT8, FLOAT8})
public static double pow(double x, double y) {
return Math.pow(x, y);
}
In contrast, Object type parameter allows NULL value. In this case, each function must explicitly handle NULL value as follow:
#ScalarFunction(name = "pow", returnType = FLOAT8, paramTypes = {FLOAT8, FLOAT8})
public static Double pow(Double x, Double y) {
if (x == null || y == null) {
return null;
}
return Math.pow(x, y);
}
In order to add your own UDFs to a Tajo cluster, you need to make a jar file, including your own function classes. Currently, Tajo does not allow to users to add custom jars in runtime. So, you should copy the custom jars to ${TAJO_HOME}/lib and then restart Tajo cluster.
Also, Tajo will support CREATE FUNCTION feature to allow users to add functions in runtime soon.

Related

Enum values as parameter default values in Haxe

Is there a way to use enum default parameters in Haxe? I get this error:
Parameter default value should be constant
enum AnEnum {
A;
B;
C;
}
class Test {
static function main() {
Test.enumNotWorking();
}
static function enumNotWorking(e:AnEnum = AnEnum.A){}
}
Try Haxe link.
Update: this feature has been added in Haxe 4. The code example from the question now compiles as-is with a regular enum.
Previously, this was only possible if you're willing to use enum abstracts (enums at compile time, but a different type at runtime):
#:enum
abstract AnEnum(Int)
{
var A = 1;
var B = 2;
var C = 3;
}
class Test3
{
static function main()
{
nowItWorks();
}
static function nowItWorks(param = AnEnum.A)
{
trace(param);
}
}
There's nothing special about the values I chose, and you could choose another type (string, or a more complex type) if it better suits your use case. You can treat these just like regular enums (for switch statements, etc.) but note that when you trace it at runtime, you'll get "1", not "A".
More information: http://haxe.org/manual/types-abstract-enum.html
Sadly enums can't be used as default values, because in Haxe enums aren't always constant.
This piece of trivia was on the old website but apparently hasn't made it into the new manual yet:
http://old.haxe.org/ref/enums#using-enums-as-default-value-for-parameters
The workaround is to check for a null value at the start of your function:
static function enumNotWorking(?e:AnEnum){
if (e==null) e=AnEnum.A;
}
Alternatively, an Enum Abstract might work for your case.

Dynamic Repository With Dynamic Methods

Inspired by this post dynamic-repositories-in-lightspeed I am trying to build my own like this.
I have a abstract GenericRepository like this. I have omitted most of the code for simplicity (Its just normal Add/Update/Filtering methods).
public abstract class GenericRepository<TEntity, TContext> :
DynamicObject,
IDataRepository<TEntity>
where TEntity : class, new()
where TContext : DbContext, new()
{
protected TContext context;
protected DbSet<TEntity> DbSet;
}
As you can see, my abstract GenericRepository extends from DynamicObject to support dynamic repositories.
I also have a abstract UnitOfWork implementation which generated a repository for a given entity at runtime like this. Again, base classes and other details are irrelevant for the question, but I'm happy to provide them if you require.
public abstract class UnitOfWorkBase<TContext> : IUnitOfWork
where TContext : DbContext, new()
{
public abstract IDataRepository<T> Repository<T>()
where T : class, IIdentifiableEntity, new();
// Code
}
Following class implements abstract method of the above class.
public class MyUnitOfWorkBase : UnitOfWorkBase<MyDataContext>
{
public override IDataRepository<T> Repository<T>()
{
if (Repositories == null)
Repositories = new Hashtable();
var type = typeof(T).Name;
if (!Repositories.ContainsKey(type))
{
var repositoryType = typeof(GenericRepositoryImpl<,>);
var genericType = repositoryType.MakeGenericType(typeof(T), typeof(InTeleBillContext));
var repositoryInstance = Activator.CreateInstance(genericType);
Repositories.Add(type, repositoryInstance);
}
return (IDataRepository<T>)Repositories[type];
}
}
Now, whenever I want to create a dynamic repository for basic CRUD functions, I can do it like this.
var uow = new MyUnitOfWorkBase();
var settingsRepo = uow.Repository<Settings>();
var settingsList = settingsRepo.Get().ToList();
Now, What I want to do is something like this.
dynamic settingsRepo = uow.Repository<Settings>();
var result = settingsRepo.FindSettingsByCustomerNumber(774278L);
Here, FindSettingsByCustomerNumber() is a dynamic method. I resolve this method using this code.
public class GenericRepositoryImpl<TEntity, TContext> :
GenericRepository<TEntity, TContext>
where TEntity : class, IIdentifiableEntity, new()
where TContext : DbContext, new()
{
public override bool TryInvokeMember(InvokeMemberBinder binder,
object[] args, out object result)
{
// Crude parsing for simplicity
if (binder.Name.StartsWith("Find"))
{
int byIndex = binder.Name.IndexOf("By");
if (byIndex >= 0)
{
string collectionName = binder.Name.Substring(4, byIndex - 4);
string[] attributes = binder.Name.Substring(byIndex + 2)
.Split(new[] { "And" }, StringSplitOptions.None);
var items = DbSet.ToList();
Func<TEntity, bool> predicate = entity => entity.GetType().GetProperty(attributes[0]).GetValue(entity).Equals(args[0]);
result = items.Where(predicate).ToList();
return true;
}
}
return base.TryInvokeMember(binder, args, out result);
}
}
This is the problem I am having.
using this line var items = DbSet.ToList(); works well, but if I were to query a large table with 1000's of data, then performance issues occur.
If I directly try to use the IQueryble interface and call it like this
Func predicate = entity => entity.GetType().GetProperty(attributes[0]).GetValue(entity).Equals(args[0]);
result = DbSet.Where(predicate).ToList();
It gives me an error saying there is no method GetProperty() in LINQ to Entities.
Is it possible to make it work using LINQ to Entities?
You need to know that LINQ-to-Entities needs to convert your expression (given by the predicate) into a SQL query. entity is replaced by the database column. Additionally LINQ2Entities supports various expressions (e.g. EqualExpression, etc.). However it cannot support the whole .NET Framework. Especially: what should GetType() on a database column return?
Therefore you need to use the Expresson API to generate the predicate and use only expressions supported by LINQ2Entities. For example: Use a MemberAccess expression for accessing a property (LINQ2Entities is able to map that to an SQL query).
Hint: we are doing predicate generation for Entity Framework and had to overcome some additional problems which we could solve using the library LinqKit.
If you do not know about the .NET Expression API yet, you need to gather skills in that area before you can resume your dynamic repository idea.
BTW: I don't think that it is a very good idea to have this kind of automatic calls. They are not refactoring safe (i.e. what if you rename the DB column? All your method calls run into problems, and it is not detectable by the compiler).
I would use it only to generate predicates for Where() clauses from Filter-like DTO types.
Unusual pattern - dynamic methods on a repository patterns.But that is another topic.
Dynamic invocation of the repository you have.
So now you need to understand Linq to Entities a little more.
Linq to Entities language reference what you can do with linq to Entities.
Given the expression tree has to be converted in to DB instructions,
it isnt surprising there are restrictions.
In case you are interested The EF provider specs and links to samples
So given you want to Dynamic EF, you have a few options.
I concentrate on dynamic wheres, but you can apply to other EF methods.
Check out
Dynamic Linq on codeplex
which allows things like
public virtual IQueryable<TPoco> DynamicWhere(string predicate, params object[] values) {
return Context.Set<TPoco>().Where(predicate, values);
}
This Where is an IQueryable extension that accepts strings...
Samples of using this string based predicate parser
LinqKit or even PM> Install-Package LinqKit
Linqkit takes dynamic EF to the next level,
Offers amazing features like
public IQueryable<TPoco> AsExpandable() {
return Context.Set<TPoco>().AsExpandable();
}
which allows you build AND and ORs progressively.
Expression trees
Expression Building API is the most powerful tool to support you here .
Learning the API is hard. using the tool harder.
eg Dealing with concatenation very hard. BUT if you can understand the API and how expressions work.
It is possible.
Here is a SIMPLE example. (imagine something complex)
public static Expression<Func<TPoco, bool>> GetContainsPredicate<TPoco>(string propertyName,
string containsValue)
{
// (tpoco t) => t.propertyName.Contains(value ) is built
var parameterExp = Expression.Parameter(typeof(TPoco), #"t");
var propertyExp = Expression.Property(parameterExp, propertyName);
MethodInfo method = typeof(string).GetMethod(#"Contains", new[] { typeof(string) });
var someValue = Expression.Constant(containsValue, typeof(string));
var containsMethodExp = Expression.Call(propertyExp, method, someValue);
return Expression.Lambda<Func<TPoco, bool>>(containsMethodExp, parameterExp);
}

scala jdbc template and rowmapper

I am new to Scala and I am looking at a bit of Scala code which involves a Spring JDBC template and a RowMapper:
It is something like this:
val db = jdbcTemplate.queryForObject(QUERY, new RowMapper[SomeObject]() {
def mapRow(ResultSet rs, int rowNum) {
var s = new SomeObject()
s.setParam1 = rs.getDouble("columnName")
return s
}
})
db
I am writing this from memory so I have just used generic names.
I was wondering why db is written at the end. I can't think of what purpose it serves.
Also, If I had several JDBC templates and an object like s in the example where I wanted to populate it's data with output from the several JDBC templates. Is it possible to do this in one function? Is it possible to have a mapRow function which doesn't return anything so that I could maybe have an array of templates and loop through them?
Thanks
The db at the end means return db where return statement is skipped. This is a standard convention in Scala. Seems like your code is a body of the function which suppose to return db. The first statement simply assigns the result of the query to a db
RowMapper interface can be replaced with implicit conversion to a function of the following type (ResultSet, Int) => SomeObject, which means that it takes two parameters (ResultSet and Int) and returns the result of type SomeObject

IEqualityComparer exception

I am using Entity Framework 4.0 and trying to use the "Contains" function of one the object sets in my context object. to do so i coded a Comparer class:
public class RatingInfoComparer : IEqualityComparer<RatingInfo>
{
public bool Equals(RatingInfo x, RatingInfo y)
{
var a = new {x.PlugInID,x.RatingInfoUserIP};
var b = new {y.PlugInID,y.RatingInfoUserIP};
if(a.PlugInID == b.PlugInID && a.RatingInfoUserIP.Equals(b.RatingInfoUserIP))
return true;
else
return false;
}
public int GetHashCode(RatingInfo obj)
{
var a = new { obj.PlugInID, obj.RatingInfoUserIP };
if (Object.ReferenceEquals(obj, null))
return 0;
return a.GetHashCode();
}
}
when i try to use the comparer with this code:
public void SaveRatingInfo2(int plugInId, string userInfo)
{
RatingInfo ri = new RatingInfo()
{
PlugInID = plugInId,
RatingInfoUser = userInfo,
RatingInfoUserIP = "192.168.1.100"
};
//This is where i get the execption
if (!context.RatingInfoes.Contains<RatingInfo>(ri, new RatingInfoComparer()))
{
//my Entity Framework context object
context.RatingInfoes.AddObject(ri);
context.SaveChanges();
}
}
i get an execption:
"LINQ to Entities does not recognize the method 'Boolean Contains[RatingInfo](System.Linq.IQueryable1[OlafCMSLibrary.Models.RatingInfo], OlafCMSLibrary.Models.RatingInfo,
System.Collections.Generic.IEqualityComparer1[OlafCMSLibrary.Models.RatingInfo])' method, and his method cannot be translated into a store expression."
Since i am not proficient with linQ and Entity Framework i might be making a mistake with my use of the "var" either in the "GetHashCode" function or in general.
If my mistake is clear to you do tell me :) it does not stop my project! but it is essential for me to understand why a simple comparer doesnt work.
Thanks
Aaron
LINQ to Entities works by converting an expression tree into queries against an object model through the IQueryable interface. This means than you can only put things into the expression tree which LINQ to Entities understands.
It doesn't understand the Contains method you are using, so it throws the exception you see. Here is a list of methods which it understands.
Under the Set Methods section header, it lists Contains using an item as supported, but it lists Contains with an IEqualityComparer as not supported. This is presumably because it would have to be able to work out how to convert your IEqualityComparer into a query against the object model, which would be difficult. You might be able to do what you want using multiple Where clauses, see which ones are supported further up the document.

Using an IEqualityComparer with a LINQ to Entities Except clause

I have an entity that I'd like to compare with a subset and determine to select all except the subset.
So, my query looks like this:
Products.Except(ProductsToRemove(), new ProductComparer())
The ProductsToRemove() method returns a List<Product> after it performs a few tasks. So in it's simplest form it's the above.
The ProductComparer() class looks like this:
public class ProductComparer : IEqualityComparer<Product>
{
public bool Equals(Product a, Product b)
{
if (ReferenceEquals(a, b)) return true;
if (ReferenceEquals(a, null) || ReferenceEquals(b, null))
return false;
return a.Id == b.Id;
}
public int GetHashCode(Product product)
{
if (ReferenceEquals(product, null)) return 0;
var hashProductId = product.Id.GetHashCode();
return hashProductId;
}
}
However, I continually receive the following exception:
LINQ to Entities does not recognize
the method
'System.Linq.IQueryable1[UnitedOne.Data.Sql.Product]
Except[Product](System.Linq.IQueryable1[UnitedOne.Data.Sql.Product],
System.Collections.Generic.IEnumerable1[UnitedOne.Data.Sql.Product],
System.Collections.Generic.IEqualityComparer1[UnitedOne.Data.Sql.Product])'
method, and this method cannot be
translated into a store expression.
Linq to Entities isn't actually executing your query, it is interpreting your code, converting it to TSQL, then executing that on the server.
Under the covers, it is coded with the knowledge of how operators and common functions operate and how those relate to TSQL. The problem is that the developers of L2E have no idea how exactly you are implementing IEqualityComparer. Therefore they cannot figure out that when you say Class A == Class B you mean (for example) "Where Person.FirstName == FirstName AND Person.LastName == LastName".
So, when the L2E interpreter hits a method it doesn't recognize, it throws this exception.
There are two ways you can work around this. First, develop a Where() that satisfies your equality requirements but that doesn't rely on any custom method. In other words, test for equality of properties of the instance rather than an Equals method defined on the class.
Second, you can trigger the execution of the query and then do your comparisons in memory. For instance:
var notThisItem = new Item{Id = "HurrDurr"};
var items = Db.Items.ToArray(); // Sql query executed here
var except = items.Except(notThisItem); // performed in memory
Obviously this will bring much more data across the wire and be more memory intensive. The first option is usually the best.
You're trying to convert the Except call with your custom IEqualityComparer into Entity SQL.
Obviously, your class cannot be converted into SQL.
You need to write Products.AsEnumerable().Except(ProductsToRemove(), new ProductComparer()) to force it to execute on the client. Note that this will download all of the products from the server.
By the way, your ProductComparer class should be a singleton, like this:
public class ProductComparer : IEqualityComparer<Product> {
private ProductComparer() { }
public static ProductComparer Instance = new ProductComparer();
...
}
The IEqualityComparer<T> can only be executed locally, it can't be translated to a SQL command, hence the error

Resources