How to write UDAF for hive using java that accept multiple columns as arguments? - hadoop

I want to calculate currency_rate based on few inputs like date, var_currecy_code,
fxd_crncy_code.
We have all data in our hive table now we need to calculate currency_rate based on the maximum date and some more inputs as mentioned above by using hive UDAF.

Hive UDF can accept a Tuple as a parameter.
Within the function, you check the length of the tuple, and extract the necessary order for your logic

Related

nifi: how to merge multiple columns in csv file?

nifi version: 1.5
input file:
col1,col2,col3,col4,col5,col6
a,hr,nat,REF,6,2481
a,hr,nat,TDB,6,1845
b,IT,raj,NAV,6,2678
i want to merge the last three columns with : delimiter and separator by / based on col1.
expected output:
col1,col2,col3,col4
a,hr,nat,REF:6:2481/TDB:6:1845
b,IT,raj,NAV:6:2678
i am not able to find the solution because lot of response were based merging two files.
is there a better way to do it?
tia.
I think you'll want a PartitionRecord processor first, with partition field col1, this will split the flow file into multiple flow files where each distinct value of col1 will be in its own flow file. If the first 3 columns are to be used for partitioning, you can add all three columns as user-defined properties for partitioning.
Then whether you use a scripted solution or perhaps QueryRecord (if Calcite supports "group by" concatenation), the memory usage should be less as you are only dealing with a flow file at a time whose rows are already associated by the specified group.

Pass parameter from spark to input format

We have files with specific format in HDFS. We want to process data extracted from these files within spark. We have started to write an input format in order to create the RDD. This way we hope will be able to create an RDD from the whole file.
But each processing has to process a small subset of data contained in the file and I know how to extract this subset very efficiently, more than filtering a huge RDD.
How can I pass a query filter in the form of a String from my driver to my input format (the same way hive context does)?
Edit:
My file format is NetCDF which stores huge matrix in a efficient way for a multidimentionnal data, for exemple x,y,z and time. A first approach would be to extract all values from the matrix and produce a RDD line for each value. I'd like my inputformat to extract only a few subset of the matrix (maybe 0.01%) and build a small RDD to work with. The subset could be z = 0 and a small time period. I need to pass the time period to the input format which will retrieve only the values I'm interested in.
I guess Hive context does this when you pass an SQL query to the context. Only values matching the SQL query are present in the RDD, not all lines of the files.

Pig - HBASE Query - Dynamic Columns - Convert Map to Bag

I am attempting to use Pig to query a HBASE table where the table columns are dynamic (column titles not known at query time). Thus this makes referencing the map[] of key values returned not feasible. I would like to convert each map returned by the query into a bag of key-value tuples.
How do I do this?
I have seen one example that seems to rank high in search results (in python) that converts each map key-value pair into a bag of tuples. See below.
#outputSchema("values:bag{t:tuple(key, value)}")
def bag_of_tuples(map_dict):
return map_dict.items()
However, in trying to follow this example, I do not know where the .items() function is coming from. Is there a way do achieve this in pure Pig Latin with out-of-the-box UDFs?

Increasing mapper in pig

I am using pig to load data from Cassandra using CqlStorage. i have 4 data nodes each can have 7 mappers, there is ~30 million data in Cassandra. When i run like this
LOAD 'cql://keyspace/columnfamily' using CqlStorage it takes 27 mappers to run .
But if i give where clause in the load function like
LOAD 'cql://keyspace/columnfamily?where_clause=id%3D100' using CqlStorage it always takes one mapper.
Can any one help me in increasing mapper
It looks from your WHERE clause like your map input will only be a single key, which would be the reason why you only get one mapper. Hadoop will allocate mappers based on the number of input keys. If you have only one input key, additional mappers will do nothing.
The bottom line is that if you specify your partition key in the where clause, you will get one mapper (since that's the way it gets distributed). Based on the comments I presume you are doing analysis for more than just one student, so there's no reason you'd be specifying the partition key. You also don't seem to have any columns that make sense for a secondary index. So I'm not sure why you even have a where clause.
It looks from your data model like you'll have to map over all your data to get aggregate marks for a combination of student and time range. It's possible you could change to a time-series data model and successfully filter in the where clause, but your current model doesn't support this.

Not able to store the data into hbase using pig when I dont know the number of columns in a file

I have a text file with N number of columns (Not sure, in the future I may have N+1).
Example:
1|A
2|B|C
3|D|E|F
I want to store above data into hbase using pig without writing UDF. How can I store this kind of data without knowing the number of columns in a file?
Put it in a map and then you can use cf1:* where cf1 is your column family

Resources