Filter by UUID in Pig - filter

I have a list of known UUIDs. I want to do a FILTER in Pig that filters out records whose id column do not contain a UUID from my list.
I have yet to find a way to specify bytearray literals such that I can write that filter statement.
How do I filter by UUID?
(in one attempt I tried using https://github.com/cevaris/pig-dse per How to FILTER Cassandra TimeUUID/UUID in Pig thinking I could filter by a chararray literal of the UUID but I got
grunt> post_creators= LOAD 'cql://mykeyspace/mycf/' using AbstractCassandraStorage;
2014-10-09 14:56:05,597 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: could not instantiate 'AbstractCassandraStorage' with arguments 'null'
)

Use this python UDF
import array
import uuid
#outputSchema("uuid:bytearray")
def to_bytes(uuid_str):
return array.array('b', uuid.UUID(uuid_str).bytes)
Filter like this:
users = FILTER users by user_id == my_udf.to_bytes('dd2e03a7-7d3d-45b9-b902-2b39c5c541b5');

Related

AWS Lambda Python boto3 reading from dynamodb table with mulitple attibutes in KeyConditionExpression

basicSongsTable has 'artist' as Partition Key and 'song' as sort key.
I am able to read using Query if I have one artist. But I want to read 2 artists with the following code. It gives vague error saying ""errorMessage": "Syntax error in module 'lambda_function': positional argument follows keyword argument (lambda_function.py, line 17)","
import boto3
import pprint
from pprint import pprint
dynamodbclient = boto3.client('dynamodb')
def lambda_handler(event, context):
response = dynamodbclient.query(
TableName ='basicSongsTable',
KeyConditionExpression='artist = :varartistname1', 'artist =:varartistname2',
ExpressionAttributeValues={
':varartistname1': {'S': 'basam'},
':varartistname2':{'S': 'sree'}
}
)
pprint(response['Items'])
If I give only one keyconditionexpression it works.
KeyConditionExpression='artist = :varartistname1',
ExpressionAttributeValues={
':varartistname1': {'S': 'basam'}
}
Table
As per documentation:
KeyConditionExpression (string) --
The condition that specifies the key values for items to be retrieved
by the Query action.
The condition must perform an equality test on a single partition key
value.
What you are trying to do is, you are trying to perform an equality test on multiple partition key values, which doesn't work.
To do what you want to do, get data for both artists, you will have to either do two queries or do a scan (which I do not recommend).
For other options, I would recommend you to take a look at this answer and its pros and cons.

How to load json record to json colum in postgres with apache nifi?

This is my flow file content:
{
"a":"b",
"c":"y",
"d":"z",
"e":"w",
"f":"u",
"g":"v",
"h":"o",
"x":"t"
}
The final result should look like that in Postgres :
| test |
|----------------------------------------------------------------|
|{"a":"b,"c":"y","d":"z","e":"w","f":"u","g":"v","h":"o","x":"t"}|
the table is: json_test
the column name is test
Those steps shows how i tried to solve the problem:
My method was to store the json record in a variable as string with "ExtractText":
the attribute data take only some key-values from the json not the entire record:
data = {"a":"b",
"c":"y",
"d":"z",
"e":"w",
"f":
so i have a problem in the regex expression.
next i used PutSQL with the following SQL statement:
Unfortunately the result isn't the wanted one.
I need to know the exact expression that i should set in ExtractText to get the entire json record in a variable as string.
The sql statement should be:
insert into schema.table_name(column_name) values(the_variable_where the flowfile data was stored)

Transform a org.apache.spark.rdd.RDD[String] into Parallelized collections

I've a csv file in my HDFS with a collection of products like:
[56]
[85,66,73]
[57]
[8,16]
[25,96,22,17]
[83,61]
I'm trying to apply the Association Rules algorithm in my code. For that I need to run this:
scala> val data = sc.textFile("/user/cloudera/data")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/data MapPartitionsRDD[294] at textFile at <console>:38
scala> val distData = sc.parallelize(data)
But when I submit this I'm getting this error:
<console>:40: error: type mismatch;
found : org.apache.spark.rdd.RDD[String]
required: Seq[?]
Error occurred in an application involving default arguments.
val distData = sc.parallelize(data)
How can I transform a RDD[String] in a Sequence collection?
Many thanks!
What you are facing is simple. The error show to you.
To parallelize a object in spark you should add a Seq() object. And you are trying to add a RDD[String] object.
The RDD is already parallelized, the textFile method parallelize the file elements by lines in your cluster.
You can check the method description here:
https://spark.apache.org/docs/latest/programming-guide.html

Hadoop Pig: Show entries using STARTSWITH

I am having issues using the STARTSWITH string function. I want to display all records in System_Period that begins with 20040
transactions = LOAD '/home/cloudera/datasets/assignment2/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:int);
sysGroup = GROUP transactions BY System_Period;
sysFilter = FILTER sysGroup BY STARTSWITH(transactions.System_Period, 20040);
DUMP sysFilter;
The error I am receiving is
Could not infer the matching function for org.apache.pig.builtin.STARTSWITH as multiple or none of them fit. Please use an explicit cast.
STARTSWITH is only used to compare a tuple1 with tuple2 to check whether tuple1 contains tuple2. You cannot pass a relation or a bag to that. And one more thing to be noted is it accepts only String(chararray) not an integer. Either FILTER the system_period that begins with 20040 before the GROUP BY and load system_period as chararray and then cast it after the filter as per your need.
transactions = LOAD '/home/cloudera/datasets/assignment2/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:chararray);
sysFilter = FILTER transactions BY STARTSWITH(System_Period, '20040');
Else after GROUP BY FLATTEN the result and then filter
transactions = LOAD '/home/cloudera/datasets/assignment2/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:chararray);
sysGroup = GROUP transactions BY System_Period;
flatres = FOREACH sysGroup GENERATE group,FLATTEN(transactions);
sysFilter = FILTER flatres BY STARTSWITH(System_Period, '20040');

How to Store into HBase using Pig and HBaseStorage

In the HBase shell, I created my table via:
create 'pig_table','cf'
In Pig, here are the results of the alias I wish to store into pig_table:
DUMP B;
Produces tuples with 6 fields:
(D1|30|2014-01-01 13:00,D1,30,7.0,2014-01-01 13:00,DEF)
(D1|30|2014-01-01 22:00,D1,30,1.0,2014-01-01 22:00,JKL)
(D10|20|2014-01-01 11:00,D10,20,4.0,2014-01-01 11:00,PQR)
...
The first field is a concatenation of the 2nd, third, and 5th fields, and will be used as the HBase rowkey.
But
STORE B INTO 'hbase://pig_table'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage (
'cf:device_id,cf:cost,cf:hours,cf:start_time,cf:code')
results in:
`Failed to produce result in "hbase:pig_table"
The logs are giving me:
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.DataByteArray
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.objToBytes(HBaseStorage.java:924)
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.putNext(HBaseStorage.java:875)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:551)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:99)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:468)
... 11 more
What is wrong with my syntax?
It appears that HBaseStorage does not automatically convert the data fields of the tuples into chararray, and which is necessary before it can be stored in HBase. I simply casted them as such:
C = FOREACH B {
GENERATE
(chararray)$0
,(chararray)$1
,(chararray)$2
,(chararray)$3
,(chararray)$4
,(chararray)$5
,(chararray)$6
;
}
STORE B INTO 'hbase://pig_table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ( 'cf:device_id,cf:cost,cf:hours,cf:start_time,cf:code')

Resources