pass dataframe column as parameter in xpath - xpath

I am using xpath in pyspark to extract from xml which is stored as a column in a table.
Below works fine
entity_id="D8"
dfquestionstep=df_source_xml.selectExpr("disclosure_entity_id",
f'xpath(**xml**,"*/entities/entity[#type=\'TI\']/entity[#type=\'UNDERWRITING\']/entity[#type=\'DISCLOSURES\']/entity[**#id=\'{entity_id}\'**]/entity[#type=\'DECISION_PATH\']/entity[#type=\'QUESTION_STEP\']/#id") QUESTION_STEP_ID'
)
PROBLEM
Now I want to pass disclosure_entity_id which is a column in dataframe having values like D8, D9 etc. in place of entity_id, i.e. entity[#id=disclosure_entity_id]
But all I get is [] as result when I execute like this, i.e. xpath fails to find anything.
Is there a way to pass the DF column directly as argument to XPATH like above?

Some testdata:
data = [
['a','<x><a>a1</a><b>b1</b><c>c1</c></x>'],
['b','<x><a>a2</a><b>b2</b><c>c2</c></x>'],
['c','<x><a>a3</a><b>b3</b><c>c3</c></x>'],
]
df = spark.createDataFrame(data, ['col','data'])
Attempt 1:
Creating a column with an XPath expression can be done:
from pyspark.sql import functions as f
df.withColumn('my_path', f.concat(f.lit('//'), f.col('col'))) \
.selectExpr('xpath(data, my_path)').show()
But unfortunately the code above only yields the error message
AnalysisException: cannot resolve 'xpath(`data`, `my_path`)' due to data type mismatch:
path should be a string literal; line 1 pos 0;
The path parameter of the xpath function has to be a constant string. This string is parsed before Spark even looks at the data.
Attempt 2:
Another option is to use an udf and use standard Python functions to process the XPath expression inside of the udf:
import xml.etree.ElementTree as ET
from pyspark.sql import types as T
def find_val(col, data):
result= ET.fromstring(data).find(f'.//{col}')
if not result is None:
return result.text
find_val_udf=f.udf(find_val, returnType=T.StringType())
df.select('col', 'data', find_val_udf('col', 'data')).show(truncate=False)
Output:
+---+----------------------------------+-------------------+
|col|data |find_val(col, data)|
+---+----------------------------------+-------------------+
|a |<x><a>a1</a><b>b1</b><c>c1</c></x>|a1 |
|b |<x><a>a2</a><b>b2</b><c>c2</c></x>|b2 |
|c |<x><a>a3</a><b>b3</b><c>c3</c></x>|c3 |
+---+----------------------------------+-------------------+

Related

How to append a delimiter between multiple values coming from a repeating field in xquery

I have a xml file which has a repeating element generating multiple values.
I would like to split all the values generated from that xpath delimited by any delimiter like , |_
I have tried the following which did not work -
tokenize(/*:ShippedUnit/*:Containment/*:ContainerManifest/*:Consignments/*:Consignment/*:ConsignmentHeader/*:ConsignmentRef, '\s')
replace(/*:ShippedUnit/*:Containment/*:ContainerManifest/*:Consignments/*:Consignment/*:ConsignmentHeader/*:ConsignmentRef," ","_")
example :
Now getting - CBR123 CBR678 CBR656
Expecting to get - CBR123|CBR678|CBR656
Note : In some transactions, there can be only one value present for that xpath. And therefore replace doesnot work here
To achieve the expected result assuming the sample source XML added to the comments in the original post, use the fn:string-join() function:
string-join(
//ConsignmentRef,
"|"
)
This will return:
CBR00464833N|CBR01264878K
For more on this function, see https://www.w3.org/TR/xpath-functions-31/#func-string-join.
Another option in XQuery 3.1 would be
declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization";
declare option output:method 'text';
declare option output:item-separator '|';
//ConsignmentRef

How to use startsWith and endsWith function in SparkR 2.3.0?

I can't understand how to use startsWith and endsWith function in SparkR 2.3.0.
I thought that I could use it like starts_with command of dplyr as below, but an error occurred.
If you'd kindly teach me.
> df <- read.df("/hadoop/tmp/iris.csv", "csv", header = "true")
> showDF(select(df, startsWith(columns(df), "Sepal")))
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function 'select' for signature '"SparkDataFrame", "logical"'
The startsWith and endsWith functions operate on columns, not on a dataframe.
To do the select you are attempting you can use
df <- as.DataFrame(iris)
df_sepal <- select(df, names(df)[grepl("Sepal", names(df))])
To use startsWith() you need to pass a column as an argument, as well as the string you are checking. For example,
df_v <- filter(df, startsWith(df$Species, "v") == TRUE)
will filter for only the rows where Species begins with 'v' (versicolor, virginica)
df_a <- filter(df, endsWith(df$Species, "a") == TRUE)
will filter for only the rows where Species ends with 'a' (setosa, viginica)

Simple integer comparison in HBase

I am trying out a very simple example in HBase. Following is how I create table and put data:
create 'newdb3','data'
put 'newdb3','row1','data:name','Thexxx Beatles'
put 'newdb3','row2','data:name','The Beatles'
put 'newdb3','row3','data:name','Beatles'
put 'newdb3','row4','data:name','Thexxx'
put 'newdb3','row1','data:duration',400
put 'newdb3','row2','data:duration',300
put 'newdb3','row3','data:duration',200
put 'newdb3','row4','data:duration',100
scan 'newdb3', {COLUMNS => 'data:name', FILTER => "SingleColumnValueFilter('data','duration', > ,'binaryprefix:200')"}
But the result is always all 4 columns. I tried number with or without string, and using hex values. I also tried 'binary' instead of 'binaryprefix'. How do I store and compare integer in hbase?
Does this produce the expected output?
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.SubstringComparator
import org.apache.hadoop.hbase.util.Bytes
scan 'newdb3', { FILTER => SingleColumnValueFilter.new(Bytes.toBytes('data'), \
Bytes.toBytes('duration'),
CompareFilter::CompareOp.valueOf('GREATER'), \
BinaryComparator.new(Bytes.toBytes('200'))) }
NOTE: This will do a binary comparison and for numbers it will work only if they are 0-padded

Pig: Unable to Load BAG

I have a record in this format:
{(Larry Page),23,M}
{(Suman Dey),22,M}
{(Palani Pratap),25,M}
I am trying to LOAD the record using this:
records = LOAD '~/Documents/PigBag.txt' AS (details:BAG{name:tuple(fullname:chararray),age:int,gender:chararray});
But I am getting this error:
2015-02-04 20:09:41,556 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 7, column 101> mismatched input ',' expecting RIGHT_CURLY
Please advice.
It's not a bag since it's not made up of tuples. Try
load ... as (name:tuple(fullname:chararray), age:int, gender:chararray)
For some reason Pig wraps the output of a line in curly braces which make it look like a bag but it's not. If you have saved this data using PigStorage you can save it using a parameter ('-schema') which tells PigStorage to create a schema file .pigschema (or something similar) which you can look at to see what the saved schema is. It can also be used when loading with PigStorage to save you the AS clause.
Yes LiMuBei point is absolutely right. Your input is not in the right format. Pig will always expect the bag should hold collection of tuples but in your case its a collection of (tuple and fields). In this case pig will retain the tuple and reject the fields(age and gender) during load.
But this problem can be easily solvable in different approach(kind of hacky solution).
1. Load each input line as chararray.
2. Remove the curly brackets and function brackets from the input.
3. Using strsplit function segregate the input as (name,age,sex) fields.
PigScript:
A = LOAD 'input' USING PigStorage AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(REPLACE(line,'[}{)(]+','')) AS (newline:chararray);
C = FOREACH B GENERATE FLATTEN(STRSPLIT(newline,',',3)) AS (fullname:chararray,age:int,sex:chararray);
DUMP C;
Output:
(Larry Page,23,M)
(Suman Dey,22,M)
(Palani Pratap,25,M)
Now you can access all the fields using fullname,age,sex.

using MultiStorage to store records in separate files

I'm trying to store a set of records like these:
2342514224232 | some text here whatever
2342514224234| some more text here whatever
....
into separate files in the output folder like this:
output / 2342514224232
output / 2342514224234
the value of the idstr should be the file name and the text should be inside the file. Here's my pig code:
REGISTER /home/bytebiscuit/pig-0.11.1/contrib/piggybank/java/piggybank.jar;
A = LOAD 'cleantweets.csv' using PigStorage(',') AS (idstr:chararray, createdat:chararray, text:chararray,followers:int,friends:int,language:chararray,city:chararray,country:chararray,lat:chararray,lon:chararray);
B = FOREACH A GENERATE idstr, text, language, country;
C = FILTER B BY (country == 'United States' OR country == 'United Kingdom') AND language == 'en';
texts = FOREACH C GENERATE idstr,text;
STORE texts INTO 'output/query_results_one' USING org.apache.pig.piggybank.storage.MultiStorage('output/query_results_one', '0');
Running this pig script gives me the following error:
<file pigquery1.pig, line 12, column 0> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.MultiStorage' with arguments '[output/query_results_one, idstr]'
Any help is really appreciated!
Try this option:
MultiStorage('output/query_results_one', '0', 'none', ',');
In case anybody stumbles across this post like I did, the problem for me was that my pig script looked like:
DEFINE MultiStorage org.apache.pig.piggybank.storage.MultiStorage();
...
STORE stuff INTO 's3:/...' USING MultiStorage('s3:/...','0','none',',');
The DEFINE statement was incorrectly not specifying inputs/outputs. Foregoing the DEFINE statement and directly putting the following fixed my problem.
STORE stuff INTO 's3:/...' USING org.apache.pig.piggybank.storage.MultiStorage('s3:/...','0','none',',');

Resources