Hive : reflect function - hadoop

Im trying to use Reflect function of Hive which have this signature :
reflect(class, method[, arg1[, arg2..]])
I want to ckeck if a column c with value hello world ! contains world, so I wrote :
with a as
(select "hello world !" as c)
select reflect("java.lang.String",c ,"contains", "world") from a
But it didnt work because it does not respect the signature, so i tried this
with a as
(select "hello world !" as c)
select reflect(reflect("java.lang.Object","toString",c) ,"contains", "world") from a
It didnt work also ! I want to know how to apply reflect function on a given column ?

reflect2 will help.See https://issues.apache.org/jira/browse/HIVE-20007
select reflect2("stackoverflow","length");
+------+--+
| _c0 |
+------+--+
| 13 |
+------+--+
but hashCode() won't work.See https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFReflect2.java#L86
select reflect2("stackoverflow","hashCode");
Error: Error while compiling statement: FAILED: SemanticException [Error 10016]: Line 1:32 Argument type mismatch '"hashCode"': Use hash() UDF instead of this.

Related

how to get the complete source code location corresponding to the ASTnode in clang

i am using clang plugin to develop a tool which can insert a 'printf' statemente behind every statement to trace the project.for example:
int main(){
int a=1000;
if(a==1000){
a=999;}
}
insert a 'printf' statemente
int main(){
int a=1000;
printf("int a=1000;");
if(a==1000){
a=999;
printf("a=999;");}
}
i use the structure of the 'printFunctionName' in the clang'example to realize.
i need to get the complete source code location corresponding to the ASTnode so that i can insert the 'printf' statement in the right location.however i found i can't use the Stmt->getSourceRange() function to get the complete location range for some Specific type statement.for example,BinaryOperator,UnaryOperator.
a=1000;
| | | |-BinaryOperator 0x565356502960 <line:13:3, col:5> 'int' '='
| | | | |-DeclRefExpr 0x565356502918 <col:3> 'int' lvalue Var 0x565356502880 'a' 'int'
| | | | `-IntegerLiteral 0x565356502940 <col:5> 'int' 1000
In fact, these characters occupy 5 columns in the source code file, but only 3 columns are recognized.
a++;
-UnaryOperator 0x5653565029b0 <line:14:9, col:10> 'int' postfix '++'
| | | | `-DeclRefExpr 0x565356502988 <col:9> 'int' lvalue Var 0x565356502880 'a' 'int'
this is three chars for 3 column.but clang think it is 2 column.
do you have some idea to solve this problem?and coud you please give me some advices to better recognize the location for a complete sentence end with ';'?thank you!
and this is my code:
SourceRange range=declStmt->getSourceRange();
TheRewriter.ReplaceText(declStmt->getSourceRange(),str+";"+"\nprintf(\""+str+"\\n\")");

How to use case expression or should go for any other option

I have a table that looks something like this, Report_Table:
Report_DEFINITION_NAME
Report_CORE_NAME
COMPLETION_STATUS
COMPLETION_DATE
ReportAD
AD
Success
14-01-2019
ReportBB
BB
Error
24-06-2022
ReportAD
AD
Error
19-03-2020
ReportR5
R5
Success
04-06-2022
ReportG8
G8
Error
04-06-2022
ReportR5
R5
Success
18-11-2020
ReportLH
LH
Success
07-09-2019
ReportU6
U6
Error
12-05-2022
ReportAD
AD
Success
23-09-2021
I wanted to pull data from Report_table. If COMPLETION_STATUS is Success it should give the latest Success COMPLETION_DATE table date and if it has Error it should give the last Success COMPLETION_DATE table date as well as the Error date. Something like
select Report_DEFINITION_NAME, Report_Core_name, COMPLETION_STATUS, COMPLETION_DATE,
CASE:
WHEN COMPLETION_STATUS='Success' THEN latest Success COMPLETION_DATE
WHEN COMPLETION_STATUS='Error' THEN last Success COMPLETION_DATE
WHEN COMPLETION_STATUS='Error' THEN latest Error COMPLETION_DATE
END
from Report_Table;
the output should be in a single query identified by core name or definition name.
So, I have used IN clause to solve my problem and no Case expression.
Getting the max value in a select statement and calling that statement using IN clause from my parent select statement solved my problem.
something like:
SELECT
a.*,
a.completion_date,
b.completion_date AS last_success_date
FROM report_table a, report_table b
WHERE a.report_definition_name = b.report_definition_name
AND a.completion_date IN
(
SELECT MAX(su.completion_date)
FROM report_table su
WHERE su.completion_status = 'Success'
AND b.report_definition_name = su.report_definition_name
)
ORDER BY a.completion_date DESC;

regex pattern not working in pyspark after applying the logic

I have data as below:
>>> df1.show()
+-----------------+--------------------+
| corruptNames| standardNames|
+-----------------+--------------------+
|Sid is (Good boy)| Sid is Good Boy|
| New York Life| New York Life In...|
+-----------------+--------------------+
So, as per above data I need to apply regex,create a new column and get the data as in the second column i.e standardNames. I tried below code:
spark.sql("select *, case when corruptNames rlike '[^a-zA-Z ()]+(?![^(]*))' or corruptNames rlike 'standardNames' then standardNames else 0 end as standard from temp1").show()
It throws below error:
pyspark.sql.utils.AnalysisException: "cannot resolve '`standardNames`' given input columns: [temp1.corruptNames, temp1. standardNames];
Try this example without select sql. I am assuming you want to create a new column called standardNames based on corruptNames if the regex pattern is true, otherwise "do something else...".
Note: Your pattern won't compile because you need to escape the second last ) with \.
pattern = '[^a-zA-Z ()]+(?![^(]*))' #this won't compile
pattern = r'[^a-zA-Z ()]+(?![^(]*\))' #this will
Code
import pyspark.sql.functions as F
df_text = spark.createDataFrame([('Sid is (Good boy)',),('New York Life',)], ('corruptNames',))
pattern = r'[^a-zA-Z ()]+(?![^(]*\))'
df = (df_text.withColumn('standardNames', F.when(F.col('corruptNames').rlike(pattern), F.col('corruptNames'))
.otherwise('Do something else'))
.show()
)
df.show()
#+-----------------+---------------------+
#| corruptNames| standardNames|
#+-----------------+---------------------+
#|Sid is (Good boy)| Do something else|
#| New York Life| Do something else|
#+-----------------+---------------------+

Insert collect_set values into Elasticsearch with PIG

I have a HIVE table which contains 3 columns- "id"(String), "booklist"(Array of String), and "date"(string) with the following data:
----------------------------------------------------
id | booklist | date
----------------------------------------------------
1 | ["Book1" , "Book2"] | 2017-11-27T01:00:00.000Z
2 | ["Book3" , "Book4"] | 2017-11-27T01:00:00.000Z
When trying to insert into Elasticsearch with this PIG script
-------------------------Script begins------------------------------------------------
SET hive.metastore.uris 'thrift://node:9000';
REGISTER hdfs://node:9001/library/elasticsearch-hadoop-5.0.0.jar;
DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader();
DEFINE EsStore org.elasticsearch.hadoop.pig.EsStorage(
'es.nodes = elasticsearch.service.consul',
'es.port = 9200',
'es.write.operation = upsert',
'es.mapping.id = id',
'es.mapping.pig.tuple.use.field.names=true'
);
hivetable = LOAD 'default.reading' USING HCatLoader();
hivetable_flat = FOREACH hivetable
GENERATE
id AS id,
booklist as bookList,
date AS date;
STORE hivetable_flat INTO 'readings/reading' USING EsStore();
-------------------------Script Ends------------------------------------------------
When running above, i got an error saying:
ERROR 2999:Unexpected internal error. Found unrecoverable error [ip:port] returned Bad Request(400) - failed to parse [bookList]; Bailing out..
Can anyone shed any light on how to parse ARRAY of STRING into ES and get above to work?
Thank you!

Setting textinputformat.record.delimiter in sparksql

In spark2.0.1 ,hadoop2.6.0, I have many files delimited with '!#!\r' and not with the usual new line \n,for example:
=========================================
2001810086 rongq 2001 810!#!
2001810087 hauaa 2001 810!#!
2001820081 hello 2001 820!#!
2001820082 jaccy 2001 820!#!
2002810081 cindy 2002 810!#!
=========================================
I try to extracted data according to Setting textinputformat.record.delimiter in spark
set textinputformat.record.delimiter='!#!\r';or set textinputformat.record.delimiter='!#!\n';but still cannot extracted the data
In spark-sql,I do this :
===== ================================
create table ceshi(id int,name string, year string, major string)
row format delimited
fields terminated by '\t';
load data local inpath '/data.txt' overwrite into table ceshi;
select count(*) from ceshi;
the result is 5,but I try to set textinputformat.record.delimiter='!#!\r'; then select count(*) from ceshi; the result is 1, the delimiter donot work well;
I also check the source of hadoop2.6.0, the method of RecordReader in TextInputFormat.java,I notice that default textinputformat.record.delimiter is null,then the the LineReader.java use the method readDefaultLine to read a line terminated by one of CR, LF, or CRLF(CR ='\r',LF ='\n').
You should use sparkContext's hadoopConfiguration api to set the textinputformat.record.delimiter as
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!#!\r")
Then if you read the text file using sparkContext as
sc.textFile("the input file path")
You should fine.
Updated
I have noticed that a text file with delimiter \r when saved is changed to \n delimiter.
so, following format should work for you as it did for me
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!#!\n")
val data = sc.textFile("the input file path")
val df = data.map(line => line.split("\t"))
.map(array => ceshi(array(0).toInt, array(1), array(2), array(3)))
.toDF
a case class called ceshi is needed as
case class ceshi(id: Int, name: String, year: String, major :String)
which should give dataframe as
+----------+-----+-----+-----+
|id |name |year |major|
+----------+-----+-----+-----+
|2001810086|rongq| 2001|810 |
|2001810087|hauaa| 2001|810 |
|2001820081|hello| 2001|820 |
|2001820082|jaccy| 2001|820 |
|2002810081|cindy| 2002|810 |
+----------+-----+-----+-----+
Now you can hit the count function as
import org.apache.spark.sql.functions._
df.select(count("*")).show(false)
which would give output as
+--------+
|count(1)|
+--------+
|5 |
+--------+

Resources