Hive : reflect function

Hive : reflect function - hadoop

Im trying to use Reflect function of Hive which have this signature :
reflect(class, method[, arg1[, arg2..]])
I want to ckeck if a column c with value hello world ! contains world, so I wrote :
with a as
(select "hello world !" as c)
select reflect("java.lang.String",c ,"contains", "world") from a
But it didnt work because it does not respect the signature, so i tried this
with a as
(select "hello world !" as c)
select reflect(reflect("java.lang.Object","toString",c) ,"contains", "world") from a
It didnt work also ! I want to know how to apply reflect function on a given column ?

reflect2 will help.See https://issues.apache.org/jira/browse/HIVE-20007
select reflect2("stackoverflow","length");
+------+--+
| _c0 |
+------+--+
| 13 |
+------+--+
but hashCode() won't work.See https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFReflect2.java#L86
select reflect2("stackoverflow","hashCode");
Error: Error while compiling statement: FAILED: SemanticException [Error 10016]: Line 1:32 Argument type mismatch '"hashCode"': Use hash() UDF instead of this.

Related

how to get the complete source code location corresponding to the ASTnode in clang

i am using clang plugin to develop a tool which can insert a 'printf' statemente behind every statement to trace the project.for example:
int main(){
int a=1000;
if(a==1000){
a=999;}
}
insert a 'printf' statemente
int main(){
int a=1000;
printf("int a=1000;");
if(a==1000){
a=999;
printf("a=999;");}
}
i use the structure of the 'printFunctionName' in the clang'example to realize.
i need to get the complete source code location corresponding to the ASTnode so that i can insert the 'printf' statement in the right location.however i found i can't use the Stmt->getSourceRange() function to get the complete location range for some Specific type statement.for example,BinaryOperator,UnaryOperator.
a=1000;
| | | |-BinaryOperator 0x565356502960 <line:13:3, col:5> 'int' '='
| | | | |-DeclRefExpr 0x565356502918 <col:3> 'int' lvalue Var 0x565356502880 'a' 'int'
| | | | `-IntegerLiteral 0x565356502940 <col:5> 'int' 1000
In fact, these characters occupy 5 columns in the source code file, but only 3 columns are recognized.
a++;
-UnaryOperator 0x5653565029b0 <line:14:9, col:10> 'int' postfix '++'
| | | | `-DeclRefExpr 0x565356502988 <col:9> 'int' lvalue Var 0x565356502880 'a' 'int'
this is three chars for 3 column.but clang think it is 2 column.
do you have some idea to solve this problem?and coud you please give me some advices to better recognize the location for a complete sentence end with ';'?thank you!
and this is my code:
SourceRange range=declStmt->getSourceRange();
TheRewriter.ReplaceText(declStmt->getSourceRange(),str+";"+"\nprintf(\""+str+"\\n\")");

How to use case expression or should go for any other option

I have a table that looks something like this, Report_Table:
Report_DEFINITION_NAME
Report_CORE_NAME
COMPLETION_STATUS
COMPLETION_DATE
ReportAD
AD
Success
14-01-2019
ReportBB
BB
Error
24-06-2022
ReportAD
AD
Error
19-03-2020
ReportR5
R5
Success
04-06-2022
ReportG8
G8
Error
04-06-2022
ReportR5
R5
Success
18-11-2020
ReportLH
LH
Success
07-09-2019
ReportU6
U6
Error
12-05-2022
ReportAD
AD
Success
23-09-2021
I wanted to pull data from Report_table. If COMPLETION_STATUS is Success it should give the latest Success COMPLETION_DATE table date and if it has Error it should give the last Success COMPLETION_DATE table date as well as the Error date. Something like
select Report_DEFINITION_NAME, Report_Core_name, COMPLETION_STATUS, COMPLETION_DATE,
CASE:
WHEN COMPLETION_STATUS='Success' THEN latest Success COMPLETION_DATE
WHEN COMPLETION_STATUS='Error' THEN last Success COMPLETION_DATE
WHEN COMPLETION_STATUS='Error' THEN latest Error COMPLETION_DATE
END
from Report_Table;
the output should be in a single query identified by core name or definition name.

So, I have used IN clause to solve my problem and no Case expression.
Getting the max value in a select statement and calling that statement using IN clause from my parent select statement solved my problem.
something like:
SELECT
a.*,
a.completion_date,
b.completion_date AS last_success_date
FROM report_table a, report_table b
WHERE a.report_definition_name = b.report_definition_name
AND a.completion_date IN
(
SELECT MAX(su.completion_date)
FROM report_table su
WHERE su.completion_status = 'Success'
AND b.report_definition_name = su.report_definition_name
)
ORDER BY a.completion_date DESC;

regex pattern not working in pyspark after applying the logic

I have data as below:
>>> df1.show()
+-----------------+--------------------+
| corruptNames| standardNames|
+-----------------+--------------------+
|Sid is (Good boy)| Sid is Good Boy|
| New York Life| New York Life In...|
+-----------------+--------------------+
So, as per above data I need to apply regex,create a new column and get the data as in the second column i.e standardNames. I tried below code:
spark.sql("select *, case when corruptNames rlike '[^a-zA-Z ()]+(?![^(]*))' or corruptNames rlike 'standardNames' then standardNames else 0 end as standard from temp1").show()
It throws below error:
pyspark.sql.utils.AnalysisException: "cannot resolve '`standardNames`' given input columns: [temp1.corruptNames, temp1. standardNames];

Try this example without select sql. I am assuming you want to create a new column called standardNames based on corruptNames if the regex pattern is true, otherwise "do something else...".
Note: Your pattern won't compile because you need to escape the second last ) with \.
pattern = '[^a-zA-Z ()]+(?![^(]*))' #this won't compile
pattern = r'[^a-zA-Z ()]+(?![^(]*\))' #this will
Code
import pyspark.sql.functions as F
df_text = spark.createDataFrame([('Sid is (Good boy)',),('New York Life',)], ('corruptNames',))
pattern = r'[^a-zA-Z ()]+(?![^(]*\))'
df = (df_text.withColumn('standardNames', F.when(F.col('corruptNames').rlike(pattern), F.col('corruptNames'))
.otherwise('Do something else'))
.show()
)
df.show()
#+-----------------+---------------------+
#| corruptNames| standardNames|
#+-----------------+---------------------+
#|Sid is (Good boy)| Do something else|
#| New York Life| Do something else|
#+-----------------+---------------------+

Insert collect_set values into Elasticsearch with PIG

I have a HIVE table which contains 3 columns- "id"(String), "booklist"(Array of String), and "date"(string) with the following data:
----------------------------------------------------
id | booklist | date
----------------------------------------------------
1 | ["Book1" , "Book2"] | 2017-11-27T01:00:00.000Z
2 | ["Book3" , "Book4"] | 2017-11-27T01:00:00.000Z
When trying to insert into Elasticsearch with this PIG script
-------------------------Script begins------------------------------------------------
SET hive.metastore.uris 'thrift://node:9000';
REGISTER hdfs://node:9001/library/elasticsearch-hadoop-5.0.0.jar;
DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader();
DEFINE EsStore org.elasticsearch.hadoop.pig.EsStorage(
'es.nodes = elasticsearch.service.consul',
'es.port = 9200',
'es.write.operation = upsert',
'es.mapping.id = id',
'es.mapping.pig.tuple.use.field.names=true'
);
hivetable = LOAD 'default.reading' USING HCatLoader();
hivetable_flat = FOREACH hivetable
GENERATE
id AS id,
booklist as bookList,
date AS date;
STORE hivetable_flat INTO 'readings/reading' USING EsStore();
-------------------------Script Ends------------------------------------------------
When running above, i got an error saying:
ERROR 2999:Unexpected internal error. Found unrecoverable error [ip:port] returned Bad Request(400) - failed to parse [bookList]; Bailing out..
Can anyone shed any light on how to parse ARRAY of STRING into ES and get above to work?
Thank you!

Setting textinputformat.record.delimiter in sparksql

In spark2.0.1 ,hadoop2.6.0, I have many files delimited with '!#!\r' and not with the usual new line \n,for example:
=========================================
2001810086 rongq 2001 810!#!
2001810087 hauaa 2001 810!#!
2001820081 hello 2001 820!#!
2001820082 jaccy 2001 820!#!
2002810081 cindy 2002 810!#!
=========================================
I try to extracted data according to Setting textinputformat.record.delimiter in spark
set textinputformat.record.delimiter='!#!\r';or set textinputformat.record.delimiter='!#!\n';but still cannot extracted the data
In spark-sql,I do this :
===== ================================
create table ceshi(id int,name string, year string, major string)
row format delimited
fields terminated by '\t';
load data local inpath '/data.txt' overwrite into table ceshi;
select count(*) from ceshi;
the result is 5,but I try to set textinputformat.record.delimiter='!#!\r'; then select count(*) from ceshi; the result is 1, the delimiter donot work well;
I also check the source of hadoop2.6.0, the method of RecordReader in TextInputFormat.java,I notice that default textinputformat.record.delimiter is null,then the the LineReader.java use the method readDefaultLine to read a line terminated by one of CR, LF, or CRLF(CR ='\r',LF ='\n').

You should use sparkContext's hadoopConfiguration api to set the textinputformat.record.delimiter as
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!#!\r")
Then if you read the text file using sparkContext as
sc.textFile("the input file path")
You should fine.
Updated
I have noticed that a text file with delimiter \r when saved is changed to \n delimiter.
so, following format should work for you as it did for me
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!#!\n")
val data = sc.textFile("the input file path")
val df = data.map(line => line.split("\t"))
.map(array => ceshi(array(0).toInt, array(1), array(2), array(3)))
.toDF
a case class called ceshi is needed as
case class ceshi(id: Int, name: String, year: String, major :String)
which should give dataframe as
+----------+-----+-----+-----+
|id |name |year |major|
+----------+-----+-----+-----+
|2001810086|rongq| 2001|810 |
|2001810087|hauaa| 2001|810 |
|2001820081|hello| 2001|820 |
|2001820082|jaccy| 2001|820 |
|2002810081|cindy| 2002|810 |
+----------+-----+-----+-----+
Now you can hit the count function as
import org.apache.spark.sql.functions._
df.select(count("*")).show(false)
which would give output as
+--------+
|count(1)|
+--------+
|5 |
+--------+

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hive : reflect function - hadoop

Related

how to get the complete source code location corresponding to the ASTnode in clang

How to use case expression or should go for any other option

regex pattern not working in pyspark after applying the logic

Insert collect_set values into Elasticsearch with PIG

Setting textinputformat.record.delimiter in sparksql

Categories

Resources