Insert collect_set values into Elasticsearch with PIG - hadoop

I have a HIVE table which contains 3 columns- "id"(String), "booklist"(Array of String), and "date"(string) with the following data:
----------------------------------------------------
id | booklist | date
----------------------------------------------------
1 | ["Book1" , "Book2"] | 2017-11-27T01:00:00.000Z
2 | ["Book3" , "Book4"] | 2017-11-27T01:00:00.000Z
When trying to insert into Elasticsearch with this PIG script
-------------------------Script begins------------------------------------------------
SET hive.metastore.uris 'thrift://node:9000';
REGISTER hdfs://node:9001/library/elasticsearch-hadoop-5.0.0.jar;
DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader();
DEFINE EsStore org.elasticsearch.hadoop.pig.EsStorage(
'es.nodes = elasticsearch.service.consul',
'es.port = 9200',
'es.write.operation = upsert',
'es.mapping.id = id',
'es.mapping.pig.tuple.use.field.names=true'
);
hivetable = LOAD 'default.reading' USING HCatLoader();
hivetable_flat = FOREACH hivetable
GENERATE
id AS id,
booklist as bookList,
date AS date;
STORE hivetable_flat INTO 'readings/reading' USING EsStore();
-------------------------Script Ends------------------------------------------------
When running above, i got an error saying:
ERROR 2999:Unexpected internal error. Found unrecoverable error [ip:port] returned Bad Request(400) - failed to parse [bookList]; Bailing out..
Can anyone shed any light on how to parse ARRAY of STRING into ES and get above to work?
Thank you!

Related

Setting textinputformat.record.delimiter in sparksql

In spark2.0.1 ,hadoop2.6.0, I have many files delimited with '!#!\r' and not with the usual new line \n,for example:
=========================================
2001810086 rongq 2001 810!#!
2001810087 hauaa 2001 810!#!
2001820081 hello 2001 820!#!
2001820082 jaccy 2001 820!#!
2002810081 cindy 2002 810!#!
=========================================
I try to extracted data according to Setting textinputformat.record.delimiter in spark
set textinputformat.record.delimiter='!#!\r';or set textinputformat.record.delimiter='!#!\n';but still cannot extracted the data
In spark-sql,I do this :
===== ================================
create table ceshi(id int,name string, year string, major string)
row format delimited
fields terminated by '\t';
load data local inpath '/data.txt' overwrite into table ceshi;
select count(*) from ceshi;
the result is 5,but I try to set textinputformat.record.delimiter='!#!\r'; then select count(*) from ceshi; the result is 1, the delimiter donot work well;
I also check the source of hadoop2.6.0, the method of RecordReader in TextInputFormat.java,I notice that default textinputformat.record.delimiter is null,then the the LineReader.java use the method readDefaultLine to read a line terminated by one of CR, LF, or CRLF(CR ='\r',LF ='\n').
You should use sparkContext's hadoopConfiguration api to set the textinputformat.record.delimiter as
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!#!\r")
Then if you read the text file using sparkContext as
sc.textFile("the input file path")
You should fine.
Updated
I have noticed that a text file with delimiter \r when saved is changed to \n delimiter.
so, following format should work for you as it did for me
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!#!\n")
val data = sc.textFile("the input file path")
val df = data.map(line => line.split("\t"))
.map(array => ceshi(array(0).toInt, array(1), array(2), array(3)))
.toDF
a case class called ceshi is needed as
case class ceshi(id: Int, name: String, year: String, major :String)
which should give dataframe as
+----------+-----+-----+-----+
|id |name |year |major|
+----------+-----+-----+-----+
|2001810086|rongq| 2001|810 |
|2001810087|hauaa| 2001|810 |
|2001820081|hello| 2001|820 |
|2001820082|jaccy| 2001|820 |
|2002810081|cindy| 2002|810 |
+----------+-----+-----+-----+
Now you can hit the count function as
import org.apache.spark.sql.functions._
df.select(count("*")).show(false)
which would give output as
+--------+
|count(1)|
+--------+
|5 |
+--------+

Hive : reflect function

Im trying to use Reflect function of Hive which have this signature :
reflect(class, method[, arg1[, arg2..]])
I want to ckeck if a column c with value hello world ! contains world, so I wrote :
with a as
(select "hello world !" as c)
select reflect("java.lang.String",c ,"contains", "world") from a
But it didnt work because it does not respect the signature, so i tried this
with a as
(select "hello world !" as c)
select reflect(reflect("java.lang.Object","toString",c) ,"contains", "world") from a
It didnt work also ! I want to know how to apply reflect function on a given column ?
reflect2 will help.See https://issues.apache.org/jira/browse/HIVE-20007
select reflect2("stackoverflow","length");
+------+--+
| _c0 |
+------+--+
| 13 |
+------+--+
but hashCode() won't work.See https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFReflect2.java#L86
select reflect2("stackoverflow","hashCode");
Error: Error while compiling statement: FAILED: SemanticException [Error 10016]: Line 1:32 Argument type mismatch '"hashCode"': Use hash() UDF instead of this.

cloudera impala jdbc query doesn't see array<string> Hive column

I have a table in Hive that has the following structure:
> describe volatility2;
Query: describe volatility2
+------------------+---------------+---------+
| name | type | comment |
+------------------+---------------+---------+
| version | int | |
| unmappedmkfindex | int | |
| mfvol | array<string> | |
+------------------+---------------+---------+
It was created by Spark HiveContext code by using a DataFrame API like this:
val volDF = hc.createDataFrame(volRDD)
volDF.saveAsTable(volName)
which carried over the RDD structure that was defined in the schema:
def schemaVolatility: StructType = StructType(
StructField("Version", IntegerType, false) ::
StructField("UnMappedMKFIndex", IntegerType, false) ::
StructField("MFVol", DataTypes.createArrayType(StringType), true) :: Nil)
However, when I'm trying to select from this table using the latest JDBC Impala driver the last column is not visible to it. My query is very simple - trying to print the data to the console - exactly like in the example code provided by the driver download:
String sqlStatement = "select * from default.volatility2";
Class.forName(jdbcDriverName);
con = DriverManager.getConnection(connectionUrl);
Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery(sqlStatement);
System.out.println("\n== Begin Query Results ======================");
ResultSetMetaData metadata = rs.getMetaData();
for (int i=1; i<=metadata.getColumnCount(); i++) {
System.out.println(rs.getMetaData().getColumnName(i)+":"+rs.getMetaData().getColumnTypeName(i));
}
System.out.println("== End Query Results =======================\n\n");
The console output it this:
== Begin Query Results ======================
version:version
unmappedmkfindex:unmappedmkfindex
== End Query Results =======================
Is it a driver bug or I'm missing something?
I found the answer to my own question. Posting it here so it may help others and save time in searching. Apparently Impala lately introduced the so called "complex types" support to their SQL that include array among others. The link to the document is this:
http://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types_using
According to this what I had to do is change the query to look like this:
select version, unmappedmkfindex, mfvol.ITEM from volatility2, volatility2.mfvol
and I got the right expected results back

Null error for nonempty data using Generate in Pig/Hadoop

I am working on Task 2 in this link:
https://sites.google.com/site/hadoopbigdataoverview/certification-practice-exam
I used the code below
a = load '/user/horton/flightdelays/flight_delays1.csv' using PigStorage(',');
dump a
a_top = limit a 5
a_top shows that the first 5 rows. There are non-null values for each Year
Then I type
a_clean = filter a BY NOT ($4=='NA');
aa = foreach a_clean generate a_clean.Year;
But that gives the error
ERROR 1200: null
What is wrong with this?
EDIT: I also tried
a = load '/user/horton/flightdelays/flight_delays1.csv' using PigStorage(',') AS (Year:chararray,Month:chararray,DayofMonth:chararray,DayOfWeek:chararray,DepTime:chararray,CRSDepTime:chararray,ArrTime:chararray,CRSArrTime:chararray,UniqueCarrier:chararray,FlightNum:chararray,TailNum:chararray,ActualElapsedTime:chararray,CRSElapsedTime:chararray,AirTime:chararray,ArrDelay:chararray,DepDelay:chararray,Origin:chararray,Dest:chararray,Distance:chararray,TaxiIn:chararray,TaxiOut:chararray,Cancelled:chararray,CancellationCode:chararray,Diverted:chararray,CarrierDelay:chararray,WeatherDelay:chararray,NASDelay:chararray,SecurityDelay:chararray,LateAircraftDelay:chararray);
and
aa = foreach a_clean generate a_clean.Year
but the error was
ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay), 2nd :(2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA)
Since you have not specified the schema in the LOAD statement,you will have to refer the columns using order in which they occur.Year seems to be the first column so try this
a_clean = filter a BY ($4 != 'NA');
aa = foreach a_clean generate a.Year;

Datamapper unable to find table

I have an app running on my Raspberry Pi with Postgresql 9.1.
My first iteration was to add weather records into a table called "Weathers". That was successful.
My next iteration is to use psychopg2 to write records from Python into a different table called "weather". That also is succcessful.
What should also be successful is to change the Weather class in my app to the new fields. But DataMapper returns a mapping error:
#records = Weather.all(:order => :timestamp.desc)
ArgumentError at /
+options [:order]+ entry :timestamp does not map to a property in Weather
Rereading the datamapper.org docs suggests it's something to do with my table name so I migrated over the older "weathers" table into another called "older" and dropped the "weathers" table. But DataMapper still fails to find this table.
My Postgres environment with a truncated view of the target table is:
List of relations
Schema | Name | Type | Owner
--------+-----------+-------+-------
public | customers | table | pi
public | older | table | pi
public | systemlog | table | pi
public | weather | table | pi
(4 rows)
Table "public.weather"
Column | Type | Modifiers | Storage | Description
-----------------------------+-----------------------------+------------------------------------------------------+----------+-------------
id | integer | not null default nextval('weather_id_seq'::regclass) | plain |
timestamp | timestamp without time zone | default now() | plain |
currentwindspeed | real | | plain |
bmp180temperature | integer | | plain |
Indexes:
"weather_pkey" PRIMARY KEY, btree (id)
Has OIDs: no
My Datamapper class is:
class Weather
include DataMapper::Resource
property :id, Serial
property :bmp180temperature, String
property :insidehumidity, String
property :totalrain, String
property :currentwinddirection, String
property :currentWindSpeed, String
property :timestamp, DateTime
end
DataMapper.finalize
Weather.auto_upgrade!
Since this is Ruby I fired-up IRb, required the Datamapper gem and got:
records = Weather.all
DataObjects::SyntaxError: ERROR: column "bmp180temperature" does not exist
LINE 1: SELECT "id", "bmp180temperature", "insidehumidity", "totalra...
^
(code: 50360452, sql state: 42703, query: SELECT "id", "bmp180temperature", "insidehumidity", "totalrain", "currentwinddirection", "current_wind_speed", "timestamp" FROM "weathers" ORDER BY "id", uri: postgres:pi#localhost/postgres?scheme=postgres&user=pi&password=pw&host=localhost&port=&path=/postgres&query=&fragment=&adapter=postgres)
from /var/lib/gems/1.9.1/gems/dm-do-adapter-1.2.0/lib/dm-do-adapter/adapter.rb:147:in `execute_reader'
from /var/lib/gems/1.9.1/gems/dm-do-adapter-1.2.0/lib/dm-do-adapter/adapter.rb:147:in `block in read'
from /var/lib/gems/1.9.1/gems/dm-do-adapter-1.2.0/lib/dm-do-adapter/adapter.rb:276:in `with_connection'
This makes me think it's not finding the right table. It seems to be OK with the id column, perhaps.
I see what appears to be DataMapper is being used for PHP in the CodeIgniter framework but I'm unsure if it's the same DataMapper I'm using in Ruby.
What am I overlooking to get DataMapper to find this table?

Resources