Deletion of folder on Amazon S3, while creating external table - hadoop

We are getting very unusual behavior on our S3 Bucket and this behavior is not consistent. So, we are not able to pin point the problem. Now coming to the issue i fire one query(creation of external table). Which leads to deletion of the folder which i was pointing in external table. And this has happened 3-4 time to us. So, could you please explain this behaviour. For you convenience i am attaching the external table query and the logs which operation are being performed on S3 bucket.
Query:
create table apr_2(date_local string, time_local string,s_computername string,c_ip string,s_ip string,s_port string,s_sitename string, referer string, localfile string, TimeTakenMS string, status string, w3status string, sc_substatus string, uri string, qs string, sc_bytes string, cs_bytes string, cs_username string, cs_User_Agent string, s_proxy string, c_protocol string, cs_version string, cs_method string, cs_Cookie string, cs_Host string, w3wpbytes string, RequestsPerSecond string, CPU_Utilization string, BeginRequest_UTC string, EndRequest_UTC string, time string, logdate string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' location 's3://logs/apr_2_com'
logs:
REST.DELETE.OBJECT logs/apr_2_com/000002.tar.gz
REST.DELETE.OBJECT logs/apr_2_com/000001.tar.gz

Try using this syntax -
create external table if not exists apr_2(date_local string, time_local string,s_computername string,c_ip string,s_ip string,s_port string,s_sitename string, referer string, localfile string, TimeTakenMS string, status string, w3status string, sc_substatus string, uri string, qs string, sc_bytes string, cs_bytes string, cs_username string, cs_User_Agent string, s_proxy string, c_protocol string, cs_version string, cs_method string, cs_Cookie string, cs_Host string, w3wpbytes string, RequestsPerSecond string, CPU_Utilization string, BeginRequest_UTC string, EndRequest_UTC string, time string, logdate string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' location 's3://logs/apr_2_com'

Related

How to configurate schema graphqls input can select mutiple type with same field ? (Graphql, Spring Boot)

I have an input like this:
filter(name: String, value: "abc")
filter(name: String, value: 1)
filter(name: String, value: [1,2,3,4])
filter(name: String, value: ["a","b","c","d"])
How to config schema.graphqls so it can accept like this:
input filter{
name: String
value: (can be mutiple base on input Int, String, Float, Boolean)
}

Error - redigo.Scan: Cannot convert from Redis bulk string to *string

I have a struct like that
type User struct {
Nickname *string `json:"nickname"`
Phone *string `json:"phone"`
}
Values ​​are placed in redis with HMSET command. (values ​​can be nil)
Now I'm trying to scan values ​​into a structure:
values, err := redis.Values(Cache.Do("HMGET", "key", "nickname", "phone" )
var usr User
_, err := redis.Scan(values, &usr.Nickname, &usr.Phone)
But I get an error
redigo.Scan: cannot assign to dest 0: cannot convert from Redis bulk
string to *string
Please tell me what I'm doing wrong?
The Scan documentation says:
The values pointed at by dest must be an integer, float, boolean, string, []byte, interface{} or slices of these types.
The application passes a pointer to a *string to the function. A *string is not one of the supported types.
There are two approaches for fixing the problem. The first is to allocate string values and pass pointers to the allocated string values to Scan:
usr := User{Nickname: new(string), Phone: new(string)}
_, err := redis.Scan(values, usr.Nickname, usr.Phone)
The second approach is to change the type of the struct fields to string:
type User struct {
Nickname string `json:"nickname"`
Phone string `json:"phone"`
}
...
var usr User
_, err := redis.Scan(values, &usr.Nickname, &usr.Phone)
From the doc it says that []byte is type for bulk string, not *string. You have two options here:
change the particular field type to []byte
or use temporary variable with []byte type on the scan, then after the data retrieved store it to the struct's field

Spark repartitionAndSortWithinPartitions with tuples

I'm trying to follow this example to partition hbase rows: https://www.opencore.com/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/
However, I have data already stored in (String, String, String) where the first is the rowkey, second is column name, and third is column value.
I tried writing an implicit ordering to achieve the OrderedRDD implicit
implicit val caseInsensitiveOrdering: Ordering[(String, String, String)] = new Ordering[(String, String, String)] {
override def compare(x: (String, String, String), y: (String, String, String)): Int = ???
}
but repartitionAndSortWithinPartitions is still not available. Is there a way I can use this method with this tuple?
RDD must have key and value, not only values, for ex.:
val data = List((("5", "6", "1"), (1)))
val rdd : RDD[((String, String, String), Int)] = sparkContext.parallelize(data)
implicit val caseInsensitiveOrdering = new Ordering[(String, String, String)] {
override def compare(x: (String, String, String), y: (String, String, String)): Int = 1
}
rdd.repartitionAndSortWithinPartitions(..)

Only STRING defined columns are loaded in HIVE i.e. columns with int and double are NULL

Only STRING defined columns are loaded in HIVE i.e. columns with int and double are NULL
Create table command
create table A(
id STRING,
member_id STRING,
loan_amnt DOUBLE,
funded_amnt DOUBLE,
`funded_amnt_inv` DOUBLE,
`term` STRING,
`int_rate` STRING,
`installment` DOUBLE,
`grade` STRING,
`sub_grade` STRING,
`emp_title` STRING,
`emp_length` STRING,
`home_ownership` STRING,
`nnual_inc` INT,
`verification_status` STRING,
`issue_d` STRING,
`loan_status` STRING,
`pymnt_plan` STRING,
`url` STRING,
`desc` STRING,
`purpose` STRING,
`title` STRING,
`zip_code` STRING,
`addr_state` STRING,
`dti` DOUBLE,
`delinq_2yrs` INT,
`earliest_cr_line` STRING,
`inq_last_6mths` STRING,
`mths_since_last_delinq` STRING,
`mths_since_last_record` STRING,
`open_acc` INT,
`pub_rec` INT,
`revol_bal` INT,
`revol_util` STRING,
`total_acc` INT,
`initial_list_status` STRING,
`out_prncp` DOUBLE,
`out_prncp_inv` DOUBLE,
`total_pymnt` DOUBLE,
`total_pymnt_inv` DOUBLE,
`total_rec_prncp` DOUBLE,
`total_rec_int` DOUBLE,
`total_rec_late_fee` DOUBLE,
`recoveries` DOUBLE,
`collection_recovery_fee` DOUBLE,
`last_pymnt_d` STRING,
`last_pymnt_amnt` DOUBLE,
`next_pymnt_d` STRING,
`last_credit_pull_d` STRING,
`collections_12_mths_ex_med` INT,
`mths_since_last_major_derog` STRING,
`policy_code` STRING,
`application_type` STRING,
`annual_inc_joint` STRING,
`dti_joint` STRING,
`verification_status_joint` STRING,
`acc_now_delinq` STRING,
`tot_coll_amt` STRING,
`tot_cur_bal` STRING,
`open_acc_6m` STRING,
`open_il_6m` STRING,
`open_il_12m` STRING,
`open_il_24m` STRING,
`mths_since_rcnt_il` STRING,
`total_bal_il` STRING,
`il_util` STRING,
`open_rv_12m ` STRING,
`open_rv_24m` STRING,
`max_bal_bc` STRING,
`all_util` STRING,
`total_credit_rv` STRING,
`inq_fi` STRING,
`total_fi_tl` STRING,
`inq_last_12m` STRING
)
ROW FORMAT delimited
fields terminated by ','
STORED AS TEXTFILE;
Loading data into table A
load data local inpath '/home/cloudera/Desktop/Project-3/1/LoanStats3a.txt' into table A;
Select data
hive> SELECT * FROM A LIMIT 1;
Output
"1077501" "1296599" NULL NULL NULL " 36 months" "
10.65%" NULL "B" "B2" "" "10+ years" "RENT" NULL "Verified" "Dec-2011" "Fully
Paid" "n" "https://www.lendingclub.com/browse/loanDetail.action?loan_id=1077501" "
Borrower added on 12/22/11 > I need to upgrade my business
technologies." "credit_card" "Computer" "860xx" "AZ" NULL NULL "Jan-1985" "1" "" "" NULL NULL NULL "83.7%"NULL "f" NULL NULL NULL NULL NULL NULL NULL NULL NULL "Jan-2015" NULL "" "Dec-2015" NULL "" "1" "INDIVIDUAL"
"" "" "" "0" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
I found the solution :-
create table stat2(id String, member_id INT, loan_amnt FLOAT, funded_amnt FLOAT, funded_amnt_inv FLOAT, term String, int_rate String, installment FLOAT, grade String, sub_grade String, emp_title String, emp_length String, home_ownership String, annual_inc FLOAT, verification_status String, issue_d date, loan_status String, pymnt_plan String, url String, descp String, purpose String, title String, zip_code String, addr_state String, dti FLOAT, delinq_2yrs FLOAT, earliest_cr_line String, inq_last_6mths FLOAT, mths_since_last_delinq FLOAT, mths_since_last_record FLOAT, open_acc FLOAT, pub_rec FLOAT, revol_bal FLOAT, revol_util String, total_acc FLOAT, initial_list_status String, out_prncp FLOAT, out_prncp_inv FLOAT, total_pymnt FLOAT, total_pymnt_inv FLOAT, total_rec_prncp FLOAT, total_rec_int FLOAT, total_rec_late_fee FLOAT, recoveries FLOAT, collection_recovery_fee FLOAT,
last_pymnt_d String, last_pymnt_amnt FLOAT, next_pymnt_d String, last_credit_pull_d String, collections_12_mths_ex_med FLOAT, mths_since_last_major_derog FLOAT, policy_code FLOAT, application_type String, annual_inc_joint FLOAT, dti_joint FLOAT, verification_status_joint String, acc_now_delinq FLOAT, tot_coll_amt FLOAT, tot_cur_bal FLOAT, open_acc_6m FLOAT, open_il_6m FLOAT, open_il_12m FLOAT, open_il_24m FLOAT, mths_since_rcnt_il FLOAT, total_bal_il FLOAT, il_util FLOAT, open_rv_12m FLOAT, open_rv_24m FLOAT, max_bal_bc FLOAT, all_util FLOAT, total_rev_hi_lim FLOAT, inq_fi FLOAT, total_cu_tl FLOAT, inq_last_12m FLOAT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties (
"separatorChar" = ",",
"quoteChar" = "\""
)
STORED AS TEXTFILE tblproperties ("skip.header.line.count"="2",
"skip.footer.line.count"="4");
It seems that your CSV contains quotes around the individual fields. The surrounding quotes are not supported by HIVE and as a result they become part of the fields. In case of string fields, the quotes become part of the string. In case of numeric fields, the quotes make the field an invalid number, resulting in NULLs.
See csv-serde
for a serde that supports quotes in CSV files.

kibana4 doesn't recognize my timestamp, ES is mapping it as a string

I want to push data I have in my hadoop cluster to ES and then visualize the hole thing in kibana4.
this is what I’ve done :
1)
CREATE TABLE xx(traffic_type_id INT, caller INT, time STRING, tranche_horaire INT, called INT, call_duration INT, code_type_trafic STRING, code_destination_trafic STRING, location_number STRING, id_offre INT, id_service INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t';
LOAD DATA INPATH ‘/user/hive/outt.csv’ OVERWRITE INTO TABLE xx;
2)
CREATE EXTERNAL TABLE esxx (caller INT, time STRING, tranche INT, called_number INT, duration INT, code_type STRING, code_destination STRING, location STRING, offre INT, service INT)
STORED BY ‘org.elasticsearch.hadoop.hive.EsStorageHandler’
TBLPROPERTIES(‘es.resource’ = ‘xx/xx’,
‘es.nodes’=’192.168.238.130:9200′,
‘es.mapping.names’ = ‘time:#timestamp’);
INSERT OVERWRITE TABLE escdr SELECT s.caller, s.time, s.tranche_horaire, s.called, s.call_duration, s.code_type_trafic, s.code_destination_trafic, s.location_number, s.id_offre, s.id_service FROM xx s;
3)
CREATE EXTERNAL TABLE xx (
caller INT,
time TIMESTAMP,
tranche INT,
called_number INT,
duration INT,
code_type STRING,
code_destination STRING,
location STRING,
offre INT,
service INT)
STORED BY ‘org.elasticsearch.hadoop.hive.EsStorageHandler’
TBLPROPERTIES(‘es.resource’ = ‘xx/xx/’,
‘es.nodes’=’192.168.238.130:9200′,
‘es.mapping.names’ = ‘time:#timestamp’);
But Kibana doesn’t seem to recognize my timestamp “time”, ES keeps on mapping it as a string (the time field in my csv file is as so : exp : 01AUG2014:19:02:11 ! What should I do and change to let ES do the appropriate mapping and thus recognize my timestamp?
Best regards,
Omar,
If I were you I would convert this strange timestamp format to basic ISO8601 on the fly while importing, so that your timestamps look like 2014-08-01T19:02:11Z (or +HH:MM for whatever timezone you have your time in; I have no way to tell).

Resources