Getting values in string in Hive query - hadoop

I have got a table in Hive, in which one of the columns is string. The values in that column are like "x=1,y=2,z=3". I need to write a query that adds the value of x in this column for all the rows. How do I extract the value of x and add them?

you would need a UDF for this transformation:
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
class SplitColumn extends UDF {
public Integer evaluate(Text input) {
if(input == null) return null;
String val=input.toString().split("=")[1];
return Integer.parseInt(val);
}
}
Now you can try this:
hive> ADD JAR target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;
hive> CREATE TEMPORARY FUNCTION SplitColumn as 'com.example.SplitColumn';
hive> select sum(SplitColumn(mycolumnName)) from mytable;
P.S: I have not tested this. But this should give a direction for you to proceed.

Related

How to update room database and How to get the insert status is working or completed?

The first:
I've got dataList from retrofit And insert Room Database.
I want to change dataList(Like insert a element). My Room Database can work because I used OnConflictStrategy.REPLACE. but when I delete dataList some elements, My Room Database can not delete elements.
Dao:
#Insert (onConflict = OnConflictStrategy.REPLACE)
suspend fun insertData(dataList : List<Data>)
Entity:
#Entity
data class Data(
#PrimaryKey val Id : Long,
val Fl : String,
val FlMc : String,
val Dm : String,
val Mc : String,)
ViewModel:
fun insertData(dataList: List<Data>) = viewModelScope.launch {
dataRepository.insertData(dataList)
}
//get data from server
fun getData():LiveData<List<Data>>
Activity:
dataViewModel.getData().observer(this){
dataViewModel.insertData(it)
}
How to resolve this situation except DELETE ALL THEN INSERT
The second:
I want to use a progressbar to indicate that I am inserting dataList
How to get the insert status is working or completed
If I understand correctly, you issue is that you cannot delete because you are building a DataList item but don't know the primary key value as it's generated.
As you haven't shown the DataList entity then assuming it is like:-
#Entity
data class DataList(
#PrimaryKey(autogenerate = true)
val id: Long,
val othercolumns: String
....
)
and if you change from suspend fun insertData(dataList : List<Data>) to suspend fun insertData(dataList : List<Data>): List<Long> (i.e. added the List as the result)
Then you have the values of the id column in the result. In the case above the value is the value of the id column.
If the #PrimaryKey is not an integer type e.g. a String then the long returned WILL NOT be the value of the primary key. It will be a special value known as the rowid.
In short using an integer with primary key makes the column an alias of the rowid. if not an integer primary key then it is not an alias BUT the rowid still exists.
You can still use the rowid to access a specific row as the rowid MUST be a unique value. e.g. (again assuming the above) you could have an #Query such as
#Query("SELECT * FROM the_datalist_table WHERE rowid=:rowid")
suspend fun getDataListById(rowid: Long)
Only of use if you know the rowid though.
You could get rowid's say by using
#Query("SELECT rowid FROM the_datalist_table WHERE othercolumns LIKE :something")
suspend fun getRowidOfSomeDataLists(something: String): List<Long>
still not of great use as the selection criteria would also be able to provide a list of Datalists.
Additional re the comment:-
How to use in viewModel or Activity?
As an example you could do something like :-
fun insertData(dataList: List<Data>) = viewModelScope.launch {
val insertedDataList: ArrayList<Data> = ArrayList()
val insertedIdList = dataRepository.insertData(dataList)
val notInsertedDataList: ArrayList<Data> = ArrayList()
for(i in 0..insertedIdList.size) {
if (insertedIdList[i] > 0) {
insertedDataList.add(
Data(
insertedIdList[i], //<<<<< sets the id as per the returned list of id's
dataList[i].Fl,
dataList[i].FlMc,
dataList[i].Dm,
dataList[i].Mc)
)
} else {
notInsertedDataList.add(
Data(
insertedIdList[i], //<<<<< sets the id as per the returned list of id's WILL BE -1 as not inserted
dataList[i].Fl,
dataList[i].FlMc,
dataList[i].Dm,
dataList[i].Mc
)
)
}
}
val notInsertedCount = notInsertedDataList.size
val insertedCount = insertedDataList.size
}
So you have :-
insertedDataList an ArrayList of the successfully inserted Data's (id was not -1) with the id set accordingly.
notInsertedDataList an ArrayList of the Data's that were not inserted (id was -1) id will be set to -1.
insertedCount an Int with the number inserted successfully.
notInsertedCount and Int with the number not inserted correctly.
DELETE ALL
To delete all rows, unless you extract all rows you can't use the convenience #Delete, as this works on being provided the Object (Data) and selecting the row to delete according to the primary key (id column).
The convenience methods #Delete, #Update, #Insert are written to generate the underlying SQL statement(s) bases upon the object (Entity) passed.
e.g. #Delete(data: Data) would generate the SQL DELETE FROM data WHERE id=?, where ? would be the value of the id field when actually run.
The simpler way to delete all columns is to use the #Query annotation (which handles SQL statements other than SELECT statements). So you could have.
#Query("DELETE FROM data")
fun deleteAllData()
note that this does not the return the number of rows that have been deleted.

how to use a custom Function in Oracle Statement on each value of one column using Scala

I have a Scala Function. I need to read records that their last two characters of ID column are equal to the result of func. So, I must use func in Oracle query which is in following:
def func(id:String): Int = {
val two_char = id.takeRight(2).toInt
val group_id = two_char % 4
return group_id
}
val query = """ SELECT * FROM table where """+func("ID") + """==="""+group_id
When Scala run the query, I receive this error:
Exception in thread "main" java.lang.NumberFormatException: For input string: "ID"
func pass name of column not its values. Would you please guide me how to use the function in Oracle Query to operate on each value of the column?
Any help is really appreciated.

how to handle skewed data while grouping in pig

I am doing a group by operation in which one reduce task is running very longer. Below is the sample code snippet and the description of the issue,
inp =load 'input' using PigStorage('|') AS(f1,f2,f3,f4,f5);
grp_inp = GROUP inp BY (f1,f2) parallel 300;
Since there is skew in data i.e. too many values for one key, one reducer is running for 4 hours. Rest all reduce tasks gets completed in 1 min or so.
What can I do to fix this issue, any alternative approaches ? Any help would be greatly appreciated. Thanks!
You may have to check few things :-
1> Filter out records which have both f1 and f2 value as NULL (if any)
2> Try to use hadoop combiner by implementing algebraic interface if possible :-
https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch10s02.html
3> Using Custom partitioner to use another key for distributing data across reducer.
Here is the sample code I used to partition my skewed data after join (same can be used after group also) :-
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.io.NullableTuple;
import org.apache.pig.impl.io.PigNullableWritable;
public class KeyPartitioner extends Partitioner<PigNullableWritable, Writable> {
/**
* Here key contains value of current key used for partitioning and Writable
* value conatins all fields from your tuple. I used my 5th field from tuple to do partitioning as I knew it has evenly distributed value.
**/
#Override
public int getPartition(PigNullableWritable key, Writable value, int numPartitions) {
Tuple valueTuple = (Tuple) ((NullableTuple) value).getValueAsPigType();
try {
if (valueTuple.size() > 5) {
Object hashObj = valueTuple.get(5);
Integer keyHash = Integer.parseInt(hashObj.toString());
int partitionNo = Math.abs(keyHash) % numPartitions;
return partitionNo;
} else {
if (valueTuple.size() > 0) {
return (Math.abs(valueTuple.get(1).hashCode())) % numPartitions;
}
}
} catch (NumberFormatException | ExecException ex) {
Logger.getLogger(KeyPartitioner.class.getName()).log(Level.SEVERE, null, ex);
}
return (Math.abs(key.hashCode())) % numPartitions;
}
}

How can I add row numbers for rows in PIG or HIVE?

I have a problem when adding row numbers using Apache Pig.
The problem is that I have a STR_ID column and I want to add a ROW_NUM column for the data in STR_ID, which is the row number of the STR_ID.
For example, here is the input:
STR_ID
------------
3D64B18BC842
BAECEFA8EFB6
346B13E4E240
6D8A9D0249B4
9FD024AA52BA
How do I get the output like:
STR_ID | ROW_NUM
----------------------------
3D64B18BC842 | 1
BAECEFA8EFB6 | 2
346B13E4E240 | 3
6D8A9D0249B4 | 4
9FD024AA52BA | 5
Answers using Pig or Hive are acceptable. Thank you.
In Hive:
Query
select str_id,row_number() over() from tabledata;
Output
3D64B18BC842 1
BAECEFA8EFB6 2
346B13E4E240 3
6D8A9D0249B4 4
9FD024AA52BA 5
Facebook posted a number of hive UDFs including NumberRows. Depending on your hive version (I believe 0.8) you may need to add an attribute to the class (stateful=true).
Pig 0.11 introduced a RANK operator that can be used for this purpose.
For folks wondering about Pig, I found the best way (currently) is to write your own UDF.
I wanted to add row numbers for tuples in a bag. This is the code for that:
import java.io.IOException;
import java.util.Iterator;
import org.apache.pig.EvalFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.data.DataType;
public class RowCounter extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
public DataBag exec(Tuple input) throws IOException {
try {
DataBag output = mBagFactory.newDefaultBag();
DataBag bg = (DataBag)input.get(0);
Iterator it = bg.iterator();
Integer count = new Integer(1);
while(it.hasNext())
{ Tuple t = (Tuple)it.next();
t.append(count);
output.add(t);
count = count + 1;
}
return output;
} catch (ExecException ee) {
// error handling goes here
throw ee;
}
}
public Schema outputSchema(Schema input) {
try{
Schema bagSchema = new Schema();
bagSchema.add(new Schema.FieldSchema(null, DataType.BAG));
return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input),
bagSchema, DataType.BAG));
}catch (Exception e){
return null;
}
}
}
This code is for reference only. Might not be error-proof.
This is good answer for you on my example
Step 1. Define row_sequence() function to process for auto increase ID
add jar /Users/trongtran/research/hadoop/dev/hive-0.9.0-bin/lib/hive-contrib-0.9.0.jar;
drop temporary function row_sequence;
create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
Step 2. Insert unique id & STR
INSERT OVERWRITE TABLE new_table
SELECT
row_sequence(),
STR_ID
FROM old_table;
From version 0.11, hive supports analytic functions like lead,lag and also row number
https://issues.apache.org/jira/browse/HIVE-896
Hive solution -
select *
,rank() over (rand()) as row_num
from table
Or, if you want to have rows ascending by STR_ID -
select *
,rank() over (STR_ID,rank()) as row_num
from table
In Hive:
select
str_id, ROW_NUMBER() OVER() as row_num
from myTable;

LINQ Select with multiple tables fields being writeable

I'm new to LINQ and I'm doing pretty well until now, but now stuck with this.
I've a LINQ object bounded to a DataGridView to let the user edit is contains.
for simple one table query, it go fine, but how to build a LINQ query with multiple table, so the result will still be read/write?
Here a example of what I mean:
GMR.Data.GMR_Entities GMR = new GMR.Data.GMR_Entities();
var dt = from Msg in GMR.tblMessages
join lang in GMR.tblDomVals on 1 equals 1//on Msg.pLangueID equals lang.ID
select Msg;
// select new {lang.DescrFr, Msg.Message,Msg.pLangueID } ;
this.dataGridView1.DataSource = dt;
In this simple query, if I return only "Msg" with the select statement, the grid can be edited. But if I replace the select statement with select new {lang.DescrFr, Msg.Message,Msg.pLangueID } ; the grid will be readable only.
I can easily understand that this is due because the query result is a anonymous type.
But is there a way to let the table tblMessage being writable?
try creating your own class, for example
public class MsgLangInfo
{
public string langDescFr{get;set;}
public int pLangueID{get;set;}
}
And at the select statement create an object of this class with new like below
select new MsgLangInfo {
langDescFr = lang.DescrFr,
langDescFr = Msg.Message,Msg.pLangueID
} ;
This way you can avoid the anonymous type problem.
You need to select the originals rows and explicitly set the grid columns.

Resources