We have a hive table that has around 500 million rows. Each row here represents a "version" of the data and Ive been tasked to create table which just contains the final version of each row. Unfortunately, each version of the data only contains a link to his previous version. The actual computation for deriving the final version of the row is not trivial but I believe the below example illustrates the issue.
For example:
id | xreference
----------------
1 | null -- original version of 1
2 | 1 -- update to id 1
3 | 2 -- update to id 2-1
4 | null -- original version of 4
5 | 4 -- update to version 4
6 | null -- original version of 6
When deriving the final version of the row from the above table I would expect the rows with ids 3, 5 and 6 to be produced.
I implemented a solution for this in cascading, which although is correct has an n^2 runtime and takes half a day to complete.
I also implemented a solution using giraffe which worked great on small data sets but I kept running out of memory for larger sets. With this implementation I basically created a vertices for every id and an edge between each id/xreference pair.
We have now been looking into consolidating/simplifying our ETL process and Ive been asked to provide an implementation that can run as a hive UDF. I know that Oracle provides functions for just this sort of thing but I haven't found much in the way of Hive functions.
Im looking for any suggestions for implementing this type of recursion specifically with Hive but Id like to hear any suggestions.
Hadoop 1.3
Hive 0.11
Cascading 2.2
12 node development cluster
20 node production cluster
Related
I have developed an ETL with shell scripting .
After that,I've found that there's an existing solution Talend open studio.
I'm thinking of using it in my future tasks.
But my problem is that the files that i want to integrate into the database must be transformed in structure . this is the structure that i have :
19-08-02 Name appel ok hope local merge (mk)
juin nov sept oct
00:00:t1 T1 299 0 24 8 3 64
F2 119 0 11 8 3 62
I1 25 0 2 9 4 64
F3 105 0 10 7 3 61
Regulated F2 0 0 0
FR T1 104 0 10 7 3 61
i must transform it into a flat file format .
Do talend offer me the possibility to do several transformations before integrating data from csvfiles into the databaseor not ?
Edit
this is an example of the flat file that i want to acheive before integrating data to the database (only first row is concerned) :
Timer,T1,F2,I1,F3,Regulated F2,FR T1
00:00:t1,299,119,25,105,0,104
00:00:t2,649,119,225,165,5,102
00:00:t5,800,111,250,105,0,100
We can split the task into three pieces, extract, transform, load.
Extract
First you have to find out how to connect to the source. With Talend its possible to connect to different kinds of sources, like databases, XML files, flat files, csv etc. pp. They are called tFileInput or tMySQLInput to name a few.
Transform
Then you have to tell Talend how to split the data into columns. In your example, this could be the white spaces, although the splitting might be difficult because the field Name is also split by a white space.
Afterwards, since it is a column to row transposition, you have to write some Java code in a tJavaRow component or could alternatively use a tMap component with conditional mapping: (row.Name.equals("T1") ? row.value : 0)
Load
Then the transformation would be completed and your data could be stored in a database, target file, etc. Components here would be called tFileOutput or tOracleOutput for example.
Conclusion
Yes, it would be possible to build your ETL process in Talend. The transposition could be a little bit complicated if you are new to Talend. But if you keep in mind that Talend processes data row by row (as your script does, I assume) this is not that big of a problem.
I am about doing some signal analysis with Hadoop/Spark and I need help on how to structure the whole process.
Signals are now stored in a database, that we will read with Sqoop and will be transformed in files on HDFS, with a schema similar to:
<Measure ID> <Source ID> <Measure timestamp> <Signal values>
where signal values are just string made of floating point comma separated numbers.
000123 S001 2015/04/22T10:00:00.000Z 0.0,1.0,200.0,30.0 ... 100.0
000124 S001 2015/04/22T10:05:23.245Z 0.0,4.0,250.0,35.0 ... 10.0
...
000126 S003 2015/04/22T16:00:00.034Z 0.0,0.0,200.0,00.0 ... 600.0
We would like to write interactive/batch queries to:
apply aggregation functions over signal values
SELECT *
FROM SIGNALS
WHERE MAX(VALUES) > 1000.0
To select signals that had a peak over 1000.0.
apply aggregation over aggregation
SELECT SOURCEID, MAX(VALUES)
FROM SIGNALS
GROUP BY SOURCEID
HAVING MAX(MAX(VALUES)) > 1500.0
To select sources having at least a single signal that exceeded 1500.0.
apply user defined functions over samples
SELECT *
FROM SIGNALS
WHERE MAX(LOW_BAND_FILTER("5.0 KHz", VALUES)) > 100.0)
to select signals that after being filtered at 5.0 KHz have at least a value over 100.0.
We need some help in order to:
find the correct file format to write the signals data on HDFS. I thought to Apache Parquet. How would you structure the data?
understand the proper approach to data analysis: is better to create different datasets (e.g. processing data with Spark and persisting results on HDFS) or trying to do everything at query time from the original dataset?
is Hive a good tool to make queries such the ones I wrote? We are running on Cloudera Enterprise Hadoop, so we can also use Impala.
In case we produce different derivated dataset from the original one, how we can keep track of the lineage of data, i.e. know how the data was generated from the original version?
Thank you very much!
1) Parquet as columnar format is good for OLAP. Spark support of Parquet is mature enough for production use. I suggest to parse string representing signal values into following data structure (simplified):
case class Data(id: Long, signals: Array[Double])
val df = sqlContext.createDataFrame(Seq(Data(1L, Array(1.0, 1.0, 2.0)), Data(2L, Array(3.0, 5.0)), Data(2L, Array(1.5, 7.0, 8.0))))
Keeping array of double allows to define and use UDFs like this:
def maxV(arr: mutable.WrappedArray[Double]) = arr.max
sqlContext.udf.register("maxVal", maxV _)
df.registerTempTable("table")
sqlContext.sql("select * from table where maxVal(signals) > 2.1").show()
+---+---------------+
| id| signals|
+---+---------------+
| 2| [3.0, 5.0]|
| 2|[1.5, 7.0, 8.0]|
+---+---------------+
sqlContext.sql("select id, max(maxVal(signals)) as maxSignal from table group by id having maxSignal > 1.5").show()
+---+---------+
| id|maxSignal|
+---+---------+
| 1| 2.0|
| 2| 8.0|
+---+---------+
Or, if you want some type-safety, using Scala DSL:
import org.apache.spark.sql.functions._
val maxVal = udf(maxV _)
df.select("*").where(maxVal($"signals") > 2.1).show()
df.select($"id", maxVal($"signals") as "maxSignal").groupBy($"id").agg(max($"maxSignal")).where(max($"maxSignal") > 2.1).show()
+---+--------------+
| id|max(maxSignal)|
+---+--------------+
| 2| 8.0|
+---+--------------+
2) It depends: if size of your data allows to do all processing in query time with reasonable latency - go for it. You can start with this approach, and build optimized structures for slow/popular queries later
3) Hive is slow, it's outdated by Impala and Spark SQL. Choice is not easy sometimes, we use rule of thumb: Impala is good for queries without joins if all your data stored in HDFS/Hive, Spark has bigger latency but joins are reliable, it supports more data sources and has rich non-SQL processing capabilities (like MLlib and GraphX)
4) Keep it simple: store you raw data (master dataset) de-duplicated and partitioned (we use time-based partitions). If new data arrives into partition and your already have downstream datasets generated - restart your pipeline for this partition.
Hope this helps
First, I believe Vitaliy's approach is very good in every aspect. (and I'm all for Spark)
I'd like to propose another approach, though. The reasons are:
We want to do Interactive queries (+ we have CDH)
Data is already structured
The need is to 'analyze' and not quite 'processing' of the data. Spark could be an overkill if (a) data being structured, we can form sql queries faster and (b) we don't want to write a program every time we want to run a query
Here are the steps I'd like to go with:
Ingestion using sqoop to HDFS: [optionally] use --as-parquetfile
Create an External Impala table or an Internal table as you wish. If you have not transferred the file as parquet file, you can do that during this step. Partition by, preferably Source ID, since our groupings are going to happen on that column.
So, basically, once we've got the data transferred, all we need to do is to create an Impala table, preferably in parquet format and partitioned by the column that we're going to use for grouping. Remember to do compute statistics after loading to help Impala run it faster.
Moving data:
- if we need to generate feed out of the results, create a separate file
- if another system is going to update the existing data, then move the data to a different location while creating->loading the table
- if it's only about queries and analysis and getting reports (i.e, external tables suffice), we don't need to move the data unnecessarily
- we can create an external hive table on top of the same data. If we need to run long-running batch queries, we can use Hive. It's a no-no for interactive queries, though. If we create any derived tables out of queries and want to use through Impala, remember to run 'invalidate metadata' before running impala queries on the hive-generated tables
Lineage - I have not gone deeper into it, here's a link on Impala lineage using Cloudera Navigator
I receive an input file which has 200 MM of records. The records are just a keys.
For each record from this file (which i'll call SAMPLE_FILE), i need to retrieve all records from a database (which i'll call EVENT_DATABASE ) that match key . The EVENT_DATABASE can have billions of records.
For example:
SAMPLE_FILE
1234
2345
3456
EVENT_DATABASE
2345 - content C - 1
1234 - content A - 3
1234 - content B - 5
4567 - content D - 7
1234 - content K - 7
1234 - content J - 2
So the system will iterate through each record from SAMPLE_RECORD and get all EVENTS which has the same key. For example, getting 1234 and query the EVENT_DATABASE will retrieve:
1234 - content A - 3
1234 - content B - 5
1234 - content K - 7
1234 - content J - 2
Then i will execute some calculations using the result set. For example, count, sum, mean
F1 = 4 (count)
F2 = 17 (sum(3+5+7+2))
I will approach the problem storing the EVENT_DATABASE using HBASE. Then i will run a map-reduce job, and in the map phase i will query the HBase, get he events and execute the calculations.
The process can be in batch. It is not necessary to be real time.
Does anyone suggests another architecture? Do i really need a map reduce job? Can i use another approach?
I personally solved this kind of problem using MapReduce, HDFS & HBase for batch analysis. Your approach seems to be good for implementing your use-case I am guessing you are going to store the calculations back into HBase.
Storm could also be used to implement the same usecase, but Storm really shines with streaming data & near real time processing rather than data at rest.
You don't really need to query Hbase for every single event. According to me this would be a better approach.
Create an external table in hive using your input file.
Create an external table in hive for your hbase table using Hive Hbase Integration (https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration)
Do a join on both the tables and get the retrieve the results.
Your approach would have been good if you were only querying for a subset of your input file but since you are querying hbase for all recrods (20M), using a join would be more efficient.
Pardon me if this is a silly question.
I have a cloudera manager installed in a single node.
I am trying to use Hbase and Hadoop for logging request and response in my web application.
I am trying to list latest user activity using the log.
Rows are added using the below table structure.
1 Column Family, RowId, 11 columns. I store every value as string. Fairly simple & similar to a mysql table.
RowId
entry:addedTime
entry:value
entry:ip
entry:accessToken
entry:identifier
entry:userId
entry:productId
entry:object
entry:requestHeader
entry:completeDate
entry:tag
Now, in order to get rows from my Hbase, I use
SingleColumnValueFilter("entry", "userId", "=", binary:"25", true, true)
Now, I am struggling to order this by
entry:completeDate DESCENDING
and limit by 25 rows for pagination or infinite scroll.
My question,
Is Hbase the only real time querying database available in Hadoop ecosystem?
Am I using Hbase for wrong reasons? Is my table structure correct?
I work in a startup and these are our baby steps to moving to BigData. Though BigData created lot of hype, the Hadoop is poorly supported for latest linux and looks too complicated.
Any help or suggestions would be appreciated.
Many thanks,
Karthik
Hi, I would like to know how to implement lookup logic in Hadoop Pig. I have a set of records, say for a weblog user, and need to go back to fetch some fields from his first visit (not the current).
This is doable in Java but do we have a way to implement this in Hadoop pig.
Example:
Suppose for traversing one particular user, identified by col1 and col2, output the first value for that user in lookup_col, in this case '1'.
col1 col2 lookup_col
---- ---- -----
326 8979 1
326 8979 4
326 8979 3
326 8979 0
You can implement this as a pig UDF.
Alternatively, you can also use simple SQL-like logic and aggregate the visits by user (not sure how you define the user and how you plan to look up visit by user but that's another matter) and get the first one and then left-join users with agg_visits.
A 'replicated join' in Pig is essentially a look up in a set that distributed amongst nodes and loaded into memory. However, you can get more than a single result because it's a JOIN operation, and not a lookup - so if you aggregate the data beforehand, you make sure that you only have a single record per key.