i have a two data sets names dataset1 and dataset2 and dataset1 is like
empid empame
101 john
102 kevin
and dataset2 is like
empid empmarks empaddress
101 75 LA
102 69 NY
The dataset2 will be very huge and i need to process some operations on these two datasets and need to get results from above two datasets.
As of my knowledge, now i have two options to process these datasets:
1.Store dataset1(which is lesser in size) as hive lookup table and have to process them through Spark
2.By using Spark Broadcast Variables we can process these dataset.
Anyone please suggest me which one is the better option.
This should be better option than those 2 options mentioned.
since you have common key you can do inner join.
dataset2.join(dataset1, Seq("empid"), "inner").show()
you can use broadcast function/hint like this as well. which means you are telling framework that small dataframe i.e dataset1 should be broadcasted to every executor.
import org.apache.spark.sql.functions.broadcast
dataset2.join(broadcast(dataset1), Seq("empid"), "inner").show()
Also Look at for more details..
DataFrame join optimization - Broadcast Hash Join how broadcast joins will work.
What-is-the-maximum-size-for-a-broadcast-object-in-spark
Related
It's hard for me to understand the streaming table in Flink. I can understand Hive, map a fixed, static data file to a "table" but how to embody a table built on streaming data?
For example, every 1 second, 5 events with same structure are sent to a Kafka stream:
{"num":1, "value": "a"}
{"num":2, "value": "b"}
....
What does the dynamic table built on them look like? Flink consumes them all and store them somewhere (memory, local file, hdfs, etc.) then map to a table? Once the "transformmer" finishes processing these 5 events then clear the data and refill the "table" with 5 new events?
Any help is appreciated...
These dynamic tables don't necessarily exist anywhere -- it's simply an abstraction that may, or may not, be materialized, depending on the needs of the query being performed. For example, a query that is doing a simple projection
SELECT a, b FROM events
can be executed by simply streaming each record through a stateless Flink pipeline.
Also, Flink doesn't operate on mini-batches -- it processes each event one at a time. So there's no physical "table", or partial table, anywhere.
But some queries do require some state, perhaps very little, such as
SELECT count(*) FROM events
which needs nothing more than a single counter, while something like
SELECT key, count(*) FROM events GROUP BY key
will use Flink's key-partitioned state (a sharded key-value store) to persist the current counter for each key. Different nodes in the cluster will be responsible for handling events for different keys.
Just as "normal" SQL takes one or more tables as input, and produces a table as output, stream SQL takes one or streams as input, and produces a stream as output. For example, the SELECT count(*) FROM events will produce the stream 1 2 3 4 5 ... as its result.
There are some good introductions to Flink SQL on YouTube: https://www.google.com/search?q=flink+sql+hueske+walther, and there are training materials on github with slides and exercises: https://github.com/ververica/sql-training.
I am new to HBase and still not sure which component of Hadoop ecosystem I will use in my case and how to analyse my data later so just exploring options.
I have an Excel sheet with a summary about all the customers like this but with ≈ 400 columns:
CustomerID Country Age E-mail
251648 Russia 27 boo#yahoo.com
487985 USA 30 foo#yahoo.com
478945 England 15 lala#yahoo.com
789456 USA 25 nana#yahoo.com
Also, I have .xls files created separately for each customer with an information about him (one customer = one .xls file), the number of columns and names of columns are the same in each file. Each of these files are named with a CustomerID. A one looks like this:
'customerID_251648.xls':
feature1 feature2 feature3 feature4
0 33,878 yes 789,598
1 48,457 yes 879,594
1 78,495 yes 487,457
0 94,589 no 787,475
I have converted all these files into .csv format and now feeling stuck which component of Hadoop ecosystem should I use for storing and querying such a data.
My eventual goal is to query some customerID and to get all the information about a customer from all the files.
I think that HBase fits perfectly for that because I can create such a schema:
row key timestamp Column Family 1 Column Family 2
251648 Country Age E-Mail Feature1 Feature2 Feature3 Feature4
What is the best approach to upload and query such a data in HBase? Should I first combine an information about a customer from different sources and then upload it to HBase? Or I can keep different .csv files for each customer and when uploading to HBase choose somehow which .csv to use for forming column-families?
For querying data stored in HBase I am going to write MapReduce tasks via Python API.
Any help would be very approciated!
You are correct with schema design, also remember that hbase loads the whole column family during scans, so if you need all the data at one time maybe its better to place everything in one column family.
A simple way to load the data will be to scan first file with customers and fetch the data from the second file on fly. Bulk CSV load could be faster in execution time, but you'll spend more time writing code.
Maybe you also need to think about the row key because HBase stores data in alphabetical order. If you have a lot of data, you'd better create table with given split-keys rather than let HBase do the splits because it can end up with unbalanced regions.
I am about doing some signal analysis with Hadoop/Spark and I need help on how to structure the whole process.
Signals are now stored in a database, that we will read with Sqoop and will be transformed in files on HDFS, with a schema similar to:
<Measure ID> <Source ID> <Measure timestamp> <Signal values>
where signal values are just string made of floating point comma separated numbers.
000123 S001 2015/04/22T10:00:00.000Z 0.0,1.0,200.0,30.0 ... 100.0
000124 S001 2015/04/22T10:05:23.245Z 0.0,4.0,250.0,35.0 ... 10.0
...
000126 S003 2015/04/22T16:00:00.034Z 0.0,0.0,200.0,00.0 ... 600.0
We would like to write interactive/batch queries to:
apply aggregation functions over signal values
SELECT *
FROM SIGNALS
WHERE MAX(VALUES) > 1000.0
To select signals that had a peak over 1000.0.
apply aggregation over aggregation
SELECT SOURCEID, MAX(VALUES)
FROM SIGNALS
GROUP BY SOURCEID
HAVING MAX(MAX(VALUES)) > 1500.0
To select sources having at least a single signal that exceeded 1500.0.
apply user defined functions over samples
SELECT *
FROM SIGNALS
WHERE MAX(LOW_BAND_FILTER("5.0 KHz", VALUES)) > 100.0)
to select signals that after being filtered at 5.0 KHz have at least a value over 100.0.
We need some help in order to:
find the correct file format to write the signals data on HDFS. I thought to Apache Parquet. How would you structure the data?
understand the proper approach to data analysis: is better to create different datasets (e.g. processing data with Spark and persisting results on HDFS) or trying to do everything at query time from the original dataset?
is Hive a good tool to make queries such the ones I wrote? We are running on Cloudera Enterprise Hadoop, so we can also use Impala.
In case we produce different derivated dataset from the original one, how we can keep track of the lineage of data, i.e. know how the data was generated from the original version?
Thank you very much!
1) Parquet as columnar format is good for OLAP. Spark support of Parquet is mature enough for production use. I suggest to parse string representing signal values into following data structure (simplified):
case class Data(id: Long, signals: Array[Double])
val df = sqlContext.createDataFrame(Seq(Data(1L, Array(1.0, 1.0, 2.0)), Data(2L, Array(3.0, 5.0)), Data(2L, Array(1.5, 7.0, 8.0))))
Keeping array of double allows to define and use UDFs like this:
def maxV(arr: mutable.WrappedArray[Double]) = arr.max
sqlContext.udf.register("maxVal", maxV _)
df.registerTempTable("table")
sqlContext.sql("select * from table where maxVal(signals) > 2.1").show()
+---+---------------+
| id| signals|
+---+---------------+
| 2| [3.0, 5.0]|
| 2|[1.5, 7.0, 8.0]|
+---+---------------+
sqlContext.sql("select id, max(maxVal(signals)) as maxSignal from table group by id having maxSignal > 1.5").show()
+---+---------+
| id|maxSignal|
+---+---------+
| 1| 2.0|
| 2| 8.0|
+---+---------+
Or, if you want some type-safety, using Scala DSL:
import org.apache.spark.sql.functions._
val maxVal = udf(maxV _)
df.select("*").where(maxVal($"signals") > 2.1).show()
df.select($"id", maxVal($"signals") as "maxSignal").groupBy($"id").agg(max($"maxSignal")).where(max($"maxSignal") > 2.1).show()
+---+--------------+
| id|max(maxSignal)|
+---+--------------+
| 2| 8.0|
+---+--------------+
2) It depends: if size of your data allows to do all processing in query time with reasonable latency - go for it. You can start with this approach, and build optimized structures for slow/popular queries later
3) Hive is slow, it's outdated by Impala and Spark SQL. Choice is not easy sometimes, we use rule of thumb: Impala is good for queries without joins if all your data stored in HDFS/Hive, Spark has bigger latency but joins are reliable, it supports more data sources and has rich non-SQL processing capabilities (like MLlib and GraphX)
4) Keep it simple: store you raw data (master dataset) de-duplicated and partitioned (we use time-based partitions). If new data arrives into partition and your already have downstream datasets generated - restart your pipeline for this partition.
Hope this helps
First, I believe Vitaliy's approach is very good in every aspect. (and I'm all for Spark)
I'd like to propose another approach, though. The reasons are:
We want to do Interactive queries (+ we have CDH)
Data is already structured
The need is to 'analyze' and not quite 'processing' of the data. Spark could be an overkill if (a) data being structured, we can form sql queries faster and (b) we don't want to write a program every time we want to run a query
Here are the steps I'd like to go with:
Ingestion using sqoop to HDFS: [optionally] use --as-parquetfile
Create an External Impala table or an Internal table as you wish. If you have not transferred the file as parquet file, you can do that during this step. Partition by, preferably Source ID, since our groupings are going to happen on that column.
So, basically, once we've got the data transferred, all we need to do is to create an Impala table, preferably in parquet format and partitioned by the column that we're going to use for grouping. Remember to do compute statistics after loading to help Impala run it faster.
Moving data:
- if we need to generate feed out of the results, create a separate file
- if another system is going to update the existing data, then move the data to a different location while creating->loading the table
- if it's only about queries and analysis and getting reports (i.e, external tables suffice), we don't need to move the data unnecessarily
- we can create an external hive table on top of the same data. If we need to run long-running batch queries, we can use Hive. It's a no-no for interactive queries, though. If we create any derived tables out of queries and want to use through Impala, remember to run 'invalidate metadata' before running impala queries on the hive-generated tables
Lineage - I have not gone deeper into it, here's a link on Impala lineage using Cloudera Navigator
Hi, I would like to know how to implement lookup logic in Hadoop Pig. I have a set of records, say for a weblog user, and need to go back to fetch some fields from his first visit (not the current).
This is doable in Java but do we have a way to implement this in Hadoop pig.
Example:
Suppose for traversing one particular user, identified by col1 and col2, output the first value for that user in lookup_col, in this case '1'.
col1 col2 lookup_col
---- ---- -----
326 8979 1
326 8979 4
326 8979 3
326 8979 0
You can implement this as a pig UDF.
Alternatively, you can also use simple SQL-like logic and aggregate the visits by user (not sure how you define the user and how you plan to look up visit by user but that's another matter) and get the first one and then left-join users with agg_visits.
A 'replicated join' in Pig is essentially a look up in a set that distributed amongst nodes and loaded into memory. However, you can get more than a single result because it's a JOIN operation, and not a lookup - so if you aggregate the data beforehand, you make sure that you only have a single record per key.
I am trying to compare two different tables in HBase so that I can automate the validation of some ETL processes that we use to move data in HBase. What's the best way to compare two tables in HBase?
My use case is below:
What I am trying to do is create one table that will be my expected output. This table will contain all of the data that I am expecting to be created via executing the teams code against an input file. I will then take the diff between the actual output table and the expected output table to verify the integrity of the component under test.
I don't know of anything out of the box but you can write a multi-table map/reduce.
The mappers will just emit keys from each table (with a value being all the hbase key values plus a table name)
The reducer can make sure it has 2 records of each key and compare the key-values. When there's only one key it can see which table is out of sync
I know this question is a little old, but how large are the tables? If they will both fit into memory you could load them into Pig using HBaseStorage, then use Pig's built in DIFF function to compare the resulting bags.
This will work even with large tables that don't fit into memory, according to the docs, but it will be extremely slow.
dataset1 = LOAD '/path/to/dataset1' USING PigStorage('<your delimiter>') AS (a:chararray, b:chararray, c:chararray, d:chararray);
dataset2 = LOAD '/path/to/dataset2' USING PigStorage('<your delimiter>') AS (a:chararray, b:chararray, c:chararray, d:chararray);
dataset3 = COGROUP dataset1 BY (a, b,c, d), dataset2 BY (a, b, c, d);
dataset4 = FOREACH dataset3 GENERATE DIFF(dataset1,dataset2);