Pentaho Data in Excel Template - pentaho-data-integration

I'm not sure how to explain this. I'm new to Pentaho Data Integration,and I need to get data into an excel file in a certain order. Example a specific title goes in A1 and date goes in A2
B1 is where the data list start. The last row have to have specific data in A1 and A2. Is there a way to do this in Pentaho? I tried using Execute SQL script->Microsoft Excel Writer but I'm completely lost.

Related

How do I read this mapping document?

I am new to etl and am working with talend. I was given this document and was told to make an "extraction job." how exactly do I read this document for this talend job that I have to make?
Well, ETL basically means, Extract-Transform-Load.
From your example, I can understand that you have to create a Target table which will pull data from the Source table based on certain conditions. These conditions are mentioned in your image.
You basically have to look at the Source File columns from you image. They clearly state,
1.) File(Table name), this means which table from the Source DB this attribute is flowing in.
2.) Attribute(s) (Field Name) : This is the name of the column.
3.) Extract logic : This means what logic has to be applied while extracting this column from Source(Straight Move) means, just dump the source value in Target.
This is just to get you started, as nobody will actually create the whole ETL flow for you here.

Approach to upload multiple interconnected csv files to HBase

I am new to HBase and still not sure which component of Hadoop ecosystem I will use in my case and how to analyse my data later so just exploring options.
I have an Excel sheet with a summary about all the customers like this but with ≈ 400 columns:
CustomerID Country Age E-mail
251648 Russia 27 boo#yahoo.com
487985 USA 30 foo#yahoo.com
478945 England 15 lala#yahoo.com
789456 USA 25 nana#yahoo.com
Also, I have .xls files created separately for each customer with an information about him (one customer = one .xls file), the number of columns and names of columns are the same in each file. Each of these files are named with a CustomerID. A one looks like this:
'customerID_251648.xls':
feature1 feature2 feature3 feature4
0 33,878 yes 789,598
1 48,457 yes 879,594
1 78,495 yes 487,457
0 94,589 no 787,475
I have converted all these files into .csv format and now feeling stuck which component of Hadoop ecosystem should I use for storing and querying such a data.
My eventual goal is to query some customerID and to get all the information about a customer from all the files.
I think that HBase fits perfectly for that because I can create such a schema:
row key timestamp Column Family 1 Column Family 2
251648 Country Age E-Mail Feature1 Feature2 Feature3 Feature4
What is the best approach to upload and query such a data in HBase? Should I first combine an information about a customer from different sources and then upload it to HBase? Or I can keep different .csv files for each customer and when uploading to HBase choose somehow which .csv to use for forming column-families?
For querying data stored in HBase I am going to write MapReduce tasks via Python API.
Any help would be very approciated!
You are correct with schema design, also remember that hbase loads the whole column family during scans, so if you need all the data at one time maybe its better to place everything in one column family.
A simple way to load the data will be to scan first file with customers and fetch the data from the second file on fly. Bulk CSV load could be faster in execution time, but you'll spend more time writing code.
Maybe you also need to think about the row key because HBase stores data in alphabetical order. If you have a lot of data, you'd better create table with given split-keys rather than let HBase do the splits because it can end up with unbalanced regions.

Bulk Loading Key-value pair data into HBASE

I am evaluating HBASE for dealing with a very wide dataset with a variable number of columns per row. In its raw form, my data has a variable list of parameter names and values for each row. In its transformed form, it is available in key-value pairs.
I want to load this data into HBASE. It is very easy to translate my key-value pair processed data into individual "put" statements to get the data in. However I need to bulkload as I have 1000s of columns and millions of rows, leading to billions of individual key-value pairs, requiring billions of "put" statements. Also, the list of columns (a,b,c,d,...) is not fully known ahead of time. I investigated the following options so far:
importtsv: Cannot be used because that requires the data to be pivoted from rows to columns ahead of time, with a fixed set of known columns to import.
HIVE to generate HFile: This option too requires column names to be specified ahead of time and map each column in the hive table to a column in hbase.
My only option seems to be to parse a chunk of the data once, pivot it into a set of known columns, and bulk load that. This seems wasteful, as HBASE is going to blow it down into key-value pairs anyway. There really should be a simpler more efficient way of bulk loading the key value pairs?
Raw data format:
rowkey1, {a:a1, b:b1}
rowkey2, {a:a2, c:c2}
rowkey3, {a:a3, b:b3, c:c3, d:d3}
processed Data format:
rowkey1, a, a1
rowkey1, b, b1
rowkey2, a, a2
rowkey2, c, c2
rowkey3, a, a3
rowkey3, b, b3
rowkey3, c, c3
rowkey3, d, d3
You almost assuredly want to use a customer M/R job + Incremental loading (aka bulk loading).
There general process will be:
Submit an M/R job that has been configured using HFileOutputFormat.configureIncrementalLoad
Map over the raw data and write PUTs for HBase
Load the output of job into table using the following:
sudo -u hdfs hdfs dfs -chown -R hbase:hbase /path/to/job/output
sudo -u hbase hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /path/to/job/output table-name-here
There are ways to do the load from java but it means impersonating HBase. The tricky part here is making sure that the files are owned by HBase and the user running the incremental load is also HBase. This Cloudera Blog Post talks a bit more about about those details.
In general I recommend taking a peek at this GH Repo which seems to cover the basics of the process.

While creating table how to identify the data types in hive

I am learning to use Hadoop for performing Big Data related operations.
I need to perform some queries on a collection of data sets split across 8 csv files. Each csv file has multiple sheets and the query concerns only one of the sheets(Sheet Name: Table4)
The dataset can be downloaded here : http://www.census.gov/hhes/www/hlthins/data/utilization/tables.html
Sample Data snap shot attached for quick reference
I have already converted the above xls file to csv.
Am not sure how to group the data while creating table in Hive.
It will be really helpful if you can guide me here.
Note: I am a novice with Hadoop and Big Data, so if anyone could guide me with how to proceed further I'd be very grateful.
If you need information on the queries or anything else let me know.
Thanks!

Write Hive Table using Spark SQL and JDBC

I am new to Hadoop and I am using a single node cluster (for development) to pull some data from a relational database.
Specifically, I am using Spark (version 1.4.1), Java API, to pull data for a query and write to Hive. I have run into various problems (and have read the manuals and tried searching online) but I think I might be misunderstanding some fundamental part of this because I am having problems.
First, I thought I'd be able to read data into Spark, optionally run some Spark methods to manipulate the data and then write it to Hive through a HiveContext object. But, there doesn't seem to be any way to write straight from Spark to Hive. Is that true?
So I need an intermediate step. I have tried a few different methods of storing the data first before writing to Hive and settled on writing an HDFS text file since it seemed to work best for me. However, writing the HDFS file, I get square brackets in the files, like this: [A,B,C]
So, when I load the data into Hive using the "LOAD DATA INPATH..." HiveQL statement, I get the square brackets in the Hive table!!
What am I missing? Or more appropriately, can someone please help me understand the steps I need to do to:
Run a SQL on SQL Server or Oracle DB
Write the data out to a Hive table that can be accessed by a dashboard tool.
My code right now, looks something like this:
DataFrame df= sqlContext.read().format("jdbc").options(getSqlContextOptions(driver, dburl, query)).load(); // This step seem to work fine.
JavaRDD<Row> rdd = df.javaRDD();
rdd.saveAsTextFile(getHdfsUri() + pathToFile); // This works, but writes the rows in square brackets, like: [1, AAA].
hiveContext.sql("CREATE TABLE BLAH (MY_ID INT, MY_DESC STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE");
hiveContext.sql("LOAD DATA INPATH '" + getHdfsUri() + hdfsFile + "' OVERWRITE INTO TABLE `BLAH`"); // Get's written like:
MY_INT MY_DESC
------ -------
AAA]
The INT column doesn't get written at all because the leading [ makes it no longer a numeric value and the last column shows the "]" at the end of the row in the HDFS file.
Please help me understand why this isn't working or what a better way would be. Thanks!
I am not locked into any specific approach, so all options would be appreciated.
Ok, I figured out what I was doing wrong. I needed to use the write function on the HiveContext and needed to use the com.databricks.spark.csv to write a sequencefile in Hive. This does not require an intermediate step of saving a file in HDFS, which is great, and writes to Hive successfully.
DataFrame df = hiveContext.createDataFrame(rdd, struct);
df.select(cols).write().format("com.databricks.spark.csv").mode(SaveMode.Append).saveAsTable("TABLENAME");
I did need to create a StructType object, though, to pass into the createDataFrame method for the proper mapping of the data types (Something like is shown in the middle of this page: Support for User Defined Types for java in Spark). And the cols variable is an array of Column objects which is really just an array of column names (i.e. something like Column[] cols = {new Column("COL1"), new Column("COL2")};
I think "Insert" is not yet supported.
http://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive
To get rid of brackets in text file, you should avoid saveAsTextFile. Instead try writing the contents using HDFS API i.e FSDataInputStream

Resources