Kedro pipeline on partitioned data - kedro

I work on partitioned data (partitioned parquet or SQL table with a "partition" column). I want Kedro to load and save data from a partition I provide at runtime (e.g. kedro run --params partition:A). The number of partitions is large and dynamic.
I use Spark. Is there a way to load/save data the way I need with SparkDataSet or SparkJDBCDataSet?

A quick google suggests the Spark JDBCDriver can use a timestamp column for partitioning . All Kedro does behind the scenes is pass the catlaog load_args and save_args to the native driver so this may work.
One another way to use a lifecycle hook like before_pipeline_run, inspect the run parameters and then inject some custom logic at that point as you're able to inspect the --params run arguments easily at that point.
A last thought - if you subclass and extend the SQL dataset you want to use you can easily extend it to partition the way you want it. You won't easily be able to pass run --params but it would be easy to retrieve env variables or custom catalog arguments.

Related

Import Sqoop column names issue

I have a question on Kylo and Nifi.
The version of Kylo used is 0.10.1
The version of Nifi used is 1.6.0
When we create a feed for database ingest (using database as source), in the Additional Options step there is no provision to enter the source table column names.
However, in Nifi side, we use an Import Sqoop processor which has a mandatory field called Source Fields and it requires that the columns be entered, separated by commas. If it is not done, we get an error:
ERROR tool.ImportTool: Imported Failed: We found column without column name. Please verify that you've entered all column names in your query if using free form query import (consider adding clause AS if you're using column transformation)
For our requirement, we want Import Sqoop to take all the columns from the table automatically into this property without manual intervention at Nifi level. Is there any option to include all columns of a database table in the background automatically? Or is there any other possibility of giving this value in UpdateAttribute processor?
As mentioned in the Comments, ImportSqoop is not a not a normal Nifi processor. This does not have to be problem, but will mean it is probably not possible to troubleshoot the problem without involving the creator.
Also, though I am still debating whether Nifi on Sqoop is an antipattern, it is certainly not necessary.
Please look into the standard options first:
Standard way to get data into Nifi from tables is with standard processors such as ExecuteSQL
If that doesn't suffice, the standard way to use Sqoop (a batch tool) is with a batch scheduler, such as Oozie or Airflow
This thread may take away further doubts on point 1: http://apache-nifi.1125220.n5.nabble.com/Sqoop-Support-in-NIFI-td5653.html
Yes, Teradata Kylo Import Sqoop is not standard NiFi processor, but it's there for us to use. Looking deeper at processor's properties, we can see that indeed, SOURCE_TABLE_FIELDS is required there. Then you have an option to manually hard-code the list of columns or set up a method to generate the list dynamically.
Typical solution is to provide the list of fields is by querying table's metadata. A particular solution depends on where source and target tables are set up and how mapping is defined between source and target columns. For example, one could use databases' INFORMATION_SCHEMA tables and match columns by name. Because SQOOP's output should match the source, one has to find a way to generate the column list and provide it to ImportSqoop processor. A better yet approach could involve a separate metadata that would store the source and target information along with mappings and possible transforms (many tools are available there for that purpose, for example, Wherescape).
More specifically, I would use LookupAttribute paired with database or scripted lookup service to retrieve the column list from some metadata provider.

Schema verification/validation before loading data into HDFS/Hive

I am a newbie to Hadoop Ecosystem and I need some suggestion from Bigdata experts on achieving schema verification/validation before loading the huge data into hdfs.
The scenario is:
I have a huge dataset with given schema (having around 200
column-header in it). This dataset is going to be stored in Hive
tables/HDFS. Before loading the data into hive table/hdfs I want to
perform a schema level verification/validation on the data supplied to
avoid any unwanted errors/exception while loading the data into hdfs.
Like in case somebody tries to pass a data file having fewer or more
number of columns in it then at the first level of verification this
load fail.
What could be the best possible approach for achieving the same?
Regards,
Bhupesh
Since you have files, you can add them into HDFS,and run map reduce on top of that. Here you would be having a hold on each row, so you can verify number of columns, their types and any other validations.
When i referred to jason/xml, there is slight overhead to make map reduce identify the records in that format. However with respect to validation there is schema validation which you can enforce and also define only specific values for a field using schema. So once the schema is ready, your parsing(xml to java) and then store them at another final HDFS location for further use(like HBase). When you are sure that data is validated, you can create Hive tables on top of that.
Use below utility to create temp tables every time based on the schema you receive in csv file format in staging directory and then apply some conditions to identify whether you have valid columns or not. Finally load into original table.
https://github.com/enahwe/Csv2Hive

Does any one know how to create dataframe in sparkR from hbase table?

I am trying to create a spark dataframe in sparkR using data stored in hbase.
Does any one know how to specify the data source parameters in SQLontext or any other way to get around this?
You might want to take a look at this package : http://spark-packages.org/package/nerdammer/spark-hbase-connector.
However, it seems that you can't use it with SparkR yet and the two others packages providing connection between Spark and HBase don't seem to ba as advanced as the first one.
So I guess you won't be able to create a dataframe directly from HBase to SparkR.

Writing to multiple HCatalog schemas in single reducer?

I have a set of Hadoop flows that were written before we started using Hive. When we added Hive, we configured the data files as external tables. Now we're thinking about rewriting the flows to output their results using HCatalog. Our main motivation to make the change is to take advantage of the dynamic partitioning.
One of the hurdles I'm running into is that some of our reducers generate multiple data sets. Today this is done with side-effect files, so we write out each record type to its own file in a single reduce step, and I'm wondering what my options are to do this with HCatalog.
One option obviously is to have each job generate just a single record type, reprocessing the data once for each type. I'd like to avoid this.
Another option for some jobs is to change our schema so that all records are stored in a single schema. Obviously this option works well if the data was just broken apart for poor-man's partitioning, since HCatalog will take care of partitioning the data based on the fields. For other jobs, however, the types of records are not consistent.
It seems that I might be able to use the Reader/Writer interfaces to pass a set of writer contexts around, one per schema, but I haven't really thought it through (and I've only been looking at HCatalog for a day, so I may be misunderstanding the Reader/Writer interface).
Does anybody have any experience writing to multiple schemas in a single reduce step? Any pointers would be much appreciated.
Thanks.
Andrew
As best I can tell, the proper way to do this is to use the MultiOutputFormat class. The biggest help for me was the TestHCatMultiOutputFormat test in Hive.
Andrew

Using Apache Hive as a MapReduce Input Format and/or Scraping Hive Metadata

Our environment is heavy into storing data in hive. I find myself currently working on something that it outside the scope though. I have a mapreduce written, but it requires a lot of direct user inputs for information that could easily be scraped from Hive. That said, when I query hive for extended table data, all of the extended information is thrown out in 1 or 2 columns as a giant blob of almost-JSON. Is there either a convenient way to parse this information, or better yet, get it directly in a more direct manor?
Alternatively, if I could get pointed to documentation on manually using the CombinedHiveInputFormat, that would simplify my code a lot more. But it seems like that InputFormat is solely used inside of Hive, using it's custom structs.
Ultimately, what I want is to know table names, columns (not including partitions), and partition locations for the split a mapper is working on. If there is yet another way to accomplish this, I am eager to know.

Resources