Can we stream data from ADX to Databricks Spark cluster? - spark-streaming

Does ADX support sending data to Databricks Spark cluster in streaming fashion -- basically Spark will pull it from ADX instead of ADX exporting it? In other words, I am trying to understand if an ADX table can act as a source for spark streaming ? Is there an example link for this that I can go through?

ADX provides the means for querying a table for all the data that has been added to it since the last query through the means of Database Cursors. Note, however, that this requires the caller to parse the query's #ExtendedProperties set (which holds the database cursor) and maintain state between every two successive queries (so that the new cursor value could be passed to the next query).

Related

Nifi Fetching Data From Oracle Issue

I am having a requirement to fetch data from oracle and upload into google cloud storage.
I am using executeSql proecssor but it is failing for large table and even for table with 1million records of approx 45mb size it is taking 2hrs to pull.
The table name are getting passed using restapi to listenHttp which passes them to executeSql. I cant use QueryDatabase because the number of table are dynamic and calls to start the fetch is also dynamic using a UI and Nifi RestUi.
Please suggest any tuning parameter in ExecuteSql Processor.
I believe you are talking about having the capability to have smaller flow files and possibly sending them downstream while the processor is still working on the (large) result set. For QueryDatabaseTable this was added in NiFi 1.6.0 (via NIFI-4836) and in an upcoming release (NiFi 1.8.0 via NIFI-1251) this capability will be available for ExecuteSQL as well.
You should be able to use GenerateTableFetch to do what you want. There you can set the Partition Size (which will end up being the number of rows per flow file) and you don't need a Maximum Value Column if you want to fetch the entire table each time a flow file comes in (which also allows you do handle multiple tables as you described). GenerateTableFetch will generate the SQL statements to fetch "pages" of data from the table, which should give you better, incremental performance on very large tables.

Cassandra for datawarehouse

Is Cassandra a good alternative for Hadoop as a data warehouse where data is append only and all updates in source databases should not overwrite the existing rows in the data warehouse but get appended. Is Cassandra really ment to act as a data warehouse or just as a database to store the results of batch / stream queries?
Cassandra can be used both as a data warehouse(raw data storage) and as a database (for final data storage). It depends more on the cases you want to do with the data.
You even may need to have both Hadoop and Cassandra for different purposes.
Assume, you need to gather and process data from multiple mobile devices and provide some complex aggregation report to the user.
So at first, you need to save data as fast as possible (as new portions appear very often) so you use Cassandra here. As Cassandra is limited in aggregation features, you load data into HDFS and do some processing via HQL scripts (assume, you're not very good at coding but great in complicated SQLs). And then you move the report results from HDFS to Cassandra in a dedicated reports table partitioned by user id.
So when the user wants to have some aggregation report about his activity in the last month, the application takes the id of active user and returns the aggregated result from Cassandra (as it is simple key-value search).
So for your question, yes, it could be an alternative, but the selection strategy depends on the data types and your application business cases.
You can read more information about usage of Cassandra
here

What is the best way to ingest data from Terdata into Hadoop with Informatica?

What is the best ways to parallel ingest data from Teradata database into Hadoop with parallel data moving?
If we create a job which is simple opens one session to Teradata database it will take a lot of time to load huge table.
if we create a set of sessions to load data in parallel, and also make Select in each of the sessions, than it will make a set of Full table scans Teradata to produce a data
What is the recommended best practice to load data in parallelised streams and make unnecessary workload to Teradata?
If Tera data supports table partitioning like oracle, you could try reading the table based on partitioning points which will enable parallelism in read...
Other option you have is, split the table into multiple partitions like adding a where clause on indexed column. This will ensure index scan and you can avoid full table scan.
The most scalable way to ingest data into Hadoop form teradata, which i found is to use Teradata connector for hadoop. It is included in Cloudera & Hortonworks distributions. I will show example base on Cloudera documentation, but the same works with Hortonworks as well:
Informatica big Data edition is using standard Scoop invocation via command line and submitting set of parameters to it. So the main question is - which driver to use to make parallel connections between two MPP systems.
Here is the link to the Cloudera documentation:
Using the Cloudera Connector Powered by Teradata
And here is the digest from this documentation (You could find that this connector support different kinds of load balancing between connections):
Cloudera Connector Powered by Teradata supports the following methods for importing data from Teradata to Hadoop:
split.by.amp
split.by.value
split.by.partition
split.by.hash
split.by.amp Method
This optimal method retrieves data from Teradata. The connector creates one mapper per available Teradata AMP, and each mapper subsequently retrieves data from each AMP. As a result, no staging table is required. This method requires Teradata 14.10 or higher.
If you use partition names in the select clause, Power Center will select only the rows within that partition so there won't be duplicate read (don't forget to choose Database partitioning in Informatica session level). However if you use key range partition you have to choose the range as you mentioned in settings. Usually we use NTILE oracle analytical function to split the table into multiple portions so that the read will be unique across the selects. Please let me know if you have any question. If you have range/auto generated/surrogate key column in the table use it in where clause - write a sub-query to divide the table into multiple portions.

Import small stream in Impala

We are currently on a Big Data project.
The Big Data platform Hadoop Cloudera.
Input of our system we have a small flow of data, we collect via Kafka (approximately 80Mo/h continuously).
Then the messages are stored in HDFS to be queried via Impala.
Our client does not want to separate the hot data with the cold data. After 5 mins, the data must be accessible in the history data (cold data). We chose to have a single database.
To insert the data, we use the JDBC connector provided by Impala API (eg INSERT INTO ...).
we are aware that this is not the recommended solution, each Impala insertion creates a file (<10kb) in HDFS.
We seek a solution to insert a small stream in a Imapala base which avoids getting many small files.
What solution we preconize?

how to efficiently move data from Kafka to an Impala table?

Here are the steps to the current process:
Flafka writes logs to a 'landing zone' on HDFS.
A job, scheduled by Oozie, copies complete files from the landing zone to a staging area.
The staging data is 'schema-ified' by a Hive table that uses the staging area as its location.
Records from the staging table are added to a permanent Hive table (e.g. insert into permanent_table select * from staging_table).
The data, from the Hive table, is available in Impala by executing refresh permanent_table in Impala.
I look at the process I've built and it "smells" bad: there are too many intermediate steps that impair the flow of data.
About 20 months ago, I saw a demo where data was being streamed from an Amazon Kinesis pipe and was queryable, in near real-time, by Impala. I don't suppose they did something quite so ugly/convoluted. Is there a more efficient way to stream data from Kafka to Impala (possibly a Kafka consumer that can serialize to Parquet)?
I imagine that "streaming data to low-latency SQL" must be a fairly common use case, and so I'm interested to know how other people have solved this problem.
If you need to dump your Kafka data as-is to HDFS the best option is using Kafka Connect and Confluent HDFS connector.
You can either dump the data to a parket file on HDFS you can load in Impala.
You'll need I think you'll want to use a TimeBasedPartitioner partitioner to make parquet files every X miliseconds (tuning the partition.duration.ms configuration parameter).
Addign something like this to your Kafka Connect configuration might do the trick:
# Don't flush less than 1000 messages to HDFS
flush.size = 1000
# Dump to parquet files
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
partitioner.class = TimebasedPartitioner
# One file every hour. If you change this, remember to change the filename format to reflect this change
partition.duration.ms = 3600000
# Filename format
path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm
Answering that question in year 2022, I would say that solution would be streaming messages from Kafka to Kudu and integrate Impala with Kudu, as it has already tight integration.
Here is example of Impala schema for Kudu:
CREATE EXTERNAL TABLE my_table
STORED AS KUDU
TBLPROPERTIES (
'kudu.table_name' = 'my_kudu_table'
);
Apache Kudu supports SQL inserts and it uses own file format under the hood. Alternatively you could use Apache Phoenix which supports inserts and upserts (if you need exactly once semantic) and uses HBase under the hood.
As long as the Impala is your final way of accessing the data, you shouldn't care about underlaying formats.

Resources