Data health check tool - validation

I want to perform data health check on huge volume of data, which can be either in RDBMS or cloud file storage like Amazon S3. Which tool would be appropriate for performing data health check, which can give me number of rows, rows not matching a given schema for data type validation, average volume for given time period etc?
I do not want to use any bigdata platform like Qubole or Databricks because of extra cost involved. I found Drools which can perform similar operations but it would need reading full data into memory and associate with a POJO before validation. Any alternatives would be appreciated where I do not have to load full data into memory.

You can avoid loading full data in memory by implementing the StatelessKieSession object of drools. StatelessKieSession works only on the current event it does not maintain the state of any event also does not keep objects in the memory. Read more about StatelessKieSession here.
Also, you can use Stateful KieSession and give an expiry to an event using the #expires declaration which expiries event after the specified time. Read more about #expires here.

Related

Report queue size using prometheus client_ruby client

I'm trying to export the size of an internal queue. I don't want to maintain a gauge counter, using incr/decr. What I need is to retrieve the actual queue size at the scrap moment. Is that possible using the prometheus ruby client?
The short answer to your vague question would be probably yes. A look at the documentation shows:
Overview
a multi-dimensional data model with time series data identified by metric name and key/value pairs
a flexible query language to leverage this dimensionality
no reliance on distributed storage; single server nodes are autonomous
time series collection happens via a pull model over HTTP
pushing time series is supported via an intermediary gateway
targets are discovered via service discovery or static configuration
multiple modes of graphing and dashboarding support
Data Model
Prometheus fundamentally stores all data as time series: streams of
timestamped values belonging to the same metric and the same set of
labeled dimensions. Besides stored time series, Prometheus may
generate temporary derived time series as the result of queries.

Using ElasticSearch as a permanent storage

Recently I am working on a project which is producing a huge amount of data every day, in this project, there are two functionalities, one is storing data into Hbase for future analysis, and second one is pushing data into ElasticSearch for monitoring.
As the data is huge, we should store data into two platforms(Hbase,Elasticsearch)!
I have no experience in both of them. I want no know is it possible to use elasticsearch instead of hbase as a persistence storage for future analytics?
I recommend you reading this old but still valid article : https://www.elastic.co/blog/found-elasticsearch-as-nosql
Keep in mind, Elasticsearch is only a search engine. But it depends if your data are critical or if you can accept to lose some of them like non critical logs.
If you don't want to use an additionnal database with huge large data, you probably can store them into files in something like HDFS.
You should also check Phoenix https://phoenix.apache.org/ which may provide the monitoring features that you are looking for

Storing and processing timeseries with Hadoop

I would like to store a large amount of timeseries from devices. Also these timeseries have to be validated, can be modified by an operator and have to be exported to other systems. Holes in the timeseries must be found. Timeseries must be shown in the UI filtered by serialnumber and date range.
We have thought about using hadoop, hbase, opentsdb and spark for this scenario.
What do you think about it? Can Spark connect to opentsdb easily?
Thanks
OpenTSDB is really great for storing large amount of time series data. Internally, it is underpinned by HBase - which means that it had to find a way around HBase's limitations in order to perform well. As a result, the representation of time series is highly optimized and not easy to decode. AFAIK, there is no out-of-the-box connector that would allow to fetch data from OpenTSDB into Spark.
The following GitHub project might provide you with some guidance:
Achak1987's connector
If you are looking for libs that would help you with time series, have a look at spark-ts - it contains useful functions for missing data imputation as well.
Warp 10 offers the WarpScript language which can be used from Spark/Pig/Flink to manipulate time series and access data stored in Warp 10 via a Warp10InputFormat.
Warp 10 is Open Source and available at www.warp10.io
Disclaimer: I'm CTO of Cityzen Data, maker of Warp 10.
Take a look at Axibase Time Series Database which has a rather unique versioning feature to maintain a history of value changes for the same timestamp. Once enabled with per-metric granularity, the database keeps track of source, status and times of value modifications for audit trail or data reconciliation.
We have customers streaming data from Spark apps using Network API, typically once data is enriched with additional metadata (aks series tags) for downstream reporting.
You can query data from ATSD with REST API or SQL.
Disclaimer: I work for Axibase.

Is it possible to write multiple blobs in a single request?

We're planning to use Azure blob storage to save processing log data for later analysis. Our systems are generating roughly 2000 events per minute, and each "event" is a json document. Looking at the pricing for blob storage, the sheer number of write operations would cost us tons of money if we take each event and simply write it to a blob.
My question is: Is it possible to create multiple blobs in a single write operation, or should I instead plan to create blobs containing multiple event data items (for example, one blob for each minute's worth of data)?
It is possible ,but isn't good practice ,it take long times to multipart files to be merge, hence we are trying to separate upload action from entity persist operation by passing entity id and update doc[image] name in other controller
Also it keeps you clean upload functionality .Best Wish
It's impossible to create multiple blobs in a single write operation.
One feasible solution is to create blobs containing multiple event data items as you planned (which is hard to implement and query in my opinion); another solution is to store the event data into Azure Storage Table rather than Blob, and leverage EntityGroupTransaction to write table entities in one batch (which is billed as one transaction).
Please note that all table entities in one batch must have the same partition key, which should be considered when you're designing your table (see Azure Storage Table Design Guide for further information). If some of your events have large data size that exceeds the size limitation of Azure Storage Table (1MB per entity, 4MB per batch), you can save data of those events to Blob and store the blob links in Azure Storage Table.

Lambda Architecture - Why batch layer

I am going through the lambda architecture and understanding how it can be used to build fault tolerant big data systems.
I am wondering how batch layer is useful when everything can be stored in realtime view and generate the results out of it? is it because realtime storage cant be used to store all of the data, then it wont be realtime as the time taken to retrieve the data is dependent on the the space it took for the data to store.
Why batch layer
To save Time and Money!
It basically has two functionalities,
To manage the master dataset (assumed to be immutable)
To pre-compute the batch views for ad-hoc querying
Everything can be stored in realtime view and generate the results out of it - NOT TRUE
The above is certainly possible, but not feasible as data could be 100's..1000's of petabytes and generating results could take time.. a lot of time!
Key here, is to attain low-latency queries over large dataset. Batch layer is used for creating batch views (queries served with low-latency) and realtime layer is used for recent/updated data which is usually small. Now, any ad-hoc query can be answered by merging results from batch views and real-time views instead of computing over all the master dataset.
Also, think of a query (same query?) running again and again over huge dataset.. loss of time and money!
Further to the answer provided by #karthik manchala, data Processing can be handled in three ways - Batch, Interactive and Real-time / Streaming.
I believe, your reference to real-time is more with interactive response than to streaming as not all use cases are streaming related.
Interactive responses are where the response can be expected anywhere from sub-second to few seconds to minutes, depending on the use case. Key here is to understand that processing is done on data at rest i.e. already stored on a storage medium. User interacts with the system while processing and hence waits for the response. All the efforts of Hive on Tez, Impala, Spark core etc are to address this issue and make the responses as fast as possible.
Streaming on the other side is where data streams into the system in real-time - for example twitter feeds, click streams etc and processing need to be done as soon as the data is generated. Frameworks like Storm, Spark Streaming address this space.
The case for batch processing is to address scenarios where some heavy-lifting need to be done on a huge dataset before hand such that user would be made believe that the responses he sees are real-time. For example, indexing a huge collection of documents into Apache Solr is a batch job, where indexing would run for minutes or possibly hours depending on the dataset. However, user who queries the Solr index would get the response in sub-second latency. As you can see, indexing cannot be achieved in real-time as there may be hue amounts of data. Same is the case with Google search, where indexing would be done in a batch mode and the results are presented in interactive mode.
All the three modes of data processing are likely involved in any organisation grappling with data challenges. Lambda Architecture addresses this challenge effectively to use the same data sources for multiple data processing requirements
You can check out the Kappa-Architecture where there is no seperate Batch-Layer.
Everything is analyzed in the Stream-Layer. You can use Kafka in the right configuration as as master-datasetstorage and save computed data in a database as your view.
If you want to recompute, you can start a new Stream-Processing job and recompute your view from Kafka into your database and replace your old view.
It is possible to use only the Realtime view as the main storage for adhoc query but as it is already mentioned in other answers, it is faster if you have much data to do batch-processing and stream-processing seperate instead of doing batch-jobs as a stream-job. It depends on the size of your data.
Also it is cheaper to have a storage like hdfs instead of a database for batch-computing.
And the last point in many cases you have different algorithms for batch and stream processing, so you need to do it seperate. But basically it is possible to only use the "realtime view" as your batch-and stream-layer also without using Kafka as masterset. It depends on your usecase.

Resources