BIg data testing approch - hadoop

I am working on big data project.
The basic flow of the project is following:
- data is coming from mainframe and getting stored into cornerstone 3.0
- after that the data is getting ingested in hive using scheduler
- then it is getting stored into the mapr db using map reduce job(running hive queries to get specific aggregated attributes) in terms of the key-value pair to reflect into the application using Rest API.
I want to test this application starting from Hive to Rest API assuming the data in Hive is loaded correctly.
What can be the best approach to test this application
(Objective to be tested : Hive data,hive queries,mapr db performance,mapr dp data,Rest api).What are the best tools and technology to use.
Thank you in advance.

What can be tested? - this is explained by requirements/question it self
data is coming from mainframe mainframe and getting stored into cornerstone 3.0 - validate data is stored as expected (based on requirement) from mainframe to cornerstone
after that the data is getting ingested in hive using scheduler - verify hive tables have data/hdfs file locations etc. as expected(as per requirements - if any transformation is happening during hive table load - you will be validating that)
then it is getting stored into the mapr db using map reduce job(running hive queries to get specific aggregated attributes) in terms of the key-value pair to reflect into the application using Rest API - here basically you are testing map-reduce job that loads/transforms data in maprdb. you should be running job first -> verify job runs end to end with no error/warns (note execution time to very the performance of the job) -> validate maprdb -> thne test REST API app and verify expected results base on requirements.
What are the best tools and technology to use?
for hive/hdfs/data validation - I would create shell-script (consist of hive, hdfs file location, log file validation, runs mapreduce job, validates mapreduce job etc) that test/verifies each step describe above. one should start with manual CLI commands first to get started with testing.
for testing REST API - there are many tools available e.g. ReadyAPI, postman. I would include this step in shell-script too (using curl)

Related

Can a DataStage job be viewed without access to a DataStage installation

I am tasked to replace an ETL process that used to run in DataStage. I have used DataStage in the past and would be able to review it for replication if I could view it.
I have the extracted jobs in version control, is there a way to view the job without access to DataStage? (If needed, I could request new extracts)
You could ask for a job report - that is a picture of the job with a printed logic for each stage in form of a html page. This might be enough to rebuild the job. There is no access to a free DataStage fat client.

How to plug in a process of identifying sensitive information somewhere in ETL pipeline?

Hope you are doing well !
We have already developed ETL pipeline using apache NiFi. Which gets trigger only when client uploads source data file from portal.After that, the data present inside source file goes through various layers,gets transformed and stored back to warehouse(i.e. hive).
Goal : To identify sensitive information and mask it so that end user won't see actual data.
Identify Sensitive data & masking strategy : We will make use of open source tool to achieve this goal as follow.
Data steward studio : This tool allow me to identify sensitive information and tag it properly.
Apache Atlas : Once data steward user has confirmed the tag then that tag will be pushed into Apache atlas.
Apache ranger : At the final, we can define tag based-masking policy using Apache ranger which will allow or deny to specific user. 
For more details on above solution , please visit link.
https://www.youtube.com/watch?v=RzEfLwJaLsc
Problem : In order to feed the data to DSS tool, it should be loaded first in hive table. That is fine. But we cannot stop the existing ETL flow in-between and then start identification process of sensitive information. The above solution must require some manual process which i want to get rid of and make it automated.that is, it should be plugged in somewhere within NiFi pipeline.But so far, as per my understanding DSS do not allow us to do something like that.
Manual Process :
Create Asset collection
Accept/Reject suggested tags within DSS.
If we cannot plug identification process in pipeline, then client sensitive data will be exposed to everyone and visible to everyone in team. I want something where we can de-identify sensitive data before it actually get loaded into HDFS or hive tables.
Please write your response to me on the same problem, if anyone has already worked into this particular area.
  
I did not test it, but here are my thoughts on this challenge.
Set up the system such that data is NOT visible to everyone(or anyone) by default
Load the data into hive
Let the profilers run and accept its suggestions
Open up the data to those who should have access (except for the things found by the profiler)
There are still some implementation details to work out (e.g. How to automate step 3/4 and whether you can just solve this with tags or whether the data needs to sit in a staging area first). But I hope this steers you in a good direction.
One idea might be to use EncryptContent of nifi (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.EncryptContent/). Then the values loaded into Hive will be encrypted in the first place and would not be visible to the stewards. Once the tagging has been done - then in the subsequent part of the pipeline (where I'm assuming you're using nifi as well) - you can decrypt back content as required.

Datadog data validation between on-prem and cloud database

Very new to Datadog and need some help. I have crafted 2 SQL queries (one for on-prem database and one for cloud database) and I would like to run those queries through Datadog and be able display the query results and validate that the daily results fall within an expected variance between the two systems.
I have already set up Datadog on the cloud environment and believe I should use DogStatsD to create a custom metric but I am pretty lost with how I can incorporate my necessary SQL queries in the code to create the metric for eventual display on a dashboard. Any help will be greatly appreciated!!!
You probably want to be using the MySQL integration, and configure the 'custom queries' option: https://docs.datadoghq.com/integrations/faq/how-to-collect-metrics-from-custom-mysql-queries
You can follow those instructions after you configure the base integration https://docs.datadoghq.com/integrations/mysql/#pagetitle (This will give you a lot of use metrics in addition to the custom queries you want to run)
As you mentioned, DogStatsD is a library you can import to whatever script or application in order to submit metrics. But it really isn't a common practice in the slightest to modify the underlying code of your database. So instead it makes more sense to externally run a query on the database, take those results, and send them to datadog. You could totally write a python script or something to do this. However the Datadog agent already has this capability built in, so it's probably easier to just use that.
I am also just assuming SQL refers to MySQL, there are other integration for things like SQL Server, and PostgreSQL, and pretty much every implementation of sql. And the same pattern applies where you would configure the integration, and then add an extra line to the config file where you have the check run your queries.

Which approach should I choose for testing my Spring Batch Job?

Currently I'm working on some integration tests for a Spring Batch application. Such application reads from a SQL table, writes on another table and, at the end, generates a report as a .txt file.
Initially I thought of just assuring that I had another file with the expected output and compare it with the report file and check the table content.
(For some context, I'm not very experienced on Spring).
But, after reading some articles on Baelung, I'm having doubts about my initial methodology.
Should I manipulate the table content on my code to assure that I have the expected input? Should I use the Spring Test framework tools? Without them, I'm able to run the job from my test?
The correct approach for batch job integration testing is to test the job as a black box. If the job reads data from a table and writes to another table or a file, you can proceed as follows:
Put some test data in the input table (Given)
Run your job (When)
Assert on the output table/file (Then)
You can find more details in the End-To-End Testing of Batch Jobs section of the reference documentation. Spring Batch provides some test utilities that might help in testing your jobs (like mocking batch domain objects, asserting on file content, etc). Please refer to the org.springframework.batch.test package.

How to locate web elements having dynamic Id's in a cluster of servers using JMeter?

I am using JMeter to test performance of the following Server Infrastructure. The Code Base uses ICEfaces framework and hence generates dynamic ID's each time there is a new build.
I record the scripts and run them for different variants of load (10 Users, 20 Users, 30 Users and so on). Whenever a new code base is deployed, because of change in ID's, I have to re-record the scripts before I perform Test runs again.
As of now I am able to satisfactorily get my job done.
I wish to take my job to a whole new level by trying to test performance on the following Server Infrastructure.
The following are my issues -
Because of two different nodes (Node1 and Node2) each node has a unique set of dynamic ID's associated with it and when I record a script on a particular login session, I cannot be sure of the Node my session is pinned on and as a result the recorded script is tailor made for a single node and not a cluster.
When "Load Balancer" gets in action, I cannot be sure of the Node JMeter hits during performance run and for obvious reasons the run fails to generate results.
I want a cleaver way to record script which can successfully run on a multiple server configuration.
How to perform performance Testing on this configuration?

Resources