We're using yarn audit extensively in our daily workflow, but we encountered an issue: in a monorepo that uses yarn workspaces yarn audit seems to scan the entire repo. Our current workaround is to run the audit, cache the result, and filter it by the path to actual service we're auditing - but this is slow and counterproductive. Is there a way to limit audit to actual service being audited?
Related
Our horizontal scaling is currently suffering because of Liquibase.
We would want our deployments to always deploy one pod which runs Liquibase (-Dspring.liquibase.enabled=true), and then all subsequent pods to not run it (-Dspring.liquibase.enabled=false).
Is there anything that Kubernetes offers which could do this out of the box?
I'm unfamiliar with Liquibase and I'm unclear how non-first Pods leverage Liquibase but, you may be able to use a lock to control access. A Pod that acquires the lock sets the property to true and, if it is unable to acquire the lock, the property is false.
One challenge will be in ensuring that the lock is released if the first Pod terminates. And, to understand the consequence on the other Pods. Is an existing Pod promoted?
Even though Kubernetes leverages etcd for its own distributed locking purposes, users are encouraged to run separate etcd instances if they need locks. Since you have to choose, you may as well choose what you prefer e.g. Redis, Zookeeper.
You could use an init Container or sidecar for the locking mechanism and a shared volume to record its state.
It feels as though Liquibase should be a distinct Deployment exposed as a Service that all Pods access.
Have you contacted Liquibase to see what it recommends for Kubernetes deployments?
In hadoop2 there's timeline service used for yarn and also the MRV2 history server, so what's the difference between them? What's the main purpose of a stand-alone timeline service?
As Hadoop has shifted from MapReduce to other execution engines (Spark, Tez, Flink) the Job History server which was designed with MapReduce in mind no longer worked. The Job History server also only provides stats on completed jobs and has/had scalability issues. The Application Timeline Server(ATS) addressed these problems and introduced the concept of a flow which is a series of jobs being treated as one entity rather than as a set of jobs (MapReduce).
Anyway, that was the original intent. You can read more in the official docs and the JIRAs have several design documents included.
References
The YARN Timeline Server
The YARN Timeline Service v.2
YARN-5355: YARN Timeline Service v.2: alpha 2
YARN-2928: YARN Timeline Service v.2: alpha 1
YARN-1530: [Umbrella] Store, manage and serve per-framework application-timeline data This includes the original application timeline design proposal.
Is there any way to restrict access to execute spark-submit with spark deploy mode as local mode. If I permit users to execute jobs in local mode my yarn cluster will become under utilized.
I have configured to use yarn as cluster manager to schedule spark jobs.
I have checked spark configs where I did not find any parameters to allow only a specific deploy mode. User can override the default deploy mode while submitting spark jobs to the cluster.
You can incentivize and facilitate using YARN by setting the spark.master key to yarn in your conf/spark-defaults.conf file. If your configuration is ready to point to the proper master, by default users will deploy their jobs on YARN.
I'm not aware of any way to completely bar your users from using a master, especially if it's under their control (as it's the case for local). What you can do, if you control the Spark installation, is modifying the existing spark-shell/spark-submit launch script to detect if a user is trying to explicitly use local as a master and preventing this to happen. Alternatively you could also have your own script that checks and prevents any local session to be opened and then runs spark-shell/spark-submit normally.
I need to track what is happening when I run a job or upload a file to HDFS. I do this using sql profiler in sql server. However, I miss such a tool for hadoop and so I am assuming that I can get some information from logs. I thing all logs are stored at /var/logs/hadoop/ but I am confused with what file I need to look at and how to set that file to capture detailed level information.
I am using HDP2.2.
Thanks,
Sree
'Hadoop' represents an entire ecosystem of different products. Each one has its own logging.
HDFS consists of NameNode and DataNode services. Each has its own log. Location of logs is distribution dependent. See File Locations for Hortonworks or Apache Hadoop Log Files: Where to find them in CDH, and what info they contain for Cloudera.
In Hadoop 2.2, MapReduce ('jobs') is a specific application in YARN, so you are talking about ResourceManager and NodeManager services (the YARN components), each with its own log, and then there is the MRApplication (the M/R component), which is a YARN applicaiton yet with its own log.
Jobs consists of taks, and tasks themselves have their own logs.
In Hadoop 2 there is a dedicated Job History service tasked with collecting and storing the logs from the jobs executed.
Higher level components (eg. Hive, Pig, Kafka) have their own logs, asside from the logs resulted from the jobs they submit (which are logging as any job does).
The good news is that vendor specific distribution (Cloudera, Hortonworks etc) will provide some specific UI to expose the most common logs for ease access. Usually they expose the JobHistory service collected logs from the UI that shows job status and job history.
I cannot point you to anything SQL Profiler equivalent, because the problem space is orders of magnitude more complex, with many different products, versions and vendor specific distributions being involved. I recommend to start by reading about and learning how the Job History server runs and how it can be accessed.
Is there a way to figure out which user ran a ‘select’ query against a Hive table? What time it was run?
More generically, which user accessed a HDFS directory?
HDFS has an audit log which will tell you which operations were run by which users. This is an old doc that shows how to enable audit logging but should still be relevant. For audit logging at the Hive level though, you'll have to look at some cutting edge tech.
Hortonworks acquired XASecure to implement security level features on top of their platform. Cloudera acquired Gazzang to do the same thing. They have some level of audit logging (and authorizations) for other services like Hive and HBase. They're also adding a lot more security related feature, but I'm not sure of the roadmap though.