Bosun adding external collectors - go

What is the procedure to define new external collectors in bosun using scollector.
Can we write python or shell scripts to collect data?

The documentation around this is not quite up to date. You can do it as described in , but we also support JSON output which is better.
Either way, you write something and put it in the external collectors directory, followed by a frequency directory, and then an executable script or binary. Something like:
If the directory frequency is zero 0, then the the script is expected to be continuously running, and you put a sleep inside the code (This is my preferred method for external collectors). The scripts outputs the telnet format, or the undocumented JSON format to stdout. Scollector picks it up, and queues that information for sending.
I created an issue to get this documented not long ago Until one of us gets around to that, here is the PR that added JSON

Adding to what Kyle said, you can take a look at some existing external collectors to see what they output. here is one written in java that one of our colleagues wrote to monitor jvm stuff. It uses the text format, which is simply:
metricname timestamp value tag1=foo tag2=bar
If you want to use the JSON format, here is an example from one of our collectors:
And you can also send metadata:
{"Metric":"exceptional.exceptions.count","Name":"desc","Value":"The number of exceptions thrown per second by applications and machines. Data is queried from multiple sources. See status instances for details on exceptions."}`
Or send error messages to stderror:
2015/08/05 15:32:00 lookup OR-SQL03: no such host


Is there a different way to create variables that don't terminate after the program ends?

Right now, I am creating files to make unterminating variables. But I'm curious if there's a simpler way to create variables that don't terminate.
I find Redis invaluable for persisting data like this. It is a quick and lightweight installation and allows you to store many types of data:
strings, including complete JSONs and binary data like JPEG/PNG/TIFF images - also with TTL (Time-to-Live) so data can be expired when no longer needed
numbers, including atomic integers, floats
hashes (like Python dictionaries)
sets, and sorted (ordered) sets
streams, bitfields, geospatial data and esoteric hyperlogs
PUB/SUB is also possible, where one or more machines/processes publish items and multiple consumers, who have subscribed to that topic, receive the published items.
It can also perform very fast operations on your data for you, like set intersections and unions, getting lengths of lists, moving items between lists, atomically adding/subtracting from numbers and so on.
You can also use it to pass data between processes, sub-processes, shell scripts, parent and child, child and parent (!) scripts and so on.
In addition to all that, it is networked, so you can set variables on one computer and read/alter them from another - very simply. For example, you can PUSH jobs to a queue, potentially from multiple machines, and run workers on multiple machines that wait for jobs on the queue, process them and return results to another list.
There is a discussion of the things you can store here.
Example: Store a string, then retrieve it:
redis-cli SET name fred
name=$(redis-cli GET name)
Example: Increment views of page 2 by 10, and then retrieve from different machine on network:
redis-cli INCRBY views:page:2 10
views=$(redis-cli -h GET views:page:2)
Example: Push a value onto a list:
redis-cli LPUSH shoppingList bananas
Example: Blocking wait for next item in list - use RPOP for non-blocking:
item=$(redis-cli BRPOP shoppingList)
Also, there are bindings for Python, C/C++, Java, Ruby, PHP etc. So you can "inject" dummy/test data into, or extract debug data from a running Python program using the redis-cli tool even on a different computer.
Use environment variables to store your data.
ABC="abc"; export ABC
And the other question is, how to make environment variables persistents after reboot.
Depending on your shell, you may have different file to persist the veriables.
if using bash, run this command containing the variable's last value before reboot.
echo 'export ABD="hello"' >> $HOME/.bashrc
I think this is a good time to be using an SQL Database. It's more scalable and functional than having a fileful of "persistent variables".
It may require a little more setup, and I admit it isn't "simpler" per say, but it will probably be worth it in the long run. You will be able to do things with your variables and that may make your future scripts simpler.
I recommend going to YouTube and find a simple instruction on how to set up a local MySQL or MSSQL. There is a guy, Mike Dane, who makes really beginner-friendly instructions. Try searching "GiraffeAcademy SQL Beginner" and see if that helps you.

Parameterise Parm File name In Informatatica

I want to know how to (or can I) parameterize the parm file name in informatica?
little bit of background. I am building a standard map in informatica. Which business users can call directly after selecting the standard filters they want to apply in the map using a GUI.
The parm file name will be given by business users and all the filters that he/she selected will be in parm. The file will be dropped in the parm folder in informatica server.
This is a good case scenario, when only 1 users is using it at 1 point of time.
Also, I want to find out what should I do when multiple users are working on GUI and generating the parm files and invoking the informatica map. How do I get multiple instences of the same map running at the same time?
I hope I am making sense here....
You can achieve this by using concurrent execution of the workflow. Read about it and understand how can you implement it.
Once you know how to implement it, use a backend script/code by the gui to assign an instance name to each call through GUI. For each instance name, you can have an individual parameter file. (I believe that there would be a finite set of combination of variable values in your case). You can use below command to call individual instances, (either through you GUI or by any other backend code.
pmcmd %workflow_name% %informatica_folder_name%
-paramfile %paramfilepathandname% -rin %instance_name%
It might sound a bit confusing, but once you understand how concurrent workflows work, you can build on it based on the above input.
It'll be only possible if you call the Informatica from external tool, not the Client tools. One way is described by #Utsav, the other is when you use Informatica WSH to call a Workflow - you can indicate the parameterfile you want to be used with the workflow, as well as desired instance name.
I Think this guide to concurrent workflows May be what you are looking for:

Does SQL*Loader have any functionality that allows for customizing the log file?

I have been asked to create a system for allowing third party companies to dump data into several of our tables. These third parties provide csv files on a periodic basis, and after doing some research it seemed like Oracle themselves had a standard tool for doing so, "sqlldr". I've since gotten it working to an acceptable degree, and we have a job scheduled to run that script once a day.
But one of the third parties supplies really dirty data, of the sort where I can't expect it to always load every row/record (looking like up to about 8% will fail). My boss asked me to forward "all output" from the first few tests to him, and like a moron I also sent the log file.
He has asked that this "report" be modified to include those exceptions that aren't unique constraints along with the line in the input file that caused the exception.
This means that I need data from the log file, but also from the (I believe) reject file in a single document. Rather than write a convoluted shell script to combine those two, does SQL*Loader itself allow any customization that might achieve the same thing? I've read through the Oracle documentation and haven't found anything that suggests this, but I've also learned not to trust it entirely either.
Is this possible? Ideally, the solution would allow me to add values to the reject file that don't exist in the original input file, but I'm also interested in any customization of the log file or reject file.
I was going to stop there, but you can define the name of the log file, which might help with issue. Most automation with SQL*Loader involves wrapping it within shell scripts; aka "roll your own."

Ruby PStore file too large

I am using PStore to store the results of some computer simulations. Unfortunately, when the file becomes too large (more than 2GB from what I can see) I am not able to write the file to disk anymore and I receive the following error;
Errno::EINVAL: Invalid argument - <filename>
I am aware that this is probably a limitation of IO but I was wondering whether there is a workaround. For example, to read large JSON files, I would first split the file and then read it in parts. Probably the definitive solution should be to switch to a proper database in the backend, but because of some limitations of the specific Ruby (Sketchup) I am using this is not always possible.
I am going to assume that your data has a field that could be used as a crude key.
Therefore I would suggest that instead of dumping data into one huge file, you could put your data into different files/buckets.
For example, if your data has a name field, you could take the first 1-4 chars of the name, create a file with those chars like rojj-datafile.pstore and add the entry there. Any records with a name starting 'rojj' go in that file.
A more structured version is to take the first char as a directory, then put the file inside that, like r/rojj-datafile.pstore.
Obviously your mechanism for reading/writing will have to take this new file structure into account, and it will undoubtedly end up slower to process the data into the pstores.

Hadoop/Yarn/Spark can I call command line?

Short version of my question: We need to call the command line from within a Spark job. Is this feasible? The cluster support group indicated this could cause problems with memory.
Long Version: I have a job that I need to run on a Hadoop/MapR cluster processing packet data captured with tshark/wireshark. The data is binary packet data, one file per minute of capture. We need to extract certain fields from this packet data such as IP addresses etc. We have investigated options such a jNetPcap but this library is a bit limited. So it looks like we need to call the tshark command line from within the job and process the response. We can't do this directly during capture as we need the capture to be as efficient as possible to avoid dropping packets. Converting the binary data to text outside the cluster is possible but this is 95% of the work so we may as well run the entire job as a non-distributed job on a single server. This limits the number of cores we can use.
Command line to decode is:
tshark -V -r somefile.pcap
tshark -T pdml -r somefile.pcap
Well, it is not impossible. Spark provides pipe method which can be used to pipe data to external process and read the output. General structure could be for example something like this:
val files: RDD[String] = ??? // List of paths
val processed: RDD[String] = files.pipe("some Unix pipe")
Still, from your description it looks like GNU Parallel could be much better choice than Spark here.
