Parsing kinesis data using bash, jq, sed - bash

Im hoping to walk through some kinesis data using bash. Using a cmd like:
aws kinesis get-records --shard-iterator <long shard info> | jq '[.|.Records[].Data]' | grep \"ey | sed -e 's/^[ \t]*\"//;s/[ \t]*\",$//'
I can get the base64 data from the stream. What Im having issues with is piping this through base64 so I can see the actual data.
If I send it through using a combination of head -n and tail I can see individual values but any attempt to pass through more than 2-3 lines fails. Errors are typically one set of JSON values followed by garbage data. The whole command is typically preceded by
Invalid character in input stream.
To see the json values I use <long bash command from above> | xargs base64 -D
-- Caveat: Using bash on OSX

This works (assuming you've copied the base64 data to a file):
while IFS= read -r line; do echo $line | base64 -D && printf "\n"; done < <infile>

I have developed Kines - friendly CLI for Amazon Kinesis Data Stream. This can be useful for your debugging purpose.
You can install it using pip.
pip install kines
Then you can run kines walk command on stream and shard to view decoded data.
kines walk <stream-name> <shard-id>
Demo:

Related

base64 encode and decode of aws command to retrieve some fields

The below code is producing the expected results with username.
es_eh="$(aws cloudtrail --region us-east-1 lookup-events --lookup-attributes AttributeKey=EventSource,AttributeValue=route53.amazonaws.com --max-items 50 --start-time "${start_date}" --end-time "${end_date}" --output json)"
for row in $(echo "${es_eh}" | jq -r '.Events[] | #base64'); do
echo "${row}" | base64 --decode | jq -r '.Username'
done
I didn't understand the purpose of doing base64 encode and then doing decode of the same string inside loop to retrieve username.
This is not working when I remove base64 encode and decode.
for row in $(echo "${es_eh}" | jq -r '.Events[]'); do
echo "${row}"| jq -r '.Username'
done
Without the encoding, the output of the first jq is more than one row. The loop iterates over the lines and fails, as none of them contains a valid JSON. With the | #base64, each subobject is encoded to a single row, inflated back to a full JSON object by base64 --decode.
To see the rows, try outputting $row before processing it.
When you use $( ) without quotes around it, the result gets split into "words", but the shell's definition of a "word" is almost never what you want (and certainly has nothing to do with the json entries you want it split into). This sort of thing is why you should almost never use unquoted expansions.
Converting the output entries to base64 makes them wordlike enough that shell word splitting actually does the right thing. But note: some base64 encoders split their output into lines, which would make each line be treated as a separate "word". If jq's base64 encoding did this, this code would fail catastrophically on large events.
Transforming the for loop into a while loop should fix the problem :
while read -r row; do
echo "${row}" | jq -r '.Username'
done < <(echo "${es_eh}" | jq -c -r '.Events[]')
Note that in the outer jq, I used option -c to put output in a single ine.

Extract a node value from json response in shell

I have the following json response
{"data":{"serverPort":0,"runId":7008,"runAction":false,"runStatus":"started"},"total":1,"success":true}
Wanted to retrieve the value of runStatus.
How can i do this using grep command
Mangling JSON with grep may not be the right way to go. To work with JSON on the command line, I would recommend to use the jq utility (https://stedolan.github.io/jq/):
$ echo '{"data":{"serverPort":0,"runId":7008,"runAction":false,"runStatus":"started"},"total":1,"success":true}' | jq '.data.runStatus'
"started"
Other solutions have been discussed here: Parsing JSON with Unix tools. For example, you could use python if that is available to you:
$ echo '{"data":{"serverPort":0,"runId":7008,"runAction":false,"runStatus":"started"},"total":1,"success":true}' | python3 -c "import sys, json; print(json.load(sys.stdin)['data']['runStatus'])"
started
However, extracting a nested value from JSON using just UNIX tools might be tricky.

Parse a nested variable from YAML file in bash

A complex .yaml file from this link needs to be fed into a bash script that runs as part of an automation program running on an EC2 instance of Amazon Linux 2. Note that the .yaml file in the link above contains many objects, and that I need to extract one of the environment variables defined inside one of the many objects that are defined in the file.
Specifically, how can I extract the 192.168.0.0/16 value of the CALICO_IPV4POOL_CIDR variable into a bash variable?
- name: CALICO_IPV4POOL_CIDR
value: "192.168.0.0/16"
I have read a lot of other postings and blog entries about parsing flatter, simpler .yaml files, but none of those other examples show how to extract a nested value like the value of CALICO_IPV4POOL_CIDR in this question.
As others are commenting, it is recommended to make use of yq (along with jq) if available.
Then please try the following:
value=$(yq -r 'recurse | select(.name? == "CALICO_IPV4POOL_CIDR") | .value' "calico.yaml")
echo "$value"
Output:
192.168.0.0/16
If you're able to install new dependencies, and are planning on dealing with lots of yaml files, yq is a wrapper around jq that can handle yaml. It'd allow a safe (non-grep) way of accessing nested yaml values.
Usage would look something like MY_VALUE=$(yq '.myValue.nested.value' < config-file.yaml)
Alternatively, How can I parse a YAML file from a Linux shell script? has a bash-only parser that you could use to get your value.
The right way to do this is to use a scripting language and a YAML parsing library to extract the field you're interested in.
Here's an example of how to do it in Python. If you were doing this for real you'd probably split it out into multiple functions and have better error reporting. This is literally just to illustrate some of the difficulties caused by the format of calico.yaml, which is several YAML documents concatenated together, not just one. You also have to loop over some of the lists internal to the document in order to extract the field you're interested in.
#!/usr/bin/env python3
import yaml
def foo():
with open('/tmp/calico.yaml', 'r') as fil:
docs = yaml.safe_load_all(fil)
doc = None
for candidate in docs:
if candidate["kind"] == "DaemonSet":
doc = candidate
break
else:
raise ValueError("no YAML document of kind DaemonSet")
l1 = doc["spec"]
l2 = l1["template"]
l3 = l2["spec"]
l4 = l3["containers"]
for containers_item in l4:
l5 = containers_item["env"]
env = l5
for entry in env:
if entry["name"] == "CALICO_IPV4POOL_CIDR":
return entry["value"]
raise ValueError("no CALICO_IPV4POOL_CIDR entry")
print(foo())
However, sometimes you need a solution right now and shell scripts are very good at that.
If you're hitting an API endpoint, then the YAML will usually be pretty-printed so you can get away with extracting text in ways that won't work on arbitrary YAML.
Something like the following should be fairly robust:
cat </tmp/calico.yaml | grep -A1 CALICO_IPV4POOL_CIDR | grep value: | cut -d: -f2 | tr -d ' "'
Although it's worth checking at the end with a regex that the extracted value really is valid IPv4 CIDR notation.
The key thing here is grep -A1 CALICO_IPV4POOL_CIDR .
The two-element dictionary you mentioned (shown below) will always appear as one chunk since it's a subtree of the YAML document.
- name: CALICO_IPV4POOL_CIDR
value: "192.168.0.0/16"
The keys in calico.yaml are not sorted alphabetically in general, but in {"name": <something>, "value": <something else>} constructions, name does consistently appear before value.
MYVAR=$(\
curl https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml | \
grep -A 1 CALICO_IPV4POOL_CIDR | \
grep value | \
cut -d ':' -f2 | \
tr -d ' "')
Replace curl https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml with however you're sourcing the file. That gets piped to grep -A 1 CALICO_IPV4POOL_CIDR. This gives you 2 lines of text: the name line, and the value line. That gets piped to grep value, which now gives us the line we want with just the value. That gets piped to cut -d ':' -f2 which uses the colon as a delimiter and gives us the second field. $(...) executes the enclosed script, and it is assigned to MYVAR. After this script, echo $MYVAR should produce 192.168.0.0/16.
You have two problems there:
How to read a YAML document from a file with multiple documents
How to select the key you want from that YAML document
I have guessed that you need the YAML document of kind 'DaemonSet' from reading Gregory Nisbett's answer.
I will try to only use tools that are likely to be already installed on your system because you mentioned you want to do this in a Bash script. I assume you have JQ because it is hard to do much in Bash without it!
For the YAML library I tend to use Ruby for this because:
Most systems have a Ruby
Ruby's Psych library has been bundled since Ruby 1.9
The PyYAML library in Python is a bit inflexible and sometimes broken compared to Ruby's in my experience
The YAML library in Perl is often not installed by default
It was suggested to use yq, but that won't help so much in this case because you still need a tool that can extract the YAML document.
Having extracted the document I am going to again use Ruby to save the file as JSON. Then we can use jq.
Extracting the YAML document
To get the YAML document using Ruby and save it as JSON:
url=...
curl -s $url | \
ruby -ryaml -rjson -e \
"puts YAML.load_stream(ARGF.read)
.select{|doc| doc['kind']=='DaemonSet'}[0].to_json" \
| jq . > calico.json
Further explanation:
The YAML.load_stream reads the YAML documents and returns them all as an Array
ARGF.read reads from a file passed via STDIN
Ruby's select allows easy selection of the YAML document according to its kind key
Then we take the element 4 and convert to JSON.
I pass that response through jq . so that it's formatted for human readability but that step isn't really necessary. I could do the same in Ruby but I'm guessing you want Ruby code kept to a minimum.
Selecting the key you want
To select the key you want the following JQ query can be used:
jq -r \
'.spec.template.spec.containers[].env[] | select(.name=="CALICO_IPV4POOL_CIDR") | .value' \
calico.json
Further explanation:
The first part spec.template.spec.containers[].env[] iterates for all containers and for all envs inside them
Then we select the Hash where the name key equals CALICO_IPV4POOL_CIDR and return the value
The -r removes the quotes around the string
Putting it all together:
#!/usr/bin/env bash
url='https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml'
curl -s $url | \
ruby -ryaml -rjson -e \
"puts YAML.load_stream(ARGF.read)
.select{|doc| doc['kind']=='DaemonSet'}[0].to_json" \
| jq . > calico.json
jq -r \
'.spec.template.spec.containers[].env[] | select(.name=="CALICO_IPV4POOL_CIDR") | .value' \
calico.json
Testing:
▶ bash test.sh
192.168.0.0/16

Parse JQ output through external bash function?

I want to parse out data out of a log file which consist of JSON sting and I wonder if there's a way for me to use a bash function to perform any custom parsing instead of overloading jq command.
Command:
tail errors.log --follow | jq --raw-output '. | [.server_name, .server_port, .request_file] | #tsv'
Outputs:
8.8.8.8 80 /var/www/domain.com/www/public
I want to parse 3rd column to cut the string to exclude /var/www/domain.com part where /var/www/domain.com is the document root, and /var/www/domain.com/subdomain/public is the public html section of the site. Therefore I would like to leave my output as /subdomain/public (or from the example /www/public).
I wonder if I can somehow inject a bash function to parse .request_file column? Or how would I do that using jq?
I'm having issues piping out the output of any part of this command that would allow me to do any sort of string manipulation.
Use a BashFAQ #1 while read loop to iterate over the lines, and a BashFAQ #100 parameter expansion to perform the desired modifications:
tail -f -- errors.log \
| jq --raw-output --unbuffered \
'[.server_name, .server_port, .request_file] | #tsv' \
| while IFS=$'\t' read -r server_name server_port request_file; do
printf '%s\t%s\t%s\n' "$server_name" "$server_port" "/${request_file#/var/www/*/}"
done
Note the use of --unbuffered, to force jq to flush its output lines immediately rather than buffering them. This has a performance penalty (so it's not default), but it ensures that you get output immediately when reading from a potentially-slow input source.
That said, it's also easy to remove a prefix in jq, so there's no particular reason to do the above:
tail -f -- errors.log | jq -r '
def withoutPrefix: sub("^([/][^/]+){3}"; "");
[.server_name, .server_port, (.request_file | withoutPrefix)] | #tsv'

using curl to call data, and grep to scrub output

I am attempting to call an API for a series of ID's, and then leverage those ID's in a bash script using curl, to query a machine for some information, and then scrub the data for only a select few things before it outputs this.
#!/bin/bash
url="http://<myserver:myport>/ws/v1/history/mapreduce/jobs"
for a in $(cat jobs.txt); do
content="$(curl "$url/$a/counters" "| grep -oP '(FILE_BYTES_READ[^:]+:\d+)|FILE_BYTES_WRITTEN[^:]+:\d+|GC_TIME_MILLIS[^:]+:\d+|CPU_MILLISECONDS[^:]+:\d+|PHYSICAL_MEMORY_BYTES[^:]+:\d+|COMMITTED_HEAP_BYTES[^:]+:\d+'" )"
echo "$content" >> output.txt
done
This is for a MapR project I am currently working on to peel some fields out of the API.
In the example above, I only care about 6 fields, though the output that comes from the curl command gives me about 30 fields and their values, many of which are irrelevant.
If I use the curl command in a standard prompt, I get the fields I am looking for, but when I add it to the script I get nothing.
Please remove quotes after
$url/$a/counters" ". Like following:
content="$(curl "$url/$a/counters | grep -oP '(FILE_BYTES_READ[^:]+:\d+)|FILE_BYTES_WRITTEN[^:]+:\d+|GC_TIME_MILLIS[^:]+:\d+|CPU_MILLISECONDS[^:]+:\d+|PHYSICAL_MEMORY_BYTES[^:]+:\d+|COMMITTED_HEAP_BYTES[^:]+:\d+'" )"

Resources