Trying to iterate operations on a file with awk and sed

Trying to iterate operations on a file with awk and sed - bash

I've got a line that pulls out the number of times the word severity comes out after the word vulnerabilities in a file
please don't laugh too hard:
cat <file> | sed '1,/vulnerabilities/d' | grep -c '"severity": 4'
This will come back with a count of "severity" : 4 matches in the file. I can't seem to iterate this amongst other files.
I have 100 or so files in the form bleeblah-082017. Where bleeblah can be different lengths and words. I'm having issues on how to easily iterate from one file above to get results from each individually.
I would usually have used an awk line to iterate through the list, but I can't seem to find any examples to meld awk and sed.
Would anyone have any ideas on how to perform the task above over many files and return a results per file?
Thanks
Davey
I have a file that has a bunch of entries such as:
{
"count": 6,
"plugin_family": "Misc.",
"plugin_id": 7467253,
"plugin_name": "Blah",
"severity": 4,
"severity_index": 1,
"vuln_index": 13
I'd like to extract the times "severity": 4 appears after the word vulnerabilities in each file. The output would be 10
Some more of the input file.
"notes": null,
"remediations": {
"num_cves": 20,
"num_hosts": 6,
"num_impacted_hosts": 2,
"num_remediated_cves": 6,
"remediations": [
{
"hosts": 2,
"remediation": "Apache HTTP Server httpOnly Cookie Information Disclosure: Upgrade to Apache version 2.0.65 / 2.2.22 or later.",
"value": "f950f3ddf554d7ea2bda868d54e2b639",
"vulns": 4
},
{
"hosts": 2,
"remediation": "Oracle Application Express (Apex) CVE-2012-1708: Upgrade Application Express to at least version 4.1.1.",
"value": "2c07a93fee3b201a9c380e59fa102ccc",
"vulns": 2
}
]
},
"vulnerabilities": [
{
"count": 6,
"plugin_family": "Misc.",
"plugin_id": 71049,
"plugin_name": "SSH Weak MAC Algorithms Enabled",
"severity": 1,
"severity_index": 0,
"vuln_index": 15
},
{
"count": 6,
"plugin_family": "Misc.",
"plugin_id": 70658,
"plugin_name": "SSH Server CBC Mode Ciphers Enabled",
"severity": 1,
"severity_index": 1,
"vuln_index": 13
},
{
"count": 2,
"plugin_family": "Web Servers",
"plugin_id": 64713,
"plugin_name": "Oracle Application Express (Apex) CVE-2012-1708",
"severity": 2,
"severity_index": 2,
"vuln_index": 12
},
Each of these files are from vulnerability scans that have been extracted from my scanner API. Essentially the word severity is all over the place in the different aspects (hosts, vulns, etc). I want to extract from each scan file the number of times the pattern appears after the word vulnerability (which only appears once in each file). Open to using perl python whatever to acheive this. Was just more familiar with shell scripting to manipulate these text type files in the past.

Parsing .json data with sed or awk is fraught with potential pitfalls. I recommend using a format-aware tool like jq to query the data you want. In this case, you can do something like
jq '{(input_filename): [.vulnerabilities[].severity]|add}' *.json
This should produce output something like
{
"bleeblah-201708.json": 4
}
{
"bleeblah-201709.json": 11
}

Use jq for parsing json on the command line. It is the standard tool. Working with text based tools like sed to parse json is very fragile since it relies on the order of elements and formatting of the json documents which is not guaranteed or part of the the json standard.
What you are looking for is the following command:
jq '[.vulnerabilities[]|select(.severity==4)]|length' file.json
If you want to run it for multiple files, use find:
find FOLDER -name 'PATTERN.json' -print \
-exec jq '[.vulnerabilities[]|select(.severity==4)]|length' {} +

I have made the following two example files, assuming that they can represent what you have. Note the occurrence of the search text before "vulnerabilities" and after, with different number of occurrences after.
From your code I assume that the search string will only be at most once on a line, the lines will be counted.
blableh-082017:
"severity" : 4
"severity" : 4
vulnerabilities
"severity" : 4
"severity" : 4
bleeblah-082017:
"severity" : 4
"severity" : 4
vulnerabilities
"severity" : 4
"severity" : 4
"severity" : 4
Here is my proposal, using find in addition to sed and grep, also using sh to achieve the desired piping inside -exec.
find . -iname "*-082017" -print -exec sh -c "sed 1,/vulnerabilities/d {} | grep -c '\"severity\" : 4'" \;
Output (hoping a name line and a count line are OK, otherwise another sed coudl reformat for you):
./blableh-082017
2
./bleeblah-082017
3
Details:
use find to process multiple files and get each file name to the output,
inspite of seds lack of support for that
use basically your code to do the cutting via sed and the counting via grep
give filename to sed as parameter, instead via pipe from cat
use sh within -exec to achieve piping
(answer by devnull to How to use pipe within -exec in find)
Environment:
GNU sed version 4.2.1
GNU bash, version 3.1.23(1)-release (i686-pc-msys)
GNU grep 2.5.4
find (GNU findutils) 4.4.2

Related

How to extract a value by searching for two words in different lines and getting the value of second one

How to search for a word, once it's found, in the next line save a specific value in a variable.
The json bellow is only a small part of the file.
Due to this specific file json structure be inconsistent and subject to change overtime, it need to by done via search like grep sed awk.
however the paramenters bellow will be always the same.
search for the word next
get the next line bellow it
extract everything after the word page_token not the boundary "
store in a variable to be used
test.txt:
"link": [
{
"relation": "search",
"url": "aaa/ww/rrrrrrrrr/aaaaaaaaa/ffffffff/ccccccc/dddd/?token=gggggggg3444"
},
{
"relation": "next",
"url": "aaa/ww/rrrrrrrrr/aaaaaaaaa/ffffffff/ccccccc/dddd/?&_page_token=121_%_#212absa23bababa121212121212121"
},
]
so the desired output in this case is:
PAGE_TOKEN="121_%_#212absa23bababa121212121212121"
my attempt:
PAGE_TOKEN=$(cat test.txt| grep "next" | sed 's/^.*: *//;q')
no lucky..

This might work for you (GNU sed):
sed -En '/next/{n;s/.*(page_token=)([^"]*).*/\U\1\E"\2"/p}' file
This is essentially a filtering operation, hence the use of the -n option.
Find a line containing next, fetch the next line, format as required and print the result.

Presuming your input is valid json, one option is to use:
cat test.json
[{
"relation": "search",
"url": "aaa/ww/rrrrrrrrr/aaaaaaaaa/ffffffff/ccccccc/dddd/?token=gggggggg3444"
},
{
"relation": "next",
"url": "aaa/ww/rrrrrrrrr/aaaaaaaaa/ffffffff/ccccccc/dddd/?&_page_token=121_%_#212absa23bababa121212121212121"
}
]
PAGE_TOKEN=$(cat test.json | jq -r '.[] | select(.relation=="next") | .url | gsub(".*=";"")')
echo "$PAGE_TOKEN"
121_%_#212absa23bababa121212121212121

grep multiples results randomly results in bash

I'm making a query into a rest api, from this result i got:
{ "meta": { "query_time": 0.004266858, "pagination": { "offset": 0, "limit": 00, "total": 4 }, "powered_by": "device-api", "trace_id": "foo" }, "resources": [ "foo/bar", "foo/bar/2", "foo/bar/3", "foo/bar/4" ], "errors": [] }
I want to take results only from resources like this:
"resources": [
"foo/bar",
"foo/bar/2",
"foo/bar/3",
"foo/bar/4"
],
Can we share some knowledge? thanks a lot!
PS: these results from resources are random

Don't use grep or other regular expression tools to parse JSON. JSON is structured data and should be processed by a tool designed to read JSON. On the command line jq is a great tool for this purpose. There are many powerful JSON libraries written in other languages if jq isn't what you need.
Once you've extracted the data you care about, you can use the shuf utility to select random lines, e.g. shuf -n 5 would sample five random lines from the input.
With the JSON you've provided this appears to do what I think you want:
jq --raw-output '.resources[]' | shuf -n 2
You may need to tweak the jq syntax slightly if the real JSON has a different structure.

jq Compare two files and output the difference in text format

I have 2 files
file_one.json
{
"releases": [
{
"name": "bpm",
"version": "1.1.5"
},
{
"name": "haproxy",
"version": "9.8.0"
},
{
"name": "test",
"version": "10"
}
]
}
and file_two.json
{
"releases": [
{
"name": "bpm",
"version": "1.1.6"
},
{
"name": "haproxy",
"version": "9.8.1"
},
{
"name": "test",
"version": "10"
}
]
}
In file 2 the versions were changed and i need to echo the new changes.
I have used the following command to see the changes:
diff -C 2 <(jq -S . file_one.json) <(jq -S . file_two.json)
But than i need to format the output to something like this.
I need to output text:
The new versions are:
bpm 1.1.6
haproxy 9.8.1

You may be able to use the following jq command :
jq --slurp -r 'map(.releases) | add
| group_by(.name)
| map(unique | select(length > 1) | max_by(.version))
| map("\(.name) : \(.version)") | join("\n")'
file_one.json file_two.json
It first merges the two releases arrays, groups the elements by name, then unicize the elements of the resulting arrays, remove the arrays with a single element (the versions that were identic between the two files), then map the arrays into their greatest element (by version) and finally format those for display.
You can try it here.
A few particularities that might make this solution incorrect for your use :
it doesn't only report version upgrades, but also version downgrades. However, it always returns the greatest version, disregarding which file contains it.
the version comparison is alphabetic. It's okay with your sample, but it can fail for multi-digits versions (e.g. 1.1.5 is considered greater than 1.1.20 because 5 > 2). This could be fixed but might not be problematic depending on your versionning scheme.
Edit following your updated request in the comments : the following jq command will output the versions changed between the first file and the second. It nicely handles downgrades and somewhat handles products that have appeared or disappeared in the second file (although it always shows the version as version --> null whether it is a product that appeared or disappeared).
jq --slurp -r 'map(.releases) | add
| group_by(.name)
| map(select(.[0].version != .[1].version))
| map ("\(.[0].name) : \(.[0].version) --> \(.[1].version)")
| join("\n")' file_one.json file_two.json
You can try it here.

Generate json file with formatting

I have a curl command which generates json output. I want to add a few characters in generated file to be able to process it further.
Command:
curl -sN --negotiate -u foo:bar "http://hostname/db/tbl_name/" >> db.json
This runs under a for loop which runs it for a db and tbl_name combination. Hence it ends up generating a number of json outputs(one for each table) concatenated together without any delimiter.
Output looks like :
{"columns":[{"name":"tbl_id","type":"varchar(50)"},{"name":"cret_timestmp","type":"timestamp"},{"name":"updt_timestmp","type":"timestamp"},{"name":"frst_nm","type":"varchar(50)"},{"name":"last_nm","type":"varchar(50)"},{"name":"acct_num","type":"varchar(15)"},{"name":"r_num","type":"varchar(15)"},{"name":"pid","type":"decimal(15,0)"},{"name":"ami_id","type":"varchar(30)"},{"name":"ssn","type":"varchar(9)"},{"name":"client_id","type":"varchar(30)"},{"name":"client_nm","type":"varchar(100)"},{"name":"info","type":"timestamp"},{"name":"rmx","type":"varchar(10)"},{"name":"id","type":"decimal(12,0)"},{"name":"ingest_timestamp","type":"string"},{"name":"incr_ingest_timestamp","type":"string"}],"database":"db_i","table":"db_tbl"}{"columns":[{"name":"key","type":"varchar(15)"},{"name":"foo_cd","type":"varchar(10)"},{"name":"foo_nm","type":"varchar(56)"},{"name":"tmc_regn_cd","type":"varchar(10)"},{"name":"tmc_mrkt_cd","type":"varchar(20)"},{"name":"mrkt_grp","type":"varchar(30)"},{"name":"ingest_timestamp","type":"string"},{"name":"incr_ingest_timestamp","type":"string"}],"database":"db_i","table":"ss_mv"}{"columns":[{"name":"bar_src_name","type":"string"},{"name":"bar_ent_name","type":"string"},{"name":"from_src","type":"string"},{"name":"reload","type":"string"},{"name":"column_mismatch","type":"string"},{"name":"xx_src_name","type":"string"},{"name":"xx_ent_name","type":"string"}],"database":"db_i","table":"test_table"}
Desired output is to start and end the output with []. Also I want to include "," between the end and beginning where column list starts.
So for ex: if the curl command runs against 3 tables as shown above, then the three generated jsons should be created like :
[{json1},{json2},{json3}]
Number 1,2,3 ...etc corresponds to different tables in curl command running in for loop against a particular db whose json should be created in one file but with desired format.
instead of what I'm currently getting :
{json1}{json2}{json3}
In the output pasted above, JSON 1 is :
{"columns":[{"name":"tbl_id","type":"varchar(50)"},{"name":"cret_timestmp","type":"timestamp"},{"name":"updt_timestmp","type":"timestamp"},{"name":"frst_nm","type":"varchar(50)"},{"name":"last_nm","type":"varchar(50)"},{"name":"acct_num","type":"varchar(15)"},{"name":"r_num","type":"varchar(15)"},{"name":"pid","type":"decimal(15,0)"},{"name":"ami_id","type":"varchar(30)"},{"name":"ssn","type":"varchar(9)"},{"name":"client_id","type":"varchar(30)"},{"name":"client_nm","type":"varchar(100)"},{"name":"info","type":"timestamp"},{"name":"rmx","type":"varchar(10)"},{"name":"id","type":"decimal(12,0)"},{"name":"ingest_timestamp","type":"string"},
{"name":"incr_ingest_timestamp","type":"string"}],"database":"db_i","table":"db_tbl"}
JSON 2 is :
{"columns":[{"name":"key","type":"varchar(15)"},{"name":"foo_cd","type":"varchar(10)"},{"name":"foo_nm","type":"varchar(56)"},{"name":"tmc_regn_cd","type":"varchar(10)"},{"name":"tmc_mrkt_cd","type":"varchar(20)"},{"name":"mrkt_grp","type":"varchar(30)"},{"name":"ingest_timestamp","type":"string"},{"name":"incr_ingest_timestamp","type":"string"}],"database":"db_i","table":"ss_mv"}
JSON 3 is :
{"columns":[{"name":"bar_src_name","type":"string"},{"name":"bar_ent_name","type":"string"},{"name":"from_src","type":"string"},{"name":"reload","type":"string"},{"name":"column_mismatch","type":"string"},{"name":"xx_src_name","type":"string"},{"name":"xx_ent_name","type":"string"}],"database":"db_i","table":"test_table"}
I hope the requirement is clear, thanks in advance, looking to achieve this via bash.

Use jq -s.
--slurp/-s: Instead of running the filter for each JSON object in the input, read the entire input stream into a large array
and run the filter just once.
Here's an example:
$ cat file.json
{ "key": "value1" }
{ "key": "value2" }
{ "key":
"value3"}{"key": "value4"}
$ jq -s < file.json
[
{
"key": "value1"
},
{
"key": "value2"
},
{
"key": "value3"
},
{
"key": "value4"
}
]

I'm not sure if I got it correctly, but I think you are looking for something like
echo "[$(cat *.json | paste -sd ',')]" > result.json
This works by creating a string that starts with [ and ends with ], and in the middle, there are the contents of the json files concatenated (cat) and separated by commas (with the help of paste). That string is echoed and written to a new file.

Presuming input in valid JSONL format (one JSON document per line of input), you can embed a Python script inside your bash script:
slurpjson_py='
import json, sys
json.dump([json.loads(line.strip()) for line in sys.stdin], sys.stdout, indent=4)
sys.stdout.write("\n")
'
slurpjson() { python -c "$slurpjson_py" "$#"; }
If called as:
slurpjson <<EOF
{ "first": "document", "starting": "here" }
{ "second": "document", "ending": "here" }
EOF
...output is correctly:
[
{
"starting": "here",
"first": "document"
},
{
"second": "document",
"ending": "here"
}
]

I managed to achieve this by running curl command and adding a "," with every line break using
sed 's/$/,/'
And then remove the last "," and added first and end [] using :
for i in *; do cat $i | sed '$ s/.$//' | awk '{print "["$0"]"}' > $json_dir/$i; done

Bash/*NIX: split a file into multiple files on a substring

Variants of this question have been asked and answered before, but I find that my sed/grep/awk skills are far too rudimentary to work from those to a custom solution since I hardly ever work in shell scripts.
I have a rather large (100K+ lines) text file in which each line defines a GeoJSON object, each such object including a property called "county" (there are, all told, 100 different counties). Here's a snippet:
{"type": "Feature", "properties": {"county":"ALAMANCE", "vBLA": 0, "vWHI": 4, "vDEM": 0, "vREP": 2, "vUNA": 2, "vTOT": 4}, "geometry": {"type":"Polygon","coordinates":[[[-79.537429,35.843303],[-79.542428,35.843303],[-79.542428,35.848302],[-79.537429,35.848302],[-79.537429,35.843303]]]}},
{"type": "Feature", "properties": {"county":"NEW HANOVER", "vBLA": 0, "vWHI": 0, "vDEM": 0, "vREP": 0, "vUNA": 0, "vTOT": 0}, "geometry": {"type":"Polygon","coordinates":[[[-79.532429,35.843303],[-79.537428,35.843303],[-79.537428,35.848302],[-79.532429,35.848302],[-79.532429,35.843303]]]}},
{"type": "Feature", "properties": {"county":"ALAMANCE", "vBLA": 0, "vWHI": 0, "vDEM": 0, "vREP": 0, "vUNA": 0, "vTOT": 0}, "geometry": {"type":"Polygon","coordinates":[[[-79.527429,35.843303],[-79.532428,35.843303],[-79.532428,35.848302],[-79.527429,35.848302],[-79.527429,35.843303]]]}},
I need to split this into 100 separate files, each containing one county's GeoJSONs, and each named xxxx_bins_2016.json (where xxxx is the county's name). I'd also like the final character (comma) at the end of each such file to go away.
I'm doing this in Mac OSX, if that matters. I hope to learn a lot by studying any solutions you could suggest, so if you feel like taking the time to explain the 'why' as well as the 'what' that would be fantastic. Thanks!
EDITED to make clear that there are different county names, some of them two-word names.

jq can kind of do this; it can group the input and output one line of text per group. The shell then takes care of writing each line to an appropriately named file. jq itself doesn't really have the ability to open files for writing that would allow you to do this in a single process.
jq -Rn -c '[inputs[:-1]|fromjson] | group_by(.properties.county)[]' tmp.json |
while IFS= read -r line; do
county=$(jq -r '.[0].properties.county' <<< $line)
jq -r '.[]' <<< "$line" > "$county.txt"
done
[inputs[:-1]|fromjson] reads each line of your file as a string, strips the trailing comma, then parses the line as JSON and wraps the lines into a single array. The resulting array is sorted and grouped by county name, then written to standard output, one group per line.
The shell loop reads each line, extracts the county name from the first element of the group with a call to jq, then uses jq again to write each element of the group to the appropriate file, again one element per line.
(A quick look at https://github.com/stedolan/jq/issues doesn't appear to show any requests yet for an output function that would let you open and write to a file from inside a jq filter. I'm thinking of something like
jq -Rn '... | group_by(.properties.county) | output("\(.properties.county).txt")' tmp.json
without the need for the shell loop.)

If using string parsing rather than proper JSON parsing to extract the county name is acceptable - brittle in general, but would work in this simple case - consider Sam Tolton's GNU awk answer, which has the potential to be by far the simplest and fastest solution.
To complement chepner's excellent answer with a variation that focuses on performance:
jq -Rrn '[inputs[:-1]|fromjson] | .properties.county + "|" + (.|tostring)' file |
awk -F'|' '{ print $2 > ($1 "_bins_2016.json") }'
Shell loops are avoided altogether, which should speed up the operation.
The general idea is:
Use jq to trim the trailing , from each input line, interpret the trimmed string as JSON, extract the county name, then output the trimmed JSON strings prepended with the county name and a distinct separator, |.
Use an awk command to split each line into the prepended county name and the trimmed JSON string, which allows awk to easily construct the output filename and write the JSON string to it.
Note: The awk command keeps all output files open until the script has finished, which means that, in your case, 100 output files will be open simultaneously - a number that shouldn't be a problem, however.
In cases where it is a problem, you can use the following variation, in which jq first sorts the lines by county name, which then allows awk to immediately close the previous output field whenever the next county is reached in the input:
jq -Rrn '
[inputs[:-1]|fromjson] | sort_by(.properties.county)[] |
.properties.county + "|" + (.|tostring)
' file |
awk -F'|' '
prevCounty != $1 { if (outFile) close(outFile); outFile = $1 "_bins_2016.json" }
{ print $2 > outFile; prevCounty = $1 }
'

A simpler version of chepner's answer:
while IFS= read -r line
do
countyName=$(jq --raw-output '.properties.county' <<<"${line: : -1}")
jq <<< "${line: : -1}" >> "$countyName"_bins_2016.json
done<file
The idea is to filter the county name using a jq filter after stripping the , from each line of your input file. Then the line is passed to jq as plain stream to produce a JSON file in prettified format.
If you are from a relatively older version of bash (< 4.0) use "${line%?}" over "${line: : -1}"
For example with the change above, one of your county becomes,
cat ALAMANCE_bins_2016.json
{
"type": "Feature",
"properties": {
"county": "ALAMANCE",
"vBLA": 0,
"vWHI": 0,
"vDEM": 0,
"vREP": 0,
"vUNA": 0,
"vTOT": 0
},
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
-79.527429,
35.843303
],
[
-79.532428,
35.843303
],
[
-79.532428,
35.848302
],
[
-79.527429,
35.848302
],
[
-79.527429,
35.843303
]
]
]
}
}
Note: The current solution could be performance intensive as reading file line by line is an expensive operation, and equally invoking jq for each of the lines.

This will do what you want minus getting rid of the last comma:-
gawk 'match($0, /"county":"([^"]+)/, array){ print >array[1]"_bins_2016.json" }' INPUT_FILE
This will output files in the current path with a filename in the format COUNTRY NAME_bins_2016.json.
The script goes line by line and uses a regex to match the exact term "country":" followed by 1 or more characters that aren't a ". It captures the characters within the quotes and then uses it as part of the filename to append the current line to.
To remove the trailing comma from all .json files in the current path you could use:-
sed -i '$ s/,$//' *.json
If you were certain that the last char was always a comma, a faster solution would be to use truncate:-
truncate -s-1 *.json
Last part taken from this answer: https://stackoverflow.com/a/40568723/1453798

Here is a quickie script that will do the job. It has the virtue of working on most systems without having to install any other tools.
IFS=$'\n'
counties=( $( sed 's/^.*"county":"//;s/".*$//' counties.txt ) )
unset IFS
for county in "${!counties[#]}"
do
county="${counties[$i]}"
filename="$county".out.txt
echo "'$filename'"
grep "\"$county\"" counties.txt > "$filename"
done
The setting of IFS to \n allows the array elements to contain spaces. The sed command strips off all the text up to the start of the county name and all the text after it. The for loop is the form that allows iterating over the array. Finally, the grep command needs to have double quotes around the search string so that counties that are substrings of other counties don't accidentally get put into the wrong file.
See this section of the GNU BASH Reference Manual for more info.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Trying to iterate operations on a file with awk and sed - bash

Related

How to extract a value by searching for two words in different lines and getting the value of second one

grep multiples results randomly results in bash

jq Compare two files and output the difference in text format

Generate json file with formatting

Bash/*NIX: split a file into multiple files on a substring

Categories

Resources