Splitting out a large file - bash

I would like to process a 200 GB file with lines like the following:
...
{"captureTime": "1534303617.738","ua": "..."}
...
The objective is to split this file into multiple files grouped by hours.
Here is my basic script:
#!/bin/sh
echo "Splitting files"
echo "Total lines"
sed -n '$=' $1
echo "First Date"
head -n1 $1 | jq '.captureTime' | xargs -i date -d '#{}' '+%Y%m%d%H'
echo "Last Date"
tail -n1 $1 | jq '.captureTime' | xargs -i date -d '#{}' '+%Y%m%d%H'
while read p; do
date=$(echo "$p" | sed 's/{"captureTime": "//' | sed 's/","ua":.*//' | xargs -i date -d '#{}' '+%Y%m%d%H')
echo $p >> split.$date
done <$1
Some facts:
80 000 000 lines to process
jq doesn't work well since some JSON lines are invalid.
Could you help me to optimize this bash script?
Thank you

This awk solution might come to your rescue:
awk -F'"' '{file=strftime("%Y%m%d%H",$4); print >> file; close(file) }' $1
It essentially replaces your while-loop.
Furthermore, you can replace the complete script with:
# Start AWK file
BEGIN{ FS='"' }
(NR==1){tmin=tmax=$4}
($4 > tmax) { tmax = $4 }
($4 < tmin) { tmin = $4 }
{ file="split."strftime("%Y%m%d%H",$4); print >> file; close(file) }
END {
print "Total lines processed: ", NR
print "First date: "strftime("%Y%m%d%H",tmin)
print "Last date: "strftime("%Y%m%d%H",tmax)
}
Which you then can run as:
awk -f <awk_file.awk> <jq-file>
Note: the usage of strftime indicates that you need to use GNU awk.

you can start optimizing by changing this
sed 's/{"captureTime": "//' | sed 's/","ua":.*//'
with this
sed -nE 's/(\{"captureTime": ")([0-9\.]+)(.*)/\2/p'
-n suppress automatic printing of pattern space
-E use extended regular expressions in the script

Related

How to grab fields in inverted commas

I have a text file which contains the following lines:
"user","password_last_changed","expires_in"
"jeffrey","2021-09-21 12:54:26","90 days"
"root","2021-09-21 11:06:57","0 days"
How can I grab two fields jeffrey and 90 days from inverted commas and save in a variable.
If awk is an option, you could save an array and then save the elements as individual variables.
$ IFS="\"" read -ra var <<< $(awk -F, '/jeffrey/{ print $1, $NF }' input_file)
$ $ var2="${var[3]}"
$ echo "$var2"
90 days
$ var1="${var[1]}"
$ echo "$var1"
jeffrey
while read -r line; do # read in line by line
name=$(echo $line | awk -F, ' { print $1} ' | sed 's/"//g') # grap first col and strip "
expire=$(echo $line | awk -F, ' { print $3} '| sed 's/"//g') # grap third col and strip "
echo "$name" "$expire" # do your business
done < yourfile.txt
IFS=","
arr=( $(cat txt | head -2 | tail -1 | cut -d, -f 1,3 | tr -d '"') )
echo "${arr[0]}"
echo "${arr[1]}"
The result is into an array, you can access to the elements by index.
May be this below method will help you using
sed and awk command
#!/bin/sh
username=$(sed -n '/jeffrey/p' demo.txt | awk -F',' '{print $1}')
echo "$username"
expires_in=$(sed -n '/jeffrey/p' demo.txt | awk -F',' '{print $3}')
echo "$expires_in"
Output :
jeffrey
90 days
Note :
This above method will work if their is only distinct username
As far i know username are not duplicate

Convert substring through command

Basically, how do I make a string substitution in which the substituted string is transformed by an external command?
For example, given the line 5aaecdab287c90c50da70455de03fd1e ./2015/01/26/GOPR0083.MP4, how to pipe the second part of the line (./2015/01/26/GOPR0083.MP4) to command xargs stat -c %.6Y and then replace it with the result so that we end up with 5aaecdab287c90c50da70455de03fd1e 1422296624.010000?
This can be done with a script, however a one-liner would be nice.
#!/bin/bash
hashtime()
{
while read longhex fname; do
echo "$longhex $(stat -c %.6Y "$fname")"
done
}
if [ $# -ne 1 ]; then
echo Usage: ${0##*/} infile 1>&2
exit 1
fi
hashtime < $1
exit 0
# one liner
awk 'BEGIN { args="stat -c %.6Y " } { printf "%s ", $1; cmd=args $2; system(cmd); }' infile
A one-liner using GNU sed, which will process the whole file:
sed -E "s/([[:xdigit:]]+) +(.*)/stat -c '\1 %.6Y' '\2'/e" file
or, using plain bash
while read -r hash pathname; do stat -c "$hash %.6Y" "$pathname"; done < file
It's typical to use awk sed cut to reformat input. For example:
line="5aaecdab287c90c50da70455de03fd1e ./2015/01/26/GOPR0083.MP4"
echo "$line" |
cut -d' ' -f2- |
xargs stat -c %.6Y

Cutting string into different types of variables

Full script:
snapshot_details=`az snapshot show -n $snapshot_name -g $resource_group --query \[diskSizeGb,location,tags\] -o json`
echo $snapshot_details
IFS='",][' read -r -a array <<< $snapshot_details
echo ${array[#]}
IFS=' ' read -r -a array1 <<< ${array[#]}
echo ${array1[0]} #size
echo ${array1[1]} #location
How can I break this into 3 different variables:
a=5
b=eastus2
c={ "name": "20190912123307" "namespace": "aj-ssd" "pvc": "poc-ssd" }
and is there any easier way to parse c so that I can easy traverse over all the keys and values?
o/p of the above script is:
[ 5, "eastus2", { "name": "20190912123307", "namespace": "ajain-ssd", "pvc": "azure-poc-ssd" } ]
5 eastus2 { name : 20190912123307 namespace : ajain-ssd pvc : azure-poc-ssd }
5
eastus2
A JSON parser, such as jq, should always be used when splitting out items from a JSON array in bash. Line-oriented tools (such as awk) are unable to correctly escape JSON -- if you had a value with a tab, newline, or literal quote, it would be emitted incorrectly.
Consider the following code, runnable exactly as-is even by people not having your az command:
snapshot_details_json='[ 5, "eastus2", { "name": "20190912123307", "namespace": "ajain-ssd", "pvc": "azure-poc-ssd" } ]'
{ read -r diskSizeGb && read -r location && read -r tags; } < <(jq -cr '.[]' <<<"$snapshot_details_json")
# show that we really got the content
echo "diskSizeGb=$diskSizeGb"
echo "location=$location"
echo "tags=$tags"
...which emits as output:
diskSizeGb=5
location=eastus2
tags={"name":"20190912123307","namespace":"ajain-ssd","pvc":"azure-poc-ssd"}
Bash can do this with the awk command:
To extract the 5 :
awk -F " " '{ print $1 }'
To extract eastus2 :
awk -F "\"" '{ print $2 }'
To extract the last string :
awk -F "{" '{ print "{" $2 }'
As seen here :
To explain quickly
awk -F " " '{ print $1 }'
-F sets a delimiter, here we set space as the delimiter.
Then, we ask awk to print the first occurence before the first delimiter is hit.
The slightly more complex one:
awk -F "{" '{ print "{" $2 }'
Here we set { as the delimiter. Since we wouldn't have the bracket with only printing $2, we're also manually re-printing the bracket (print "{" $2)
It will not be nice in Bash, but this should work if your input format does not vary (including no {, } or spaces inside the key/value pairs):
S='5 "eastus2" { "name": "20190912123307" "namespace": "aj-ssd" "pvc": "poc-ssd" }'
a=`echo "$S" | awk '{print $1}'`
b=`echo "$S" | awk '{print $2}' | sed -e 's/\"//g'`
c=`echo "$S" | awk '{$1=$2=""; print $0}'`
echo "$a"
echo "$b"
echo "$c"
elems=`echo "$c" | sed -e 's/{//' | sed -e 's/}//' | sed -e 's/: //g'`
echo $elems
for e in $elems
do
kv=`echo "$e" | sed -e 's/\"\"/ /' | sed -e 's/\"//g'`
key=`echo "$kv" | awk '{print $1}'`
value=`echo "$kv" | awk '{print $2}'`
echo "key:$key; value:$value"
done
The idea in the iteration over key/value pairs is to:
(1) remove the space (and colon) between keys and corresponding value so that each key/value pair appears as one item.
(2) inside the loop, change the delimiter between keys and values (which is now "") to space and remove the double quotes (variable 'kv').
(3) extract the key/value as the first/second item of kv.
EDIT:
Avoid file name wildcard expansions.

How to properly parse this scenario in a simple bash script?

I have a file where each key-value pair takes a new line. There is a possibility of having multiple values for each key. I want to return a list of all pairs that have a "special key", where "special" is is defined as some function.
For Example, if "special" is defined as a key that somewhere has a value of 100
A 100
B 400
A hello
B world
C 100
I would return
A 100
A hello
C 100
How to do this in bash?
#!/bin/bash
special=100
awk -v s=$special '
{
a[$1,$2]
if($2 ~ s)
k[$1]
}
END
{
for(key in k)
for(pair in a)
{
split(pair,b,SUBSEP)
if(b[1] == key)
print b[1],b[2]
}
}' ./infile
Proof of Concept
$ special=100; echo -e "A 100\nB 400\nA hello\nB world\nC 100" | awk -v s=$special '{a[$1,$2];if($2 ~ s)k[$1]}END{for(key in k)for(pair in a){split(pair,b,SUBSEP); if(b[1] == key)print b[1],b[2]}}'
A hello
A 100
C 100
This would also work:
id=`grep "\<$special\>$" yourfile | sed -e "s/$special//"`
[ -z "$id" ] || grep "^$id" yourfile
Returns:
If special=100
A 100
A hello
C 100
If special="hello"
A 100
A hello
If special="A"
(nothing)
If special="ello"
(nothing)
Notes
drop the \<\> if you want partial match
add | uniq at the end if there is a possibility of multiple entrances of the same pair (A 100, A 100, ...) but you don't want that in your output.
***** script *****
#!/bin/bash
grep " $1" data.txt | cut -d ' ' -f1 | grep -f /dev/fd/0 data.txt
result:
./test.sh 100
A 100
A hello
C 100
***** inline *****
the first grep must contain the 'special' preceded by a space ' ':
grep " 100" data.txt | cut -d ' ' -f1 | grep -f /dev/fd/0 data.txt
A 100
A hello
C 100
awk -v special="100" '$2==special{a[$1]}($1 in a)' file
Whew! My bash was incredibly rusty! Hope this helps:
FILE=$1
IFS=$'\n' # Internal File Sep, so as to avoid splitting in whitespaces
FIND="100"
KEEP=""
for line in `cat $FILE`; do
key=`echo $line | cut -d \ -f1`;
value=`echo $line | cut -d \ -f2`;
echo "$key = $value"
if [ "$value" == "$FIND" ]; then
KEEP="$key $KEEP"
fi
done
echo "Keys to keep: $KEEP"
# You can now do whatever you want with those keys.

Bash: "xargs cat", adding newlines after each file

I'm using a few commands to cat a few files, like this:
cat somefile | grep example | awk -F '"' '{ print $2 }' | xargs cat
It nearly works, but my issue is that I'd like to add a newline after each file.
Can this be done in a one liner?
(surely I can create a new script or a function that does cat and then echo -n but I was wondering if this could be solved in another way)
cat somefile | grep example | awk -F '"' '{ print $2 }' | while read file; do cat $file; echo ""; done
Using GNU Parallel http://www.gnu.org/software/parallel/ it may be even faster (depending on your system):
cat somefile | grep example | awk -F '"' '{ print $2 }' | parallel "cat {}; echo"
awk -F '"' '/example/{ system("cat " $2 };printf "\n"}' somefile

Resources