How to properly parse this scenario in a simple bash script? - bash

I have a file where each key-value pair takes a new line. There is a possibility of having multiple values for each key. I want to return a list of all pairs that have a "special key", where "special" is is defined as some function.
For Example, if "special" is defined as a key that somewhere has a value of 100
A 100
B 400
A hello
B world
C 100
I would return
A 100
A hello
C 100
How to do this in bash?

#!/bin/bash
special=100
awk -v s=$special '
{
a[$1,$2]
if($2 ~ s)
k[$1]
}
END
{
for(key in k)
for(pair in a)
{
split(pair,b,SUBSEP)
if(b[1] == key)
print b[1],b[2]
}
}' ./infile
Proof of Concept
$ special=100; echo -e "A 100\nB 400\nA hello\nB world\nC 100" | awk -v s=$special '{a[$1,$2];if($2 ~ s)k[$1]}END{for(key in k)for(pair in a){split(pair,b,SUBSEP); if(b[1] == key)print b[1],b[2]}}'
A hello
A 100
C 100

This would also work:
id=`grep "\<$special\>$" yourfile | sed -e "s/$special//"`
[ -z "$id" ] || grep "^$id" yourfile
Returns:
If special=100
A 100
A hello
C 100
If special="hello"
A 100
A hello
If special="A"
(nothing)
If special="ello"
(nothing)
Notes
drop the \<\> if you want partial match
add | uniq at the end if there is a possibility of multiple entrances of the same pair (A 100, A 100, ...) but you don't want that in your output.

***** script *****
#!/bin/bash
grep " $1" data.txt | cut -d ' ' -f1 | grep -f /dev/fd/0 data.txt
result:
./test.sh 100
A 100
A hello
C 100
***** inline *****
the first grep must contain the 'special' preceded by a space ' ':
grep " 100" data.txt | cut -d ' ' -f1 | grep -f /dev/fd/0 data.txt
A 100
A hello
C 100

awk -v special="100" '$2==special{a[$1]}($1 in a)' file

Whew! My bash was incredibly rusty! Hope this helps:
FILE=$1
IFS=$'\n' # Internal File Sep, so as to avoid splitting in whitespaces
FIND="100"
KEEP=""
for line in `cat $FILE`; do
key=`echo $line | cut -d \ -f1`;
value=`echo $line | cut -d \ -f2`;
echo "$key = $value"
if [ "$value" == "$FIND" ]; then
KEEP="$key $KEEP"
fi
done
echo "Keys to keep: $KEEP"
# You can now do whatever you want with those keys.

Related

in shell script how to print a line if the previous and the next line has a blank and the

I have a file like
abc
1234567890
0987654321
cde
fgh
ijk
1234567890
0987654321
I need to write a script that extract the lines with a blank before and after, in the example should be like this:
cde
fgh
I guess that awk or sed could do the work but I wasn't able to make them work. Any help?
Here is the solution.
#!/bin/bash
amt=$(sed -n '$=' path-to-your-file)
i=0
while :
do
((i++))
if [ $i == $amt ]; then
break
fi
if ! [ $i == 1 ]; then
j=$(expr $i - 1)
emp=$(sed $j'!d' path-to-your-file)
if [ h$emp == h ]; then
j=$(expr $i + 1)
emp=$(sed $j'!d' path-to-your-file)
if [ h$emp == h ]; then
emp=$(sed $i'!d' path-to-your-file)
echo >> extracted $emp
fi
fi
fi
done
With awk:
awk '
BEGIN{
RS=""
FS="\n"
}
NF==1' file
Prints:
cde
fgh
very simple solution
cat "myfile.txt" | grep -A 1 '^$' | grep -B 1 '^$' | grep -v -e '^--$' | grep -v '^$'
assuming "--" is the default group separator
you may get ride of group separator by other means like
--group-separator="" or --no-group-separator options
but depends of grep program variant (BSD, Gnu, OSX... )

Accept filename as argument and calculate repeated words along with count

I need to find the number or repeated characters from a text file and need to pass filename as argument.
Example:
test.txt data contains
Zoom
Output should be like:
z 1
o 2
m 1
I need a command that will accept filename as argument and then lists the number of characters from that file. In my example I have a test.txt which has zoom word. So the output will be like how many times each letter has repeated.
My attempt:
vi test.sh
#!/bin/bash
FILE="$1" --to pass filename as argument
sort file1.txt | uniq -c --to count the number of letters
Just a guess?
cat test.txt |
tr '[:upper:]' '[:lower:]' |
fold -w 1 |
sort |
uniq -c |
awk '{print $2, $1}'
m 1
o 2
z 1
Suggesting awk script that count all kinds of chars:
awk '
BEGIN{FS = ""} # make each char a field
{
for (i = 1; i <= NF; i++) { # iteratre over all fields in line
++charsArr[$i]; # count each field occourance in array
}
}
END {
for (char in charsArr) { # iterrate over chars array
printf("%3d %s\n", charsArr[char], char); # cournt char-occourances and the char
}
}' |sort -n
Or in one line:
awk '{for(i=1;i<=NF;i++)++arr[$i]}END{for(char in arr)printf("%3d %s\n",arr[char],char)}' FS="" input.1.txt|sort -n
#!/bin/bash
#get the argument for further processing
inputfile="$1"
#check if file exists
if [ -f $inputfile ]
then
#convert file to a usable format
#convert all characters to lowercase
#put each character on a new line
#output to temporary file
cat $inputfile | tr '[:upper:]' '[:lower:]' | sed -e 's/\(.\)/\1\n/g' > tmp.txt
#loop over every character from a-z
for char in {a..z}
do
#count how many times a character occurs
count=$(grep -c "$char" tmp.txt)
#print if count > 0
if [ "$count" -gt "0" ]
then
echo -e "$char" "$count"
fi
done
rm tmp.txt
else
echo "file not found!"
exit 1
fi

Cutting string into different types of variables

Full script:
snapshot_details=`az snapshot show -n $snapshot_name -g $resource_group --query \[diskSizeGb,location,tags\] -o json`
echo $snapshot_details
IFS='",][' read -r -a array <<< $snapshot_details
echo ${array[#]}
IFS=' ' read -r -a array1 <<< ${array[#]}
echo ${array1[0]} #size
echo ${array1[1]} #location
How can I break this into 3 different variables:
a=5
b=eastus2
c={ "name": "20190912123307" "namespace": "aj-ssd" "pvc": "poc-ssd" }
and is there any easier way to parse c so that I can easy traverse over all the keys and values?
o/p of the above script is:
[ 5, "eastus2", { "name": "20190912123307", "namespace": "ajain-ssd", "pvc": "azure-poc-ssd" } ]
5 eastus2 { name : 20190912123307 namespace : ajain-ssd pvc : azure-poc-ssd }
5
eastus2
A JSON parser, such as jq, should always be used when splitting out items from a JSON array in bash. Line-oriented tools (such as awk) are unable to correctly escape JSON -- if you had a value with a tab, newline, or literal quote, it would be emitted incorrectly.
Consider the following code, runnable exactly as-is even by people not having your az command:
snapshot_details_json='[ 5, "eastus2", { "name": "20190912123307", "namespace": "ajain-ssd", "pvc": "azure-poc-ssd" } ]'
{ read -r diskSizeGb && read -r location && read -r tags; } < <(jq -cr '.[]' <<<"$snapshot_details_json")
# show that we really got the content
echo "diskSizeGb=$diskSizeGb"
echo "location=$location"
echo "tags=$tags"
...which emits as output:
diskSizeGb=5
location=eastus2
tags={"name":"20190912123307","namespace":"ajain-ssd","pvc":"azure-poc-ssd"}
Bash can do this with the awk command:
To extract the 5 :
awk -F " " '{ print $1 }'
To extract eastus2 :
awk -F "\"" '{ print $2 }'
To extract the last string :
awk -F "{" '{ print "{" $2 }'
As seen here :
To explain quickly
awk -F " " '{ print $1 }'
-F sets a delimiter, here we set space as the delimiter.
Then, we ask awk to print the first occurence before the first delimiter is hit.
The slightly more complex one:
awk -F "{" '{ print "{" $2 }'
Here we set { as the delimiter. Since we wouldn't have the bracket with only printing $2, we're also manually re-printing the bracket (print "{" $2)
It will not be nice in Bash, but this should work if your input format does not vary (including no {, } or spaces inside the key/value pairs):
S='5 "eastus2" { "name": "20190912123307" "namespace": "aj-ssd" "pvc": "poc-ssd" }'
a=`echo "$S" | awk '{print $1}'`
b=`echo "$S" | awk '{print $2}' | sed -e 's/\"//g'`
c=`echo "$S" | awk '{$1=$2=""; print $0}'`
echo "$a"
echo "$b"
echo "$c"
elems=`echo "$c" | sed -e 's/{//' | sed -e 's/}//' | sed -e 's/: //g'`
echo $elems
for e in $elems
do
kv=`echo "$e" | sed -e 's/\"\"/ /' | sed -e 's/\"//g'`
key=`echo "$kv" | awk '{print $1}'`
value=`echo "$kv" | awk '{print $2}'`
echo "key:$key; value:$value"
done
The idea in the iteration over key/value pairs is to:
(1) remove the space (and colon) between keys and corresponding value so that each key/value pair appears as one item.
(2) inside the loop, change the delimiter between keys and values (which is now "") to space and remove the double quotes (variable 'kv').
(3) extract the key/value as the first/second item of kv.
EDIT:
Avoid file name wildcard expansions.

Splitting out a large file

I would like to process a 200 GB file with lines like the following:
...
{"captureTime": "1534303617.738","ua": "..."}
...
The objective is to split this file into multiple files grouped by hours.
Here is my basic script:
#!/bin/sh
echo "Splitting files"
echo "Total lines"
sed -n '$=' $1
echo "First Date"
head -n1 $1 | jq '.captureTime' | xargs -i date -d '#{}' '+%Y%m%d%H'
echo "Last Date"
tail -n1 $1 | jq '.captureTime' | xargs -i date -d '#{}' '+%Y%m%d%H'
while read p; do
date=$(echo "$p" | sed 's/{"captureTime": "//' | sed 's/","ua":.*//' | xargs -i date -d '#{}' '+%Y%m%d%H')
echo $p >> split.$date
done <$1
Some facts:
80 000 000 lines to process
jq doesn't work well since some JSON lines are invalid.
Could you help me to optimize this bash script?
Thank you
This awk solution might come to your rescue:
awk -F'"' '{file=strftime("%Y%m%d%H",$4); print >> file; close(file) }' $1
It essentially replaces your while-loop.
Furthermore, you can replace the complete script with:
# Start AWK file
BEGIN{ FS='"' }
(NR==1){tmin=tmax=$4}
($4 > tmax) { tmax = $4 }
($4 < tmin) { tmin = $4 }
{ file="split."strftime("%Y%m%d%H",$4); print >> file; close(file) }
END {
print "Total lines processed: ", NR
print "First date: "strftime("%Y%m%d%H",tmin)
print "Last date: "strftime("%Y%m%d%H",tmax)
}
Which you then can run as:
awk -f <awk_file.awk> <jq-file>
Note: the usage of strftime indicates that you need to use GNU awk.
you can start optimizing by changing this
sed 's/{"captureTime": "//' | sed 's/","ua":.*//'
with this
sed -nE 's/(\{"captureTime": ")([0-9\.]+)(.*)/\2/p'
-n suppress automatic printing of pattern space
-E use extended regular expressions in the script

Get just the integer from wc in bash

Is there a way to get the integer that wc returns in bash?
Basically I want to write the line numbers and word counts to the screen after the file name.
output: filename linecount wordcount
Here is what I have so far:
files=\`ls`
for f in $files;
do
if [ ! -d $f ] #only print out information about files !directories
then
# some way of getting the wc integers into shell variables and then printing them
echo "$f $lines $words"
fi
done
Most simple answer ever:
wc < filename
Just:
wc -l < file_name
will do the job. But this output includes prefixed whitespace as wc right-aligns the number.
You can use the cut command to get just the first word of wc's output (which is the line or word count):
lines=`wc -l $f | cut -f1 -d' '`
words=`wc -w $f | cut -f1 -d' '`
wc $file | awk {'print "$4" "$2" "$1"'}
Adjust as necessary for your layout.
It's also nicer to use positive logic ("is a file") over negative ("not a directory")
[ -f $file ] && wc $file | awk {'print "$4" "$2" "$1"'}
Sometimes wc outputs in different formats in different platforms. For example:
In OS X:
$ echo aa | wc -l
1
In Centos:
$ echo aa | wc -l
1
So using only cut may not retrieve the number. Instead try tr to delete space characters:
$ echo aa | wc -l | tr -d ' '
The accepted/popular answers do not work on OSX.
Any of the following should be portable on bsd and linux.
wc -l < "$f" | tr -d ' '
OR
wc -l "$f" | tr -s ' ' | cut -d ' ' -f 2
OR
wc -l "$f" | awk '{print $1}'
If you redirect the filename into wc it omits the filename on output.
Bash:
read lines words characters <<< $(wc < filename)
or
read lines words characters <<EOF
$(wc < filename)
EOF
Instead of using for to iterate over the output of ls, do this:
for f in *
which will work if there are filenames that include spaces.
If you can't use globbing, you should pipe into a while read loop:
find ... | while read -r f
or use process substitution
while read -r f
do
something
done < <(find ...)
If the file is small you can afford calling wc twice, and use something like the following, which avoids piping into an extra process:
lines=$((`wc -l "$f"`))
words=$((`wc -w "$f"`))
The $((...)) is the Arithmetic Expansion of bash. It removes any whitespace from the output of wc in this case.
This solution makes more sense if you need either the linecount or the wordcount.
How about with sed?
wc -l /path/to/file.ext | sed 's/ *\([0-9]* \).*/\1/'
typeset -i a=$(wc -l fileName.dat | xargs echo | cut -d' ' -f1)
Try this for numeric result:
nlines=$( wc -l < $myfile )
Something like this may help:
#!/bin/bash
printf '%-10s %-10s %-10s\n' 'File' 'Lines' 'Words'
for fname in file_name_pattern*; {
[[ -d $fname ]] && continue
lines=0
words=()
while read -r line; do
((lines++))
words+=($line)
done < "$fname"
printf '%-10s %-10s %-10s\n' "$fname" "$lines" "${#words[#]}"
}
To (1) run wc once, and (2) not assign any superfluous variables, use
read lines words <<< $(wc < $f | awk '{ print $1, $2 }')
Full code:
for f in *
do
if [ ! -d $f ]
then
read lines words <<< $(wc < $f | awk '{ print $1, $2 }')
echo "$f $lines $words"
fi
done
Example output:
$ find . -maxdepth 1 -type f -exec wc {} \; # without formatting
1 2 27 ./CNAME
21 169 1065 ./LICENSE
33 130 961 ./README.md
86 215 2997 ./404.html
71 168 2579 ./index.html
21 21 478 ./sitemap.xml
$ # the above code
404.html 86 215
CNAME 1 2
index.html 71 168
LICENSE 21 169
README.md 33 130
sitemap.xml 21 21
Solutions proposed in the answered question doesn't work for Darwin kernels.
Please, consider following solutions that work for all UNIX systems:
print exactly the number of lines of a file:
wc -l < file.txt | xargs
print exactly the number of characters of a file:
wc -m < file.txt | xargs
print exactly the number of bytes of a file:
wc -c < file.txt | xargs
print exactly the number of words of a file:
wc -w < file.txt | xargs
There is a great solution with examples on stackoverflow here
I will copy the simplest solution here:
FOO="bar"
echo -n "$FOO" | wc -l | bc # "3"
Maybe these pages should be merged?
Try this:
wc `ls` | awk '{ LINE += $1; WC += $2 } END { print "lines: " LINE " words: " WC }'
It creates a line count, and word count (LINE and WC), and increase them with the values extracted from wc (using $1 for the first column's value and $2 for the second) and finally prints the results.
"Basically I want to write the line numbers and word counts to the screen after the file name."
answer=(`wc $f`)
echo -e"${answer[3]}
lines: ${answer[0]}
words: ${answer[1]}
bytes: ${answer[2]}"
Outputs :
myfile.txt
lines: 10
words: 20
bytes: 120
files=`ls`
echo "$files" | wc -l | perl -pe "s#^\s+##"
You have to use input redirection for wc:
number_of_lines=$(wc -l <myfile.txt)
respectively in your context
echo "$f $(wc -l <"$f") $(wc -w <"$f")"

Resources