Count number of lines under each header in a text file using bash shell script - bash

I can do this easily in python or some other high level language. What I am interested in is doing this with bash.
Here is the file format:
head-xyz
item1
item2
item3
head-abc
item8
item5
item6
item9
What I would like to do is print the following output:
head-xyz: 3
head-abc: 4
header will have a specific pattern similar to the example i gave above. items also have specific patterns like in the example above. I am only interested in the count of items under each header.

You can use awk:
awk '/head/{h=$0}{c[h]++}END{for(i in c)print i, c[i]-1}' input.file
Breakdown:
/head/{h=$0}
For every line matching /head/, set variable h to record the header.
{c[h]++}
For every line in the file, update the array c, which stores a map from header string to line count.
END{for(i in c)print i, c[i]-1}
At the end, loop through the keys in array c and print the key (header) followed by the value (count). Subtract one to avoid counting the header itself.

Note: Bash version 4 only (uses associative arrays)
#!/usr/bin/env bash
FILENAME="$1"
declare -A CNT
while read -r LINE || [[ -n $LINE ]]
do
if [[ $LINE =~ ^head ]]; then HEADLINE="$LINE"; fi
if [ ${CNT[$HEADLINE]+_} ];
then
CNT[$HEADLINE]=$(( ${CNT[$HEADLINE]} + 1 ))
else
CNT[$HEADLINE]=0
fi
done < "$FILENAME"
for i in "${!CNT[#]}"; do echo "$i: ${CNT[$i]}"; done
Output:
$ bash countitems.sh input
head-abc: 4
head-xyz: 3
Does this answer your question #powerrox ?

If you don't consider sed a high-level language, here's another approach:
for file in head-*; do
echo "$file: \c"
sed -n '/^head-/,${
/^head-/d
/^item[0-9]/!q
p
}
' <$file | wc -l
done
In English, the sed script does
Don't print by default
Within lines matching /^head-/ to end of file
Delete the "head line"
After that, quit if you find a non-item line
Otherwise, print the line
And wc -l to count lines.

Related

Detect double new lines with bash script

I am attempting to return the line number of lines that have a break. An input example:
2938
383
3938
3
383
33333
But my script is not working and I can't see why. My script:
input="./input.txt"
declare -i count=0
while IFS= read -r line;
do
((count++))
if [ "$line" == $'\n\n' ]; then
echo "$count"
fi
done < "$input"
So I would expect, 3, 6 as output.
I just receive a blank response in the terminal when I execute. So there isn't a syntax error, something else is wrong with the approach I am taking. Bit stumped and grateful for any pointers..
Also "just use awk" doesn't help me. I need this structure for additional conditions (this is just a preliminary test) and I don't know awk syntax.
The issue is that "$line" == $'\n\n' won't match a newline as it won't be there after consuming an empty line from the input, instead you can match an empty line with regex pattern ^$:
if [[ "$line" =~ ^$ ]]; then
Now it should work.
It's also match easier with awk command:
$ awk '$0 == ""{ print NR }' test.txt
3
6
As Roman suggested, line read by read terminates with a delimiter, and that delimiter would not show up in the line the way you're testing for.
If the pattern you are searching for looks like an empty line (which I infer is how a "double newline" always manifests), then you can just test for that:
while read -r; do
((count++))
if [[ -z "$REPLY" ]]; then
echo "$count"
fi
done < "$input"
Note that IFS is for field-splitting data on lines, and since we're only interested in empty lines, IFS is moot.
Or if the file is small enough to fit in memory and you want something faster:
mapfile -t -O1 foo < i
declare -p foo
for n in "${!foo[#]}"; do
if [[ -z "${foo[$n]}" ]]; then
echo "$n"
fi
done
Reading the file all at once (mapfile) then stepping through an array may be easier on resources than stepping through a file line by line.
You can also just use GNU awk:
gawk -v RS= -F '\n' '{ print (i += NF); i += length(RT) - 1 }' input.txt
By using FS = ".+", it ensures only truly zero-length (i.e. $0 == "") line numbers get printed, while skipping rows consisting entirely of [[:space:]]'s
echo '2938
383
3938
3
383
33333' |
{m,g,n}awk -F'.+' '!NF && $!NF = NR'
3
6
This sed one-liner should do the job at once:
sed -n '/^$/=' input.txt
Simply writes the current line number (the = command) if the line read is empty (the /^$/ matches the empty line).

Iterate over a csv and change the values of a column that meets a condition

I have to use bash to iterate over a CSV file and replace the values of a column that meets a condition. Finally, the results have to be stored in an output file.
I have written this code, which reads the file and stores the content in an array. On iterating over the file, if the value at column 13 is equal to "NULL" then the value of this record has to be replaced by "0". Once the file is reviewed the output with the replaced values is stored at file_b.
#!/bin/bash
file="./2022_Accidentalidad.csv"
while IFS=; read -ra array
do
if [[ ${array[13]} == "NULL" ]]; then
echo "${array[13]}" | sed -n 's/NULL/0/g'
fi
done < $file > file_b.csv
The problem is that file_b is empty. Nothing is written there.
How could I do this?
I cannot use AWK, and have to use or a FOR or a WHILE command to iterate over the file.
Sample input:
num_expediente;fecha;hora;localizacion;numero;cod_distrito;distrito;tipo_accidente;estado_meteorológico;tipo_vehiculo;tipo_persona;rango_edad;sexo;cod_lesividad;lesividad;coordenada_x_utm;coordenada_y_utm;positiva_alcohol;positiva_droga
2022S000001;01/01/2022;1:30:00;AVDA. ALBUFERA, 19;19;13;PUENTE DE VALLECAS;Alcance;Despejado;Turismo;Conductor;De 18 a 30 años;Mujer;NULL;NULL;443359,226;4472082,272;N;NULL
Expected output
num_expediente;fecha;hora;localizacion;numero;cod_distrito;distrito;tipo_accidente;estado_meteorológico;tipo_vehiculo;tipo_persona;rango_edad;sexo;cod_lesividad;lesividad;coordenada_x_utm;coordenada_y_utm;positiva_alcohol;positiva_droga
2022S000001;01/01/2022;1:30:00;AVDA. ALBUFERA, 19;19;13;PUENTE DE VALLECAS;Alcance;Despejado;Turismo;Conductor;De 18 a 30 años;Mujer;0;NULL;443359,226;4472082,272;N;NULL
Thanks a lot in advance.
You don't need sed. Just replace $array[13] with 0. Then print the entire array with ; separators between the fields.
( # in a subshell
IFS=';' # set IFS, that affects `read` and `"${array[*]}"`
while read -ra array
do
if [[ ${array[13]} == "NULL" ]]; then
array[13]=0
fi
echo "${array[*]}"
done
) < $file > file_b.csv
echo uses the first character of $IFS as the output field separator.
When awk is also possible:
awk 'BEGIN{FS=OFS=";"} NR==2 && $14=="NULL"{$14=0} {print}' "$file" > file_b.csv
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
One idea using a regex and the BASH_REMATCH array:
regex='(([^;]*;){13})(NULL)(;.*)'
while read -r line
do
[[ "${line}" =~ $regex ]] &&
line="${BASH_REMATCH[1]}0${BASH_REMATCH[4]}"
# uncomment following line to display contents of BASH_REMATCH[] array
# declare -p BASH_REMATCH
echo "${line}"
done < file.csv > file_b.csv
This generates:
$ cat file_b.csv
num_expediente;fecha;hora;localizacion;numero;cod_distrito;distrito;tipo_accidente;estado_meteorológico;tipo_vehiculo;tipo_persona;rango_edad;sexo;cod_lesividad;lesividad;coordenada_x_utm;coordenada_y_utm;positiva_alcohol;positiva_droga
2022S000001;01/01/2022;1:30:00;AVDA. ALBUFERA, 19;19;13;PUENTE DE VALLECAS;Alcance;Despejado;Turismo;Conductor;De 18 a 30 años;Mujer;0;NULL;443359,226;4472082,272;N;NULL

How to filter text data in bash more efficiently

I have data file which I need to filter with bash script, see data example:
name=pencils
name=apples
value=10
name=rocks
value=3
name=tables
value=6
name=beds
name=cups
value=89
I need to group name value pairs like so apples=10, if current line starts with name and next line starts with name, first line should be omitted entirely. So result file should look like this:
apples=10
rocks=3
tables=6
cups=89
I came with this simple solution which works but is very slow, it takes 5 min to complete for file with 2000 lines.
VALUES=$(cat input.txt)
for x in $VALUES; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}" >> output.txt
fi
done
I'm aware that this kind of task is not very suitable for bash, but script is already written and this is just small part of it.
How can I optimize this task in bash?
Do not run any commands in subshells, it slows your script a lot. You can do everything in the current shell.
#! /bin/bash
while IFS== read k v ; do
if [[ $k == name ]] ; then
name=$v
elif [[ $k == value ]] ; then
printf '%s=%s\n' "$name" "$v"
fi
done
There are three easy optimizations you can make that will greatly speed up the script without requiring a major rethink.
1. Replace for with while read
Loading input.txt into a string, and then looping over that string with for x in $VALUES is slow. It requires the whole file to be read into memory even though this task could be done in a streaming fashion, reading a line at a time.
A common replacement for for line in $(cat file) is while read line; do ... done < file. It turns out that loops are compound commands, and like the normal one-line commands we're used to, compound commands can have < and > redirections. Redirecting a file into a loop means that for the duration of the loop, stdin comes from the file. So if you call read line inside the loop then it will read one line each iteration.
while IFS= read -r x; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}" >> output.txt
fi
done < input.txt
2. Redirect output outside loop
It's not just input that can be redirected. We can do the same thing for the >> output.txt redirection. Here's where you'll see the biggest speedup. When >> output.txt is inside the loop output.txt must be opened and closed every iteration, which is crazy slow. Moving it to the outside means it only needs to be opened once. Much, much faster.
while IFS= read -r x; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}"
fi
done < input.txt > output.txt
3. Shell string processing
One final improvement is to use faster string processing. Calling grep requires forking a subprocess every time just to do a simple string split. It'd be a lot faster if we could do the string splitting using just shell constructs. Well, as it happens that's easy now that we've switched to read. read can do more than read whole lines; it can also split on a delimiter from the variable $IFS (inter-field separator).
while IFS='=' read -r key value; do
case "$key" in
name) name="$value";;
value) echo "$name=$value";;
fi
done < input.txt > output.txt
Further reading
BashFAQ/001 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
This explains why I have IFS= read -r in the first two iterations.
BashFAQ/024 - I set variables in a loop that's in a pipeline. Why do they disappear after the loop terminates? Or, why can't I pipe data to read?
cmd | while read; do ... done is another popular use of while read, but it has unique pitfalls.
BashFAQ/100 - How do I do string manipulations in bash?
More in-shell string processing options.
If you have performance issues do not use bash at all. Use a text processing tool like, for instance, awk:
$ awk -F= '{name = $2} $1 == "value" {print name "=" $2}' data.txt
apples=10
rocks=3
tables=6
cups=89
Explanation: -F= defines the field separator as character =. The first block is executed only if the first field of a line ($1) is equal to string value. It prints variable name followed by character = and the second field ($2). The second block is executed on each line and it stores the second field ($2) in variable name.
Normally, if your input resembles what you show, this should automatically skip the first line. Else, we can exclude it explicitly using a test on the NR variable which value is the line number, starting at 1:
awk -F= 'NR != 1 && $1 == "value" {print name "=" $2}
NR != 1 {name = $2}' data.txt
All this works on inputs like the one you show but not on inputs where you would have other types of lines or several value=... consecutive lines. If you really want to test that the name/value pair is on two consecutive lines we need something more. For instance, test if the first field is name and use another variable n to store the line number of the last encountered name=... line. With all these tests we can now put the 2 blocks in a slightly more intuitive order (but the opposite would work the same):
awk -F= 'NR != 1 && $1 == "name" {name = $2; n = NR}
NR != 1 && NR == n+1 && $1 == "value" {print name "=" $2}' data.txt
With awk there might be a more elegant solution but you can have:
awk 'BEGIN{RS="\n?name=";FS="\nvalue="} {if($2) printf "%s=%s\n",$1,$2}' inputs.txt
RS="\n?name=" says that the record separator is name=
FS="\nvalue=" says that the field separator for each record is value=
if($2) says to only proceed the printf is the second field exists

adding numbers without grep -c option

I have a txt file like
Peugeot:406:1999:Silver:1
Ford:Fiesta:1995:Red:2
Peugeot:206:2000:Black:1
Ford:Fiesta:1995:Red:2
I am looking for a command That counts the number of red Ford Fiesta cars.
The last number in each line is the amount of that particular car.
The command I am looking for CANNOT use the -c option of grep.
so this command should just output the number 4.
Any help would be welcome, thank you.
A simple bit of awk would do the trick:
awk -F: '$1=="Ford" && $4=="Red" { c+=$5 } END { print c }' file
Output:
4
Explanation:
The -F: switch means that the input field separator is a colon, so the car manufacturer is $1 (the 1st field), the model is $2, etc.
If the 1st field is "Ford" and the 4th field is "Red", then add the value of the 5th (last) field to the variable c. Once the whole file has been processed, print out the value of c.
For a native bash solution:
c=0
while IFS=":" read -ra col; do
[[ ${col[0]} == Ford ]] && [[ ${col[3]} == Red ]] && (( c += col[4] ))
done < file && echo $c
Effectively applies the same logic as the awk one above, without any additional dependencies.
Methods:
1.) use some scripting language for counting, like awk or perl and such. Awk solution already posted, here is an perl solution.
perl -F: -lane '$s+=$F[4] if m/Ford:.*:Red/}{print $s' < carfile
#or
perl -F: -lane '$s+=$F[4] if ($F[0]=~m/Ford/ && $F[3]=~/Red/)}{print $s' < carfile
both examples prints
4
2.) The second method is based on shell-pipelining
filter out the right rows
extract the column with the count
sum the numbers
e.g some examples:
grep 'Ford:.*:Red:' carfile | cut -d: -f5 | paste -sd+ | bc
the grep filter out the right rows
the cut get the last column
the paste creates an line like 2+2 what can be counted by
the bc for counting
Another example:
sed -n 's/\(Ford:.*:Red\):\(.*\)/\2/p' carfile | paste -sd+ | bc
the sed filter and extract
another example - different way of counting
(echo 0 ; sed -n 's/\(Ford:.*:Red\):\(.*\)/\2+/p' carfile ;echo p )| dc
numbers are counted by RPN calculator called dc, e.g. it works like 0 2 + - first comes the values and as the last the operation.
the first echo puts into the stack 0
the sed creates a stream of numbers like 2+ 2+
the last echo p prints the stack
exists many other possibilies how count a strem of numbers.
e.g counting by bash
while read -r num
do
sum=$(( $sum + $num ))
done < <(sed -n 's/\(Ford:.*:Red\):\(.*\)/\2/p' carfile)
and pure bash:
while IFS=: read -r maker model year color count
do
if [[ "$maker" == "Ford" && "$color" == "Red" ]]
then
(( sum += $count ))
fi
done < carfile
echo $sum

how to read file from line x to the end of a file in bash

I would like know how I can read each line of a csv file from the second line to the end of file in a bash script.
I know how to read a file in bash:
while read line
do
echo -e "$line\n"
done < file.csv
But, I want to read the file starting from the second line to the end of the file. How can I achieve this?
tail -n +2 file.csv
From the man page:
-n, --lines=N
output the last N lines, instead of the last 10
...
If the first character of N (the number of bytes or lines) is a '+',
print beginning with the Nth item from the start of each file, other-
wise, print the last N items in the file.
In English this means that:
tail -n 100 prints the last 100 lines
tail -n +100 prints all lines starting from line 100
Simple solution with sed:
sed -n '2,$p' <thefile
where 2 is the number of line you wish to read from.
Or else (pure bash)...
{ for ((i=1;i--;));do read;done;while read line;do echo $line;done } < file.csv
Better written:
linesToSkip=1
{
for ((i=$linesToSkip;i--;)) ;do
read
done
while read line ;do
echo $line
done
} < file.csv
This work even if linesToSkip == 0 or linesToSkip > file.csv's number of lines
Edit:
Changed () for {} as gniourf_gniourf enjoin me to consider: First syntax generate a sub-shell, whille {} don't.
of course, for skipping only one line (as original question's title), the loop for (i=1;i--;));do read;done could be simply replaced by read:
{ read;while read line;do echo $line;done } < file.csv
There are many solutions to this. One of my favorite is:
(head -2 > /dev/null; whatever_you_want_to_do) < file.txt
You can also use tail to skip the lines you want:
tail -n +2 file.txt | whatever_you_want_to_do
Depending on what you want to do with your lines: if you want to store each selected line in an array, the best choice is definitely the builtin mapfile:
numberoflinestoskip=1
mapfile -s $numberoflinestoskip -t linesarray < file
will store each line of file file, starting from line 2, in the array linesarray.
help mapfile for more info.
If you don't want to store each line in an array, well, there are other very good answers.
As F. Hauri suggests in a comment, this is only applicable if you need to store the whole file in memory.
Otherwise, you best bet is:
{
read; # Just a scratch read to get rid (pun!) of the first line
while read line; do
echo "$line"
done
} < file.csv
Notice: there's no subshell involved/needed.
This will work
i=1
while read line
do
test $i -eq 1 && ((i=i+1)) && continue
echo -e "$line\n"
done < file.csv
I would just get a variable.
#!/bin/bash
i=0
while read line
do
if [ $i != 0 ]; then
echo -e $line
fi
i=$i+1
done < "file.csv"
UPDATE Above will check for the $i variable on every line of csv. So if you have got very large csv file of millions of line it will eat significant amount of CPU cycles, no good for Mother nature.
Following one liner can be used to delete the very first line of CSV file using sed and then output the remaining file to while loop.
sed 1d file.csv | while read d; do echo $d; done

Resources