How to exclude lines above a target - bash

I have read multiple posts about how to exclude lines around a grep match, but none addresses it with finality, most find other ways to sort data, and that does not solve similar issues with different data.
i have a file with a recursive output, a command repeated over and over. i want to trim out the 0 results because it is the only constant value, the result hits are an unknown quantity.
the only unique string i can search by needs to have the 4 lines above it excluded no matter what the content of those lines are, and i have not found any post with info generic enough to fit.
this is a conceptual question, there has to be a simple solution, but if an example is needed:
Path/Path/Path> search
[results]
[results]
2 entries found
Path/Path/Path> search
[result]
1 entry found
Path/Path/Path> search
0 entry found

try this:
# Assumption: The data is in logile.txt
i=0
tac logfile.txt |\
while read -r line; do
if [[ "${line:0:7}" == "0 entry" ]]; then
i=0
continue
else
((i++))
[[ $i -le 4 ]] && continue
fi
echo "$line"
done | tac
output:
Path/Path/Path> search
[results]
[results]
2 entries found

Related

Bash: checking substring increments with modular arithmetic

I have a list of files with file names that contain a substring of 6 numbers that represents HHMMSS, HH: 2 digits hour, MM: 2 digits minutes, SS: 2 digits seconds.
If the list of files is ordered, the increments should be in steps of 30 minutes, that is, the first substring should be 000000, followed by 003000, 010000, 013000, ..., 233000.
I want to check that no file is missing iterating the list of files and checking that neither of these substrings is missing. My approach:
string_check=000000
for file in ${file_list[#]}; do
if [[ ${file:22:6} == $string_check ]]; then
echo "Ok"
else
echo "Problem: an hour (file) is missing"
exit 99
fi
string_check=$((string_check+3000)) #this is the key line
done
And the previous to the last line is the key. It should be formatted to 6 digits, I know how to do that, but I want to add time like a clock, or, in more specific words, modular arithmetic modulo 60. How can that be done?
Assumptions:
all 6-digit strings are of the format xx[03]0000 (ie, has to be an even 00 or 30 minutes and no seconds)
if there are strings like xx1529 ... these will be ignored (see 2nd half of answer - use of comm - to address OP's comment about these types of strings being an error)
Instead of trying to do a bunch of mod 60 math for the MM (minutes) portion of the string, we can use a sequence generator to generate all the desired strings:
$ for string_check in {00..23}{00,30}00; do echo $string_check; done
000000
003000
010000
013000
... snip ...
230000
233000
While OP should be able to add this to the current code, I'm thinking we might go one step further and look at pre-parsing all of the filenames, pulling the 6-digit strings into an associative array (ie, the 6-digit strings act as the indexes), eg:
unset myarray
declare -A myarray
for file in ${file_list}
do
myarray[${file:22:6}]+=" ${file}" # in case multiple files have same 6-digit string
done
Using the sequence generator as the driver of our logic, we can pull this together like such:
for string_check in {00..23}{00,30}00
do
[[ -z "${myarray[${string_check}]}" ]] &&
echo "Problem: (file) '${string_check}' is missing"
done
NOTE: OP can decide if the process should finish checking all strings or if it should exit on the first missing string (per OP's current code).
One idea for using comm to compare the 2 lists of strings:
# display sequence generated strings that do not exist in the array:
comm -23 <(printf "%s\n" {00..23}{00,30}00) <(printf "%s\n" "${!myarray[#]}" | sort)
# OP has commented that strings not like 'xx[03]000]` should generate an error;
# display strings (extracted from file names) that do not exist in the sequence
comm -13 <(printf "%s\n" {00..23}{00,30}00) <(printf "%s\n" "${!myarray[#]}" | sort)
Where:
comm -23 - display only the lines from the first 'file' that do not exist in the second 'file' (ie, missing sequences of the format xx[03]000)
comm -13 - display only the lines from the second 'file' that do not exist in the first 'file' (ie, filenames with strings not of the format xx[03]000)
These lists could then be used as input to a loop, or passed to xargs, for additional processing as needed; keeping in mind the comm -13 output will display the indices of the array, while the associated contents of the array will contain the name of the original file(s) from which the 6-digit string was derived.
Doing this easy with POSIX shell and only using built-ins:
#!/usr/bin/env sh
# Print an x for each glob matched file, and store result in string_check
string_check=$(printf '%.0sx' ./*[0-2][0-9][03]000*)
# Now string_check length reflects the number of matches
if [ ${#string_check} -eq 48 ]; then
echo "Ok"
else
echo "Problem: an hour (file) is missing"
exit 99
fi
Alternatively:
#!/usr/bin/env sh
if [ "$(printf '%.0sx' ./*[0-2][0-9][03]000*)" \
= 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' ]; then
echo "Ok"
else
echo "Problem: an hour (file) is missing"
exit 99
fi

Having difficulty defining conditions to call certain functions and error messages

I'm writing a piece of code which will use data from a file that I've already made in order to work out the average value of the file, the minimum value, maximum value and then finally displaying all values at once.
I'm very new to unix so I'm trying to learn it but I just cant seem to crack where I need to go with my code in order for it to gain functionality.
I've got the basics of the code but I need to find a way to call the functions using the year, which is stored in a directory corresponding to that year which is making me think I'm going to have problems calling from the file as I'm using a sed function to only take line 4 of that file rather than the year.
I also need to figure out how to set error messages and status to the script if they have not stated (Year) (One of the 4 commands), the year doesnt correspond to one available in the tree and the keyword is invalid.
Any help or even pointers towards good material to learn these things would be great.
Here is my current code:
#!/bin/bash
#getalldata() {
#find . -name "ff_*" -exec sed -n '4p' {} \;
#}
#Defining where the population configuration file is which contains all the data
popconfile.txt=$HOME/testarea
#Function to find the average population of all of the files
averagePopulation()
{
total=0
list=$(cat popconfile.txt)
for var in "${list[#]}"
do
total=$((total + var))
done
average=$((total/$(wc -l popconfile.txt)))
echo "$average"
}
#Function to find the maximum population from all the files
maximumPopulation()
{
max=1
for in `cat popconfile.txt`
do
if [[ $1 > "$max" ]]; then
max=$1
echo "$max"
fi
done
}
#Function to find the minimum population from all the files
minimumPopulation()
{
min=1000000
for in `cat popconfile.txt`
do
if [[ $1 < "$min" ]]; then
max=$1
echo "$min"
fi
done
}
#Function to show all of the results in one place
function showAll()
{
echo "$min"
echo "$max"
echo "$average"
}
Thanks!
Assuming your popconfile.txt format is
cat popconfile.txt
150
10
45
1000
34
87
You might be able to simplify your code with :
for i in $(cat popconfile.txt);do
temp[$i]=$i
done
pop=(${temp[*]})
min=${pop[0]}
max=${pop[$((${#pop[*]}-1))]}
for ((j=0;j<${#pop[*]};j++));do
sum=$(($sum+${pop[$j]}))
done
average=$(($sum/${#pop[*]}))
echo "maximum="$max
echo "minimum="$min
echo "average="$average
Be aware though that the average here or in your code is calculated with integer mathematics, so you're loosing all decimals.

Merging rows in .csv in order

After analysis of brain scans I ended up with around 1000 .csv files, one for each scan. I've merged them into one in order (by subject ID and date). My problem is, that some subjects had two or more consecutive scans and some had only one. Database now looks like that:
ID, CC_area, CC_perimeter, CC_circularity
024_S_0985, 407.00, 192.15, 0.138530 //first scan of A
024_S_0985, 437.50, 204.80, 0.131074 //second scan of A
024_S_0985, 400.75, 198.80, 0.127420 //third scan of A
024_S_1063, 544.50, 214.34, 0.148939 //first and only scan of B
024_S_1171, 654.75, 240.33, 0.142453 //first scan of C
024_S_1171, 659.50, 242.21, 0.141269 //second scan of C
...
But I want it to look like that:
ID, CC_area, CC_perimeter, CC_circularity, CC_area2, CC_perimeter2, CC_circularity2, CC_area3, CC_perimeter3, CC_circularity3, ..., CC_circularity6
024_S_0985, 407.00, 192.15, 0.138530, 437.50, 204.80, 0.131074, 400.75, 198.80, 0.127420, ... ,
024_S_1063, 544.50, 214.34, 0.148939,,,,,, ...,
024_S_1171, 654.75, 240.33, 0.142453, 659.50, 242.21, 0.141269,,, ... ,
...
What is important, that order of data must not be changed and number of rows for one ID is not known (it varies from 1 to 6). (So first columns of scan 1, then scan 2 etc.). Could you help me, or provide, with solution for that using bash? I am not experienced in programming and I have lost hope, that I could do it myself.
You can combine the line with the same filename (or initial index) using a normal while read loop and then acting on 3 conditions. (1) whether it is the first line following the header; (2) where the current index is equal to the last; and (3) where the current index differs from the last. There are a number of ways to approach this, but a short bash script could look like the following:
#!/bin/bash
fn="${1:-/dev/stdin}" ## accept filename or stdin
[ -r "$fn" ] || { ## validate file is readable
printf "error: file not found: '%s'\n" "$fn"
exit 1
}
declare -i cnt=0 ## flag for 1st iteration
while read -r line; do ## for each line in file
## read header, print & continue
[ ${line//,*/} = ID ] && printf "%s\n" "$line" && continue
line="${line// */}" ## strip //first scan of A....
idx=${line//,*/} ## parse file index from line
line="${line#*, }" ## strip index
if [ $cnt -eq 0 ]; then ## if first line - print
printf "%s, %s" "$idx" "$line"
((cnt++))
elif [ $idx = $lidx ]; then ## if indexes equal, append
printf ", %s" "$line"
else ## else, newline & print
printf "\n%s, %s" "$idx" "$line"
fi
last="$line" ## save last line
lidx=$idx ## save last index
done <"$fn"
printf "\n"
Input
$ cat dat/cmbcsv.dat
ID, CC_area, CC_perimeter, CC_circularity
024_S_0985, 407.00, 192.15, 0.138530 //first scan of A
024_S_0985, 437.50, 204.80, 0.131074 //second scan of A
024_S_0985, 400.75, 198.80, 0.127420 //third scan of A
024_S_1063, 544.50, 214.34, 0.148939 //first and only scan of B
024_S_1171, 654.75, 240.33, 0.142453 //first scan of C
024_S_1171, 659.50, 242.21, 0.141269 //second scan of C
Output
$ bash cmbcsv.sh dat/cmbcsv.dat
ID, CC_area, CC_perimeter, CC_circularity
024_S_0985, 407.00, 192.15, 0.138530, 437.50, 204.80, 0.131074, 400.75, 198.80, 0.127420
024_S_1063, 544.50, 214.34, 0.148939
024_S_1171, 654.75, 240.33, 0.142453, 659.50, 242.21, 0.141269
Note: I didn't know whether you needed all the additional commas or ellipses or if they were just there to show there could be more of the same index (e.g. ,,...,). You can easily add them if need be.
well if you know which scan belongs to which person you can add an extra column like patient name or id, but I guess that's if you have that original info of how much scans per person

search lines of file for email address - returning whole line, with bash

Suppose I have a file (sizes.txt)
daveclark#foo.com 0 23252 0
mikeclark#foo.com 0 45131 1
clark#foo.com 0 55235 0
joeclark#bar.net 33632 1
maryclark#bar.net 0 55523 0
clark#bar.net 0 99356 0
Now I have another file (users.txt)
clark#foo.com
clark#bar.net
What I want to do is find each line in sizes.txt for the specific email addresses in users.txt...using a loop, bash or one-liner in CentOS. Here's the key point, I need to find lines that only contain clark#foo.com and then clark#bar.net - meaning this should be one line only for each.
The most simple way that comes to mind...
for i in `cat users.txt`; do grep $i sizes.txt; done
...but this does not work because processing the first line of users.txt will return the lines containing daveclark#foo.com, mikeclark#foo.com and clark#foo.com. I explicitly want the line containing "clark#foo.com" (the third line of sizes.txt). Processing second line of users.txt, will have the same problem (it will return maryclark#bar.net and clark#bar.net lines) I know this has to be something totally simple that I'm overlooking.
What you are looking for is the exact match with grep. In your case that would be the -w option.
So
for i in cat users.txt do
grep -w "^$i" sizes.txt
done
should do the trick.
Cheers.
You can try something like this using only bash built-in functions and syntax:
while read -r user ; do
while read -r s_user s_column_2 s_column_3 s_column_4 ; do
[ "${s_user}" = "${user}" ] && printf "%b\t%b\t%b\t%b\n" "${s_user}" "${s_column_2}" "${s_column_3}" "${s_column_4}"
done < sizes.txt
done < users.txt
this nested while could be slow when using big size.txt files. In those cases you could use this in combination with awk

Bash script that analyzes report files

I have the following bash script which I will use to analyze all report files in the current directory:
#!/bin/bash
# methods
analyzeStructuralErrors()
{
# do something with $1
}
# main
reportFiles=`find $PWD -name "*_report*.txt"`;
for f in $reportFiles
do
echo "Processing $f"
analyzeStructuralErrors $f
done
My report files are formatted as such:
Error Code for Issue X - Description Text - Number of errors.
col1_name,col2_name,col3_name,col4_name,col5_name,col6_name
1143-1-1411-247-1-72953-1
1143-2-1411-247-436-72953-1
2211-1-1888-204-442-22222-1
Error Code for Issue Y - Description Text - Number of errors.
col1_name,col2_name,col3_name,col4_name,col5_name,col6_name
Other data
.
.
.
I'm looking for a way to go through each file and aggregate the report data. In the above example, we have two unique issues of type X, which I would like to handle in analyzeStructural. Other types of issues can be ignored in this routine. Can anyone offer advice on how to do this? I want to read each line until I hit the next error basically, and put that data into some kind of data structure.
Below is a working awk implementation that uses it's pseudo multidimensional arrays. I've included sample output to show you how it looks. I took the liberty to add a 'Count' column to denote how many times a certain "Issue" was hit for a given Error Code
#!/bin/bash
awk '
/Error Code for Issue/ {
errCode[currCode=$5]=$5
}
/^ +[0-9-]+$/ {
split($0, tmpArr, "-")
error[errCode[currCode],tmpArr[1]]++
}
END {
for (code in errCode) {
printf("Error Code: %s\n", code)
for (item in error) {
split(item, subscr, SUBSEP)
if (subscr[1] == code) {
printf("\tIssue: %s\tCount: %s\n", subscr[2], error[item])
}
}
}
}
' *_report*.txt
Output
$ ./report.awk
Error Code: B
Issue: 1212 Count: 3
Error Code: X
Issue: 2211 Count: 1
Issue: 1143 Count: 2
Error Code: Y
Issue: 2961 Count: 1
Issue: 6666 Count: 1
Issue: 5555 Count: 2
Issue: 5911 Count: 1
Issue: 4949 Count: 1
Error Code: Z
Issue: 2222 Count: 1
Issue: 1111 Count: 1
Issue: 2323 Count: 2
Issue: 3333 Count: 1
Issue: 1212 Count: 1
As suggested by Dave Jarvis, awk will:
handle this better than bash
is fairly easy to learn
likely available wherever bash is available
I've never had to look farther than The AWK Manual.
It would make things easier if you used a consistent field separator for both the list of column names and the data. Perhaps you could do some pre-processing in a bash script using sed before feeding to awk. Anyway, take a look at multi-dimensional arrays and reading multiple lines in the manual.
Bash has one-dimensional arrays that are indexed by integers. Bash 4 adds associative arrays. That's it for data structures. AWK has one dimensional associative arrays and fakes its way through two dimensional arrays. If you need some kind of data structure more advanced than that, you'll need to use Python, for example, or some other language.
That said, here's a rough outline of how you might parse the data you've shown.
#!/bin/bash
# methods
analyzeStructuralErrors()
{
local f=$1
local Xpat="Error Code for Issue X"
local notXpat="Error Code for Issue [^X]"
while read -r line
do
if [[ $line =~ $Xpat ]]
then
flag=true
elif [[ $line =~ $notXpat ]]
then
flag=false
elif $flag && [[ $line =~ , ]]
then
# columns could be overwritten if there are more than one X section
IFS=, read -ra columns <<< "$line"
elif $flag && [[ $line =~ - ]]
then
issues+=(line)
else
echo "unrecognized data line"
echo "$line"
fi
done
for issue in ${issues[#]}
do
IFS=- read -ra array <<< "$line"
# do something with ${array[0]}, ${array[1]}, etc.
# or iterate
for field in ${array[#]}
do
# do something with $field
done
done
}
# main
find . -name "*_report*.txt" | while read -r f
do
echo "Processing $f"
analyzeStructuralErrors "$f"
done

Resources