Printing the same contiguous lines only once using shell/awk

Printing the same contiguous lines only once using shell/awk - shell

I have an input as below:
Sep 9 09:22:11
Hello
Hello
Sep 9 10:23:11
Hello
Hello
Hello
Sep 10 11:23:11
I expect the output as below: (the same contiguous lines are replaced by only one line)
Sep 9 09:22:11
Hello
Sep 9 10:23:11
Hello
Sep 10 11:23:11
Could anyone help me solving this one fast using shell or awk ?

Using awk you can do this:
awk '$0 != prev; {prev=$0}' file
Sep 9 09:22:11
Hello
Sep 9 10:23:11
Hello
Sep 10 11:23:11
Command Breakup:
$0 != prev; # if previous line is not same as current then print it
{prev=$0} # store current line in a variable called prev

To remove repeats of lines, use uniq:
uniq File
With your sample input, for example:
$ uniq File
Sep 9 09:22:11
Hello
Sep 9 10:23:11
Hello
Sep 10 11:23:11
Although its name may imply that uniq concerns itself with unique lines, it does not: it looks for adjacent repeated lines and, by default, removes the repeats.

Just because you asked for shell too, though the given answers are all better solutions -
last=''
while read line
do if [[ "$line" -eq "$last" ]]
then continue
else echo "$line"
last="$line"
fi
done < infile
This is simple, clear, and likely slower than either awk or uniq.

Related

Print line if column 2 is greater than column 2 on the next line

I have a file with multiple lines that all have a date in the second column. I'm looking for a command that prints the whole line if the date is greater than the date on the next line.
When this is no longer the case I want it to stop, don't print anything else.
I'm a rookie so if you could explain your answer that would be great.
I'm trying to use awk (answer can be any command)
awk '$2 > ?nextline?$2 {print}' file
I couldn't find how to check next line or how to stop after the first time the greater than command isn't true.
Input:
Jan 20 text1
Jan 15 text2
Jan 15 text3
Jan 3 text4
Jan 27 text5
Jan 17 text6
(more lines...)
Wanted output:
Jan 20 text1
Jan 15 text2
Jan 15 text3
Jan 3 text4

An awk version:
awk 'f && $2>f {exit} 1; {f=$2}' file
Jan 20 text1
Jan 15 text2
Jan 15 text3
Jan 3 text4
f && $2>f {exit} Test if fis set and second field larger than f? Yes, exit the program.
1; is always true, so print the line.
{f=$2} set f to second field.
Cold even be shorten some more:
awk 'f&&$2>f{exit}f=$2' file
f=$2 By setting it as pattern it will be true and print at the same time as f is set to $2
This version would skip print blank line between data if that exists, other not.

Generally, you have to reverse your requirement. Instead of printing out the current line if the next line does something, because sed, awk, etc., can't see the future, you need to stop printing if the current line does something compared to the previous line since awk and friends can see the past by storing it in a variable.
So a simple way, since you said language doesn't matter, is to do this:
perl -palE 'BEGIN {$h = 999} last if $F[1] > $h; $h = $F[1]' < file
What we are doing here is passing perl in -p "loop through input and print line at the end of the loop", -a "auto-split the line on spaces into the #F variable", -l "auto-handle line endings" (not strictly required here, but just a good habit most of the time), and -E "execute the code from the next parameter with the current version of perl specified" (-e would suffice here, but, again, habit). And the code we pass in starts off by setting $h (highest allowed at this point) to something out of range, I'm assuming no number will be 999+ since you say they're days of a month, using last to terminate the loop if the current day is higher than the highest allowed, and setting that high point to the current value if we get past the if. Perl now automatically prints out the current line and loops to the next line.
The key point is that we only look at the current line and track in a variable the relevant history so that we don't need to look into the future.

Performance issues with bash script

I have written a bash script that is responsible for 'collapsing' a log file. Given a log file of the format:
21 Oct 2017 12:38:03 [DEBUG] Single line message
21 Oct 2017 12:38:05 [DEBUG] Multi line
message
that may continue
several lines
21 Oct 2017 12:38:07 [DEBUG] Single line message
Collapse the log file to a single lined file with a separator character:
21 Oct 2017 12:38:03 [DEBUG] Single line message
21 Oct 2017 12:38:05 [DEBUG] Multi line; message; that may continue; several lines
21 Oct 2017 12:38:07 [DEBUG] Single line message
The following bash script achieves this goal, but at an excruciatingly slow pace. A 500mb input log may take 30 minutes on an 8 core 32 gb machine.
while read -r line; do
if [ -z "$line" ]; then
BUFFER+=$LINE_SEPERATOR
continue
done
POSSIBLE_DATE='cut -c1-11 <<< $line'
if [ "$PREV_DATE" == "$POSSIBLE_DATE" ]; then # Usually date won't change, big comparison saving.
if [ -n "$BUFFER" ]; then
echo $BUFFER
BUFFER=""
fi
BUFFER+="$line"
elif [[ "$POSSIBLE_DATE" =~ ^[0-3][0-9]\ [A-Za-z]{3}\ 2[0-9]{3} ]]; then # Valid date.
PREV_DATE="$POSSIBLE_DATE"
if [ -n "$BUFFER" ]; then
echo $BUFFER
BUFFER=""
fi
BUFFER+="$line"
else
BUFFER+="$line"
fi
done
Any ideas how I can optimize this script? It doesn't appear as though the regex is the bottleneck (my first optimization) as now that condition is rarely hit.
Most of the lines in the log file are single lines, so its just a straight up comparison of the first 11 chars, doesn't seem like it should be so computationally expensive?
Thanks.

using awk
It will be much more faster as it won't spawn multiple processes.
$ awk '/^[^0-9]/{ORS="; "} /^[0-9]/{$0=(FNR==1)?$0:RS $0; ORS=""} END{printf RS}1' file
21 Oct 2017 12:38:03 [DEBUG] Single line message
21 Oct 2017 12:38:05 [DEBUG] Multi line message; that may continue ; several lines;
21 Oct 2017 12:38:07 [DEBUG] Single line message
/^[^0-9]/{ORS="; "} : If line starts with non-digit then set Output Record Separator as ; instead of default \n
/^[0-9]/{$0=(FNR==1)?$0:RS $0; ORS=""}: If it starts with a digit then set ORS="" and prepend RS or \n to the record (with exception of first line i.e FNR==1 where we don't want a newline at the start)

You can use sed
sed ':B;/^[0-9][0-9]* /N;/\n[0-9][0-9]* /!{s/\n/; /;bB};h;s/\n.*//p;x;s/.*\n//;tB' infile
You can adjust the regex '[0-9][0-9]* ' to your need.

Make cat command to operate recursively looping through a directory

I have a large directory of data files which I am in the process of manipulating to get them in a desired format. They each begin and end 15 lines too soon, meaning I need to strip the first 15 lines off one file and paste them to the end of the previous file in the sequence.
To begin, I have written the following code to separate the relevant data into easy chunks:
#!/bin/bash
destination='media/user/directory/'
for file1 in `ls $destination*.ascii`
do
echo $file1
file2="${file1}.end"
file3="${file1}.snip"
sed -e '16,$d' $file1 > $file2
sed -e '1,15d' $file1 > $file3
done
This worked perfectly, so the next step is the worlds simplest cat command:
cat $file3 $file2 > outfile
However, what I need to do is to stitch file2 to the previous file3. Look at this screenshot of the directory for better understanding.
See how these files are all sequential over time:
*_20090412T235945_20090413T235944_* ### April 13
*_20090413T235945_20090414T235944_* ### April 14
So I need to take the 15 lines snipped off the April 14 example above and paste it to the end of the April 13 example.
This doesn't have to be part of the original code, in fact it would be probably best if it weren't. I was just hoping someone would be able to help me get this going.
Thanks in advance! If there is anything I have been unclear about and needs further explanation please let me know.

"I need to strip the first 15 lines off one file and paste them to the end of the previous file in the sequence."
If I understand what you want correctly, it can be done with one line of code:
awk 'NR==1 || FNR==16{close(f); f=FILENAME ".new"} {print>f}' file1 file2 file3
When this has run, the files file1.new, file2.new, and file3.new will be in the new form with the lines transferred. Of course, you are not limited to three files: you may specify as many as you like on the command line.
Example
To keep our example short, let's just strip the first 2 lines instead of 15. Consider these test files:
$ cat file1
1
2
3
$ cat file2
4
5
6
7
8
$ cat file3
9
10
11
12
13
14
15
Here is the result of running our command:
$ awk 'NR==1 || FNR==3{close(f); f=FILENAME ".new"} {print>f}' file1 file2 file3
$ cat file1.new
1
2
3
4
5
$ cat file2.new
6
7
8
9
10
$ cat file3.new
11
12
13
14
15
As you can see, the first two lines of each file have been transferred to the preceding file.
How it works
awk implicitly reads each file line-by-line. The job of our code is to choose which new file a line should be written to based on its line number. The variable f will contain the name of the file that we are writing to.
NR==1 || FNR==16{f=FILENAME ".new"}
When we are reading the first line of the first file, NR==1, or when we are reading the 16th line of whatever file we are on, FNR==16, we update f to be the name of the current file with .new added to the end.
For the short example, which transferred 2 lines instead of 15, we used the same code but with FNR==16 replaced with FNR==3.
print>f
This prints the current line to file f.
(If this was a shell script, we would use >>. This is not a shell script. This is awk.)
Using a glob to specify the file names
destination='media/user/directory/'
awk 'NR==1 || FNR==16{close(f); f=FILENAME ".new"} {print>f}' "$destination"*.ascii

Your task is not that difficult at all. You want to gather a list of all _end files in the directory (using a for loop and globbing, NOT looping on the results of ls). Once you have all the end files, you simply parse the dates using parameter expansion w/substing removal say into d1 and d2 for date1 and date2 in:
stuff_20090413T235945_20090414T235944_end
| d1 | | d2 |
then you simply subtract 1 from d1 into say date0 or d0 and then construct a previous filename out of d0 and d1 using _snip instead of _end. Then just test for the existence of the previous _snip filename, and if it exists, paste your info from the current _end file to the previous _snip file. e.g.
#!/bin/bash
for i in *end; do ## find all _end files
d1="${i#*stuff_}" ## isolate first date in filename
d1="${d1%%T*}"
d2="${i%T*}" ## isolate second date
d2="${d2##*_}"
d0=$((d1 - 1)) ## subtract 1 from first, get snip d1
prev="${i/$d1/$d0}" ## create previous 'snip' filename
prev="${prev/$d2/$d1}"
prev="${prev%end}snip"
if [ -f "$prev" ] ## test that prev snip file exists
then
printf "paste to : %s\n" "$prev"
printf " from : %s\n\n" "$i"
fi
done
Test Input Files
$ ls -1
stuff_20090413T235945_20090414T235944_end
stuff_20090413T235945_20090414T235944_snip
stuff_20090414T235945_20090415T235944_end
stuff_20090414T235945_20090415T235944_snip
stuff_20090415T235945_20090416T235944_end
stuff_20090415T235945_20090416T235944_snip
stuff_20090416T235945_20090417T235944_end
stuff_20090416T235945_20090417T235944_snip
stuff_20090417T235945_20090418T235944_end
stuff_20090417T235945_20090418T235944_snip
stuff_20090418T235945_20090419T235944_end
stuff_20090418T235945_20090419T235944_snip
Example Use/Output
$ bash endsnip.sh
paste to : stuff_20090413T235945_20090414T235944_snip
from : stuff_20090414T235945_20090415T235944_end
paste to : stuff_20090414T235945_20090415T235944_snip
from : stuff_20090415T235945_20090416T235944_end
paste to : stuff_20090415T235945_20090416T235944_snip
from : stuff_20090416T235945_20090417T235944_end
paste to : stuff_20090416T235945_20090417T235944_snip
from : stuff_20090417T235945_20090418T235944_end
paste to : stuff_20090417T235945_20090418T235944_snip
from : stuff_20090418T235945_20090419T235944_end
(of course replace stuff_ with your actual prefix)
Let me know if you have questions.

You could store the previous $file3 value in a variable (and do a check if it is not the first run with -z check):
#!/bin/bash
destination='media/user/directory/'
prev=""
for file1 in $destination*.ascii
do
echo $file1
file2="${file1}.end"
file3="${file1}.snip"
sed -e '16,$d' $file1 > $file2
sed -e '1,15d' $file1 > $file3
if [ -z "$prev" ]; then
cat $prev $file2 > outfile
fi
prev=$file3
done

blocking space for a string in shell

How can we block a particular space for a string in shell using printf command
for example result is
tom#x.com 10
john#x.com 11
andrew#x.com 12
thomas_sean#x.com 15
how can we align this result in proper manner as my command used in coding is
printf $user$i $time
result desired is
tom#x.com 10
john#x.com 11
andrew#x.com 12
thomas_sean#x.com 15
my code is as below-
echo $h | cut -f$a -d" "
`printf "\t${t[$a]}\t\t $hour:$min:$sec\n"`

Possible script
while read -r user time
do
printf "%-20s %s\n" "$user" "$time"
done <<'EOF'
tom#x.com 10
john#x.com 11
andrew#x.com 12
thomas_sean#x.com 15
EOF
Sample output:
tom#x.com 10
john#x.com 11
andrew#x.com 12
thomas_sean#x.com 15
The %-20s can be adjusted a little (%-17s or %-18s) if desired, but the basic idea is to reserve an appropriate number of spaces and left justify the string, followed by a blank and then the 'time'. The \n for the newline is necessary; printf does not add a newline unless you request it to do so.

Pipe your output to column -t:
$ column -t << END
tom#x.com 10
john#x.com 11
andrew#x.com 12
thomas_sean#x.com 15
END
tom#x.com 10
john#x.com 11
andrew#x.com 12
thomas_sean#x.com 15

You can try something like this:
printf("%-10s%-50s", $1, $2)
I didn't test it, but I think it will work.
If not, give a look on this issue, someone had the same problem solved :)
Hope it helps!

incrementing a number in bash with leading 0

I am a very newbie to bash scripting and am trying to write some code to parse and manipulate a file that I am working on.
I need to increment and decrement the minute of a time for a bunch of different times in a file. My problem happens when the time is for example 2:04 or 14:00.
File Example:
2:43
2:05
15:00
My current excerpt from my bash script is like this
for x in `cat $1`;
do minute_var=$(echo $x | cut -d: -f2);
incr_min=$(($minute_var + 1 | bc));
echo $incr_min;
done
Current Result:
44
6
1
Required Result:
44
06
01
Any suggestions

Use printf:
incr_min=$(printf %02d $(($minute_var + 1 )) )
No that bc is not needed if only integers are involved.

is this ok for your requirement?
kent$ echo "2:43
2:05
15:00"|awk -F: '{$2++;printf "%02d\n", $2}'
44
06
01

while IFS=: read hour min; do
printf "%02d\n" $((10#$min + 1))
done <<END
2:43
2:05
15:00
8:08
0:59
END
44
06
01
09
60
For the minute wrapping to the next hour, use a language with time functions, like gawk
awk -F: '{
time = mktime("1970 01 01 " $1 " " $2 " 00")
time += 60
print strftime("%M", time)
}'
perl -MTime::Piece -MTime::Seconds -nle '
    $t = Time::Piece->strptime($_, "%H:%M");
    print +($t + ONE_MINUTE)->strftime("%M");
'

UPDATED #2
There are some problems with your script. At first instead of `cat file` you should use `<file` or rather $(<file). One fork and exec call is spared as bash simply opens the file. On the other hand calling cut and bc (and printf) also not needed as bash has internally the proper features. So you can spare some forks and execs again.
If the input file is large (greater then cca 32 KiB) then the for-loop line can be too large to be processed by bash so I suggest to use while-loop instead and read the file line-by-line.
I could suggest something like this in pure bash (applied Atle's substr solution):
while IFS=: read hr min; do
incr_min=$((1$min+1)); #Octal problem solved
echo ${incr_min: -2}; #Mind the space before -2!
#or with glennjackman's suggestion to use decimal base
#incr_min=0$((10#$min+1))
#echo ${incr_min: -2};
#or choroba's solution improved to set variable directly
#printf -v incr_min %02d $((10#$min+1))
#echo $incr_min
done <file
Input file
$ cat file
2:43
2:05
15:00
12:07
12:08
12:09
Output:
44
06
01
08
09
10
Maybe the printf -v is the simplest as it puts the result to the variable in a single step.
Good question from tripleee what should happen if the result is 60.

Use printf to reformat the output to be zero-padded, 2-wide:
incr_min=$(printf %02d $incr_min)

Here's a solution that
wraps the seconds from 59 to 0
is fully POSIX compliant--no bashisms!
doesn't need a single fork thus is extremely fast
$ cat x
2:43
2:05
2:08
2:09
15:00
15:59
$ while IFS=: read hr min; do
printf '%02d\n' $(((${min#0}+1)%60))
done < x
44
06
09
10
01
00

Try this:
for x in $(<$1); do
printf "%02d\n" $(((${x#*:}+1)%60));
done

Padding with 0, and getting two last characters:
for x in `cat $1`;
do minute_var=$(echo $x | cut -d: -f2);
incr_min=0$(($minute_var + 1 | bc));
echo ${incr_min: -2:2};
done

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Printing the same contiguous lines only once using shell/awk - shell

Using awk you can do this: awk '$0 != prev; {prev=$0}' file Sep 9 09:22:11 Hello Sep 9 10:23:11 Hello Sep 10 11:23:11 Command Breakup: $0 != prev; # if previous line is not same as current then print it {prev=$0} # store current line in a variable called prev

Just because you asked for shell too, though the given answers are all better solutions - last='' while read line do if [[ "$line" -eq "$last" ]] then continue else echo "$line" last="$line" fi done < infile This is simple, clear, and likely slower than either awk or uniq.

Related

Print line if column 2 is greater than column 2 on the next line

Performance issues with bash script

Make cat command to operate recursively looping through a directory

blocking space for a string in shell

incrementing a number in bash with leading 0

Categories

Resources