Bash script to remove redundant lines - bash

Good afternoon,
I'm trying to make a bash script that cleans out some data output files. The files look like this:
/path/
/path/to
/path/to/keep
/another/
/another/path/
/another/path/to
/another/path/to/keep
I'd like to end up with this:
/path/to/keep
/another/path/to/keep
I want to cycle through lines of the file, checking the next line to see if it contains the current line, and if so, delete the current line from the file. Here's my code:
for LINE in $(cat bbutters_data2.txt)
do
grep -A1 ${LINE} bbutters_data2.txt
if [ $? -eq 0 ]
then
sed -i '/${LINE}/d' ./bbutters_data2.txt
fi
done

Assuming that your input file is sorted in the way that you have shown:
$ awk 'NR>1 && substr($0,1,length(last))!=last {print last;} {last=$0;} END{print last}' file
/path/to/keep
/another/path/to/keep
How it works
awk reads through the input file line by line. Every time we read a new line, we compare it to the last. If the new line does not contain the last line, then we print the last line. In more detail:
NR>1 && substr($0,1,length(last))!=last {print last;}
If this is not the first line and if the last line, called last, is not contained in the current line, $0, then print the last line.
last=$0
Update the variable last to the current line.
END{print last}
After we finish reading the file, print the last line.

I like the awk solution, but bash itself can handle the task. Note: the solution (both awk and bash), require that the lesser included paths be listed in increasing order. Here is an alternative bash solution (bash only due to the glob match operation):
#!/bin/bash
fn="${1:-/dev/stdin}" ## accept filename or stdin
[ -r "$fn" ] || { ## validate file is readable
printf "error: file not found: '%s'\n" "$fn"
exit 1
}
declare -i cnt=0 ## flag for 1st iteration
while read -r line; do ## for each line in file
## if 1st iteration, fill 'last', increment 'cnt', continue
[ $cnt -eq 0 ] && { last="$line"; ((cnt++)); continue; }
## while 'line' is a child of 'last', continue, else print
[[ $line = "${last%/}"/* ]] || printf "%s\n" "$last"
last="$line" ## update last=$line
done <"$fn"
[ ${#line} -eq 0 ] && ## print last line (updated for non POSIX line end)
printf "%s\n" "$last" ||
printf "%s\n" "$line"
exit 0
Output
$ bash path_uniql.sh < dat/incpaths.txt
/path/to/keep
/another/path/to/keep

Related

Detect double new lines with bash script

I am attempting to return the line number of lines that have a break. An input example:
2938
383
3938
3
383
33333
But my script is not working and I can't see why. My script:
input="./input.txt"
declare -i count=0
while IFS= read -r line;
do
((count++))
if [ "$line" == $'\n\n' ]; then
echo "$count"
fi
done < "$input"
So I would expect, 3, 6 as output.
I just receive a blank response in the terminal when I execute. So there isn't a syntax error, something else is wrong with the approach I am taking. Bit stumped and grateful for any pointers..
Also "just use awk" doesn't help me. I need this structure for additional conditions (this is just a preliminary test) and I don't know awk syntax.
The issue is that "$line" == $'\n\n' won't match a newline as it won't be there after consuming an empty line from the input, instead you can match an empty line with regex pattern ^$:
if [[ "$line" =~ ^$ ]]; then
Now it should work.
It's also match easier with awk command:
$ awk '$0 == ""{ print NR }' test.txt
3
6
As Roman suggested, line read by read terminates with a delimiter, and that delimiter would not show up in the line the way you're testing for.
If the pattern you are searching for looks like an empty line (which I infer is how a "double newline" always manifests), then you can just test for that:
while read -r; do
((count++))
if [[ -z "$REPLY" ]]; then
echo "$count"
fi
done < "$input"
Note that IFS is for field-splitting data on lines, and since we're only interested in empty lines, IFS is moot.
Or if the file is small enough to fit in memory and you want something faster:
mapfile -t -O1 foo < i
declare -p foo
for n in "${!foo[#]}"; do
if [[ -z "${foo[$n]}" ]]; then
echo "$n"
fi
done
Reading the file all at once (mapfile) then stepping through an array may be easier on resources than stepping through a file line by line.
You can also just use GNU awk:
gawk -v RS= -F '\n' '{ print (i += NF); i += length(RT) - 1 }' input.txt
By using FS = ".+", it ensures only truly zero-length (i.e. $0 == "") line numbers get printed, while skipping rows consisting entirely of [[:space:]]'s
echo '2938
383
3938
3
383
33333' |
{m,g,n}awk -F'.+' '!NF && $!NF = NR'
3
6
This sed one-liner should do the job at once:
sed -n '/^$/=' input.txt
Simply writes the current line number (the = command) if the line read is empty (the /^$/ matches the empty line).

Add filename of each file as a separator row when merging into a single file Bash Script

I have the current script which combines all the CSV files in a folder into a single CSV file and it works great. I need to add functionality to add the filename of the original csv's as a header row for each data block so I know which section is which.
Can someone assist as this is not by strong point and I am over my head
#!/bin/bash
OutFileName="./Data/all/all.csv" # Fix the output name
i=0 # Reset a counter
for filename in ./Data/all/*.csv; do
if [ "$filename" != "$OutFileName" ] ; # Avoid recursion
then
if [[ $i -eq 0 ]] ; then
head -1 $filename > $OutFileName # Copy header if it is the first file
fi
tail -n +2 $filename >> $OutFileName # Append from the 2nd line each file
i=$(( $i + 1 )) # Increase the counter
fi
done
I will be automating this and using and run shell script in apple automator.
Thank you got any help.
This is one of the files that are imported and output example
Example of current input file
Once combined I need the filename where the "headers are"
When you want to generate something like ...
Header1,Header2,Header3
file1.csv
a,b,c
x,y,z
file2.csv
1,2,3
9,9,9
file3.csv
...
... then you just have to insert an echo "$filename" >> "$OutFileName" in front of the tail command. Here is an updated version of your script with some minor improvements.
#!/bin/bash
out="./Data/all/all.csv"
i=0
rm -f "$out"
for file in ./Data/all/*.csv; do
(( i++ == 0)) && head -1 "$file"
echo "$file"
tail -n +2 "$file"
done > "$out"
There is no concept of "header line" other than the first line of the CSV file. What you can do is add a new column.
I've switched to Awk because it simplifies the script considerably. Your original would be literally a one-liner.
awk -F , 'NR==1 { OFS=FS; $(NF+1) = "Filename" }
FNR>1{ $(NF+1) = FILENAME }1' all/*.csv >all.csv
Not saving the output in the same directory as the inputs removes the pesky corner case handling.

How to browse a line from a file?

I have a file that contains 10 lines with this sort of content:
aaaa,bbb,132,a.g.n.
I wanna walk throw every line, char by char and put the data before the " , " is met in an output file.
if [ $# -eq 2 ] && [ -f $1 ]
then
echo "Read nr of fields to be saved or nr of commas."
read n
nrLines=$(wc -l < $1)
while $nrLines!="1" read -r line || [[ -n "$line" ]]; do
do
for (( i=1; i<=$n; ++i ))
do
while [ read -r -n1 temp ]
do
if [ temp != "," ]
then
echo $temp > $(result$i)
else
fi
done
paste -d"\n" $2 $(result$i)
done
nrLines=$($nrLines-1)
done
else
echo "File not found!"
fi
}
In parameter $2 I have an empty file in which I will store the data from file $1 after I extract it without the " , " and add a couple of comments.
Example:
My input_file contains:
a.b.c.d,aabb,comp,dddd
My output_file is empty.
I call my script: ./script.sh input_file output_file
After execution the output_file contains:
First line info: a.b.c.d
Second line info: aabb
Third line info: comp
(yes, without the 4th line info)
You can do what you want very simply with parameter-expansion and substring-removal using bash alone. For example, take an example file:
$ cat dat/10lines.txt
aaaa,bbb,132,a.g.n.
aaaa,bbb,133,a.g.n.
aaaa,bbb,134,a.g.n.
aaaa,bbb,135,a.g.n.
aaaa,bbb,136,a.g.n.
aaaa,bbb,137,a.g.n.
aaaa,bbb,138,a.g.n.
aaaa,bbb,139,a.g.n.
aaaa,bbb,140,a.g.n.
aaaa,bbb,141,a.g.n.
A simple one-liner using native bash string handling could simply be the following and give the following results:
$ while read -r line; do echo ${line%,*}; done <dat/10lines.txt
aaaa,bbb,132
aaaa,bbb,133
aaaa,bbb,134
aaaa,bbb,135
aaaa,bbb,136
aaaa,bbb,137
aaaa,bbb,138
aaaa,bbb,139
aaaa,bbb,140
aaaa,bbb,141
Paremeter expansion w/substring removal works as follows:
var=aaaa,bbb,132,a.g.n.
Beginning at the left and removing up to, and including, the first ',' is:
${var#*,} # bbb,132,a.g.n.
Beginning at the left and removing up to, and including, the last ',' is:
${var##*,} # a.g.n.
Beginning at the right and removing up to, and including, the first ',' is:
${var%,*} # aaaa,bbb,132
Beginning at the left and removing up to, and including, the last ',' is:
${var%%,*} # aaaa
Note: the text to remove above is represented with a wildcard '*', but wildcard use is not required. It can be any allowable text. For example, to only remove ,a.g.n where the preceding number is 136, you can do the following:
${var%,136*},136 # aaaa,bbb,136 (all others unchanged)
To print 2016 th line from a file named file.txt u have to run a command like this-
sed -n '2016p' < file.txt
More-
sed -n '2p' < file.txt
will print 2nd line
sed -n '2011p' < file.txt
2011th line
sed -n '10,33p' < file.txt
line 10 up to line 33
sed -n '1p;3p' < file.txt
1st and 3th line
and so on...
For more detail, please have a look in this tutorial and this answer.
In native bash the following should do what you want, assuming you replace the contents of your script.sh with the below:
#!/bin/bash
IN_FILE=${1}
OUT_FILE=${2}
IFS=\,
while read line; do
set -- ${line}
for ((i=1; i<=${#}; i++)); do
((${i}==4)) && continue
((n+=1))
printf '%s\n' "Line ${n} info: ${!i}"
done
done < ${IN_FILE} > ${OUT_FILE}
This will not print the 4th field of each line within the input file, on a new line in the output file (I assume this is your requirement as per your comment?).
[wspace#wspace sandbox]$ awk -F"," 'BEGIN{OFS="\n"}{for(i=1; i<=NF-1; i++){print "line Info: "$i}}' data.txt
line Info: a.b.c.d
line Info: aabb
line Info: comp
This little snippet can ignore the last field.
updated:
#!/usr/bin/env bash
if [ ! -f "$1" -o $# -ne 2 ];then
echo "Usage: $(basename $0) input_file out_file"
exit 127
fi
input_file=$1
output_file=$2
: > $output_file
if [ "$(wc -l < $1)" -ne 0 ];then
while true
do
read -r -n1 char
if [ "$char" == "" ];then
break
elif [ $char != "," ];then
temp=$temp$char
else
echo "line info: $temp" >> $output_file
temp=""
fi
done < $input_file
else
echo "file $1 is empty"
fi
Maybe this is what you want
Did you try
sed "s|,|\n|g" $1 | head -n -1 > $2
I assume that only the last word would not have a comma on its right.
Try this (tested with you sample line) :
#!/bin/bash
# script.sh
echo "Number of fields to save ?"
read nf
while IFS=$',' read -r -a arr; do
newarr=${arr[#]:0:${nf}}
done < "$1"
for i in ${newarr[#]};do
printf "%s\n" $i
done > "$2"
Execute script with :
$ ./script.sh inputfile outputfile
Number of fields ?
3
$ cat outputfile
a.b.c.d
aabb
comp
All words separated with commas are stored into an array $arr
A tmp array $newarr removes last $n element ($n get the read command).
It loops over new array and prints result in $2, the outputfile.

How do I use Head and Tail to print specific lines of a file

I want to say output lines 5 - 10 of a file, as arguments passed in.
How could I use head and tail to do this?
where firstline = $2 and lastline = $3 and filename = $1.
Running it should look like this:
./lines.sh filename firstline lastline
head -n XX # <-- print first XX lines
tail -n YY # <-- print last YY lines
If you want lines from 20 to 30 that means you want 11 lines starting from 20 and finishing at 30:
head -n 30 file | tail -n 11
#
# first 30 lines
# last 11 lines from those previous 30
That is, you firstly get first 30 lines and then you select the last 11 (that is, 30-20+1).
So in your code it would be:
head -n $3 $1 | tail -n $(( $3-$2 + 1 ))
Based on firstline = $2, lastline = $3, filename = $1
head -n $lastline $filename | tail -n $(( $lastline -$firstline + 1 ))
Aside from the answers given by fedorqui and Kent, you can also use a single sed command:
#!/bin/sh
filename=$1
firstline=$2
lastline=$3
# Basics of sed:
# 1. sed commands have a matching part and a command part.
# 2. The matching part matches lines, generally by number or regular expression.
# 3. The command part executes a command on that line, possibly changing its text.
#
# By default, sed will print everything in its buffer to standard output.
# The -n option turns this off, so it only prints what you tell it to.
#
# The -e option gives sed a command or set of commands (separated by semicolons).
# Below, we use two commands:
#
# ${firstline},${lastline}p
# This matches lines firstline to lastline, inclusive
# The command 'p' tells sed to print the line to standard output
#
# ${lastline}q
# This matches line ${lastline}. It tells sed to quit. This command
# is run after the print command, so sed quits after printing the last line.
#
sed -ne "${firstline},${lastline}p;${lastline}q" < ${filename}
Or, to avoid any external utilites, if you're using a recent version of bash (or zsh):
#!/bin/sh
filename=$1
firstline=$2
lastline=$3
i=0
exec <${filename} # redirect file into our stdin
while read ; do # read each line into REPLY variable
i=$(( $i + 1 )) # maintain line count
if [ "$i" -ge "${firstline}" ] ; then
if [ "$i" -gt "${lastline}" ] ; then
break
else
echo "${REPLY}"
fi
fi
done
try this one-liner:
awk -vs="$begin" -ve="$end" 'NR>=s&&NR<=e' "$f"
in above line:
$begin is your $2
$end is your $3
$f is your $1
Save this as "script.sh":
#!/bin/sh
filename="$1"
firstline=$2
lastline=$3
linestoprint=$(($lastline-$firstline+1))
tail -n +$firstline "$filename" | head -n $linestoprint
There is NO ERROR HANDLING (for simplicity) so you have to call your script as following:
./script.sh yourfile.txt firstline lastline
$ ./script.sh yourfile.txt 5 10
If you need only line "10" from yourfile.txt:
$ ./script.sh yourfile.txt 10 10
Please make sure that:
(firstline > 0) AND (lastline > 0) AND (firstline <= lastline)

bash: only process line if not in second file

I have this block of code:
while IFS=$'\n' read -r line || [[ -n "$line" ]]; do
if [ "$line" != "" ]; then
echo -e "$lanIP\t$line" >> /tmp/ipList;
fi
done < "/tmp/includeList"
I know this must be really simple. But I have another list (/tmp/excludeList). I only want to echo the line within my while loop if the line ins't found in my excludeList. How do I do that. Is there some awk statement or something?
You can do this with grep alone:
$ cat file
blue
green
red
yellow
pink
$ cat exclude
green
pink
$ grep -vx -f exclude file
blue
red
yellow
The -v flag tells grep to only output the lines in file that are not found in exclude and the -x flags forces whole line matching.
use grep
while IFS=$'\n' read -r line || [[ -n "$line" ]]; do
if [[ -n ${line} ]] \
&& ! grep -xF "$line" excludefile &>/dev/null; then
echo -e "$lanIP\t$line" >> /tmp/ipList;
fi
done < "/tmp/includeList"
the -n $line means if $line is not empty
the grep returns true if $line is found in exclude file which is inverted by the ! so returns true if the line is not found.
-x means line matched so nothing else can appear on the line
-F means fixed string so if any metacharacters end up in $line they'll be matched literally.
Hope this helps
With awk:
awk -v ip=$lanIP -v OFS="\t" '
NR==FNR {exclude[$0]=1; next}
/[^[:space:]]/ && !($0 in exclude) {print ip, $0}
' /tmp/excludeList /tmp/includeList > /tmpipList
This reads the exclude list info an array (as the array keys) -- the NR==FNR condition is true while awk is reading the first file from the arguments. Then, while reading the include file, if the current line contains a non-space character and it does not exist in the exclude array, print it.
The equivalent with grep:
grep -vxF -f /tmp/excludeList /tmp/includeList | while IFS= read -r line; do
[[ -n "$line" ]] && printf "%s\t%s\n" "$ipList" "$line"
done > /tmp/ipList

Resources