Split file by multiple line breaks - bash

Let's say you have the following input file
Some text. It may contain line
breaks.
Some other part of the text
Yet an other part of
the text
And you want to iterate each text part (seperated by two line breaks (\n\n)), so that
in the first iteration I would only get:
Some text. It may contain line
breaks.
In the second iteration I would get:
Some other part of the text
And in the last iteration I would get:
Yet an other part of
the text
I tried this, but it doesn't seem to work because IFS only supports one character?
cat $inputfile | while IFS=$'\n\n' read part; do
# do something with $part
done

This is the solution of anubhava in pure bash:
#!/bin/bash
COUNT=1; echo -n "$COUNT: "
while read LINE
do
[ "$LINE" ] && echo "$LINE" || { (( ++COUNT )); echo -n "$COUNT: " ;}
done

Use awk with null RS:
awk '{print NR ":", $0}' RS= file
1: Some Text. It may contains line
breaks.
2: Some Other Part of the Text
3: Yet an other Part of
the Text
You can clearly see that your input file has 3 records now (each record is printed with record # in output).

Related

Detect double new lines with bash script

I am attempting to return the line number of lines that have a break. An input example:
2938
383
3938
3
383
33333
But my script is not working and I can't see why. My script:
input="./input.txt"
declare -i count=0
while IFS= read -r line;
do
((count++))
if [ "$line" == $'\n\n' ]; then
echo "$count"
fi
done < "$input"
So I would expect, 3, 6 as output.
I just receive a blank response in the terminal when I execute. So there isn't a syntax error, something else is wrong with the approach I am taking. Bit stumped and grateful for any pointers..
Also "just use awk" doesn't help me. I need this structure for additional conditions (this is just a preliminary test) and I don't know awk syntax.
The issue is that "$line" == $'\n\n' won't match a newline as it won't be there after consuming an empty line from the input, instead you can match an empty line with regex pattern ^$:
if [[ "$line" =~ ^$ ]]; then
Now it should work.
It's also match easier with awk command:
$ awk '$0 == ""{ print NR }' test.txt
3
6
As Roman suggested, line read by read terminates with a delimiter, and that delimiter would not show up in the line the way you're testing for.
If the pattern you are searching for looks like an empty line (which I infer is how a "double newline" always manifests), then you can just test for that:
while read -r; do
((count++))
if [[ -z "$REPLY" ]]; then
echo "$count"
fi
done < "$input"
Note that IFS is for field-splitting data on lines, and since we're only interested in empty lines, IFS is moot.
Or if the file is small enough to fit in memory and you want something faster:
mapfile -t -O1 foo < i
declare -p foo
for n in "${!foo[#]}"; do
if [[ -z "${foo[$n]}" ]]; then
echo "$n"
fi
done
Reading the file all at once (mapfile) then stepping through an array may be easier on resources than stepping through a file line by line.
You can also just use GNU awk:
gawk -v RS= -F '\n' '{ print (i += NF); i += length(RT) - 1 }' input.txt
By using FS = ".+", it ensures only truly zero-length (i.e. $0 == "") line numbers get printed, while skipping rows consisting entirely of [[:space:]]'s
echo '2938
383
3938
3
383
33333' |
{m,g,n}awk -F'.+' '!NF && $!NF = NR'
3
6
This sed one-liner should do the job at once:
sed -n '/^$/=' input.txt
Simply writes the current line number (the = command) if the line read is empty (the /^$/ matches the empty line).

bash print whole line after splitting line with ifs

When awk splits a line into fields using a delimiter, it maintains the original line in the $0 variable. Thus it can print the original line (assuming nothing else modifies $0) after performing operations on the individual fields. Can bash's read do something similar, where it has not only the individual elements but also the entire line?
E.g., with the following input.txt
foo,bar
baz,quz
The awk behavior that i'm trying to imitate in bash is:
awk -F, '($1 == "baz") {print $0}' input.txt
This will print baz,quz because $0 is the whole line that was read, even though the line was also split into two fields ($1 and $2).
Bash:
while IFS=, read -r first second; do
if [[ "$first" == lemur ]]; then
# echo the entire line
fi
done < input.txt
In this simple case it wouldn't be too difficult to recreate the original line by echoing the $first and $second variables with a comma between them. But in more complex scenarios where IFS may be more than one character and there may be many fields, it becomes much harder to accurately recreate the original line unless bash is maintaining it during the read operation.
Probably you'll have to do it with 2 different reads like
while read -r line; do
IFS=, read first second <<<"$line"
if [[ $first == lemur ]]; then
printf '%s\n' "$line"
fi
done < input.txt

How to separate a line into an array with white space in shell scripting

I can't figure out why my script is not displaying the string separated by white space.
This is my code:
While read -r row
do
line = ($row)
for word in $line
do
echo ${word[0]}
done
done < $1
say the line is "add $s0 $s0 $t1"
i want the output to be "add"
While read -r row
This will try to run a command called While, you'll probably get an error for that. The shell keyword is while.
do
line = ($row)
This will try to run a command called line, which is a program from GNU coreutils (line - read one line), but probably not what you want. Assignments in the shell must not have whitespace around the equal sign.
If that assignment worked, it would make an array called line.
for word in $line
Referencing the array just by name expands to the first item of it, so the loop is useless here.
do
echo ${word[0]}
And here, indexing is not very useful since word is going to be a single value, not an array.
I suspect what you want is this:
while read -r row ; do
words=($row);
echo "${words[0]}"
done
Though if $row contains glob characters like *, they'll be expanded to matching filenames.
This would be better:
read -r -a words
echo "${words[0]}"
or simply
read -r line
echo "${line%% *}" # remove everything after the first space
This work fine :
while read -r row
do
echo $row | awk '{print $1}'
done
while read -r row ask for user input and store it in row variable, awk '{print $1}' display only first word of user input.
Do you want each token on a seperate line? Why not just use sed?
$ echo "1 2 3 hi" | sed -r 's/[ \t]+/\n/g'
1
2
3
hi
If you want the first word of each line, then:
$ echo "1 2 3 hi" | sed -r 's/^([^ \t]+).+/\1/'
1
If its a file, then remove "echo ... | " and just give the filename as a parameter to sed:
$ sed -r 's/^([^ \t]+).+/\1/' file.txt

search lines in bash for specific character and display line

I am trying to write search a string in bash and echo the line of that string that contains the + character with some text is a special case. The code does run but I get both lines in the input file displayed. Thank you :)
bash
#!/bin/bash
printf "Please enter the variant the following are examples"
echo " c.274G>T or c.274-10G>A"
printf "variant(s), use a comma between multiple: "; IFS="," read -a variant
for ((i=0; i<${#variant[#]}; i++))
do printf "NM_000163.4:%s\n" ${variant[$i]} >> c:/Users/cmccabe/Desktop/Python27/input.txt
done
awk '{for(i=1;i<=NF;++i)if($i~/+/)print $i}' input.txt
echo "$i" "is a special case"
input.txt
NM_000163.4:c.138C>A
NM_000163.4:c.266+83G>T
desired output ( this line contains a + in it)
NM_000163.4:c.266+83G>T is a special case
edit:
looks like I need to escape the + and that is part of my problem
you can change your awk script as below and get rid of echo.
$ awk '/+/{print $0,"is a special case"}' file
NM_000163.4:c.266+83G>T is a special case
As far as I understand your problem, you can do it with a single sed command:
sed -n '/+/ {s/$/is a special case/ ; p}' input.txt
On lines containing +, it replaces the end ($) with your text, thus appending it. After that the line is printed.

Bash script get item from array

I'm trying to read file line by line in bash.
Every line has format as follows text|number.
I want to produce file with format as follows text,text,text etc. so new file would have just text from previous file separated by comma.
Here is what I've tried and couldn't get it to work :
FILENAME=$1
OLD_IFS=$IFSddd
IFS=$'\n'
i=0
for line in $(cat "$FILENAME"); do
array=(`echo $line | sed -e 's/|/,/g'`)
echo ${array[0]}
i=i+1;
done
IFS=$OLD_IFS
But this prints both text and number but in different format text number
here is sample input :
dsadadq-2321dsad-dasdas|4212
dsadadq-2321dsad-d22as|4322
here is sample output:
dsadadq-2321dsad-dasdas,dsadadq-2321dsad-d22as
What did I do wrong?
Not pure bash, but you could do this in awk:
awk -F'|' 'NR>1{printf(",")} {printf("%s",$1)}'
Alternately, in pure bash and without having to strip the final comma:
#/bin/bash
# You can get your input from somewhere else if you like. Even stdin to the script.
input=$'dsadadq-2321dsad-dasdas|4212\ndsadadq-2321dsad-d22as|4322\n'
# Output should be reset to empty, for safety.
output=""
# Step through our input. (I don't know your column names.)
while IFS='|' read left right; do
# Only add a field if it exists. Salt to taste.
if [[ -n "$left" ]]; then
# Append data to output string
output="${output:+$output,}$left"
fi
done <<< "$input"
echo "$output"
No need for arrays and sed:
while IFS='' read line ; do
echo -n "${line%|*}",
done < "$FILENAME"
You just have to remove the last comma :-)
Using sed:
$ sed ':a;N;$!ba;s/|[0-9]*\n*/,/g;s/,$//' file
dsadadq-2321dsad-dasdas,dsadadq-2321dsad-d22as
Alternatively, here is a bit more readable sed with tr:
$ sed 's/|.*$/,/g' file | tr -d '\n' | sed 's/,$//'
dsadadq-2321dsad-dasdas,dsadadq-2321dsad-d22as
Choroba has the best answer (imho) except that it does not handle blank lines and it adds a trailing comma. Also, mucking with IFS is unnecessary.
This is a modification of his answer that solves those problems:
while read line ; do
if [ -n "$line" ]; then
if [ -n "$afterfirst" ]; then echo -n ,; fi
afterfirst=1
echo -n "${line%|*}"
fi
done < "$FILENAME"
The first if is just to filter out blank lines. The second if and the $afterfirst stuff is just to prevent the extra comma. It echos a comma before every entry except the first one. ${line%|\*} is a bash parameter notation that deletes the end of a paramerter if it matches some expression. line is the paramter, % is the symbol that indicates a trailing pattern should be deleted, and |* is the pattern to delete.

Resources