Find a line with a single word and merge it with the next line - shell

I have an issue with grep that i can't sort out.
What I have.
A listing of firstnames and lastnames, like:
John Doe
Alice Smith
Bob Smith
My problem.
Sometimes, firstname and lastname are disjointed, like:
Alice
Smith
Bob Doolittle
Mark
Von Doe //sometimes, there are more than one word on the next line
What I'd like to achieve.
Concatenate the "orphan" name with the next line.
Alice Smith
Bod Doolittle
Mark Von Doe
What I already tried
grep -ozP "^\w+\n\w.+" file | tr '\n' ' '
So, here I ask grep to find a line with just one word and concatenate it with the following line, even is this next line has more than one word.
It works correctly but only if the isolated word is at the very beginning of the file. If it appears below the first line, grep do not spot it. So a quick and dirty solution where I would loop through the file and remove a line after each pass doesn't work for me.

If awk is acceptable:
awk '
NF==1 {printf "%s ",$1; getline; print; next}
1' names.dat
Where:
NF==1 - if only one name/field in the current record ...
printf / getline / print / next - print field #1, read next line and print it, then skip to next line
1 - print all other lines as is
As a one-liner:
awk 'NF==1{printf "%s ",$1;getline;print;next}1' names.dat
This generates:
Alice Smith
Bob Doolittle
Mark Von Doe //sometimes, there are more than one word on the next line

You can use GNU sed like this:
sed -E -i '/^[^[:space:]]+$/{N;s/\n/ /}' file
See the sed demo:
s='Alice
Smith
Bob Doolittle
Mark
Von Doe //sometimes, there are more than one word on the next line'
sed -E '/^[^[:space:]]+$/{N;s/\n/ /}' <<< "$s"
Output:
Alice Smith
Bob Doolittle
Mark Von Doe //sometimes, there are more than one word on the next line
Details:
/^[^[:space:]]+$/ finds a line with no whitespace
{N;s/\n/ /} - reads in the next line, and appends a newline char with this new line to the current pattern space, and then s/\n/ / replaces this newline char with a space.

Use this Perl one-liner:
perl -lane 'BEGIN { $is_first_name = 1; } if ( #F == 1 && $is_first_name ) { #prev = #F; $is_first_name = 0; } else { print join " ", #prev, #F; $is_first_name = 1; #prev = (); }' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.

Using awk:
awk '
{f=$2 ? 1 : 0}
v==1{v=0; print; next}
f==0{v=1; printf "%s ", $1; next}
1
' file
Output
Alice Smith
Bob Doolittle
Mark Von Doe

This might work for you (GNU sed):
sed -E 'N;s/^(\S+)\n/\1 /;P;D' file
Append the next line.
If the first line in the pattern space contains one word only, replace the following newline with a space.
Print/delete the first line and repeat.

Related

Remove first two lines, last two lines and space from file and add quotes on each line and replace newline with commas in shell script

I have to input.txt file which needs to be formatted by shell script with following condition
remove first two lines and
last two lines
remove all spaces in each
lines(each line have two spaces at
beginning and one space at end)
Each line should be within single
quotes(' ')
At last replace newline($) with
commas.
(original)
input.txt
sql
--------
Abce
Bca
Efr
-------
Row (3)
Desired output file
output.txt
'Abce','Bca','Efr'
I have tried using following commands
Sed -i 1,2d input.txt > input.txt
Sed "$(( $(wc -l <input.txt) -2+1)), $ d" Input.txt > input.txt
Sed ':a;N;$!ba;s/\n/, /g' input.txt > output.txt
But i get blank output.txt
Would you please try the following:
mapfile -t ary < <(tail -n +3 input.txt | head -n -2 | sed -E "s/^[[:blank:]]*/'/; s/[[:blank:]]*$/'/")
(IFS=,; echo "${ary[*]}")
tail -n +3 outputs lines after the 3rd line, inclusive.
head -n -2 outputs lines excluding the last 2 lines.
sed -E "s/^[[:blank:]]*/'/" removes leading whitespaces and prepends
a single quote.
Similarly the sed command "s/[[:blank:]]*$/'/" removes trailing
whitespaces and appends a single quote.
The syntax <(command ..) is a process substitution and the
output of the commands within the parentheses is fed to the mapfile
via the redirect.
mapfile -t ary reads lines from the standard input into the array
variable named ary.
echo "${ary[*]}" expands to a single string with the contents of
the array ary separated by the value of IFS, which is just assigned
to a comma.
The assignment of IFS and the array expansion are enclosed with
parentheses to be executed in the subshell. This prevents the IFS
to be modified in the current process.
With your shown samples, please try following awk program. Written and tested in GNU awk, should work with any version.
awk -v s1="'" -v lines="$(wc -l < Input_file)" '
BEGIN{ OFS="," }
FNR==(lines-1) {
print val
exit
}
FNR>2{
sub(/^[[:space:]]+/,"")
val=(val?val OFS:"") (s1 $0 s1)
}
' Input_file
Explanation: Adding detailed explanation for above code, this is only for explanation purposes.
awk -v s1="'" -v lines="$(wc -l < Input_file)" ' ##Starting awk program, setting s1 variable to ' and creating lines which has total number of lines in it, using wc -l command on Input_file file.
BEGIN{ OFS="," } ##Setting OFS to comma in BEGIN section of this program.
FNR==(lines-1) { ##Checking condition if its 2nd last line of Input_file.
print val ##Then printing val here.
exit ##exiting from program from here.
}
FNR>2{ ##Checking condition if FNR is greater than 2 then do following.
sub(/^[[:space:]]+/,"") ##Substituting initial spaces with NULL here.
val=(val?val OFS:"") (s1 $0 s1) ##Creating val which has ' current line ' in it and keep adding it in val.
}
' Input_file ##Mentioning Input_file name here.
If you know the input is small enough to fit in memory:
$ awk '
NR>4 { gsub(/^ *| *$/,"\047",p2); out=out sep p2; sep="," }
{ p2=p1; p1=$0 }
END { print out }
' input.txt
'Abce','Bca','Efr'
Otherwise:
$ awk '
NR>4 { gsub(/^ *| *$/,"\047",p2); printf "%s%s", sep, p2; sep="," }
{ p2=p1; p1=$0 }
END { print "" }
' input.txt
'Abce','Bca','Efr'
Either script will work using any awk in any shell on every Unix box.
This might work for you (GNU sed):
sed -E '1,2d;$!H;$!d;x;s/^\s*(.*)\s*$/'\''\1'\''/mg;s/\n[^\n]*$//;y/\n/,/' file
Delete the first two lines.
Append each line to the hold space, except for the last (this means the second from last line will still be present - see later).
Delete all lines except for the last.
Swap to the hold space.
Remove all spaces either side of the words on each line and surround those words by single quotes.
Remove the last line and its newline.
Replace all newlines by commas.
The first sed -i overwrites input.txt with an empty file. You can't write output back to the file you are reading, and sed -i does not produce any output anyway.
The minimal fix is to take out the -i and string together the commands into a pipeline; but of course, sed allows you to combine the commands into a single script.
len=$(wc -l <input.txt)
sed -e '1,2d' -e "$((len - 3))"',$d' \
-e ':a' \
-e 's/^ \(.*\) $/'"'\\1'/" \
-e N -e '$!ba' -e 's/\n/, /g' input.txt >output.txt
(Untested; if your sed does not allow multiple -e options, needs refactoring to use a single string with semicolons or newlines between the commands.)
This is hard to write and debug and brittle because of the ways you have to combine the quoting features of the shell with the requirements of sed and this particular script, but also more inherently because sed is a terse and obscure language.
A much more legible and maintainable solution is to switch to Awk, which allows you to express the logic in more human terms, and avoid having to pull in support from the shell for simple tasks like arithmetic and string formatting.
awk 'FNR > 2 { sub(/^ /, ""); sub(/ $/, "");
a[++i] = sprintf("\047%s\047,", $0); }
END { for(j=1; j < i-1; ++j) printf "%s", a[j] }' input.txt >output.txt
This literally replaces all newlines with commas; perhaps you would in fact like to print a newline instead of the comma on the last line?
awk 'FNR > 2 { sub(/^ /, ""); sub(/ $/, "");
a[++i] = sprintf("%s\047%s\047", sep, $0); sep="," }
END { for(j=1; j < i-1; ++j) printf "%s", a[j]; printf "\n" }' input.txt >output.txt
If the input file is really large, you might want to refactor this to not keep all the lines in memory. The array a collects the formatted output and we print all its elements except the last two in the END block.
sed -E '
/^-+$/,/^-+$/!d
//d
s/^[[:space:]]*|[[:space:]]*$/'\''/g
' input.txt |
paste -sd ,
This uses a trick that doesn't work on all sed implementations, to print the lines between two patterns (the dashes in this case), excluding those patterns.
On the plus side if the ---- pattern is at a different line number, it still works. Down side is it breaks, if that pattern (a line containing only dashes) occurs an odd number of times (ie. not in pairs, that wrap the lines you want).
Then sub line start and end (including white space) with single quotes.
Finally pipe to paste to sub the new lines with commas, excluding a trailing comma.
Using sed
$ sed "1,2d; /-/,$ d; s/\s\+//;s/.*/'&'/" input_file | sed -z 's/\n/,/g;s/,$/\n/'
'Abce','Bca','Efr'
I'll post a sed solution which is rather light.
sed '$d' input.txt | sed "\$d; 1,2d; s/^\s*\|\s*$/'/g" | paste -sd ',' > output.txt
$d Remove last line with first sed
\$d Remove the last line. $ escaped with backslash as we are within double-quotes.
1,2d Remove the first two lines.
s/^\s*\|\s*$/'/g Replace all leading and trailing whitespace with single quotes.
Use paste to concatenate to a single, comma delimited strings.
If we know that the relevant lines always start with two spaces, then it can even be simplified further.
sed -n "s/\s*$/'/; s/^ /'/p" input.txt | paste -sd ',' > output.txt
-n suppress printing lines unless told to
s/\s*$/'/ replace trailing whitespace with single quotes
s/^ /'/p replace two leading spaces and print lines that match
paste to concat
Then an awk solution:
awk -v i=1 -v q=\' 'FNR>2 {
gsub(/^[[:space:]]*|[[:space:]]*$/, q)
a[i++]=$0
} END {
for(i=1; i<=length(a)-3; i++)
printf "%s,", a[i]
print a[i++]
}' input.txt > output.txt
-v i=1 create an awk variable starting at one
-v q=\' create an awk variable for the single quote character
FNR>2 { ... tells it to only process line 3+
gsub(/^[[:space:]]*|[[:space:]]*$/, q) substitute leading and trailing whitespace with single quotes
a[i++]=$0 add line to array
END { ... Process the rest after reaching end of file
for(i=1; i<=length(a)-3; i++) take the length of the array but subtract three -- representing the last three lines
printf "%s,", a[i] print all but last three entries comma delimited
print a[i++] print next entry and complete the script (skipping the last two entries)
Not a one liner but works
sed "s/^ */\'/;s/\$/\',/;1,2d;N;\$!P;\$!D;\$d" | sed ' H;1h;$!d;x;s/\n//g;s/,$//'
Explanation:
s/^ */\'/;s/\$/\',/ ---> Adds single quotes and comma
N;$!P;$!D;$d ---> Deletes last two lines
H;1h;$!d;x;s/\n//g;s/,$//' ---> Loads entire file and merge all lines and remove last comma

If a line has a length less than a number, append to its previous line

I have a file that looks like this:
ABCDEFGH
ABCDEFGH
ABC
ABCDEFGH
ABCDEFGH
ABCD
ABCDEFGH
Most of the lines have a fixed length of 8. But there are some lines in between that have a length less than 8. I need a simple line of code that appends each of those short lines to its previous line.
I have tried the following code but it takes lots of memory when working with large files.
cat FILENAME | awk 'BEGIN{OFS=FS="\t"}{print length($1), $1}' | tr
'\n' '\t' | sed 's/8/\n/g' | awk 'BEGIN{OFS="";FS="\t"}{print $2, $4}'
The output I expect:
ABCDEFGH
ABCDEFGHABC
ABCDEFGH
ABCDEFGHABCD
ABCDEFGH
If perl is your option, please try:
perl -0777 -pe 's/(\n)(.{1,7})$/\2/mg' filename
-0777 option tells perl to slurp all lines.
The pattern (\n)(.{1,7}) matches to a line with length less than 8, assigning \1 to a newline and \2 to the string.
The replacement \2 does not contain the preceding newline and is appended to the previous line.
sed <FILENAME 'N;/\n.\{8\}/!s/\n//;P;D'
N; - append next line to pattern space
/\n.\{8\}/ - does second line contain 8 characters?
!s/\n//; - no: join the two lines
P - print first line of pattern space
D - delete first line of pattern space, start next cycle
Default print without \n and append it to the last line when the current line has length 8.
The first and last line are special.
awk 'NR==1 {printf $0;next}
length($0)==8 {printf "\n"}
{printf("%s",$0)}
END { printf "\n" }' FILENAME
When you have GNU sed 4.2 (support -z option), you can try
EDIT (see comments): the inferiour
sed -rz 's/\n(.{0,7})\n/\1\n/g' FILENAME
If you like old traditional tools, you can use ed, the standard text editor:
printf '%s\n' 'g/^.\{,7\}$/-,.j' wq | ed -s filename

print 1st string of a line if last 5 strings match input

I have a requirement to print the first string of a line if last 5 strings match specific input.
Example: Specified input is 2
India;1;2;3;4;5;6
Japan;1;2;2;2;2;2
China;2;2;2;2
England;2;2;2;2;2
Expected Output:
Japan
England
As you can see, China is excluded as it doesn't meet the requirement (last 5 digits have to be matched with the input).
grep ';2;2;2;2;2$' file | cut -d';' -f1
$ in a regex stands for "end of line", so grep will print all the lines that end in the given string
-d';' tells cut to delimit columns by semicolons
-f1 outputs the first column
You could use awk:
awk -F';' -v v="2" -v count=5 '
{
c=0;
for(i=2;i<=NF;i++){
if($i == v) c++
if(c>=count){print $1;next}
}
}' file
where
v is the value to match
count is the maximum number of value to print the wanted string
the for loop is parsing all fields delimited with a ; in order to find a match
This script doesn't need the 5 values 2 to be consecutive.
With sed:
sed -n 's/^\([^;]*\).*;2;2;2;2;2$/\1/p' file
It captures and output non ; first characters in lines ending with ;2;2;2;2;2
It can be shortened with GNU sed to:
sed -nE 's/^([^;]*).*(;2){5}$/\1/p' file
awk -F\; '/;2;2;2;2;2$/{print $1}' file
Japan
England

modify distribution of data inside a file

I need help with bash in order to modify a file.txt. I have names, each name in a line
for example
Peter
John
Markus
and I need them in the same row and with " before and at the end of each element of the vector.
"Peter" "John" "Markus"
Well, I can insert " when I have all elements in a row but I don't know how to modify the shape...all lines in a row.
array=( Peter John Markus )
number=${#array[#]}
for ((i=0;i<number;i++)); do
array[i]="\"${array[i]}"\"
echo "${array[i]}"
done
With awk
$ awk '{printf "\""$0"\" "} END{print""}' file
"Peter" "John" "Markus"
How it works:
printf "\""$0"\" "
With every new line of input, $0, this prints out a quote, the line itself, a quote and a space.
END{print""}
(optional) After we have read the last line of the file, this prints out a newline.
With sed and tr
$ sed 's/.*/"&"/' file | tr '\n' ' '
"Peter" "John" "Markus"
How it works:
s/.*/"&"/
This puts a quote before and after every line
tr '\n' ' '
This replaces newline characters with spaces so that all names appear on the same line.
With sed alone
$ sed ':a;$!{N;ba};s/^/"/; s/$/"/; s/\n/" "/g' file
"Peter" "John" "Markus"
How it works:
:a;$!{N;ba}
This reads the whole file in to the pattern space.
s/^/"/
This adds a quote at the beginning of the file
s/$/"/
This adds a quote to the end of the file.
s/\n/" "/g
This replaces every newline with the three characters: quote-space-quote.
With bash
To make the bash script in the question print on one line, one can use echo -n in place of echo. In other words, replace:
echo "${array[i]}"
With:
echo -n "${array[i]} "
Quoting all words on one line
From the comments, suppose that our file has all the names on one line and we want to quote each individually. Use:
$ cat file2
Peter John Markus
$ sed -r 's/[[:alnum:]]+/"&"/g' file2
"Peter" "John" "Markus"
The above is for GNU sed. On OSX or other BSD system, try:
sed -E 's/[[:alnum:]]+/"&"/g' file2
Perl to the rescue:
perl -pe 'chomp; $_ = qq("$_" );chop if eof' < input
Explanation:
-p reads the input line by line and prints what's in $_
chomp removes a newline
$_ = qq("$_" ) puts a " before and "<Space> after the string.
chop if eof removes the trailing space.

bash- remove \n every three lines

How can I remove newline delimiter from every three lines.
Example:
input:
1
name
John
2
family
Grady
3
Tel
123456
output:
1
name John
2
family Grady
3
Tel 123456
This might work for you (GNU sed):
sed 'n;N;s/\n//' file
to replace the newline with a space use:
sed 'n:N;s/\n/ /' file
as an alternative use paste:
paste -sd'\n \n' file
One way using AWK:
awk '{ printf "%s%s", $0, (NR%3==2 ? FS : RS) }' file
awk 'NR%3==2{printf "%s ",$0;next}{print $0}' input.txt
Output:
1
name John
2
family Grady
3
Tel 123456
You could do this in Perl,
$ perl -pe 's/\n/ /g if $. % 3 == 2' file
1
name John
2
family Grady
3
Tel 123456
Assuming all those lines you want joined with the next one end with : (your original question):
1
name:
John
2
family: Grady
3
Tel:
123456
You can use sed for this, with:
sed ':a;/:$/{N;s/\n//;ba}'
The a is a branch label. The pattern :$ (colon at end of line) is detected and, if found, N appends the next line to the current one, the newline between them is removed with the s/\n// substitution command, and it branches back to label a with the ba command.
For your edited question where you just want to combine the second and third line of each three-line group regardless of content:
1
name
John
2
family
Grady
3
Tel
123456
Use:
sed '{n;N;s/\n/ /}'
In that command sequence, n will output the first line in the group and replace it with the second one. Then N will append the third line to that second one and s/\n/ / will change the newline between them into a space before finally outputting the combined two-three line.
Then it goes onto the next group of three and does the same thing.
Both those commands will generate the desired output for their respective inputs.
Yet another solution, in Bash:
while read line
do
if [[ $line = *: ]]
then
echo -n $line
else
echo $line
fi
done < input.txt
Unix way:
$ paste -sd'\n \n' input
Output:
1
name John
2
family Grady
3
Tel 123456

Resources