Combining Directory of Text Files into CSV, One File Per Line - bash

I have a large directory of text files, each rather complicated:
Say file1.txt:
Am no an listening depending up believing. Enough around remove to
barton agreed regret in or it. Advantage mr estimable be commanded
provision. Year well shot deny shew come now had. Shall downs stand
marry taken his for out. Do related mr account brandon an up. Wrong
for never ready ham these witty him. Our compass see age uncivil
matters weather forbade her minutes. Ready how but truth son new
under.
Am increasing at contrasted in favourable he considered astonished. As
if made held in an shot. By it enough to valley desire do. Mrs chief
great maids these which are ham match she. Abode to tried do thing
maids. Doubtful disposed returned rejoiced to dashwood is so up.
And file2.txt:
Among going manor who did. Do ye is celebrated it sympathize
considered. May ecstatic did surprise elegance the ignorant age. Own
her miss cold last. It so numerous if he outlived disposal. How but
sons mrs lady when. Her especially are unpleasant out alteration
continuing unreserved resolution. Hence hopes noisy may china fully
and. Am it regard stairs branch thirty length afford.
Blind would equal while oh mr do style. Lain led and fact none. One
preferred sportsmen resolving the happiness continued. High at of in
loud rich true. Oh conveying do immediate acuteness in he. Equally
welcome her set nothing has gravity whether parties. Fertile suppose
shyness mr up pointed in staying on respect.
What I need to do is to create a new file, say allfiles.txt that is:
Am no an listening depending up believing. Enough around remove to barton agreed regret in or it. Advantage mr estimable be commanded provision. Year well shot deny shew come now had. Shall downs stand marry taken his for out. Do related mr account brandon an up. Wrong for never ready ham these witty him. Our compass see age uncivil matters weather forbade her minutes. Ready how but truth son new under. Am increasing at contrasted in favourable he considered astonished. As if made held in an shot. By it enough to valley desire do. Mrs chief great maids these which are ham match she. Abode to tried do thing maids. Doubtful disposed returned rejoiced to dashwood is so up.
Among going manor who did. Do ye is celebrated it sympathize considered. May ecstatic did surprise elegance the ignorant age. Own her miss cold last. It so numerous if he outlived disposal. How but sons mrs lady when. Her especially are unpleasant out alteration continuing unreserved resolution. Hence hopes noisy may china fully and. Am it regard stairs branch thirty length afford. Blind would equal while oh mr do style. Lain led and fact none. One preferred sportsmen resolving the happiness continued. High at of in loud rich true. Oh conveying do immediate acuteness in he. Equally welcome her set nothing has gravity whether parties. Fertile suppose shyness mr up pointed in staying on respect.
This file is just two lines in this case, the full text on each. I have searched the archives but cannot seem to find an implementation for this in bash.

touch allfiles.txt # create allfiles.txt
for f in *.txt; do # for each file of the current directory
cat "$f" | tr '\n' ' ' >> allfiles.txt; # append the content of that file to allfiles.txt
echo >> allfiles.txt # insert a new line
done

for file in dir/* #Process all files in directory
do
tr '\n' ' ' < "$file" # Remove newlines
echo '' # Add newline between files
done > newfile # Write all the output of the loop to the newfile

Here's a pure INTERCAL implementation, no bash, tr, or cat required:
PLEASE DO ,1 <- #1
DO .4 <- #0
DO .5 <- #0
DO COME FROM (30)
PLEASE ABSTAIN FROM (40)
DO WRITE IN ,1
DO .1 <- ,1SUB#1
DO (10) NEXT
PLEASE GIVE UP
(20) PLEASE RESUME '?.1$#256'~'#256$#256'
(10) DO (20) NEXT
DO FORGET #1
PLEASE DO .2 <- .4
DO (1000) NEXT
DO .4 <- .3~#255
PLEASE DO .3 <- !3~#15'$!3~#240'
DO .3 <- !3~#15'$!3~#240'
DO .2 <- !3~#15'$!3~#240'
PLEASE DO .1 <- .5
DO (1010) NEXT
DO .5 <- .2
DO ,1SUB#1 <- .3
(30) PLEASE READ OUT ,1
PLEASE NOTE: having had pressing business at the local pub
(40) the author got bored with this implementation

With awk
awk 'FILENAME!=f&&NR>1{print "\n"}{FILENAME=f}1' ORS='' file1.txt file2.txt > allfiles.txt

Combined Perl/bash solution:
for f in *.txt; do
perl -ne 'chomp; print "$_ "; END{ print "\n" }' "$f"
done > output.txt
Perl-only solution
#!/usr/bin/env perl
use strict;
use warnings;
foreach my $file (<*.txt>) {
open FILE, "<$file" or die $!;
while (<FILE>) {
chomp;
print "$_ ";
}
close FILE;
print "\n";
}

Here a pure bash solution: no cat, tr, awk, etc...
Besides, it will have a nicely output format: you won't get double spaces, or beginning or trailing spaces as with the methods provided in the other answers.
for f in *.txt; do
# There are purposely no quotes for $(<"$f")
echo $(<"$f")
echo
done > newfile
The only caveat is if a file starts with -e, -E or -n: these characters won't be output: they will be slurped by echo considering it's an option. But I guess this is very unlikely to happen!
The trick is to use echo $l with no quotes!
Using this trick, here's how you can use cat in a funny way to achieve what you want (but this time it's not a pure bash solution): same thing, it's a funny no-use of quotes!
for f in *.txt; do
# There are purposely no quotes for $(<"$f")
cat <<< $(<"$f")
echo
done > newfile
If you only have two files, say file1.txt and file2.txt you can do without a loop and a single cat command:
# there's purposely a lack of quotes
cat <<< $(<file1.txt)$'\n\n'$(<file2.txt) > newfile
or with a single echo (and same caveat as above), and pure bash:
# there's purposely a lack of quotes
echo $(<file1.txt)$'\n\n'$(<file2.txt) > newfile
Note. I added comments to specify that there are no quotes as every bash programmer should feel uncomfortable when reading these unquoted parts!
Note2. Can you do shorter?

This might work for you:
for file in *.txt ;do paste -s "$file"; done | sed 's/^ *//;s/ */ /g'

awk '
FNR == 1 && FILENAME != ARGV[1] {print "\n"}
{printf "%s",$0}
END {print ""}
' *.txt > allfiles.txt

Related

Measure field/column width in fixed width output - Finding delimiters? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
In the context of the bash shell and command output:
Is there a process/approach to help determine/measure the width of fields that appear to be fixed width?
(apart from the mark one human eyeball and counting on the screen method....)
If the output appears to be fixed width, is it possible/likely that it's actually delimited by some sort of non-printing character(s)?
If so, how would I go about hunting down said character?
I'm mostly after a way to do this in bash shell/script, but I'm not averse to a programming language approach.
Sample Worst Case Data:
Name value 1 empty_col simpleHeader complex multi-header
foo bar -someVal1 1someOtherVal
monty python circus -someVal2 2someOtherVal
exactly the field_widthNextVal -someVal3 3someOtherVal
My current approach:
The best I have come up with is redirecting the output to a file, then using a ruler/index type of feature in the editor to manually work out field widths. I'm hoping there is a smarter/faster way...
What I'm thinking:
With Headers:
Perhaps an approach that measures from the first character 'to the next character that is encountered, after having already encountered multiple spaces'?
Without Headers:
Drawing a bit of a blank on this one....?
This strikes me as the kind of problem that was cracked about 40 years ago though, so I'm guessing there are better solutions than mine to this stuff...
Some Helpful Information:
Column Widths
fieldwidths=$(head -n 1 file | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}')
This is proving to be helpful for determining column widths. I don't fully understand how it works yet to provide a complete answer, but it might be helpful to a future someone else. Source: https://unix.stackexchange.com/questions/465170/parse-output-with-dynamic-col-widths-and-empty-fields
File Examination
Redirect output to a file:
command > file.data
Use hexdump or xxd against file.data to look at it's raw information. See links for some basics on those tools:
hexdump output vs xxd output
https://nwsmith.blogspot.com/2012/07/hexdump-and-xxd-output-compared.html?m=1
hexdump
https://man7.org/linux/man-pages/man1/hexdump.1.html
https://linoxide.com/linux-how-to/linux-hexdump-command-examples/
https://www.geeksforgeeks.org/hexdump-command-in-linux-with-examples/
xxd
https://linux.die.net/man/1/xxd
https://www.howtoforge.com/linux-xxd-command/
tl;dr:
# Determine Column Widths
# Source for this voodoo:
# https://unix.stackexchange.com/a/465178/266125
fieldwidths=$(echo "$(appropriate-command)" | head -n 1 | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}' | sed 's/^[ ]*//;s/[ ]*$//')
# Iterate
while IFS= read -r line
do
# You can do put awk command in a separate line if this is clearer to you
awkcmd="BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$1}"
field1="$(echo "$line" | awk "$awkcmd" | sed 's/^[ ]*//;s/[ ]*$//')"
# Or do it all in one line if you prefer:
field2="$(echo "$line" | awk "BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$2}" | sed 's/^[ ]*//;s/[ ]*$//')"
*** Code Stuff Here ***
done <<< $(appropriate-command)
Some explanation of the above - for newbies (like me)
Okay, so I'm a complete newbie, but this is my answer, based on a grand total of about two days of clawing around in the dark. This answer is relevant to those who are also new and trying to process data in the bash shell and bash scripts.
Unlike the *nix wizards and warlocks that have presented many of the solutions you will find to specific problems (some impressively complex), this is just a simple outline to help people understand what it is that they probably don't know; that they don't know. You will have to go and look this stuff up separately, it's way to big to cover it all here.
EDIT:
I would strongly suggest just buying a book/video/course for shell scripting. You do learn a lot doing it the school of hard knocks way as I have for the last couple of days, but it's proving to be painfully slow. The devil is very much in the details with this stuff. A good structured course probably instils good habits from the get go too, rather than potentially developing your own habits/short hand 'that seems to work' but will likely and unwittingly, bite you later on.
Resources:
Bash references:
https://linux.die.net/man/1/bash
https://tldp.org/LDP/Bash-Beginners-Guide
https://www.gnu.org/software/bash/manual/html_node
Common Bash Mistakes, Traps and Pitfalls:
https://mywiki.wooledge.org/BashPitfalls
http://www.softpanorama.org/Scripting/Shellorama/Bash_debugging/typical_mistakes_in_bash_scripts.shtml
https://wiki.bash-hackers.org/scripting/newbie_traps
My take is that there is no 'one right way that works for everything' to achieve this particular task of processing fixed width command output. Notably, the fixed widths are dynamic and might changed each time the command is run. It can be done somewhat haphazardly using standard bash tools (it depends on the types of values in each field, particularly if they contain whitespace or unusual/control characters). That said, expect any fringe cases to trip up the 'one bash pipeline to parse them all' approach, unless you have really looked at your data and it's quite well sanitised.
My uninformed, basic approach:
Pre-reqs:
To get much out of all this:
Learn the basics of how IFS= read -r line (and it's variants) work, it's one way of processing multiple lines of data, one line at a time. When doing this, you need to be aware of how things are expanded differently by the shell.
Grasp the basics of process substitution and command substitution, understand when data is being manipulated in a sub-shell, otherwise it disappears on you when you think you can recall it later.
It helps to grasp what Regular Expressions (regex) are. Half of the hieroglyphics that you encounter are probably regex in action.
Even further, it helps to understand when/what/why you need to 'escape' certain characters, at certain times, as this is why there is even more \ than you would expect amongst the hieroglyphics.
When doing redirection, be aware of the difference in > (overwrites without prompting) and >> (which appends to any existing data).
Understand differences in comparison operators and conditional tests (such as used with if statements and loop conditions).
if [ cond ] is not necessarily the same as if [[ cond ]]
look into the basics of arrays, and how to load, iterate over and query their elements.
bash -x script.sh is useful for debugging. Targeted debugging of specific lines is done by using set -x lines of code to debug set +x within the script.
As for the fixed width data:
If it's delimited:
Use the delimiter. Most *nix tools use a single white space as a default delimiter, but you can typically also set a specific delimiter (google how to do it for the specific tool).
Optional Step:
If there is no obvious delimiter, you can check to see if there is some secret hidden delimiter to take advantage of. There probably isn't, but you can feel good about yourself for checking. This is done by looking at the hex data in the file. Redirect the output of a command to a file (if you don't have the data in a file already). Do it using command > file.data and then explore file.data using hexdump -Cv file.data (another tool is xxd).
If you're stuck with fixed width:
Basically to do something useful, you need to:
Read line by line (i.e. record by record).
Split the lines into their columns (i.e. field by field, this is the fixed-width aspect)
Check that you are really doing what you think you are doing; particularly if expanding or redirecting data. What you see on shell as command output, might not actually be exactly what you are presenting to your script/pipe (most commonly due to differences in how the shell expands args/variables, and tends to automatically manipulate whitespace without telling you...)
Once you know exactly what your processing pipe/script is seeing, you can then tidy up any unwanted whitespace and so forth.
Starting Guidelines:
Feed the pipe/script an entire line at a time, then chop up fields (unless you really know what you are doing). Doing the field separation inside any loops such as while IFS= read -r line; do stuff; done is less error prone (in terms of the 'what is my pipe actually seeing' problem. When I was doing it outside, it tended to produce more scenarios where the data was being modified without me understanding that it was being altered (let alone why), before it even reached the pipe/script. This obviously meant I got extremely confused as to why a pipe that worked in one setting on the command line, fell over when I 'feed the same data' in a script or by some other method (but the pipe really wasn't actually getting the same data). This comes back to preserving whitespace with fixed-width data, particularly during expansion and redireciton, process substitiution and command substitution. Typically it amounts to liberal use of double quotes when calling a variable, i.e. not $someData but "$someData". Use parenthesis to clear up which var you are talking about, i.e. ${var}bar. Similarly when capturing the entire output of a command.
If there is nothing to leverage as a delimiter, you have some choices. Hack away directly at the fixed width data using tools like:
cut -c n1-n2 this directly cuts things out, starting from character n1 through to n2.
awk '{print $1}' this uses a single space by default to separate fields and print the first field.
Or, you can try to be a bit more scientific and 'measure twic, cut once'.
You can work out the field widths fairly easily if there are headers. This line is particularly helpful (sourced from an answer I link below):
fieldwidths=$(head -n 1 file | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}')
echo $fieldwidths
You can also look at all the data to see what length of data you are seeing in each field, and if you are actually getting the number of fields you expect (Thanks to David C. Rankin for this one!):
awk '{ for (i=1; i<=NF; i++) printf "%d\t",length($i) } {print ""}' file.data
With that information, you can then set about chopping fields up with a bit more certainty that you are actually capturing the entire field (and only the entire field).
Tool options are many and varied, but I'm finding GNU awk (gawk) and perl's unpack to be the clearest. As part of a pipe/script consider this (sub in your relevant field widths and which ever field you want out in the {print $fieldnumber} obviously):
awk 'BEGIN {FIELDWIDTHS=$10 20 30 10}{print $1}
For command output with dynamic field widths, if you feed it into a while IFS= read -r line; do; done loop, you will need to parse the output using the awk above, as each time the field widths might have changed. Since I originally couldn't get the expansion right, I built the awk command on a separate line and stored it in a variable, which I then called in the pipe. Once you have it figured out though, you can just shove it all back into one line if you want:
# Determine Column Widths:
# Source for this voodoo:
# https://unix.stackexchange.com/a/465178/266125
fieldwidths=$(echo "$(appropriate-command)" | head -n 1 | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}' | sed 's/^[ ]*//;s/[ ]*$//')
# Iterate
while IFS= read -r line
do
# Separate the awk command if you want:
# This uses GNU awk to split the column widths and pipes it to sed to remove leading and trailing spaces.
awkcmd="BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$1}"
field1="$(echo "$line" | awk "$awkcmd" | sed 's/^[ ]*//;s/[ ]*$//')"
# Or do it all in one line, rather than two:
field2="$(echo "$line" | awk "BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$2}" | sed 's/^[ ]*//;s/[ ]*$//')"
if [ "${DELETIONS[0]}" == 'all' ] && [ "${#DELETIONS[#]}" -eq 1 ] && [ "$field1" != 'UUID' ]; then
*** Code Stuff ***
fi
*** More Code Stuff ***
done <<< $(appropriate-command)
Remove excess whitespace using various approaches:
tr -d '[:blank:] and/or tr -d '[:space:](the later eliminates new lines and vertical whitespace, not just horizontal like :blank: does. They both also remove internal whitespace).
sed s/^[ ]*//;s/[ ]*$// this cleans up only leading and trailing whitespace.
Now you should basically have clean, separated fields to work with one at a time, having started from multi-field, multi-line command output.
Once you get what is going on fairly well with the above, you can start to look into other more elegant approaches as presented in these answers:
Finding Dynamic Field Widths:
https://unix.stackexchange.com/a/465178/266125
Using perl's unpack:
https://unix.stackexchange.com/a/465204/266125
Awk and other good answers:
https://unix.stackexchange.com/questions/352185/awk-fixed-width-columns
Some stuff just can't be done in a single pass. Like the perl answer above, it basically breaks the problem down into two parts. The first is turning the fixed width data into delimited data (just chose a delimiter that doesn't occur within any of the values in your fields/records!). Once you have it as delimited data, it makes the processing substantially easier from there on out.

Iterating with awk over some thousend files and writing to the same files in one or two runs

I have a lot of files in their own directory. All have the same name structure:
2019-10-18-42-IV-Friday.md
2019-10-18-42-IV-Saturday.md
2019-10-18-42-IV-Sunday.md
2019-10-18-43-43-IV-Monday.md
2019-10-18-42-IV Tuesday.md
and so on.
This is in detail:
yyyy-mm-dd-dd-week of year-actual quarter-day of week.md
I want to write one line to each file as a second line:
With awk I want to extract and expand the dates from the file name and then write them to the appropriate file.
This is the point where I fail.
%!awk -F"-"-" '{print "Today is $6 ", the " $3"."$2"."$1", Kw "$4", in the" $5 ". Quarter."}'
That works well, I get the sentence I want to write into the files.
So put the whole thing in a loop:
ze.sh
#!/bin/bash
for i in *.md;
j = awk -F " " '{ print "** Today is " $6 ", the" $3"." $2"." $1", Kw " $4 ", in the " $5 ". Quarter. **"}' $i
Something with CAT, I suppose.
end
What do I have to do to make variable i iterate over all files, extract the values for j from $i, and then write $j to the second line of each file?
Thanks a lot for your help.
[Using manjaro linux and bash]
GNU bash, Version 5.0.11(1)-release (x86_64-pc-linux-gnu)
Linux version 5.2.21-1-MANJARO
Could you please try following(haven't tested it, GNU awk is needed for this). For writing date on 2nd line, I have chosen same format in which your Input_file has date in it.
awk -i inplace '
FNR==2{
split(FILENAME,array,"-")
print array[1]"-"array[2]"-"array[3]
}
1
' *.md
If possible try without -i inplace option first so that changes will not be saved into Input_file and once you are Happy with results then you can add it as shown above to code to make inplace changes into Input_file.
For inplace update supported awk versions see James sir's posted link.
Save modifications in place with awk
For updating a file in-place, sed is better suited than awk, because:
You don't need a recent version, older versions can do it too
Can work in both GNU and BSD flavors -> more portable
But first, to split a filename to its parts, you don't need an extra process, the read builtin can do it too. From your examples, we need to extract year, month, day, week numbers, a quarter string, and a weekday name string:
2019-10-18-42-IV-Friday.md
2019-10-18-42-IV-Saturday.md
2019-10-18-42-IV-Sunday.md
2019-10-18-43-43-IV-Monday.md
2019-10-18-42-IV Tuesday.md
For the first 3 lines, this simple expression would work:
IFS=-. read year month day week q dayname rest <<< "$filename"
The last line has a space before the weekday name instead of a -, but that's easy to fix:
IFS='-. ' read year month day week q dayname rest <<< "$filename"
Line 4 is harder to fix, because it has a different number of fields. To handle the extra field, we should add an extra variable term:
IFS='-. ' read year month day week q dayname ext rest <<< "$filename"
And then, if we can assume that the second 43 on that line can be ignored and we can just shift the arguments, then we use a conditional on the value of $ext.
That is, for most lines the value of ext will be md (the file extension).
If the value is different that means we have an extra field, and we should shift the values:
if [[ $ext != "md" ]; then
q=$dayname
dayname=$ext
fi
Now, we can use the variables to format the line you want to insert into the file:
line="Today is $dayname, the $day.$month.$year, Kw $week, in the $q. Quarter."
Finally, we can formulate a sed statement, for example to append our custom formatted line after the first one, ideally in a way that will work with both GNU and BSD flavors of sed.
This will work equivalently with both GNU and BSD versions:
sed -i.bak -e "1 a\\"$'\n'"$line"$'\n' "$filename" && rm *.bak
Notice that .bak backup files are created that must be manually removed.
If you don't want backup files to be created, then I'm afraid you need to use slightly different format for GNU and BSD flavors:
# GNU
sed -i'' -e "1 a\\"$'\n'"$line"$'\n' "$filename"
# BSD
sed -i '' -e "1 a\\"$'\n'"$line"$'\n' "$filename"
In fact if you only need to support GNU flavor, then a simpler form will work too:
sed -i'' "1 a$line" "$filename"
You can put all of that together in a for filename in *.md; do ...; done loop.
You probably want to feed the file name into the AWK script, using the '-' to separate the components.
This script assume the second line need to be appended the AWK output to the file:
for i in *.md ; do
echo $i | awk -F- 'AWK COMMAND HERE' >> $i
done
If the new text has to be inserted (as the second line) into the new file, the sed program can be used to perform update the file (using in-place edit '-i'). Something like
for i in *.md ; do
mark=$(echo $i | awk -F- 'AWK COMMAND HERE')
sed -i -e "2i$mark" $i
done
This is the best solution for me, especially because it copes with the different delimiters.
Many thanks to everyone who was interested in this question and especially to those who posted solutions.
I wish I hadn't made it so hard because I mistyped the example data.
This is now "my" variant of the solution:
for filename in *.md; do
IFS='-. ' read year month day week q dayname rest <<< "$filename"
line="Today is $dayname, the $day.$month.$year, Kw $week, in the $q. Quarter."
sed -i.bak -e "1 a\\"$'\n'"$line"$'\n' "$filename" && rm *.bak;
done
Because of the multiple field separators, the result is best to use.
But perhaps I am wrong, and the other solutions also offer the possibility of using different separators: At least '-' and '.' are required.
I am very surprised and pleased how quickly I received very good answers as a newcomer. Hopefully I can give something back.
And I'm also amazed how many different solutions are possible for the problems that arise.
If anyone is interested in what I've done, read on here:
I've had a fatal autoimmune disease for two years. Little by little, my brain is destroyed, intermittently.
Especially my memory has suffered a lot; I often don't remember what I did yesterday, learned what still has to be done.
That's why I created day files until 31.12.2030, with a markdown template for each day. There I then record what I have done and learned on those days and what still has to be done.
It was important to me to have the correct date within the individual file. Why no database, why markdown?
I want to have a format that I can use anywhere, on any device and with any OS. A format that doesn't belong to a company, that can change it or make it more expensive, that can take it off the market or limit it with licenses.
It's fast enough. The changes to 4,097 files as described above took less than 2 seconds on my i5 laptop (12 GB Ram, SSD).
Searching with fzf over all files is also very fast. I can simply have the files converted and output as what I just need.
My memory won't come back from this, but I have a chance to log what I forgot.
Thank you very much for your help and attention.

How to join lines not starting with specific pattern to the previous line in UNIX?

Please take a look at the sample file and the desired output below to understand what I am looking for.
It can be done with loops in a shell script but I am struggling to get an awk/sed one liner.
SampleFile.txt
These are leaves.
These are branches.
These are greenery which gives
oxygen, provides control over temperature
and maintains cleans the air.
These are tigers
These are bears
and deer and squirrels and other animals.
These are something you want to kill
Which will see you killed in the end.
These are things you must to think to save your tomorrow.
Desired output
These are leaves.
These are branches.
These are greenery which gives oxygen, provides control over temperature and maintains cleans the air.
These are tigers
These are bears and deer and squirrels and other animals.
These are something you want to kill Which will see you killed in the end.
These are things you must to think to save your tomorrow.
With sed:
sed ':a;N;/\nThese/!s/\n/ /;ta;P;D' infile
resulting in
These are leaves.
These are branches.
These are greenery which gives oxygen, provides control over temperature and maintains cleans the air.
These are tigers
These are bears and deer and squirrels and other animals.
These are something you want to kill Which will see you killed in the end.
These are things you must to think to save your tomorrow.
Here is how it works:
sed '
:a # Label to jump to
N # Append next line to pattern space
/\nThese/!s/\n/ / # If the newline is NOT followed by "These", append
# the line by replacing the newline with a space
ta # If we changed something, jump to label
P # Print part until newline
D # Delete part until newline
' infile
The N;P;D is the idiomatic way of keeping multiple lines in the pattern space; the conditional branching part takes care of the situation where we append more than one line.
This works with GNU sed; for other seds like the one found in Mac OS, the oneliner has to be split up so branching and label are in separate commands, the newlines may have to be escaped, and we need an extra semicolon:
sed -e ':a' -e 'N;/'$'\n''These/!s/'$'\n''/ /;ta' -e 'P;D;' infile
This last command is untested; see this answer for differences between different seds and how to handle them.
Another alternative is to enter the newlines literally:
sed -e ':a' -e 'N;/\
These/!s/\
/ /;ta' -e 'P;D;' infile
But then, by definition, it's no longer a one-liner.
Please try the following:
awk 'BEGIN {accum_line = "";} /^These/{if(length(accum_line)){print accum_line; accum_line = "";}} {accum_line = accum_line " " $0;} END {if(length(accum_line)){print accum_line; }}' < data.txt
The code consists of three parts:
The block marked by BEGIN is executed before anything else. It's useful for global initialization
The block marked by END is executed when the regular processing finished. It is good for wrapping the things. Like printing the last collected data if this line has no These at the beginning (this case)
The rest is the code performed for each line. First, the pattern is searched for and the relevant things are done. Second, data collection is done regardless of the string contents.
awk '$1==These{print row;row=$0}$1!=These{row=row " " $0}'
you can take it from there. blank lines, separators,
other unspecified behaviors (untested)
another awk if you have support for multi-char RS (gawk has)
$ awk -v RS="These" 'NR>1{$1=$1; print RS, $0}' file
These are leaves.
These are branches.
These are greenery which gives oxygen, provides control over temperature and maintains cleans the air.
These are tigers
These are bears and deer and squirrels and other animals.
These are something you want to kill Which will see you killed in the end.
These are things you must to think to save your tomorrow.
Explanation Set the record delimiter as "These", skip the first (empty) record. Reassign field to force awk to restructure the record; print record separator and the rest of the record.
$ awk '{printf "%s%s", (NR>1 ? (/^These/?ORS:OFS) : ""), $0} END{print ""}' file
These are leaves.
These are branches.
These are greenery which gives oxygen, provides control over temperature and maintains cleans the air.
These are tigers
These are bears and deer and squirrels and other animals.
These are something you want to kill Which will see you killed in the end.
These are things you must to think to save your tomorrow.
Not a one-liner (but see end of answer!), but an awk-script:
#!/usr/bin/awk -f
NR == 1 { line = $0 }
/^These/ { print line; line = $0 }
! /^These/ { line = line " " $0 }
END { print line }
Explanation:
I'm accumulating, building up, lines that start with "These" with lines not starting with "These", outputting the completed lines whenever I find the next line with "These" at the beginning.
Store the first line (the first "record").
If the line starts with "These", print the accumulated (previous, now complete) line and replace whatever we have found so far with the current line.
If it doesn't start with "These", accumulate the line (i.e concatenate it with the previously read incomplete lines, with a space in between).
When there's no more input, print the last accumulated (now complete) line.
Run like this:
$ ./script.awk data.in
As a one-liner:
$ awk 'NR==1{c=$0} /^These/{print c;c=$0} !/^These/{c=c" "$0} END{print c}' data.in
... but why you would want to run anything like that on the command line is beyond me.
EDIT Saw that it was the specific string "These" (/^These/) that was what should be looked for. Previously had my code look for uppercase letters at the start of the line (/^[A-Z]/).
Here is a sed program which avoids branches. I tested it with the --posix option. The trick is to use an "anchor" (a string which does not occur in the file):
sed --posix -n '/^These/!{;s/^/DOES_NOT_OCCUR/;};H;${;x;s/^\n//;s/\nDOES_NOT_OCCUR/ /g;p;}'
Explanation:
write DOES_NOT_OCCUR at the beginning of lines not starting with "These":
/^These/!{;s/^/DOES_NOT_OCCUR/;};
append the pattern space to the hold space
H;
If the last line is read, exchange pattern space and hold space
${;x;
Remove the newline at the beginning of the pattern space which is added by the H command when it added the first line to the hold space
s/^\n//;
Replace all newlines followed by DOES_NOT_OCCUR with blanks and print the result
s/\nDOES_NOT_OCCUR/ /g;p;}
Note that the whole file is read in sed's process memory, but with only 4GB this should not be a problem.

Bash: Variable1 > get first n words > cut > Variable2

I've read so many entries here now and my head is exploding. Can't find the "right" solution, maybe my bad english is also the reason and for sure my really low skills of bash-stuff.
I'm writing a script, which reads the input of an user (me) into a variable.
read TEXT
echo $TEXT
Hello, this is a sentence with a few words.
What I want is (I'm sure) maybe very simple: I need now the first n words into a second variable. Like
$TEXT tr/csplit/grep/truncate/cut/awk/sed/whatever get the first 5 words > $TEXT2
echo $TEXT2
Hello, this is a sentence
I've used for example ${TEXT:0:10} but this cuts also in the middle of the word. And I don't want to use txt-file-input~outputs, just variables. Is there any really low level, simple solution for it, without losing myself in big, complex code-blocks and hundreds of (/[{*+$'-:%"})]... and so on? :(
Thanks a lot for any support!
Using cut could be a simple solution, but the below solution works too with xargs
firstFiveWords=$(xargs -n 5 <<< "Hello, this is a sentence with a few words." | awk 'NR>1{exit};1')
$ echo $firstFiveWords
Hello, this is a sentence
From the man page of xargs
-n max-args
Use at most max-args arguments per command line. Fewer than max-args arguments will be used if the size (see the -s
option) is exceeded, unless the -x option is given, in which case xargs will exit.
and awk 'NR>1{exit};1' will print the first line from its input.

Finding lines containing words that occur more than once using grep

How do I find all lines that contain duplicate lower case words.
I want to be able to do this using egrep, this is what I've tried thus far but I keep getting invalid back references:
egrep '\<(.)\>\1' inputFile.txt
egrep -w '\b(\w)\b\1' inputFile.txt
For example, if I have the following file:
The sky was grey.
The fall term went on and on.
I hope every one has a very very happy holiday.
My heart is blue.
I like you too too too much
I love daisies.
It should find the following lines in the file:
The fall term went on and on.
I hope every one has a very very happy holiday.
I like you too too too much
It finds these lines because the words on, very and too occur more than once in each line.
This could be possible through -E or -P parameter.
grep -E '(\b[a-z]+\b).*\b\1\b' file
Example:
$ cat file
The fall term went on and on.
I hope every one has a very very happy holiday.
Hi foo bar.
$ grep -E '(\b[a-z]+\b).*\b\1\b' file
The fall term went on and on.
I hope every one has a very very happy holiday.
Got it, you need find out duplicate words (all lowcase)
sed -n '/\s\([a-z]*\)\s.*\1/p' infile
Tools are used to serve your request. To restrict on one tool is not good way.
\1 is the feature in sed, but not sure if grep/egrep has this feature as well.
I know this is about grep, but here is an awk
It would be more flexible, since you can easy change to counter c
c==2 two equal words
c>2 two or more equals words
etc
awk -F"[ \t.,]" '{c=0;for (i=1;i<=NF;i++) a[$i]++; for (i in a) c=c<a[i]?a[i]:c;delete a} c==2' file
The fall term went on and on.
I hope every one has a very very happy holiday.
It runs a loop trough all words in a line and create an array index for every words.
Then a new loop to see if there is word that is repeated.
try
egrep '[a-z]*' my_file
this will find all lower case chars in each line
egrep '[a-z]*' --color my_file
this will color the lower chars

Resources