Extract words from files - bash

How can I extract all the words from a file, every word on a single line?
Example:
test.txt
This is my sample text
Output:
This
is
my
sample
text

The tr command can do this...
tr [:blank:] '\n' < test.txt
This asks the tr program to replace white space with a new line.
The output is stdout, but it could be redirected to another file, result.txt:
tr [:blank:] '\n' < test.txt > result.txt

And here the obvious bash line:
for i in $(< test.txt)
do
printf '%s\n' "$i"
done
EDIT Still shorter:
printf '%s\n' $(< test.txt)
That's all there is to it, no special (pathetic) cases included (And handling multiple subsequent word separators / leading / trailing separators is by Doing The Right Thing (TM)). You can adjust the notion of a word separator using the $IFS variable, see bash manual.

The above answer doesn't handle multiple spaces and such very well. An alternative would be
perl -p -e '$_ = join("\n",split);' test.txt
which would. E.g.
esben#mosegris:~/ange/linova/build master $ echo "test test" | tr [:blank:] '\n'
test
test
But
esben#mosegris:~/ange/linova/build master $ echo "test test" | perl -p -e '$_ = join("\n",split);'
test
test

This might work for you:
# echo -e "this is\tmy\nsample text" | sed 's/\s\+/\n/g'
this
is
my
sample
text

perl answer will be :
pearl.214> cat file1
a b c d e f pearl.215> perl -p -e 's/ /\n/g' file1
a
b
c
d
e
f
pearl.216>

Related

Replace one character by the other (and vice-versa) in shell

Say I have strings that look like this:
$ a='/o\\'
$ echo $a
/o\
$ b='\//\\\\/'
$ echo $b
\//\\/
I'd like a shell script (ideally a one-liner) to replace / occurrences by \ and vice-versa.
Suppose the command is called invert, it would yield (in a shell prompt):
$ invert $a
\o/
$ invert $b
/\\//\
For example using sed, it seems unavoidable to use a temporary character, which is not great, like so:
$ echo $a | sed 's#/#%#g' | sed 's#\\#/#g' | sed 's#%#\\#g'
\o/
$ echo $b | sed 's#/#%#g' | sed 's#\\#/#g' | sed 's#%#\\#g'
/\\//\
For some context, this is useful for proper printing of git log --graph --all | tac (I like to see newer commits at the bottom).
tr is your friend:
% echo 'abc' | tr ab ba
bac
% echo '/o\' | tr '\\/' '/\\'
\o/
(escaping the backslashes in the output might require a separate step)
I think this can be done with (g)awk:
$ echo a/\\b\\/c | gawk -F "/" 'BEGIN{ OFS="\\" } { for(i=1;i<=NF;i++) gsub(/\\/,"/",$i); print $0; }'
a\/b/\c
$ echo a\\/b/\\c | gawk -F "/" 'BEGIN{ OFS="\\" } { for(i=1;i<=NF;i++) gsub(/\\/,"/",$i); print $0; }'
a/\b\/c
$
-F "/" This defines the separator, The input will be split in "/", and should no longer contain a "/" character.
for(i=1;i<=NF;i++) gsub(/\\/,"/",$i);. This will replace, in all items in the input, the backslash (\) for a slash (/).
If you want to replace every instance of / with \, you can uses the y command of sed, which is quite similar to what tr does:
$ a='/o\'
$ echo "$a"
/o\
$ echo "$a" | sed 'y|/\\|\\/|'
\o/
$ b='\//\\/'
$ echo "$b"
\//\\/
$ echo "$b" | sed 'y|/\\|\\/|'
/\\//\
If you are strictly limited to GNU AWK you might get desired result following way, let file.txt content be
\//\\\\/
then
awk 'BEGIN{FPAT=".";OFS="";arr["/"]="\\";arr["\\"]="/"}{for(i=1;i<=NF;i+=1){if($i in arr){$i=arr[$i]}};print}' file.txt
gives output
/\\////\
Explanation: I inform GNU AWK that field is any single character using FPAT built-in variable and that output field separator (OFS) is empty string and create array where key-value pair represent charactertobereplace-replacement, \ needs to be escaped hence \\ denote literal \. Then for each line I iterate overall all fields using for loop and if given field hold character present in array arr keys I do exchange it for corresponding value, after loop I print line.
(tested in gawk 4.2.1)

using Grep or Sed to get a text beetween {}

first thread on Stack Overflow,
I'm learning bash and i can't figure how to use Grep or Sed to a specific used. i want to extract/print all the data beetween specific characters like { and } or [ and ].
I've search a lot, but i can't find anything related to get something if the two characters are not on the same line.
I hope you can help me !
Thanks in advance
Didn't realize that OP has { and } in two separate lines. sed would be easier,
$ sed -n '/{/,/}/{//!p}' inputfile
For square brackets, you have to escape the characters:
$ sed -n '/\[/,/\]/{//!p}' inputfile
inputfile:
$ cat inputfile
Some text inside
{
between braces
}
some other text
[
between square bracket
]
some more text
output:
$ sed -n '/{/,/}/{//!p}' inputfile
between braces
$ sed -n '/\[/,/\]/{//!p}' inputfile
between square bracket
If they are on the same line, use perl-style-regex in grep and option -o:
$ echo 'Some text {between}' | grep -o -P '(?<=\{).*(?=\})'
between
$ echo 'Some text [between]' | grep -o -P '(?<=\[).*(?=\])'
between
You can use sed for the {} case like this
sed 's!.*{\(.*\)}.*!\1!'
\1 is a 'Remember pattern' that remembers everything that is within (.*)
You can try this sed but it's better with awk
sed '
/{/!d
s/[^{]*//
:A
$bB
N
/}/!bA
:B
s/}[^}]*$/}/
t
d
' infile
With modern grep (2.6.3+) - it's easy:
[root#s]$ cat test
aa{bb]
cc}dd [ee] ff [gg
hh] ii {jj} kk
[root#s]$ <test grep -z -P '({[^}]+}|\[[^]]+\])' -o
{bb]
cc}
[ee]
[gg
hh]
{jj}
If Your grep is 2.5.1 or lower (where -z didn't exist and -P was poorly-implemented) - the input needs to be converted to 1-line first.
Example: tr '\n' '\t' replaces all new-lines with tab-characters (if desired - opposite replacement can be done after processing).
[root#s]$ <test tr '\n' '\t'|grep '\(\[[^]]*\]\|{[^}]*}\)' --color=always|tr '\t' '\n'
aa{bb]
cc}dd [ee] ff [gg
hh] ii {jj} kk
[root#s]$ <test tr '\n' '\t'|grep '\(\[[^]]*\]\|{[^}]*}\)' -o
{bb] cc}
[ee]
[gg hh]
{jj}
For both versions You can choose which presentation (-o or --color=always) is more appealing.
P.S. all of the above assumes there's no nesting/escaping of those { } [ ] characters in the input

Using cut on stdout with tabs

I have a file which contains one line of text with tabs
echo -e "foo\tbar\tfoo2\nx\ty\tz" > file.txt
I'd like to get the first column with cut. It works if I do
$ cut -f 1 file.txt
foo
x
But if I read it in a bash script
while read line
do
new_name=`echo -e $line | cut -f 1`
echo -e "$new_name"
done < file.txt
Then I get instead
foo bar foo2
x y z
What am I doing wrong?
/edit: My script looks like that right now
while IFS=$'\t' read word definition
do
clean_word=`echo -e $word | external-command'`
echo -e "$clean_word\t<b>$word</b><br>$definition" >> $2
done < $1
External command removes diacritics from a Greek word. Can the script be optimized any further without changing external-command?
What is happening is that you did not quote $line when reading the file. Then, the original tab-delimited format was lost and instead of tabs, spaces show in between words. And since cut's default delimiter is a TAB, it does not find any and it prints the whole line.
So quoting works:
while read line
do
new_name=`echo -e "$line" | cut -f 1`
#----------------^^^^^^^
echo -e "$new_name"
done < file.txt
Note, however, that you could have used IFS to set the tab as field separator and read more than one parameter at a time:
while IFS=$'\t' read name rest;
do
echo "$name"
done < file.txt
returning:
foo
x
And, again, note that awk is even faster for this purpose:
$ awk -F"\t" '{print $1}' file.txt
foo
x
So, unless you want to call some external command while looping the file, awk (or sed) is better.

Bash: Strip trailing linebreak from output

When I execute commands in Bash (or to be specific, wc -l < log.txt), the output contains a linebreak after it. How do I get rid of it?
If your expected output is a single line, you can simply remove all newline characters from the output. It would not be uncommon to pipe to the tr utility, or to Perl if preferred:
wc -l < log.txt | tr -d '\n'
wc -l < log.txt | perl -pe 'chomp'
You can also use command substitution to remove the trailing newline:
echo -n "$(wc -l < log.txt)"
printf "%s" "$(wc -l < log.txt)"
If your expected output may contain multiple lines, you have another decision to make:
If you want to remove MULTIPLE newline characters from the end of the file, again use cmd substitution:
printf "%s" "$(< log.txt)"
If you want to strictly remove THE LAST newline character from a file, use Perl:
perl -pe 'chomp if eof' log.txt
Note that if you are certain you have a trailing newline character you want to remove, you can use head from GNU coreutils to select everything except the last byte. This should be quite quick:
head -c -1 log.txt
Also, for completeness, you can quickly check where your newline (or other special) characters are in your file using cat and the 'show-all' flag -A. The dollar sign character will indicate the end of each line:
cat -A log.txt
One way:
wc -l < log.txt | xargs echo -n
If you want to remove only the last newline, pipe through:
sed -z '$ s/\n$//'
sed won't add a \0 to then end of the stream if the delimiter is set to NUL via -z, whereas to create a POSIX text file (defined to end in a \n), it will always output a final \n without -z.
Eg:
$ { echo foo; echo bar; } | sed -z '$ s/\n$//'; echo tender
foo
bartender
And to prove no NUL added:
$ { echo foo; echo bar; } | sed -z '$ s/\n$//' | xxd
00000000: 666f 6f0a 6261 72 foo.bar
To remove multiple trailing newlines, pipe through:
sed -Ez '$ s/\n+$//'
There is also direct support for white space removal in Bash variable substitution:
testvar=$(wc -l < log.txt)
trailing_space_removed=${testvar%%[[:space:]]}
leading_space_removed=${testvar##[[:space:]]}
If you want to print output of anything in Bash without end of line, you echo it with the -n switch.
If you have it in a variable already, then echo it with the trailing newline cropped:
$ testvar=$(wc -l < log.txt)
$ echo -n $testvar
Or you can do it in one line, instead:
$ echo -n $(wc -l < log.txt)
If you assign its output to a variable, bash automatically strips whitespace:
linecount=`wc -l < log.txt`
printf already crops the trailing newline for you:
$ printf '%s' $(wc -l < log.txt)
Detail:
printf will print your content in place of the %s string place holder.
If you do not tell it to print a newline (%s\n), it won't.
Adding this for my reference more than anything else ^_^
You can also strip a new line from the output using the bash expansion magic
VAR=$'helloworld\n'
CLEANED="${VAR%$'\n'}"
echo "${CLEANED}"
Using Awk:
awk -v ORS="" '1' log.txt
Explanation:
-v assignment for ORS
ORS - output record separator set to blank. This will replace new line (Input record separator) with ""

how to concatenate lines into one string

I have a function in bash that outputs a bunch of lines to stdout. I want to combine them into a single line with some delimiter between them.
Before:
one
two
three
After:
one:two:three
What is an easy way to do this?
Use paste
$ echo -e 'one\ntwo\nthree' | paste -s -d':'
one:two:three
And another way:
cat file | tr -s "\n" ":"
This might work for you:
paste -sd':' file
For fun, here's a bash-only way:
echo $'one\n2 and 3\nfour' | { mapfile -t lines; IFS=:; echo "${lines[*]}"; }
outputs
one:2 and 3:four
The {} grouping is to ensure all the commands that refer to the array variable are executed in the same subshell. The variable will not exist once the pipeline ends.
http://www.gnu.org/software/bash/manual/bashref.html#index-mapfile-140
Taking #glennJackman's corrections verbatim
awk '{printf("%s%s", sep, $0); sep=":"} END {print ""}' file
Or as you specified bash
while read line ; do printf "%s:" $line ; done < file | sed s'/:$//'
I hope this helps
Input.txt
one
two
three
Perl Solution : dummy.pl
#a = `cat /home/Input.txt`;
foreach my $x (#a)
{
chomp($x);
push(#array,"$x");
}
chomp(#array);
print "#array";
Run the script as :
$> perl dummy.pl | sed 's/ /:/g' > Output.txt
Output.txt
one:two:three

Resources