How to make the 'cut' command treat same sequental delimiters as one? - bash

I'm trying to extract a certain (the fourth) field from the column-based, 'space'-adjusted text stream. I'm trying to use the cut command in the following manner:
cat text.txt | cut -d " " -f 4
Unfortunately, cut doesn't treat several spaces as one delimiter. I could have piped through awk
awk '{ printf $4; }'
or sed
sed -E "s/[[:space:]]+/ /g"
to collapse the spaces, but I'd like to know if there any way to deal with cut and several delimiters natively?

Try:
tr -s ' ' <text.txt | cut -d ' ' -f4
From the tr man page:
-s, --squeeze-repeats replace each input sequence of a repeated character
that is listed in SET1 with a single occurrence
of that character

As you comment in your question, awk is really the way to go. To use cut is possible together with tr -s to squeeze spaces, as kev's answer shows.
Let me however go through all the possible combinations for future readers. Explanations are at the Test section.
tr | cut
tr -s ' ' < file | cut -d' ' -f4
awk
awk '{print $4}' file
bash
while read -r _ _ _ myfield _
do
echo "forth field: $myfield"
done < file
sed
sed -r 's/^([^ ]*[ ]*){3}([^ ]*).*/\2/' file
Tests
Given this file, let's test the commands:
$ cat a
this is line 1 more text
this is line 2 more text
this is line 3 more text
this is line 4 more text
tr | cut
$ cut -d' ' -f4 a
is
# it does not show what we want!
$ tr -s ' ' < a | cut -d' ' -f4
1
2 # this makes it!
3
4
$
awk
$ awk '{print $4}' a
1
2
3
4
bash
This reads the fields sequentially. By using _ we indicate that this is a throwaway variable as a "junk variable" to ignore these fields. This way, we store $myfield as the 4th field in the file, no matter the spaces in between them.
$ while read -r _ _ _ a _; do echo "4th field: $a"; done < a
4th field: 1
4th field: 2
4th field: 3
4th field: 4
sed
This catches three groups of spaces and no spaces with ([^ ]*[ ]*){3}. Then, it catches whatever coming until a space as the 4th field, that it is finally printed with \1.
$ sed -r 's/^([^ ]*[ ]*){3}([^ ]*).*/\2/' a
1
2
3
4

shortest/friendliest solution
After becoming frustrated with the too many limitations of cut, I wrote my own replacement, which I called cuts for "cut on steroids".
cuts provides what is likely the most minimalist solution to this and many other related cut/paste problems.
One example, out of many, addressing this particular question:
$ cat text.txt
0 1 2 3
0 1 2 3 4
$ cuts 2 text.txt
2
2
cuts supports:
auto-detection of most common field-delimiters in files (+ ability to override defaults)
multi-char, mixed-char, and regex matched delimiters
extracting columns from multiple files with mixed delimiters
offsets from end of line (using negative numbers) in addition to start of line
automatic side-by-side pasting of columns (no need to invoke paste separately)
support for field reordering
a config file where users can change their personal preferences
great emphasis on user friendliness & minimalist required typing
and much more. None of which is provided by standard cut.
See also: https://stackoverflow.com/a/24543231/1296044
Source and documentation (free software): http://arielf.github.io/cuts/

This Perl one-liner shows how closely Perl is related to awk:
perl -lane 'print $F[3]' text.txt
However, the #F autosplit array starts at index $F[0] while awk fields start with $1

With versions of cut I know of, no, this is not possible. cut is primarily useful for parsing files where the separator is not whitespace (for example /etc/passwd) and that have a fixed number of fields. Two separators in a row mean an empty field, and that goes for whitespace too.

Related

How to extract text by unspecified spaces

I'm trying to extract usernames from a text file in one per line format and from my research, it seems like the only way to do it is by spacing commands here's the format:
1 user 3
2 fusrfff 4
3 usrf 12
The only problem is because all of the users are different I can't define a static space amount. There's also the fact the UIDs (first numbers) go from 1-40k. There's a bunch of other information after the user group number too. Can anyone point me in the right direction? Thanks.
awk does not care about the amount of space between fields:
awk '{print $2}' your_file.txt
If you want to go with bash only, read does not care either:
while read uid username other_stuff; do
printf '%s\n' "$name"
done < your_file.txt
First replace spaces by one space. You can use sed 's/ +/ /g' or
tr -s " " < file.txt| cut -d" " -f2
This is using sed
$ cat file.txt| sed "s/ */ /g" | cut -d' ' -f2
user
fusrfff
usrf

read values of txt file from bash [duplicate]

This question already has answers here:
How to grep for contents after pattern?
(8 answers)
Closed 5 years ago.
I'm trying to read values from a text file.
I have test1.txt which looks like
sub1 1 2 3
sub8 4 5 6
I want to obtain values '1 2 3' when I specify 'sub1'.
The closest I get is:
subj="sub1"
grep "$subj" test1.txt
But the answer is:
sub8 4 5 6
I've read that grep gives you the next line to the match, so I've tried to change the text file to the following:
test2.txt looks like:
sub1
1 2 3
sub8
4 5 6
However, when I type
grep "$subj" test2.txt
The answer is:
sub1
It should be something super simple but I've tried awk, seg, grep,egrep, cat and none is working...I've also read some posts somehow related but none was really helpful
Awk works: awk '$1 == "'"$subj"'" { print $2, $3, $4 }' test1.txt
The command outputs fields two, three, and four for all lines in test1.txt where the first field is $subj (i.e.: the contents of the variable named subj).
With your original text file format:
target=sub1
while IFS=$' \t\n' read -r key values; do
if [[ $key = "$target" ]]; then
echo "Found values: $values"
fi
done <test1.txt
This requires no external tools, using only functionality built into bash itself. See BashFAQ #1.
As has come up during debugging in comments, if you have a traditional Apple-format text file (CR newlines only), then you might want something more like:
target=sub1
while IFS=$' \t\n' read -r -d $'\r' key values || [[ $key ]]; do
if [[ $key = "$target" ]]; then
echo "Found values: $values"
fi
done <test1.txt
Alternately, using awk (for a standard UNIX text file):
target="sub1"
awk -v target="$target" '$1 == target { $1 = ""; print; }' <test1.txt
...or, for a file with CR-only newlines:
target="sub1"
tr '\r' '\n' <test1.txt | awk -v target="$target" '$1 == target { $1 = ""; print; }'
This version will be slower if the text file being read is small (since awk, like any other external tool, takes time to start up); but faster if it's large (since awk's operation is much faster than that of bash's built-ins once it's done starting up).
grep "sub1" test1.txt | cut -c6-
or
grep -A 1 "sub1" test2.txt | tail -n 1
You doing it right, but it seems like test1.txt has a wrong value in it.
with grep foo you get all lines with foo in it. use grep -m1 foo to find the first line with foo in it only.
then you can use cut -d" " -f2- to get all the values behind foo, while seperated by empty spaces.
In the end the command would look like this ...
$ subj="sub1"
$ grep -m1 "$subj" test1.txt | cut -d" " -f2-
But this doenst explain why you could not find sub1 in the first place.
Did you read the proper file ?
There's a bunch of ways to do this (and shorter/more efficient answers than what I'm giving you), but I'm assuming you're a beginner at bash, and therefore I'll give you something that's easy to understand:
egrep "^$subj\>" file.txt | sed "s/^\S*\>\s*//"
or
egrep "^$subj\>" file.txt | sed "s/^[^[:blank:]]*\>[[:blank:]]*//"
The first part, egrep, will search for you subject at the beginning of the line in file.txt (that's what the ^ symbol does in the grep string). It also is looking for a whole word (the \> is looking for an end of word boundary -- that way sub1 doesn't match sub12 in the file.) Notice you have to use egrep to get the \>, as grep by default doesn't recognize that escape sequence. Once done finding the lines, egrep then passes it's output to sed, which will strip the first word and trailing whitespace off of each line. Again, the ^ symbol in the sed command, specifies it should only match at the beginning of the line. The \S* tells it to read as many non-whitespace characters as it can. Then the \s* tells sed to gobble up as many whitespace as it can. sed then replaces everything it matched with nothing, leaving the other stuff behind.
BTW, there's a help page in Stack overflow that tells you how to format your questions (I'm guessing that was the reason you got a downvote).
-------------- EDIT ---------
As pointed out, if you are on a Mac or something like that you have to use [:alnum:] instead of \S, and [:blank:] instead of \s in your sed expression (as these are portable to all platforms)
awk '/sub1/{ print $2,$3,$4 }' file
1 2 3
What happens? After regexp /sub1/ the three following fields are printed.
Any drawbacks? It affects the space.
Sed also works: sed -n -e 's/^'"$subj"' *//p' file1.txt
It outputs all lines matching $subj at the beginning of a line after having removed the matching word and the spaces following. If TABs are used the spaces should be replaced by something like [[:space:]].

Cut from column to end of line

I'm having a bit of an issue cutting the output up from egrep. I have output like:
From: First Last
From: First Last
From: First Last
I want to cut out the "From: " (essentially leaving the "First Last").
I tried
cut -d ":" -f 7
but the output is just a bunch of blank lines.
I would appreciate any help.
Here's the full code that I am trying to use if it helps:
egrep '^From:' $file | cut -d ":" -f 7
NOTE: I've already tested the egrep portion of the code and it works as expected.
The cut command lines in your question specify colon-separated fields and that you want the output to consist only of field 7; since there is no 7th field in your input, the result you're getting isn't what you intend.
Since the "From:" prefix appears to be identical across all lines, you can simply cut from the 7th character onward:
egrep '^From:' $file | cut -c7-
and get the result you intend.
you were really close.
I think you only need to replace ":" with " " as separator and add "-" after the "7": like this:
cut -d " " -f 2-
I tested and works pretty well.
The -f argument is for what fields. Since there is only one : in the line, there's only two fields. So changing -f 7 to -f 2- will give you want you want. Albeit with a leading space.
You can combine the egrep and cut parts into one command with sed:
sed -n 's/^From: //gp' $file
sed -n turns off printing by default, and then I am using p in the sed command explicitly to print the lines I want.
You can use sed:
sed 's/^From: *//'
OR awk:
awk -F ': *' '$1=="From"{print $2}'
OR grep -oP
grep -oP '^From: *\K.*'
Here is a Bash one-liner:
grep ^From file.txt | while read -a cols; do echo ${cols[#]:1}; done
See: Handling positional parameters at wiki.bash-hackers.org
cut itself is a very handy tool in bash
cut -d (delimiter character) -f (fields that you want as output)
a single field is given directly as -f 3 ,
range of fields can be selected as -f 5-9
so in your this particular case code would be
egrep '^From:' $file | cut -d\ -f 2-3
the delimiter is space here and can be escaped using a \
-f 1 corresponds to " From " and 2-3 corresponds to " First Last "

Cut command does not appear to be working

I'm piping a command to cut and nothing appears to be happening.
The output of the command looks like this:
Name File Info OS
11 FileName1 OS1
12 FileName2 OS2
13 FileName3 OS3
I'm trying to extract column 1,2 from all rows (starting with row 2) using the following:
my_command | cut -f1,2 and the output is exactly the same as the original.
Cut doen't behave well with multiple spaces as a delimiter. Use awk instead
mycommand | awk 'NR>1{print $1,$2}'
use tr -s to convert repeating spaces into single space. Now cut can be used where single space is delimiter seperating columns.
mycommand | tr -s ' ' | cut -d' ' -f1,2
If multiple spaces are used for a delimiter and the column positions are fixed, you would use column numbers with cut:
mycommand | cut -c1-27
Or you could lose the front spaces with:
mycommand | cut -c5-27
This will work even if your fields have embedded spaces. The awk method will fail if you have embedded spaces in your fields.

shell replace cr\lf by comma

I have input.txt
1
2
3
4
5
I need to get such output.txt
1,2,3,4,5
How to do it?
Try this:
tr '\n' ',' < input.txt > output.txt
With sed, you could use:
sed -e 'H;${x;s/\n/,/g;s/^,//;p;};d'
The H appends the pattern space to the hold space (saving the current line in the hold space). The ${...} surrounds actions that apply to the last line only. Those actions are: x swap hold and pattern space; s/\n/,/g substitute embedded newlines with commas; s/^,// delete the leading comma (there's a newline at the start of the hold space); and p print. The d deletes the pattern space - no printing.
You could also use, therefore:
sed -n -e 'H;${x;s/\n/,/g;s/^,//;p;}'
The -n suppresses default printing so the final d is no longer needed.
This solution assumes that the CRLF line endings are the local native line ending (so you are working on DOS) and that sed will therefore generate the local native line ending in the print operation. If you have DOS-format input but want Unix-format (LF only) output, then you have to work a bit harder - but you also need to stipulate this explicitly in the question.
It worked OK for me on MacOS X 10.6.5 with the numbers 1..5, and 1..50, and 1..5000 (23,893 characters in the single line of output); I'm not sure that I'd want to push it any harder than that.
In response to #Jonathan's comment to #eumiro's answer:
tr -s '\r\n' ',' < input.txt | sed -e 's/,$/\n/' > output.txt
tr and sed used be very good but when it comes to file parsing and regex you can't beat perl
(Not sure why people think that sed and tr are closer to shell than perl... )
perl -pe 's/\n/$1,/' your_file
if you want pure shell to do it then look at string matching
${string/#substring/replacement}
Use paste command. Here is using pipes:
echo "1\n2\n3\n4\n5" | paste -s -d, /dev/stdin
Here is using a file:
echo "1\n2\n3\n4\n5" > /tmp/input.txt
paste -s -d, /tmp/input.txt
Per man pages the s concatenates all lines and d allows to define the delimiter character.
Awk versions:
awk '{printf("%s,",$0)}' input.txt
awk 'BEGIN{ORS=","} {print $0}' input.txt
Output - 1,2,3,4,5,
Since you asked for 1,2,3,4,5, as compared to 1,2,3,4,5, (note the comma after 5, most of the solutions above also include the trailing comma), here are two more versions with Awk (with wc and sed) to get rid of the last comma:
i='input.txt'; awk -v c=$(wc -l $i | cut -d' ' -f1) '{printf("%s",$0);if(NR<c){printf(",")}}' $i
awk '{printf("%s,",$0)}' input.txt | sed 's/,\s*$//'
printf "1\n2\n3" | tr '\n' ','
if you want to output that to a file just do
printf "1\n2\n3" | tr '\n' ',' > myFile
if you have the content in a file do
cat myInput.txt | tr '\n' ',' > myOutput.txt
python version:
python -c 'import sys; print(",".join(sys.stdin.read().splitlines()))'
Doesn't have the trailing comma problem (because join works that way), and splitlines splits data on native line endings (and removes them).
cat input.txt | sed -e 's|$|,|' | xargs -i echo "{}"

Resources