How to extract text by unspecified spaces

How to extract text by unspecified spaces - bash

I'm trying to extract usernames from a text file in one per line format and from my research, it seems like the only way to do it is by spacing commands here's the format:
1 user 3
2 fusrfff 4
3 usrf 12
The only problem is because all of the users are different I can't define a static space amount. There's also the fact the UIDs (first numbers) go from 1-40k. There's a bunch of other information after the user group number too. Can anyone point me in the right direction? Thanks.

awk does not care about the amount of space between fields:
awk '{print $2}' your_file.txt
If you want to go with bash only, read does not care either:
while read uid username other_stuff; do
printf '%s\n' "$name"
done < your_file.txt

First replace spaces by one space. You can use sed 's/ +/ /g' or
tr -s " " < file.txt| cut -d" " -f2

This is using sed
$ cat file.txt| sed "s/ */ /g" | cut -d' ' -f2
user
fusrfff
usrf

Related

Shell: Counting lines per column while ignoring empty ones

I am trying to simply count the lines in the .CSV per column, while at the same time ignoring empty lines.
I use below and it works for the 1st column:
cat /path/test.csv | cut -d, -f1 | grep . | wc -l` >> ~/Desktop/Output.csv
#Outputs: 8
And below for the 2nd column:
cat /path/test.csv | cut -d, -f2 | grep . | wc -l` >> ~/Desktop/Output.csv
#Outputs: 6
But when I try to count 3rd column, it simply Outputs the Total number of lines in the whole .CSV.
cat /path/test.csv | cut -d, -f3 | grep . | wc -l` >> ~/Desktop/Output.csv
#Outputs: 33
#Should be: 19?
I've also tried to use awk instead of cut, but get the same issue.
I have tried creating new file thinking maybe it had some spaces in the lines, still the same.
Can someone clarify what is the difference? Betwen reading 1-2 column and the rest?
20355570_01.tif,,
20355570_02.tif,,
21377804_01.tif,,
21377804_02.tif,,
21404518_01.tif,,
21404518_02.tif,,
21404521_01.tif,,
21404521_02.tif,,
,22043764_01.tif,
,22043764_02.tif,
,22095060_01.tif,
,22095060_02.tif,
,23507574_01.tif,
,23507574_02.tif,
,,23507574_03.tif
,,23507804_01.tif
,,23507804_02.tif
,,23507804_03.tif
,,23509247_01.tif
,,23509247_02.tif
,,23509247_03.tif
,,23527663_01.tif
,,23527663_02.tif
,,23527663_03.tif
,,23527908_01.tif
,,23527908_02.tif
,,23527908_03.tif
,,23535506_01.tif
,,23535506_02.tif
,,23535562_01.tif
,,23535562_02.tif
,,23535636_01.tif
,,23535636_02.tif

That happens when input file has DOS line endings (\r\n). Fix your file using dos2unix and your command will work for 3rd column too.
dos2unix /path/test.csv
Or, you can remove the \r at the end while counting non-empty columns using awk:
awk -F, '{sub(/\r/,"")} $3!=""{n++} END{print n}' /path/test.csv

The problem is in the grep command: the way you wrote it will return 33 lines when you count the 3rd column.
It's better instead to use the following command to count number of lines in .CSV for each column (example below is for the 3rd column):
cat /path/test.csv | cut -d , -f3 | grep -cve '^\s*$'
This will return the exact number of lines for each column and avoid of piping into wc.
See previous post here:
count (non-blank) lines-of-code in bash

edit: I think oguz ismail found the actual reason in their answer. If they are right and your file has windows line endings you can use one of the following commands without having to convert the file.
cut -d, -f3 yourFile.csv cut | tr -d \\r | grep -c .
cut -d, -f3 yourFile.csv | grep -c $'[^\r]' # bash only
old answer: Since I cannot reproduce your problem with the provided input I take a wild guess:
The "empty" fields in the last column contain spaces. A field containing a space is not empty altough it looks like it is empty as you cannot see spaces.
To count only fields that contain something other than a space adapt your regex from . (any symbol) to [^ ] (any symbol other than space).
cut -d, -f3 yourFile.csv | grep -c '[^ ]'

bash to extract second half of name

Ok so with the new High Sierra, I am trying to write a script to automatically delete there local snapshots that eat up HDD space. I know you can shrink using thinlocalsnapshots / 1000000000 4 but I feel like that is only a band-aid.
So what I am trying to do is extract the date 2018-02-##-###### from:
sudo tmutil listlocalsnapshots /
com.apple.TimeMachine.2018-02-15-170531
com.apple.TimeMachine.2018-02-15-181655
com.apple.TimeMachine.2018-02-15-223352
com.apple.TimeMachine.2018-02-16-000403
com.apple.TimeMachine.2018-02-16-013400
com.apple.TimeMachine.2018-02-16-033621
com.apple.TimeMachine.2018-02-16-063811
com.apple.TimeMachine.2018-02-16-080812
com.apple.TimeMachine.2018-02-16-090939
com.apple.TimeMachine.2018-02-16-100459
com.apple.TimeMachine.2018-02-16-110325
com.apple.TimeMachine.2018-02-16-122954
com.apple.TimeMachine.2018-02-16-141223
com.apple.TimeMachine.2018-02-16-151309
com.apple.TimeMachine.2018-02-16-161040
I have tried variations of
| awk '{print $ } (insert number after $)
along with
| cut -d ' ' -f 10-.
Please if you know what I am missing here I would greatly appreciate it
edit: Here is the script that will get rid of those pesky Local snapshots.If anyone is interested, Thanks again:
#! /bin/bash
dates=`tmutil listlocalsnapshots / | awk -F "." 'NR++1{print $4}'`
for dates in $dates
do
tmutil deletelocalsnapshots $dates
done

You were close:
somecommand | cut -d"." -f4-
# or
somecommand | awk -F"." '{print $4}'
You can also try sed, but cut is made for this.

1- awk: you can either specify the field separator with the -F option, or print a substring
awk -F. '{print $4}'
awk '{print substr($0,23)}'
2- cut: equivalently.
cut -d. -f4
cut -c23-
3- Pure bash (sloooooow!): same as above.
while IFS=. read s1 s2 s3 d; do echo "$d"; done
while read line; do echo "${line:23}"; done
In practice, with a small number of records as in your use case, speed is not an issue and even pure bash or regexps (as in other aswers) can be used. As the number of records grows, the higher speed of awk and cut becomes noticeable.

Using grep and a regex :
$ grep -oP '\d{4}-\d{2}-\d{2}-\d{6}$'
2018-02-15-170531
2018-02-15-181655
2018-02-15-223352
2018-02-16-000403
2018-02-16-013400
2018-02-16-033621
2018-02-16-063811
2018-02-16-080812
2018-02-16-090939
2018-02-16-100459
2018-02-16-110325
2018-02-16-122954
2018-02-16-141223
2018-02-16-151309
2018-02-16-161040

Cut from column to end of line

I'm having a bit of an issue cutting the output up from egrep. I have output like:
From: First Last
From: First Last
From: First Last
I want to cut out the "From: " (essentially leaving the "First Last").
I tried
cut -d ":" -f 7
but the output is just a bunch of blank lines.
I would appreciate any help.
Here's the full code that I am trying to use if it helps:
egrep '^From:' $file | cut -d ":" -f 7
NOTE: I've already tested the egrep portion of the code and it works as expected.

The cut command lines in your question specify colon-separated fields and that you want the output to consist only of field 7; since there is no 7th field in your input, the result you're getting isn't what you intend.
Since the "From:" prefix appears to be identical across all lines, you can simply cut from the 7th character onward:
egrep '^From:' $file | cut -c7-
and get the result you intend.

you were really close.
I think you only need to replace ":" with " " as separator and add "-" after the "7": like this:
cut -d " " -f 2-
I tested and works pretty well.

The -f argument is for what fields. Since there is only one : in the line, there's only two fields. So changing -f 7 to -f 2- will give you want you want. Albeit with a leading space.

You can combine the egrep and cut parts into one command with sed:
sed -n 's/^From: //gp' $file
sed -n turns off printing by default, and then I am using p in the sed command explicitly to print the lines I want.

You can use sed:
sed 's/^From: *//'
OR awk:
awk -F ': *' '$1=="From"{print $2}'
OR grep -oP
grep -oP '^From: *\K.*'

Here is a Bash one-liner:
grep ^From file.txt | while read -a cols; do echo ${cols[#]:1}; done
See: Handling positional parameters at wiki.bash-hackers.org

cut itself is a very handy tool in bash
cut -d (delimiter character) -f (fields that you want as output)
a single field is given directly as -f 3 ,
range of fields can be selected as -f 5-9
so in your this particular case code would be
egrep '^From:' $file | cut -d\ -f 2-3
the delimiter is space here and can be escaped using a \
-f 1 corresponds to " From " and 2-3 corresponds to " First Last "

shell replace cr\lf by comma

I have input.txt
1
2
3
4
5
I need to get such output.txt
1,2,3,4,5
How to do it?

Try this:
tr '\n' ',' < input.txt > output.txt

With sed, you could use:
sed -e 'H;${x;s/\n/,/g;s/^,//;p;};d'
The H appends the pattern space to the hold space (saving the current line in the hold space). The ${...} surrounds actions that apply to the last line only. Those actions are: x swap hold and pattern space; s/\n/,/g substitute embedded newlines with commas; s/^,// delete the leading comma (there's a newline at the start of the hold space); and p print. The d deletes the pattern space - no printing.
You could also use, therefore:
sed -n -e 'H;${x;s/\n/,/g;s/^,//;p;}'
The -n suppresses default printing so the final d is no longer needed.
This solution assumes that the CRLF line endings are the local native line ending (so you are working on DOS) and that sed will therefore generate the local native line ending in the print operation. If you have DOS-format input but want Unix-format (LF only) output, then you have to work a bit harder - but you also need to stipulate this explicitly in the question.
It worked OK for me on MacOS X 10.6.5 with the numbers 1..5, and 1..50, and 1..5000 (23,893 characters in the single line of output); I'm not sure that I'd want to push it any harder than that.

In response to #Jonathan's comment to #eumiro's answer:
tr -s '\r\n' ',' < input.txt | sed -e 's/,$/\n/' > output.txt

tr and sed used be very good but when it comes to file parsing and regex you can't beat perl
(Not sure why people think that sed and tr are closer to shell than perl... )
perl -pe 's/\n/$1,/' your_file
if you want pure shell to do it then look at string matching
${string/#substring/replacement}

Use paste command. Here is using pipes:
echo "1\n2\n3\n4\n5" | paste -s -d, /dev/stdin
Here is using a file:
echo "1\n2\n3\n4\n5" > /tmp/input.txt
paste -s -d, /tmp/input.txt
Per man pages the s concatenates all lines and d allows to define the delimiter character.

Awk versions:
awk '{printf("%s,",$0)}' input.txt
awk 'BEGIN{ORS=","} {print $0}' input.txt
Output - 1,2,3,4,5,
Since you asked for 1,2,3,4,5, as compared to 1,2,3,4,5, (note the comma after 5, most of the solutions above also include the trailing comma), here are two more versions with Awk (with wc and sed) to get rid of the last comma:
i='input.txt'; awk -v c=$(wc -l $i | cut -d' ' -f1) '{printf("%s",$0);if(NR<c){printf(",")}}' $i
awk '{printf("%s,",$0)}' input.txt | sed 's/,\s*$//'

printf "1\n2\n3" | tr '\n' ','
if you want to output that to a file just do
printf "1\n2\n3" | tr '\n' ',' > myFile
if you have the content in a file do
cat myInput.txt | tr '\n' ',' > myOutput.txt

python version:
python -c 'import sys; print(",".join(sys.stdin.read().splitlines()))'
Doesn't have the trailing comma problem (because join works that way), and splitlines splits data on native line endings (and removes them).

cat input.txt | sed -e 's|$|,|' | xargs -i echo "{}"

How to make the 'cut' command treat same sequental delimiters as one?

I'm trying to extract a certain (the fourth) field from the column-based, 'space'-adjusted text stream. I'm trying to use the cut command in the following manner:
cat text.txt | cut -d " " -f 4
Unfortunately, cut doesn't treat several spaces as one delimiter. I could have piped through awk
awk '{ printf $4; }'
or sed
sed -E "s/[[:space:]]+/ /g"
to collapse the spaces, but I'd like to know if there any way to deal with cut and several delimiters natively?

Try:
tr -s ' ' <text.txt | cut -d ' ' -f4
From the tr man page:
-s, --squeeze-repeats replace each input sequence of a repeated character
that is listed in SET1 with a single occurrence
of that character

As you comment in your question, awk is really the way to go. To use cut is possible together with tr -s to squeeze spaces, as kev's answer shows.
Let me however go through all the possible combinations for future readers. Explanations are at the Test section.
tr | cut
tr -s ' ' < file | cut -d' ' -f4
awk
awk '{print $4}' file
bash
while read -r _ _ _ myfield _
do
echo "forth field: $myfield"
done < file
sed
sed -r 's/^([^ ]*[ ]*){3}([^ ]*).*/\2/' file
Tests
Given this file, let's test the commands:
$ cat a
this is line 1 more text
this is line 2 more text
this is line 3 more text
this is line 4 more text
tr | cut
$ cut -d' ' -f4 a
is
# it does not show what we want!
$ tr -s ' ' < a | cut -d' ' -f4
1
2 # this makes it!
3
4
$
awk
$ awk '{print $4}' a
1
2
3
4
bash
This reads the fields sequentially. By using _ we indicate that this is a throwaway variable as a "junk variable" to ignore these fields. This way, we store $myfield as the 4th field in the file, no matter the spaces in between them.
$ while read -r _ _ _ a _; do echo "4th field: $a"; done < a
4th field: 1
4th field: 2
4th field: 3
4th field: 4
sed
This catches three groups of spaces and no spaces with ([^ ]*[ ]*){3}. Then, it catches whatever coming until a space as the 4th field, that it is finally printed with \1.
$ sed -r 's/^([^ ]*[ ]*){3}([^ ]*).*/\2/' a
1
2
3
4

shortest/friendliest solution
After becoming frustrated with the too many limitations of cut, I wrote my own replacement, which I called cuts for "cut on steroids".
cuts provides what is likely the most minimalist solution to this and many other related cut/paste problems.
One example, out of many, addressing this particular question:
$ cat text.txt
0 1 2 3
0 1 2 3 4
$ cuts 2 text.txt
2
2
cuts supports:
auto-detection of most common field-delimiters in files (+ ability to override defaults)
multi-char, mixed-char, and regex matched delimiters
extracting columns from multiple files with mixed delimiters
offsets from end of line (using negative numbers) in addition to start of line
automatic side-by-side pasting of columns (no need to invoke paste separately)
support for field reordering
a config file where users can change their personal preferences
great emphasis on user friendliness & minimalist required typing
and much more. None of which is provided by standard cut.
See also: https://stackoverflow.com/a/24543231/1296044
Source and documentation (free software): http://arielf.github.io/cuts/

This Perl one-liner shows how closely Perl is related to awk:
perl -lane 'print $F[3]' text.txt
However, the #F autosplit array starts at index $F[0] while awk fields start with $1

With versions of cut I know of, no, this is not possible. cut is primarily useful for parsing files where the separator is not whitespace (for example /etc/passwd) and that have a fixed number of fields. Two separators in a row mean an empty field, and that goes for whitespace too.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to extract text by unspecified spaces - bash

awk does not care about the amount of space between fields: awk '{print $2}' your_file.txt If you want to go with bash only, read does not care either: while read uid username other_stuff; do printf '%s\n' "$name" done < your_file.txt

First replace spaces by one space. You can use sed 's/ +/ /g' or tr -s " " < file.txt| cut -d" " -f2

This is using sed $ cat file.txt| sed "s/ */ /g" | cut -d' ' -f2 user fusrfff usrf

Related

Shell: Counting lines per column while ignoring empty ones

bash to extract second half of name

Cut from column to end of line

shell replace cr\lf by comma

How to make the 'cut' command treat same sequental delimiters as one?

Categories

Resources