How to delete a line of the text file from the output of checklist - bash

I have a text file:
$100 Birthday
$500 Laptop
$50 Phone
I created a --checklist from the text file
[ ] $100 Birthday
[*] $500 Laptop
[*] $50 Phone
the output is $100 $50
How can I delete the line of $100 and $50 in the text file, please?
The expected output of text file:
$100 Birthday
Thank you!

with grep and cut
grep -xf <(grep '\[ ]' file2.txt | cut -d\ -f3-) file1.txt
with grep and sed
grep -xf <(sed -rn 's/\[ ]\s+(.*)/\1/p' file2.txt) file1.txt
explanation
use grep to select lines from text file
$ grep Birthday file1.txt
100 Birthday
cut will split line into columns. -f 2 will print only 2nd column but -f 2- will print everything from 2nd column. as delimiter -d whitespace ' ' is used here (some character must escaped with \)
and we can use pipe | as input (instead file)
$ echo one two three | cut -d \ -f 2-
two three
$ grep Birthday file1.txt | cut -d \ -f 2-
Birthday ^
|
(note the two whitespaces) --------+
assuming we have a text file temp.txt
$ cat temp.txt
Birthday
Laptop
Phone
grep can also read list of search patterns from another file as input instead
$ grep -f temp.txt file1.txt
100 Birthday
500 Laptop
50 Phone
or we print the file content with cat and redirect output with <
$ grep -f <(cat temp.txt) file1.txt
100 Birthday
500 Laptop
50 Phone
Now let's generate temp.txt from checklist. You only want grep lines containing [ ] and cut starting from 3rd column (again some characters have special meaning and must therefore escaped \[)
$ grep '\[ ]' file2.txt
[ ] 100 Birthday
$ grep '\[ ]' file2.txt | cut -d\ -f3-
100 Birthday
You don't need temp.txt and can therefore redirect list straight to grep -f what is called process substitution <(...)
$ grep -f <(grep '\[ ]' file2.txt | cut -d\ -f3-) file1.txt
100 Birthday
grep read all lines from temp.txt as PATTERN and some characters have special meaning for regex. ^ stands for begin of line and $ for end of line. To be nitpicky correct search pattern should therefore be '^100 Birthday$' so it won't match "1100 Birthday 2".
You might have noticed that I dropped the $ currency in your input files for reason. You can keep it, but tell grep to take all input literally with -F and(/or?) -x flag which will search for whole line "100 Birthday" (no regex for line start/ending needed)
sed [OPTION] 's/regexp/replacement/command' [file]
sed is more common when it comes to text editing. instead grep | cut we can do it from one single command:
grep '\[ ]' | cut -f3- and sed 's/\[ ] *//'
are basically targeting the same lines and delete [ ] from it.
There are however some extra flags required, because sed is text editor and will stream the whole file by default. to emulate grep's behavior we use
-n option to suppress the input
p command to print only changes
and for regexp
\[ ] (text to replace)
' *' = ' ' (whitespace) + * (star)
meaning: repeated previous character 0 or more times, in particulary all trailing whitespaces
(replacement is empty because we want just delete)
so working similar sed command will look like this
sed -n 's/\[ ] *//p' file2.txt
And that's in my opinion all it takes for a checklist. You have however two redundant files and want match your cloned checklist against original file, so let me show you more complicated things.
Instead of deleting the checkbox let's output captured groups. This pseudo code will explain it better than me. \1 is for first capture group ( ) and so on (kinda internal variables)
$ sed 's/(aaa)b(ccc)dd/\1/'
aaa
$ sed 's/(aaa)b(ccc)dd/\2/'
ccc
$ sed 's/(aaa)b(ccc)dd/\1 \2/'
aaa ccc
$ sed 's/(aaa)b(ccc)dd/lets \1 replace \2 this/'
lets aaa replace ccc this
so in this example sed 's/\[ ] (.*)/\1/' we use for regexp
\[ ] (text to replace)
' ' (trailing whitespace)
and inside the first capture group ( ) the desired "100 Birthday"
.* = . (dot) + * (star)
meaning: repeated previous character 0 or more times (in particulary a dot here)
but the dot . itself is regex for ANY char now (special meaning)
so the capture group is all the rest of line
and for replacement we use (only)
\1 first capture group
$ sed -n 's/\[ ] (.*)/\1/p' file2.txt
100 Birthday
But there is more :)
Instead of matching only ' ' whitespace there exist another regex with special meaning (extended regex)
\s will match whitespace and tab
+ repeated previous character 1 or more times (note the difference to * 0 or more times)
\s+ will match a series of spaces
and to make it work we need one more flag
-r use extended regular expressions
so with this command you can extract all search patterns from your cloned checklist...
$ sed -rn 's/\[ ]\s+(.*)/\1/p' file2.txt
100 Birthday
...and finally let it run against your original file (without the need of temp.txt)
$ grep -xf <(sed -rn 's/\[ ]\s+(.*)/\1/p' file2.txt) file1.txt
100 Birthday

Related

bash check for words in first file not contained in second file

I have a txt file containing multiple lines of text, for example:
This is a
file containing several
lines of text.
Now I have another file containing just words, like so:
this
contains
containing
text
Now I want to output the words which are in file 1, but not in file 2. I have tried the following:
cat file_1.txt | xargs -n1 | tr -d '[:punct:]' | sort | uniq | comm -i23 - file_2.txt
xargs -n1 to put each space separated substring on a newline.
tr -d '[:punct:] to remove punctuations
sort and uniq to make a sorted file to use with comm which is used with the -i flag to make it case insensitive.
But somehow this doesn't work. I've looked around online and found similar questions, however, I wasn't able to figure out what I was doing wrong. Most answers to those questions were working with 2 files which were already sorted, stripped of newlines, spaces, and punctuation while my file_1 may contain any of those at the start.
Desired output:
is
a
file
several
lines
of
paste + grep approach:
grep -Eiv "($(paste -sd'|' <file2.txt))" <(grep -wo '\w*' file1.txt)
The output:
is
a
file
several
lines
of
I would try something more direct:
for A in `cat file1 | tr -d '[:punct:]'`; do grep -wq $A file2 || echo $A; done
flags used for grep: q for quiet (don't need output), w for word match
One in awk:
$ awk -F"[^A-Za-z]+" ' # anything but a letter is a field delimiter
NR==FNR { # process the word list
a[tolower($0)]
next
}
{
for(i=1;i<=NF;i++) # loop all fields
if(!(tolower($i) in a)) # if word was not in the word list
print $i # print it. duplicates are printed also.
}' another_file txt_file
Output:
is
a
file
several
lines
of
grep:
$ grep -vwi -f another_file <(cat txt_file | tr -s -c '[a-zA-Z]' '\n')
is
a
file
several
lines
of
This pipeline will take the original file, replace spaces with newlines, convert to lowercase, then use grep to filter (-v) full words (-w) case insensitive (-i) using the lines in the given file (-f file2):
cat file1 | tr ' ' '\n' | tr '[:upper:]' '[:lower:]' | grep -vwif file2

Using sed with delimiters similar to cut

Given a file foo.txt containing file names such as:
2015_275_14_1,Siboney_by_The_Tailor_Maids
2015_275_16_1,Louis_Armstrong_Cant_Give_You_Anything_But_Love
2015_275_17_1,Benny_Goodman_Trio_Nice_Work_Avalon
2015_275_18_1,Feather_On_Jazz_Jazz_In_The_Concert_Hall
2015_235_1_1,Integration_Report_1
2015_273_2_1_1,Cab_Calloway_Home_Movie_1
2015_273_2_2_1,Cab_Calloway_Home_Movie_2
I want to replace the _ in the part before the comma with . and the _ in the second part after the comma with a space.
I can accomplish each individually with:
sed -E -i '' 's/([0-9]{4})_([0-9]{3})_([0-9]{2})_([0-9])/\1.\2.\3.\4./'
for the first part, and the second part then with:
sed -E -i '' "s/_/ /g"
But I was hoping to accomplish it in an easier fashion by using cut with sed but that doesn't work:
cut -d "," -f 1 foo.txt | sed -E -i '' "s/_/./g" foo.txt && cut -d "," -f 2 foo.txt | sed -E -i '' "s/_/ /g" foo.txt
No good.
So, is there a way to accomplish this with sed or maybe awk or maybe something else where I'm treating the , as a delimiter such as in cut?
Desired output:
2015.275.14.1,Siboney by The Tailor Maids
You can use awk to attain your goal, here's the method.
$ awk -F',' '{gsub(/_/,".",$1);gsub(/_/," ",$2);printf "%s,%s\n",$1,$2}' file
2015.275.14.1,Siboney by The Tailor Maids
2015.275.16.1,Louis Armstrong Cant Give You Anything But Love
2015.275.17.1,Benny Goodman Trio Nice Work Avalon
2015.275.18.1,Feather On Jazz Jazz In The Concert Hall
2015.235.1.1,Integration Report 1
2015.273.2.1.1,Cab Calloway Home Movie 1
2015.273.2.2.1,Cab Calloway Home Movie 2
Similar to #CWLiu's answer but I use OFS (output field separator) instead of adding back in the comma and having to add newline from using printf.
awk -F ',' 'BEGIN {OFS = FS} {gsub(/_/, ".", $1); gsub(/_/, " ", $2); print;}' foo.txt
Explanation:
-F ',' sets the field separator
BEGIN {OFS = FS} sets the output field separator (default space) equal to the field separator so the comma is printed back out
gsub(/_/, ".", $1) global substitution on the first column
gsub(/_/, " ", $2) global substitution on the second column
print print the whole line
$ awk 'BEGIN{FS=OFS=","} {gsub(/_/,".",$1); gsub(/_/," ",$2)} 1' file
2015.275.14.1,Siboney by The Tailor Maids
2015.275.16.1,Louis Armstrong Cant Give You Anything But Love
2015.275.17.1,Benny Goodman Trio Nice Work Avalon
2015.275.18.1,Feather On Jazz Jazz In The Concert Hall
2015.235.1.1,Integration Report 1
2015.273.2.1.1,Cab Calloway Home Movie 1
2015.273.2.2.1,Cab Calloway Home Movie 2
Try this for GNU sed:
$ cat input.txt
2015_275_14_1,Siboney_by_The_Tailor_Maids
2015_275_16_1,Louis_Armstrong_Cant_Give_You_Anything_But_Love
2015_275_17_1,Benny_Goodman_Trio_Nice_Work_Avalon
2015_275_18_1,Feather_On_Jazz_Jazz_In_The_Concert_Hall
2015_235_1_1,Integration_Report_1
2015_273_2_1_1,Cab_Calloway_Home_Movie_1
2015_273_2_2_1,Cab_Calloway_Home_Movie_2
$ sed -r ':loop;/^[^_]+,/{s/_/ /g;bend};s/_/./;bloop;:end' input.txt
2015.275.14.1,Siboney by The Tailor Maids
2015.275.16.1,Louis Armstrong Cant Give You Anything But Love
2015.275.17.1,Benny Goodman Trio Nice Work Avalon
2015.275.18.1,Feather On Jazz Jazz In The Concert Hall
2015.235.1.1,Integration Report 1
2015.273.2.1.1,Cab Calloway Home Movie 1
2015.273.2.2.1,Cab Calloway Home Movie 2
Explanation:
use s/_/./ to substitute _ to . until all _ before , have been substituted, which is judged by ^[^_]+,;
then, if ^[^_]+, matches, use s/_/ /g to subtitute all _ to after ,
You could cut and paste:
$ paste -d, <(cut -d, -f1 infile | sed 'y/_/./') <(cut -d, -f2 infile | sed 'y/_/ /')
2015.275.14.1,Siboney by The Tailor Maids
2015.275.16.1,Louis Armstrong Cant Give You Anything But Love
2015.275.17.1,Benny Goodman Trio Nice Work Avalon
2015.275.18.1,Feather On Jazz Jazz In The Concert Hall
2015.235.1.1,Integration Report 1
2015.273.2.1.1,Cab Calloway Home Movie 1
2015.273.2.2.1,Cab Calloway Home Movie 2
The process substitution <() lets you treat the output of commands like a file, and paste -d, pastes the output of each command side-by-side, separated by a comma.
The sed y command transliterates characters and is, in this case, equivalent to s/_/./g. and s/_/ /g.
You could also do it purely in sed, but it's a bit unwieldy:
sed 'h;s/.*,//;y/_/ /;x;s/,.*//;y/_/./;G;s/\n/,/' infile
Explained:
h # Copy pattern space to hold space
s/.*,// # Remove first part including comma
y/_/ / # Replace all "_" by spaces in the remaining second part
x # Swap pattern and hold space
s/,.*// # Remove second part including comma
y/_/./ # Replace all "_" by periods in the remaining first part
G # Append hold space to pattern space
s/\n/,/ # Replace linebreak with comma
Or, alternatively (from comment by potong):
sed 's/,/\n/;h;y/_/ /;x;y/_/./;G;s/\n.*\n/,/' infile
Explained:
s/,/\n/ # Replace comma by linebreak
h # Copy pattern space to hold space
y/_/ / # Replace all "_" by spaces
x # Swap pattern and hold space
y/_/./ # Replace all "_" by periods
G # Append hold space
s/\n.*\n/,/ # Remove second and third line in pattern space

Grep -A1 -f returns more results than it should

This is my problem:
I have a fasta file with genetic data like so (my.fasta):
>TR1|c0_g1_i1
GTCGAGCATGGTCTTGGTCATCT
>TR2|c0_g1_i1
AAGCAGTGCAGAAGAACTGGCGAA...
I also have a list of names which is a subset of the my.fasta file and I want to pull out the sequences for them (names.list):
TR3|c0_g1_i1
TR4|c0_g1_i1
What I want to get is this:
>TR3|c0_g1_i1
CGGATCATGGTCTTGGTCAAAA
>TR4|c0_g1_i1
ATTGGGGGTTTTAAACTGGCGAA...
I'm doing: grep -A1 -f names.list my.fasta | grep -v "^--$" > new.fasta
But! I have 30566 names in my names.list and when I do grep -c ">" new.fasta I get 31080.
I've grep ">" new.fasta | cut -d' ' -f1 | tr -d '>' > new.names.list
and then cat names.list new.names.list > names.all.list
and sort names.all.list | uniq -c | grep " 1 " | | sed -r 's/ 1 //' > names.extra.list and ended up with extra 514 names. How did they get there?!
Names list for the whole my.fasta: http://speedy.sh/PQpdD/names.myfasta.list
Names list for the subset I want: http://speedy.sh/kzqKr/names.list
Thanks!
Some of your names include each other, for example: TR74928|c6_g4_i1 and TR74928|c6_g4_i10. So grep will return you more than one result per line.
To solve this:
sed -e 's/^/>/g' names.list > copy.list
to get the names prefixed with > just like in your file my.fasta, then:
grep -A1 -x -f copy.list my.fasta | grep -v "^--$" > new.fasta
to match exactly the lines containing your identifiers.
-x, --line-regexp
Select only those matches that exactly match the whole line. This
option has the same effect as anchoring the expression with ^ and $.
A simpler solution is:
grep -A1 -w -f names.list my.fasta | grep -v "^--$" > new.fasta
but this will work only if no identifier line in my.fasta has more than one "word" (the identifier).
-w, --word-regexp
Select only those lines containing matches that form whole words. The
test is that the matching substring must either be at the beginning of
the line, or preceded by a non-word constituent character. Similarly,
it must be either at the end of the line or followed by a non-word
constituent character. Word-constituent characters are letters,
digits, and the underscore.

Get last four characters from a string

I am trying to parse the last 4 characters of Mac serial numbers from terminal. I can grab the serial with this command:
serial=$(ioreg -l |grep "IOPlatformSerialNumber"|cut -d ""="" -f 2|sed -e s/[^[:alnum:]]//g)
but I need to output just the last 4 characters.
Found it in a linux forum echo ${serial:(-4)}
Using a shell parameter expansion to extract the last 4 characters after the fact works, but you could do it all in one step:
ioreg -k IOPlatformSerialNumber | sed -En 's/^.*"IOPlatformSerialNumber".*(.{4})"$/\1/p'
ioreg -k IOPlatformSerialNumber returns much fewer lines than ioreg -l, so it speeds up the operation considerably (about 80% faster on my machine).
The sed command matches the entire line of interest, and replaces it with the last 4 characters before the " that ends the line; i.e., it returns the last 4 chars. of the value.
Note: The ioreg output line of interest looks something like this:
| "IOPlatformSerialNumber" = "A02UV13KDNMJ"
As for your original command: cut -d ""="" is the same as cut -d = - the shell simply removes the empty strings around the = before cut sees the value. Note that cut only accepts a single delimiter char.
You can also do: grep -Eo '.{4}$' <<< "$serial"
I don't know how the output of ioreg -l looks like, but it looks to me that you are using so many pipes to do something that awk alone could handle:
use = as field separator
vvv
awk -F= '/IOPlatformSerialNumber/ { #match lines containing IOPlatform...
gsub(/[^[:alnum:]]/, "", $2) # replace all non alpha chars from 2nd field
print substr($2, length($2)-3, length($2)) # print last 4 characters
}'
Or even sed (a bit ugly one since the repetition of command): catch the first 4 alphanumeric characters occuring after the first =:
sed -rn '/IOPlatformSerialNumber/{
s/^[^=]*=[^a-zA-Z0-9]*([a-zA-Z0-9])[^a-zA-Z0-9]*([a-zA-Z0-9])[^a-zA-Z0-9]*([a-zA-Z0-9])[^a-zA-Z0-9]*([a-zA-Z0-9]).*$/\1\2\3\4/;p
}'
Test
$ cat a
aaa
bbIOPlatformSerialNumber=A_+23B/44C//55=ttt
IOPlatformSerialNumber=A_+23B/44C55=ttt
asdfasd
The last 4 alphanumeric characters between the 1st and 2nd = are 4C55:
$ awk -F= '/IOPlatformSerialNumber/ {gsub(/[^[:alnum:]]/, "", $2); print substr($2, length($2)-3, length($2))}' a
4C55
4C55
Without you posting some sample output of ioreg -l this is untested and a guess but it looks like all you need is something like:
ioreg -l | sed -r -n 's/IOPlatformSerialNumber=[[:alnum:]]+([[:alnum:]]{4})/\1/'

Count how many times each word from a word list appears in a file?

I have a file, list.txt which contains a list of words. I want to check how many times each word appears in another file, file1.txt, then output the results. A simple output of all of the numbers sufficient, as I can manually add them to list.txt with a spreadsheet program, but if the script adds the numbers at the end of each line in list.txt, that is even better, e.g.:
bear 3
fish 15
I have tried this, but it does not work:
cat list.txt | grep -c file1.txt
You can do this in a loop that reads a single word at a time from a word-list file, and then counts the instances in a data file. For example:
while read; do
echo -n "$REPLY "
fgrep -ow "$REPLY" data.txt | wc -l
done < <(sort -u word_list.txt)
The "secret sauce" consists of:
using the implicit REPLY variable;
using process substitution to collect words from the word-list file; and
ensuring that you are grepping for whole words in the data file.
This awk method only has to pass through each file once:
awk '
# read the words in list.txt
NR == FNR {count[$1]=0; next}
# process file1.txt
{
for (i=0; i<=NF; i++)
if ($i in count)
count[$i]++
}
# output the results
END {
for (word in count)
print word, count[word]
}
' list.txt file1.txt
This might work for you (GNU sed):
tr -s ' ' '\n' file1.txt |
sort |
uniq -c |
sed -e '1i\s|.*|& 0|' -e 's/\s*\(\S*\)\s\(\S*\)\s*/s|\\<\2\\>.*|\2 \1|/' |
sed -f - list.txt
Explanation:
Split file1.txt into words
Sort the words
Count the words
Create a sed script to match the words (initially zero out each word)
Run the above script against the list.txt
single line command
cat file1.txt |tr " " "\n"|sort|uniq -c |sort -n -r -k 1 |grep -w -f list.txt
The last part of the command tells grep to read words to match from list (-f option) and then match whole words(-w) i.e. if list.txt contains contains car, grep should ignore carriage.
However keep in mind that your view of whole word and grep's view might differ. for eg. although car will not match with carriage, it will match with car-wash , notice that "-" will be considered for word boundary. grep takes anything except letters,numbers and underscores as word boundary. Which should not be a problem as this conforms to the accepted definition of a word in English language.

Resources