find the number of entries in a file and remove those enties using a shell script - shell

I have the following code where I have collected all the file sizes greater than 40k from my system. I have stored all of this info into a text file. I need to process the file to read the number of times each entry is found in the text file and delete all of those entries.
I have the following code but it does not seem to to working properly.
#! /bin/sh
rm -rf /home/b/Desktop/CalcfileSizeGreater40.txt
filename="/home/b/Desktop/fileSizeGreater40.txt"
cat $filename | while read line
do
number_of_times=`cat $filename | grep $line | wc -l`
echo $line:$number_of_times
echo $line : $number_of_times >> /home/b/Desktop/CalcfileSizeGreater40.txt
sed '/$line/d' $filename >tmp
mv tmp $filename
done
When I look at the CalcfileSizeGreater40.txt i can see
131072 : 4
65553 : 9
65553 : 9
65553 : 9
65553 : 9
65553 : 9
65553 : 9
131072 : 4
65553 : 9
65553 : 9
65553 : 9
any ideas as to where I am going wrong ?

You can simplify this line:
number_of_times=`cat $filename | grep $line | wc -l`
to:
number_of_times=$(grep -c "$line" "$filename")
The use of $(...) in place of back-quotes is extra beneficial when you need to nest command execution. You can count occurrences with grep, and you never needed to use cat. It is a good idea to get into the habit of enclosing file names in variables in double quotes just in case the file names end up with spaces in them.
Editing the file that you are using cat on is not a good idea. Because of the way you are operating, the initial cat will echo every line of the original file in turn, completely ignoring any changes you make to a (different) file of the same name with the editing commands. This is why some of your names showed up a lot in the output.
However, what you are basically trying to do is count the number of occurrences of each line in the file. This is conventionally done with:
sort "$filename" |
uniq -c
The sort groups all identical sets of lines together in the file, and uniq -c counts the number of occurrences of each distinct line. It does, however, output the count before the line, so that has to be reversed — we can use sed for that. So, your script could be just:
sizefile="/home/b/Desktop/CalcfileSizeGreater40.txt"
rm -f "$sizefile"
filename="/home/b/Desktop/fileSizeGreater40.txt"
sort "$filename" |
uniq -c |
sed 's/^[ ]*\([0-9][0-9]*\)[ ]\(.*\)/\2 : \1/' > "$sizefile"
I'd be cautious about using rm -fr on your CalcfileSizeGreater40.txt; rm -f is sufficient for a file, and you probably don't want to remove stuff if it isn't a file but is a directory.
One pleasant side effect of this is that the code is a lot more efficient than the original as it makes one pass through the file (unless it is so big that sort has to split it up to handle it).
I am finding the sed code a little difficult to follow.
I should have explained that the [ ] bits are meant to represent a blank and a tab. On my machine, it appears that uniq only generates spaces, so you could simplify that to:
sed 's/^ *\([0-9][0-9]*\) \(.*\)/\2 : \1/'
The regex looks for the start of a line, any number of blanks, and then a number (which it remembers as \1 because of the \(...\) enclosing it), followed by a space and then 'everything else', which is also remembered (as '\2'). The replacement then prints the 'everything else' followed by a space, colon, space and the count.

sort -g $filename | uniq -c
you will got (times number) in every line
10 500000
1 10000
you just need to swap every line
sort -g $filename | uniq -c | while read a b; do echo $b $a ; done

Related

1. How to use the input not including the first one 2.Using grep and sed to find the pattern entered by the user and how to create the next line

The command that I'm making wants the first input to be a file and search how many times a certain pattern occurs within the file, using grep and sed.
Ex:
$ cat file1
oneonetwotwotwothreefourfive
Intended output:
$ ./command file1 one two three
one 2
two 3
three 1
The problem is the file does not have any lines and is just a long list of letters. I'm trying to use sed to replace the pattern I'm looking for with "FIND" and move the list to the next line and this continues until the end of file. Then, use $grep FIND to get the line that contains FIND. Finally, use wc -l to find a number of lines. However, I cannot find the option to move the list to the next line
Ex:
$cat file1
oneonetwosixone
Intended output:
FIND
FIND
twosixFIND
Another problem that I've been having is how to use the rest of the input, not including the file.
Failed attempt:
file=$1
for PATTERN in 2 3 4 5 ... N
do
variable=$(sed 's/$PATTERN/find/g' $file | grep FIND $file | wc -l)
echo $PATTERN $variable
exit
Another failed attempt:
file=$1
PATTERN=$($2,$3 ... $N)
for PATTERN in $*
do variable=$(sed 's/$PATTERN/FIND/g' $file | grep FIND $file | wc-1)
echo $PATTERN $variable
exit
Any suggestions and help will be greatly appreciated. Thank you in advance.
Non-portable solution with GNU grep:
file=$1
shift
for pattern in "$#"; do
echo "$pattern" $(grep -o -e "$pattern" <"$file" | wc -l)
done
If you want to use sed and your "patterns" are actually fixed strings (which don't contain characters that have special meaning to sed), you could do something like:
file=$1
shift
for pattern in "$#"; do
echo "$pattern" $(
sed "s/$pattern/\n&\n/g" "$file" |\
grep -e "$pattern" | wc -l
)
done
Your code has several issues:
you should quote use of variables where word splitting may happen
don't use ALLCAPS variable names - they are reserved for use by the shell
if you put a string in single-quotes, variable expansion does not happen
if you give grep a file, it won't read standard input
your for loop has no terminating done
This might work for you (GNU bash,sed and uniq):
f(){ local file=$1;
shift;
local args="$#";
sed -E 's/'${args// /|}'/\n&\n/g
s/(\n\S+)\n\S+/\1/g
s/\n+/\n/g
s/.(.*)/echo "\1"|uniq -c/e
s/ *(\S+) (\S+)/\2 \1/mg' $file; }
Separate arguments into file and remaining arguments.
Apply arguments as alternation within a sed substitution command which splits words into lines separated by a newline either side.
Remove unwanted words and unwanted newlines.
Evaluate the manufactured file within a sed substitution using the uniq command with the -c option.
Rearrange the output and print the result.
The problem is the file does not have any lines
Great! So the problem reduces to putting newlines.
func() {
file=$1
shift
rgx=$(printf "%s\\|" "$#" | sed 's#\\|$##');
# put the newline between words
sed 's/\('"$rgx"'\)/&\n/g' "$file" |
# it's just standard here
sort | uniq -c |
# filter only input - i.e. exclude fourfive
grep -xf <(printf " *[0-9]\+ %s\n" "$#")
};
func <(echo oneonetwotwotwothreefourfive) one two three
outputs:
2 one
1 three
3 two

grep from 7 GB text file OR many smaller ones

I have about two thousand text files in folder.
I want to loop each one and search for specific word in line.
for file in "./*.txt";
do
cat $file | grep "banana"
done
I was wondering if join all text files into one file would be faster.
The whole directory has about 7 GB.
You're not actually looping, you're calling cat just once on the string ./*.txt, i.e., your script is equivalent to
cat ./*.txt | grep 'banana'
This is not equivalent to
grep 'banana' ./*.txt
though, as the output for the latter would prefix the filename for each match; you could use
grep -h 'banana' ./*.txt
to suppress filenames.
The problem you could run into is that ./*.txt expands to something that is longer than the maximum command line length allowed; to prevent that, you could do something like
printf '%s\0' ./*.txt | xargs -0 grep -h 'banana'
which is save for both files containing blanks and shell metacharacters and calls grep as few times as possible1.
This can even be parallelized; to run 4 grep processes in parallel, each handling 5 files at a time:
printf '%s\0' ./*.txt | xargs -0 -L 5 -P 4 grep -h 'banana'
What I think you intended to run is this:
for file in ./*.txt; do
cat "$file" | grep "banana"
done
which would call cat/grep once per file.
1At first I thought that printf would run into trouble with command line length limitations as well, but it seems that as a shell built-in, it's exempt:
$ touch '%s\0' {1000000..10000000} > /dev/null
-bash: /usr/bin/touch: Argument list too long
$ printf '%s\0' {1000000..10000000} > /dev/null
$

How to use lines in a file as keyword for grep?

I've search lots of questions on here and other sites, and people have suggested things that should fix my problem, but I think there's something wrong with my code that I just don't recognize.
I have 24 .fasta files from NGS sequencing that are 150bp long. There's approximately 1M reads for each file. The reads are from targeted sequencing where we electroplated vectors with cDNA for genes of interest, and a unique barcode sequence. I need to look through the sequencing files for the presence or absence of the barcode sequence which corresponds to a specific gene.
I have a .txt list of the barcodeSequences that I want to pass to grep to look for the barcode in the .fasta file. I've tried so many variations of this command. I can give grep each barcode individually but that's so time consuming, I know it's possible to give it the list of barcode sequences and search each .fasta for each of the barcodes and record how many times each barcode is found in each file.
Here's my code where I give it each barcode individually:
# Barcode 33
mkdir --mode 755 $dir/BC33
FILES="*.fasta"
for f in $FILES; do
cat "$f" | tr -d "\n" | tr ">" "\n" | grep 'TATTAGAGTTTGAGAATAAGTAGT' > $dir/BC33/"$f"
done
I tried to adapt it so that I don't have to feed every barcode sequence in individually:
dir="/home/lozzib/AG_Barcode_Seq/"
cd $dir
FILES="*.fasta"
for f in $FILES; do
cat "$f" | tr -d "\n" | tr ">" "\n" | grep -c -f BarcodeScreenSeq.txt | sort > $dir/Results/"$f"
echo "Finished $f"
done
But it is not searching for the barcode sequences. With this iteration it is just returning new files in the /Results directory that are empty. I also tried a nest loop, where I tried to make the barcode sequence a variable that changed like the $FILES, but that just gave me a new file with the names of my .fasta files:
dir="/home/lozzib/AG_Barcode_Seq/"
cd $dir
FILES="*.fasta"
for f in $FILES; do
for b in `cat /home/lozzib/AG_Barcode_Seq/BarcodeScreenSeq.txt`; do
cat "$f" | grep -c "$b" | sort > $dir/"$f"_Barcode
done ;
done
I want a output .txt file that has:
<barcode sequence>: <# of times that bc was found>
for each .fasta file because I want to put all the samples together to make one large excel sheet which shows each barcode and how many times it was found in each sample.
Please help, I've tried everything I can think of.
EDIT
Here is what the BarcodeScreenSeq.txt file would look like. It's just a txt file where each line is a barcode sequence:
head BarcodeScreenSeq.txt
TATTATGAGAAAGTTGAATAGTAG
ATGAAAGTTAGAGTTTATGATAAG
AATAGATAAGATTGATTGTGTTTG
TGTTAAATGTATGTAGTAATTGAG
ATAGATTTAAGTGAAGAGAGTTAT
GAATGTTTGTAAATGTATAGATAG
AAATTGTGAAAGATTGTTTGTGTA
TGTAAGTGAAATAGTGAGTTATTT
GAATTGTATAAAGTATTAGATGTG
AGTGAGATTATGAGTATTGATTTA
EDIT
lozzib#gliaserver:~/AG_Barcode_Seq$ file BarcodeScreenSeq.txt
BarcodeScreenSeq.txt: ASCII text, with CRLF line terminators
Windows Line Endings
Your BarcodeScreenSeq.txt has windows line endings. Each line ends with the special characters \r\n. Linux tools such as grep only deal with linux line endings \r and interpret your file ...
TATTATG\r\n
ATGAAAG\r\n
...
to look for the patterns TATTATG\r, ATGAAAG\r, ... (note the \r at the end). Because of the \r there is no match.
Either: Convert your file once bye running dos2unix BarcodeScreenSeq.txt or sed -i 's/\r//g' BarcodeScreenSeq.txt. This will change your file.
Or: replace every BarcodeScreenSeq.txt in the following scripts by <(tr -d '\r' < BarcodeScreenSeq.txt). This won't change the file, but creates more overhead as the file is converted over and over again.
Command
grep -c has only one counter. If you pass multiple search patterns at once (for instance using -f BarcodeScreenSeq.txt) you still get only one number for all patterns together.
To count the occurrences of each pattern individually you can use the following trick:
for file in *.fasta; do
grep -oFf BarcodeScreenSeq.txt "$file" |
sort | uniq -c |
awk '{print $2 ": " $1 }' > "Results/$file"
done
grep -o will print each match as a single line.
sort | uniq -c will count how often each line occurs.
awk is only there to change the format from #matches pattern to pattern: #matches.
Benefit: The command should be fairly fast.
Drawback: Patterns from BarcodeScreenSeq.txt that are not found in $file won't be listed at all. Your result will leave out lines of the form pattern: 0.
If you really need the lines of the form pattern: 0 you could use another trick:
for file in *.fasta; do
grep -oFf BarcodeScreenSeq.txt "$file" |
cat - BarcodeScreenSeq.txt |
sort | uniq -c |
awk '{print $2 ": " ($1 - 1) }' > "Results/$file"
done
cat - BarcodeScreenSeq.txt will insert the content of BarcodeScreenSeq.txt at the end of grep's output such that #matches is one bigger than it should be. The number is corrected by awk.
You can read a text file one line at a time and process each line separately using a redirect, like so:
for f in *.fasta; do
while read -r seq; do
grep -c "${seq}" "${f}" > "${dir}"/"${f}"_Barcode
done < /home/lozzib/AG_Barcode_Seq/BarcodeScreenSeq.txt
done

Delete the 3 last line of my txt with bash? [duplicate]

I want to remove some n lines from the end of a file. Can this be done using sed?
For example, to remove lines from 2 to 4, I can use
$ sed '2,4d' file
But I don't know the line numbers. I can delete the last line using
$sed $d file
but I want to know the way to remove n lines from the end. Please let me know how to do that using sed or some other method.
I don't know about sed, but it can be done with head:
head -n -2 myfile.txt
If hardcoding n is an option, you can use sequential calls to sed. For instance, to delete the last three lines, delete the last one line thrice:
sed '$d' file | sed '$d' | sed '$d'
From the sed one-liners:
# delete the last 10 lines of a file
sed -e :a -e '$d;N;2,10ba' -e 'P;D' # method 1
sed -n -e :a -e '1,10!{P;N;D;};N;ba' # method 2
Seems to be what you are looking for.
A funny & simple sed and tac solution :
n=4
tac file.txt | sed "1,$n{d}" | tac
NOTE
double quotes " are needed for the shell to evaluate the $n variable in sed command. In single quotes, no interpolate will be performed.
tac is a cat reversed, see man 1 tac
the {} in sed are there to separate $n & d (if not, the shell try to interpolate non existent $nd variable)
Use sed, but let the shell do the math, with the goal being to use the d command by giving a range (to remove the last 23 lines):
sed -i "$(($(wc -l < file)-22)),\$d" file
To remove the last 3 lines, from inside out:
$(wc -l < file)
Gives the number of lines of the file: say 2196
We want to remove the last 23 lines, so for left side or range:
$((2196-22))
Gives: 2174
Thus the original sed after shell interpretation is:
sed -i '2174,$d' file
With -i doing inplace edit, file is now 2173 lines!
If you want to save it into a new file, the code is:
sed -i '2174,$d' file > outputfile
You could use head for this.
Use
$ head --lines=-N file > new_file
where N is the number of lines you want to remove from the file.
The contents of the original file minus the last N lines are now in new_file
Just for completeness I would like to add my solution.
I ended up doing this with the standard ed:
ed -s sometextfile <<< $'-2,$d\nwq'
This deletes the last 2 lines using in-place editing (although it does use a temporary file in /tmp !!)
To truncate very large files truly in-place we have truncate command.
It doesn't know about lines, but tail + wc can convert lines to bytes:
file=bigone.log
lines=3
truncate -s -$(tail -$lines $file | wc -c) $file
There is an obvious race condition if the file is written at the same time.
In this case it may be better to use head - it counts bytes from the beginning of file (mind disk IO), so we will always truncate on line boundary (possibly more lines than expected if file is actively written):
truncate -s $(head -n -$lines $file | wc -c) $file
Handy one-liner if you fail login attempt putting password in place of username:
truncate -s $(head -n -5 /var/log/secure | wc -c) /var/log/secure
This might work for you (GNU sed):
sed ':a;$!N;1,4ba;P;$d;D' file
Most of the above answers seem to require GNU commands/extensions:
$ head -n -2 myfile.txt
-2: Badly formed number
For a slightly more portible solution:
perl -ne 'push(#fifo,$_);print shift(#fifo) if #fifo > 10;'
OR
perl -ne 'push(#buf,$_);END{print #buf[0 ... $#buf-10]}'
OR
awk '{buf[NR-1]=$0;}END{ for ( i=0; i < (NR-10); i++){ print buf[i];} }'
Where "10" is "n".
With the answers here you'd have already learnt that sed is not the best tool for this application.
However I do think there is a way to do this in using sed; the idea is to append N lines to hold space untill you are able read without hitting EOF. When EOF is hit, print the contents of hold space and quit.
sed -e '$!{N;N;N;N;N;N;H;}' -e x
The sed command above will omit last 5 lines.
It can be done in 3 steps:
a) Count the number of lines in the file you want to edit:
n=`cat myfile |wc -l`
b) Subtract from that number the number of lines to delete:
x=$((n-3))
c) Tell sed to delete from that line number ($x) to the end:
sed "$x,\$d" myfile
You can get the total count of lines with wc -l <file> and use
head -n <total lines - lines to remove> <file>
Try the following command:
n = line number
tail -r file_name | sed '1,nd' | tail -r
This will remove the last 3 lines from file:
for i in $(seq 1 3); do sed -i '$d' file; done;
I prefer this solution;
head -$(gcalctool -s $(cat file | wc -l)-N) file
where N is the number of lines to remove.
sed -n ':pre
1,4 {N;b pre
}
:cycle
$!{P;N;D;b cycle
}' YourFile
posix version
To delete last 4 lines:
$ nl -b a file | sort -k1,1nr | sed '1, 4 d' | sort -k1,1n | sed 's/^ *[0-9]*\t//'
I came up with this, where n is the number of lines you want to delete:
count=`wc -l file`
lines=`expr "$count" - n`
head -n "$lines" file > temp.txt
mv temp.txt file
rm -f temp.txt
It's a little roundabout, but I think it's easy to follow.
Count up the number of lines in the main file
Subtract the number of lines you want to remove from the count
Print out the number of lines you want to keep and store in a temp file
Replace the main file with the temp file
Remove the temp file
For deleting the last N lines of a file, you can use the same concept of
$ sed '2,4d' file
You can use a combo with tail command to reverse the file: if N is 5
$ tail -r file | sed '1,5d' file | tail -r > file
And this way runs also where head -n -5 file command doesn't run (like on a mac!).
#!/bin/sh
echo 'Enter the file name : '
read filename
echo 'Enter the number of lines from the end that needs to be deleted :'
read n
#Subtracting from the line number to get the nth line
m=`expr $n - 1`
# Calculate length of the file
len=`cat $filename|wc -l`
#Calculate the lines that must remain
lennew=`expr $len - $m`
sed "$lennew,$ d" $filename
A solution similar to https://stackoverflow.com/a/24298204/1221137 but with editing in place and not hardcoded number of lines:
n=4
seq $n | xargs -i sed -i -e '$d' my_file
In docker, this worked for me:
head --lines=-N file_path > file_path
Say you have several lines:
$ cat <<EOF > 20lines.txt
> 1
> 2
> 3
[snip]
> 18
> 19
> 20
> EOF
Then you can grab:
# leave last 15 out
$ head -n5 20lines.txt
1
2
3
4
5
# skip first 14
$ tail -n +15 20lines.txt
15
16
17
18
19
20
POSIX compliant solution using ex / vi, in the vein of #Michel's solution above.
#Michel's ed example uses "not-POSIX" Here-Strings.
Increment the $-1 to remove n lines to the EOF ($), or just feed the lines you want to (d)elete. You could use ex to count line numbers or do any other Unix stuff.
Given the file:
cat > sometextfile <<EOF
one
two
three
four
five
EOF
Executing:
ex -s sometextfile <<'EOF'
$-1,$d
%p
wq!
EOF
Returns:
one
two
three
This uses POSIX Here-Docs so it is really easy to modify - especially using set -o vi with a POSIX /bin/sh.
While on the subject, the "ex personality" of "vim" should be fine, but YMMV.
This will remove the last 12 lines
sed -n -e :a -e '1,10!{P;N;D;};N;ba'

read values of txt file from bash [duplicate]

This question already has answers here:
How to grep for contents after pattern?
(8 answers)
Closed 5 years ago.
I'm trying to read values from a text file.
I have test1.txt which looks like
sub1 1 2 3
sub8 4 5 6
I want to obtain values '1 2 3' when I specify 'sub1'.
The closest I get is:
subj="sub1"
grep "$subj" test1.txt
But the answer is:
sub8 4 5 6
I've read that grep gives you the next line to the match, so I've tried to change the text file to the following:
test2.txt looks like:
sub1
1 2 3
sub8
4 5 6
However, when I type
grep "$subj" test2.txt
The answer is:
sub1
It should be something super simple but I've tried awk, seg, grep,egrep, cat and none is working...I've also read some posts somehow related but none was really helpful
Awk works: awk '$1 == "'"$subj"'" { print $2, $3, $4 }' test1.txt
The command outputs fields two, three, and four for all lines in test1.txt where the first field is $subj (i.e.: the contents of the variable named subj).
With your original text file format:
target=sub1
while IFS=$' \t\n' read -r key values; do
if [[ $key = "$target" ]]; then
echo "Found values: $values"
fi
done <test1.txt
This requires no external tools, using only functionality built into bash itself. See BashFAQ #1.
As has come up during debugging in comments, if you have a traditional Apple-format text file (CR newlines only), then you might want something more like:
target=sub1
while IFS=$' \t\n' read -r -d $'\r' key values || [[ $key ]]; do
if [[ $key = "$target" ]]; then
echo "Found values: $values"
fi
done <test1.txt
Alternately, using awk (for a standard UNIX text file):
target="sub1"
awk -v target="$target" '$1 == target { $1 = ""; print; }' <test1.txt
...or, for a file with CR-only newlines:
target="sub1"
tr '\r' '\n' <test1.txt | awk -v target="$target" '$1 == target { $1 = ""; print; }'
This version will be slower if the text file being read is small (since awk, like any other external tool, takes time to start up); but faster if it's large (since awk's operation is much faster than that of bash's built-ins once it's done starting up).
grep "sub1" test1.txt | cut -c6-
or
grep -A 1 "sub1" test2.txt | tail -n 1
You doing it right, but it seems like test1.txt has a wrong value in it.
with grep foo you get all lines with foo in it. use grep -m1 foo to find the first line with foo in it only.
then you can use cut -d" " -f2- to get all the values behind foo, while seperated by empty spaces.
In the end the command would look like this ...
$ subj="sub1"
$ grep -m1 "$subj" test1.txt | cut -d" " -f2-
But this doenst explain why you could not find sub1 in the first place.
Did you read the proper file ?
There's a bunch of ways to do this (and shorter/more efficient answers than what I'm giving you), but I'm assuming you're a beginner at bash, and therefore I'll give you something that's easy to understand:
egrep "^$subj\>" file.txt | sed "s/^\S*\>\s*//"
or
egrep "^$subj\>" file.txt | sed "s/^[^[:blank:]]*\>[[:blank:]]*//"
The first part, egrep, will search for you subject at the beginning of the line in file.txt (that's what the ^ symbol does in the grep string). It also is looking for a whole word (the \> is looking for an end of word boundary -- that way sub1 doesn't match sub12 in the file.) Notice you have to use egrep to get the \>, as grep by default doesn't recognize that escape sequence. Once done finding the lines, egrep then passes it's output to sed, which will strip the first word and trailing whitespace off of each line. Again, the ^ symbol in the sed command, specifies it should only match at the beginning of the line. The \S* tells it to read as many non-whitespace characters as it can. Then the \s* tells sed to gobble up as many whitespace as it can. sed then replaces everything it matched with nothing, leaving the other stuff behind.
BTW, there's a help page in Stack overflow that tells you how to format your questions (I'm guessing that was the reason you got a downvote).
-------------- EDIT ---------
As pointed out, if you are on a Mac or something like that you have to use [:alnum:] instead of \S, and [:blank:] instead of \s in your sed expression (as these are portable to all platforms)
awk '/sub1/{ print $2,$3,$4 }' file
1 2 3
What happens? After regexp /sub1/ the three following fields are printed.
Any drawbacks? It affects the space.
Sed also works: sed -n -e 's/^'"$subj"' *//p' file1.txt
It outputs all lines matching $subj at the beginning of a line after having removed the matching word and the spaces following. If TABs are used the spaces should be replaced by something like [[:space:]].

Resources