Been working on this all day, kind of got it to run, but I may still need some help to polish my code language.
Situation: I am using bedtools that gets two files (tab delimited) that contain genomic intervals (one per line) with some additional data (by column). More precisely, I am running the window function, this generates and output that contains for each interval in "a" file, all the intervals in "b" file that fall into the window that I have defined with parameter -l and -r. More precise explanation can be found here.
An example of function as taken from their web:
$ cat A.bed
chr1 1000 2000
$ cat B.bed
chr1 500 800
chr1 10000 20000
$ bedtools window -a A.bed -b B.bed -l 200 -r 20000
chr1 1000 2000 chr1 10000 20000
$ bedtools window -a A.bed -b B.bed -l 300 -r 20000
chr1 1000 2000 chr1 500 800
chr1 1000 2000 chr1 10000 20000
Question: So the thing is that I want to use that stdout to do a number of things in one shot.
Count the number of lines in the original stdout. For that I use wc -l
Then:
cut columns 4-6 cut -f 4-6
sort lines and keep only those not repeated sort | uniq -u
save to a file tee file.bed
count number of lines of the new stdout, again wc -l
So I have manged to get it to work more or less with this:
windowBed -a ARS_saccer3.bed -b ./Peaks/WTappeaks_-Mit_sorted.bed -r 0 -l 10000 | tee >(wc -l) >(cut -f 7-13 | sort | uniq -u | tee ./Window/windowBed_UP10.bed | wc -l)
This kind of works, because I get the output file correctly, and the values show in screen but... like this
juan#juan-VirtualBox:~/Desktop/sf_Biolinux_sf/IGV/Collisions$ 448
543
The first number is the second wc -l I don't understand why it shows first. And also, after the second number, cursor remains awaiting for instructions instead of appearing a new command line, so I assume there is something that remain unfinished with the code line as it is right now.
This probably is something very basic, but I will be very grateful to anyone that cares to explain me a little more about programming.
For anyone willing to offer solutions, bear in mind that I would like to keep this pipe in one line, without the need to run additional sh or anything else.
Thanks
When you create a "forked pipeline" like this, bash has to run the two halves of the fork concurrently, otherwise where would it buffer the stdout for the other half of the fork? So it is essentially like running both subshells in the background, which explains why you get the results in an order you did not expect (due to the concurrency) and why the output is dumped unceremoniously on top of your command prompt.
You can avoid both of these problems by writing the two outputs to separate temporary files, waiting for everything to finish, and then concatenating the temporary files in the order you expect, like this:
windowBed -a ARS_saccer3.bed -b ./Peaks/WTappeaks_-Mit_sorted.bed -r 0 -l 10000 | tee >(wc -l >tmp1) >(cut -f 7-13 | sort | uniq -u | tee ./Window/windowBed_UP10.bed | wc -l >tmp2)
wait
cat tmp1 tmp2
rm tmp1 tmp2
Related
I am trying to count how many files have words with the pattern [Gg]reen.
#!/bin/bash
for File in `ls ./`
do
cat ./$File | egrep '[Gg]reen' | sed -n '$='
done
When I do this I get this output:
1
1
3
1
1
So I want to count the lines to get in total 5. I tried using wc -l after the sed but it didn't work; it counted the lines in all the files. I tried to use >file.txt but it didn't write anything on it. And when I use >> instead it writes but when I execute the shell it appends the lines again.
Since according to your question, you want to know how many files contain a pattern, you are interested in the number of files, not the number of pattern occurances.
For instance,
grep -l '[Gg]reen' * | wc -l
would produce the number of files which contain somewhere green or Green as a substring.
Given two files (so that at any file can be duplicates) in the following format:
file1 (file that contains only numbers) for example:
10
40
20
10
10
file2 (file that contains only numbers) for example:
30
40
10
30
0
How can I prints the contents of the files, so that, from any file, we will remove the duplications.
For example, the output according to the 2 file above, need to be:
10
40
20
30
40
10
0
Note: in the output, we can get duplications (at most, will be 2 number that appears two times) , but, from any file, we will take the content without duplications !
How can I do it with sort , uniq , cat using only one command?
Namely, something like that: cat file1 file2 | sort | uniq (but, of course, this command not good - it's not solve the problem, it's only for explain what I mean while I say "using only one command").
I will be happy to listen your ideas how do it :)
If I understood the question correctly, this awk should do it while preserving the order:
awk 'FNR==1{delete a}!a[$0]++' file1 file2
If you don't need to preserve the order, it can be as simple as:
sort -u file1; sort -u file2
If you don't want to use a list (;), something like this is also an option:
cat <(sort -u file1) <(sort -u file2)
i currently have a list of terms - words.txt,with each term on one line, and I want to count how many total occurrences for all those terms exists in the first 500 lines of multiple csv files in the same directory.
I currently have something like this:
grep -Ff words.txt /some/directory |wc -l
How exactly can I get the program to display for each file the count number for just those first 500 lines of each file? Do i have to create new files with the 500 lines? How can i do that for a large number of original files? I'm very new to coding and working on a dataset for research, so any help is much appreciated!
Edit: I want it to display something like this but for each file:
grep -Ff words.txt list1.csv |wc -l
/Users/USER/Desktop/FILE/list1.csv:28
This works for me.
head /some/directory/* -n 100 | grep -Ff words.txt | wc -l
Sample Result: 38
The following command should print the first and last line from seq 100, but it only prints the first line:
seq 100 | (head -n1 ; tail -n1)
1
It does work for larger sequences, such as 10,000:
seq 10000 | (head -n1 ; tail -n1)
1
10000
UPDATE
I've selected #John1024's answer because my question was why doesn't this work and he provides an acceptable answer.
Also, the should is apparently my opinion only..the reality is that head doesn't work this way...it may very well consume more stdin than I'd like, and leave nothing for tail.
Of course, the problem that prompted this question in the first palce was trying to read the first and last n lines of a file. Here's the solution I came up with based on GNU sed:
sed -ne'1,9{p;b}' -e'10{x;s/$/--/;x;G;p;b}' -e':a;$p;N;21,$D;ba'
or more compact
sed -ne'1,9{p;b};10{x;s/$/--/;x;G;p;b};:a;$p;N;21,$D;ba'
Example output:
*Note On my Mac, with MacPorts, GNU sed is invoked as gsed. Apple's built-in sed is finicky about semi-colon separated expressions and requires multiple -e arguments. This should work on Apple's sed: sed -ne'1,9{' -e'p;b' -e'}' -e'10{' -e'x;s/$/--/;x;G;p;b' -e'}' -e':a' -e'$p;N;21,$D;ba' *
seq 100 | gsed -ne'1,9{p;b}' -e'10{x;s/$/--/;x;G;p;b}' -e':a;$p;N;21,$D;ba'
1
2
3
4
5
6
7
8
9
10
--
91
92
93
94
95
96
97
98
99
100
Explanation
gsed -ne' invoke sed without automatic printing pattern space
-e'1,9{p;b}' print the first 9 lines
-e'10{x;s/$/--/;x;G;p;b}' print line 10 with an appended '--' separator
-e':a;$p;N;21,$D;ba' print the last 10 lines
I see the same behavior with GNU head and tail on Linux.
It depends on how much input head -n1 consumes before it quits. If head reads all of stdin before it quits, then there is nothing left for tail to read and tail produces no output.
Observe:
$ seq 10000 | (head -n1 ; cat ) | head
1
1861
1862
1863
1864
1865
1866
1867
1868
Here, we can see that head -n1 consumes the first 1860 lines. The cat command sees all the remaining input.
Why is that? Observe how many bytes are in the first 1860 lines:
$ seq 1860 | wc
1860 1860 8193
It's a reasonable guess that head -n1 first reads 8kB of data from stdin, then prints the first line, and, seeing that it needs no more data, it quits. The rest of stdin is available for any subsequent process.
So, with seq 100 which produces less than 8kB output total, head reads all of stdin and leaves nothing for tail to read. With seq 10000 which produces more than 8kB, head will not read all the data in pipeline. The data that it leaves will be available for tail.
As Charles Duffy points out, the details of this behavior are entirely implementation dependent and, upon any software upgrade, it may change.
I want to write a shell script which will find the occurence of multiple strings like "Errors|Notice|Warnings" from a given log file, such as /var/log/messages. If any string matches it should send a mail notification to specified mail ID.
I can use:
grep -i -E '^Errors|Notice|Warnings' /var/log/messages
But my main problem is, the log file always growing, and if I want to add this script in cron, how can I record the file line or contents that I had already checked on the last execution of my script?
For example, if the log file is 100 lines and I have read the file using cat or anything similar, then before second execution, the file becomes 300 lines, then now i want to read from 101 line number to 300.
Can anyone please suggest how I can record this?
You can use following script to do that:
start=0
[[ -f last-processed ]] && start=$(<last-processed)
start=$((start+1))
tail +$start /var/log/messages | grep -i -E 'Errors|Notice|Warnings' &&\
wc -l /var/log/messages | awk '{print $1}' > last-processed
btw you have a problem in your regx, it should be 'Errors|Notice|Warnings' instead of '^Errors|Notice|Warnings'
Rotating your log file could be the best solution.
But if you want to grep file from line_first to line_last you can use sed:
For example, get line from 100 to 110 from input stream:
$> line_first=100; line_last=110
$> seq 1 1000 | sed -n "${line_first},${line_last}p"
100
101
102
103
104
105
106
107
108
109
110