I have a list of URLs, and would like to identify what is a directory and what is not:
https://www.example.com/folder/
https://www.example.com/folder9/
https://www.example.com/folder/file.sh
https://www.example.com/folder/text
I can use grep -e /$ to find which is which, but I'd like to do an inline command where I can redirect the output based on that logic.
I understand that awk may have the answer here, but don't have enough experience in awk to do this.
Something like:
cat urls | if /$ matches write to folders.txt else write to files.txt
I could drop it all to a file then read it twice but when it gets to thousands of lines I feel that would be inefficient.
Yes, awk is a great choice for this:
awk '/\/$/ { print > "folders.txt"; next }
{ print > "files.txt" }' urls.txt
/\/$/ { print > "folders.txt"; next } if the line ends with a /, write it to folders.txt and skip to the next line
{ print > "files.txt" } write all other lines to files.txt
You may want to use the expression /\/[[:space:]]*$/ instead of /\/$/ in case you have trailing spaces in your file.
All you need is:
awk '{print > ((/\/$/ ? "folders" : "files")".txt")}' urls.txt
With coreutils, grep and bash process substitution:
<urls tee >(grep '/$' > folders.txt) >(grep -v '/$' > files.txt) > /dev/null
Related
I'm currently reading in a file of three letter strings in unix and was wondering how I would go about making the lines variables so that I can grep them in the code...
My idea goes something like this:
!#/bin/bash
IFS=''
while read line
do
code=$(line)
#This would be where I want to assign the line a variable
grep "$code" final.txt > deptandcourse.txt
#This is where I would want to grep according to that three letter string
done < strings.txt
Sample file (strings.txt):
ABC
BCA
BDC
I would like to put these letters in the variable line and then grep the file (final.txt) first for 'ABC', then 'BCA', then 'BDC'
line is a variable you've set to contain the contents of each line of the file your reading from throughout the loop, so you don't need to reassign it to another variable. See this page for more information on using read in a loop.
Also, it looks like you might want to append to deptandcourse.txt with >> as using the > redirect will overwrite the file each time.
Maybe this is what you want:
while read -r line
do
grep "$line" final.txt >> deptandcourse.txt
done < strings.txt
As #JohnZwinck suggested in his comment:
grep -f strings.txt final.txt > deptandcourse.txt
which seems to be the best solution.
You could also use awk to accomplish the same thing:
awk 'FNR==NR {
a[$0]
next
}
{
for(i in a)
if($0 ~ i)
print
}' strings.txt final.txt > deptandcourse.txt
I have a file called words.txt containing a list of words. I also have a file called file.txt containing a sentence per line. I need to quickly delete any lines in file.txt that contain one of the lines from words.txt, but only if the match is found somewhere between { and }.
E.g. file.txt:
Once upon a time there was a cat.
{The cat} lived in the forest.
The {cat really liked to} eat mice.
E.g. words.txt:
cat
mice
Example output:
Once upon a time there was a cat.
Is removed because "cat" is found on those two lines and the words are also between { and }.
The following script successfully does this task:
while read -r line
do
sed -i "/{.*$line.*}/d" file.txt
done < words.txt
This script is very slow. Sometimes words.txt contains several thousand items, so the while loop takes several minutes. I attempted to use the sed -f option, which seems to allow reading a file, but I cannot find any manuals explaining how to use this.
How can I improve the speed of the script?
An awk solution:
awk 'NR==FNR{a["{[^{}]*"$0"[^{}]*}"]++;next}{for(i in a)if($0~i)next;b[j++]=$0}END{printf "">FILENAME;for(i=0;i in b;++i)print b[i]>FILENAME}' words.txt file.txt
It converts file.txt directly to have the expected output.
Once upon a time there was a cat.
Uncondensed version:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
b[j++] = $0
}
END {
printf "" > FILENAME
for (i = 0; i in b; ++i)
print b[i] > FILENAME
}
' words.txt file.txt
If files are expected to get too large that awk may not be able to handle it, we can only redirect it to stdout. We may not be able to modify the file directly:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
}
1
' words.txt file.txt
you can use grep to match 2 files like this:
grep -vf words.txt file.txt
In think that using the grep command should be way faster. By example:
grep -f words.txt -v file.txt
The f option make grep use the words.txt file as matching patterns
The v option reverse the matching, ie keeping files that do not match one of the patterns.
It doesn't solve the {} constraint, but that is easily avoidable, for example by adding the brackets to the pattern file (or in a temporary file created at runtime).
I think this should work for you:
sed -e 's/.*/{.*&.*}/' words.txt | grep -vf- file.txt > out ; mv out file.txt
This basically just modifies the words.txt file on the fly and uses it as a word file for grep.
In pure native bash (4.x):
#!/bin/env bash4
# ^-- MUST start with a /bin/bash shebang, NOT /bin/sh
readarray -t words <words.txt # read words into array
IFS='|' # use | as delimiter when expanding $*
words_re="[{].*(${words[*]}).*[}]" # form a regex matching all words
while read -r; do # for each line in file...
if ! [[ $REPLY =~ $words_re ]]; then # ...check whether it matches...
printf '%s\n' "$REPLY" # ...and print it if not.
fi
done <file.txt
Native bash is somewhat slower than awk, but this still is a single-pass solution (O(n+m), whereas the sed -i approach was O(n*m)), making it vastly faster than any iterative approach.
You could do this in two steps:
Wrap each word in words.txt with {.* and .*}:
awk '{ print "{.*" $0 ".*}" }' words.txt > wrapped.txt
Use grep with inverse match:
grep -v -f wrapped.txt file.txt
This would be particularly useful if words.txt is very large, as a pure-awk approach (storing all the entries of words.txt in an array) would require a lot of memory.
If would prefer a one-liner and would like to skip creating the intermediate file you could do this:
awk '{ print "{.*" $0 ".*}" }' words.txt | grep -v -f - file.txt
The - is a placeholder which tells grep to use stdin
update
If the size of words.txt isn't too big, you could do the whole thing in awk:
awk 'NR==FNR{a[$0]++;next}{p=1;for(i in a){if ($0 ~ "{.*" i ".*}") { p=0; break}}}p' words.txt file.txt
expanded:
awk 'NR==FNR { a[$0]++; next }
{
p=1
for (i in a) {
if ($0 ~ "{.*" i ".*}") { p=0; break }
}
}p' words.txt file.txt
The first block builds an array containing each line in words.txt. The second block runs for every line in file.txt. A flag p controls whether the line is printed. If the line matches the pattern, p is set to false. When the p outside the last block evaluates to true, the default action occurs, which is to print the line.
Is there a unix one liner to do this?
head -n 3 test.txt > out_dir/test.head.txt
grep hello test.txt > out_dir/test.tmp.txt
cat out_dir/test.head.txt out_dir/test.tmp.txt > out_dir/test.hello.txt
rm out_dir/test.head.txt out_dir/test.tmp.txt
I.e., I want to get the header and some grep lines from a given file, simultaneously.
Use awk:
awk 'NR<=3 || /hello/' test.txt > out_dir/test.hello.txt
You can say:
{ head -n 3 test.txt ; grep hello test.txt ; } > out_dir/test.hello.txt
Try using sed
sed -n '1,3p; /hello/p' test.txt > out_dir/test.hello.txt
The awk solution is the best, but I'll add a sed solution for completeness:
$ sed -n test.txt -e '1,3p' -e '4,$s/hello/hello/p' test.txt > $output_file
The -n says not to print out a line unless specified. The -e are the commands '1,3p prints ou the first three lines 4,$s/hello/hello/p looks for all lines that contain the word hello, and substitutes hello back in. The p on the end prints out all lines the substitution operated upon.
There should be a way of using 4,$g/HELLO/p, but I couldn't get it to work. It's been a long time since I really messed with sed.
Of course, I would go awk but here is an ed solution for the pre-vi nostalgics:
ed test.txt <<%
4,$ v/hello/d
w test.hello.txt
%
Ok, I have this line that outputs data to a text file. The only issue is I need the lines to be unique. So, if it is going to add a line that already exists how can I prevent that? This is my script:
tcpdump -lvi any "udp port 53" 2>/dev/null|grep -E 'A\?'|awk '{print $(NF-1)}' >> /tmp/domains
Do I pipe it go awk and somehow delete duplicates? Do I have another script run everyone minute that removes duplicates?
Here is the output of loading up Amazon.com:
amazon.com.
amazon.com.
www.amazon.com.
www.amazon.com.
amazon.com.
www.amazon.com.
a0e90b2a1701074fb52d450dc80084cb1.labs.cloudfront.net.
a0e90b2a1701074fb52d450dc80084cb1.labs.cloudfront.net.
ad.doubleclick.net.
ad.doubleclick.net.
ecx.images-amazon.com.
...more
And in looking at my output it looks like I need to figure out why there is a trailing dot.
You never need grep AND awk since awk can do anything grep can do so if you're using awk, just use awk:
tcpdump -lvi any "udp port 53" 2>/dev/null|
awk '/A\?/{ key=$(NF-1); if (!seen[key]++) print key }' > /tmp/domains
If you ever need to stop this script and restart it but only append new domains to the output file, you just need to read the output file first to populate the "seen" array, e.g.:
tcpdump -lvi any "udp port 53" 2>/dev/null|
awk -v outfile="/tmp/domains" '
BEGIN{
while ( (getline key < outfile) > 0 )
seen[key]++
close(outfile)
}
/A\?/{ key=$(NF-1); if (!seen[key]++) print key >> outfile }
'
This will print out only unseen input lines as they come in, rather than at the end like some other duplicate removing awk scripts posted.
awk '{host=$(NF-1)} !(host in list) {print host; list[host]++}'
If you only want to run the whole thing periodically and update the list, it may be easier to do something like
tcpdump and extract hostnames | sort -u /tmp/domains - > /tmp/domains.new
mv /tmp/domains.new /tmp/domains
Change this
tcpdump -lvi any "udp port 53" 2>/dev/null|grep -E 'A\?'|awk '{print $(NF-1)}'
To:
tcpdump -lvi any "udp port 53" 2>/dev/null|grep -E 'A\?'|awk '{a[$(NF-1)]++}END{for(i in a)print i}'
Hrmm, do you need a list of domains (unique)? Or do you need the whole line?
You could try using the whole line as a key in the awk array, but the timestamps will be different, and packetsizes, etc.
gawk 'BEGIN{count=0} {arr[$0]=$(NF-1); if (length(arr) > count) { count++; print $0 )}'domain
though likely more useful to you is lines for each domain...
gawk '{ domain = $(NF-1); arr[ domain ] = $0 ;}
END { for (entry in arr) print "domain:",entry, arr[entry]} '
some output would have been useful to see.
ok, I see the output now,
Domains MUST end in a dot Good Luck!!
ps. use this one
cmd | gawk 'BEGIN{ count = 0 } {
arr[ $0 ] = $(NF-1);
if (length(arr) > count) {
count++;
print $0
}
}'
as it continuously adds new domains to the output. Better to not lookup domains and use ips instead...
replace $(NF-1) with |& host -t A domain or so
see Advanced Features :: Two-Way pipelines in the gawk info pages 'info gawk'
For it to be useful you need to insert the new domains into a sorted list. While I don't suggest using ncurses for this, piping the output to a java program that shows the data in a single, sorted table would be not too hard...
Unless you plan on running this for a long time or have a very busy site, you could ensure uniqueness by saving previous lookups to an awk hash. This works here:
tcpdump -lvi any "udp port 53" 2> /dev/null | grep -E 'A\?' | awk '!h[$(NF-1)]++ { print $(NF-1) }' > /tmp/domains
Otherwise, you need to save chunks of the tcpdump/grep output to a temporary file and merge it with /tmp/domains. The best way I know is to keep the output sorted individually and then do a unique merge-sort with sort -mu. This works here:
lim=10000
tmpfile=$(mktemp /tmp/unique.domain.XXXXXX)
unique_domains=/tmp/domains
tcpdump -lvi any "udp port 53" 2> /dev/null | grep -E 'A\?' | while read line; do
awk -v lim=$lim '!h[$(NF-1)]++ { print $(NF-1); ndomain++ }; ndomain > lim { exit }' | sort > $tmpfile
sort -mu $tmpfile $unique_domains 2> /dev/null > $unique_domains.tmp
mv $unique_domains.tmp $unique_domains
done
If you want to access /tmp/domain while this is running you need to add some file locking, for example with lockfile:
lim=10000
lock=/tmp/domains.lock
tmpfile=$(mktemp /tmp/unique.domain.XXXXXX)
unique_domains=/tmp/domains
tcpdump -lvi any "udp port 53" 2> /dev/null | grep -E 'A\?' | while read line; do
awk -v lim=$lim '!h[$(NF-1)]++ { print $(NF-1); ndomain++ }; ndomain > lim { exit }' | sort > $tmpfile
lockfile $lock
sort -mu $tmpfile $unique_domains 2> /dev/null > $unique_domains.tmp
mv $unique_domains.tmp $unique_domains
rm $lock
done
Now to get a snapshot of /tmp/domains you would do something like this:
lockfile /tmp/domains.lock
cp /tmp/domains unique_domains
sync
rm -f /tmp/domains.lock
Answer:
Here is a solution using a pipe to the bash function
checkDuplicates() {
touch -- "$1" # Where $1 is a file that holds the data. It could be the same file that you write to or any other one.
while read -r nextCheck; do
grep -q -m 1 "$nextCheck" "$1" || printf "%s\n" "$nextCheck"
done
}
myFile='/tmp/domains'
YOURANYCOMMAND | checkDuplicates "$myFile" > "$myFile"
Bonus trick:
This could be useful for the case when you want to see a difference between two files. For example:
fileA:
what
is
this
fileB:
what
I
is
dont
this
even
Then this code
cat 'fileB' | checkDuplicates 'fileA'
Is going to output
I
Dont
Even
In bash, is there a way to chain multiple commands, all taking the same input from stdin? That is, one command reads stdin, does some processing, writes the output to a file. The next command in the chain gets the same input as what the first command got. And so on.
For example, consider a large text file to be split into multiple files by filtering the content. Something like this:
cat food_expenses.txt | grep "coffee" > coffee.txt | grep "tea" > tea.txt | grep "honey cake" > cake.txt
This obviously does not work, because the second grep gets the first grep's output, not the original text file. I tried inserting tee's but that does not help. Is there some bash magic that can cause the first grep to send its input to the pipe, not the output?
And by the way, splitting a file was a simple example. Consider splitting (filering by pattern search) a continuous live text stream coming over a network and writing the output to different named pipes or sockets. I would like to know if there is an easy way to do it using a shell script.
(This question is a cleaned up version of my earlier one , based on responses that pointed out the unclearness)
For this example, you should use awk as semiuseless suggests.
But in general to have N arbitrary programs read a copy of a single input stream, you can use tee and bash's process output substitution operator:
tee <food_expenses.txt \
>(grep "coffee" >coffee.txt) \
>(grep "tea" >tea.txt) \
>(grep "honey cake" >cake.txt)
Note that >(command) is a bash extension.
The obvious question is why do you want to do this within one command ?
If you don't want to write a script, and you want to run stuff in parallel, bash supports the concepts of subshells, and these can run in parallel. By putting your command in brackets, you can run your greps (or whatever) concurrently e.g.
$ (grep coffee food_expenses.txt > coffee.txt) && (grep tea food_expenses.txt > tea.txt)
Note that in the above your cat may be redundant since grep takes an input file argument.
You can (instead) play around with redirecting output through different streams. You're not limited to stdout/stderr but can assign new streams as required. I can't advise more on this other than direct you to examples here
I like Stephen's idea of using awk instead of grep.
It ain't pretty, but here's a command that uses output redirection to keep all data flowing through stdout:
cat food.txt |
awk '/coffee/ {print $0 > "/dev/stderr"} {print $0}'
2> coffee.txt |
awk '/tea/ {print $0 > "/dev/stderr"} {print $0}'
2> tea.txt
As you can see, it uses awk to send all lines matching 'coffee' to stderr, and all lines regardless of content to stdout. Then stderr is fed to a file, and the process repeats with 'tea'.
If you wanted to filter out content at each step, you might use this:
cat food.txt |
awk '/coffee/ {print $0 > "/dev/stderr"} $0 !~ /coffee/ {print $0}'
2> coffee.txt |
awk '/tea/ {print $0 > "/dev/stderr"} $0 !~ /tea/ {print $0}'
2> tea.txt
You could use awk to split into up to two files:
awk '/Coffee/ { print "Coffee" } /Tea/ { print "Tea" > "/dev/stderr" }' inputfile > coffee.file.txt 2> tea.file.txt
I am unclear why the filtering needs to be done in different steps. A single awk program can scan all the incoming lines, and dispatch the appropriate lines to individual files. This is a very simple dispatch that can feed multiple secondary commands (i.e. persistent processes that monitor the output files for new input, or the files could be sockets that are setup ahead of time and written to by the awk process.).
If there is a reason to have every filter see every line, then just remove the "next;" statements, and every filter will see every line.
$ cat split.awk
BEGIN{}
/^coffee/ {
print $0 >> "/tmp/coffee.txt" ;
next;
}
/^tea/ {
print $0 >> "/tmp/tea.txt" ;
next;
}
{ # default
print $0 >> "/tmp/other.txt" ;
}
END {}
$
Here are two bash scripts without awk. The second one doesn't even use grep!
With grep:
#!/bin/bash
tail -F food_expenses.txt | \
while read line
do
for word in "coffee" "tea" "honey cake"
do
if [[ $line != ${line#*$word*} ]]
then
echo "$line"|grep "$word" >> ${word#* }.txt # use the last word in $word for the filename (i.e. cake.txt for "honey cake")
fi
done
done
Without grep:
#!/bin/bash
tail -F food_expenses.txt | \
while read line
do
for word in "coffee" "tea" "honey cake"
do
if [[ $line != ${line#*$word*} ]] # does the line contain the word?
then
echo "$line" >> ${word#* }.txt # use the last word in $word for the filename (i.e. cake.txt for "honey cake")
fi
done
done;
Edit:
Here's an AWK method:
awk 'BEGIN {
list = "coffee tea";
split(list, patterns)
}
{
for (pattern in patterns) {
if ($0 ~ patterns[pattern]) {
print > patterns[pattern] ".txt"
}
}
}' food_expenses.txt
Working with patterns which include spaces remains to be resolved.
You can probably write a simple AWK script to do this in one shot. Can you describe the format of your file a little more?
Is it space/comma separated?
do you have the item descriptions on a specific 'column' where columns are defined by some separator like space, comma or something else?
If you can afford multiple grep runs this will work,
grep coffee food_expanses.txt> coffee.txt
grep tea food_expanses.txt> tea.txt
and, so on.
Assuming that your input is not infinite (as in the case of a network stream that you never plan on closing) I might consider using a subshell to put the data into a temp file, and then a series of other subshells to read it. I haven't tested this, but maybe it would look something like this
{ cat inputstream > tempfile };
{ grep tea tempfile > tea.txt };
{ grep coffee tempfile > coffee.txt};
I'm not certain of an elegant solution to the file getting too large if your input stream is not bounded in size however.