Grep differences between two lists [closed] - windows

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
I've been using a simple grep command to output differences between two files and it works fine in Windows 10 (cygwin grep), and in Windows 10's Ubuntu based Bash shell, worked great with Cygwin's Grep, but is not working on Mac OS X Yosemite.
Here is the is the very simple command line I've been using:
grep -F -v -f list1.txt list2.txt > differences1.txt
With Mac OS X issuing that command from terminal causes a long pause and no output to screen. I've checked both list1.txt and list2.txt to ensure they have the appropriate line terminations for their respective OS and still doesn't work. I've consulted the man page for grep along with command line help and can't discern any difference in parameters between the different OSes that would cause this problem. But for the record here is the Windows 10 Bash Shell Grep version (GNU grep 2.16) and Mac OS X Yosemite (BSD grep 2.5.1-FreeBSD).

techraf's suggestion to install GNU coreutils to use grep as you used in Windows Bash Shell will work for you always.
Even without that FreeBSD awk has enough functionality to get the difference between two files with the logic below:-
awk 'NR == FNR{a[$0]++; next} !($0 in a)' file1 file2
will do the diff between two files as confirmed on awk version 20091126 (FreeBSD)
Assuming my files are like below:-
file1:-
1
2
3
4
file2:-
2
5
To get the lines that are unique in file1, do
awk 'NR == FNR{a[$0]++; next} !($0 in a)' file2 file1
will produce output as
1
3
4
To get the lines that are unique in file2, do
awk 'NR == FNR{a[$0]++; next} !($0 in a)' file1 file2
will produce output as
5

Make sure you are running the correct grep (not alias, function, or some script that happened to be found earlier on your PATH):
/usr/bin/grep -F -v -f list1.txt list2.txt > differences1.txt
A number of tools in FreeBSD (on which Mac OS X is based) and GNU distributions differ in functionality. That said, the parameters in your command indeed look consistent across the versions.
You can install GNU grep with Homebrew:
brew tap homebrew/dupes; brew install grep
and then run it using the command ggrep.
As a side-note: you can also install other GNU tools which also differ (like gdate for date) with:
brew install coreutils

Related

Efficient search pattern in large CSV file

I recently asked how to use awk to filter and output based on a searched pattern. I received some very useful answers being the one by user #anubhava the one that I found more straightforward and elegant. For the sake of clarity I am going to repeat some information of the original question.
I have a large CSV file (around 5GB) I need to identify 30 categories (in the action_type column) and create a separate file with only the rows matching each category.
My input file dataset.csv is something like this:
action,action_type, Result
up,1,stringA
down,1,strinB
left,2,stringC
I am using the following to get the results I want (again, this is thanks to #anubhava).
awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn; close(fn)}' file
This works as expected. But I have found it quite slow. It has been running for 14 hours now and, based on the size of the output files compared to the original file, it is not at even 20% of the whole process.
I am running this on a Windows 10 with an AMD Ryzen PRO 3500 200MHz, 4 Cores, 8 Logical Processors with 16GB Memory and an SDD drive. I am using GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.1.0, GNU MP 6.2.0). My CPU is currently at 30% and Memory at 51%. I am running awk inside a Cygwin64 Terminal.
I would love to hear some suggestions on how to improve the speed. As far as I can see it is not a capacity problem. Could it be the fact that this is running inside Cygwin? Is there an alternative solution? I was thinking about Silver Searcher but could not quite workout how to do the same thing awk is doing for me.
As always, I appreciate any advice.
with sorting:
awk -F, 'NR > 1{if(!seen[$2]++ && fn) close(fn); if(fn = $2 "_dataset.csv"; print >> fn}' < (sort -t, -nk2 dataset.csv)
or with gawk (unlimited number of opened fd-s)
gawk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn;}' dataset.csv
This is the right way to do it using any awk:
$ tail -n +2 file | sort -t, -k2,2n |
awk -F, '$2!=p{close(out); out=$2"_dataset.csv"; p=$2} {print > out}'
The reason I say this is the right approach is it doesn't rely on the 2nd field of the header line coming before the data values when sorted, doesn't require awk to test NR > 1 for every line of input, doesn't need an array to store $2s or any other values, and only keeps 1 output file open at a time (the more files open at once the slower any awk will run, especially gawk once you get past the limit of open files supported by other awks as gawk then has to start opening/closing the files in the background as needed). It also doesn't require you to empty existing output files before you run it, it will do that automatically, and it only does string concatenation to create the output file once per output file, not once per line.
Just like the currently accepted answer, the sort above could reorder the input lines that have the same $2 value - add -s if that's undesirable and you have GNU sort, with other sorts you need to replace the tail with a different awk command and add another sort arg.

Sed on macOS produces extra file suffixed with -e [duplicate]

This question already has answers here:
sed command with -i option failing on Mac, but works on Linux
(13 answers)
sed in-place flag that works both on Mac (BSD) and Linux
(15 answers)
Closed 4 years ago.
I'm trying to get sed to replace a line in a file with the contents of another file. Got this to work, however the in-place replacement somehow produces an extra file with -e suffixed.
This only seems to happens on macOS (High Sierra), and doesn't happen on Linux (Alpine) as I tried to reproduce this in a docker container.
My commands that reproduce this in sequence:
$ echo 'someline' > target_file.txt
$ echo 'replacementcontent' > replacement.txt
$ sed -Ei -e "\#^someline\$#{
r replacement.txt
d
}" target_file.txt
$ cat target_file.txt
replacementcontent
$ ls
replacement.txt target_file.txt target_file.txt-e
The replacement worked as intended but in a Linux environment the target_file.txt-e would not be there.
I know there are differences between the macOS and Linux sed, but this just seems random but I'm likely just not understanding something.
Why does this happen, and can the command be written in an agnostic way (so that it works the same on both macOS and Linux)?

Mac OS terminal solution to remove from a textfile lines from another textfiles

I work in SEO and sometimes I have to manage lists of domains to be considered for certain actions in our campaigns. On my iMac, I have 2 lists, one provided for consideration - unfiltered.txt - and another that has listed the domains I've already analyzed - used.txt. The one provided for consideration, the new one (unfiltered.txt), looks like this:
site1.com
site2.com
domain3.net
british.co.uk
england.org.uk
auckland.co.nz
... etc
List of domains that needs to be used as a filter, to be eliminated (used.txt) - looks like this.
site4.org
site5.me
site6.co.nz
gland.org.uk
kland.co.nz
site7.de
site8.it
... etc
Is there a way to use my OS X terminal to remove from unfiltered.txt all the lines found in used.txt? Found a software solution that partially solves a problem, and, aside from the words from used.txt, eliminates also words containing these smaller words. It means I get a broader filter and eliminate also domains that I still need.
For example, if my unfiltered.txt contains a domain named fogland.org.uk it will be automatically eliminated if in my used.txt file I have a domain named gland.org.uk.
Files are pretty big (close to 100k lines). I have pretty good configuration, with SSD, i7 7th gen, 16GB RAM, but it is unlikely to let it run for hours just for this operation.
... hope it makes sense.
TIA
You can do that with awk. You pass both files to awk. Whilst parsing the first file, where the current record number across all files is the same as the record number in the current file, you make a note of each domain you have seen. Then, when parsing the second file, you only print records that correspond to ones you have not seen in the first file:
awk 'FNR==NR{seen[$0]++;next} !seen[$0]' used.txt unfiltered.txt
Sample Output for your input data
site1.com
site2.com
domain3.net
british.co.uk
england.org.uk
auckland.co.nz
awk is included and delivered as part of macOS - no need to install anything.
I have always used
grep -v -F -f expunge.txt filewith.txt > filewithout.txt
to do this. When "expunge.txt" is too large, you can do it in stages, cutting it into manageable chunks and filtering one after another:
cp filewith.txt original.txt
and loop as required:
grep -v -F -f chunkNNN.txt filewith.txt > filewithout.txt
mv filewithout.txt filewith.txt
You could even do this in a pipe:
grep -v -F -f chunk01.txt original.txt |\
grep -v -F -f chunk02.txt original.txt |\
grep -v -F -f chunk03.txt original.txt \
> purged.txt
You can use comm. I haven't got a mac here to check but I expect it will be installed by default. Note that both files must be sorted. Then try:
comm -2 -3 unfiltered.txt used.txt
Check the man page for further details.
You can use comm and process substitution to do everything in one line:
comm -23 <(sort used.txt) <(sort unfiltered.txt) > used_new.txt
P.S. tested on my Mac running OSX 10.11.6 (El Capitan)

GAWK premature EOF with getline

Here's the deal: I need to read a specific amount of bytes, which will be processed later on. I've encountered a strange phenomenon though, and I couldn't wrap my head around it. Maybe someone else? :)
NOTE: The following code-examples are slimmed-down versions just to show the effect!
A way of doing this, at least with gawk, is to set RS to a catch-all regex, and then use RT to see, what has been matched:
RS="[\x00-\xFF]"
Then, quite simply use the following awk-script:
BEGIN {
ORS=""
OFS=""
RS="[\x00-\xFF]"
}
{
print RT
}
This is working fine:
$ echo "abcdef" | awk -f bug.awk
abcdef
However, I'll need several files, to be accessed, so I am forced to use getline:
BEGIN {
ORS=""
OFS=""
RS="[\x00-\xFF]"
while (getline)
{
print RT
}
}
This is seemingly equivalent to the above, but when running it, there is a nasty surprise:
$ echo "abcdef" | awk -f bug.awk
abc
This means, for some reason, getline is encountering the EOF condition 3 bytes early. So, did I miss something, that I should know about the internals of bash/Linux buffering, or did I find a dreadful bug?
Just for the record: I am using GNU Awk 4.0.1 on Ubuntu 14.04 LTS (Linux 3.13.0/36)
Any tips, guys?
UPDATE: I am using getline, because I have previously read and preprocessed the file(s), and stored in file(s) /dev/shm/. Then I'd need to do a few final processing steps. The above examples are just bare minimum scripts, to show the problem.
Seems like this is a manifestation of the bug reported here, which (if I understand it correctly) has the effect of terminating the getline prematurely when close to the end of input, rather than at the end of input.
The bug fixes seem to have been committed on May 9 and May 10, 2014, so if you can upgrade to version 4.1 it should fix the problem.
If all you need to do is read a specified number of bytes, I'd suggest that awk is not the ideal tool, regardless of bugs. Instead, you might consider one of the following two standard utilities, which will be able to do the work rather more efficiently:
head -c $count
or
dd bs=$count count=1
With dd you can explicitly set the input file (if=PATH) and output file (of=PATH) if stdin/stdout are not appropriate. With head you can specify the input file as a positional parameter, but the output always goes to stdout.
See man head and man dd for more details.
Fortunately, using GNU Awk 4.1.3 (on a Mac), your program with getline works as expected:
echo "abcdef" | gawk 'BEGIN{ORS="";OFS="";RS="[\x00-\xFF]";
while (getline) {print RT}}'
abcdef
$ gawk --version
GNU Awk 4.1.3, API: 1.1

Pipes & xargs => top [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I am trying to use pipes and xargs to start top with a particular pid, but I cannot get it to work and I don't know why:
ps aux|grep ProgramName|awk '{print $2}'|head -n1|xargs top -pid
I get the correct pid printed to screen if I stop after head -n1, and manually adding that pid to the command top -pid XXX also works, but running the whole line as one command just does not return the top screen.
What am I doing wrong here?
EDITs: yes, "-pid" is indeed correct (further checking the remote shell revealed it is actually a Mac OS based system, not a Linux one)
What am I doing wrong here?
Several things:
You are using grep and awk in the same pipeline. Since awk does pattern matching, there is no reason to use grep as a separate process.
You are using awk and head in the same pipeline. Since awk can control the number of items it prints, there is no need to use head.
Your grep command will find both the indicated program, and the grep program.
You are using xargs to provide a single command line argument. Either backticks or $() is a better choice.
top takes a -p switch, not a -pid switch. (At least on my computer.)
Adding it all up, try:
$ top -p $(ps aux | awk '/ProgramName/ && ! /awk/ { print $2; exit; }')
Your problem is
the arg to top should be "-p" not "-pid"
xargs is for running non-interactive programs
Try this:
top -p "$(pgrep ProgramName | head -n 1)"
or
top -p "$(pgrep --oldest ProgramName)"
or
top -p "$(pgrep --newest ProgramName)"

Resources