Combining flat file modification and concatenating steps - bash

I perform this operation very often, and I am looking for a shortcut. Is there any way I can do the following without having to write to a temp file?
cut -k 3-5 file1 > temp1
cat temp1 file2 | sort > outfile
Thanks!

Like this:
cut -k 3-5 file1 | cat - file2 | sort > outfile
There may be ancient versions of cat which do not take - to mean standard input.

Just do them in sequence:
(cut -k 3-5 file1; cat file2) | sort > outfile
This has the added advantage of working in any Bourne-based shell without requiring bash- or zsh-specific features.

This should do it:
cat <(cut -k 3-5 file1) file2 | sort > outfile

Related

BASH: Find lines which is not available on File 2 v File 1

I am currently writing a bash script to find names that are available in File1 but not available in File2.
File1:
"Name"
"Jeff"
"Michael"
"Ringo"
"John"
File2:
"Name"
"Jeff"
"Michael"
"John"
"Bert"
From the example above, it should return "Ringo". So far, I am running a for loop to extract it.
for q in `cat File1 | tail -n +2 | sort`;do grep $q File2 >> output.txt;done
However, it would take forever to run it on ~150,000 records. So, is there a better solution you can share for this?
Thanks in advance for the answers.
comm is a standard utility for this.
tail -n +2 File1 | sort -u > tmp1
sort -u File2 > tmp2
comm -23 tmp1 tmp2 > output.txt
rm tmp1 tmp2
With bash, the temporary file cleanup can be avoided:
comm -23 \
<(tail -n +2 File1 | sort -u) \
<(sort -u File2) \
> output.txt
Note that sort works fine on files that do not fit in memory (implementations generally use mergesort with temporary files if memory usage would become too high). comm itself requires minimal memory. I believe overall runtime is O(n*log(n))
I think you're looking for diff(1). If you have the GNU version, this flag with some output processing to get just the first column.
--suppress-common-lines
do not output common lines
--side-by-side, -y
output in two columns
But diff requires lines to be in the same order in both files. If that's not your case, grep with multiple expressions and the invert flag -v/--invert-match and -E/--extended-regexp might work better.
Also note that I am using command substitution instead of a for loop to run it in one go. The (x|y) extended regexp searches x OR y.
grep --invert-match --extended-regexp \
"^( $(uniq file2 | tr '\n' '|') )$" \
file1

How can I combine a set of text files, leaving off the first line of each?

As part of a normal workflow, I receive sets of text files, each containing a header row. It's more convenient for me to work with these as a single file, but if I cat them naively, the header rows in files after the first cause problems.
The files tend to be large enough (103–105 lines, 5–50 MB) and numerous enough that it's awkward and/or tedious to do this in an editor or step-by-step, e.g.:
$ wc -l *
20251 1.csv
124520 2.csv
31158 3.csv
175929 total
$ tail -n 20250 1.csv > 1.tmp
$ tail -n 124519 2.csv > 2.tmp
$ tail -n 31157 3.csv > 3.tmp
$ cat *.tmp > combined.csv
$ wc -l combined.csv
175926 combined.csv
It seems like this should be doable in one line. I've isolated the arguments that I need but I'm having trouble figuring out how to match them up with tail and subtract 1 from the line total (I'm not comfortable with awk):
$ wc -l * | grep -v "total" | xargs -n 2
20251 foo.csv
124520 bar.csv
31158 baz.csv
87457 zappa.csv
7310 bingo.csv
29968 niner.csv
2086 hella.csv
$ wc -l * | grep -v "total" | xargs -n 2 | tail -n
tail: option requires an argument -- n
Try 'tail --help' for more information.
xargs: echo: terminated by signal 13
You don't need to use wc -l to calculate the number of lines to output; tail can skip the first line (or the first K lines), just by adding a + symbol when using the -n (or --lines) option, as described in the man page:
-n, --lines=K output the last K lines, instead of the last 10;
or use -n +K to output starting with the Kth
This makes combining all files in a directory without the first line of each file as simple as:
$ tail -q -n +2 * > combined.csv
$ wc -l *
20251 foo.csv
124520 bar.csv
31158 baz.csv
87457 zappa.csv
7310 bingo.csv
29968 niner.csv
2086 hella.csv
302743 combined.csv
605493 total
The -q flag suppresses headers in the output when globbing for multiple files with tail.
Both tail and sed answers work fine.
For the sake of an alternative here is an awk command that does the same job:
awk 'FNR > 1' *.csv > combined.csv
FNR > 1 condition will skip first row for each file.
With GNU sed:
sed -ns '2,$p' 1.csv 2.csv 3.csv > combined.csv
or
sed -ns '2,$p' *.csv > combined.csv
Another sed alternative
sed -s 1d *.csv
deletes first line from each input file, without -s it will only delete from the first file.

How could I compare two files and remove similar rows in them (bash script)

I have two files of data with similar number of columns. I'd like to save file2 in another file (file3) while I exclude the rows which are existed already in the file1.
grep -v -i -f file1 file2> file3
But the problem is that the space between columns in the file1 is "\t" while in the other one it is just " ". Therefore this command line doesn't work.
Any suggestion??
Thanks folks!
You can convert tabs to spaces on the fly:
grep -vif <(tr '\t' ' ' < file1) file2 > file3
This is process substitution.
Try:
grep -Fxvf file1 file2
Switch meanings available from the grep man page.
grep -v -f is problematic because it searches file2 for each line in file1. With large files it will take a very long time. Try this instead:
comm -13 <(cat file1 | tr '\t' ' ' | sort) <(sort file2)

Searching for Strings

I would like to have a shell script that searches two files and returns a list of strings:
File A contains just a list of unique alphanumeric strings, one per line, like this:
accc_34343
GH_HF_223232
cwww_34343
jej_222
File B contains a list of SOME of those strings (some times more than once), and a second column of infomation, like this:
accc_34343 dog
accc_34343 cat
jej_222 cat
jej_222 horse
I would like to create a third file that contains a list of the strings from File A that are NOT in File B.
I've tried using some loops with grep -v, but that doesn't work. So, in the above example, the new file would have this as it's contents:
GH_HF_223232
cwww_34343
Any help is greatly appreciated!
Here's what you can do:
grep -v -f <(awk '{print $1}' file_b) file_a > file_c
Explanation:
grep -v : Use -v option to grep to invert the matching
-f : Use -f option to grep to specify that the patterns are from file
<(awk '{print $1}' file_b): The <(awk '{print $1}' file_b) is to simply extract the first column values from file_b without using a temp file; the <( ... ) syntax is process substitution.
file_a : Tell grep that the file to be searched is file_a
> file_c : Output to be written to file_c
comm is used to find intersections and differences between files:
comm -23 <(sort fileA) <(cut -d' ' -f1 fileB | sort -u)
result:
GH_HF_223232
cwww_34343
I assume your shell is bash/zsh/ksh
awk 'FNR==NR{a[$0];next}!($1 in a)' fileA fileB
check here

BASH Substracting Files on Key line by line

I just wanna to substract one CSV-File from another one, but not if the lines are the same. Instead of comparing the lines I'd like to look if the lines matching in one field.
e.g. the first file
EMAIL;NAME;SALUTATION;ID
foo#bar.com;Foo;Mr;1
bar#foo.com;Bar;Ms;2
and the second file
EMAIL;NAME
foo#bar.com;Foo
the resultfile should be
EMAIL;NAME;SALUTATION;ID
bar#foo.com;Bar;Ms;2
I think u know what I mean ;)
How is that possible in bash? It's easy for me doing this in Java, but I realy like to learn how to do that in bash. Also I can substract by comparing the lines using sort
#! / bin / bash
echo "Substracting Files..."
sort "/tmp/list1.csv" "/tmp/list2.csv" "/tmp/list2.csv" | uniq -u >> /tmp/subList.csv
echo "Files successfully substracted."
But the lines arn't the same tuple. So I have to compare line with keys.
Any suggestions? Thanks a lot.. Nils
One possible solution coming to my mind is this one (working with bash):
grep -v -f <(cut -d ";" -f1 /tmp/list2.csv) /tmp/list1.csv
That means:
cut -d ";" -f1 /tmp/list2.csv: Extract the first column of the second file.
grep -f some_file: Use a file as pattern source.
<(some_command): This is a process substitution. It executes the command and feeds the output to a named pipe which then can be used as file input to grep -f.
grep -v: Print only the lines not matching the pattern(s).
Update: the solution to the question, via join and awk.
join --header -1 1 -2 1 -t";" --nocheck-order -v 1 1.csv 2.csv | | awk 'NR==1 {print gensub(";[^;]\\+$","","g");next} 1'
These were the inverse answers:
$ join -1 1 -2 1 -t";" --nocheck-order -o 1.1,1.2,1.3,1.4 1.csv 2.csv
EMAIL;NAME;SALUTATION;ID
foo#bar.com;Foo;Mr;1
join to the rescue.
Or the skipping of printing the NAME field without -o:
$ join -1 1 -2 1 -t";" --nocheck-order 1.csv 2.csv | awk 'BEGIN {FS=";" ; OFS=";"} {$NF=""; print }'
(But it still prints a plus ;˛after the last field.
HTH

Resources