Delete rows that match pattern by id - bash

I have the following file containing n rows:
>name.1_i4_xyz_n
>name.1_i1_xyz_n
>name.1_i1_xyz_n
>name.1_i1_xyz_m
>name.1_i2_xyz_n
>name.1_i2_xyz_m
>name.1_i7_xyz_m
>name.1_i4_xyz_n
...
I want to delete rows that ends with m. In the example the output would be:
>name.1_i4_n
>name.1_i4_n
...
Note that I've deleted i2 as it has two records and one of them ends with m. Same with i1.
Any help? I want to keep it simple and do it with just one line of code.
This is what I have so far:
$ grep "i._.*." < input.txt | sort -k 2 -t "_" | cut -d'_' -f1,2,4
>name.1_i1_m
>name.1_i1_n
>name.1_i1_n
>name.1_i2_m
>name.1_i2_n
>name.1_i4_n
>name.1_i4_n
>name.1_i7_m
...

to delete rows that ends with m:
$ grep -v m$ file
>name.1_i4_xyz_n
>name.1_i1_xyz_n
>name.1_i1_xyz_n
>name.1_i2_xyz_n
>name.1_i4_xyz_n
Another solution that handles the ids, using awk and 2 runs:
$ awk 'BEGIN { FS="_" } # set delimiter
NR==FNR { # on the first run
if($0~/m$/) # if it ends in an m
d[$2] # make a del array entry of that index
next
}
($2 in d==0)' file file # on the second run don't print if index in del array
>name.1_i4_xyz_n
>name.1_i4_xyz_n
One-liner version:
$ awk 'BEGIN{FS="_"}NR==FNR{if($0~/m$/)d[$2];next}($2 in d==0)' file file

If the i... part does not appear in any other column, you can use
grep -vFf <(grep -E 'm$' file | cut -d _ -f 2) file
The part inside <() filters out all i... that have a row ending with m. In your example: i1, i2, and i7.
The outer grep takes a list of literal search strings (inside the <()) and prints only the lines not containing any of the search strings.

You can use awk as this:
awk -F_ '{if(/m$/) a[$2]; else rows[++n]=$0}
END{for (i=1; i<=n; i++) {split(rows[i], b, FS); if (!(b[2] in a)) print}}' file
>name.1_i4_xyz_n
>name.1_i4_xyz_n

Another awk proposal.
awk '/_i4/&&!/_m$/' filterm.awk
>name.1_i4_xyz_n
>name.1_i4_xyz_n

Related

Sort and split CSV file using sed or awk

I have a CSV file (test.csv) that looks like this:
WH_01,TRAINAMS,A10,1221-ESD
WH_03,TRAINLON,L10A3,3005-21
WH_01,TRAINAMS,A101,PWR-120
WH_02,TRAINCLE,A1,074-HD-SATA
WH_01,TRAINAMS,A10,PWR-120
WH_02,TRAINCLE,A15,102-55665
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,UK-B3,101859
1). I can sort the file based on the value in column# 2 as follows:
sort -t, -k2,2 test.csv > testsort.csv
2). Next I would like to split the file based on the value in column# 2. Using the above example, it should create 3 files:
testsort_1.csv:
WH_01,TRAINAMS,A10,1221-ESD
WH_01,TRAINAMS,A101,PWR-120
WH_01,TRAINAMS,A10,PWR-120
testsort_2.csv:
WH_02,TRAINCLE,A1,074-HD-SATA
WH_02,TRAINCLE,A15,102-55665
testsort_3.csv:
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,L10A3,3005-21
WH_03,TRAINLON,UK-B3,101859
How can I do this? Not sure if the sort is even required and if the above can be achieved without sorting.
Thank you.
Good move separating sort and awk.
$ sort -t, -k2,2 test.csv |awk -F, '!($2 in T) {T[$2]=++i} {print > ("testsort_" i ".csv")}'
$ tail -n +1 testsort*
==> testsort_1.csv <==
WH_01,TRAINAMS,A10,1221-ESD
WH_01,TRAINAMS,A101,PWR-120
WH_01,TRAINAMS,A10,PWR-120
==> testsort_2.csv <==
WH_02,TRAINCLE,A1,074-HD-SATA
WH_02,TRAINCLE,A15,102-55665
==> testsort_3.csv <==
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,L10A3,3005-21
WH_03,TRAINLON,UK-B3,101859
!($2 in T) - If the second field is not found in the indices of array T,
{T[$2]=++i} - increment the counter and save the second field as an index.
{print} - print every line
> "file" - overwrite, redirect, and append output to file
("." i ".") - concatenate "strings" and variable
Since you're not sure if you need to sort that almost certainly means you don't and you just think it'd be useful for some reason plus you're just sorting on $2 and then splitting into different files based on the value of $2 so sorting is doing no good whatsoever.
All you actually need to do is:
awk -F, '{print > ($2".csv")}'
Look:
$ ls
test.csv
$ awk -F, '{print > ($2".csv")}' test.csv
$ ls
test.csv TRAINAMS.csv TRAINCLE.csv TRAINLON.csv
$ tail -n +1 TRAIN*
==> TRAINAMS.csv <==
WH_01,TRAINAMS,A10,1221-ESD
WH_01,TRAINAMS,A101,PWR-120
WH_01,TRAINAMS,A10,PWR-120
==> TRAINCLE.csv <==
WH_02,TRAINCLE,A1,074-HD-SATA
WH_02,TRAINCLE,A15,102-55665
==> TRAINLON.csv <==
WH_03,TRAINLON,L10A3,3005-21
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,UK-B3,101859
If you got past about 20 output file names and weren't using GNU awk then you'd have to close() each one whenever $2 changes and use >> instead of > to append to them.
If for some reason you really do need to use the output file names from your question then that'd be:
awk -F, '!($2 in map){map[$2]="testsort_"++cnt".csv"} {print > map[$2]}' test.csv
You can do it in a fairly simple fashion by keeping a counter for the filename and using sprintf to create the filename for each successive group of files. You use the FNR (file record number) to distinguish between the first and subsequent records.
For example:
$ sort -t, -k2 file.csv |
awk -F, -v cnt=1 -v fn="testsort_1.csv" '
FNR==1 {
prev=$2
print $0 > fn
}
FNR>1 {
if ($2!=prev) {
cnt++
fn=sprintf("%s_%d.csv", "testsort", cnt)
}
print $0 > fn
prev=$2
}'
(note: you set the initial filename as a variable to begin, and then create all subsequent filenames from your cnt (count) using sprintf. prev tracks the second field from the previous record. fn is the filename created by sprintf and the counter.)
A shorter version of the same script declaring prev as a variable initially, would be:
sort -t, -k2 file.csv |
awk -F, -v cnt=0 -v prev="" '{
if ($2!=prev) {
cnt++
fn = "testsort_" cnt ".csv"
prev=$2
}
print $0 > fn
}'
If you do not wish to have sequentially numbered files, but instead want the "testsort_number.csv" taken from the sorted records, look at #Cyrus now-deleted answer that provides an excelled (and shorter) solution in that regard. (I see you already have great answer)
Example Use/Output
With your input in file.csv, the following output files would be created:
$ for i in testsort_{1..3}.csv; do printf "\n%s\n" $i; cat $i; done
testsort_1.csv
WH_01,TRAINAMS,A10,1221-ESD
WH_01,TRAINAMS,A10,PWR-120
WH_01,TRAINAMS,A101,PWR-120
testsort_2.csv
WH_02,TRAINCLE,A1,074-HD-SATA
WH_02,TRAINCLE,A15,102-55665
testsort_3.csv
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,L10A3,3005-21
WH_03,TRAINLON,UK-B3,101859

awk to do group by sum of column

I have this csv file and I am trying to write shell script to calculate sum of column after doing group by on it. Column number is 11th (STATUS)
My script is
awk -F, 'NR>1{arr[$11]++}END{for (a in arr) print a, arr[a]}' $f > $parentdir/outputfile.csv;
File output expected is
COMMITTED 2
but actual output is just 2.
It prints only count and not group by sum. If I delete any other columns and run same query then it works fine but not with below sample data.
FILE NAME;SEQUENCE NR;TRANSACTION ID;RUN NUMBER;START EDITCREATION;END EDITCREATION;END COMMIT;EDIT DURATION;COMMIT DURATION;HAS DEPENDENCY;STATUS;DETAILS
Buldhana_Refinesource_FG_IW_ETS_000001.xml;1;4a032127-b20d-4fa8-9f4d-7f2999c0c08f;1;20180831130210345;20180831130429638;20180831130722406;140;173;false;COMMITTED;
Buldhana_Refinesource_FG_IW_ETS_000001.xml;2;e4043fc0-3b0a-46ec-b409-748f98ce98ad;1;20180831130722724;20180831130947144;20180831131216693;145;150;false;COMMITTED;
change the FS to ; in your script
awk -F';' 'NR>1{arr[$11]++}END{for (a in arr) print a, arr[a]}' file
COMMITTED 2
You're using wrong field separator. Use
awk -F\;
; must be escaped to use it as a literal. Except this, your approach seems OK.
Besides awk, you may also use
tail -n +2 $f | cut -f11 -d\; | sort | uniq -c
or
datamash --header-in -t \; -g 11 count 11 < $f
to do the same thing.

Replacing a specific field using awk, sed or any POSIX tool

I have a huge file1, which has values as follows:
a 1
b 2
c 3
d 4
e 5
I have another huge file2, which is colon delimited with seven fields as follows:
a:2543:2524:2542:252:536365:54654
c:5454:5454:654:54:87:54
d:87:65:1:98:32:87
I want to search the lines for the variables of file1 and replace its value in the 7th column in file2 so the output should be as follows:
a:2543:2524:2542:252:536365:1
c:5454:5454:654:54:87:3
d:87:65:1:98:32:4
Maybe this awk will work, assuming the files are as posted
awk -F: 'NR==FNR{a[$1]=$0;next;}a[$1]{$0=a[$1]}1' file1 file2
a:2543:2524:2542:252:536365:1
c:5454:5454:654:54:87:3
d:87:65:1:98:32:4
So I came up with a solution; ended up with a couple of lines of code. Maybe there is a better way to do it.But this works !
while read line ; do
var1=`echo $line| awk '{print $1}'`
var2=`echo $line| awk '{print $2}'`
awk -v var1="$var1" -v var2="$var2" -F ':' 'BEGIN { OFS = ":"} $1==var1 {sub(".*",var2,$7)}{print}' file2 > file2.tmp
mv file2.tmp file2
done < file1
cat file2
This should work - it assumes both files are sorted on the first column - I'd be very interested in any performance comparisons with your solution for very large files - both speed and memory - --complement is linux-specific but easily replaced if not available
file0=$1 #file with single value
file1=$2 #file with 6th value to be replaced
# normalize on colon delimiter
tr ' ' : <$file0|
# join on first field
join -t: $file1 -|
# delete column 7
cut --complement -d: -f7

Comparing values in two files

I am comparing two files, each having one column and n number of rows.
file 1
vincy
alex
robin
file 2
Allen
Alex
Aaron
ralph
robin
if the data of file 1 is present in file 2 it should return 1 or else 0, in a tab seprated file.
Something like this
vincy 0
alex 1
robin 1
What I am doing is
#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1=="'$i'") print 1 ; else print 0 }'>>binary
done
the above code is not giving me the output which I am looking for.
Kindly have a look and suggest correction.
Thank you
The simple awk solution:
awk 'NR==FNR{ seen[$0]=1 } NR!=FNR{ print $0 " " seen[$0] + 0}' file2 file1
A simple explanation: for the lines in file2, NR==FNR, so the first action is executed and we simply record that a line has been seen. In file1, the 2nd action is taken and the line is printed, followed by a space, followed by a "0" or a "1", depending on if the line was seen in file2.
AWK loves to do this kind of thing.
awk 'FNR == NR {a[tolower($1)]; next} {f = 0; if (tolower($1) in a) {f = 1}; print $1, f}' file2 file1
Swap the positions of file2 and file1 in the argument list to make file1 the dictionary instead of file2.
When FNR (the record number in the current file) and NR (the record number of all records so far) are equal, then the first file is the one being processed. Simply referencing an array element brings it into existence. This sets up the dictionary. The next instruction reads the next record.
Once FNR and NR aren't equal, subsequent file(s) are being processed and their data is looked up in the dictionary array.
The following code should do it.
Take a close look to the BEGIN and END sections.
#!/bin/bash
rm -f binary
for i in $(cat file1); do
awk 'BEGIN {isthere=0;} { if ($1=="'$i'") isthere=1;} END { print "'$i'",isthere}' < file2 >> binary
done
There are several decent approaches. You can simply use line-by-line set math:
{
grep -xF -f file1 file2 | sed $'s/$/\t1/'
grep -vxF -f file1 file2 | sed $'s/$/\t0/'
} > somefile.txt
Another approach would be to simply combine the files and use uniq -c, then just swap the numeric column with something like awk:
sort file1 file2 | uniq -c | awk '{ print $2"\t"$1 }'
The comm command exists to do this kind of comparison for you.
The following approach does only one pass and scales well to very large input lists:
#!/bin/bash
while read; do
if [[ $REPLY = $'\t'* ]] ; then
printf "%s\t0\n" "${REPLY#?}"
else
printf "%s\t1\n" "${REPLY}"
fi
done < <(comm -2 <(tr '[A-Z]' '[a-z]' <file1 | sort) <(tr '[A-Z]' '[a-z]' <file2 | sort))
See also BashFAQ #36, which is directly on-point.
Another solution, if you have python installed.
If you're familiar with Python and are interested in the solution, you only need a bit of formatting.
#/bin/python
f1 = open('file1').readlines()
f2 = open('file2').readlines()
f1_in_f2 = [int(x in f2) for x in f1]
for n,c in zip(f1, f1_in_f2):
print n,c

Join lines based on pattern

I have the following file:
test
1
My
2
Hi
3
i need a way to use cat ,grep or awk to give the following output:
test1
My2
Hi3
How can i achieve this in a single command? something like
cat file.txt | grep ... | awk ...
Note that its always a string followed by a number in the original text file.
sed 'N;s/\n//' file.txt
This should give the desired output when the content is in file.txt
paste -d "" - - < filename
This takes consecutive lines and pastes them together delimited by the empty string.
awk '{printf("%s", $0);} !(NR%2){printf("\n");}' file.txt
EDIT: I just noticed that your question requires the use of cat and grep. Both of those programs are unnecessary to achieve your stated aims. If you have some reason for including them that you haven't mentioned, try this (uselessly inefficient) version of the line I wrote immediately above:
cat file.txt | grep '^' | awk '{printf("%s", $0);} !(NR%2){printf("\n");}'
It is possible that this command uses features not present in the original awk program. You may need to invoke the new awk program, nawk instead.
If your input file is always 1 number then 1 string, and you only want the strings, all you have to do is take every other line.
If you only want the odd lines, you can do awk 'NR % 2' file.txt
If you want the evens, this becomes awk 'NR % 2==0' data
Here is the answer:
cat file.txt | awk 'BEGIN { lno = 0 } { val=$0; if (lno % 2 == 1) {printf "%s\n", $0} else {printf "%s", $0}; ++lno}'

Resources