Find repeating sections and output each section to an individual file?

Find repeating sections and output each section to an individual file? - sorting

Take a text file with lines like:
/user$ cat ORIGFILE
se832p41iEC.200289_EDI832I140401232506.txt
pt832p41iEC.213631_EDI832I140401232501.txt
xe832p41iEC.201687_EDI832I140401232512.txt
pt832p41iEC.213632_EDI832I140401232502.txt
se832p41iEC.200289_EDI832I140401232508.txt
se832p41iEC.200289_EDI832I140401232507.txt
xe832p41iEC.201687_EDI832I140401232513.txt
xe832p41iEC.201687_EDI832I140401232511.txt
If there are duplicate session number (e.g 200289), it should output each repeating section to a file and display like this:
/user$ cat se832p41iEC.200289
se832p41iEC.200289_EDI832I140401232506.txt
se832p41iEC.200289_EDI832I140401232507.txt
se832p41iEC.200289_EDI832I140401232508.txt
/user$ cat xe832p41iEC.201687
xe832p41iEC.201687_EDI832I140401232511.txt
xe832p41iEC.201687_EDI832I140401232512.txt
xe832p41iEC.201687_EDI832I140401232513.txt
/user$ cat NEWFILE
pt832p41iEC.213631_EDI832I140401232501.txt
pt832p41iEC.213632_EDI832I140401232502.txt
Thank you in advance.
Update: Just figured it out after #Jaypal's hint (thanks man):
First - sort ORIGFILE| uniq -u > NEWFILE
Second - sort ORIGFILE | uniq -D > AWKFILE
Last - awk -F_ '{print $0 > $1}' AWKFILE

Now that you have added your attempt, here is a way of doing it with awk:
$ ls
file
$ cat file
se832p41iEC.200289_EDI832I140401232506.txt
pt832p41iEC.213631_EDI832I140401232501.txt
xe832p41iEC.201687_EDI832I140401232512.txt
pt832p41iEC.213632_EDI832I140401232502.txt
se832p41iEC.200289_EDI832I140401232508.txt
se832p41iEC.200289_EDI832I140401232507.txt
xe832p41iEC.201687_EDI832I140401232513.txt
xe832p41iEC.201687_EDI832I140401232511.txt
$ awk -F_ '{
a[$1] = (a[$1] ? a[$1] RS $0 : $0)
b[$1]++
}
END {
for(x in a) print a[x] > (b[x]>1 ? x : "NEWFILE")
}' file
$ ls
NEWFILE file se832p41iEC.200289 xe832p41iEC.201687
$ head *
==> NEWFILE <==
pt832p41iEC.213631_EDI832I140401232501.txt
pt832p41iEC.213632_EDI832I140401232502.txt
==> file <==
se832p41iEC.200289_EDI832I140401232506.txt
pt832p41iEC.213631_EDI832I140401232501.txt
xe832p41iEC.201687_EDI832I140401232512.txt
pt832p41iEC.213632_EDI832I140401232502.txt
se832p41iEC.200289_EDI832I140401232508.txt
se832p41iEC.200289_EDI832I140401232507.txt
xe832p41iEC.201687_EDI832I140401232513.txt
xe832p41iEC.201687_EDI832I140401232511.txt
==> se832p41iEC.200289 <==
se832p41iEC.200289_EDI832I140401232506.txt
se832p41iEC.200289_EDI832I140401232508.txt
se832p41iEC.200289_EDI832I140401232507.txt
==> xe832p41iEC.201687 <==
xe832p41iEC.201687_EDI832I140401232512.txt
xe832p41iEC.201687_EDI832I140401232513.txt
xe832p41iEC.201687_EDI832I140401232511.txt

Related

Sort and split CSV file using sed or awk

I have a CSV file (test.csv) that looks like this:
WH_01,TRAINAMS,A10,1221-ESD
WH_03,TRAINLON,L10A3,3005-21
WH_01,TRAINAMS,A101,PWR-120
WH_02,TRAINCLE,A1,074-HD-SATA
WH_01,TRAINAMS,A10,PWR-120
WH_02,TRAINCLE,A15,102-55665
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,UK-B3,101859
1). I can sort the file based on the value in column# 2 as follows:
sort -t, -k2,2 test.csv > testsort.csv
2). Next I would like to split the file based on the value in column# 2. Using the above example, it should create 3 files:
testsort_1.csv:
WH_01,TRAINAMS,A10,1221-ESD
WH_01,TRAINAMS,A101,PWR-120
WH_01,TRAINAMS,A10,PWR-120
testsort_2.csv:
WH_02,TRAINCLE,A1,074-HD-SATA
WH_02,TRAINCLE,A15,102-55665
testsort_3.csv:
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,L10A3,3005-21
WH_03,TRAINLON,UK-B3,101859
How can I do this? Not sure if the sort is even required and if the above can be achieved without sorting.
Thank you.

Good move separating sort and awk.
$ sort -t, -k2,2 test.csv |awk -F, '!($2 in T) {T[$2]=++i} {print > ("testsort_" i ".csv")}'
$ tail -n +1 testsort*
==> testsort_1.csv <==
WH_01,TRAINAMS,A10,1221-ESD
WH_01,TRAINAMS,A101,PWR-120
WH_01,TRAINAMS,A10,PWR-120
==> testsort_2.csv <==
WH_02,TRAINCLE,A1,074-HD-SATA
WH_02,TRAINCLE,A15,102-55665
==> testsort_3.csv <==
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,L10A3,3005-21
WH_03,TRAINLON,UK-B3,101859
!($2 in T) - If the second field is not found in the indices of array T,
{T[$2]=++i} - increment the counter and save the second field as an index.
{print} - print every line
> "file" - overwrite, redirect, and append output to file
("." i ".") - concatenate "strings" and variable

Since you're not sure if you need to sort that almost certainly means you don't and you just think it'd be useful for some reason plus you're just sorting on $2 and then splitting into different files based on the value of $2 so sorting is doing no good whatsoever.
All you actually need to do is:
awk -F, '{print > ($2".csv")}'
Look:
$ ls
test.csv
$ awk -F, '{print > ($2".csv")}' test.csv
$ ls
test.csv TRAINAMS.csv TRAINCLE.csv TRAINLON.csv
$ tail -n +1 TRAIN*
==> TRAINAMS.csv <==
WH_01,TRAINAMS,A10,1221-ESD
WH_01,TRAINAMS,A101,PWR-120
WH_01,TRAINAMS,A10,PWR-120
==> TRAINCLE.csv <==
WH_02,TRAINCLE,A1,074-HD-SATA
WH_02,TRAINCLE,A15,102-55665
==> TRAINLON.csv <==
WH_03,TRAINLON,L10A3,3005-21
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,UK-B3,101859
If you got past about 20 output file names and weren't using GNU awk then you'd have to close() each one whenever $2 changes and use >> instead of > to append to them.
If for some reason you really do need to use the output file names from your question then that'd be:
awk -F, '!($2 in map){map[$2]="testsort_"++cnt".csv"} {print > map[$2]}' test.csv

You can do it in a fairly simple fashion by keeping a counter for the filename and using sprintf to create the filename for each successive group of files. You use the FNR (file record number) to distinguish between the first and subsequent records.
For example:
$ sort -t, -k2 file.csv |
awk -F, -v cnt=1 -v fn="testsort_1.csv" '
FNR==1 {
prev=$2
print $0 > fn
}
FNR>1 {
if ($2!=prev) {
cnt++
fn=sprintf("%s_%d.csv", "testsort", cnt)
}
print $0 > fn
prev=$2
}'
(note: you set the initial filename as a variable to begin, and then create all subsequent filenames from your cnt (count) using sprintf. prev tracks the second field from the previous record. fn is the filename created by sprintf and the counter.)
A shorter version of the same script declaring prev as a variable initially, would be:
sort -t, -k2 file.csv |
awk -F, -v cnt=0 -v prev="" '{
if ($2!=prev) {
cnt++
fn = "testsort_" cnt ".csv"
prev=$2
}
print $0 > fn
}'
If you do not wish to have sequentially numbered files, but instead want the "testsort_number.csv" taken from the sorted records, look at #Cyrus now-deleted answer that provides an excelled (and shorter) solution in that regard. (I see you already have great answer)
Example Use/Output
With your input in file.csv, the following output files would be created:
$ for i in testsort_{1..3}.csv; do printf "\n%s\n" $i; cat $i; done
testsort_1.csv
WH_01,TRAINAMS,A10,1221-ESD
WH_01,TRAINAMS,A10,PWR-120
WH_01,TRAINAMS,A101,PWR-120
testsort_2.csv
WH_02,TRAINCLE,A1,074-HD-SATA
WH_02,TRAINCLE,A15,102-55665
testsort_3.csv
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,L10A3,3005-21
WH_03,TRAINLON,UK-B3,101859

Why awk does not work in script file while select somthing between two files

In my project, I have two files.
The content of file1 is like:
bme-zhangyl
chem-abbott
chem-hef
chem-lijun
chem-liuch
chem-lix
chem-nisf
chem-quanm
chem-sunli
chem-taohq
chem-wanggc
chem-wangyg
The content of file2 is like:
bme-zhangyl bme-zhangmm
phy-dongert phy-zhangwq
chem-lijun phy-zhangwq
ls-liulj bio-chenw
phy-zhangyb phy-zhangwq
mee-xingw mee-rongym
cs-likm cs-hisao
cs-nany cs-hisao
cs-pengym cs-hisao
chem-quanm cs-hisao
cs-likq cs-hisao
cs-wujx cs-liuyp
mse-mar mse-liangyy
ccse-xiezy ccse-xiezy
maad-chensm maad-wanmp
Now i have a script file, the content of it is like:
#!/bash/sh
for i in $(cat file1)
do
groupname=`awk '($1=='"$i"'){print $2}' file2`
echo $groupname
done
But it is unlucky, it displays nothing;
i have tried another way:
#!/bash/sh
for i in $(cat file1)
do
groupname=`awk '{if($1=='"$i"')print $2}' file2`
echo $groupname
done
and
#!/bash/sh
for i in $(cat file1)
do
groupname=`awk '{if($1==$i)print $2}' file2`
echo $groupname
done
They are all fail. It seems nothing wrong, who can help me?
The correct output should be:
bme-zhangmm
phy-zhangwq
cs-hisao

Using bare awk:
$ awk 'NR==FNR{a[$1];next}$1 in a{print $2}' file1 file2
Output:
bme-zhangmm
phy-zhangwq
cs-hisao
Explained:
$ awk '
NR==FNR { # has file1 strings to a hash
a[$1]
next
}
$1 in a { # if file2 field 1 keyword was hashed from file1
print $2 # output word from field 2
}' file1 file2
UpdateD: As a script:
#!/bin/sh
awk 'NR==FNR{a[$1];next}$1 in a{print $2}' file1 file2

i have tested:
groupname=`awk '{if($1==" '$i' ") print $2}' UGfrompwdguprst`
it works Ok

How to divide list with help awk

i need help. I have file vm.list:
VM-NAME1|WEEKDAY|2|
VM-NAME2|WEEKDAY|4|
VM-NAME3|WEEKDAY|3|
VM-NAME4|WEEKDAY|4|
VM-NAME5|WEEKDAY|4|
VM-NAME6|WEEKDAY|1|
VM-NAME7|WEEKDAY|1|
VM-NAME8|WEEKDAY|4|
VM-NAME9|WEEKDAY|2|
VM-NAME10|WEEKDAY|4|
VM-NAME11|WEEKDAY|4|
I need list divide into new lists depending of 3 value and action run:
LIST1:
VM-NAME6
VM-NAME7
LIST2:
VM-NAME1
VM-NAME9
LIST3:
VM-NAME3
LIST4:
VM-NAME2
VM-NAME4
VM-NAME5
VM-NAME8
VM-NAME10
VM-NAME11
Just about it
for i in $(awk -F "|" '{print $3}' today.list | sort | uniq)
do echo $i
awk -F "|" '{ if ($3 == '$i') print $1 }' today.list
done
i understand what it incorrect, but i don't have ideas

give this awk one-liner a try:
awk -F'|' '{a[$3]=a[$3]RS$1}END{for(x in a)print "List"x":" a[x]}' file

A version that expects the source file being in order by cols 3 and 1 (handled by sort in process substitution).
$ awk -F\| '{ print q ($3!=p?ORS "LIST" $3 ":":"") }{ p=$3; q=$1 }' <(sort -t\| -k3 -k1 file)
LIST1:
VM-NAME6
VM-NAME7
LIST2:
VM-NAME1
VM-NAME9
LIST3:
VM-NAME3
LIST4:
VM-NAME10
VM-NAME11
VM-NAME2
VM-NAME4
VM-NAME5
It won't store the whole record set in memory.

If you want the output in separate files, named List{1..4}, you can use this:
$ awk -F'|' '{print $1 > "List" $3}' vm.list
$ head -n100 List* # just a quick hack to print them with a header...
==> List1 <==
VM-NAME6
VM-NAME7
==> List2 <==
VM-NAME1
VM-NAME9
==> List3 <==
VM-NAME3
==> List4 <==
VM-NAME2
VM-NAME4
VM-NAME5
VM-NAME8
VM-NAME10
VM-NAME11

how to use any bash command to cut a file and save all columns in a different file

Good day,
I was wondering how to cut a file and save each part in a different file. Where the delimiter is ]
example:
TOfile1 ] TOfile2
Thanks in advance

Using awk:
$ awk -F' []] ' '{for(i=1;i<=NF;i++) print $i > "file"i}' input
$ head file*
==> file1 <==
TOfile1
==> file2 <==
TOfile2
Set the delimiter to []], since ] is a special character we put it inside character class to consider it literal.
We iterate over all fields storing each element in a separate file. This gives the answer flexibility to create as many files as there are fields and not just two.
Hence, if your input is something like the following:
$ cat input
TOfile1 ] TOfile2 ] Tofile3 ] Tofile4
TOfile1 ] TOfile2
$ awk -F' []] ' '{for(i=1;i<=NF;i++) print $i > "file"i}' input
$ head file*
==> file1 <==
TOfile1
TOfile1
==> file2 <==
TOfile2
TOfile2
==> file3 <==
Tofile3
==> file4 <==
Tofile4

sed -e 's/.*]//' myfile > TOFile2
sed -e 's/].*//' myfile > TOFile1

This appears to be precisely the standard use-case for the cut utility. Without further ado:
cut --delimiter=']' --field=1 input.txt > TOFile1.txt
cut --delimiter=']' --field=2 input.txt > TOFile2.txt
I'm using the long option names here for readability. The short versions are -d and -f respectively.

awk: filter a file with another file

I'm trying to filter a file with another file.
I have a file d3_tmp and m2p_tmp; They are as follows:
$ cat d3_tmp
0x000001 0x4d 2
0x1107ce 0x4e 2
0x111deb 0x6b 2
$ cat m2p_tmp
mfn=0x000001 ==> pfn=0xffffffffffffffff
mfn=0x000002 ==> pfn=0xffffffffffffffff
mfn=0x000003 ==> pfn=0xffffffffffffffff
I want to print out the lines in m2p_tmp whose second column is not equal to the first column of d3_tmp. (The files are split with \t and =)
So the desired result is:
mfn=0x000002 ==> pfn=0xffffffffffffffff
mfn=0x000003 ==> pfn=0xffffffffffffffff
However, after I use the following awk command:
awk -F '[\t=]' ' FNR==NR { print $1; a[$1]=1; next } !($2 in a){printf "%s \t 0\n", $2}' d3_tmp m2p_tmp
The result is:
0x000001
0x1107ce
0x111deb
0x000001 0
0x000002 0
0x000003 0
I'm not sure why "$2 in a" does not work.
Could anyone help?
Thank you very much!

Using awk
awk 'NR==FNR{for (i=1;i<=NF;i++) a[$i];next} !($2 in a)' d3_tmp FS="[ =]" m2p_tmp
a[$i] is used to collect all items in file d3_tmp into array a, NR==FNR used to control the collection is only focus on d3_tmp.
in second part, set the FS to space or "=", and compare if $2 in file m2p_tmp is in this array a or not, if in, print it.
The question has been edited, so I have to change the code as well.
awk 'NR==FNR{a[$1];next} !($2 in a)' d3_tmp FS="[ \t=]" m2p_tmp

awk -v FS="[\t= ]" ' FNR==NR { a[$1]=$1; next } !($2 in a){print $0}' d3_tmp m2p_tmp
mfn=0x000002 ==> pfn=0xffffffffffffffff
mfn=0x000003 ==> pfn=0xffffffffffffffff

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Find repeating sections and output each section to an individual file? - sorting

Related

Sort and split CSV file using sed or awk

Why awk does not work in script file while select somthing between two files

How to divide list with help awk

how to use any bash command to cut a file and save all columns in a different file

awk: filter a file with another file

Categories

Resources