Extract subset of a feed file with custom delimiter and create CSV file - shell

I get a feed file in below format.
employee_id||034100151730105|L|
employee_cd||03410015|L|
dept_id||1730105|L|
dept_name||abc|L|
employee_firstname||pqr|L|
employee_lastname||ppp|L|
|R||L|
employee_id||034100151730108|L|
employee_cd||03410032|L|
dept_id||4230105|L|
dept_name||fdfd|L|
employee_firstname||sasas|L|
employee_lastname||dfdf|L|
|R||L|
.....
Is there any easy unix script to extract subset of fields and create a CSV like below..
employee_cd,employee_firstname,dept_name
03410015,pqr,abc
03410032,sasas,fdfd
.....

I would suggest awk solution (considering that dept_name item always goes before employee_firstname item):
awk -F'|' 'BEGIN{OFS=","; print "employee_cd,employee_firstname,dept_name";}
$1~/employee_cd|employee_firstname|dept_name/{ a[++c]=$3 }
END { for(i=1;i<length(a);i+=3) print a[i],a[i+2],a[i+1] }' file
The output:
employee_cd,employee_firstname,dept_name
03410015,pqr,abc
03410032,sasas,fdfd
Solution details:
OFS="," - setting output field separator
$1~/employee_cd|employee_firstname|dept_name/ - if first column matches one of the needed items
a[++c]=$3 - capturing an item value indexed by consequent position
for(i=1;i<length(a);i+=3) print a[i],a[i+2],a[i+1] - outputting item values by threes
To save the output as .csv file:
the above command > output.csv

Related

How to get paragraphs of text by index number

I am wondering if there is a way to get paragraphs of text (source file would be a pyx file) by number as sed does with lines
sed -n ${i}p
At this moment I'd be interested to use awk with:
awk '/custom-pyx-tag\(/,/\)custom-pyx-tag/'
but I can't find documentation or examples about that.
I'm also trying to trim "\r\n" with gsub(/\r\n/,"; ") int the same awk command but it doesn't work, and I can't really figure out why.
Any hint would be very appreciated, thanks
EDIT:
This is just one example and not my exact need but I would need to know how to do it for a multipurpose project
Let's take the case that I have exported the ID3Tags of a huge collection of audio files and these have been stored in a pyx-like format, so in the end I will have a nice big file with this pattern repeating for each file in the collection:
audio-genre(
blablabla
)audio-genre
audio-artist(
bla.blabla
)audio-artist
audio album(
bla-bla-bla
)audio-album
audio-track-num(
0x
)audio-track-num
audio-track-title(
bla.bla-bla
)audio-track-title
audio-lyrics(
blablablablabla
bla.bla.bla.bla
blah-blah-blah
blabla-blabla
)audio-lyrics
...
Now if I want to extract the artist of the 1234th audio file I can use:
awk '/audio-artist\(/, /)audio-artist/' | sed '/audio-artist/d' | sed -n 1234p
so being one line it can be obtained with sed, but I don't know how to get an entire paragraph given its index, for example if I want to get the lyrics of the 6543th file how could I do it?
In the end it is just a question of whether there is a command equivalent to
sed -n $ {num} p
but to be used for paragraphs
awk -v indx=1024
'BEGIN {
RS=""
}
{ split($0,arr,"audio-artist");
for (i=2;i<=length(arr);i=i+2)
{ gsub("[()]","",arr[i]);
arts[cnt+=1]=arr[i]
}
}
END {
print arts[indx]
}' audioartist
One liner:
awk -v indx=1234 'BEGIN {RS=""} NR==1 { split($0,arr,"audio-artist");for (i=2;i<=length(arr);i=i+2) { gsub("[()]","",arr[i]);arts[cnt+=1]=arr[i] } } END { print arts[indx] }' audioartist
Using awk, and the file called audioartist, we consume the file as one line by setting the records separator (RS) to "". We then split the whole file into an array arr, based on the separator audio-artist. We look through the array arr starting from 2 in steps of 2 till the end of the array and strip out the opening and closing brackets, creating another array called arts with an incrementing count as the index and the stripped artist as the value. At the end we print the arts index specified by the passed indx variable (in this case 1234).

In bash extract select properties from a standard property file to a single delimited line?

In bash:
1) For a given groupname of interest, and
2) a list of keys of interest, for which we want a table of values, for this groupname,
3) read in a set of files, like those in /usr/share/applications (see simplified example below),
4) and produce a delimited table, with one line per file, and one field for each given key.
EXAMPLE
inputs
We want only the values of the Name and Exec keys, from only [Desktop Entry] groups, and from one or more files, like these:
[Desktop Entry]
Name=Root
Comment=Opens
Exec=e2
..
[Desktop Entry]
Comment=Close
Name=Root2
output
Two lines, one per input file, each in a delimited <Name>,<Exec> format, ready for import into a database:
Root,e2
Root2,
Each input file is:
One or more blocks of lines delimited by a [some-groupname].
Below each [.*] is one or more standard, unsorted key=value pairs.
Not every block contains the same set of keys.
[Forgive me if I am asking for a solution to an old problem, but I can't seem to find a good, quick bash way, to do this. Yes, I could code it up with some while and read loops, etc... but surely it's been done before.]
Similar to this Q but more general answer wanted.
If awk is your option, would you please try the following:
awk -v RS="[" -v FS="\n" '{ # split the file into records on "["
# and split the record into fields on "\n"
name = ""; exec = "" # reset variables
if ($1 == "Desktop Entry]") {
# if the groupname matches
for (i=2; i<=NF; i++) { # loop over the fields (lines) of "key=value" pairs
if (sub(/^Name=/, "", $i)) name = $i
# the field (line) starts with "Name="
else if (sub(/^Exec=/, "", $i)) exec = $i
# the field (line) starts with "Exec="
}
print name "," exec
}
}' file
You can feed multiple files as file1 file2 file3, dir/file* or whatever.

Converting CSV file to multiline text file

I have file which looks like following:
C_DocType_ID,SOReference,DocumentNo,ProductValue,Quantity,LineDescription,C_Tax_ID,TaxAmt
1000000,1904093563U,1904093563U,5210-1,1,0,1000000,0
1000000,1904093563U,1904093563U,6511,2,0,1000000,0
1000000,1904093563U,1904093563U,5001,1,0,1000000,0
1000000,1904083291U,1904083291U,5310,4,0,1000000,0
1000000,1904083291U,1904083291U,5311,3,0,1000000,0
1000000,1904083291U,1904083291U,6101,6,0,1000000,0
1000000,1904083291U,1904083291U,6102,1,0,1000000,0
1000000,1904083291U,1904083291U,6106,6,0,1000000,0
I need to convert it to text file which looks like this:
WOH~1.0~~1904093563Utest~~~ORD~~~~
WOL~~~5210-1~~~~~~~~1~~~~~~~~~~~~~~~~~~~~~
WOL~~~6511~~~~~~~~2~~~~~~~~~~~~~~~~~~~~~
WOL~~~5001~~~~~~~~1~~~~~~~~~~~~~~~~~~~~~
WOH~1.0~~1904083291Utest~~~ORD~~~~~~
WOL~~~5310~~~~~~~~4~~~~~~~~~~~~~~~~~~~~~
WOL~~~5311~~~~~~~~3~~~~~~~~~~~~~~~~~~~~~
WOL~~~6101~~~~~~~~6~~~~~~~~~~~~~~~~~~~~~
WOL~~~6102~~~~~~~~1~~~~~~~~~~~~~~~~~~~~~
WOL~~~6106~~~~~~~~6~~~~~~~~~~~~~~~~~~~~~
The output file has header record and line item record. Header Record contains the SOReference and some hardcoded fields and the Line Item record contains the Product Value and Quantity associated to that SOReference . In the input file we have 2 unique SOReferences thats why the the output file contains 2 header record and their associated line items record.
Need something being done as a command line (awk/sed)? since I have a series of files like this one which need to be converted to text.
With AWK, please try the following:
awk -F, '
FNR==1 {next} # skip the header line
{
if ($2 != prevcol2) { # insert newline when SOReference changes
nl = FNR<=2 ? "" : "\n" # suppress the newline in the 1st line
printf("%sWOH~1.0~~%stest~~~ORD~~~~\n", nl, $2)
}
printf("WOL~~~%s~~~~~~~~%s~~~~~~~~~~~~~~~~~~~~~\n", $4, $5)
prevcol2 = $2
}' file.csv

How to split a CSV file into multiple files based on column value

I have CSV file which could look like this:
name1;1;11880
name2;1;260.483
name3;1;3355.82
name4;1;4179.48
name1;2;10740.4
name2;2;1868.69
name3;2;341.375
name4;2;4783.9
there could more or less rows and I need to split it into multiple .dat files each containing rows with the same value of the second column of this file. (Then I will make bar chart for each .dat file) For this case it should be two files:
data1.dat
name1;1;11880
name2;1;260.483
name3;1;3355.82
name4;1;4179.48
data2.dat
name1;2;10740.4
name2;2;1868.69
name3;2;341.375
name4;2;4783.9
Is there any simple way of doing it with bash?
You can use awk to generate a file containing only a particular value of the second column:
awk -F ';' '($2==1){print}' data.dat > data1.dat
Just change the value in the $2== condition.
Or, if you want to do this automatically, just use:
awk -F ';' '{print > ("data"$2".dat")}' data.dat
which will output to files containing the value of the second column in the name.
Try this:
while IFS=";" read -r a b c; do echo "$a;$b;$c" >> data${b}.dat; done <file

comparing csv files

I want to write a shell script to compare two .csv files. First one contains filename,path the second .csv file contains filename,paht,target. Now, I want to compare the two .csv files and output the target name where the file from the first .csv exists in the second .csv file.
Ex.
a.csv
build.xml,/home/build/NUOP/project1
eesX.java,/home/build/adm/acl
b.csv
build.xml,/home/build/NUOP/project1,M1
eesX.java,/home/build/adm/acl,M2
ddexse3.htm,/home/class/adm/33eFg
I want the output to be something like this.
M1 and M2
Please help
Thanks,
If you don't necessarily need a shell script, you can easily do it in Python like this:
import csv
seen = set()
for row in csv.reader(open('a.csv')):
seen.add(tuple(row))
for row in csv.reader(open('b.csv')):
if tuple(row[:2]) in seen:
print row[2]
if those M1 and M2 are always at field 3 and 5, you can try this
awk -F"," 'FNR==NR{
split($3,b," ")
split($5,c," ")
a[$1]=b[1]" "c[1]
next
}
($1 in a){
print "found: " $1" "a[$1]
}' file2.txt file1.txt
output
# cat file2.txt
build.xml,/home/build/NUOP/project1,M1 eesX.java,/home/build/adm/acl,M2 ddexse3.htm,/home/class/adm/33eFg
filename, blah,M1 blah, blah, M2 blah , end
$ cat file1.txt
build.xml,/home/build/NUOP/project1 eesX.java,/home/build/adm/acl
$ ./shell.sh
found: build.xml M1 M2
try http://sourceforge.net/projects/csvdiff/
Quote:
csvdiff is a Perl script to diff/compare two csv files with the possibility to select the separator. Differences will be shown like: "Column XYZ in record 999" is different. After this, the actual and the expected result for this column will be shown.

Resources