I want to generate more data based on some sample data i already have in a file stored in Unix location.
looking for a unix shell code.
ID,FN,LN,Gender
1,John,hopkins,M
2,Andrew,Singh,M
3,Ram,Lakshman,M
4,ABC,DEF,F
5,Virendra,Sehwag,F
6,Sachin,Tendulkar,F
You could use awk to read the existing data into an array and then keep printing it over and over with new IDs:
awk -F, -v OFS=, -v n=100 '
BEGIN {
l = 0;
}
/^[0-9]/ {
a[l] = $2","$3","$4;
l++;
}
{ print }
END {
for ( i = l; i <= n; i++ ) {
printf "%d,%s\n", i, a[i%l];
}
}
'
n is the number of IDs you want (existing IDs + generated).
I have a file with delimited integers which I've extracted from elsewhere and dumped into a file. Some lines contain a range, as per the below:
Files 1,2,3,4,5,6,7,8,9,10 are OK
Users 1,2,3-9,10 have problems
Cars 1-5,5-10 are in the depot
Trains 1-10 are on time
Any way to expand the ranges on the text file so that it returns each individual number, with the , delimiter preserved? The text either side of the integers could be anything, and I need it preserved.
Files 1,2,3,4,5,6,7,8,9,10 are OK
Uses 1,2,3,4,5,6,7,8,9,10 have problems
Cars 1,2,3,4,5,6,7,8,9,10 are in the depot
Trains 1,2,3,4,5,6,7,8,9,10 are on time
I guess this can be done relatively easily with awk, let alone any other scripting language. Any help very much appreciated
You haven't tagged with perl but I'd recommend it in this case:
perl -pe 's/(\d+)-(\d+)/join(",", $1..$2)/ge' file
This substitutes all occurrences of one or more digits, followed by a hyphen, followed by one or more digits. It uses the numbers it has captured to create a list from the first number to the second and joins the list on a comma.
The e modifier is needed here so that an expression can be evaluated in the replacement part of the substitution.
To avoid repeated values and to sort the list, things get a little more complicated. At this point, I'd recommend using a script, rather than a one-liner:
use strict;
use warnings;
use List::MoreUtils qw(uniq);
while (<>) {
s/(\d+)-(\d+)/join(",", $1..$2)/ge;
if (/(.*\s)((\d+,)+\d+)(.*)/) {
my #list = sort { $a <=> $b } uniq split(",", $2);
$_ = $1 . join(",", #list) . $4 . "\n";
}
} continue {
print;
}
After expanding the ranges (like in the one-liner), I've re-parsed the line to extract the list of values. I've used uniq from List::MoreUtils (a core module) to remove any duplicates and sorted the values.
Call the script like perl script.pl file.
A solution using awk:
{
result = "";
count = split($0, fields, /[ ,-]+/, seps);
for (i = 1; i <= count; i++) {
if (fields[i] ~ /[0-9]+/) {
if (seps[i] == ",") {
numbers[fields[i]] = fields[i];
} else if (seps[i] == "-") {
for (j = fields[i] + 1; j <= fields[i+1]; j++) {
numbers[j] = j;
}
} else if (seps[i] == " ") {
numbers[fields[i]] = fields[i];
c = asort(numbers);
for (r = 1; r < c; r++) {
result = result numbers[r] ",";
}
result = result numbers[c] " ";
}
} else {
result = result fields[i] seps[i];
}
}
print result;
}
$ cat tst.awk
match($0,/[0-9,-]+/) {
split(substr($0,RSTART,RLENGTH),numsIn,/,/)
numsOut = ""
delete seen
for (i=1;i in numsIn;i++) {
n = split(numsIn[i],range,/-/)
for (j=range[1]; j<=range[n]; j++) {
if ( !seen[j]++ ) {
numsOut = (numsOut=="" ? "" : numsOut ",") j
}
}
}
print substr($0,1,RSTART-1) numsOut substr($0,RSTART+RLENGTH)
}
$ awk -f tst.awk file
Files 1,2,3,4,5,6,7,8,9,10 are OK
Users 1,2,3,4,5,6,7,8,9,10 have problems
Cars 1,2,3,4,5,6,7,8,9,10 are in the depot
Trains 1,2,3,4,5,6,7,8,9,10 are on time
another awk
$ awk '{while(match($0, /[0-9]+-[0-9]+/))
{k=substr($0, RSTART, RLENGTH);
split(k,a,"-");
f=a[1];
for(j=a[1]+1; j<=a[2]; j++) f=f","j;
sub(k,f)}}1' file
Files 1,2,3,4,5,6,7,8,9,10 are OK
Users 1,2,3,4,5,6,7,8,9,10 have problems
Cars 1,2,3,4,5,5,6,7,8,9,10 are in the depot
Trains 1,2,3,4,5,6,7,8,9,10 are on time
note that the Cars 1-5,5-10 will end up two 5 values when expanded due to overlapping ranges.
I am attempting to extract data between the nth occurrence of 2 patterns.
Pattern 1: CardDetail
Pattern 2: ]
The input file, input.txt has thousands of lines that vary in what each line contains. The lines I'm concerned with grabbing data from will always contain CardDetail somewhere in the line. Finding the matching lines is easy enough using awk, but pulling the data between each match and placing it onto seperate lines each is where I'm falling short.
input.txt contains data about network gear and any attached/child devices. It looks something like this:
DeviceDetail [baseProductId=router-5000, cardDetail=[CardDetail [baseCardId=router-5000NIC1, cardDescription=Router 5000 NIC, cardSerial=5000NIC1], CardDetail [baseCardId=router-5000NIC2, cardDescription=Router 5000 NIC, cardSerial=5000NIC2]], deviceSerial=5000PRIMARY, deviceDescription=Router 5000 Base Model]
DeviceDetail [baseProductId=router-100, cardDetail=[CardDetail [baseCardId=router-100NIC1, cardDescription=Router 100 NIC, cardSerial=100NIC1], CardDetail [baseCardId=router-100NIC2, cardDescription=Router 100 NIC, cardSerial=100NIC2]], deviceSerial=100PRIMARY, deviceDescription=Router 100 Base Model]
* UPDATE: I forgot to mention in the initial post that I also need the device's PARENT serials (deviceSerial) listed with them as well. *
What I would like the output.txt to look like is something like this:
"router-5000NIC1","Router 5000 NIC","5000NIC1","5000PRIMARY"
"router-5000NIC2","Router 5000 NIC","5000NIC2","5000PRIMARY"
"router-100NIC1","Router 100 NIC","100NIC1","100PRIMARY"
"router-100NIC2","Router 100 NIC","100NIC2","100PRIMARY"
The number of occurrences of CardDetail on a single line could vary between 0 to hundreds depending on the device. I need to be able to extract all of the data by field between each occurrence of CardDetail and the next occurrence of ] and transport them to their own line in a CSV format.
If you have gawk or mawk available, you can do this by (mis)using the record and field splitting capabilities:
awk -v RS='CardDetail *\\[' -v FS='[=,]' -v OFS=',' -v q='"' '
NR > 1 { sub("\\].*", ""); print q $2 q, q $4 q, q $6 q }'
Output:
"router-5000NIC1","Router 5000 NIC","5000NIC1"
"router-5000NIC2","Router 5000 NIC","5000NIC2"
"router-100NIC1","Router 100 NIC","100NIC1"
"router-100NIC2","Router 100 NIC","100NIC2"
Is it sufficient?
$> grep -P -o "(?<=CardDetail).*?(?=\])" input.txt | grep -P -o "(?<=\=).*?(?=\,)"
router-5000NIC1
Router 5000 NIC
router-5000NIC2
Router 5000 NIC
router-100NIC1
Router 100 NIC
router-100NIC2
Router 100 NIC
Here is an example that uses regular expressions. If there are minor variations in the text format, this will handle them. Also this collects all the values in an array; you could then do further processing (sort values, remove duplicates, etc.) if you wish.
#!/usr/bin/awk -f
BEGIN {
i_result = 0
DQUOTE = "\""
}
{
line = $0
for (;;)
{
i = match(line, /CardDetail \[ **([^]]*) *\]/, a)
if (0 == i)
break
# a[1] has the text from the parentheses
s = a[1]
# replace from this: a, b, c to this: "a","b","c"
gsub(/ *, */, "\",\"", s)
s = DQUOTE s DQUOTE
results[i_result++] = s
line = substr(line, RSTART + RLENGTH - 1)
}
}
END {
for (i = 0; i < i_result; ++i)
print results[i]
}
P.S. Just for fun I made a Python version.
#!/usr/bin/python
import re
import sys
DQUOTE = "\""
pat_card = re.compile("CardDetail \[ *([^]]*) *\]")
pat_comma = re.compile(" *, *")
results = []
def collect_cards(line, results):
while True:
m = re.search(pat_card, line)
if not m:
return
len_matched = len(m.group(0))
s = m.group(1)
s = DQUOTE + re.sub(pat_comma, '","', s) + DQUOTE
results.append(s)
line = line[len_matched:]
if __name__ == "__main__":
for line in sys.stdin:
collect_cards(line, results)
for card in results:
print card
EDIT: Here's a new version that also looks for "deviceID" and puts the matched text as the first field.
In AWK you concatenate strings just by putting them next to each other in an expression; there is an implicit concatenation operator when two strings are side by side. So this gets the deviceID text into a variable called s0, using concatenation to put double quotes around it; then later uses concatenation to put s0 at the start of the matched string.
#!/usr/bin/awk -f
BEGIN {
i_result = 0
DQUOTE = "\""
COMMA = ","
}
{
line = $0
for (;;)
{
i = match(line, /deviceID=([A-Za-z_0-9]*),/, a)
s0 = DQUOTE a[1] DQUOTE
i = match(line, /CardDetail \[ **([^]]*) *\]/, a)
if (0 == i)
break
# a[1] has the text from the parentheses
s = a[1]
# replace from this: foo=a, bar=b, other=c to this: "a","b","c"
gsub(/[A-Za-z_][^=,]*=/, "", s)
# replace from this: a, b, c to this: "a","b","c"
gsub(/ *, */, "\",\"", s)
s = s0 COMMA DQUOTE s DQUOTE
results[i_result++] = s
line = substr(line, RSTART + RLENGTH - 1)
}
}
END {
for (i = 0; i < i_result; ++i)
print results[i]
}
Try this
#awk -f myawk.sh temp.txt
BEGIN { RS="CardDetail"; FS="[=,]"; OFS=","; print "Begin Processing "}
$0 ~ /baseCardId/ {gsub("]","",$0);print $2, $4 , $6}
END {print "Process Complete"}