How to remove duplicates entries from a file using shell - shell

I have a file that is in the format:
0000000540|Q1.1|margi|Q1.1|margi|Q1.1|margi
0099940598|Q1.2|8888|Q1.3|5454|Q1.2|8888
0000234223|Q2.10|saigon|Q3.9|tango|Q1.1|money
I am trying to remove the duplicates that appear on the same line.
So, if a line has
0000000540|Q1.1|margi|Q1.1|margi|Q1.1|margi
I'll like it to be
0000000540|Q1.1|margi
If the line has
0099940598|Q1.2|8888|Q1.3|5454|Q1.2|8888
I'll like it to be like
0099940598|Q1.2|8888|Q1.3|5454
I would like to do this on a shell script that takes an input file and outputs the file without the duplicates.
Thanks in advance to anyone who can help

This should do it but may not be efficient for large files.
awk '
{
delete p;
n = split($0, a, "|");
printf("%s", a[1]);
for (i = 2; i <= n ; i++)
{
if (!(a[i] in p))
{
printf("|%s", a[i]);
p[a[i]] = "";
}
}
printf "\n";
}
' YourFileName

Related

I have a file in Unix with data set as below, i want to generate more data like this but no duplicates. Looking for a unix shell code. Below is sample

I want to generate more data based on some sample data i already have in a file stored in Unix location.
looking for a unix shell code.
ID,FN,LN,Gender
1,John,hopkins,M
2,Andrew,Singh,M
3,Ram,Lakshman,M
4,ABC,DEF,F
5,Virendra,Sehwag,F
6,Sachin,Tendulkar,F
You could use awk to read the existing data into an array and then keep printing it over and over with new IDs:
awk -F, -v OFS=, -v n=100 '
BEGIN {
l = 0;
}
/^[0-9]/ {
a[l] = $2","$3","$4;
l++;
}
{ print }
END {
for ( i = l; i <= n; i++ ) {
printf "%d,%s\n", i, a[i%l];
}
}
'
n is the number of IDs you want (existing IDs + generated).

Compress ranges of ranges of numbers in bash

I have a csv file named "ranges.csv", which contains:
start_range,stop_range
9702220000,9702220999
9702222000,9702222999
9702223000,9702223999
9750000000,9750000999
9750001000,9750001999
9750002000,9750002999
I am trying to combine the ranges where the stop_range=start_range-1 and output the result in another csv file named "ranges2.csv". So the output will be:
9702220000,9702220999
9702222000,9702223999
9750000000,9750002999
Moreover, I need to know how many ranges contains a compress range (example: for the new range 9750000000,9750002999 I need to know that before the compression there were 3 ranges). This information will help me to create a new csv file named "ranges3.csv" which should contain only the range with the most ranges inside it (the most comprehensive area):
9750000000,9750002999
I was thinking about something like this:
if (stop_range = start_range-1)
new_stop_range = start_range-1
But I am not very smart and I am new to bash scripting.
I know how to output the results in another file but the function for what I need gives me headaches.
I think this does the trick:
#!/bin/bash
awk '
BEGIN { FS = OFS = ","}
NR == 2 {
start = $1; stop = $2; i = 1
}
NR > 2 {
if ($1 == (stop + 1)) {
i++;
stop = $2
} else {
if (++i > max) {
maxr = start "," stop;
max = i
}
start = $1
i = 0
}
stop = $2
}
END {
if (++i > max) {
maxr = start "," stop;
}
print maxr
}
' ranges.csv
Assuming your ranges are sorted, then this code gives you the merged ranges only:
awk 'BEGIN{FS=OFS=","}
(FNR>1) && ($1!=e+1){print b,e; b=e="" }
($1==e+1){ e=$2; next }
{ b=$1; e=$2 }
END { print b,e }' file
Below you get the same but with the range count:
awk 'BEGIN{FS=OFS=","}
(FNR>1) && ($1!=e+1){print b,e,c; b=e=c="" }
($1==e+1){ e=$2; c++; next }
{ b=$1; e=$2; c=1 }
END { print b,e,c }' file
If you want the largest one, you can sort on the third column. I don't want to make a rule to give the range with the most counts, as there might be multiple.
If you really only want all the ranges with the maximum merge:
awk 'BEGIN{FS=OFS=","}
(FNR>1) && ($1!=e+1){
a[c] = a[c] (a[c]?ORS:"") b OFS e
m=(c>m?c:m)
b=e=c=""
}
($1==e+1){ e=$2; c++; next }
{ b=$1; e=$2; c=1 }
END { a[c] = a[c] (a[c]?ORS:"") b OFS e
m=(c>m?c:m)
print a[m]
}' file

How I make a list of missing integer from a sequence using bash

I have a file let's say files_190911.csv whose contents are as follows.
EDR_MPU023_09_20190911080534.csv.gz
EDR_MPU023_10_20190911081301.csv.gz
EDR_MPU023_11_20190911083544.csv.gz
EDR_MPU023_14_20190911091405.csv.gz
EDR_MPU023_15_20190911105513.csv.gz
EDR_MPU023_16_20190911105911.csv.gz
EDR_MPU024_50_20190911235332.csv.gz
EDR_MPU024_51_20190911235400.csv.gz
EDR_MPU024_52_20190911235501.csv.gz
EDR_MPU024_54_20190911235805.csv.gz
EDR_MPU024_55_20190911235937.csv.gz
EDR_MPU025_24_20190911000050.csv.gz
EDR_MPU025_25_20190911000155.csv.gz
EDR_MPU025_26_20190911000302.csv.gz
EDR_MPU025_29_20190911000624.csv.gz
I want to make a list of missing sequence from those using bash script.
Every MPUXXX has its own sequence. So there are multiple series of sequences in that file.
The datetime for missing list will use from previous sequence.
From the sample above, the result will be like this.
EDR_MPU023_12_20190911083544.csv.gz
EDR_MPU023_13_20190911083544.csv.gz
EDR_MPU024_53_20190911235501.csv.gz
EDR_MPU025_27_20190911000302.csv.gz
EDR_MPU025_28_20190911000302.csv.gz
It would be simpler if there were only a single sequence.
So I can use something like this.
awk '{for(i=p+1; i<$1; i++) print i} {p=$1}'
But I know this can't be used for multiple sequence.
EDITED (Thanks #Cyrus!)
AWK is your friend:
#!/usr/bin/awk
BEGIN {
FS="[^0-9]*"
last_seq = 0;
next_serial = 0;
}
{
cur_seq = $2;
cur_serial = $3;
if (cur_seq != last_seq) {
last_seq = cur_seq;
ts = $4
prev = cur_serial;
} else {
if (cur_serial == next_serial) {
ts = $4;
} else {
for (i = next_serial; i < cur_serial; i++) {
print "EDR_MPU" last_seq "_" i "_" ts ".csv.gz"
}
}
}
next_serial = cur_serial + 1;
}
And then you do:
$ < files_190911.csv awk -f script.awk
EDR_MPU023_12_20190911083544.csv.gz
EDR_MPU023_13_20190911083544.csv.gz
EDR_MPU024_53_20190911235501.csv.gz
EDR_MPU025_27_20190911000302.csv.gz
EDR_MPU025_28_20190911000302.csv.gz
The assignment to FS= splits lines by the regex. The rest program detects holes in sequences and prints them with the appropriate timestamp.

Expand range of numbers in file

I have a file with delimited integers which I've extracted from elsewhere and dumped into a file. Some lines contain a range, as per the below:
Files 1,2,3,4,5,6,7,8,9,10 are OK
Users 1,2,3-9,10 have problems
Cars 1-5,5-10 are in the depot
Trains 1-10 are on time
Any way to expand the ranges on the text file so that it returns each individual number, with the , delimiter preserved? The text either side of the integers could be anything, and I need it preserved.
Files 1,2,3,4,5,6,7,8,9,10 are OK
Uses 1,2,3,4,5,6,7,8,9,10 have problems
Cars 1,2,3,4,5,6,7,8,9,10 are in the depot
Trains 1,2,3,4,5,6,7,8,9,10 are on time
I guess this can be done relatively easily with awk, let alone any other scripting language. Any help very much appreciated
You haven't tagged with perl but I'd recommend it in this case:
perl -pe 's/(\d+)-(\d+)/join(",", $1..$2)/ge' file
This substitutes all occurrences of one or more digits, followed by a hyphen, followed by one or more digits. It uses the numbers it has captured to create a list from the first number to the second and joins the list on a comma.
The e modifier is needed here so that an expression can be evaluated in the replacement part of the substitution.
To avoid repeated values and to sort the list, things get a little more complicated. At this point, I'd recommend using a script, rather than a one-liner:
use strict;
use warnings;
use List::MoreUtils qw(uniq);
while (<>) {
s/(\d+)-(\d+)/join(",", $1..$2)/ge;
if (/(.*\s)((\d+,)+\d+)(.*)/) {
my #list = sort { $a <=> $b } uniq split(",", $2);
$_ = $1 . join(",", #list) . $4 . "\n";
}
} continue {
print;
}
After expanding the ranges (like in the one-liner), I've re-parsed the line to extract the list of values. I've used uniq from List::MoreUtils (a core module) to remove any duplicates and sorted the values.
Call the script like perl script.pl file.
A solution using awk:
{
result = "";
count = split($0, fields, /[ ,-]+/, seps);
for (i = 1; i <= count; i++) {
if (fields[i] ~ /[0-9]+/) {
if (seps[i] == ",") {
numbers[fields[i]] = fields[i];
} else if (seps[i] == "-") {
for (j = fields[i] + 1; j <= fields[i+1]; j++) {
numbers[j] = j;
}
} else if (seps[i] == " ") {
numbers[fields[i]] = fields[i];
c = asort(numbers);
for (r = 1; r < c; r++) {
result = result numbers[r] ",";
}
result = result numbers[c] " ";
}
} else {
result = result fields[i] seps[i];
}
}
print result;
}
$ cat tst.awk
match($0,/[0-9,-]+/) {
split(substr($0,RSTART,RLENGTH),numsIn,/,/)
numsOut = ""
delete seen
for (i=1;i in numsIn;i++) {
n = split(numsIn[i],range,/-/)
for (j=range[1]; j<=range[n]; j++) {
if ( !seen[j]++ ) {
numsOut = (numsOut=="" ? "" : numsOut ",") j
}
}
}
print substr($0,1,RSTART-1) numsOut substr($0,RSTART+RLENGTH)
}
$ awk -f tst.awk file
Files 1,2,3,4,5,6,7,8,9,10 are OK
Users 1,2,3,4,5,6,7,8,9,10 have problems
Cars 1,2,3,4,5,6,7,8,9,10 are in the depot
Trains 1,2,3,4,5,6,7,8,9,10 are on time
another awk
$ awk '{while(match($0, /[0-9]+-[0-9]+/))
{k=substr($0, RSTART, RLENGTH);
split(k,a,"-");
f=a[1];
for(j=a[1]+1; j<=a[2]; j++) f=f","j;
sub(k,f)}}1' file
Files 1,2,3,4,5,6,7,8,9,10 are OK
Users 1,2,3,4,5,6,7,8,9,10 have problems
Cars 1,2,3,4,5,5,6,7,8,9,10 are in the depot
Trains 1,2,3,4,5,6,7,8,9,10 are on time
note that the Cars 1-5,5-10 will end up two 5 values when expanded due to overlapping ranges.

Unable to increment last 2 digit of variable declared in file using script

I have the file given below:
elix554bx.xayybol.42> vi setup.REVISION
# Revision information
setenv RSTATE R24C01
setenv CREVISION X3
exit
My requirement is to read RSTATE from file and then increment last 2 digits of RSTATE in setup.REVISION file and overwrite into same file.
Can you please suggest how to do this?
If you're using vim, then you can use the sequence:
/RSTATE/
$<C-a>:x
The first line is followed by a return and searches for RSTATE. The second line jumps to the end of the line and uses Control-a (shown as <C-a> above, and in the vim documentation) to increment the number. Repeat as often as you want to increment the number. The :x is also followed by a return and saves the file.
The only tricky bit is that the leading 0 on the number makes vim think the number is in octal, not decimal. You can override that by using :set nrformats= followed by return to turn off octal and hex; the default value is nrformats=octal,hex.
You can learn an awful lot about vim from the book Practical Vim: Edit Text at the Speed of Thought by Drew Neil. This information comes from Tip 10 in chapter 2.
Here's an awk one-liner type solution:
awk '{
if ( $0 ~ 'RSTATE' ) {
match($0, "[0-9]+$" );
sub( "[0-9]+$",
sprintf( "%0"RLENGTH"d", substr($0, RSTART, RSTART+RLENGTH)+1 ),
$0 );
print; next;
} else { print };
}' setup.REVISION > tmp$$
mv tmp$$ setup.REVISION
Returns:
setenv RSTATE R24C02
setenv CREVISION X3
exit
This will handle transitions from two to three to more digits appropriately.
I wrote for you a class.
class Reader
{
public string ReadRs(string fileWithPath)
{
string keyword = "RSTATE";
string rs = "";
if(File.Exists(fileWithPath))
{
StreamReader reader = File.OpenText(fileWithPath);
try
{
string line = "";
bool finded = false;
while (reader != null && !finded)
{
line = reader.ReadLine();
if (line.Contains(keyword))
{
finded = true;
}
}
int index = line.IndexOf(keyword);
rs = line.Substring(index + keyword.Length +1, line.Length - 1 - (index + keyword.Length));
}
catch (IOException)
{
//Error
}
finally
{
reader.Close();
}
}
return rs;
}
public int GetLastTwoDigits(string rsState)
{
int digits = -1;
try
{
int length = rsState.Length;
//Get the last two digits of the rsstate
digits = Int32.Parse(rsState.Substring(length - 2, 2));
}
catch (FormatException)
{
//Format Error
digits = -1;
}
return digits;
}
}
You can use this as exists
Reader reader = new Reader();
string rsstate = reader.ReadRs("C://test.txt");
int digits = reader.GetLastTwoDigits(rsstate);

Resources