how to delete few rows of data from a text file using shell scripting based on some conditions - shell

I have a text file with more than 100k rows. Below mentioned data is a sample for the text file I have. I want to use some conditions on this data and delete some rows. The text file does not have headers (ID,NAME,Code-1,code,2-code-3). I mentioned for reference. How can I achieve this with shell scripting?
Input test file:
| ID | NAME | Code-1 | code-2 | code-3 |
| $$ | 5HF | 1E | N | Y |
| $$ | 2MU | 3C | N | Y |
| $$ | 32E | 3C | N | N |
| AB | 3CH | 3C | N | N |
| MK | A1M | AS | P | N |
| $$ | Y01 | 01 | F | Y |
| $$ | BG0 | 0G | F | N |
Conditions:
if code-2 = 'N' and code-1 not equal to ( '3C' , '3B' , '32' , '31' , '3D' ) then ID='$$'
if code-2 ='N' and code-1 equal to ( '3C' , '3B' , '32' , '31' , '3D') then accept any ID and (accept ID='$$' only if code-3='Y')'
if code-2 != 'N' then accept (ID='$$' only if code-3='Y') and all other IDs
Output:
| ID | NAME | Code-1 | code-2 | code-3 |
| $$ | 5HF | 1E | N | Y |
| $$ | 2MU | 3C | N | Y |
| AB | 3CH | 3C | N | N |
| MK | A1M | AS | P | N |
| $$ | Y01 | 01 | F | Y |

It's encouraged you demonstrate own efforts when ask questions. But I do understand this question could be complicated if you are new to Bash. Here is my solution using awk. Spent 0.545s processed 137k lines on my computer (with moderate specs).
awk '{
ID=$2; NAME=$4; CODE1=$6; CODE2=$8; CODE3=$10;
if (CODE2 == "N") {
if (CODE1 ~ /(3C|3B|32|31|3D)/) {
if (ID == "$$") {
if (CODE3 == "Y") {
print;
}
}
else {
print;
}
}
else {
if (ID == "$$") {
print;
}
}
}
else {
if (ID == "$$") {
if (CODE3 == "Y") {
print;
}
}
else {
print;
}
}}' file
Note it has certain restrictions:
a) It delimits values by spaces not |. It will work with your exact input format, but won't work with input rows without additional spaces, e.g.
|$$|32E|3C|N|N|
|AB|3CH|3C|N|N|
b) For the same reason, the command will generate incorrect result, if col value has extra spaces, e.g.
| $$ | 32E FOO | 3C | N | N |
| AB | 3CH BBT | 3C | N | N |

Related

Use AWK with delimiter to print specific columns

My file looks as follows:
+------------------------------------------+---------------+----------------+------------------+------------------+-----------------+
| Message | Status | Adress | Changes | Test | Calibration |
|------------------------------------------+---------------+----------------+------------------+------------------+-----------------|
| Hello World | Active | up | 1 | up | done |
| Hello Everyone Here | Passive | up | 2 | down | none |
| Hi there. My name is Eric. How are you? | Down | up | 3 | inactive | done |
+------------------------------------------+---------------+----------------+------------------+------------------+-----------------+
+----------------------------+---------------+----------------+------------------+------------------+-----------------+
| Message | Status | Adress | Changes | Test | Calibration |
|----------------------------+---------------+----------------+------------------+------------------+-----------------|
| What's up? | Active | up | 1 | up | done |
| Hi. I'm Otilia | Passive | up | 2 | down | none |
| Hi there. This is Marcus | Up | up | 3 | inactive | done |
+----------------------------+---------------+----------------+------------------+------------------+-----------------+
I want to extract a specific column using AWK.
I can use CUT to do it; however when the length of each table varies depending on how many characters are present in each column, I'm not getting the desired output.
cat File.txt | cut -c -44
+------------------------------------------+
| Message |
|------------------------------------------+
| Hello World |
| Hello Everyone Here |
| Hi there. My name is Eric. How are you? |
+------------------------------------------+
+----------------------------+--------------
| Message | Status
|----------------------------+--------------
| What's up? | Active
| Hi. I'm Otilia | Passive
| Hi there. This is Marcus | Up
+----------------------------+--------------
or
cat File.txt | cut -c 44-60
+---------------+
| Status |
+---------------+
| Active |
| Passive |
| Down |
+---------------+
--+--------------
| Adress
--+--------------
| up
| up
| up
--+--------------
I tried using AWK but I don't know how to add 2 different delimiters which would take care of all the lines.
cat File.txt | awk 'BEGIN {FS="|";}{print $2,$3}'
Message Status
------------------------------------------+---------------+----------------+------------------+------------------+-----------------
Hello World Active
Hello Everyone Here Passive
Hi there. My name is Eric. How are you? Down
Message Status
----------------------------+---------------+----------------+------------------+------------------+-----------------
What's up? Active
Hi. I'm Otilia Passive
Hi there. This is Marcus Up
The output I'm looking for:
+------------------------------------------+
| Message |
|------------------------------------------+
| Hello World |
| Hello Everyone Here |
| Hi there. My name is Eric. How are you? |
+------------------------------------------+
+----------------------------+
| Message |
|----------------------------+
| What's up? |
| Hi. I'm Otilia |
| Hi there. This is Marcus |
+----------------------------+
or
+------------------------------------------+---------------+
| Message | Status |
|------------------------------------------+---------------+
| Hello World | Active |
| Hello Everyone Here | Passive |
| Hi there. My name is Eric. How are you? | Down |
+------------------------------------------+---------------+
+----------------------------+---------------+
| Message | Status |
|----------------------------+---------------+
| What's up? | Active |
| Hi. I'm Otilia | Passive |
| Hi there. This is Marcus | Up |
+----------------------------+---------------+
or random other columns
+------------------------------------------+----------------+------------------+
| Message | Adress | Test |
|------------------------------------------+----------------+------------------+
| Hello World | up | up |
| Hello Everyone Here | up | down |
| Hi there. My name is Eric. How are you? | up | inactive |
+------------------------------------------+----------------+------------------+
+----------------------------+---------------+------------------+
| Message |Adress | Test |
|----------------------------+---------------+------------------+
| What's up? |up | up |
| Hi. I'm Otilia |up | down |
| Hi there. This is Marcus |up | inactive |
+----------------------------+---------------+------------------+
Thanks in advance.
One idea using GNU awk:
awk -v fldlist="2,3" '
BEGIN { fldcnt=split(fldlist,fields,",") } # split fldlist into array fields[]
{ split($0,arr,/[|+]/,seps) # split current line on dual delimiters "|" and "+"
for (i=1;i<=fldcnt;i++) # loop through our array of fields (fldlist)
printf "%s%s", seps[fields[i]-1], arr[fields[i]] # print leading separator/delimiter and field
printf "%s\n", seps[fields[fldcnt]] # print trailing separator/delimiter and terminate line
}
' File.txt
NOTES:
requires GNU awk for the 4th argument to the split() function (seps == array of separators; see gawk string functions for details)
assumes our field delimiters (|, +) do not show up as part of the data
the input variable fldlist is a comma-delimited list of columns that mimics what would be passed to cut (eg, when a line starts with a delimiter then field #1 is blank)
For fldlist="2,3" this generates:
+------------------------------------------+---------------+
| Message | Status |
|------------------------------------------+---------------+
| Hello World | Active |
| Hello Everyone Here | Passive |
| Hi there. My name is Eric. How are you? | Down |
+------------------------------------------+---------------+
+----------------------------+---------------+
| Message | Status |
|----------------------------+---------------+
| What's up? | Active |
| Hi. I'm Otilia | Passive |
| Hi there. This is Marcus | Up |
+----------------------------+---------------+
For fldlist="2,4,6" this generates:
+------------------------------------------+----------------+------------------+
| Message | Adress | Test |
|------------------------------------------+----------------+------------------+
| Hello World | up | up |
| Hello Everyone Here | up | down |
| Hi there. My name is Eric. How are you? | up | inactive |
+------------------------------------------+----------------+------------------+
+----------------------------+----------------+------------------+
| Message | Adress | Test |
|----------------------------+----------------+------------------+
| What's up? | up | up |
| Hi. I'm Otilia | up | down |
| Hi there. This is Marcus | up | inactive |
+----------------------------+----------------+------------------+
For fldlist="4,3,2" this generates:
+----------------+---------------+------------------------------------------+
| Adress | Status | Message |
+----------------+---------------|------------------------------------------+
| up | Active | Hello World |
| up | Passive | Hello Everyone Here |
| up | Down | Hi there. My name is Eric. How are you? |
+----------------+---------------+------------------------------------------+
+----------------+---------------+----------------------------+
| Adress | Status | Message |
+----------------+---------------|----------------------------+
| up | Active | What's up? |
| up | Passive | Hi. I'm Otilia |
| up | Up | Hi there. This is Marcus |
+----------------+---------------+----------------------------+
Say that again? (fldlist="3,3,3"):
+---------------+---------------+---------------+
| Status | Status | Status |
+---------------+---------------+---------------+
| Active | Active | Active |
| Passive | Passive | Passive |
| Down | Down | Down |
+---------------+---------------+---------------+
+---------------+---------------+---------------+
| Status | Status | Status |
+---------------+---------------+---------------+
| Active | Active | Active |
| Passive | Passive | Passive |
| Up | Up | Up |
+---------------+---------------+---------------+
And if you make the mistake of trying to print the '1st' column, ie, fldlist="1":
+
|
|
|
|
|
+
+
|
|
|
|
|
+
If GNU awk is available, please try markp-fuso's nice solution.
If not, here is a posix-compliant alternative:
#!/bin/bash
# define bash variables
cols=(2 3 6) # bash array of desired columns
col_list=$(IFS=,; echo "${cols[*]}") # create a csv string
awk -v cols="$col_list" '
NR==FNR {
if (match($0, /^[|+]/)) { # the record contains a table
if (match($0, /^[|+]-/)) # horizontally ruled line
n = split($0, a, /[|+]/) # split into columns
else # "cell" line
n = split($0, a, /\|/)
len = 0
for (i = 1; i < n; i++) {
len += length(a[i]) + 1 # accumulated column position
pos[FNR, i] = len
}
}
next
}
{
n = split(cols, a, /,/) # split the variable `cols` on comma into an array
for (i = 1; i <= n; i++) {
col = a[i]
if (pos[FNR, col] && pos[FNR, col+1]) {
printf("%s", substr($0, pos[FNR, col], pos[FNR, col + 1] - pos[FNR, col]))
}
}
print(substr($0, pos[FNR, col + 1], 1))
}
' file.txt file.txt
Result with cols=(2 3 6) as shown above:
+---------------+----------------+-----------------+
| Status | Adress | Calibration |
+---------------+----------------+-----------------|
| Active | up | done |
| Passive | up | none |
| Down | up | done |
+---------------+----------------+-----------------+
+---------------+----------------+-----------------+
| Status | Adress | Calibration |
+---------------+----------------+-----------------|
| Active | up | done |
| Passive | up | none |
| Up | up | done |
+---------------+----------------+-----------------+
It detects the column width in the 1st pass then splits the line on the column position in the 2nd pass.
You can control the columns to print with the bash array cols which is assigned at the beginning of the script. Please assign the array to the list of desired column numbers in increasing order. If you want to use the bash variable in different way, please let me know.

How to split a row where there's 2 data in each cells separated by a carriage return?

Someone gives me a file with, sometimes, inadequate data.
Data should be like this :
+---------+-----------+--------+
| Name | Initial | Age |
+---------+-----------+--------+
| Jack | J | 43 |
+---------+-----------+--------+
| Nicole | N | 12 |
+---------+-----------+--------+
| Mark | M | 22 |
+---------+-----------+--------+
| Karine | K | 25 |
+---------+-----------+--------+
Sometimes it comes like this tho :
+---------+-----------+--------+
| Name | Initial | Age |
+---------+-----------+--------+
| Jack | J | 43 |
+---------+-----------+--------+
| Nicole | N | 12 |
| Mark | M | 22 |
+---------+-----------+--------+
| Karine | K | 25 |
+---------+-----------+--------+
As you can see, Nicole and Mark are put in the same row, but the data are separated by a carriage return.
I can do split by row, but it demultiply the data :
+---------+-----------+--------+
| Nicole | N | 12 |
| | M | 22 |
+---------+-----------+--------+
| Mark | N | 12 |
| | M | 22 |
+---------+-----------+--------+
Which make me lose that Mark is associated with the "2nd row" of data.
(The data here is purely an example)
One way to do this is to transform each cell into a list by doing a Text.Split on the line feed / carriage return symbol.
TextSplit = Table.TransformColumns(Source,
{
{"Name", each Text.Split(_,"#(lf)"), type text},
{"Initial", each Text.Split(_,"#(lf)"), type text},
{"Age", each Text.Split(_,"#(lf)"), type text}
}
)
Now each column is a list of lists which you can combine into one long list using List.Combine and you can glue these columns together to make table with Table.FromColumns.
= Table.FromColumns(
{
List.Combine(TextSplit[Name]),
List.Combine(TextSplit[Initial]),
List.Combine(TextSplit[Age])
},
{"Name", "Initial", "Age"}
)
Putting this together, the whole query looks like this:
let
Source = <Your data source>
TextSplit = Table.TransformColumns(Source,{{"Name", each Text.Split(_,"#(lf)"), type text},{"Initial", each Text.Split(_,"#(lf)"), type text},{"Age", each Text.Split(_,"#(lf)"), type text}}),
FromColumns = Table.FromColumns({List.Combine(TextSplit[Name]),List.Combine(TextSplit[Initial]),List.Combine(TextSplit[Age])},{"Name","Initial","Age"})
in
FromColumns

Ruby adding empty strings to hash for CSV spacing

I have:
hash = {"1"=>["A", "B", "C", ... "Z"], "2"=>["B", "C"], "3"=>["A", "C"]
My goal is to use hash as a source for creating a CSV with columns whose names are a letter of the alphabet and with rows hash(key) = 1,2,3 etc.
I created an array of all hash.values.unshift("")values that serve as row 1 (columns labels).
desired output:
| A | B | C | ... | Z |
1| A | B | C | ... | Z |
2| | B | C | ....... |
3| A | | C | ....... |
Creating CSV:
CSV.open("groups.csv", 'w') do |csv|
csv << row1
hash.each do |v|
csv << v.flatten
end
end
This makes the CSV look almost what I want but There is no spacing to get columns to align.
Any advice on how to make a method for modifying my hash that compares my all [A-Z] against each subsequent hash key (rows) to insert empty strings to provide spacing?
Can Class CSV do it better?
Something like this?
require 'csv'
ALPHA = ('A'..'Z').to_a.freeze
hash={"1"=>ALPHA, "2"=>["B", "C"], "3"=>["A", "C"]}
csv = CSV.generate("", col_sep: "|") do |csv|
csv << [" "] + ALPHA # header
hash.each do |k, v|
alphabet = ALPHA.map { |el| [el, 0] }.to_h
v.each { |el| alphabet[el] += 1 }
csv << [k, *alphabet.map { |k, val| val == 1 ? k : " " }]
end
end
csv.split("\n").each { |row| puts row }
output:
|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z
1|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z
2| |B|C| | | | | | | | | | | | | | | | | | | | | | |
3|A| |C| | | | | | | | | | | | | | | | | | | | | | |
If your values are truly single characters and don't need the CSV escaping, then I recommend bypassing CSV altogether and building the string in plain Ruby.
Assuming you want to align your lines correctly regardless of the number of digits in the row number (e.g. 1, 10, and 100), you can use printf style formatting to guarantee horizontal aligment (assuming your row number width never exceeds the value of ROWNUM_WIDTH).
By the way, I changed the hash's keys to integers, hope that's ok.
#!/usr/bin/env ruby
FIELDS = ('A'..'Z').to_a
DATA = { 1 => FIELDS, 2 => %w(B C), 3 => %w(A C) }
ROWNUM_WIDTH = 3
output = ' ' * ROWNUM_WIDTH + " | #{FIELDS.join(' | ')} |\n"
DATA.each do |rownum, values|
line = "%*d | " % [ROWNUM_WIDTH, rownum]
FIELDS.each do |field|
char = values.include?(field) ? field : ' '
line << "#{char} | "
end
output << line << "\n"
end
puts output
=begin
Outputs:
| A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z |
1 | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z |
2 | | B | C | | | | | | | | | | | | | | | | | | | | | | | |
3 | A | | C | | | | | | | | | | | | | | | | | | | | | | | |
=end
all = [*?A..?Z]
hash = {"1"=>[*?A..?Z], "2"=>["B", "C"], "3"=>["A", "C"]}
hash.map do |k, v|
[k, *all.map { |k| v.include?(k) ? k : ' ' }]
end.unshift([' ', *all]).
map { |row| row.join('|') }
#⇒ [
# [0] " |A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z",
# [1] "1|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z",
# [2] "2| |B|C| | | | | | | | | | | | | | | | | | | | | | | ",
# [3] "3|A| |C| | | | | | | | | | | | | | | | | | | | | | | "
# ]

Can't iterate over array in Bash

I need to add a new column with a (ordinal) number after the last column in my table.
Both input and output files are .CSV tables.
Incoming table has more then 500 000 lines (rows) of data and 7 columns, e.g. https://www.dropbox.com/s/g2u68fxrkttv4gq/incoming_data.csv?dl=0
Incoming CSV table (this is just an example, so "|" and "-" are here for the sake of clarity):
| id | Name |
-----------------
| 1 | Foo |
| 1 | Foo |
| 1 | Foo |
| 4242 | Baz |
| 4242 | Baz |
| 4242 | Baz |
| 4242 | Baz |
| 702131 | Xyz |
| 702131 | Xyz |
| 702131 | Xyz |
| 702131 | Xyz |
Result CSV (this is just an example, so "|" and "-" are here for the sake of clarity):
| id | Name | |
--------------------------
| 1 | Foo | 1 |
| 1 | Foo | 2 |
| 1 | Foo | 3 |
| 4242 | Baz | 1 |
| 4242 | Baz | 2 |
| 4242 | Baz | 3 |
| 4242 | Baz | 4 |
| 702131 | Xyz | 1 |
| 702131 | Xyz | 2 |
| 702131 | Xyz | 3 |
| 702131 | Xyz | 4 |
First column is ID, so I've tried to group all lines with the same ID and iterate over them. Script (I don't know bash scripting, to be honest):
FILE=$PWD/$1
# Delete header and extract IDs and delete non-unique values. Also change \n to ♥, because awk doesn't properly work with it.
IDS_ARRAY=$(awk -v FS="|" '{for (i=1;i<=NF;i++) if ($i=="\"") inQ=!inQ; ORS=(inQ?"♥":"\n") }1' $FILE | awk -F'|' '{if (NR!=1) {print $1}}' | awk '!seen[$0]++')
for id in $IDS_ARRAY; do
# Group $FILE by $id from $IDS_ARRAY.
cat $FILE | grep $id >> temp_mail_group.csv
ROW_GROUP=$PWD/temp_mail_group.csv
# Add a number after each row.
# NF+1 — add a column after last existing.
awk -F'|' '{$(NF+1)=++i;}1' OFS="|", $ROW_GROUP >> "numbered_mails_$(date +%Y-%m-%d).csv"
rm -f $PWD/temp_mail_group.csv
done
Right now this script works almost like I want to, except that it thinks that (for example) ID 2834 and 772834 are the same.
UPD: Although I marked one answer as approved it does not assign correct values to some groups of records with the same ID (right now I don't see a pattern).
You can do everything in a single script:
gawk 'BEGIN { FS="|"; OFS="|";}
/^-/ {print; next;}
$2 ~ /\s*id\s*/ {print $0,""; next;}
{print "", $2, $3, ++a[$2];}
'
$1 is the empty field before the first | in the input. I use an empty output column "" to get the leading |.
The trick is ++a[$2] which takes the second field in each row (= the ID column) and looks for it in the associative array a. If there is no entry, the result is 0. By pre-incrementing, we start with 1 and add 1 every time the ID reappears.
Every time you write a loop in shell just to manipulate text you have the wrong approach. The guys who invented shell also invented awk for shell to call to manipulate text - don't disappoint them :-).
$ awk '
BEGIN{ w = 8 }
{
if (NR==1) {
val = sprintf("%*s|",w,"")
}
else if (NR==2) {
val = sprintf("%*s",w+1,"")
gsub(/ /,"-",val)
}
else {
val = sprintf(" %-*s|",w-1,++cnt[$2])
}
print $0 val
}
' file
| id | Name | |
----------------------
| 1 | Foo | 1 |
| 1 | Foo | 2 |
| 1 | Foo | 3 |
| 42 | Baz | 1 |
| 42 | Baz | 2 |
| 42 | Baz | 3 |
| 42 | Baz | 4 |
| 70 | Xyz | 1 |
| 70 | Xyz | 2 |
| 70 | Xyz | 3 |
| 70 | Xyz | 4 |
An awk way
Without considering the dotted line being extended.
awk 'NR>2{$0=$0 (++a[$2])"|"}1' file
output
| id | Name |
-------------
| 1 | Foo |1|
| 1 | Foo |2|
| 1 | Foo |3|
| 42 | Baz |1|
| 42 | Baz |2|
| 42 | Baz |3|
| 42 | Baz |4|
| 70 | Xyz |1|
| 70 | Xyz |2|
| 70 | Xyz |3|
| 70 | Xyz |4|
Here's a way to do it with pure Bash:
inputfile=$1
prev_id=
while IFS= read -r line ; do
printf '%s' "$line"
IFS=$'| \t\n' read t1 id name t2 <<<"$line"
if [[ $line == -* ]] ; then
printf '%s\n' '---------'
elif [[ $id == 'id' ]] ; then
printf ' Number |\n'
else
if [[ $id != "$prev_id" ]] ; then
id_count=0
prev_id=$id
fi
printf '%2d |\n' "$(( ++id_count ))"
fi
done <"$inputfile"

Format text in sphinx table cells

I have a table I am generating in sphinx for comparing constructs in different languages. I would like to have the cells contain code blocks in each language and have it come out looking like code (at least in a monospaced font). What I have so far is:
+-----------------------------+------------------------+
| Haskell | Scala |
+=============================+========================+
| | do var1<- expn1 | | for {var1 <- expn1; |
| | var2 <- expn2 | | var2 <- expn2; |
| | expn3 | | result <- expn3 |
| | | } yield result |
+-----------------------------+------------------------+
| | do var1 <- expn1 | | for {var1 <- expn1; |
| | var2 <- expn2 | | var2 <- expn2; |
| | return expn3 | | } yield expn3 |
+-----------------------------+------------------------+
| | do var1 <- expn1 >> expn2 | | for {_ <- expn1; |
| | return expn3 | | var1 <- expn2 |
| | | } yield expn3 |
+-----------------------------+------------------------+
This, at least preserves line breaks but it comes out in the same font as the rest of the document which is a little annoying.
Is there any way to convert the cells to some better format?
Did you try using the .. code-block:: directive?
This works fine on my PC using Sphinx 1.4.1:
+----------------------------------+----------------------------------+
| Tweedledee | Tweedledum |
+----------------------------------+----------------------------------+
| .. code-block:: c | .. code-block:: c |
| :caption: foo.c | :caption: bar.c |
| | |
| extern int bar(int y); | extern int foo(int x); |
| int foo(int x) | int bar(int y) |
| { | { |
| return x > 0 ? bar(x-1)+1 | return y > 0 ? foo(x-1)*2 |
| : 0; | : 0; |
| } | } |
+----------------------------------+----------------------------------+

Resources