reformatting text file from rows to column - bash

i have multiple files in a directory that i need to reformat and put the output in one file, the file structure is:
========================================================
Daily KPIs - DATE: 24/04/2013
========================================================
--------------------------------------------------------
Number of des = 5270
--------------------------------------------------------
Number of users = 210
--------------------------------------------------------
Number of active = 520
--------------------------------------------------------
Total non = 713
--------------------------------------------------------
========================================================
I need the output format to be:
Date,Numberofdes,Numberofusers,Numberofactive,Totalnon
24042013,5270,210,520,713
The directory has around 1500 files with the same format and im using Centos 7.
Thanks

First we need a method to join the elements of an array into a string (cf. Join elements of an array?):
function join_array()
{
local IFS=$1
shift
echo "$*"
}
Then we can cycle over each of the files and convert each one into a comma-separated list (assuming that the original file have a name ending in *.txt).
for f in *.txt
do
sed -n 's/[^:=]\+[:=] *\(.*\)/\1/p' < $f | {
mapfile -t fields
join_array , "${fields[#]}"
}
done
Here, the sed command looks inside each input file for lines that:
begin with a substring that contains neither a : nor a = character (the [^:=]\+ part);
then follow a : or a = and an arbitrary number of spaces (the [:=] * part);
finally, end with an arbitrary substring (the *\(.*\) part).
The last substring is then captured and printed instead of the original string. Any other line in the input files is discared.
After that, the output of sed is read by mapfile into the indexed array variable fields (the -t ensures that trailing newlines from each line read are discarded) and finally the lines are joined thanks to our previously-defined join_array method.
The reason whereby we need to wrap mapfile inside a subshell is explained here: readarray (or pipe) issue.

Related

Insert string variable value into the middle of another string variable's value in ksh

So I have a variable TRAILER which contains about 50 character. This variable is defined earlier in my shell session. As you can probably tell, it's a trailer to a file we'll be sending. I need to insert the record count of that file into the trailer. This record count is going to be 9 digits long (left padded with zeros if need be) and will start at index 2 of that string TRAILER. I want to retain all other characters in the TRAILER string just insert the RECORD_COUNT variable value into the TRAILER variable starting at index 2 (3rd character)
So the trailer variable is defined like this:
#Trailer details
TRAILER_RECORD_IDENTIFER="T"
LIFE_CYCLE="${LIFE_CYCLE_ENV}"
RECORD_COUNT="" #This will be calculated in the wrapper during the creation step
FILE_NUMBER="1111"
FILE_COUNT="1111"
CONTROL_TOTAL_1=" "
CONTROL_TOTAL_2=" "
CONTROL_TOTAL_3=" "
CONTROL_TOTAL_4=" "
CONTROL_TOTAL_5=" "
TRAILER="${TRAILER_RECORD_IDENTIFER}"\
"${LIFE_CYCLE}"\
"${RECORD_COUNT}"\
"${FILE_NUMBER}"\
"${FILE_COUNT}"\
"${CONTROL_TOTAL_1}"\
"${CONTROL_TOTAL_2}"\
"${CONTROL_TOTAL_3}"\
"${CONTROL_TOTAL_4}"\
"${CONTROL_TOTAL_5}"
Which then prints TRAILER as
TRAILER="TD11111111......" that would be 75 blank spaces for all of the white characters defined by the CONTROL_TOTAL variables.
These variables ALL get defined in the beginning of the shell. REcord count is defined but left blank ebcause we won't know the specific file until later int he shell.
Later in the shell i know the file that i want to use, i get the record coun:
cat ${ADE_DATA_FL_PATH_TMP} | wc -l | read ADE_DATA_FL_PATH_TMP_REC_COUNT >> ${LOG_FILE} 2>&1
Now I want to take ADE_DATA_FL_PATH_TMP_REC_COUNT and write that value into the TRAILER variable starting at the 2nd index, padded with zero's to be 9 characters long. So if my record count is 2700 records the new trailer would look like...
TRAILER="TD00000270011111111......"
You can use printf for padding.
I use TD as fixed first two characters, you can change this the way you want.
printf -v TRAILER "TD%.9d%s" "${ADE_DATA_FL_PATH_TMP_REC_COUNT}" "$(cut -c 12- <<< "${TRAILER}")"
Perhaps this is a good time switching to writing variable names in lowercase.

Find Replace using Values in another File

I have a directory of files, myFiles/, and a text file values.txt in which one column is a set of values to find, and the second column is the corresponding replace value.
The goal is to replace all instances of find values (first column of values.txt) with the corresponding replace values (second column of values.txt) in all of the files located in myFiles/.
For example...
values.txt:
Hello Goodbye
Happy Sad
Running the command would replace all instances of "Hello" with "Goodbye" in every file in myFiles/, as well as replace every instance of "Happy" with "Sad" in every file in myFiles/.
I've taken as many attempts at using awk/sed and so on as I can think logical, but have failed to produce a command that performs the action desired.
Any guidance is appreciated. Thank you!
Read each line from values.txt
Split that line in 2 words
Use sed for each line to replace 1st word with 2st word in all files in myFiles/ directory
Note: I've used bash parameter expansion to split the line (${line% *} etc) , assuming values.txt is space separated 2 columnar file. If it's not the case, you may use awk or cut to split the line.
while read -r line;do
sed -i "s/${line#* }/${line% *}/g" myFiles/* # '-i' edits files in place and 'g' replaces all occurrences of patterns
done < values.txt
You can do what you want with awk.
#! /usr/bin/awk -f
# snarf in first file, values.txt
FNR == NR {
subs[$1] = $2
next
}
# apply replacements to subsequent files
{
for( old in subs ) {
while( index(old, $0) ) {
start = index(old, $0)
len = length(old)
$0 = substr($0, start, len) subs[old] substr($0, start + len)
}
}
print
}
When you invoke it, put values.txt as the first file to be processed.
Option One:
create a python script
with open('filename', 'r') as infile, etc., read in the values.txt file into a python dict with 'from' as key, and 'to' as value. close the infile.
use shutil to read in directory wanted, iterate over files, for each, do popen 'sed 's/from/to/g'" or read in each file interating over all the lines, each line you find/replace.
Option Two:
bash script
read in a from/to pair
invoke
perl -p -i -e 's/from/to/g' dirname/*.txt
done
second is probably easier to write but less exception handling.
It's called 'Perl PIE' and it's a relatively famous hack for doing find/replace in lots of files at once.

Adding an extra value into CSV data, according to filename

Let's say i have the following type of filename formats :
CO#ATH2000.dat , CO#MAR2000.dat
Each of these, have data like that following:
....
"12-02-1984",3.8,4.1,3.8,3.8,3.8,3.7,4.1,4.3,3.8,4.1,5.0,4.8,4.5,4.3,4.3,4.3,4.1,4.5,4.3,4.3,4.3,4.5,4.3,4.1
"13-02-1984",3.7,4.3,4.3,4.3,4.1,4.3,4.5,4.8,4.8,5.0,5.2,5.0,5.2,5.2,5.2,4.8,4.8,4.8,4.8,4.8,4.8,4.8,4.5,4.3
"14-02-1984",3.8,4.1,3.8,3.8,3.8,3.8,3.8,4.2,4.5,4.5,4.1,3.6,3.6,3.4,3.4,3.2,3.4,3.2,3.2,3.2,2.9,2.7,2.5,2.2
"15-02-1984",2.2,2.2,2.0,2.0,2.0,1.8,2.1,2.6,2.6,2.5,2.4,2.4,2.4,2.5,2.7,2.7,2.6,2.6,2.7,2.6,2.8,2.8,2.8,2.8
..........
Now i also have the following .sh file that can merge ALL those .dat files into one single output .dat file.
for filename in `ls CO#*`; do
cat $filename >> CO#combined.dat
done
Now here is the problem. I want inside CO#combined.dat, at each line, before the start of the values, to have a 'standard' value according to the filename-parameter. For example i want each file with ATH in its filename have 3, at the start of each line and with MAR in its filename have 22,.
So the CO#combined.dat should be something like this:
....
3,"12-02-1984",3.8,4.1,3.8,3.8,3.8,3.7,4.1,4.3,3.8,4.1,5.0,4.8,4.5,4.3,4.3,4.3,4.1,4.5,4.3,4.3,4.3,4.5,4.3,4.1
3,"13-02-1984",3.7,4.3,4.3,4.3,4.1,4.3,4.5,4.8,4.8,5.0,5.2,5.0,5.2,5.2,5.2,4.8,4.8,4.8,4.8,4.8,4.8,4.8,4.5,4.3
20,"14-02-1984",3.8,4.1,3.8,3.8,3.8,3.8,3.8,4.2,4.5,4.5,4.1,3.6,3.6,3.4,3.4,3.2,3.4,3.2,3.2,3.2,2.9,2.7,2.5,2.2
20,"15-02-1984",2.2,2.2,2.0,2.0,2.0,1.8,2.1,2.6,2.6,2.5,2.4,2.4,2.4,2.5,2.7,2.7,2.6,2.6,2.7,2.6,2.8,2.8,2.8,2.8
..........
So in conclusion i want the script to do the above procedure!
Thanks in advance!
With awk you can take advantage of the built-in FILENAME variable along with the fact that you can supply multiple files to a given invocation. awk processes each file in turn, setting FILENAME to the name of the file whose records are currently being read.
With that you can set your prefix according to whatever pattern you wish to search for in the file name. Finally you can print the prefix and the original record.
Here's a demonstration on simplified versions of your sample input:
$ cat CO\#ATH2000.dat
1
2
3
$ cat CO\#MAR2000.dat
A
B
C
$ awk 'FILENAME ~ /MAR/ {pre=22} FILENAME ~ /ATH/ {pre=3} { print pre "," $0 }' CO*.dat
3,1
3,2
3,3
22,A
22,B
22,C
can be done simply
for f in CO#*; do
case ${f:3:3} in
ATH) k=3 ;;
*) k=22 ;;
esac;
sed "s/^/$k,/" $f >> all;
done
${f:3:3} extract the code ATH or MAR from the filename it's bash substring function; case converts the code to numerical counterpart; sed insert the numerical value and comma at the beginning of each line.

Slow bash script to execute sed expression on each line of an input file

I have a simple bash script as follows
#!/bin/bash
#This script reads a file of row identifiers separated by new lines
# and outputs all query FASTA sequences whose headers contain that identifier.
# usage filter_fasta_on_ids.sh fasta_to_filter.fa < seq_ids.txt; > filtered.fa
while read SEQID; do
sed -n -e "/$SEQID/,/>/ p" $1 | head -n -1
done
A fasta file has the following format:
> HeadER23217;count=1342
ACTGTGCCCCGTGTAA
CGTTTGTCCACATACC
>ANotherName;count=3221
GGGTACAGACCTACAC
CAACTAGGGGACCAAT
edit changed header names to better show their actual structure in the files
The script I made above does filter the file correctly, but it is very slow. My input file has ~ 20,000,000 lines containing ~ 4,000,000 sequences, and I have a list of 80,000 headers that I want to filter on. Is there a faster way to do this using bash/sed or other tools (like python or perl?) Any ideas why the script above is taking hours to complete?
You're scanning the large file 80k times. I'll suggest a different approach with a different tool: awk. Load the selection list into an hashmap (awk array) and while scanning the large file if any sequence matches print.
For example
$ awk -F"\n" -v RS=">" 'NR==FNR{for(i=1;i<=NF;i++) a["Sequence ID " $i]; next}
$1 in a' headers fasta
The -F"\n" flag sets the field separator in the input file to be a new line. -v RS=">" sets the record separator to be a ">"
Sequence ID 1
ACTGTGCCCCGTGTAA
CGTTTGTCCACATACC
Sequence ID 4
GGGTACAGACCTACAT
CAACTAGGGGACCAAT
the headers file contains
$ cat headers
1
4
and the fasta file includes some more records in the same format.
If your headers already includes the "Sequence ID" prefix, adjust the code as such. I didn't test this for large files but should be dramatically faster than your code as long as you don't have memory restrictions to hold 80K size array. In that case, splitting the headers to multiple sections and combining the results should be trivial.
To allow any format of header and to have the resulting file be a valid FASTA file, you can use the following command:
awk -F"\n" -v RS=">" -v ORS=">" -v OFS="\n" 'NR==FNR{for(i=1;i<=NF;i++) a[$i]; next} $1 in a' headers fasta > out
The ORS and OFS flags set the output field and record separators, in this case to be the same as the input fasta file.
You should take advantage of the fact (which you haven't explicitly stated, but I assume) that the huge fasta file contains the sequences in order (sorted by ID).
I'm also assuming the headers file is sorted by ID. If it isn't, make it so - sorting 80k integers is not costly.
When both are sorted it boils down to a single simultaneous linear scan through both files. And since it runs in constant memory it can work with any size unlike the other awk example. I give an example in python since I'm not comfortable with manual iteration in awk.
import sys
fneedles = open(sys.argv[1])
fhaystack = open(sys.argv[2])
def get_next_id():
while True:
line = next(fhaystack)
if line.startswith(">Sequence ID "):
return int(line[len(">Sequence ID "):])
def get_next_needle():
return int(next(fneedles))
try:
i = get_next_id()
j = get_next_needle()
while True:
if i == j:
print(i)
while i <= j:
i = get_next_id()
while i > j:
j = get_next_needle()
except StopIteration:
pass
Sure it's a bit verbose, but it finds 80k of 4M sequences (339M of input) in about 10 seconds on my old machine. (It could also be rewritten in awk which would probably be much faster). I created the fasta file this way:
for i in range(4000000):
print(">Sequence ID {}".format(i))
print("ACTGTGCCCCGTGTAA")
print("ACTGTGCCCCGTGTAA")
print("ACTGTGCCCCGTGTAA")
print("ACTGTGCCCCGTGTAA")
And the headers ("needles") this way:
import random
ids = list(range(4000000))
random.shuffle(ids)
ids = ids[:80000]
ids.sort()
for i in ids:
print(i)
It's slow because you are reading several times the same file when you could have sed read it once and process all patterns. So you need to generate a sed script with a statement for each ID and the />/b to replace your head -n -1.
while read ID; do
printf '/%s/,/>/ { />/b; p }\n' $ID;
done | sed -n -f - data.fa

How to read and replace Special characters in a fixed length file using shell script

I have a fixed length file in which some records have different special characters like Еӏєпа
I'm able to select those records containing special characters/.
I want to read 2 columns from those records and update it with '*' padded with blanks
Sample Data :
1234562013-09-01 01:05:30Еӏєпа Нцвѡі A other
5657812011-05-05 02:34:56abu jaya B other
Specifically, the 3rd and 4th column containing special characters, should be replaced with a single '*' padded with blanks to fill the length
I need result like below
1234562013-09-01 01:05:30* * A2013-09-01 02:03:40other
5657812011-05-05 02:34:56abu jaya B2013-09-01 07:06:10other
Tried the following commands :
sed -r "s/^(.{56}).{510}/\1$PAD/g;s/^(.{511}).{1023}/\1$PAD/g" errorline.txt
cut -c 57-568
Could someone help me out with this?
I would go with awk, something like:
awk '/[LIST__OF_SPECIAL_CHARS]/ {
l=$0
# for 3rd col
# NOTE the * must be padded if you have a fixed length file
# This can be done with spaces and/or (s)printf, read the docs
if (substr($0,FROM,NUM_OF_CHARS) ~ /[LIST__OF_SPECIAL_CHARS]/) {
l=substr(l,1,START_OF_3RD_COL_MINUS_1) "*" substr(l,START_OF_4TH_COL)
}
# for 4th col
# NOTE the * must be padded if you have a fixed length file
# This can be done with spaces and/or (s)printf, read the docs
if (substr($0,START_OF_4TH_COL,NUM_OF_CHARS) ~ /[LIST__OF_SPECIAL_CHARS]/) {
l=substr(l,1,START_OF_4TH_COL_MINUS_1) "*" substr(l,END_OF_4TH_COL_PLUS_1)
}
# after printing this line, skip to next record.
print l
next
}
{ # prints every other record
print }' INPUTFILE
sed "/.\{56\}.*[^a-zA-Z0-9 ].*.\{7\}/ s/\(.\{56\}\).\{20\}\(.\{7\}\)/\1* * \2/"errorline.txt
where:
56 is the first part of your line that don't contain special char
20 is the second part taht contain maybe special char
7 is the last part, end of your string.
"* * " is the string that will replace your special char section.
Adapt those values to your string structure
This sed read all the file and replace only the lines with special char.

Resources