Grep moving spaces? - shell

{
while read -r line ; do
if grep -q '2015' $line
then
echo "inside then"
echo "$line"
echo "yes"
fi
done
} < testinput
Once I execute the above code the output is:
inside then
( nothing is printed in this line its spaces)
yes
Why is that input line is not getting printed in the second output line?
Your help is appreciated. The reason why I am asking is that I actually have to perform a few operations on the input line after the match using grep is successful.
Input file Sample :
2015-07-18-00.07.28.991321-240 I84033A497 LEVEL: Info
PID : 21233902 TID : 9510 PROC : db2sysc 0
INSTANCE: xxxxxxxx NODE : 000 DB : XXXXXXX
APPHDL : 0-8 APPID: *LOCAL.xxxxxxx.150718040727
AUTHID : XXXXXXXX
EDUID : 9510 EDUNAME: db2agent (XXXXXXXX) 0
FUNCTION: DB2 Common, SQLHA APIs for DB2 HA Infrastructure
I need to capture the time when SQLHA shows up in the input file or log file. To do that first I find the match for time in the input file and then I save that time in the variables. Once I find SQLHA I will write the time saved in the variables into an output file. So for every occurrence of SQLHA in the log, I will write the time to the output file.

After the update about what is really wanted, it is fairly clear that
you should probably use awk, though sed would also be an option (but harder). You can do it in shell too, though that's messier.
awk '/^2015-/ { datetime = $1 } / SQLHA / { print datetime }' testinput
Using sed:
sed -n -e '/^2015-/ {s/ .*//; h; n; }' -e '/ SQLHA / { x; p; x; }' testinput
(If you find 2015- at the start of a line, remove the stuff after a space and save it in the hold space. If you find SQLHA with spaces on either side, swap the hold and pattern space (thus placing the saved date/time in the pattern space), then print it, then switch it back. The switch back means that if two lines contain SQLHA between occurrences of the date line, you'll get the same date printed twice, rather than a date and then the first of the SQLHA lines. You end up having to think about what can go wrong, as well as what to do when everything goes right — but that may be more for later than right now.)
Using just sh:
while read -r line
do
case "$line" in
(2015-*) datetime=$(set -- $line; echo $1);; # Absence of quotes is deliberate
(*SQLHA*) echo "$datetime";;
esac
done < testinput
There are many other ways to do that in shell. Some of them are safer than this. It'll work on the data shown safely, but you might get to run against maliciously created data.
while read -r line
do
case "$line" in
(2015-*) datetime=$(echo "$line" | sed 's/ .*//');;
(*SQLHA*) echo "$datetime";;
esac
done < testinput
This invokes sed once per date line. Using Bash, I guess you can use:
while read -r line
do
case "$line" in
(2015-*) datetime="${line/ */}";; # Replace blank and everything after with nothing
(*SQLHA*) echo "$datetime";;
esac
done < testinput
This is the least likely to go wrong and avoids executing an external command for each line. You could also avoid case…esac using if and probably [[ so as to get pattern matching. Etc.
Running your script on a Mac, I get error output such as:
grep: 2015-07-18-00.07.28.991321-240: No such file or directory
grep: I84033A497: No such file or directory
grep: LEVEL:: No such file or directory
Are you not seeing that? If you're not, then either you've sent errors to /dev/null (or some other location than the terminal) or you've not shown us exactly the code you're using — or there's a blank line at the top of your testinput file.

This will do what your script is trying to do:
#!/usr/bin/awk -f
/2015/ {
print "inside then"
print
print "yes"
}

This is what I have written(very basic).I will try to run the same program with Grep and post why i am getting the blank space soon.
while read -r line
do
if [[ $line == *2015* ]];
then
dtime=`echo $line | cut -c1-26`
fi
if [[ $line == *SQLHA* ]];
then
echo $dtime
fi
done
} < testinput
Input Used:
2015-07-18-00.07.28.991321-240 I84033A497 LEVEL: Info
EDUID : 9510 EDUNAME: db2agent (SIEB_RPT) 0
FUNCTION: DB2 Common,APIs for DB2 HA Infrastructure, sqlhaAmIin
2015-07-18-00.07.29.991321-240 I84033A497 LEVEL: Info
FUNCTION: DB2 Common, SQLHA APIs for DB2 HA Infrastructure, sqlha
2015-07-18-00.07.48.991321-240 I84033A497 LEVEL: Info
EDUID : 9510 EDUNAME: db2agent (SIEB_RPT) 0
FUNCTION: DB2 Common, SQLHA APIs for DB2 HA Infrastructure, sqlha
O/p :
2015-07-18-00.07.29.991321
2015-07-18-00.07.48.991321

Related

Using sed in order to change a specific character in a specific line

I'm a beginner in bash and here is my problem. I have a file just like this one:
Azzzezzzezzzezzz...
Bzzzezzzezzzezzz...
Czzzezzzezzzezzz...
I try in a script to edit this file.ABC letters are unique in all this file and there is only one per line.
I want to replace the first e of each line by a number who can be :
1 in line beginning with an A,
2 in line beginning with a B,
3 in line beginning with a C,
and I'd like to loop this in order to have this type of result
Azzz1zzz5zzz1zzz...
Bzzz2zzz4zzz5zzz...
Czzz3zzz6zzz3zzz...
All the numbers here are random int variables between 0 and 9. I really need to start by replacing 1,2,3 in first exec of my loop, then 5,4,6 then 1,5,3 and so on.
I tried this
sed "0,/e/s/e/$1/;0,/e/s/e/$2/;0,/e/s/e/$3/" /tmp/myfile
But the result was this (because I didn't specify the line)
Azzz1zzz2zzz3zzz...
Bzzzezzzezzzezzz...
Czzzezzzezzzezzz...
I noticed that doing sed -i "/A/ s/$/ezzz/" /tmp/myfile will add ezzz at the end of A line so I tried this
sed -i "/A/ 0,/e/s/e/$1/;/B/ 0,/e/s/e/$2/;/C/ 0,/e/s/e/$3/" /tmp/myfile
but it failed
sed: -e expression #1, char 5: unknown command: `0'
Here I'm lost.
I have in a variable (let's call it number_of_e_per_line) the number of e in either A, B or C line.
Thank you for the time you take for me.
Just apply s command on the line that matches A.
sed '
/^A/{ s/e/$1/; }
/^B/{ s/e/$2/; }
# or shorter
/^C/s/e/$3/
'
s command by default replaces the first occurrence. You can do for example s/s/$1/2 to replace the second occurrence, s/e/$1/g (like "Global") replaces all occurrences.
0,/e/ specifies a range of lines - it filters lines from the first up until a line that matches /e/.
sed is not part of Bash. It is a separate (crude) programming language and is a very standard command. See https://www.grymoire.com/Unix/Sed.html .
Continuing from the comment. sed is a poor choice here unless all your files can only have 3 lines. The reason is sed processes each line and has no way to keep a separate count for the occurrences of 'e'.
Instead, wrapping sed in a script and keeping track of the replacements allows you to handle any file no matter the number of lines. You just loop and handle the lines one at a time, e.g.
#!/bin/bash
[ -z "$1" ] && { ## valiate one argument for filename provided
printf "error: filename argument required.\nusage: %s filename\n" "./$1" >&2
exit 1
}
[ -s "$1" ] || { ## validate file exists and non-empty
printf "error: file not found or empty '%s'.\n" "$1"
exit 1
}
declare -i n=1 ## occurrence counter initialized 1
## loop reading each line
while read -r line || [ -n "$line" ]; do
[[ $line =~ ^.*e.*$ ]] || continue ## line has 'e' or get next
sed "s/e/1/$n" <<< "$line" ## substitute the 'n' occurence of 'e'
((n++)) ## increment counter
done < "$1"
Your data file having "..." at the end of each line suggests your files is larger than the snippet posted. If you have lines beginning 'A' - 'Z', you don't want to have to write 26 separate /match/s/find/replace/ substitutions. And if you have somewhere between 3 and 26 (or more), you don't want to have to rewrite a different sed expression for every new file you are faced with.
That's why I say sed is a poor choice. You really have no way to make the task a generic task with sed. The downside to using a script is it will become a poor choice as the number of records you need to process increase (over 100000 or so just due to efficiency)
Example Use/Output
With the script in replace-e-incremental.sh and your data in file, you would do:
$ bash replace-e-incremental.sh file
Azzz1zzzezzzezzz...
Bzzzezzz1zzzezzz...
Czzzezzzezzz1zzz...
To Modify file In-Place
Since you make multiple calls to sed here, you need to redirect the output of the file to a temporary file and then replace the original by overwriting it with the temp file, e.g.
$ bash replace-e-incremental.sh file > mytempfile && mv -f mytempfile file
$ cat file
Azzz1zzzezzzezzz...
Bzzzezzz1zzzezzz...
Czzzezzzezzz1zzz...

Extracting file content using a for loop [duplicate]

I'm working on a long Bash script. I want to read cells from a CSV file into Bash variables. I can parse lines and the first column, but not any other column. Here's my code so far:
cat myfile.csv|while read line
do
read -d, col1 col2 < <(echo $line)
echo "I got:$col1|$col2"
done
It's only printing the first column. As an additional test, I tried the following:
read -d, x y < <(echo a,b,)
And $y is empty. So I tried:
read x y < <(echo a b)
And $y is b. Why?
You need to use IFS instead of -d:
while IFS=, read -r col1 col2
do
echo "I got:$col1|$col2"
done < myfile.csv
To skip a given number of header lines:
skip_headers=3
while IFS=, read -r col1 col2
do
if ((skip_headers))
then
((skip_headers--))
else
echo "I got:$col1|$col2"
fi
done < myfile.csv
Note that for general purpose CSV parsing you should use a specialized tool which can handle quoted fields with internal commas, among other issues that Bash can't handle by itself. Examples of such tools are cvstool and csvkit.
How to parse a CSV file in Bash?
Coming late to this question and as bash do offer new features, because this question stand about bash and because none of already posted answer show this powerful and compliant way of doing precisely this.
Parsing CSV files under bash, using loadable module
Conforming to RFC 4180, a string like this sample CSV row:
12,22.45,"Hello, ""man"".","A, b.",42
should be splitted as
1 12
2 22.45
3 Hello, "man".
4 A, b.
5 42
bash loadable .C compiled modules.
Under bash, you could create, edit, and use loadable c compiled modules. Once loaded, they work like any other builtin!! ( You may find more information at source tree. ;)
Current source tree (Oct 15 2021, bash V5.1-rc3) do contain a bunch of samples:
accept listen for and accept a remote network connection on a given port
asort Sort arrays in-place
basename Return non-directory portion of pathname.
cat cat(1) replacement with no options - the way cat was intended.
csv process one line of csv data and populate an indexed array.
dirname Return directory portion of pathname.
fdflags Change the flag associated with one of bash's open file descriptors.
finfo Print file info.
head Copy first part of files.
hello Obligatory "Hello World" / sample loadable.
...
tee Duplicate standard input.
template Example template for loadable builtin.
truefalse True and false builtins.
tty Return terminal name.
uname Print system information.
unlink Remove a directory entry.
whoami Print out username of current user.
There is an full working cvs parser ready to use in examples/loadables directory: csv.c!!
Under Debian GNU/Linux based system, you may have to install bash-builtins package by
apt install bash-builtins
Using loadable bash-builtins:
Then:
enable -f /usr/lib/bash/csv csv
From there, you could use csv as a bash builtin.
With my sample: 12,22.45,"Hello, ""man"".","A, b.",42
csv -a myArray '12,22.45,"Hello, ""man"".","A, b.",42'
printf "%s\n" "${myArray[#]}" | cat -n
1 12
2 22.45
3 Hello, "man".
4 A, b.
5 42
Then in a loop, processing a file.
while IFS= read -r line;do
csv -a aVar "$line"
printf "First two columns are: [ '%s' - '%s' ]\n" "${aVar[0]}" "${aVar[1]}"
done <myfile.csv
This way is clearly the quickest and strongest than using any other combination of bash builtins or fork to any binary.
Unfortunely, depending on your system implementation, if your version of bash was compiled without loadable, this may not work...
Complete sample with multiline CSV fields.
Conforming to RFC 4180, a string like this single CSV row:
12,22.45,"Hello ""man"",
This is a good day, today!","A, b.",42
should be splitted as
1 12
2 22.45
3 Hello "man",
This is a good day, today!
4 A, b.
5 42
Full sample script for parsing CSV containing multilines fields
Here is a small sample file with 1 headline, 4 columns and 3 rows. Because two fields do contain newline, the file are 6 lines length.
Id,Name,Desc,Value
1234,Cpt1023,"Energy counter",34213
2343,Sns2123,"Temperatur sensor
to trigg for alarm",48.4
42,Eye1412,"Solar sensor ""Day /
Night""",12199.21
And a small script able to parse this file correctly:
#!/bin/bash
enable -f /usr/lib/bash/csv csv
file="sample.csv"
exec {FD}<"$file"
read -ru $FD line
csv -a headline "$line"
printf -v fieldfmt '%-8s: "%%q"\\n' "${headline[#]}"
numcols=${#headline[#]}
while read -ru $FD line;do
while csv -a row "$line" ; (( ${#row[#]} < numcols )) ;do
read -ru $FD sline || break
line+=$'\n'"$sline"
done
printf "$fieldfmt\\n" "${row[#]}"
done
This may render: (I've used printf "%q" to represent non-printables characters like newlines as $'\n')
Id : "1234"
Name : "Cpt1023"
Desc : "Energy\ counter"
Value : "34213"
Id : "2343"
Name : "Sns2123"
Desc : "$'Temperatur sensor\nto trigg for alarm'"
Value : "48.4"
Id : "42"
Name : "Eye1412"
Desc : "$'Solar sensor "Day /\nNight"'"
Value : "12199.21"
You could find a full working sample there: csvsample.sh.txt or
csvsample.sh.
Note:
In this sample, I use head line to determine row width (number of columns). If you're head line could hold newlines, (or if your CSV use more than 1 head line). You will have to pass number or columns as argument to your script (and the number of head lines).
Warning:
Of course, parsing CSV using this is not perfect! This work for many simple CSV files, but care about encoding and security!! For sample, this module won't be able to handle binary fields!
Read carefully csv.c source code comments and RFC 4180!
From the man page:
-d delim
The first character of delim is used to terminate the input line,
rather than newline.
You are using -d, which will terminate the input line on the comma. It will not read the rest of the line. That's why $y is empty.
We can parse csv files with quoted strings and delimited by say | with following code
while read -r line
do
field1=$(echo "$line" | awk -F'|' '{printf "%s", $1}' | tr -d '"')
field2=$(echo "$line" | awk -F'|' '{printf "%s", $2}' | tr -d '"')
echo "$field1 $field2"
done < "$csvFile"
awk parses the string fields to variables and tr removes the quote.
Slightly slower as awk is executed for each field.
In addition to the answer from #Dennis Williamson, it may be helpful to skip the first line when it contains the header of the CSV:
{
read
while IFS=, read -r col1 col2
do
echo "I got:$col1|$col2"
done
} < myfile.csv
If you want to read CSV file with some lines, so this the solution.
while IFS=, read -ra line
do
test $i -eq 1 && ((i=i+1)) && continue
for col_val in ${line[#]}
do
echo -n "$col_val|"
done
echo
done < "$csvFile"

Bash script to add double quotes in .CSV comma delimited file

I need to add double quotes to the csv file. My sample data is like this..
378478,COMPLETED,Tracfone,,,"2020/03/29 09:39:22",,2787,,356074101197544,89148000005748235454,75176540
378328,COMPLETED,"Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)",50,"2020/03/29 06:10:01",200890899011202395,0899,0279395,356058102052972,89148000005117597971,67756296
I have tried some code available online with awk and sed, it is resulting as below , Error - **First digit in the number is being trimmed like for ex. in '378478' it is only displaying '78478'.
Also it is adding double quotes to already existing double quotes too!** nothing seems to be perfectly working. Please guide me!
"78478","COMPLETED","Tracfone","","",""2020/03/29 09:39:22"","","2787","","356074101197544","89148000005748235454","75176540"
"78328","COMPLETED",""Total Wireless"",""Unlimited Talk"," Text"," & Data (First 25GB High Speed"," then unlimited 2GB)"","50",""2020/03/29 06:10:01"","200890899011202395","0899","0279395","356058102052972","89148000005117597971","67756296"
"78329","COMPLETED",""Cricket Wireless"",""Unlimited Talk"," Text"," & 4G LTE Data w/ 15GB Hotspot"","60",""2020/03/29""
This is the code I am using:
awk -F"'?,'?" -v OFS='","' '{$1=$1; gsub(/^.|$/,"\"")} 1' file # or
sed -E 's/([^,]*) , (.*)/"\1" , "\2"/' file
My total code is the below one. my Intention was to first convert all .xlsx to .csv and then add double quotes to same csv and save it in the same file.i know the $file.csv part is wrong, hence i need some help
find "$Src_Dir" -type f -iname "*.xlsx" -print>path/temp
cat path/temp | while IFS="" read -r -d $'\0' file;
do
echo $file
ssconvert "${file}" --export-type=Gnumeric_stf:stf_csv
awk -F"'?,'?" -v OFS='","' '{$1=$1; gsub(/^.|$/,"\"")} 1' $file > $file.csv
done
If you want to handle anything other than the simplest CSV files, you should probably move away from sed and awk. There are much better tools available.
For example, if you sudo apt install csvtool (or equivalent) on your favourite distro, you can use its call-per-line functionality to process each line in the input file. See the following script for an example:
#!/bin/bash
function quotify {
# Start empty line, process every field.
line=""
while [[ $# -ne 0 ]] ; do
# Append comma for all but first field, then quoted field.
[[ -n "${line}" ]] && line="${line},"
line="${line}\"$1\""
shift
done
# Output the fully quoted line.
echo "${line}"
}
# Needed to call functions. Also, ensure link: /bin/sh -> /bin/bash.
export -f quotify
# Pretty-print input and output.
echo "Input file:"
sed 's/^/ /' inputFile.csv
echo "Output file:"
csvtool call quotify inputFile.csv | sed 's/^/ /'
Note the quotify function which is called for each line in the CSV file, with the arguments set to each field within that line (sans quotes, whether the original fields had quotes or not).
It basically constructs a string of all the fields in the line, with quotes around them, then writes that to standard output, as shown below in the output from that script:
Input file:
378478,COMPLETED,Tracfone,,,"2020/03/29 09:39:22",,2787,,356074101197544,89148000005748235454,75176540
378328,COMPLETED,"Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)",50,"2020/03/29"
Output file:
"378478","COMPLETED","Tracfone","","","2020/03/29 09:39:22","","2787","","356074101197544","89148000005748235454","75176540"
"378328","COMPLETED","Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)","50","2020/03/29"
Even though using a separate tool is probably the easiest way to go, if you absolutely cannot install other packages, then you're going to have to code up something in a package you already have. The following bash script is a good place to start, as it uses no other tools to achieve its goal.
At the moment, it's tied to a very specific set of rules, as follows:
White space matters. Anything between the commas is considered part of the field. This especially matters when detecting a quoted field, it must have the quote as the first character, no abc, "d,e,f",ghi stuff since the "d,e,f" won't be handled correctly.
Quoted fields are allowed to contain commas, and "" sequences within them are turned into ".
It's probably not a good idea to supply ill-formatted CSV files :-)
But, with that in mind, here we go. I'll offer a brief textual description of each section but hopefully the comments in the code will be enough to figure out what's going on.
First, a function for finding the position if some string within another string, useful for working out the field bounds:
function findPos {
haystack="$1"
needle="$2"
# Remove everything past the needle.
prefix="${haystack%%${needle}*}"
# If nothing was removed, it wasn't found, so supply massive number.
# Otherwise, it was found at the length of the string with removed stuff.
position=999999
[[ ${#prefix} -ne ${#haystack} ]] && position=${#prefix}
echo ${position}
}
Then we can use that in the function that works out the length of the next field. This basically just looks for the next comma for unquoted fields, and does special handling for quoted fields by building up the field from segments (it has to handle quotes within quotes and commas):
function getNextFieldLen {
line="$1"
# Empty line means all work done.
[[ -z "${line}" ]] && echo -1 && return
# Handle unquoted first, this is easy.
[[ "${line:0:1}" != '"' ]] && { echo $(findPos "${line}" ","); return; }
# Now handle quoted. Loop over all segments where a segment is defined as
# the text up to the next <"">, assuming it's before the next <",>.
field=""
nextQuoteComma=$(findPos "${line}" '",')
nextDoubleQuote=$(findPos "${line}" '""')
while [[ ${nextDoubleQuote} -lt ${nextQuoteComma} ]]; do
# Append segment to the field and go back for next segment.
field="${field}${line:0:${nextDoubleQuote}}\"\""
line="${line:${nextDoubleQuote}}"
line="${line:2}"
nextQuoteComma=$(findPos "${line}" '",')
nextDoubleQuote=$(findPos "${line}" '""')
done
# Add final segment (up to the comma) and output entire field.
field="${field}${line:0:${nextQuoteComma}}\""
echo "${#field}"
}
Finally, there's the top-level function which will quotify whatever comes in via standard input:
function quotifyStdIn {
# Process file line by line.
while read -r line; do
# Start with empty output line and non-comma separator.
outLine="" ; sep=""
# Place terminator to make processing easier, start field loop.
line="${line},"
fieldLen=$(getNextFieldLen "${line}")
while [[ ${fieldLen} -ge 0 ]]; do
# Get field and quotify if needed, adjust line (remove field and comma).
field="${line:0:${fieldLen}}"
[[ "${field:0:1}" = '"' ]] || field="\"${field}\""
line="${line:$((fieldLen+1))}"
#line="${line:${fieldLen}}"
#line="${line:1}"
# Append to output line and prepare for next field.
outLine="${outLine}${sep}${field}"; sep=","
fieldLen=$(getNextFieldLen "${line}")
done
# Output built line.
echo "${outLine}"
done
}
And, on the off-chance you want to read directly from a file (though providing a file name that's empty or "-" will use standard input so you can probably just use the file-based function for everything):
function quotifyFile {
file="$1"
# Empty file or "-" means standard input, otherwise take input from real file.
[[ ${#file} -eq 0 ]] && { quotifyStdIn; return; }
[[ "${file}" = "-" ]] && { quotifyStdIn; return; }
quotifyStdIn < "${file}"
}
And, finally, because every program that's not a "Hello, world" one deserves some form of test harness, this is what you can use to test the various capabilities:
(
echo 'paxdiablo,was here'
echo 'and,"then, strangely,",he,was,not'
echo '50,"My name is ""Pax"", and yours is ""Bob""",42'
echo '17,"""Love"" is grand",19'
) > harness.csv
echo "Before:"
sed "s/^/ /" harness.csv
echo "After:"
quotifyFile harness.csv | sed "s/^/ /"
rm -rf harness.csv
And, since a test harness is of little use unless you run the tests, here's the results of the first run:
Before:
paxdiablo,was here
and,"then, strangely,",he,was,not
50,"My name is ""Pax"", and yours is ""Bob""",42
17,"""Love"" is grand",19
After:
"paxdiablo","was here"
"and","then, strangely,","he","was","not"
"50","My name is ""Pax"", and yours is ""Bob""","42"
"17","""Love"" is grand","19"
Hopefully, that will be enough to get you going in the absence of being able to install packages. Of course, if one of the packages you can't install in bash itself, then you have problems that I can't help you with :-)
Your starting CSV is not a good CSV: the 2 rows have different number of columns
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
| 378478 | COMPLETED | Tracfone | - | - | 2020/03/29 09:39:22 | - | 2787 | - | 356074101197544 | 89148000005748235454 | 75176540 |
| 378328 | COMPLETED | Total Wireless | Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB) | 50 | 2020/03/29 | - | - | - | - | - | - |
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
Using Miller (https://github.com/johnkerl/miller) you could run
mlr --csv --quote-all -N unsparsify input >output
to have
"378478","COMPLETED","Tracfone","","","2020/03/29 09:39:22","","2787","","356074101197544","89148000005748235454","75176540"
"378328","COMPLETED","Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)","50","2020/03/29","","","","","",""
You can use it downloading the executable https://github.com/johnkerl/miller/releases/tag/v5.7.0

Writing a script for large text file manipulation (iterative substitution of duplicated lines), weird bugs and very slow.

I am trying to write a script which takes a directory containing text files (384 of them) and modifies duplicate lines that have a specific format in order to make them not duplicates.
In particular, I have files in which some lines begin with the '#' character and contain the substring 0:0. A subset of these lines are duplicated one or more times. For those that are duplicated, I'd like to replace 0:0 with i:0 where i starts at 1 and is incremented.
So far I've written a bash script that finds duplicated lines beginning with '#', writes them to a file, then reads them back and uses sed in a while loop to search and replace the first occurrence of the line to be replaced. This is it below:
#!/bin/bash
fdir=$1"*"
#for each fastq file
for f in $fdir
do
(
#find duplicated read names and write to file $f.txt
sort $f | uniq -d | grep ^# > "$f".txt
#loop over each duplicated readname
while read in; do
rname=$in
i=1
#while this readname still exists in the file increment and replace
while grep -q "$rname" $f; do
replace=${rname/0:0/$i:0}
sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
let "i+=1"
done
done < "$f".txt
rm "$f".txt
rm "$f".bu
done
echo "done" >> progress.txt
)&
background=( $(jobs -p) )
if (( ${#background[#]} ==40)); then
wait -n
fi
done
The problem with it is that its impractically slow. I ran it on a 48 core computer for over 3 days and it hardly got through 30 files. It also seemed to have removed about 10 files and I'm not sure why.
My question is where are the bugs coming from and how can I do this more efficiently? I'm open to using other programming languages or changing my approach.
EDIT
Strangely the loop works fine on one file. Basically I ran
sort $f | uniq -d | grep ^# > "$f".txt
while read in; do
rname=$in
i=1
while grep -q "$rname" $f; do
replace=${rname/0:0/$i:0}
sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
let "i+=1"
done
done < "$f".txt
To give you an idea of what the files look like below are a few lines from one of them. The thing is that even though it works for the one file, it's slow. Like multiple hours for one file of 7.5 M. I'm wondering if there's a more practical approach.
With regard to the file deletions and other bugs I have no idea what was happening Maybe it was running into memory collisions or something when they were run in parallel?
Sample input:
#D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
Sample output:
#D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
Here's some code that produces the required output from your sample input.
Again, it is assumed that your input file is sorted by the first value (up to the first space character).
time awk '{
#dbg if (dbg) print "#dbg:prev=" prev
if (/^#/ && prev!=$1) {fixNum=0 ;if (dbg) print "prev!=$1=" prev "!=" $1}
if (/^#/ && (prev==$1 || NR==1) ) {
prev=$1
n=split($1,tmpArr,":") ; n++
#dbg if (dbg) print "tmpArr[6]="tmpArr[6] "\tfixNum="fixNum
fixNum++;tmpArr[6]=fixNum;
# magic to rebuild $1 here
for (i=1;i<n;i++) {
tmpFix ? tmpFix=tmpFix":"tmpArr[i]"" : tmpFix=tmpArr[i]
}
$1=tmpFix ; $0=$0
print $0
}
else { tmpFix=""; print $0 }
}' file > fixedFile
output
#D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
I've left a few of the #dbg:... statements in place (but they are now commented out) to show how you can run a small set of data as you have provided, and watch the values of variables change.
Assuming a non-csh, you should be able to copy/paste the code block into a terminal window cmd-line and replace file > fixFile at the end with your real file name and a new name for the fixed file. Recall that awk 'program' file > file (actually, any ...file>file) will truncate the existing file and then try to write, SO you can lose all the data of a file trying to use the same name.
There are probably some syntax improvements that will reduce the size of this code, and there might be 1 or 2 things that could be done that will make the code faster, but this should run very quickly. If not, please post the result of time command that should appear at the end of the run, i.e.
real 0m0.18s
user 0m0.03s
sys 0m0.06s
IHTH
#!/bin/bash
i=4
sort $1 | uniq -d | grep ^# > dups.txt
while read in; do
if [ $((i%4))=0 ] && grep -q "$in" dups.txt; then
x="$in"
x=${x/"0:0 "/$i":0 "}
echo "$x" >> $1"fixed.txt"
else
echo "$in" >> $1"fixed.txt"
fi
let "i+=1"
done < $1

appending text to specific line in file bash

So I have a file that contains some lines of text separated by ','. I want to create a script that counts how much parts a line has and if the line contains 16 parts i want to add a new one. So far its working great. The only thing that is not working is appending the ',' at the end. See my example below:
Original file:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
Expected result:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
This is my code:
while read p; do
if [[ $p == "HEA"* ]]
then
IFS=',' read -ra ADDR <<< "$p"
echo ${#ADDR[#]}
arrayCount=${#ADDR[#]}
if [ "${arrayCount}" -eq 16 ];
then
sed -i "/$p/ s/\$/,xx/g" $f
fi
fi
done <$f
Result:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
,xx
What im doing wrong? I'm sure its something small but i cant find it..
It can be done using awk:
awk -F, 'NF==16{$0 = $0 FS "xx"} 1' file
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
-F, sets input field separator as comma
NF==16 is the condition that says execute block inside { and } if # of fields is 16
$0 = $0 FS "xx" appends xx at end of line
1 is the default awk action that means print the output
For using sed answer should be in the following:
Use ${line_number} s/..../..../ format - to target a specific line, you need to find out the line number first.
Use the special char & to denote the matched string
The sed statement should look like the following:
sed -i "${line_number}s/.*/&xx/"
I would prefer to leave it to you to play around with it but if you would prefer i can give you a full working sample.

Resources