I have a file with tab-separated-values and also with blank spaces like this:
! (desambiguación) http://es.dbpedia.org/resource/!_(desambiguación) 5
! (álbum) http://es.dbpedia.org/resource/!_(álbum_de_Trippie_Redd) 2
!! http://es.dbpedia.org/resource/!! 4
$9.99 http://es.dbpedia.org/resource/$9.99 6
Tomlinson http://es.dbpedia.org/resource/(10108)_Tomlinson 20
102 Miriam http://es.dbpedia.org/resource/(102)_Miriam 2
2003 QQ47 http://es.dbpedia.org/resource/(143649)_2003_QQ47 2
I want to extract the last number of every line:
5
2
4
6
20
2
2
For that, I have done this:
while read line;
do
NUMBER=$(echo $line | cut -f 3 -d ' ')
echo $NUMBER
done < $PAIRCOUNTS_FILE
The main problem is that some lines have more spaces than others and cut doesn't work for me with default delimiter (tab). I dont' know why, maybe because I am using WSL.
I have tried cut with several options but it doesn't work in anyway:
NUMBER=$(echo $line | cut -f 3 -d ' ')
NUMBER=$(echo $line | cut -f 4 -d ' ')
NUMBER=$(echo $line | cut -f 2)
NUMBER=$(echo $line | cut -f 3)
Hope you can help me with this. Thanks in advance.
I want to extract the last number of every line:
You could use grep
grep -Eo '[[:digit:]]+$' file
Or mapfile aka readarray which is a bash4+ feature.
mapfile -t array < file
printf '%s\n' "${array[#]##* }"
You can use awk:
awk '{print $NF}' file
With cut (if it is truly TAB separated and 3 fields per line):
cat file | cut -f3
If you have some variable number of fields per line, use rev|cut|rev to get the last field:
cat file | rev | cut -f1 | rev
Or with pure Bash and parameter expansion:
while IFS= read -r line; do
last=${line##* } # that is a literal TAB in the parameter expansion
printf "%s\n" "$last";
done <file
Or, read into a bash array and echo the last field:
while IFS=$'\t' read -r -a arr; do
echo "${arr[${#arr[#]}-1]}"
done <file
If you have a mixture of tabs and spaces you can do what usually is a mistake and break a Bash variable on white spaces in general (tabs and spaces) into an array:
while IFS= read -r line; do
arr=($line) # break on either tab or space without quotes
echo "${arr[${#arr[#]}-1]}"
done <file
I have setup like this:
#read a list of files
tr -d \\015 < sample.txt | while IFS=, read -r NAME
do
#grep for lines and do stuff
for VAR in $(grep '.*: {$' $NAME)
do
do some stuff
done
done
Problem is this. It doesn't work, because the for VAR in $(grep '.*: {$' $NAME) adds an unnecessary space and newline to its results.
If I echo $VAR I get the following:
blahblahblah:
{
Now consider this code:
#read a list of files
tr -d \\015 < sample.txt | while IFS=, read -r NAME
do
VAR=$(grep '.*: {$' $NAME)
echo $VAR
done
If I echo $VAR here I get:
blahblahblah: {
Why do I get the extra newline in the first example?
Below is a shell script that is written to process a huge file. It typically reads a fixed length file line by line, perform substring and append into another file as a delimited file. It works perfectly, but it is too slow.
array=() # Create array
while IFS='' read -r line || [[ -n "$line" ]] # Read a line
do
coOrdinates="$(echo -e "${line}" | grep POSITION | cut -d'(' -f2 | cut -d')' -f1 | cut -d':' -f1,2)"
if [[ -z "${coOrdinates// }" ]];
then
echo "Not adding"
else
array+=("$coOrdinates")
fi
done < "$1_CTRL.txt"
while read -r line;
do
result='"'
for e in "${array[#]}"
do
SUBSTRING1=`echo "$e" | sed 's/.*://'`
SUBSTRING=`echo "$e" | sed 's/:.*//'`
result1=`perl -e "print substr('$line', $SUBSTRING,$SUBSTRING1)"`
result1="$(echo -e "${result1}" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')"
result=$result$result1'"'',''"'
done
echo $result >> $1_1.txt
done < "$1.txt"
Earlier, i had used the cut command and changed as above, but there is no improvement in the time taken.
Can please suggest what kind of changes can be done to improve the time taken for processing..
Thanks in advance
Update:
Sample content of the input file :
XLS01G702012 000034444132412342134
Control File :
OPTIONS (DIRECT=TRUE, ERRORS=1000, rows=500000) UNRECOVERABLE
load data
CHARACTERSET 'UTF8'
TRUNCATE
into table icm_rls_clientrel2_hg
trailing nullcols
(
APP_ID POSITION(1:3) "TRIM(:APP_ID)",
RELATIONSHIP_NO POSITION(4:21) "TRIM(:RELATIONSHIP_NO)"
)
Output file:
"LS0","1G702012 0000"
perl:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
# read the control file
my $ctrl;
{
local $/ = "";
open my $fh, "<", shift #ARGV;
$ctrl = <$fh>;
close $fh;
}
my #positions = ( $ctrl =~ /\((\d+):(\d+)\)/g );
# read the data file
open my $fh, "<", shift #ARGV;
while (<$fh>) {
my #words;
for (my $i = 0; $i < scalar(#positions); $i += 2) {
push #words, substr($_, $positions[$i], $positions[$i+1]);
}
say join ",", map {qq("$_")} #words;
}
close $fh;
perl parse.pl x_CTRL.txt x.txt
"LS0","1G702012 00003"
Different results from what you requested:
in the POSITION(m:n) syntax of the control file, is n a length or an
index?
in the data file, are those spaces or tabs?
I suggest, with pure bash and to avoid subshells:
if [[ $line =~ POSITION ]] ; then # grep POSITION
coOrdinates="${line#*(}" # cut -d'(' -f2
coOrdinates="${coOrdinates%)*}" # cut -d')' -f1
coOrdinates="${coOrdinates/:/ }" # cut -d':' -f1,2
if [[ -z "${coOrdinates// }" ]]; then
echo "Not adding"
else
array+=("$coOrdinates")
fi
fi
more efficient, by gniourf_gniourf :
if [[ $line =~ POSITION\(([[:digit:]]+):([[:digit:]])\) ]]; then
array+=( "${BASH_REMATCH[*]:1:2}" )
fi
similarly:
SUBSTRING1=${e#*:} # $( echo "$e" | sed 's/.*://' )
SUBSTRING= ${e%:*} # $( echo "$e" | sed 's/:.*//' )
# to confirm, I don't know perl substr
result1=${line:$SUBSTRING:$SUBSTRING1} # $( perl -e "print substr('$line', $SUBSTRING,$SUBSTRING1)" )
#result1= # "$(echo -e "${result1}" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')"
# trim, if nécessary?
result1="${result1%${result1##*[^[:space:]]}}" # right
result1="${result1#${result1%%[^[:space:]]*}}" # left
gniourf_gniourf suggest having the grep out of the loop:
while read ...; do
...
done < <(grep POSITION ...)
for extra efficiency: while/read loops are very slow in Bash, so prefiltering as much as possible will speed up the process quite a lot.
Updated Answer
Here is a version where I parse the control file with awk, save the character positions and then use those when parsing the input file:
awk '
/APP_ID/ {
sub(/\).*/,"") # Strip closing parenthesis and all that follows
sub(/^.*\(/,"") # Strip everything up to opening parenthesis
split($0,a,":") # Extract the two character positions separated by colon into array "a"
next
}
/RELATIONSHIP/ {
sub(/\).*/,"") # Strip closing parenthesis and all that follows
sub(/^.*\(/,"") # Strip everything up to opening parenthesis
split($0,b,"[():]") # Extract character positions into array "b"
next
}
FNR==NR{next}
{ f1=substr($0,a[1]+1,a[2]); f2=substr($0,b[1]+1,b[2]); printf("\"%s\",\"%s\"\n",f1,f2)}
' ControlFile InputFile
Original Answer
Not a complete, rigorous answer, but this should give you an idea of how to do the extraction with awk once you have the POSITION parameters from the control file:
awk -v a=2 -v b=3 -v c=5 -v d=21 '{f1=substr($0,a,b); f2=substr($0,c,d); printf("\"%s\",\"%s\"\n",f1,f2)}' InputFile
Sample Output
"LS0","1G702012 00003"
Try running that on your large input file to get an idea of the performance, then tweak the output. Reading the control file is not at all time-critical so don't bother with optimising that.
To avoid the (slow) while loop , you can use cut and paste
#!/bin/bash
inFile=${1:-checkHugeFile}.in
ctrlFile=${1:-checkHugeFile}_CTRL.txt
outFile=${1:-checkHugeFile}.txt
cat /dev/null > $outFile
typeset -a array # Create array
while read -r line # Read a line
do
coOrdinates="${line#*(}"
coOrdinates="${coOrdinates%%)*}"
[[ -z "${coOrdinates// }" ]] && { echo "Not adding"; continue; }
array+=("$coOrdinates")
done < <(grep POSITION "$ctrlFile" )
echo coOrdinates: "${array[#]}"
for e in "${array[#]}"
do
nr=$((nr+1))
start=${e%:*}
len=${e#*:}
from=$(( start + 1 ))
to=$(( start + len + 1 ))
cut -c$from-$to $inFile > ${outFile}.$nr
done
paste $outFile.* | sed -e 's/^/"/' -e 's/\t/","/' -e 's/$/"/' >${outFile}
rm $outFile.[0-9]
I have a read loop that is reading a variable but not behaving the way I expect. I want to read every line of my variable and process each one. Here is my loop:
while read -r line
do
echo $line | sed 's/<\/td>/<\/td>$/g' | cut -d'$' -f2,3,4 >> file.txt
done <<< "$TABLE"
I expect it to process every line of the file but instead it just does the first one. If my the middle is simply echo $line >> file.txt it works as expected. What's going on here? How do I get the behavior I want?
It seems your lines are delimited by \r instead of \n.
Use this while loop to iterate the input with use of read -d $'\r':
while read -rd $'\r' line; do
echo "$line" | sed 's~</td>~</td>$~g' | cut -d'$' -f2,3,4 >> file.txt
done <<< "$TABLE"
If $TABLE contains a multi-line string, I recommend
printf '%s\n' "$TABLE" |
while read -r line; do
echo $line | sed 's/<\/td>/<\/td>$/g' | cut -d'$' -f2,3,4 >> file.txt
done
This is also more portable since the '<<<' operator for here-strings is not POSIX.
I have if test '\n' = "$line" but this doesn't seem to catch the new lines. What is wrong in that code?
How about
if test $line = $'\n'
If you are testing that $line is exactly a newline, you can do either of
test "$line" = $'\n' # This is non-standard, and will only work in some shells
test "$line" = '
' # This (two-liner) will work in any shell
If you want to know if $line merely contains a newline (ie, it is at
least two lines long), you could do:
if echo "$line" | sed 1d | grep -q .; then
echo line is at least 2 lines
fi