append data from one file to anoher file from right - shell

I have two files with below formats:
file1:
Sub_amount , date/time
12 , 2018040412
78 , 2018040413
26 , 2018040414
file2:
Unsub_amount , date/time
76 , 2018040412
98 , 2018040413
56 , 2018040414
what I need is, append file2 to file1 from right. what I mean is:
Sub_amount, Unsub_amount , date/time
12 , 76 , 2018040412
78 , 98 , 2018040413
26 , 56 , 2018040414
At the end, what is needed to be shown is:
date/time , Unsub_amount , Sub_amount
2018040412, 76 , 12
2018040413, 98 , 78
2018040414, 26 , 56
I would be appreciate if anyone can support :)
Thanks.

I would use awk for this:
awk -F'[[:blank:]]*,[[:blank:]]*' -v OFS="," '
# remove leading and trailing blanks from the line
{ gsub(/^[[:blank:]]+|[[:blank:]]+$/, "") }
# skip empty lines
/^$/ { next }
# store the Sub values from file1
NR == FNR { sub_amt[$2] = $1; next }
# print the data from file2, matching the cached value from file1
{ print $2, $1, sub_amt[$2] }
' file1 file2
date/time,Unsub_amount,Sub_amount
2018040412,76,12
2018040413,98,78
2018040414,56,26

edit: Not as cool or well-working as the awk solution, but an alternative. Trims the columns (using sed) and extracts them (using cut), then joins them (using paste). Assumes that the rows match up. Happy coding!
#!/usr/bin/env sh
# copy this code
# pbpaste > csv-col-merge.sh # paste/create the file
# chmod u+x csv-col-merge.sh # make it executable
# ./csv-col-merge.sh --gen-example
# ./csv-col-merge.sh demo/subs.csv demo/unsubs.csv demo/combined.csv
#
# learn more:
# - "paste": https://askubuntu.com/questions/616166/how-can-i-merge-files-on-a-line-by-line-basis
# - "<<- EOF": https://stackoverflow.com/questions/2953081/how-can-i-write-a-heredoc-to-a-file-in-bash-script
# - "$_": https://unix.stackexchange.com/questions/271659/vs-last-argument-of-the-preceding-command-and-output-redirection
# - "cat -": https://stackoverflow.com/questions/14004756/read-stdin-in-function-in-bash-script
# - "cut -d ',' -f 1": split lines with ',' and take the first column, see "man cut"
# - "sed -E 's/find/replace/g'"; -E for extended regex, for ? (optional) support, see "man sed"
# -
# --gen-example
if [ "$1" = '--gen-example' ]; then
mkdir -p demo && cd $_
cat <<- EOF > subs.csv
sub_amount, date/time
12, 2018040412
78, 2018040413
26, 2018040414
EOF
cat <<- EOF > unsubs.csv
unsub_amount, date/time
76, 2018040412
98, 2018040413
56, 2018040414
EOF
exit
fi
# load
trim () { cat - | sed -E 's/ ?, ?/,/g'; }
subs="$(cat "$1" | trim)"
unsubs="$(cat "$2" | trim)"
combined=$3
# intermediate
getcol () { cut -d ',' -f $1; }
col_subs="$(echo "$subs" | getcol 1)"
col_unsubs="$(echo "$unsubs" | getcol 1)"
col_subs_date="$(echo "$subs" | getcol 2)"
col_unsubs_date="$(echo "$unsubs" | getcol 2)"
if [ ! "$col_subs_date" = "$col_unsubs_date" ]; then echo 'Make sure date col match up'; exit; fi
# process
mkdir tmp
echo "$col_subs_date" > tmp/a
echo "$col_unsubs" > tmp/b
echo "$col_subs" > tmp/c
paste -d ',' < tmp/a tmp/b tmp/c
rm -rf tmp

Related

Concatenate string in .csv after x commas using shell/bash

I have several .csv files containing data. The data vendor created the files indicating the years once in the first line with missing values in between, variables names in the second. Data follows in the third to the Xth line.
"year 1", , , "year 2", , ,"year 2", , ,
"Var1", "Var2", "Var3", "Var1", "Var2", "Var3", "Var1", "Var2", "Var3"
"ABC" , 1234 , 4567 , "DEF" , 789 , "ABC" , 1234 , 4567 , "DEF"
I am new to shell programming but it shouldn't be too complicated writing a script that outputs the following
"Var1_year1", "Var2_year1", "Var3_year1", "Var1_year2", "Var2_year2", "Var3_year2", "Var1_year3", "Var2_year3", "Var3_year3"
"ABC" , 1234 , 4567 , "DEF" , 789 , "ABC" , 1234 , 4567 , "DEF"
Some thing like
#!/bin/bash
FILES=/Users/pathTo.csvfiles/*.csv
for f in $FILES
do
echo "Processing $f file..."
# 1. Replace the second line with 'Varname_YearX' where YearX comes from the first line
cat ????
# 2. Delete first line
sed -i '' 1d $f
done
echo "Processing complete."
Update: The .csv files vary in their amount of lines. Only the first two lines need to be edited, the following lines are data.
If you want to merge the first and the second line of each CSV, try this.
# No point in using a variable for the wildcard
for f in /Users/pathTo.csvfiles/*.csv
do
awk -F , 'NR==1 { # Collect first line
# Squash quotes
gsub(/"/, "")
for(i=1;i<=NF;++i)
y[i] = $i || y[i-1]
next # Do not fall through to print
}
NR==2 { # Combine collected with current
gsub(/"/, "")
for(i=1;i<=NF;++i)
$i = y[i] "_" $i
}
# Print everything (except first)
1' "$f" > "$f.tmp"
mv "$f.tmp" "$f"
done
The first loop simply copies the previous field's value to y[i] if the i:th field is empty.
Ugly code using csvtool, various standard tools, and bash:
i=file.csv
paste -d_ <(head -2 $i | tail -1 | csvtool transpose -) \
<(head -1 $i | csvtool transpose - |
sed '$d;s/ //;/^$/{g;b};h') |
csvtool transpose - | sed 's/[^,]*/"&"/g' | cat - <(tail +3 $i)
Output:
"Var1_year1","Var2_year1","Var3_year1","Var1_year2","Var2_year2","Var3_year2","Var1_year2","Var2_year2","Var3_year2"
"ABC" , 1234 , 4567 , "DEF" , 789 , "ABC" , 1234 , 4567 , "DEF"

Count specific character in each line of at text and remove this character in a specific position until this character has a specific count

Hello i need help with one script that its on Solaris system:
I will explain the script analytically:
i have these files :
i)
cat /tmp/BadTransactions/TRANSACTIONS_DAILY_20180730.txt
201807300000000004
201807300000000005
201807300000000006
201807300000000007
201807300000000008
201807200002056422
201807230003099849
201807230003958306
201806290003097219
201806080001062012
201806110001633519
201806110001675603
ii)
cat /tmp/BadTransactions/test_data_for_validation_script.txt
20180720|201807200002056422||57413620344272|030341-213T |580463|WIRE||EUR|EUR|20180720|20180720|||||||00000000000019.90|00000000000019.90|Debit||||||||||MPA|||574000|129|||||||||||||||||||||||||31313001103712|BFNJKL|K| I P P BONNIER PUBLICATIO|||FI|PERS7
20180723|201807230003099849||57100440165173|140197-216U|593619|WIRE||EUR|EUR|20180723|20180723|||||||00000000000060.00|00000000000060.00|Debit||||||||||MPA|||571004|106|||||||||||||||||||||||||57108320141339|Ura Basket / UraNaiset|||-div|||FI|PERS
20180723|201807230003958306||57206820079775|210489-0788|593619|WIRE||EUR|EUR|20180721|20180723|||||||00000000000046.00|00000000000046.00|Debit||||||||||MPA|||578800|106|||||||||||||||||||||||||18053000009026|IC Kodit||| c/o Newsec Asset Manag|||FI|PERS
20180629|201806290003097219||57206820079775|210489-0788|593619|WIRE||EUR|EUR|20180628|20180629|||||||00000000000856.00|00000000000856.00|Debit||||||||||MPA|||578800|106|||||||||||||||||||||||||18053000009018|IC Kodit||| c/o Newsec Asset Manag|||FI|PERS
20180608|201806080001062012||57206820079441|140197-216S|580463|WIRE||EUR|EUR|20180608|20180608|||||||00000000000019.90|00000000000019.90|Debit||||||||||MPA|||541002|129|||||||||||||||||||||||||57108320141339|N FN|K| IKI I P BONNIER PUBLICATION|||FI|PERS7
20180611|201806110001633519||57206820079525|140197-216B|593619|WIRE||EUR|EUR|20180611|20180611|||||||00000000000242.10|00000000000242.10|Debit||||||||||MPA|||535806|106|||||||||||||||||||||||||57108320141339|As Oy Haikkoonsilta|| mannerheimin|||FI|PERS9
20180611|201806110001675603||57206820079092|140197-216Z|580463|WIRE||EUR|EUR|20180611|20180611|||||||00000000000019.90|00000000000019.90|Debit||||||||||MPA|||536501|129|||||||||||||||||||||||||57108320141339|N ^NLKL|K| I P NJ BONNIER PUBLICAT|||FI|PERS7
The script has to check each line of the
/tmp/BadTransactions/TRANSACTIONS_DAILY_20180730.txt and if the strings are on
the /tmp/BadTransactions/test_data_for_validation_script.txt it will create a
new file `/tmp/BadTransactions/TRANSACTIONS_DAILY_NEW_20180730.txt
From this new file it will count all the " | " in each line and if its more than 64 it will delete the " | " in 61th posistion of the line . This will be continued until its line has 64 pipes.
For example if one line has 67 " | " it will delete the 61th , then it will check it again and now has 66 " | | so it will delete the 61th " | " , etc... until it reach 64 pipes.So all the line have to have 64th " | ".
Here is my code , but in this code i have managed to delete only the 61th pipe in each line , i cannot make the loop so that it will check each line until it reach the 64 pipes.
I will appreciate it if you could help me.
#!/bin/bash
PATH=/usr/xpg4/bin:/bin:/usr/bin
while read line
do
grep "$line" /tmp/BadTransactions/test_data_for_validation_script.txt
awk 'NR==FNR { K[$1]; next } ($2 in K)' /tmp/BadTransactions/TRANSACTIONS_DAILY_20180730.txt FS="|" /opt/NorkomC
onfigS2/inbox/TRANSACTIONS_DAILY_20180730.txt > /tmp/BadTransactions/TRANSACTIONS_DAILY_NEW_20180730.txt
sed '/\([^|]*[|]\)\{65\}/ s/|//61' /tmp/BadTransactions/TRANSACTIONS_DAILY_NEW_20180730.txt
done < /tmp/BadTransactions/TRANSACTIONS_DAILY_20180730.txt > /tmp/BadTransactions/TRANSACTIONS_DAILY_NEW_201807
30.txt
Ok, in this problem you have several pieces of code.
You need to read a file line by line
Check each line against another file
Examine the matching line for the occurrences of "|"
Delete recursively the 61st "|" until the string will remain with 64 of them
You could do something like this
#!/bin/bash
count() { ### We will use this to count how many pipes are there
string="${1}"; shift
char="${1}"
printf "%s" "${string}" | grep -o -e "${char}" | grep -c .
}
file1="/tmp/BadTransactions/TRANSACTIONS_DAILY_20180730.txt" ### File to read
file2="/tmp/BadTransactions/test_data_for_validation_script.txt" ### File to check for duplicates
file3="/tmp/BadTransactions/TRANSACTIONS_DAILY_NEW_20180730.txt" ### File where to save our final work
printf "" > "${file3}" ### Delete (eventual) history
exec 3<"${file1}" ### Put our data in file descriptor 3
while read -r line <&3; do ### read each line and put it in var "$line"
string="$(grep -e "${line}" "${file2}")" ### Check the line against second file
while [ "$(count "${string}" "|")" -gt 64 ]; do ### While we have more than 64 "|"
string="$(printf "%s" "${string}" | sed -e "s/|//61")" ### Delete the 61st occurrence
done
printf "%s" "${string}" >> "${file3}" ### Save the correct line in the third file
done
exec 3>&- ### Clean file descriptor 3
This is not tested, but should work.
N.B. Please note that I am giving for granted that grep will return only one occurrence from second file...
If it is not your case you have to manually check each value with something like:
for value in $(grep -e "${line}" "${file2}"); do
...
done
EDIT:
For systems like Solaris or others that doesn't have GNU grep installed you can substitute the count method as follow:
count() {
string="${1}"; shift
char="${1}"
printf "%s" "${string}" | awk -F"${char}" '{print NF-1}'
}

How to split file by percentage of no. of lines?

How to split file by percentage of no. of lines?
Let's say I want to split my file into 3 portions (60%/20%/20% parts), I could do this manually, -_- :
$ wc -l brown.txt
57339 brown.txt
$ bc <<< "57339 / 10 * 6"
34398
$ bc <<< "57339 / 10 * 2"
11466
$ bc <<< "34398 + 11466"
45864
bc <<< "34398 + 11466 + 11475"
57339
$ head -n 34398 brown.txt > part1.txt
$ sed -n 34399,45864p brown.txt > part2.txt
$ sed -n 45865,57339p brown.txt > part3.txt
$ wc -l part*.txt
34398 part1.txt
11466 part2.txt
11475 part3.txt
57339 total
But I'm sure there's a better way!
There is a utility that takes as arguments the line numbers that should become the first of each respective new file: csplit. This is a wrapper around its POSIX version:
#!/bin/bash
usage () {
printf '%s\n' "${0##*/} [-ks] [-f prefix] [-n number] file arg1..." >&2
}
# Collect csplit options
while getopts "ksf:n:" opt; do
case "$opt" in
k|s) args+=(-"$opt") ;; # k: no remove on error, s: silent
f|n) args+=(-"$opt" "$OPTARG") ;; # f: filename prefix, n: digits in number
*) usage; exit 1 ;;
esac
done
shift $(( OPTIND - 1 ))
fname=$1
shift
ratios=("$#")
len=$(wc -l < "$fname")
# Sum of ratios and array of cumulative ratios
for ratio in "${ratios[#]}"; do
(( total += ratio ))
cumsums+=("$total")
done
# Don't need the last element
unset cumsums[-1]
# Array of numbers of first line in each split file
for sum in "${cumsums[#]}"; do
linenums+=( $(( sum * len / total + 1 )) )
done
csplit "${args[#]}" "$fname" "${linenums[#]}"
After the name of the file to split up, it takes the ratios for the sizes of the split files relative to their sum, i.e.,
percsplit brown.txt 60 20 20
percsplit brown.txt 6 2 2
percsplit brown.txt 3 1 1
are all equivalent.
Usage similar to the case in the question is as follows:
$ percsplit -s -f part -n 1 brown.txt 60 20 20
$ wc -l part*
34403 part0
11468 part1
11468 part2
57339 total
Numbering starts with zero, though, and there is no txt extension. The GNU version supports a --suffix-format option that would allow for .txt extension and which could be added to the accepted arguments, but that would require something more elaborate than getopts to parse them.
This solution plays nice with very short files (split file of two lines into two) and the heavy lifting is done by csplit itself.
$ cat file
a
b
c
d
e
$ cat tst.awk
BEGIN {
split(pcts,p)
nrs[1]
for (i=1; i in p; i++) {
pct += p[i]
nrs[int(size * pct / 100) + 1]
}
}
NR in nrs{ close(out); out = "part" ++fileNr ".txt" }
{ print $0 " > " out }
$ awk -v size=$(wc -l < file) -v pcts="60 20 20" -f tst.awk file
a > part1.txt
b > part1.txt
c > part1.txt
d > part2.txt
e > part3.txt
Change the " > " to just > to actually write to the output files.
Usage
The following bash script allows you to specify the percentage like
./split.sh brown.txt 60 20 20
you also can use the placeholder . which fills the percentage up to 100%.
./split.sh brown.txt 60 20 .
the splitted file is written to
part1-brown.txt
part2-brown.txt
part3-brown.txt
The script always generates as much part files as numbers are specified.
If the percentages sum up to 100, cat part* will always generate the original file (no duplicated or missing lines).
Bash Script: split.sh
#! /bin/bash
file="$1"
fileLength=$(wc -l < "$file")
shift
part=1
percentSum=0
currentLine=1
for percent in "$#"; do
[ "$percent" == "." ] && ((percent = 100 - percentSum))
((percentSum += percent))
if ((percent < 0 || percentSum > 100)); then
echo "invalid percentage" 1>&2
exit 1
fi
((nextLine = fileLength * percentSum / 100))
if ((nextLine < currentLine)); then
printf "" # create empty file
else
sed -n "$currentLine,$nextLine"p "$file"
fi > "part$part-$file"
((currentLine = nextLine + 1))
((part++))
done
BEGIN {
split(w, weight)
total = 0
for (i in weight) {
weight[i] += total
total = weight[i]
}
}
FNR == 1 {
if (NR!=1) {
write_partitioned_files(weight,a)
split("",a,":") #empty a portably
}
name=FILENAME
}
{a[FNR]=$0}
END {
write_partitioned_files(weight,a)
}
function write_partitioned_files(weight, a) {
split("",threshold,":")
size = length(a)
for (i in weight){
threshold[length(threshold)] = int((size * weight[i] / total)+0.5)+1
}
l=1
part=0
for (i in threshold) {
close(out)
out = name ".part" ++part
for (;l<threshold[i];l++) {
print a[l] " > " out
}
}
}
Invoke as:
awk -v w="60 20 20" -f above_script.awk file_to_split1 file_to_split2 ...
Replace " > " with > in script to actually write partitioned files.
The variable w expects space separated numbers. Files are partitioned in that proportion. For example "2 1 1 3" will partition files into four with number of lines in proportion of 2:1:1:3. Any sequence of numbers adding up to 100 can be used as percentages.
For large files the array a may consume too much memory. If that is an issue, here is an alternative awk script:
BEGIN {
split(w, weight)
for (i in weight) {
total += weight[i]; weight[i] = total #cumulative sum
}
}
FNR == 1 {
#get number of lines. take care of single quotes in filename.
name = gensub("'", "'\"'\"'", "g", FILENAME)
"wc -l '" name "'" | getline size
split("", threshold, ":")
for (i in weight){
threshold[length(threshold)+1] = int((size * weight[i] / total)+0.5)+1
}
part=1; close(out); out = FILENAME ".part" part
}
{
if(FNR>=threshold[part]) {
close(out); out = FILENAME ".part" ++part
}
print $0 " > " out
}
This passes through each file twice. Once for counting lines (via wc -l) and the other time while writing partitioned files. Invocation and effect is similar to the first method.
i like Benjamin W.'s csplit solution, but it's so long...
#!/bin/bash
# usage ./splitpercs.sh file 60 20 20
n=`wc -l <"$1"` || exit 1
echo $* | tr ' ' '\n' | tail -n+2 | head -n`expr $# - 1` |
awk -v n=$n 'BEGIN{r=1} {r+=n*$0/100; if(r > 1 && r < n){printf "%d\n",r}}' |
uniq | xargs csplit -sfpart "$1"
(the if(r > 1 && r < n) and uniq bits are to prevent creating empty files or strange behavior for small percentages, files with small numbers of lines, or percentages that add to over 100.)
I just followed your lead and made what you do manually into a script. It may not be the fastest or "best", but if you understand what you are doing now and can just "scriptify" it, you may be better off should you need to maintain it.
#!/bin/bash
# thisScript.sh yourfile.txt 20 50 10 20
YOURFILE=$1
shift
# changed to cat | wc so I dont have to remove the filename which comes from
# wc -l
LINES=$(cat $YOURFILE | wc -l )
startpct=0;
PART=1;
for pct in $#
do
# I am assuming that each parameter is on top of the last
# so 10 30 10 would become 10, 10+30 = 40, 10+30+10 = 50, ...
endpct=$( echo "$startpct + $pct" | bc)
# your math but changed parts of 100 instead of parts of 10.
# change bc <<< to echo "..." | bc
# so that one can capture the output into a bash variable.
FIRSTLINE=$( echo "$LINES * $startpct / 100 + 1" | bc )
LASTLINE=$( echo "$LINES * $endpct / 100" | bc )
# use sed every time because the special case for head
# doesn't really help performance.
sed -n $FIRSTLINE,${LASTLINE}p $YOURFILE > part${PART}.txt
$((PART++))
startpct=$endpct
done
# get the rest if the % dont add to 100%
if [[ $( "lastpct < 100" | bc ) -gt 0 ]] ; then
sed -n $FIRSTLINE,${LASTLINE}p $YOURFILE > part${PART}.txt
fi
wc -l part*.txt

bash routine to return the page number of a given line number from text file

Consider a plain text file containing page-breaking ASCII control character "Form Feed" ($'\f'):
alpha\n
beta\n
gamma\n\f
one\n
two\n
three\n
four\n
five\n\f
earth\n
wind\n
fire\n
water\n\f
Note that each page has a random number of lines.
Need a bash routine that return the page number of a given line number from a text file containing page-breaking ASCII control character.
After a long time researching the solution I finally came across this piece of code:
function get_page_from_line
{
local nline="$1"
local input_file="$2"
local npag=0
local ln=0
local total=0
while IFS= read -d $'\f' -r page; do
npag=$(( ++npag ))
ln=$(echo -n "$page" | wc -l)
total=$(( total + ln ))
if [ $total -ge $nline ]; then
echo "${npag}"
return
fi
done < "$input_file"
echo "0"
return
}
But, unfortunately, this solution proved to be very slow in some cases.
Any better solution ?
Thanks!
The idea to use read -d $'\f' and then to count the lines is good.
This version migth appear not ellegant: if nline is greater than or equal to the number of lines in the file, then the file is read twice.
Give it a try, because it is super fast:
function get_page_from_line ()
{
local nline="${1}"
local input_file="${2}"
if [[ $(wc -l "${input_file}" | awk '{print $1}') -lt nline ]] ; then
printf "0\n"
else
printf "%d\n" $(( $(head -n ${nline} "${input_file}" | grep -c "^"$'\f') + 1 ))
fi
}
Performance of awk is better than the above bash version. awk was created for such text processing.
Give this tested version a try:
function get_page_from_line ()
{
awk -v nline="${1}" '
BEGIN {
npag=1;
}
{
if (index($0,"\f")>0) {
npag++;
}
if (NR==nline) {
print npag;
linefound=1;
exit;
}
}
END {
if (!linefound) {
print 0;
}
}' "${2}"
}
When \f is encountered, the page number is increased.
NR is the current line number.
----
For history, there is another bash version.
This version is using only built-it commands to count the lines in current page.
The speedtest.sh that you had provided in the comments showed it is a little bit ahead (20 sec approx.) which makes it equivalent to your version:
function get_page_from_line ()
{
local nline="$1"
local input_file="$2"
local npag=0
local total=0
while IFS= read -d $'\f' -r page; do
npag=$(( npag + 1 ))
IFS=$'\n'
for line in ${page}
do
total=$(( total + 1 ))
if [[ total -eq nline ]] ; then
printf "%d\n" ${npag}
unset IFS
return
fi
done
unset IFS
done < "$input_file"
printf "0\n"
return
}
awk to the rescue!
awk -v RS='\f' -v n=09 '$0~"^"n"." || $0~"\n"n"." {print NR}' file
3
updated anchoring as commented below.
$ for i in $(seq -w 12); do awk -v RS='\f' -v n="$i"
'$0~"^"n"." || $0~"\n"n"." {print n,"->",NR}' file; done
01 -> 1
02 -> 1
03 -> 1
04 -> 2
05 -> 2
06 -> 2
07 -> 2
08 -> 2
09 -> 3
10 -> 3
11 -> 3
12 -> 3
A script of similar length can be written in bash itself to locate and respond to the embedded <form-feed>'s contained in a file. (it will work for POSIX shell as well, with substitute for string index and expr for math) For example,
#!/bin/bash
declare -i ln=1 ## line count
declare -i pg=1 ## page count
fname="${1:-/dev/stdin}" ## read from file or stdin
printf "\nln:pg text\n" ## print header
while read -r l; do ## read each line
if [ ${l:0:1} = $'\f' ]; then ## if form-feed found
((pg++))
printf "<ff>\n%2s:%2s '%s'\n" "$ln" "$pg" "${l:1}"
else
printf "%2s:%2s '%s'\n" "$ln" "$pg" "$l"
fi
((ln++))
done < "$fname"
Example Input File
The simple input file with embedded <form-feed>'s was create with:
$ echo -e "a\nb\nc\n\fd\ne\nf\ng\nh\n\fi\nj\nk\nl" > dat/affex.txt
Which when output gives:
$ cat dat/affex.txt
a
b
c
d
e
f
g
h
i
j
k
l
Example Use/Output
$ bash affex.sh <dat/affex.txt
ln:pg text
1: 1 'a'
2: 1 'b'
3: 1 'c'
<ff>
4: 2 'd'
5: 2 'e'
6: 2 'f'
7: 2 'g'
8: 2 'h'
<ff>
9: 3 'i'
10: 3 'j'
11: 3 'k'
12: 3 'l'
With Awk, you can define RS (the record separator, default newline) to form feed (\f) and IFS (the input field separator, default any sequence of horizontal whitespace) to newline (\n) and obtain the number of lines as the number of "fields" in a "record" which is a "page".
The placement of form feeds in your data will produce some empty lines within a page so the counts are off where that happens.
awk -F '\n' -v RS='\f' '{ print NF }' file
You could reduce the number by one if $NF == "", and perhaps pass in the number of the desired page as a variable:
awk -F '\n' -v RS='\f' -v p="2" 'NR==p { print NF - ($NF == "") }' file
To obtain the page number for a particular line, just feed head -n number to the script, or loop over the numbers until you have accrued the sum of lines.
line=1
page=1
for count in $(awk -F '\n' -v RS='\f' '{ print NF - ($NF == "") }' file); do
old=$line
((line += count))
echo "Lines $old through line are on page $page"
((page++)
done
This gnu awk script prints the "page" for the linenumber given as command line argument:
BEGIN { ffcount=1;
search = ARGV[2]
delete ARGV[2]
if (!search ) {
print "Please provide linenumber as argument"
exit(1);
}
}
$1 ~ search { printf( "line %s is on page %d\n", search, ffcount) }
/[\f]/ { ffcount++ }
Use it like awk -f formfeeds.awk formfeeds.txt 05 where formfeeds.awk is the script, formfeeds.txt is the file and '05' is a linenumber.
The BEGIN rule deals mostly with the command line argument. The other rules are simple rules:
$1 ~ search applies when the first field matches the commandline argument stored in search
/[\f]/ applies when there is a formfeed

Replace all control characters in a range of line with awk

I got a file with several lines. Some of these lines contain LFs (0x0A) and CRs (0x0D), which I want to get removed. The point is, that I want to replace them with SPACE them only in a range of characters of every line, eg in a File:
30 30 30 30 30 30 30 30 30 30 **0D 0A** 30 30 0A; 0000000000..00
30 30 30 30 30 30 30 30 **0D 0A** 30 30 30 30 0A; 00000000..0000
I want to remove 0d, 0a from position 0 to 12 in every line of the file.
I got
awk '{l=substr($0,1,12);r=substr($0,13);gsub(/\x00-\1F/," ",l);print l r}' ${f} > ${f}.noLF
but this seems not to work. I guess substr stops at the first 0x0d.
Is there another solution?
awk '/\r$/ && length < 13 {sub(/\r$/,""); printf "%s ", $0; next} {print}' file
Here is something ugly that may work!
Save it as go
#!/bin/bash
while :
do
# Read 13 bytes from stdin, and replace carriage returns and linefeeds with spaces
dd bs=13 count=1 2>/dev/null | tr '\r\n' ' '
# Break out of loop if dd was not successful
[ ${PIPESTATUS[0]} -ne 0 ] && break
# Get rest of conventional line, breaking out of loop if EOF
read rest || break
echo $rest
done
It reads 13 bytes from your file and removes all carriage returns and linefeeds. Then it reads the rest of the conventional line and outputs that.
Use it like this:
chmod +x go
./go < yourfile
Example:
more file
q
wertyuiopqwertyuiop
qwerty
uiopqwertyuiop
./go < file
q wertyuiopqwertyuiop
qwerty uiopqwertyuiop
EDITED TO MATCH FURTHER QUESTIONS
#!/bin/bash
while :
do
# Read 13 bytes from stdin, and replace carriage returns and linefeeds with spaces
first13=$(dd bs=13 count=1 2>/dev/null)
ddexitstatus=$?
if [ echo $first13 | grep -q "^KT" ]; then
echo $first13
else
echo $first13 | tr '\r\n' ' '
fi
# Break out of loop if dd was not successful
[ $ddexitstatus -ne 0 ] && break
# Get rest of conventional line, breaking out of loop if EOF
read rest || break
echo $rest
done

Resources