Measure field/column width in fixed width output - Finding delimiters? [closed] - bash

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
In the context of the bash shell and command output:
Is there a process/approach to help determine/measure the width of fields that appear to be fixed width?
(apart from the mark one human eyeball and counting on the screen method....)
If the output appears to be fixed width, is it possible/likely that it's actually delimited by some sort of non-printing character(s)?
If so, how would I go about hunting down said character?
I'm mostly after a way to do this in bash shell/script, but I'm not averse to a programming language approach.
Sample Worst Case Data:
Name value 1 empty_col simpleHeader complex multi-header
foo bar -someVal1 1someOtherVal
monty python circus -someVal2 2someOtherVal
exactly the field_widthNextVal -someVal3 3someOtherVal
My current approach:
The best I have come up with is redirecting the output to a file, then using a ruler/index type of feature in the editor to manually work out field widths. I'm hoping there is a smarter/faster way...
What I'm thinking:
With Headers:
Perhaps an approach that measures from the first character 'to the next character that is encountered, after having already encountered multiple spaces'?
Without Headers:
Drawing a bit of a blank on this one....?
This strikes me as the kind of problem that was cracked about 40 years ago though, so I'm guessing there are better solutions than mine to this stuff...
Some Helpful Information:
Column Widths
fieldwidths=$(head -n 1 file | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}')
This is proving to be helpful for determining column widths. I don't fully understand how it works yet to provide a complete answer, but it might be helpful to a future someone else. Source: https://unix.stackexchange.com/questions/465170/parse-output-with-dynamic-col-widths-and-empty-fields
File Examination
Redirect output to a file:
command > file.data
Use hexdump or xxd against file.data to look at it's raw information. See links for some basics on those tools:
hexdump output vs xxd output
https://nwsmith.blogspot.com/2012/07/hexdump-and-xxd-output-compared.html?m=1
hexdump
https://man7.org/linux/man-pages/man1/hexdump.1.html
https://linoxide.com/linux-how-to/linux-hexdump-command-examples/
https://www.geeksforgeeks.org/hexdump-command-in-linux-with-examples/
xxd
https://linux.die.net/man/1/xxd
https://www.howtoforge.com/linux-xxd-command/

tl;dr:
# Determine Column Widths
# Source for this voodoo:
# https://unix.stackexchange.com/a/465178/266125
fieldwidths=$(echo "$(appropriate-command)" | head -n 1 | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}' | sed 's/^[ ]*//;s/[ ]*$//')
# Iterate
while IFS= read -r line
do
# You can do put awk command in a separate line if this is clearer to you
awkcmd="BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$1}"
field1="$(echo "$line" | awk "$awkcmd" | sed 's/^[ ]*//;s/[ ]*$//')"
# Or do it all in one line if you prefer:
field2="$(echo "$line" | awk "BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$2}" | sed 's/^[ ]*//;s/[ ]*$//')"
*** Code Stuff Here ***
done <<< $(appropriate-command)
Some explanation of the above - for newbies (like me)
Okay, so I'm a complete newbie, but this is my answer, based on a grand total of about two days of clawing around in the dark. This answer is relevant to those who are also new and trying to process data in the bash shell and bash scripts.
Unlike the *nix wizards and warlocks that have presented many of the solutions you will find to specific problems (some impressively complex), this is just a simple outline to help people understand what it is that they probably don't know; that they don't know. You will have to go and look this stuff up separately, it's way to big to cover it all here.
EDIT:
I would strongly suggest just buying a book/video/course for shell scripting. You do learn a lot doing it the school of hard knocks way as I have for the last couple of days, but it's proving to be painfully slow. The devil is very much in the details with this stuff. A good structured course probably instils good habits from the get go too, rather than potentially developing your own habits/short hand 'that seems to work' but will likely and unwittingly, bite you later on.
Resources:
Bash references:
https://linux.die.net/man/1/bash
https://tldp.org/LDP/Bash-Beginners-Guide
https://www.gnu.org/software/bash/manual/html_node
Common Bash Mistakes, Traps and Pitfalls:
https://mywiki.wooledge.org/BashPitfalls
http://www.softpanorama.org/Scripting/Shellorama/Bash_debugging/typical_mistakes_in_bash_scripts.shtml
https://wiki.bash-hackers.org/scripting/newbie_traps
My take is that there is no 'one right way that works for everything' to achieve this particular task of processing fixed width command output. Notably, the fixed widths are dynamic and might changed each time the command is run. It can be done somewhat haphazardly using standard bash tools (it depends on the types of values in each field, particularly if they contain whitespace or unusual/control characters). That said, expect any fringe cases to trip up the 'one bash pipeline to parse them all' approach, unless you have really looked at your data and it's quite well sanitised.
My uninformed, basic approach:
Pre-reqs:
To get much out of all this:
Learn the basics of how IFS= read -r line (and it's variants) work, it's one way of processing multiple lines of data, one line at a time. When doing this, you need to be aware of how things are expanded differently by the shell.
Grasp the basics of process substitution and command substitution, understand when data is being manipulated in a sub-shell, otherwise it disappears on you when you think you can recall it later.
It helps to grasp what Regular Expressions (regex) are. Half of the hieroglyphics that you encounter are probably regex in action.
Even further, it helps to understand when/what/why you need to 'escape' certain characters, at certain times, as this is why there is even more \ than you would expect amongst the hieroglyphics.
When doing redirection, be aware of the difference in > (overwrites without prompting) and >> (which appends to any existing data).
Understand differences in comparison operators and conditional tests (such as used with if statements and loop conditions).
if [ cond ] is not necessarily the same as if [[ cond ]]
look into the basics of arrays, and how to load, iterate over and query their elements.
bash -x script.sh is useful for debugging. Targeted debugging of specific lines is done by using set -x lines of code to debug set +x within the script.
As for the fixed width data:
If it's delimited:
Use the delimiter. Most *nix tools use a single white space as a default delimiter, but you can typically also set a specific delimiter (google how to do it for the specific tool).
Optional Step:
If there is no obvious delimiter, you can check to see if there is some secret hidden delimiter to take advantage of. There probably isn't, but you can feel good about yourself for checking. This is done by looking at the hex data in the file. Redirect the output of a command to a file (if you don't have the data in a file already). Do it using command > file.data and then explore file.data using hexdump -Cv file.data (another tool is xxd).
If you're stuck with fixed width:
Basically to do something useful, you need to:
Read line by line (i.e. record by record).
Split the lines into their columns (i.e. field by field, this is the fixed-width aspect)
Check that you are really doing what you think you are doing; particularly if expanding or redirecting data. What you see on shell as command output, might not actually be exactly what you are presenting to your script/pipe (most commonly due to differences in how the shell expands args/variables, and tends to automatically manipulate whitespace without telling you...)
Once you know exactly what your processing pipe/script is seeing, you can then tidy up any unwanted whitespace and so forth.
Starting Guidelines:
Feed the pipe/script an entire line at a time, then chop up fields (unless you really know what you are doing). Doing the field separation inside any loops such as while IFS= read -r line; do stuff; done is less error prone (in terms of the 'what is my pipe actually seeing' problem. When I was doing it outside, it tended to produce more scenarios where the data was being modified without me understanding that it was being altered (let alone why), before it even reached the pipe/script. This obviously meant I got extremely confused as to why a pipe that worked in one setting on the command line, fell over when I 'feed the same data' in a script or by some other method (but the pipe really wasn't actually getting the same data). This comes back to preserving whitespace with fixed-width data, particularly during expansion and redireciton, process substitiution and command substitution. Typically it amounts to liberal use of double quotes when calling a variable, i.e. not $someData but "$someData". Use parenthesis to clear up which var you are talking about, i.e. ${var}bar. Similarly when capturing the entire output of a command.
If there is nothing to leverage as a delimiter, you have some choices. Hack away directly at the fixed width data using tools like:
cut -c n1-n2 this directly cuts things out, starting from character n1 through to n2.
awk '{print $1}' this uses a single space by default to separate fields and print the first field.
Or, you can try to be a bit more scientific and 'measure twic, cut once'.
You can work out the field widths fairly easily if there are headers. This line is particularly helpful (sourced from an answer I link below):
fieldwidths=$(head -n 1 file | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}')
echo $fieldwidths
You can also look at all the data to see what length of data you are seeing in each field, and if you are actually getting the number of fields you expect (Thanks to David C. Rankin for this one!):
awk '{ for (i=1; i<=NF; i++) printf "%d\t",length($i) } {print ""}' file.data
With that information, you can then set about chopping fields up with a bit more certainty that you are actually capturing the entire field (and only the entire field).
Tool options are many and varied, but I'm finding GNU awk (gawk) and perl's unpack to be the clearest. As part of a pipe/script consider this (sub in your relevant field widths and which ever field you want out in the {print $fieldnumber} obviously):
awk 'BEGIN {FIELDWIDTHS=$10 20 30 10}{print $1}
For command output with dynamic field widths, if you feed it into a while IFS= read -r line; do; done loop, you will need to parse the output using the awk above, as each time the field widths might have changed. Since I originally couldn't get the expansion right, I built the awk command on a separate line and stored it in a variable, which I then called in the pipe. Once you have it figured out though, you can just shove it all back into one line if you want:
# Determine Column Widths:
# Source for this voodoo:
# https://unix.stackexchange.com/a/465178/266125
fieldwidths=$(echo "$(appropriate-command)" | head -n 1 | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}' | sed 's/^[ ]*//;s/[ ]*$//')
# Iterate
while IFS= read -r line
do
# Separate the awk command if you want:
# This uses GNU awk to split the column widths and pipes it to sed to remove leading and trailing spaces.
awkcmd="BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$1}"
field1="$(echo "$line" | awk "$awkcmd" | sed 's/^[ ]*//;s/[ ]*$//')"
# Or do it all in one line, rather than two:
field2="$(echo "$line" | awk "BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$2}" | sed 's/^[ ]*//;s/[ ]*$//')"
if [ "${DELETIONS[0]}" == 'all' ] && [ "${#DELETIONS[#]}" -eq 1 ] && [ "$field1" != 'UUID' ]; then
*** Code Stuff ***
fi
*** More Code Stuff ***
done <<< $(appropriate-command)
Remove excess whitespace using various approaches:
tr -d '[:blank:] and/or tr -d '[:space:](the later eliminates new lines and vertical whitespace, not just horizontal like :blank: does. They both also remove internal whitespace).
sed s/^[ ]*//;s/[ ]*$// this cleans up only leading and trailing whitespace.
Now you should basically have clean, separated fields to work with one at a time, having started from multi-field, multi-line command output.
Once you get what is going on fairly well with the above, you can start to look into other more elegant approaches as presented in these answers:
Finding Dynamic Field Widths:
https://unix.stackexchange.com/a/465178/266125
Using perl's unpack:
https://unix.stackexchange.com/a/465204/266125
Awk and other good answers:
https://unix.stackexchange.com/questions/352185/awk-fixed-width-columns
Some stuff just can't be done in a single pass. Like the perl answer above, it basically breaks the problem down into two parts. The first is turning the fixed width data into delimited data (just chose a delimiter that doesn't occur within any of the values in your fields/records!). Once you have it as delimited data, it makes the processing substantially easier from there on out.

Related

How to rewrite a bad shell script to understand how to perform similar tasks? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
So, I wrote a bad shell script (according to several questions, one of which I asked) and now I am wondering which way to go to perform the same, or similar, task(s).
I honestly have no clue about which tool may be best for what I need to achieve and I hope that, by understanding how to rewrite this piece of code, it will be easier to understand which way to go.
There we go:
# read reference file line by line
while read -r linE;
do
# field 2 will be grepped
pSeq=`echo $linE | cut -f2 -d" "`
# field 1 will be used as filename to store the grepped things
fName=`echo $linE | cut -f1 -d" "`
# grep the thing in a very big file
grep -i -B1 -A2 "^"$pSeq a_very_big_file.txt | sed 's/^--$//g' | awk 'NF' > $dir$fName".txt"
# grep the same thing in another very big file and store it in the same file as abovr
grep -i -B1 -A2 "^"$pSeq another_very_big_file.txt | sed 's/^--$//g' | awk 'NF' >> $dir$fName".txt"
done < reference_file.csv
At this point I am wondering...how to achieve the same result, whithout using a while loop to read into the reference_file.csv? What is the best way to go, to solve similar problems?
EDIT: when I mentioned the two very_big_files, I am talking > 5GB.
EDIT II: these should be the format of the files:
reference_file.csv:
object pattern
oj1 ptt1
oj2 ptt2
... ...
ojN pttN
a_very_big_file and another_very_big_file:
>head1
ptt1asequenceofcharacters
+
asequenceofcharacters
>head2
ptt1anothersequenceofcharacters
+
anothersequenceofcharacters
>headN
pttNathirdsequenceofcharacters
+
athirdsequenceofcharacters
Basically, I search for pattern in the two files, then I need to get the line above and the two below each match. Of course, not all the lines in the two files match with the patterns in the reference_file.csv.
Global Maxima
Efficient bash scripts are typically very creative and nothing you can achieve by incrementally improving a naive solution.
The most important part of finding efficient solutions is to know your data. Every restriction you can make allows optimizations. Some examples that can make a huge difference:
- The input is sorted or data in different files has the same order.
- The elements in a list are unique.
- One of the files to be processed is way bigger than the others.
- The symbol X never appears in the input or only appears at special places.
- The order of the output does not matter.
When I try to find an efficient solution, my first goal is to make it work without an explicit loop. For this, I need to know the available tools. Then comes the creative part of combining these tools. To me, this is like assembling a jigsaw puzzle without knowing the final picture. A typical mistake here is similar to the XY problem: After you assembled some pieces, you might be fooled into thinking you'd know the final picture and search for a piece Y that does not exist in your toolbox. Frustrated, you implement Y yourself (typically by using a loop) and ruin the solution.
If there is no right piece for your current approach, either use a different approach or give up on bash and use a better scripting/programming language.
Local Maxima
Even though you might not be able to get the best solution by improving a bad solution, you still can improve it. For this you don't need to be very creative if you know some basic anti-patterns and their better alternatives. Here are some typical examples from your script:
Some of these might seem very small, but starting a new process is way more expensive than one might suppose. Inside a loop, the cost of starting a process is multiplied by the number of iterations.
Extract multiple fields from a line
Instead of calling cut for each individual field, use read to read them all at once:
while read -r line; do
field1=$(echo "$line" | cut -f1 -d" ")
field2=$(echo "$line" | cut -f2 -d" ")
...
done < file
while read -r field1 field2 otherFields; do
...
done < file
Combinations of grep, sed, awk
Everything grep (in its basic form) can do, sed can do better. And everything sed can do, awk can do better. If you have a pipe of these tools you can combine them into a single call.
Some examples of (in your case) equivalent commands, one per line:
sed 's/^--$//g' | awk 'NF'
sed '/^--$/d'
grep -vFxe--
grep -i -B1 -A2 "^$pSeq" | sed 's/^--$//g' | awk 'NF'
awk "/^$pSeq/"' {print last; c=3} c>0; {last=$0; c--}'
Multiple grep on the same file
You want to read files at most once, especially if they are big. With grep -f you can search multiple patterns in a single run over one file. If you just wanted to get all matches, you would replace your entire loop with
grep -i -B1 -A2 -f <(cut -f2 -d' ' reference_file | sed 's/^/^/') \
a_very_big_file another_very_big_file
But since you have to store different matches in different files ... (see next point)
Know when to give up and switch to another language
Dynamic output files
Your loop generates multiple files. The typical command line utils like cut, grep and so on only generate one output. I know only one standard tool that generates a variable number of output files: split. But that does not filter based on values, but on position. Therefore, a non-loop solution for your problem seems unlikely. However, you can optimize the loop by rewriting it in a different language, e.g. awk.
Loops in awk are faster ...
time awk 'BEGIN{for(i=0;i<1000000;++i) print i}' >/dev/null # takes 0.2s
time for ((i=0;i<1000000;++i)); do echo $i; done >/dev/null # takes 3.3s
seq 1000000 > 1M
time awk '{print}' 1M >/dev/null # takes 0.1s
time while read -r l; do echo "$l"; done <1M >/dev/null # takes 5.4s
... but the main speedup will come from something different. awk has everything you need built into it, so you don't have to start new processes. Also ... (see next point)
Iterate the biggest file
Reduce the number of times you have to read the biggest files. So instead of iterating reference_file and reading both big files over and over, iterate over the big files once while holding reference_file in memory.
Final script
To replace your script, you can try the following awk script. This assumes that ...
the filenames (first column) in reference_file are unique
the two big files do not contain > except for the header
the patterns (second column) in reference_file are not prefixes of each other.
If this is not the case, simply remove the break.
awk -v dir="$dir" '
FNR==NR {max++; file[max]=$1; pat[max]=$2; next}
{
for (i=1;i<=max;i++)
if ($2~"^"pat[i]) {
printf ">%s", $0 > dir"/"file[i]
break
}
}' reference_file RS=\> FS=\\n a_very_big_file another_very_big_file

Performance Tuning an AWK?

I've written a simple parser in BASH to take apart csv files and dump to a (temp) SQL-input file. The performance on this is pretty terrible; when running on a modern system I'm barely cracking 100 lines per second. I realize the ultimate answer is to rewrite this in a more performance oriented language, but as a learning opportunity, I'm curious where I can improve my BASH skills.
I suspect there are gains to be made by writing to an ram instead of to a file, then flushing all the text at once to the file, but I'm not clear on where/when BASH gets upset about memory usage (largest files I've parsed have been under 500MB).
The following code-block seems to eat most of the cycles, and as I understand, needs to be processed linearly due to checking timestamps (the data has a timestamp, but no timedate stamp, so I was forced ask the user for the start-day and check if the timestamp has cycled 24:00 -> 0:00), so parallel processing didn't seem like an option.
while read p; do
linetime=`printf "${p}" | awk '{printf $1}'`
# THE DATA LACKS FULL DATESTAMPS, SO FORCED TO ASK USER FOR START-DAY & CHECK IF THE DATE HAS CYCLED
if [[ "$lastline" > "$linetime" ]]
then
experimentdate=$(eval $datecmd)
fi
lastline=$linetime
printf "$p" | awk -v varout="$projname" -v experiment_day="$experimentdate " -v singlequote="$cleanquote" '{printf "insert into tool (project,project_datetime,reported_time,seconds,intensity) values ("singlequote""varout""singlequote","singlequote""experiment_day $1""singlequote","singlequote""$1""singlequote","$2","$3");\n"}' >> $sql_input_file
Ignore the singlequote nonsense, I needed this to run on both OSX & 'nix, so I had to workaround some issues with OSX's awk and singlequotes.
Any suggestions for how I can improve performance?
You do not want to start awk for every line you process in a loop. Replace your loop with awk or replace awk with builtin commands.
Both awk's are only used for printing. Replace these lines with additional parameters to the printf command.
I did not understand the codeblock for datecmd (not using $linetime but using the output variable experimentdate), but this one should be optimised: Can you use regular expressions or some other trick?
So you do not have the tune awk, but decide to use awk completely or get it out of your while-loop.
Your performance would improve if you did all the processing with awk. Awk can read your input file directly, express conditionals, and run external commands.
Awk is not the only one either. Perl and Python would be well suited to this task.

How can I get the SOA serial number from a file with sed?

I store my SOA data for multiple domains in a single file that gets $INCLUDEd by zone files. I've written a small sed script that is supposed to get the serial number, increment it, then re-save the SOA file. It all works properly as long as the SOA file is in the proper format, with the entire record on one line, but it fails as soon as the record gets split into multiple lines.
For example, this works as input data:
# IN SOA dnsserver. hostmaster.example.net. ( 2013112202 21600 900 691200 86400 )
But this does not:
# IN SOA dnsserver. hostmaster.example.net. (
2013112202 ; Serial number
21600 ; Refresh every day, 86400 is 1 day
900 ; Retry refresh every 15 min
691200 ; Expire every 8 days
86400 ) ; Minimum TTL 1 day
I like comments, and I would like to spread things out. But I need my script to be able to find the serial number so that I can increment it and rewrite the file.
The SED that works on the single line is this:
SOA=$(sed 's/.*#.*SOA[^0-9]*//;s/[^0-9].*//' $SOAfile)
But for multi-line ... I'm a bit lost. I know I can join lines with N, but how do I know if I even need to? Do I need to write separate sed scripts based on some other analysis I do of the original file?
Please help! :-)
I wouldn't use sed for this. While you might be able to brute-force something, it would require a large amount of concentration to come up with it, and it would look like line noise, and so be almost unmaintainable afterwards.
What about this in awk?
The easiest way might be to split your records based on the # character, like so:
SOA=$(awk 'BEGIN{RS="#"} NR==2{print $6}' $SOAfile)
But that will break if you have comments containing # before the uncommented line, or if you have any comments between the # and the serial number. You could make a pipe to avoid these issues...
SOA=$(sed 's/;.*//;/^#/p;1,/^#/d' $SOAfile | awk 'BEGIN{RS="#"} NR==2{print $6}')
It may seem redundant to remove comments and strip the top of the file, but there could be other lines like #include which (however unlikely) could contain your record separator.
Or you could do something like this in pure awk:
SOA=$(awk -v field=6 '/^#/ { if($2=="IN"){field++} for(i=1;i<field;i++){if(i==NF){field=field-NF;getline;i=1}} print $field}' $SOAfile)
Or, broken out for easier reading:
awk -v field=6 '
/^#/ {
if ($2=="IN") {field++;}
for (i=1;i<field;i++) {
if(i==NF) {field=field-NF;getline;i=1;}
}
print $field; }' $SOAfile
This is flexible enough to handle any line splitting you might have, as it counts to field along multiple lines. It also adjusts the field number based on whether your zone segment contains the optional "IN" keyword.
A pure-sed solution would, instead of counting fields, use the first string of digits after an open bracket after your /^#/, like this:
SOA=$(sed -n '/^#/,/^[^;]*)/H;${;x;s/.*#[^(]*([^0-9]*//;s/[^0-9].*//;p;}' $SOAfile)
Looks like line noise, right? :-) Broken out for easier reading, it looks like this:
/^#/,/^[^;]*)/H # "Hold" the meaningful part of the file...
${ # Once we reach the end...
x # Copy the hold space back to the main buffer
s/.*#[^(]*([^0-9]*// # Remove stuff ahead of the serial
s/[^0-9].*// # Remove stuff after the serial
p # And print.
}
The idea here is that starting from the first line that begins with #, we copy the file into sed's hold space, then at the end of the file, do some substitutions to strip out all the text up to the serial number, and then after the serial number, and print whatever remains.
All of these work on single line and multi line zone SOA records I've tested with.
You can try the following - it's your original sed program preceded by commands to first read all input lines, if applicable:
SOA=$(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/.*#.*SOA[^0-9]*//;s/[^0-9].*//' \
"$SOAfile")
This form will work with both single- and multi-line input files.
Multi-line input files are first read as a whole before applying the substitutions.
Note: The awkward separate -e options are needed to keep FreeBSD happy with respect to labels and branching commands, which need a literal \n for termination - using separate -e options is a more readable alternative to splicing in literal newlines with $'\n'.
Alternative solution, using awk:
SOA=$(awk -v RS='#' '$1 == "IN" && $2 == "SOA" { print $6 }' "$SOAfile")
Again, this will work with both single- and multi-line record definitions.
The only constraint is that comments must not precede the serial number.
Additionally, if a file contained multiple records, the above would collect ALL serial numbers, separated by a newline each.
Why sed? grep is simplest in this case:
grep -A1 -e '#.*SOA' 1 | grep -oe '[0-9]*'
or: (maybe better):
grep -A1 -e '#.*SOA' 1 | grep 'Serial number' | grep -oe '[0-9]*'
This might work for you (GNU sed):
sed -nr '/# IN SOA/{/[0-9]/!N;s/[^0-9]+([0-9]+).*/\1/p}' file
For lines that contain # IN SOA if the line contains no numbers append the next line. Then extract the first sequence of numbers from the line(s).

Delete a specific string with tr

Is it possible to delete a specific string with tr command in a UNIX-Shell?
For example: If I type:
tr -d "1."
and the input is 1.1231, it would show 23 as an output, but I want it to show 1231 (notice only the first 1 has gone). How would I do that?
If you know a solution or a better way, please explain the syntax since I don't want to just copy&paste but also to learn.
I have huge problems with awk, so if you use this, please explain it even more.
In your example above the cut command would suffice.
Example: echo '1.1231' | cut -d '.' -f 2 would return 1231.
For more information on cut, just type man cut.
You would be better off using some kind of regex (maybe something like sed).
For example, with the input 1.1231 you could use the following to get the 1231 output:
sed 's/1\.//g'
Maybe have a look here:
http://tldp.org/LDP/abs/html/string-manipulation.html
You could also use sed for this kind of thing:
$ echo "1.1231" | sed -e "s/1\.//"
1231
This is just using sed to run a regular expression search and replace, replacing "1." (with appropriate escaping) with "". It only deletes the first match by default.
If you are using bash, you can do this easily with parameter substitution:
$ a=1.1231
$ echo ${a#1.}
1231
This will remove the leading "1." string. If you want to remove up to and including the first occurrence, use ${a#*1.} and if you want to remove everything up to and including the last occurrence, use ${##*1.}.
The TLDP page on string manipulation has further options (such as substring extraction).
Note that using standard sh built-in string manipulation tools for such simple transformations will always be much faster than using an external tool, such as sed, awk or cut because the shell doesn't have to create a sub-process to perform the operation. However, for more complicated things (e.g. you need to use regular expressions or when the input is large), you're better of using the dedicated tools.
Since you asked specifically about awk, here is another one.
awk '{ gsub(/1\./,"") }1' input.txt
As any awk tutorial will tell you, the general form of an awk program is a sequence of 'condition { actions }'. If you have no actions, the default action is to print. If you have no conditions, the actions will be taken unconditionally. This program uses both of these special cases.
The first part is an action without a condition, i.e. it will be taken for all lines. The action is to substitute all occurrences of the regular expression /1\./ with nothing. So this will trim any '1.' (regardless of context) from a line.
The second part is a condition without an action, i.e. it will print if the condition is true, and the condition is always true. This is a common idiom for "we are done -- print whatever we have now". It consists simply of the constant 1 (which when used as a condition means "true", simply).
This could be reformulated in a number of ways. For example, you could factor the print into the first action;
awk '{ gsub(/1\./,""); print }' input.txt
Perhaps you want to substitute the integer part, i.e. any numbers before a period sign. The regex for that would be something like /[0-9]+\./.
gsub is a GNU extension, so you might want to replace it with sub or some sort of loop if you need portability to legacy awk syntax.

Reading millions of files (in a certain order) and putting them into one big file --- fast

In my bash script I have the following (for concreteness I preserve the original names;
sometimes people ask about the background etc., and then the original names make more sense):
tail -n +2 Data | while read count phi npa; do
cat Instances/$phi >> $nF
done
That is, the first line of file Data is skipped, and then all lines, which are of
the form "r c p n", are read, and the content of files Instances/p is appended
to file $nF (in the order given by Data).
In typical examples, Data has millions of lines. So perhaps I should write a
C++ application for that. However I wondered whether somebody knew a faster
solution just using bash?
Here I use cut instead of your while loop, but you could re-introduce that if it provides some utility to you. The loop would have to output the phy variable once per iteration.
tail -n +2 Data | cut -d' ' -f 2 | xargs -I{} cat Instances/{} >> $nF
This reduces the number of cat invocations to as few as possible, which should improve efficiency. I also believe that using cut here will improve things further.

Resources