Extract line from text file based on leading characters of each line - bash

I have a very large data dump that I need to manipulate. Basically, I receive a text file that has data from multiple tables in it. The first two characters of each line will tell me what table this is from. I need to read each of these lines and then extract them into a TEXT file... It would append each line to the text file. Each table should have it's own text file.
For example, lets say the data file looks like this...
HDxxxxxxxxxxxxx
HDyyyyyyyyyyyyy
ENxxxxxxxxxxxxx
ENyyyyyyyyyyyyy
HSyyyyyyyyyyyyy
What I would need is the first two lines to be in a text file named HD_out.txt, the 3rd and 4th lines in one named EN_out.txt, and the last one in a file named HS_out.txt.
Does anyone know how could this be done with either a simple batch file or UNIX shell script?

Use awk to split file based on first 2 characters:
gawk -v FIELDWIDTHS='2 99999' '{print $2 > $1"_out.txt"}' input.txt

Using bash:
while read -r line; do
echo "${line:2}" >> "${line:0:2}_out.txt"
done < inputFile
${var:startposition:length} is a bash string function to capture sub-strings. This would cause your inputfile to be split based on the first two chars. If you want to include the table prefix, just use echo "$line" >> "${line:0:2}_out.txt" instead of what is shown above.
Demo:
$ ls
file
$ cat file
HDxxxxxxxxxxxxx
HDyyyyyyyyyyyyy
ENxxxxxxxxxxxxx
ENyyyyyyyyyyyyy
HSyyyyyyyyyyyyy
$ while read -r line; do echo "${line:2}" >> "${line:0:2}_out.txt"; done < file
$ ls
EN_out.txt file HD_out.txt HS_out.txt
$ head *.txt
==> EN_out.txt <==
xxxxxxxxxxxxx
yyyyyyyyyyyyy
==> HD_out.txt <==
xxxxxxxxxxxxx
yyyyyyyyyyyyy
==> HS_out.txt <==
yyyyyyyyyyyyy

Related

How to split a text file content by a string?

Suppose I've got a text file that consists of two parts separated by delimiting string ---
aa
bbb
---
cccc
dd
I am writing a bash script to read the file and assign the first part to var part1 and the second part to var part2:
part1= ... # should be aa\nbbb
part2= ... # should be cccc\ndd
How would you suggest write this in bash ?
You can use awk:
foo="$(awk 'NR==1' RS='---\n' ORS='' file.txt)"
bar="$(awk 'NR==2' RS='---\n' ORS='' file.txt)"
This would read the file twice, but handling text files in the shell, i.e. storing their content in variables should generally be limited to small files. Given that your file is small, this shouldn't be a problem.
Note: Depending on your actual task, you may be able to just use awk for the whole thing. Then you don't need to store the content in shell variables, and read the file twice.
A solution using sed:
foo=$(sed '/^---$/q;p' -n file.txt)
bar=$(sed '1,/^---$/b;p' -n file.txt)
The -n command line option tells sed to not print the input lines as it processes them (by default it prints them). sed runs a script for each input line it processes.
The first sed script
/^---$/q;p
contains two commands (separated by ;):
/^---$/q - quit when you reach the line matching the regex ^---$ (a line that contains exactly three dashes);
p - print the current line.
The second sed script
1,/^---$/b;p
contains two commands:
1,/^---$/b - starting with line 1 until the first line matching the regex ^---$ (a line that contains only ---), branch to the end of the script (i.e. skip the second command);
p - print the current line;
Using csplit:
csplit --elide-empty-files --quiet --prefix=foo_bar file.txt "/---/" "{*}" && sed -i '/---/d' foo_bar*
If version of coreutils >= 8.22, --suppress-matched option can be used and sed processing is not required, like
csplit --suppress-matched --elide-empty-files --quiet --prefix=foo_bar file.txt "/---/" "{*}".

Concatenate awk-output, string, and text file

I have the following two tab-separated files in my current directory.
a.tsv
do not use this line
but this one
and that too
b.tsv
three fields here
not here
For each tsv file there is an associated txt file in the same directory, with the same filename but different suffix.
a.txt
This is the a-specific text.
b.txt
Text associated to b.
For each pair of files I want to create a new file with the same name but the suffix _new.txt. The new files should contain all lines from the respective tsv file that contain exactly 3 fields, afterwards the string \n####\n, and then the whole content of the respective txt file. Thus, the following output files should be created.
Desired output
a_new.txt
but this one
and that too
####
This is the a-specific text.
b_new.txt
three fields here
####
Text associated to b.
Working, but bad solution
for file in ./*.tsv
do awk -F'\t' 'NF==3' $file > ${file//.tsv/_3_fields.tsv}
done
for file in ./*_3_fields.tsv
do cat $file <(printf "\n####\n") ${file//_3_fields.tsv/.txt} > ${file//_3_fields.tsv/_new.txt}
done
Non-working code
I'd like to get the result with one script, and avoid creating the intermediate file with the suffix _3_fields.tsv.
I tried command substitution as follows:
for file in ./*.tsv
do cat <<< $(awk -F'\t' 'NF==3' $file) <(printf "\n####\n") ${file//.tsv/.txt} > ${file//.tsv/_new.txt}
done
But this doesn't write the awk-processed part into the new files.
Yet, the command substitution seems to work if I only write the awk-processed part into the new file like follows:
for file in ./*.tsv; do cat <<< $(awk -F'\t' 'NF==3' $file) > ${file//.tsv/_new.txt}; done
I'd be interested in why the second last code doesn't work as expected, and what a good solution would be to do the task.
Maybe you wanted to redirect a sequence of commands
for file in ./*.tsv
do
{
awk -F'\t' 'NF==3' "$file"
printf "\n####\n"
cat "${file//.tsv/.txt}"
} > "${file//.tsv/_new.txt}"
done
Note that space after opening brace and semicolon or newline before closing brace are important.
Seems also you are confusing command substitution $() and process substituion <() or >(). Also <<< is to redirect content as standard input whereas < to redirect a file.

How to add a header to text file in bash?

I have a text file and want to convert it to csv file before to convert it, i want to add a header to text file so that the csv file has the same header. I have one thousand columns in text file and want to have one thousand column name. As a side note, the content of the text file is just rows of some numbers which is separated by comma ",". Is there any way to add the header line in bash?
I tried the way below and didn't work. I did the command below first in python.
> for i in range(1001):
> print "col" + "_" + "i"
save the output of this in text file with this command (python header.py >> header.txt) and add the output of this in format of text file to the original text file that i have like below:
cat header.txt filename.txt > newfilename.txt
then convert the txt file to csv file with "mv newfilename.txt newfilename.csv".
But unfortunately this way doesn't work as the header line has double number of other rows for some reason. I would appreciate any help to make this problem solve.
based on the description your file is already comma separated, so is a csv file. You just want to add a column number header line.
$ awk -F, 'NR==1{for(i=1;i<=NF;i++) printf "col_%d%s", $i,(i==NF?ORS:FS)}1' file
will add column headers as many as the fields in the first row of the file
e.g.
$ seq 5 | paste -sd, | # create 1,2,3,4,5 as a test input
awk -F, 'NR==1{for(i=1;i<=NF;i++) printf "col_%d%s", i, (i==NF?ORS:FS)}1'
col_1,col_2,col_3,col_4,col_5
1,2,3,4,5
You can generate the column names in bash using one of the options below. Each example generates a header.txt file. You already have code to add this to the beginning of your file as a header.
Using bash loops
Bash loops for this many iterations will be inefficient, but will work.
for i in {1..10}; do
echo -n "col_$i "
done > header.txt
echo >> header.txt
or using seq
for i in $(seq 1 1000); do
echo -n "col_$i "
done > header.txt
echo >> header.txt
Using seq only
Using seq alone will be more efficient.
seq -f "col_%g" -s" " 1 1000 > header.txt
Use seq and sed
You can use the seq utility to construct your CSV header, with a little minor help from Bash expansions. You can then insert the new header row into your existing CSV file, or concatenate the header with your data.
For example:
# construct a quoted CSV header
columns=$(seq -f '"col_%g"' -s', ' 1 1001)
# strip the trailing comma
columns="${columns%,*}"
# insert headers as first line of foo.csv with GNU sed
sed -i -e "1 i\\${columns}" /tmp/foo.csv
Caveats
If you don't have GNU sed, you can also use cat, sponge, or other tools to concatenate your header and data, although most of your concatenation options will require redirection to a new combined file to avoid clobbering your existing data.
For example, given /tmp/data.csv as your original data file:
seq -f '"col_%g"' -s', ' 1 1001 > /tmp/header.csv
sed -i -e 's/,[[:space:]]*$//' /tmp/header.csv
cat /tmp/header /tmp/data > /tmp/new_file.csv
Also, note that while Bash solutions that avoid calling standard utilities are possible, doing it in pure Bash might be too slow or memory intensive for large data sets.
Your mileage may vary.
printf "col%s," {1..100} |
sed 's/,$//' |
cat - filename.txt >newfilename.txt
I believe sed should supply the missing final newline as a side effect. If not, maybe try 's/,$/\n/' though this isn't entirely portable, either. You could probably replace the cat with sed as well, something like
... | sed 's/,$//;r filename.txt'
but again, I'm not entirely sure how portable this is.

Bash - read specific line from a file with all sorts of data and store as a variable

I have looked for an answer to what seems like a simple question, but I feel as though all these questions (below) only briefly touch on the matter and/or over-complicate the solution.
Read a file and split each line into two variables with bash program
Bash read from file and store to variables
Need to assign the contents of a text file to a variable in a bash script
What I want to do is read specific lines from a file (titled 'input'), store them variables and then use them.
For example, in this code, every 9th line after a certain point contains a filename that I want to store as a variable for later use. How can I do that?
steps=49
for((i=1;i<=${steps};i++)); do
...
g=$((9 * $i + 28)) #In.omega filename
`
For the bigger picture, I basically need to print a specific line (line 9) from the file whose name is specified in the gth line of the file named "input"
sed '1,39p;d' data > temp
sed "9,9p;d" [filename specified in line g of input] >> temp
sed '41,$p;d' data >> temp
mv temp data
Say you want to assign the 49th line of the $FILE file to the $ARG variable, you can do:
$ ARG=`cat $FILE | head -49 | tail -1`
To get line 9 of the file named in the gth line of the file named input:
sed -n 9p "$(sed -n ${g}p input)"
arg=$(cat sample.txt | sed -n '2p')
where arg is variable and sample.txt is file and 2 is line number

Edit files in Bash

I have a few files that contain IP addresses. I'm creating a script and have to figure out how to create a new user file with an IP address that is based off the file created before it. If the last file contains an IP of A.B.C.D the new file needs to be A.B.C.(D+4).
I think I need to use the 'sed' and 'awk' commands, but haven't been able to get anything working. How would I go about writing this part of the script?
Here's something to get you started: suppose there is a file called input looks like this:
Input: contents of input
127.0.0.1
127.0.0.2
127.0.0.3
127.0.0.200
You can do on the cmdline:
awk 'BEGIN{FS=OFS="."} {$4=$4+4; print}' input > output
Explanation on what awk is doing here:
awk '...' - invoke awk, a tool used primarily for line-by-line manipulation of files, the stuff enclosed by single quotes are instructions to awk.
BEGIN{FS=OFS="."} - tell awk to use . as the delimiter for both input and output. FS stands for "Field Separator"
{$4=$4+4; print} - $4 means the 4th field. Since . is the delimiter, D corresponds to the 4th field and we add the integer value 4 to the 4th field. The print here is just short hand for printing the entire line.
input - name the input file as argument to awk; save a cat
> output - redirect the output to a file so you can inspect them for any issues before making the user files based on it.
Output: contents of output
127.0.0.5
127.0.0.6
127.0.0.7
127.0.0.204
And then you can read output one line at a time to create new user files as needed, maybe another script with something along the lines of:
while read line
do
echo "this is a user file" > "$line"
done < output
(and adjust it to your needs)
Finally, as long as you understand what's going on in the above, you can skip the output file altogether and just do this all in a one-liner:
awk 'BEGIN{FS=OFS="."} {$4=$4+4; print}' input | while read line; do echo "hello world" > "$line"; done

Resources