Basically what I am having trouble with is when I type: file *
I will get:
AdvDataStructures.text.ref: ASCII text
makefile: ASCII make commands text
makelib: ASCII English text
README.txt: ASCII Pascal program text
shell3_2016.sh: ASCII text
shell3_2016.sh~: ASCII text
smallTestDir: directory
smallTestDir.text.out: empty
smallTestDir.text.ref: ASCII text
testarg0.text.ref: ASCII text
testarg1.text.ref: ASCII text
testbaddir.text.ref: ASCII text
When I use
for i in `file *`
it reads in each word separated by space in for i. I need it to read in each line as: AdvDataStructures.text.ref: ASCII text ,so I can look through it for a pattern.
ALSO, I have no clue how to make it so when I read in the line, I somehow have to read in the amount of lines within the file that is called. Is there a way to like call the first word of the output so it knows to read in the file name?
Basically, an example of what I have to do is read in one line at a time (AdvDataStructures.text.ref: ASCII text), if a pattern finds a match in it (I know how to do this with egrep) it will the count the number of lines within the file(AdvDataStructures.text.ref)
The usual way is to use a while read loop:
file * | while read filename description; do
filename=${filename%:} # remove : after filename
...
done
Related
This question already has answers here:
Shell script read missing last line
(7 answers)
Closed 2 years ago.
Created a text file as hello_world.rtf with following two lines only:
Hello
World
and trying to read above file using below bash script from terminal:
while test= read -r line; do
> echo "The text read from file is: $line"
> done < hello_world.rtf
and it returns the following:
The text read from file is: {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf500
The text read from file is: {\fonttbl\f0\fswiss\fcharset0 Helvetica;}
The text read from file is: {\colortbl;\red255\green255\blue255;}
The text read from file is: {\*\expandedcolortbl;;}
The text read from file is: \paperw12240\paperh15840\margl1440\margr1440\vieww10800\viewh8400\viewkind0
The text read from file is: \pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
The text read from file is:
The text read from file is: \f0\fs24 \cf0 Hello\
Any suggestion what is wrong here and how can I get the clean result?
RTF means Rich Text Format. It is a language for text formatting, developed and used mostly by Microsoft and deprecated for a while.
The text inside the file looks as you can see in the output of your code. It contains the words "Hello" and "World" but also formatting instructions.
Save the file as plain text, not RTF and it will contain only the text you typed in it.
test= in front of read does not have any effect in this context. You can remove it.
Make sure the last line of the file ends with a new-line character. read returns an non-zero exit status (and this means false) when it reaches the end of file and your code exits the while loop and does not display the last value read by read. If the file ends with a new-line character, the last line (that is read but not listed by the code) is empty, therefore nothing is lost.
It is a recommended practice for text files to always end with a newline character.
Alternatively you can print the value of line again after the loop. It contains the last line of the file (from the last end-of-line character until the end of file).
I'm working on a small text file with a list of words in it that I want to add a new word to, and then sort. The file doesn't have a newline at the end when I start, but does after the sort. Why? Can I avoid this behavior or is there a way to strip the newline back out?
Example:
words.txt looks like
apple
cookie
salmon
I then run printf "\norange" >> words.txt; sort words.txt -o words.txt
I use printf rather than echo figuring that'll avoid the newline, but the file then reads
apple
cookie
orange
salmon
#newline here
If I just run printf "\norange" >> words.txt orange appears at the bottom of the file, with no newline, ie;
apple
cookie
salmon
orange
This behavior is explicitly defined in the POSIX specification for sort:
The input files shall be text files, except that the sort utility shall add a newline to the end of a file ending with an incomplete last line.
As a UNIX "text file" is only valid if all lines end in newlines, as also defined in the POSIX standard:
Text file - A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the newline character. Although POSIX.1-2008 does not distinguish between text files and binary files (see the ISO C standard), many utilities only produce predictable or meaningful output when operating on text files. The standard utilities that have such restrictions always specify "text files" in their STDIN or INPUT FILES sections.
Think about what you are asking sort to do.
You are asking it "take all the lines, and sort them in order."
You've given it a file containing four lines, which it splits to the following strings:
"salmon\n"
"cookie\n"
"orange"
It sorts these for you dutifully:
"cookie\n"
"orange"
"salmon\n"
And it then outputs them as a single string:
"cookie
orangesalmon
"
That is almost certainly exactly what you do not want.
So instead, if your file is missing the terminating newline that it should have had, the sort program understands that, most likely, you still intended that last line to be a line, rather than just a fragment of a line. It appends a \n to the string "orange", making it "orange\n". Then it can be sorted properly, without "orange" getting concatenated with whatever line happens to come immediately after it:
"cookie\n"
"orange\n"
"salmon\n"
So when it then outputs them as a single string, it looks a lot better:
"cookie
orange
salmon
"
You could strip the last character off the file, the one from the end of "salmon\n", using a range of handy tools such as awk, sed, perl, php, or even raw bash. This is covered elsewhere, in places like:
How can I remove the last character of a file in unix?
But please don't do that. You'll just cause problems for all other utilities that have to handle your files, like sort. And if you assume that there is no terminating newline in your files, then you will make your code brittle: any part of the toolchain which "fixes" your error (as sort kinda does here) will "break" your code.
Instead, treat text files the way they are meant to be treated in unix: a sequence of "lines" (strings of zero or more non-newline bytes), each followed by a newline.
So newlines are line-terminators, not line-separators.
There is a coding style where prints and echos are done with the newline leading. This is wrong for many reasons, including creating malformed text files, and causing the output of the program to be concatenated with the command prompt. printf "orange\n" is correct style, and also more readable: at a glance someone maintaining your code can tell you're printing the word "orange" and a newline, whereas printf "\norange" looks at first glance like it's printing a backslash and the phrase "no range" with a missing space.
I came here looking for going the other way (opposite to the way posted in
echo -e equivalent in Windows?):
I have a text file with ASCII chars, to be converted to a text file with ASCII equivalent of the characters in the text file.
eg. a text file containing "--" to be changed to a text file containing "4545".
Anyone got dos code for this?
Thanks,
For Microsoft QBasic
dim character$(1):dim number%
open "inputfile" for Binary as 1
open "outputfile" for Binary as 2
while not eof(1)
get #1,,character$:put #2,,asc(character$)
'here one byte string is converted into character number
wend
close
From a text file with variable number of columns per row (tab delimited), I would like to extract value with specific condition.
The text file looks like:
S1=dhs Sb=skf S3=ghw QS=ghr</b>
S1=dhf QS=thg S3=eiq<b/>
QS=bhf S3=ruq Gq=qpq GW=tut<b/>
Sb=ruw QS=ooe Gq=qfj GW=uvd<b/>
I would like to have a result like:
QS=ghr<b/>
QS=thg<b/>
QS=bhf<b/>
QS=ooe
Please excuse my naive question but I am a beginner trying to learn some basic bash scripting technique for text manipulation.
Thanks in advance!
You could use awk ,
awk '{for(i=1;i<=NF;i++){if($i~/^QS=/){print $i}}}' file
This awk command iterates through each fields and check for the column which has QS= string at the start. If it finds any, then the corresponding column would be printed.
Through grep,
grep -oP '(^|\t)\KQS=\S*' file
-o parameter means only matching. So it prints only the characters which are matched.
-P this enables the Perl-regex mode.
(^|\t) matches the start of a line or a tab character.
\K discards the previously matched tab or start of the line boundary.
QS= Now it matches the QS= string.
\S* Matches zero or more non-space characters.
So I have a strange question. I have written a script that re-formats data files. I basically create new files with the right column order, spacing, and such. I then unix2dos these files (the program I am formatting these files for is DIPS for windows, and I assume that the files should be ansi). When I go to open the files in the DIPS Program however an error occurs and the file won't open.
When I create the same kind of data file through the DIPS program and open it in note pad, it matches exactly with the data files I have created with my script.
On the other hand if I open the data files that I have created with my script in Kedit first, save them, and then open them in the DIPS program everything works.
My question is what could saving in Kedit possibly do that unix2dos does not?
(Also if I try using note pad or word pad to save instead of Kedit the file doesn't open in DIPS)
Here is what was created using the diff command in unix
"
1,16c1,16
* This file is generated by Dips for Windows.
* The following 2 lines are the Title of this file.
Cobre Panama
Drill Hole B11106-GT
Number of Traverses: 0
Global Orientation is:
DIP/DIPDIRECTION
0.000000 (Declination)
NO QUANTITY
Number of extra columns are: 0
--
* This file is generated by Dips for Windows.
* The following 2 lines are the Title of this file.
Cobre Panama
Drill Hole B11106-GT
Number of Traverses: 0
Global Orientation is:
DIP/DIPDIRECTION
0.000000 (Declination)
NO QUANTITY
Number of extra columns are: 0
18c18
--
440c440
--
442c442
-1
-1
"
Any help would be appreciated! Thanks!
Okay! Figured it out.
Simply when you unix2dos your file you do not strip any space characters in between the last letter in a line and the line break character. When saving in Kedit you do strip the spaces between the last letter in a line and the line break character.
In my script I had a poor programing practice in which I was writing a string like this;
echo "This is an example string " >> outfile.txt
The character count is 32, and if you could see the break line character (chr(10)) the line would read;
This is an example string
If you unix2dos outfile.txt the line looks the same as above but with a different break line character. However when you place the file into Kedit and save it, now the character count is 25 and the line looks like this;
This is an example string
This occurs because Kedit does not preserve spaces at the end of a line. It places the return or line break character at the last letter or "non space" character in a line.
So programs that read literal input like DIPS (i'm guessing) or more widely used AutoCAD scripting will have a real problem with extra spaces before the return character. Basically in AutoCAD scripting a space in a line is treated as a return character. So if you have ten extra spaces at the end of a line it's treated the same as ten returns instead of the one you probably intended.
OH and if this helped you out or though it was good please give me a vote up!
unix2dos converts the line-break characters at the end of each line, from unix line breaks (10) to dos line breaks (13, 10)
Kedit could possible change the encoding of the file (like from ansi to UTF-8)
You can change the encoding of a file with the iconv utility (on a linux box)