What changes when a file is saved in Kedit for windows that the unix2dos command doesn't do? - bash

So I have a strange question. I have written a script that re-formats data files. I basically create new files with the right column order, spacing, and such. I then unix2dos these files (the program I am formatting these files for is DIPS for windows, and I assume that the files should be ansi). When I go to open the files in the DIPS Program however an error occurs and the file won't open.
When I create the same kind of data file through the DIPS program and open it in note pad, it matches exactly with the data files I have created with my script.
On the other hand if I open the data files that I have created with my script in Kedit first, save them, and then open them in the DIPS program everything works.
My question is what could saving in Kedit possibly do that unix2dos does not?
(Also if I try using note pad or word pad to save instead of Kedit the file doesn't open in DIPS)
Here is what was created using the diff command in unix
"
1,16c1,16
* This file is generated by Dips for Windows.
* The following 2 lines are the Title of this file.
Cobre Panama
Drill Hole B11106-GT
Number of Traverses: 0
Global Orientation is:
DIP/DIPDIRECTION
0.000000 (Declination)
NO QUANTITY
Number of extra columns are: 0
--
* This file is generated by Dips for Windows.
* The following 2 lines are the Title of this file.
Cobre Panama
Drill Hole B11106-GT
Number of Traverses: 0
Global Orientation is:
DIP/DIPDIRECTION
0.000000 (Declination)
NO QUANTITY
Number of extra columns are: 0
18c18
--
440c440
--
442c442
-1
-1
"
Any help would be appreciated! Thanks!

Okay! Figured it out.
Simply when you unix2dos your file you do not strip any space characters in between the last letter in a line and the line break character. When saving in Kedit you do strip the spaces between the last letter in a line and the line break character.
In my script I had a poor programing practice in which I was writing a string like this;
echo "This is an example string " >> outfile.txt
The character count is 32, and if you could see the break line character (chr(10)) the line would read;
This is an example string
If you unix2dos outfile.txt the line looks the same as above but with a different break line character. However when you place the file into Kedit and save it, now the character count is 25 and the line looks like this;
This is an example string
This occurs because Kedit does not preserve spaces at the end of a line. It places the return or line break character at the last letter or "non space" character in a line.
So programs that read literal input like DIPS (i'm guessing) or more widely used AutoCAD scripting will have a real problem with extra spaces before the return character. Basically in AutoCAD scripting a space in a line is treated as a return character. So if you have ten extra spaces at the end of a line it's treated the same as ten returns instead of the one you probably intended.
OH and if this helped you out or though it was good please give me a vote up!

unix2dos converts the line-break characters at the end of each line, from unix line breaks (10) to dos line breaks (13, 10)
Kedit could possible change the encoding of the file (like from ansi to UTF-8)
You can change the encoding of a file with the iconv utility (on a linux box)

Related

Appending a count to a code in multiple files and saving the result

I'm looking for a bit of help here. I'm a complete newbie!
I need to look in a file for a code matching the pattern A00000_00_A and append a count to it, so the first time it appears it is replaced with A00000_00_A_001, second time A00000_00_A_002 etc. The output needs to be written back to the same file. Each file only contains 1 code, but it appears multiple times.
After some digging I have found-
perl -pi -e 's/Q\d{4,5}'_'\d{2}_./$&.'_'.++$A /ge' /users/documents/*.xml
but the issue is the counter does not reset in each file.
That is, the output of the first file is say Q00390_01_A_1 to Q00390_01_A_7, while the second file is Q00391_01_A_8 to Q00391_01_A_10.
What I want is Q00390_01_A_1 to Q00390_01_A_7 in the first file and Q00391_01_A_1 to Q00391_01_A_2 in the second.
Does anyone have any idea on how to edit the above code to make it do that? I'm a total newbie so ideally an edit to what I have would be brilliant. Thanks
cd /users/documents/
for f in *.xml;do
perl -pi -e 's/facs=.(Q|M)\d{4,5}_\d{2}_\w/$&._.sprintf("%04d",++$A) /ge' $f
done
This matches the string facs= and any character, then "Q" or "M" followed by either four or five digits, then an underscore, then two digits, another underscore, and a word character. The entire match is then concatenated with an underscore and the value of $A zero padded to four digits.

Mac OS X split csv not working

I am attempting to split a large 200,000 line csv file into smaller pieces using the terminal command:
split -l 20000 users.csv
From what I have read online this should chop up the 200,000 line csv into ten 20,000 line files but this doesn't happen. All I get is a text file called 'xaa' that is just the original csv, all 200,000 lines.
Like I said in the title I am running on Mac OS High Sierra v.10.13.5
What exactly am I missing here?
As Ken Thomases points out in the comments, the most likely culprit is that the file is using non-newline line separators, and the most likely culprit is CR (carriage return).
You can tell if this is the case using the file utility. A file with such line separators looks like this:
$ file foo
foo: ASCII text, with CR line terminators
The reason split would behave this way with those line separators is that the file would appear to be only one line long (no newline characters). So split would write that one (very long) line, then exit.
you should use the split command with the -b option. This will split your files based on kilobytes or megabytes. This may break up a line at the end but can manually be accounted for.

sort -o appends newline to end of file - why?

I'm working on a small text file with a list of words in it that I want to add a new word to, and then sort. The file doesn't have a newline at the end when I start, but does after the sort. Why? Can I avoid this behavior or is there a way to strip the newline back out?
Example:
words.txt looks like
apple
cookie
salmon
I then run printf "\norange" >> words.txt; sort words.txt -o words.txt
I use printf rather than echo figuring that'll avoid the newline, but the file then reads
apple
cookie
orange
salmon
#newline here
If I just run printf "\norange" >> words.txt orange appears at the bottom of the file, with no newline, ie;
apple
cookie
salmon
orange
This behavior is explicitly defined in the POSIX specification for sort:
The input files shall be text files, except that the sort utility shall add a newline to the end of a file ending with an incomplete last line.
As a UNIX "text file" is only valid if all lines end in newlines, as also defined in the POSIX standard:
Text file - A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the newline character. Although POSIX.1-2008 does not distinguish between text files and binary files (see the ISO C standard), many utilities only produce predictable or meaningful output when operating on text files. The standard utilities that have such restrictions always specify "text files" in their STDIN or INPUT FILES sections.
Think about what you are asking sort to do.
You are asking it "take all the lines, and sort them in order."
You've given it a file containing four lines, which it splits to the following strings:
"salmon\n"
"cookie\n"
"orange"
It sorts these for you dutifully:
"cookie\n"
"orange"
"salmon\n"
And it then outputs them as a single string:
"cookie
orangesalmon
"
That is almost certainly exactly what you do not want.
So instead, if your file is missing the terminating newline that it should have had, the sort program understands that, most likely, you still intended that last line to be a line, rather than just a fragment of a line. It appends a \n to the string "orange", making it "orange\n". Then it can be sorted properly, without "orange" getting concatenated with whatever line happens to come immediately after it:
"cookie\n"
"orange\n"
"salmon\n"
So when it then outputs them as a single string, it looks a lot better:
"cookie
orange
salmon
"
You could strip the last character off the file, the one from the end of "salmon\n", using a range of handy tools such as awk, sed, perl, php, or even raw bash. This is covered elsewhere, in places like:
How can I remove the last character of a file in unix?
But please don't do that. You'll just cause problems for all other utilities that have to handle your files, like sort. And if you assume that there is no terminating newline in your files, then you will make your code brittle: any part of the toolchain which "fixes" your error (as sort kinda does here) will "break" your code.
Instead, treat text files the way they are meant to be treated in unix: a sequence of "lines" (strings of zero or more non-newline bytes), each followed by a newline.
So newlines are line-terminators, not line-separators.
There is a coding style where prints and echos are done with the newline leading. This is wrong for many reasons, including creating malformed text files, and causing the output of the program to be concatenated with the command prompt. printf "orange\n" is correct style, and also more readable: at a glance someone maintaining your code can tell you're printing the word "orange" and a newline, whereas printf "\norange" looks at first glance like it's printing a backslash and the phrase "no range" with a missing space.

Using pipe symbol and "print" in Windows

I am trying to make a shell script work in Windows. Sorry but I'm not very experienced in Windows (or even that much in shell to be honest). The script works well except for this one line:
print "9\n0\n1\n5\n0\n0\n\n" | /usr/ts23/mm_util
The mm_util is an interactive utility that takes numbers as input. It chooses selection 9 first, then 0, then 1, etc. I've changed the path to use the utility, which has an identical interface in Windows but the output is just the first screen. The "9" input isn't entered, and because of this the output (that is parsed) is incorrect. How can I change this so that the "9" is entered on the first screen?
Here is a method that does not require a file. It works on the command line:
(for %N in (9 0 1 5 0 0 "") do #echo(%~N)|c:\Users\ts23\mm_util
The "" is to get an empty line in the output, as you had in your original question. Your answer does not have the blank line.
The %~N notation strips enclosing quotes from the value.
The echo( is non-intuitive syntax that can reliably print a blank line, in case %~N expands to nothing.
Don't forget to double the percents if you put the code in a batch script.
Try to put that nine-linebreak-zero-stuff in a text file, and then execute print textfile.txt | /usr/ts23/mm_util
And bear in mind that Windows uses the pre-UNIX convention that the linebreak is CR LF, not just LF.
The way I got the output I wanted was by using this:
C:\Users\ts23\mm_util < test.txt
And then just put the following inside test.txt
9
0
1
5
0
0
The output I got was what I needed, hopefully this will help someone trying to do something like this in the future.

Replace chars in file by index

I am looking for a reliable method to replace a sequence of chars in a text file. I know that the file will always follow a specific format and that I need to replace a specific range of chars (ie start at char 20, replace the next 11 chars with '#')
I have found several examples using sed and awk which accomplish this on most files. However, the hangup in my case is that the range of chars in the file contain random gibberish chars include several NULL chars. This causes the file commands to stop processing.
I know that the simplest fix would be to go to the process that creates the file and not pad the file with NULL chars. However, the file is generated by a process buried within ancient COBOL running on a mainframe and any changes there require nearly an act of congress.
so, knowing that I am stuck with what I have, is there any way to manipulate the file, from the command line, that can successfully overwrite the NULL chars?
Thanks in advance.
GNU dd can do that
echo '###########'|dd of=FILENAME seek=20 bs=1 count=11 conv=notrunc
Make sure the echo command provides enough characters as input.

Resources