I have extended ascii chars in Oracle table data which I am able extract to a file using sqlplus with the \ escape character prefixed. I want to use nzload to load the exact same data into a netezza table.
nzload adds a couple of extra bytes when it encounters this char seq (c2bf)
in the extracted file data:
echo "PROFESSIONAL¿" | od -x
0000000 5052 4f46 4553 5349 4f4e 414c **c2bf** 0a00
after nzload:
echo "PROFESSIONAL¿" | od -x
0000000 5052 4f46 4553 5349 4f4e 414c **c382 c2bf**
on the nzload command line, I have these options:
-escapechar \ -ctrlchars
Can anyone provide any help with this?
I'm not very savvy with Unicode conversion issues, but I've done this to myself before, and I'll demonstrate what I think is happening.
I believe what you are seeing here is not an issue with loading special characters with nzload, rather it's an issue with how your display/terminal software is display the data and/or Netezza how is storing the character data. I suspect a double conversion to/from UTF-8 (the Unicode encoding that Netezza supports). Let's see if we can suss out which it is.
Here I am using PuTTY with the default (for me) Remote Character Set as Latin-1.
$ od -xa input.txt
0000000 5250 464f 5345 4953 4e4f 4c41 bfc2 000a
P R O F E S S I O N A L B ? nl
0000017
$ cat input.txt
PROFESSIONAL¿
Here we can see from od that the file has only the data we expect, however when we cat the file we see the extra character. If it's not in the file, then the character is likely coming from the display translation.
If I change the PuTTY settings to have UTF-8 be the remote character set, we see it this way:
$ od -xa input.txt
0000000 5250 464f 5345 4953 4e4f 4c41 bfc2 000a
P R O F E S S I O N A L B ? nl
0000017
$ cat input.txt
PROFESSIONAL¿
So, the same source data, but two different on-screen representations, which are, not coincidentally, the same as your two different outputs. The same data can be displayed at least two ways.
Now let's see how it loads into Netezza, once into a VARCHAR column, and again into an NVARCHAR column.
create table test_enc_vchar (col1 varchar(50));
create table test_enc_nvchar (col1 nvarchar(50));
$ nzload -db testdb -df input.txt -t test_enc_vchar -escapechar '\' -ctrlchars
Load session of table 'TEST_ENC_VCHAR' completed successfully
$ nzload -db testdb -df input.txt -t test_enc_nvchar -escapechar '\' -ctrlchars
Load session of table 'TEST_ENC_NVCHAR' completed successfully
The data loaded with no errors. Note while I specify the escapechar option for nzload, none of the characters in this specific sample of input data require escaping, nor are they escaped.
I will now use the rawtohex function from the SQL Extension Toolkit as an in-database tool like we've used od from the command line.
select rawtohex(col1) from test_enc_vchar;
RAWTOHEX
------------------------------
50524F46455353494F4E414CC2BF
(1 row)
select rawtohex(col1) from test_enc_nvchar;
RAWTOHEX
------------------------------
50524F46455353494F4E414CC2BF
(1 row)
At this point both columns seem to have exactly the same data as the input file. So far, so good.
What if we select the column? For the record, I am doing this in a PuTTY session with remote character set of UTF-8.
select col1 from test_enc_vchar;
COL1
----------------
PROFESSIONAL¿
(1 row)
select col1 from test_enc_nvchar;
COL1
---------------
PROFESSIONAL¿
(1 row)
Same binary data, but different display. If I then copy the output of each of those selects into echo piped to od,
$ echo PROFESSIONAL¿ | od -xa
0000000 5250 464f 5345 4953 4e4f 4c41 82c3 bfc2
P R O F E S S I O N A L C stx B ?
0000020 000a
nl
0000021
$ echo PROFESSIONAL¿ | od -xa
0000000 5250 464f 5345 4953 4e4f 4c41 bfc2 000a
P R O F E S S I O N A L B ? nl
0000017
Based on this output, I'd wager that you are loading your sample data, which I'd also wager is UTF-8, into a VARCHAR column rather than an NVARCHAR column. This is not, in of itself, a problem, but can have display/conversion issues down the line.
Generally speaking, you'd want to load UTF-8 data into NVARCHAR columns.
Related
I am trying to write a script that, taking a file as argument, checks if the hex dump of the file has a mod 4 length. If the length is not mod 4 it must add 00 to the end of the dump to make it mod 4
I tried with hexdump
D=$(hexdump filename)
if [ $((${#D}%4)) != 0 ]
then
D+=00
fi
and with od
D=$(od -t x filename)
if [ $((${#D}%4)) != 0 ]
then
D+=00
fi
but both methods not working
I think the problems are as follows:
I include the offset columns in variable D when I only have to consider the hexdump.
For example:
cat file.txt
// Hello World!
//This is a file .txt
od -t x file.txt
//0000000 6c6c6548 6f77206f 21646c72 6968540a
//0000020 73692073 66206120 20656c69 7478742e
//0000040 0000000a
//0000041
The columns 0000000 0000020 0000040 0000041 must not be inside D.
In practice, inside D there must be 6c6c65486f77206f21646c726968540a736920736620612020656c697478742e0000000a
Also, I have to modify that dump so that it has a mod 4 lenght and I don't know if it is enough to simply add 00 at the end.
Any idea how this can be done?
D=$(xxd -p $1|tr -d '\n')
while [ $(bc<<<$(echo -n $D|wc -c)%4) != 0 ];do D+=00;done
does not include columns
Why do you need to examine the hex dump? Just pad the raw file with zeros if necessary.
padump () {
local size=$(stat -f '%z' "$1")
local i
( cat "$1"
for ((i=size; i%2>0; i++)); do
printf '\0'
done ) |
hexdump
}
Usage: padump filename
The argument to stat is unfortunately not portable; on Linux, try stat -c '%s' "$1" (the code above was written on macOS).
The modulo of the above is 2 because four hex digits corresponds to two bytes of file.
Anyway, I'll note that in your attempt, od already adds more padding than you bargained for, and outputs the bytes in the wrong order. The final value at offset 0x0041 is the single byte 0x0a and the zeros before it actually belong after it, and are padding. The other hex values are similarly swapped; 0x6c is the character l, but occurs first in the dump - read as a regular hex dump, 6c6c6548 spells lleH. You can fix this with options;
#!/bin/sh
od -t 2x --endian=big "$1" |
sed 's/^[^ ]* //;s/ //g;$d'
Because od already adds the padding, this is probably the simplest solution. The above also discards the offsets (the first column; sed 's/^[^ ]* //') and removes any remaining spaces between the hex digits (s/ //g), and the final line with just an offset ($d). However, the --endian=big option is Linux only. Alternatively, use -t 1x and add padding as above.
There is a shell script (bash) that check a csv file for lines that don't match a pattern and send a mail with the wrong lines. Thats works fine but while combine the wrong lines linux give a \r as line break, in the E-Mail there is no linebreak. So I try to send \r\n as line break but this has no effect, perl or bash delete this \n newline.
Here is a minimal working script as example:
SUBJECT="Error while parse CSV"
TO="rcpt#domain.tld"
wrongLines=$(perl -ne 'print "Row $.: $_\r\n" if not /^00[1-9]\d{4,}$/' $file)
MESSAGE="Error while parse following Lines, pattern dont match: \r\n $wrongLines"
echo $MESSAGE |od -c
The output of od is:
0000000 E r r o r w h i l e p a r s
0000020 e f o l l o w i n g L i n e
0000040 s , p a t t e r n d o n t
0000060 m a t c h : \ r \ n R o w
0000100 2 : 4 9 2 7 8 3 8 7 4 3 \r R
0000120 o w 3 : 4 8 2 3 2 8 9 7 3 8
0000140 \r \n
0000143
But what is the reason that in the od output the \n between the rows is deleted? I also try \x0D\x0A instead of \r\n but this also don't help. Any suggestions?
Your problem is that you're not using quotes!
Look:
$ a="A multi-line
input
variable"
$ echo $a
A multi-line input variable
$ echo "$a"
A multi-line
input
variable
$
Without quotes, you'll be victim of word splitting and filename expansion (not illustrated in the example above).
Also, adding \r or \n (that is, verbatim backslash followed by r or n) is not going to help at all.
Conclusion: Quote every variable expansion! always! (unless you really mean a glob pattern — in which case you will also add a comment in the code to explain why you purposely didn't quote the expansion).
Side note: don't use upper case variable names!
It is recommended you use lower-case names for your own parameters so as not to confuse them with the all-uppercase variable names used by Bash internal variables and environment variables.
I have a task to export the data of a table to flat files.I have written a shell script to do the same. The problem here is after the first column value, i.e., 310, the cursor won't move space by space, it takes a "tab" I think. It jumps certain position.
I am not able to understand this. Could you please help me solve this by removing tab and taking that as spaces.
Is trimspool ON or Linesize causing the problem?
310 LIFRZONAAC 0000000 0000003
310 LIIPACCCLA 0000000 0000000
310 LIIPACREPL 0000000 0000000
310 LIIPANRDSI 0000000 0000000
310 LIIPAXNAD 0000000 0000000
310 LIIPBNRDSI 0000000 0000000
310 LIIPCAUDIP 0000000 0000000
310 LIIPCAUDMU 0000000 0000000
====================================================================================
My control file and shell script is as below.
Control file::
LOAD DATA CHARACTERSET WE8ISO8859P1
APPEND
PRESERVE BLANKS
INTO TABLE "MNCABEA1"
APPEND
(
SERVICIO CHAR(9),
FAMILIA_COMPONENTE CHAR(4),
PARAMETRO CHAR(7),
CO_CABECERA_SC CHAR(8),
CO_CABECERA_CARACT CHAR(7)
)
====================================================================================
Shell script::
cat<<ENDMNCABEA1 >MNCABEA1.sql
set head off
set feed off
set pagesize 0
set trimspool on
set linesize 500
SELECT SERVICIO||FAMILIA_COMPONENTE||PARAMETRO||DECODE(CO_CABECERA_SC,' ',REPLACE(CO_CABECERA_SC,' ','0'),LPAD(LTRIM(CO_CABECERA_SC),8,'0'))||DECODE(CO_CABECERA_CARACT,' ',REPLACE(CO_CABECERA_CARACT,' ','0'),LPAD(LTRIM(CO_CABECERA_CARACT),7,'0')) FROM MNCABEA1;
exit
ENDMNCABEA1
sqlplus -s $USUADM#$ORACLE_SID #MNCABEA1.sql > /var/opt/aat/shr/mn/par/ext/salida/MNCABEA1_TEST
You also need to set tab off; see the documentation. You might also want to set trimout off to avoid having trailing spaces on the line; as you aren't doing a spool, set trimspool probably isn't doing anything.
I'm on OSX 10.6.8
I'm having some issues sorting a text file by the first character.
I'm concatenating three files into one and need the final result sorted by the first alphabetical letter.
Each file has lines that look like this:
A025-001
A118-001
A118-002
B657-001
D316-001
So the file after concatenation via "cat" looks like this:
A025-001
....
A025-001 (where file 2 was appended)
....
A025-001 (where file 3 was appended)
I've tried "sort -k 1.1,1.1 result.txt > sortedresult.txt" and with a large amount of other options in the man page: i,b,f,s (just guessing in hopes that I may have found the right one)
I need all the entries to be put next to each other:
A025-001
A025-001
B.......
B.......
D.......
Hopefully, someone more knowledgeable than thou can help me solve this problem.
Thanks
Update: the data files themselves aren't working well with unix tools. If I cat the results file, only a few lines are shown, of many. Opening them in "vim" shows a bunch of ^M characters. It seems as if sort is not going through the whole file.
There's column header at the top, with fields in quotations, tab-separated e.g. "Product" \t "Category" \t
The rest of the data is tab-separated but without quotations.
sample od -c:
0000000 " P r o d u c t N u m b e r "
0000020 \t " L o o k u p A t t r i b u
0000040 t e 1 G r o u p " \t " L o o
0000060 k u p A t t r i b u t e 1
0000100 N a m e " \t " L o o k u p A t
0000120 t r i b u t e 1 V a l u e "
0000140 \t " L o o k u p A t t r i b u
0000160 t e 1 V a l u e I m a g e
0000200 " \t " L o o k u p A t t r i b
Here's some of the data (not the column header):
0000660 " \n A 0 2 5 - 0 0 1 \t F a c e t
0000700 \t F a c e t C o l o r \t B l u e
0000720 \t C C D D D D \t O P T I O N \t \r
Does anyone know why it is doing this?
Update #2: The files were exported out of FileMaker as ASCII. You'll see a lot of extra tabs, just ignore those, once we get this figured out I'll sed them out. Here is the entire file along with a hexdump and od -c of the file: pastebin.com/UzaUgG6C
Looking at the pastebin, it seems FileMaker is terminating the column headers with \n and separating your records with \r. You need to normalize your line endings first.
cat result.txt | tr '\r' '\n' | sort
I think the problem is just the line endings. The ^M characters are carriage returns. UNIX tools generally expect newlines, and no carriage returns. Try the answers to this question or try running mac2unix if you have it.
Try
sort -k1.1,1.2 result.txt > sortedresult.txt
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, and/or give it a + (or -) as a useful answer.
You should try simply:
cat file1.txt file2.txt file3.txt | sort > result.txt
using the -k 1.1,1.1 will not make any use as there is only one field
To make it stable, that is, the group of entries for which the first characters are same, will keep the relative ordering same, you might use the -s switch with the -k 1.1,1.1 switch.
cat file1.txt file2.txt file3.txt | sort -s -k 1.1,1.1 > result.txt
I think this is the solution you need.
I am having a file in the following format
Column1 Column2
str1 1
str2 2
str3 3
I want the columns to be rearranged. I tried below command
cut -f2,1 file.txt
The command doesn't reorder the columns. Any idea why its not working?
For the cut(1) man page:
Use one, and only one of -b, -c or -f. Each LIST is made up of
one
range, or many ranges separated by commas. Selected input is written
in the same order that it is read, and is written exactly once.
It reaches field 1 first, so that is printed, followed by field 2.
Use awk instead:
awk '{ print $2 " " $1}' file.txt
You may also combine cut and paste:
paste <(cut -f2 file.txt) <(cut -f1 file.txt)
via comments: It's possible to avoid process substitution and remove one instance of cut by doing:
paste file.txt file.txt | cut -f2,3
Using join:
join -t $'\t' -o 1.2,1.1 file.txt file.txt
Notes:
-t $'\t' In GNU join the more intuitive -t '\t' without the $ fails, (coreutils v8.28 and earlier?); it's probably a bug that a workaround like $ should be necessary. See: unix join separator char.
Even though there's just one file being worked on, join syntax requires two filenames. Repeating the file name allows join to perform the desired action.
For systems with low resources join offers a smaller footprint than some of the tools used in other answers:
wc -c $(realpath `which cut join sed awk perl`) | head -n -1
43224 /usr/bin/cut
47320 /usr/bin/join
109840 /bin/sed
658072 /usr/bin/gawk
2093624 /usr/bin/perl
using just the shell,
while read -r col1 col2
do
echo $col2 $col1
done <"file"
You can use Perl for that:
perl -ane 'print "$F[1] $F[0]\n"' < file.txt
-e option means execute the command after it
-n means read line by line (open the file, in this case STDOUT, and loop over lines)
-a means split such lines to a vector called #F ("F" - like Field). Perl indexes vectors starting from 0 unlike cut which indexes fields starting form 1.
You can add -F pattern (with no space between -F and pattern) to use pattern as a field separator when reading the file instead of the default whitespace
The advantage of running perl is that (if you know Perl) you can do much more computation on F than rearranging columns.
Just been working on something very similar, I am not an expert but I thought I would share the commands I have used. I had a multi column csv which I only required 4 columns out of and then I needed to reorder them.
My file was pipe '|' delimited but that can be swapped out.
LC_ALL=C cut -d$'|' -f1,2,3,8,10 ./file/location.txt | sed -E "s/(.*)\|(.*)\|(.*)\|(.*)\|(.*)/\3\|\5\|\1\|\2\|\4/" > ./newcsv.csv
Admittedly it is really rough and ready but it can be tweaked to suit!
Just as an addition to answers that suggest to duplicate the columns and then to do cut. For duplication, paste etc. will work only for files, but not for streams. In that case, use sed instead.
cat file.txt | sed s/'.*'/'&\t&'/ | cut -f2,3
This works on both files and streams, and this is interesting if instead of just reading from a file with cat, you do something interesting before re-arranging the columns.
By comparison, the following does not work:
cat file.txt | paste - - | cut -f2,3
Here, the double stdin placeholder paste does not duplicate stdin, but reads the next line.
Using sed
Use sed with basic regular expression's nested subexpressions to capture and reorder the column content. This approach is best suited when there are a limited number of cuts to reorder columns, as in this case.
The basic idea is to surround interesting portions of the search pattern with \( and \), which can be played back in the replacement pattern with \# where # represents the sequential position of the subexpression in the search pattern.
For example:
$ echo "foo bar" | sed "s/\(foo\) \(bar\)/\2 \1/"
yields:
bar foo
Text outside a subexpression is scanned but not retained for playback in the replacement string.
Although the question did not discuss fixed width columns, we will discuss here as this is a worthy measure of any solution posed. For simplicity let's assume the file is space delimited although the solution can be extended for other delimiters.
Collapsing Spaces
To illustrate the simplest usage, let's assume that multiple spaces can be collapsed into single spaces, and the the second column values are terminated with EOL (and not space padded).
File:
bash-3.2$ cat f
Column1 Column2
str1 1
str2 2
str3 3
bash-3.2$ od -a f
0000000 C o l u m n 1 sp sp sp sp C o l u m
0000020 n 2 nl s t r 1 sp sp sp sp sp sp sp 1 nl
0000040 s t r 2 sp sp sp sp sp sp sp 2 nl s t r
0000060 3 sp sp sp sp sp sp sp 3 nl
0000072
Transform:
bash-3.2$ sed "s/\([^ ]*\)[ ]*\([^ ]*\)[ ]*/\2 \1/" f
Column2 Column1
1 str1
2 str2
3 str3
bash-3.2$ sed "s/\([^ ]*\)[ ]*\([^ ]*\)[ ]*/\2 \1/" f | od -a
0000000 C o l u m n 2 sp C o l u m n 1 nl
0000020 1 sp s t r 1 nl 2 sp s t r 2 nl 3 sp
0000040 s t r 3 nl
0000045
Preserving Column Widths
Let's now extend the method to a file with constant width columns, while allowing columns to be of differing widths.
File:
bash-3.2$ cat f2
Column1 Column2
str1 1
str2 2
str3 3
bash-3.2$ od -a f2
0000000 C o l u m n 1 sp sp sp sp C o l u m
0000020 n 2 nl s t r 1 sp sp sp sp sp sp sp 1 sp
0000040 sp sp sp sp sp nl s t r 2 sp sp sp sp sp sp
0000060 sp 2 sp sp sp sp sp sp nl s t r 3 sp sp sp
0000100 sp sp sp sp 3 sp sp sp sp sp sp nl
0000114
Transform:
bash-3.2$ sed "s/\([^ ]*\)\([ ]*\) \([^ ]*\)\([ ]*\)/\3\4 \1\2/" f2
Column2 Column1
1 str1
2 str2
3 str3
bash-3.2$ sed "s/\([^ ]*\)\([ ]*\) \([^ ]*\)\([ ]*\)/\3\4 \1\2/" f2 | od -a
0000000 C o l u m n 2 sp C o l u m n 1 sp
0000020 sp sp nl 1 sp sp sp sp sp sp sp s t r 1 sp
0000040 sp sp sp sp sp nl 2 sp sp sp sp sp sp sp s t
0000060 r 2 sp sp sp sp sp sp nl 3 sp sp sp sp sp sp
0000100 sp s t r 3 sp sp sp sp sp sp nl
0000114
Lastly although the question's example does not have strings of unequal length, this sed expression supports this case.
File:
bash-3.2$ cat f3
Column1 Column2
str1 1
string2 2
str3 3
Transform:
bash-3.2$ sed "s/\([^ ]*\)\([ ]*\) \([^ ]*\)\([ ]*\)/\3\4 \1\2/" f3
Column2 Column1
1 str1
2 string2
3 str3
bash-3.2$ sed "s/\([^ ]*\)\([ ]*\) \([^ ]*\)\([ ]*\)/\3\4 \1\2/" f3 | od -a
0000000 C o l u m n 2 sp C o l u m n 1 sp
0000020 sp sp nl 1 sp sp sp sp sp sp sp s t r 1 sp
0000040 sp sp sp sp sp nl 2 sp sp sp sp sp sp sp s t
0000060 r i n g 2 sp sp sp nl 3 sp sp sp sp sp sp
0000100 sp s t r 3 sp sp sp sp sp sp nl
0000114
Comparison to other methods of column reordering under shell
Surprisingly for a file manipulation tool, awk is not well-suited for cutting from a field to end of record. In sed this can be accomplished using regular expressions, e.g. \(xxx.*$\) where xxx is the expression to match the column.
Using paste and cut subshells gets tricky when implementing inside shell scripts. Code that works from the commandline fails to parse when brought inside a shell script. At least this was my experience (which drove me to this approach).
Expanding on the answer from #Met, also using Perl:
If the input and output are TAB-delimited:
perl -F'\t' -lane 'print join "\t", #F[1, 0]' in_file
If the input and output are whitespace-delimited:
perl -lane 'print join " ", #F[1, 0]' in_file
Here,
-e tells Perl to look for the code inline, rather than in a separate script file,
-n reads the input 1 line at a time,
-l removes the input record separator (\n on *NIX) after reading the line (similar to chomp), and add output record separator (\n on *NIX) to each print,
-a splits the input line on whitespace into array #F,
-F'\t' in combination with -a splits the input line on TABs, instead of whitespace into array #F.
#F[1, 0] is the array made up of the 2nd and 1st elements of array #F, in this order. Remember that arrays in Perl are zero-indexed, while fields in cut are 1-indexed. So fields in #F[0, 1] are the same fields as the ones in cut -f1,2.
Note that such notation enables more flexible manipulation of input than in some other answers posted above (which are fine for a simple task). For example:
# reverses the order of fields:
perl -F'\t' -lane 'print join "\t", reverse #F' in_file
# prints last and first fields only:
perl -F'\t' -lane 'print join "\t", #F[-1, 0]' in_file