my text file is delimited by pipeline '|'
I want to export this in to excel file (xls) using a script in Unix
can anyone please help
My suggestion would be,
Convert the delimiter | to ,
Save the file with csv extension
Open the file in excel.
Note: If you have , in the file contents other than token separator this idea will not work.
If you want to convert your file to .xls format then you will have to use apache POI library. It has perl support.
If you just want to open it in excel then you can directly use open with excel and set the seperator as |.
Or put all the words in " " and use , as the seperator. If it is within "" then comma within the text will not be an issue. But double quotes within the text will be a problem.
To avoid all these you can use some other ascii character as the seperator.
Related
I have a scenario to import the csv file then validate the content, I have a cell with following special characters
“ !#$%&’()*+,-./:;<=>?#[\]^_`{|}”~
When I read the CSV file with above characters in cell using CSV.read("csv_filepath"), I am getting following
“ !\#$%&’()*+,-./:;<=>?#[\\]^_`{|}”~
backslash(\) is added for # and \, how to read the exact content
I am trying to remove a newline characters from with in quotes in file
I am able to achieve that using the below code
awk -F"\"" '!length($NF){print;next}{printf("%s ", $0)}' filename.txt>filenamenew.txt
Note I am creating a new file filenamenew.txt is this avoidable can i do the command in place the reason I ask is because files are huge.
my file is pipe delimited
sample input file
"id"|"name"
"1"|"john
doe"
"2"|"second
name
in the list"
using the above code I get the following output
"id"|"name"
"1"|"john doe"
"2"|"second name in the list"
but I have a huge files and i see in some of the lines have ^M character in between quotes example
second sample input file
"id"|"name"
"1"|"john
doe"
"^M2"|"second^M^M
name
in the list"
o/p using above code
"id"|"name"
"1"|"john doe"
name in the list"
so basically if there is a ^M in the line that string is not being printed but i read online ^M is equal to \r so i used
tr -d'\r'< filename.txt
I also tried
awk-F"|"{sub(/^M/,"")}1
but it did not remove those characters (^M)
A little background on why i am doing this
I am extracting data from a relational table and loading into flat file and checking if the counts between table and file matched but since there is \n in columns count(*) vs wc-l in file is not matching.
final resolution:
i don't want to delete these unprintable characters in the long run but want to replace it with some character or value(so that counts between table and file matches) and then when i am loading it back to a table i want to again replace the value that i have added effectively as a place holder with \n or ^M what was originally present so that there is no tampering of data from my side.
Any suggestions is appreciated.
thanks.
I am trying to concatenate a set of .txt files using windows command line, into a csv file.
so i use
type *.txt > me_new_file.csv
but a the fields of a given row, which is tab delimited, ends up in one column. How do I take advantage of tab separation in the original text file to create a csv file such that fields are aligned in columns correctly, using one or more command lines? I am thinking there might be something like...
type *.txt > me_new_file.csv delim= ' '
but haven't been able to find anything yet.
Thank You for your help. Would also appreciate if someone could direct me to a related answer.
From the command line you'd have a fairly complicated time of it. The Windows cmd.exe command processor is much, much simpler than dash, ash, or bash, et.al.
Best thing would be to concatenate all of your files into the .csv file, open it in a text editor, and do a global find and replace replacing with ,
Be careful that your other data doesn't have any commas in it.
If the source files are tab delimited, then the output file is also tab delimited. Depending on the software you are using, you should be able load the tab delimited data properly.
Suppose you are using Excel. If the output file has a .csv extension, then Excel will default to comma delimited columns when it opens the file. Of course that does not work for you. But if you rename the file to have some other extension like .txt, then when you open it with Excel, it will open a series of dialog boxes where you can specify the format, including tab delimited.
If you want to keep the .csv extension and have Excel automatically open it properly, then you need to transform the data. This can be done very easily with JREPL.BAT - a hybrid JScript/batch utility that performs a regular expression search and replace on text data. JREPL.BAT is pure script that runs natively on any Windows machine from XP onward.
The following encloses each value in quotes, just in case a value contains a comma literal.
type *.txt 2>nul | jrepl "\t" "\q,\q" /x /jendln "$txt='\x22'+$txt+'\x22'" /o output.csv
Beware: Your use of type *.txt will fail if the last line in any of your source .txt files does not end with a newline. In such a case, the first line of the next file will be appended to the last line of the previous file. Not good.
You can solve that problem by processing each file individually in a FOR loop.
(for %F in (*.txt) do jrepl "\t" "\q,\q" /x /jendln "$txt='\x22'+$txt+'\x22'" /f "%F") >output.csv
The above is designed to run on the command line. If used in a batch script, then a few changes are needed:
(for %%F in (*.txt) do call jrepl "\t" "\q,\q" /x /jendln "$txt='\x22'+$txt+'\x22'" /f "%%F") >output.csv
Note: My answer assumes none of the source files contain quotes. If they do contain quotes, then a more complicated search and replace is required. But it still can be done efficiently with JREPL.
The question: How (where) can I specify the line terminator string of DAT file in case, that I pass the name of the DAT file on the command line using "data" parameter and not in CTL file? I am using Oracle 11.2 SQL Loader.
The goal: I need to load fast huge amount of data from CSV file into Oracle 11.2 (or above). The field (column) separator is hexa 1F (US character = unit separator), the string delimiter is the double quote, the record (row) separator is hexa 1E (RS character = record separator).
The problem: Using "stream record format" with "str terminator_string" of SQL Loader is fine, but just only in case, that I can specify the name of the DAT file using "infile" directive inside CTL. But the name of my DAT file is varying, so I pass the name of the DAT file on the command line as the "data parameter". And in this case I do not know, how (where) can I specify the line terminator string of DAT file in case.
Remark: The problem is the same as in the unsolved problem in this question.
Admittedly, more a workaround than a proper solution, but it should work if you have a fixed name in the controlfile, and then copy/rename/sym link each file to the fixed name and process. Or, have a control which has a infile entry "THE_DAT_FILE", and then run "sed" to change this to the required file name and then invoke sqlldr using this sed'd file.
So, something like:
Get the data file F1
Copy/SymLink F1 to the_file.dat (sym link asuming Unix/Linux/Cygwin)Admi
RUn sqlldr with STR which refers to INFILE as "the_file.dat"
When complete, delete/unlink the_file.dat
Repeat 1-4 for next file(s) F1, F2, ... Fn
E.g.
for DAT_FILE in *.dat
do
ln -s $DAT_FILE /tmp/the_file.dat
sqlldr .....
rm /tmp/the_file.dat
done
Or
for DAT_FILE in *.dat
do
cat the_ctl_file | \
sed "s/THE_DAT_FILE/£DAT_FILE/" > /tmp/ctl_$DAT_FILE.cf
sqlldr ..... controlfile=tmp/ctl_$DAT_FILE.cf
done
I just ran into a similar situation, where I need to use the same control file for a set of files, all with the windows EOL character for EOR with embedded newlines in text fields.
Rather than code a specific control file for each with the name on the INFILE directive, I coded the name as /dev/null with the STR as:
INFILE '/dev/null' "STR '\r\n'"
And then on the sqlldr command line I use the DATA option to specify the actual flat file.
This is a common issue I have and my solution is a bit brash. So I'm looking for a quick fix and explanation of the problem.
The problem is that when I decide to save a spreadsheet in excel (mac 2011) as a tab delimited file it seems to do it perfectly fine. Until I try to parse the file line by line using Perl. For some reason it slurps the whole document in one line.
My brutish solution is to open the file in a web browser and copy and paste the information into the tab delimited file in TextEdit (I never use rich text format). I tried introducing a newline in the end of the file before doing this fix and it does not resolve the issue.
What's going on here? An explanation would be appreciated.
~Thanks!~
The problem is the actual character codes that define new lines on different systems. Windows systems commonly use a CarriageReturn+LineFeed (CRLF) and *NIX systems use only a LineFeed (LF).
These characters can be represented in RegEx as \r\n or \n (respectively).
Sometimes, to hash through a text file, you need to parse New Line characters. Try this for DOS-to-UNIX in perl:
perl -pi -e 's/\r\n/\n/g' input.file
or, for UNIX-to-DOS using sed:
$ sed 's/$'"/`echo \\\r`/" input.txt > output.txt
or, for DOS-to-UNIX using sed:
$ sed 's/^M$//' input.txt > output.txt
Found a pretty simple solution to this. Copy data from Excel to clipboard, paste it into a google spreadsheet. Download google spreadsheet file as a 'tab-separated values .tsv'. This gets around the problem and you have tab delimiters with an end of line for each line.
Yet another solution ...
for a tab-delimited file, save the document as a Windows Formatted Text (.txt) file type
for a comma-separated file, save the document as a `Windows Comma Separated (.csv)' file type
Perl has a useful regex pattern \R which will match any common line ending. It actually matches any vertical whitespace -- the same as \v -- or the CR LF combination, so it's the same as \r\n|\v
This is useful here because you can slurp your entire file into a single scalar and then split /\R/, which will give you a list of file records, already chomped (if you want to keep the line terminators you can split /\R\K/ instead
Another option is the PerlIO::eol module. It provides a new Perl IO layer that will normalize line endings no matter what the contents of the file are
Once you have loaded the module with use PerlIO::eol you can use it in an open statement
open my $fh, '<:eol(LF)', 'myfile.tsv' or die $!;
or you can use the open pragma to set it as the default layer for all input file handles
use open IN => ':raw:eol(LF)';
which will work fine with an input file from any platform