Dealing with additional "+AC0" in my data

Dealing with additional "+AC0" in my data - rstudio

I'd like to ask if there is an option to change part of dataframe in columns. I need to operate on CSV file that was saved in UTF-7 which causes adding "+AC0" in front of every minus value like:
+AC0-0.9949341
I need to get rid of this. I already tried
data1$x <- as.character(data1$x) > data1$x[data1$x == "+AC0"] <- ""
and
data1[data1[["x"]] == "+AC0","x"] <- ""[screenshot][1]

From what I can tell it's UTF-7 data. So you should be able to force UTF-7 text encoding when reading the file and have it all automagically go away.

Related

How to speed up XQuery string search?

I am using XQuery/BaseX to look through large XML files to find historical data for some counters. All the files are zipped and stored somewhere on drive. The important part of file looks as follows:
<measInfo xmlns="http://www.hehe.org/foo/" measInfoId="uplink speed">
<granPeriod duration="DS222S" endTime="2020-09-03T08:15:00+02:00"/>
<repPeriod duration="DS222S"/>
<measTypes>AFD123 AFD124 AFD125 AFD156</measTypes>
<measValue measObjLdn="PLDS-PLDS/STBHG-532632">
<measResults>23 42 12 43</measResults>
</measValue>
</measInfo>
I built the following query:
declare default element namespace "http://www.hehe.org/foo/";
let $sought := ["AFD124", "AFD125"]
let $datasource := collection("C:\Users\Patryk\Desktop\folderwitharchives")
let $filename := concat(convert:dateTime-to-integer(current-dateTime()), ".xml")
for $meas in $datasource/measCollecFile/measData/measInfo return
for $measType at $i in $meas/tokenize(measTypes)[. = $sought] return
file:append($filename,
<meas
measInfoId="{data($meas/#measInfoId)}"
measObjLdn="{data($meas/measValue/#measObjLdn)}"
>
{$meas/granPeriod}
{$meas/repPeriod}
<measType>{$measType}</measType>
<measValue>{$meas/measValue/tokenize(measResults, " ")[$i]}</measValue>
</meas>)
The script works, but it takes a lot of time for some counters (measType). I read the documentation about indexing, and my idea is to somehow index all the measTypes (parts of the string), so that once I need to look through the whole archive looking for a counter, it can be quickly accessed. I am not sure if it is possible when operating directly on archives? Would I have to create a new database of them? I would prefer not to, due to the size of files. How to create indexes for such case?

It is not the answer to my question, but I have noticed that the execution time is much longer when I write XML nodes to a file. It is much faster to append any other string to a file:
concat($measInfo/#measInfoId, ",", $measInfo/measValue/#measObjLdn, ",",
$measInfo/granPeriod, ",", $measInfo/repPeriod, ",", $measType, ",",
$tokenizedValues[$i], "
"))
Why is it and how to speed up writing XML nodes to a file?
Also, I have noticed that appending value to a file inside for loop is much longer, and I suspect that it is because the file has to be opened again in each iteration. Is there a way to keep the file open throughout the whole query?

Importing flat file into SSIS gives problems because non-consistent deliminater

I am facing a problem importing a flat file into SSIS.
The file is seperated by "|" and has deliminater as ";;". However the deliminator is inconsistent. Sometimes, at the and of the rows, there is only ";" or nothing "". When importing to SSIS I get the result
Column 1 Column 2 Column 3 Column 4 Column 5
a b c d e;|a1|b1|c1|d1|e1
This should instead look like
Column 1 Column 2 Column 3 Column 4 Column 5
a b c d e
a1 b1 c1 d1 e1
And the problem arrises because in the first row there is only one or none ";".
Note this is an example, many of the rows are correct and have ";;" as deliminator. I am only pointing out the problem.
The .csv file would look like
Column 1|Column 2|Column 3|Column 4|Column 5;;
a|b|c|d|e;
a1|b1|c1|d1|e1;;
and should instead look like
Column 1|Column 2|Column 3|Column 4|Column 5;;
a|b|c|d|e;;
a1|b1|c1|d1|e1;;
The data set is very big with almost 600.000 rows and 50 columns.
The first problem I face is when I import the file, since SSIS standard DataType reading is string [DT_STR]. with a length of 50. Since sometimes there are multiple rows with wrong deliminators, I get a very long strings in the last column cell. I Use Visual Studio, and in the Advanced Editor I changed the length to something very big.
Advanced editor in Visual studio were I have changed the length
So the question is, how do I in SSIS and Visual Studio Community separate the values in some cells in one column and split op these into a entire new row (with the already defined column variables).
I have tried manually to find all the cases where there is a error and changed this in the .csv file. After this SSIS works. However this is not a durable solution because I am getting a new file every month.
I have tried reading suggestions as:
Split a single column of data with comma delimiters into multiple columns in SSIS
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/49a764e7-1a6f-4a6f-9c92-2462ffa3add2/regarding-ssis-split-multi-value-column-into-multiple-records?forum=sqlintegrationservices
but their problem is not he same, since they have a column value the replicate, and I want a entire new row.
Thanks for any help,
ss
!! EDIT trying using the answers from J Weezy and R M: !!
I try to create a script task and follow that solution.
In Visual Studio, I add a script task using a Script Component and I choose "Transformation". Under Input Columns I choose all.
After this i direct the flat file source to the script component and run the code. Running the script like this (where the script component doesn't do anything) works.
Then I enter "Edit Script" in the script component, and under public override void Input0_ProcessInputRow(Input0Buffer Row) I enter (using the help from R M):
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
public static string[] SplitLine(string input)
{
Regex lineSplit = new Regex("[0-9]\;$", RegexOptions.Compiled);
List<string> list = new List<string>();
string curr = null;
foreach (Match match in lineSplit.Matches(input))
{
curr = match.Value;
if (0 == curr.Length)
{
list.Add("");
}
list.Add(curr.TrimStart(';'));
}
return list.ToArray();
}
}
However this doesn't work (I am not even allowed to execute the task).
I have never worked with c# before so everything is new to me. As i understand the code, it search each line to find the pattern where there is numbers in front of only one ";" at the end, hence it will not find those lines which ends with numbers following by ";;" (two ;).
When there is a match, one ";" is added.
Please let me know, what I am not understanding and doing wrong.
Maybe it is also wrong to put the script component after the flat file source, because adding ";" will not result in a new line, which is what I want.

Inconsistent row delimiters is bad data and there really is no way to correct for this in either the connection manager or the data flow. Fixing bad data within the data flow is not what SSIS was designed for. Your best bet is to do one of the two following:
Work with the data source provider to fix the issue on their end
Create a script task to first modify the file to correct the bad data
From there, you will be able to process the file normally in SSIS.
Update 1:
If the only problem is a duplicate delimiter (;;), then read in the row and use the Replace(";;",";"); function. If you have either multiple duplicate or invalid end of row delimiters, then you are better served by using StringBuilder(). For a solution on using StringBuilder(), see the weblink below.
https://stackoverflow.com/a/49949787/4630376
Update 2:
One thing that I just remembered, you will need to adjust for handling only those characters that are outside of double quotes, assuming that double quotes exist within the file as the text qualifier. This is important because without it you will remove any characters that are within quotes, which may be valid data.

I would agree with J Weezy to create a script task to correct the bad data. In the script task you could possibly use regex to deal with the “;” and “;;” issue. The script task may be your only way of dealing with the the “;” and “;;” issue.
While the below code in its current form will not work for your case, it possibly could be changed to work for your case. I have used it to deal with processing a text\csv file to correct formatting issues with each line of data. Note I got this from another post on Stackoverflow.
public static string[] SplitLine(string input)
{
Regex lineSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);
List<string> list = new List<string>();
string curr = null;
foreach (Match match in lineSplit.Matches(input))
{
curr = match.Value;
if (0 == curr.Length)
{
list.Add("");
}
list.Add(curr.TrimStart(','));
}
return list.ToArray();
}

Deleting contents of file after a specific line in ruby

Probably a simple question, but I need to delete the contents of a file after a specific line number? So I wan't to keep the first e.g 5 lines and delete the rest of the contents of a file. I have been searching for a while and can't find a way to do this, I am an iOS developer so Ruby is not a language I am very familiar with.

That is called truncate. The truncate method needs the byte position after which everything gets cut off - and the File.pos method delivers just that:
File.open("test.csv", "r+") do |f|
f.each_line.take(5)
f.truncate( f.pos )
end
The "r+" mode from File.open is read and write, without truncating existing files to zero size, like "w+" would.
The block form of File.open ensures that the file is closed when the block ends.

I'm not aware of any methods to delete from a file so my first thought was to read the file and then write back to it. Something like this:
path = '/path/to/thefile'
start_line = 0
end_line = 4
File.write(path, File.readlines(path)[start_line..end_line].join)
File#readlines reads the file and returns an array of strings, where each element is one line of the file. You can then use the subscript operator with a range for the lines you want
This isn't going to be very memory efficient for large files, so you may want to optimise if that's something you'll be doing.

Ruby .count operation truncates input file

I want to read a file in and show how large it is. .count is acting like .count! and changing the size of my input file buffer. so now logfile.each doesn't iterate. What's going on?
logfile = open(input_fspec)
puts "logfile size: #{logfile.count} lines"

count will read all the lines from the input in order to do the counting. If you want to read the lines again (e.g. using readline or each) then you will need to call logfile.rewind to move back to the start of the file.
In fact, what count is actually returning is the number of lines that have not been read yet. For example, if you had already read through the file and called count afterwards then it would return 0.

You could do this instead before you even open it:
File.size("input_fspec")

Pass data from workspace to a function

I created a GUI and used uiimport to import a dataset into matlab workspace, I would like to pass this imported data to another function in matlab...How do I pass this imported dataset into another function....I tried doing diz...but it couldnt pick diz....it doesnt pick the data on the matlab workspace....any ideas??
[file_input, pathname] = uigetfile( ...
{'*.txt', 'Text (*.txt)'; ...
'*.xls', 'Excel (*.xls)'; ...
'*.*', 'All Files (*.*)'}, ...
'Select files');
uiimport(file_input);
M = dlmread(file_input);
X = freed(M);

I think that you need to assign the result of this statement:
uiimport(file_input);
to a variable, like this
dataset = uiimport(file_input);
and then pass that to your next function:
M = dlmread(dataset);
This is a very basic feature of Matlab, which suggests to me that you would find it valuable to read some of the on-line help and some of the documentation for Matlab. When you've done that you'll probably find neater and quicker ways of doing this.
EDIT: Well, #Tim, if all else fails RTFM. So I did, and my previous answer is incorrect. What you need to pass to dlmread is the name of the file to read. So, you either use uiimport or dlmread to read the file, but not both. Which one you use depends on what you are trying to do and on the format of the input file. So, go RTFM and I'll do the same. If you are still having trouble, update your question and provide details of the contents of the file.

In your script you have three ways to read the file. Choose one on them depending on your file format. But first I would combine file name with the path:
file_input = fullfile(pathname,file_input);
I wouldn't use UIIMPORT in a script, since user can change way to read the data, and variable name depends on file name and user.
With DLMREAD you can only read numerical data from the file. You can also skip some number of rows or columns with
M = dlmread(file_input,'\t',1,1);
skipping the first row and one column on the left.
Or you can define a range in kind of Excel style. See the DLMREAD documentation for more details.
The filename you pass to DLMREAD must be a string. Don't pass a file handle or any data. You will get "Filename must be a string", if it's not a string. Easy.
FREAD reads data from a binary file. See the documentation if you really have to do it.
There are many other functions to read the data from file. If you still have problems, show us an example of your file format, so we can suggest the best way to read it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Dealing with additional "+AC0" in my data - rstudio

From what I can tell it's UTF-7 data. So you should be able to force UTF-7 text encoding when reading the file and have it all automagically go away.

Related

How to speed up XQuery string search?

Importing flat file into SSIS gives problems because non-consistent deliminater

Deleting contents of file after a specific line in ruby

Ruby .count operation truncates input file

Pass data from workspace to a function

Categories

Resources