Sort letters in huge string - sorting

I have a 16 MB text document containing a single huge string of letters and numbers without any separators. Excerpt: "as81jsa8sm1o1kmka9s93m1l"
Is there a simple way to alphabetize all of the characters, without having to write a script? I'm afraid JS will crash under the weight of the file.
Thanks.

If you know the string only contains letters and numbers, you can use a bucket sort and achieve good performance.
I am not sure what language you are using, so I'll assume you can read the string character by character. my solution is psuedocode
int[] buckets = int[36]; // 26 for letters, 10 for numbers; assuming only lowercase characters
while(string.hasNext()) {
char x = next character in string;
if(x.isAlpha()) {
buckets[x-'a']++;
}else {
buckets[26 + x - '0']++
}
}
To print out the sorted string:
string s = ""; // at the end of the loops, s will contain the sorted string
for(int i =0 ; i < 26; ++i) {
int y = buckets[i];
for(int j = 0; j < y; ++j) {
s+=(y+'a');
}
}
for(int i =0 ; i < 10; ++i) {
int y = buckets[i+26];
for(int j = 0; j < y; ++j) {
s+=(y+'0');
}
}

Related

Dynamic programming--Longest Common Substring: Understanding the space optimization

I am working through a ver typical question which is the Longest Common Substring of two strings.
The problem statement is quite clear:
for two string s1 and s2, find the length of their longest common substring.
I can understand the definition of the state represented by the dp array. It is a two-dimension array where two dimension just represents the index of the characters in each string(but just 1 based not 0 based).
The original solution code is like below which appears clear to me :
public int findLCSLength(String s1, String s2) {
int[][] dp = new int[s1.length()+1][s2.length()+1];
int maxLength = 0;
for(int i=1; i <= s1.length(); i++) {
for(int j=1; j <= s2.length(); j++) {
if(s1.charAt(i-1) == s2.charAt(j-1)) {
dp[i][j] = 1 + dp[i-1][j-1];
maxLength = Math.max(maxLength, dp[i][j]);
}
}
}
return maxLength;
}
This solution obviously can be optimized since the state of dp[i][j] just depends on the previous row which means two row will be sufficent for the dp array.
So I made the dp array a two-dimension one and use the mod operation to map the indexes in the range of 2.
static int findLCSLength(String s1, String s2) {
int[][] dp = new int[2][s2.length()+1];
int maxLength = 0;
for(int i=1; i <= s1.length(); i++) {
for(int j=1; j <= s2.length(); j++) {
if(s1.charAt(i-1) == s2.charAt(j-1)) {
dp[i%2][j] = 1 + dp[(i-1)%2][j-1];
maxLength = Math.max(maxLength, dp[i%2][j]);
}
}
}
return maxLength;
}
However my code didn't produce the correct answer for all test cases. I found one code snippet which gives correct answer on all test cases which has only one extra operation as I missed.
static int findLCSLength(String s1, String s2) {
int[][] dp = new int[2][s2.length()+1];
int maxLength = 0;
for(int i=1; i <= s1.length(); i++) {
for(int j=1; j <= s2.length(); j++) {
//This is the only extra line I missed
dp[i%2][j] = 0;
if(s1.charAt(i-1) == s2.charAt(j-1)) {
dp[i%2][j] = 1 + dp[(i-1)%2][j-1];
maxLength = Math.max(maxLength, dp[i%2][j]);
}
}
}
return maxLength;
}
One of the cases that my code fails is "passport" and "ppsspt", where my code produced 4 but the correct answer is obviously 3.
I am confused but this line , what does this line do and why it is necessary?
Hope anyone can help on that.
It resets the current count.
Your code sets this variable when:
if(s1.charAt(i-1) == s2.charAt(j-1)) {
But there's no else to set it to 0, which is effectively what that code does.
So consider when:
s1.charAt(i-1) != s2.charAt(j-1)
The previous value that you had in this array location will carry over to the next sub-string comparison when it shouldn't.

Sort method c# string to char

I'm trying to do Sort method, however i get this error:
IndexOutOfRangeException, on the line if(chars[i] > chars1[y]). Amount is equal to 25
string temp1;
for (int i = 0; i < amount; i++)
{
for (int y = i + 1; y < amount - 1; y++)
{
var chars = Duomenys[i].Pozicija.ToCharArray();
var chars1 = Duomenys[y].Pozicija.ToCharArray();
if (chars[i] > chars1[y])
{............}
You are using the same indices (i and y) to identify locations within the array Duomenys and chars/chars1, which seem like very different things. Can't tell what you should be doing instead, given the lack of information provided.

Run length encoding using O(1) space

Can we do the run-length encoding in place(assuming the input array is very large)
We can do for the cases such as AAAABBBBCCCCDDDD
A4B4C4D4
But how to do it for the case such as ABCDEFG?
where the output would be A1B1C1D1E1F1G1
My first thought was to start encoding from the end, so we will use the free space (if any), after that we can shift the encoded array to the start. A problem with this approach is that it will not work for AAAAB, because there is no free space (it's not needed for A4B1) and we will try to write AAAAB1 on the first iteration.
Below is corrected solution:
(let's assume the sequence is AAABBC)
encode all groups with two or more elements and leave the rest unchanged (this will not increase length of the array) -> A3_B2C
shift everything right eliminating empty spaces after first step -> _A3B2C
encode the array from the start (reusing the already encoded groups of course) -> A3B2C1
Every step is O(n) and as far as I can see only constant additional memory is needed.
Limitations:
Digits are not supported, but that anyway would create problems with decoding as Petar Petrov mentioned.
We need some kind of "empty" character, but this can be worked around by adding zeros: A03 instead of A3_
C++ solution O(n) time O(1) space
string runLengthEncode(string str)
{
int len = str.length();
int j=0,k=0,cnt=0;
for(int i=0;i<len;i++)
{
j=i;
cnt=1;
while(i<len-1 && str[i]==str[i+1])
{
i++;
cnt++;
}
str[k++]=str[j];
string temp =to_string(cnt);
for(auto m:temp)
str[k++] = m;
}
str.resize(k);
return str;
}
null is used to indicate which items are empty and will be ignored for encoding. Also you can't encode digits (AAA2222 => A324 => 324 times 'A', but it's A3;24). Your question opens more questions.
Here's a "solution" in C#
public static void Encode(string[] input)
{
var writeIndex = 0;
var i = 0;
while (i < input.Length)
{
var symbol = input[i];
if (symbol == null)
{
break;
}
var nextIndex = i + 1;
var offset = 0;
var count = CountSymbol(input, symbol, nextIndex) + 1;
if (count == 1)
{
ShiftRight(input, nextIndex);
offset++;
}
input[writeIndex++] = symbol;
input[writeIndex++] = count.ToString();
i += count + offset;
}
Array.Clear(input, writeIndex, input.Length - writeIndex);
}
private static void ShiftRight(string[] input, int nextIndex)
{
var count = CountSymbol(input, null, nextIndex, (a, b) => a != b);
Array.Copy(input, nextIndex, input, nextIndex + 1, count);
}
private static int CountSymbol(string[] input, string symbol, int nextIndex)
{
return CountSymbol(input, symbol, nextIndex, (a, b) => a == b);
}
private static int CountSymbol(string[] input, string symbol, int nextIndex, Func<string, string, bool> cmp)
{
var count = 0;
var i = nextIndex;
while (i < input.Length && cmp(input[i], symbol))
{
count++;
i++;
}
return count;
}
The 1st solution does not take care of single characters. For example - 'Hi!' will not work. I've used totally different approach, used 'insert()' functions to add inplace. This take care of everything, whether the total 'same' character is > 10 or >100 or = 1.
#include<iostream>
#include<algorithm>
using namespace std;
int main(){
string name = "Hello Buddy!!";
int start = 0;
char distinct = name[0];
for(int i=1;i<name.length()+1;){
if(distinct!=name[i]){
string s = to_string(i-start);
name.insert(start+1,s);
name.erase(name.begin() + start + 1 + s.length(),name.begin() + s.length() + i);
i=start+s.length()+1;
start=i;
distinct=name[start];
continue;
}
i++;
}
cout<<name;
}
Suggest me if you find anything incorrect.
O(n), in-place RLE, I couldn't think better than this. It will not place a number, if chars occurence is just 1. Will also place a9a2, if the character comes 11 times.
void RLE(char *str) {
int len = strlen(str);
int count = 1, j = 0;
for (int i = 0; i < len; i++){
if (str[i] == str[i + 1])
count++;
else {
int times = count / 9;
int rem = count % 9;
for (int k = 0; k < times; k++) {
str[j++] = str[i];
_itoa(9, &str[j++], 10);
count = count - 9;
}
if (count > 1) {
str[j++] = str[i];
_itoa(rem, &str[j++], 10);
count = 1;
}
else
str[j++] = str[i];
}
}
cout << str;
}
I/P => aaabcdeeeefghijklaaaaa
O/P => a3bcde4fghijkla5
Inplace solution using c++ ( assumes length of encoding string is not more than actual string length):
#include <bits/stdc++.h>
#include<stdlib.h>
using namespace std;
void replacePattern(char *str)
{
int len = strlen(str);
if (len == 0)
return;
int i = 1, j = 1;
int count;
// for each character
while (str[j])
{
count = 1;
while (str[j] == str[j-1])
{
j = j + 1;
count++;
}
while(count > 0) {
int rem = count%10;
str[i++] = to_string(rem)[0];
count = count/10;
}
// copy character at current position j
// to position i and increment i and j
if (str[j])
str[i++] = str[j++];
}
// add a null character to terminate string
if(str[len-1] != str[len-2]) {
str[i] = '1';
i++;
}
str[i] = '\0';
}
// Driver code
int main()
{
char str[] = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabccccc";
replacePattern(str);
cout << str;
return 0;
}

Longest Common Subsequence among 3 Strings

I've implemented the dynamic programming solution to find the longest common subsequence among 2 strings. There is apparently a way to generalize this algorithm to find the LCS among 3 strings, but in my research I have not found any information on how to go about this. Any help would be appreciated.
To find the Longest Common Subsequence (LCS) of 2 strings A and B, you can traverse a 2-dimensional array diagonally like shown in the Link you posted. Every element in the array corresponds to the problem of finding the LCS of the substrings A' and B' (A cut by its row number, B cut by its column number). This problem can be solved by calculating the value of all elements in the array. You must be certain that when you calculate the value of an array element, all sub-problems required to calculate that given value has already been solved. That is why you traverse the 2-dimensional array diagonally.
This solution can be scaled to finding the longest common subsequence between N strings, but this requires a general way to iterate an array of N dimensions such that any element is reached only when all sub-problems the element requires a solution to has been solved.
Instead of iterating the N-dimensional array in a special order, you can also solve the problem recursively. With recursion it is important to save the intermediate solutions, since many branches will require the same intermediate solutions. I have written a small example in C# that does this:
string lcs(string[] strings)
{
if (strings.Length == 0)
return "";
if (strings.Length == 1)
return strings[0];
int max = -1;
int cacheSize = 1;
for (int i = 0; i < strings.Length; i++)
{
cacheSize *= strings[i].Length;
if (strings[i].Length > max)
max = strings[i].Length;
}
string[] cache = new string[cacheSize];
int[] indexes = new int[strings.Length];
for (int i = 0; i < indexes.Length; i++)
indexes[i] = strings[i].Length - 1;
return lcsBack(strings, indexes, cache);
}
string lcsBack(string[] strings, int[] indexes, string[] cache)
{
for (int i = 0; i < indexes.Length; i++ )
if (indexes[i] == -1)
return "";
bool match = true;
for (int i = 1; i < indexes.Length; i++)
{
if (strings[0][indexes[0]] != strings[i][indexes[i]])
{
match = false;
break;
}
}
if (match)
{
int[] newIndexes = new int[indexes.Length];
for (int i = 0; i < indexes.Length; i++)
newIndexes[i] = indexes[i] - 1;
string result = lcsBack(strings, newIndexes, cache) + strings[0][indexes[0]];
cache[calcCachePos(indexes, strings)] = result;
return result;
}
else
{
string[] subStrings = new string[strings.Length];
for (int i = 0; i < strings.Length; i++)
{
if (indexes[i] <= 0)
subStrings[i] = "";
else
{
int[] newIndexes = new int[indexes.Length];
for (int j = 0; j < indexes.Length; j++)
newIndexes[j] = indexes[j];
newIndexes[i]--;
int cachePos = calcCachePos(newIndexes, strings);
if (cache[cachePos] == null)
subStrings[i] = lcsBack(strings, newIndexes, cache);
else
subStrings[i] = cache[cachePos];
}
}
string longestString = "";
int longestLength = 0;
for (int i = 0; i < subStrings.Length; i++)
{
if (subStrings[i].Length > longestLength)
{
longestString = subStrings[i];
longestLength = longestString.Length;
}
}
cache[calcCachePos(indexes, strings)] = longestString;
return longestString;
}
}
int calcCachePos(int[] indexes, string[] strings)
{
int factor = 1;
int pos = 0;
for (int i = 0; i < indexes.Length; i++)
{
pos += indexes[i] * factor;
factor *= strings[i].Length;
}
return pos;
}
My code example can be optimized further. Many of the strings being cached are duplicates, and some are duplicates with just one additional character added. This uses more space than necessary when the input strings become large.
On input: "666222054263314443712", "5432127413542377777", "6664664565464057425"
The LCS returned is "54442"

Trimming a string with <= 2 characters

Suppose you are given an input string:
"my name is vikas"
Suggest an algorithm to modify it to:
"name vikas"
Which means remove words having length <=2 or say k characters, to make it generic.
I think you can do this in-place in O(n) time. Iterate over the string, keeping a pointer to begining the word you're processing. If you find that the length of the word is greater than k, you overwrite the begining of the string with this word. Here's a C code (it assumes that each word is separated by exacly on space):
void modify(char *s, int k){
int n = strlen(s);
int j = 0, cnt = 0, r = 0, prev = -1;
s[n++] = ' '; // Setinel to avoid special case
for(int i=0; i<n; i++){
if(s[i] == ' '){
if (cnt > k){
if(r > 0) s[r++] = ' ';
while(j < i) s[r++] = s[j++];
}
cnt = 0;
}
else {
if (prev == ' ') j = i;
cnt++;
}
prev = s[i];
}
s[r] = '\0';
}
int main(){
char s[] = "my name is vikas";
modify(s, 2);
printf("%s\n", s);
}
"a short sentence of words" split ' ' filter {_.length > 2} mkString " "
(Scala)
Iterate over individual characters of String keeping the current position in the string and the "current word", accumulate all current words with length >= k, reassemble String from accumulated words?
This algorithm uses in-place rewriting and minimizes the number of copies between elements:
final int k = 2;
char[] test = " my name is el jenso ".toCharArray();
int l = test.length;
int pos = 0;
int cwPos = 0;
int copyPos = 0;
while (pos < l)
{
if (Character.isWhitespace(test[pos]))
{
int r = pos - cwPos;
if (r - 1 < k)
{
copyPos -= r;
cwPos = ++pos;
}
else
{
cwPos = ++pos;
test[copyPos++] = ' ';
}
}
else
{
test[copyPos++] = test[pos++];
}
}
System.out.println(new String(test, 0, copyPos));
split() by " " and omit if length() <= 2
Something like that will suffice (time complexity is optimal, I guess):
input
.Split(' ')
.Where(s => s.Length > k)
.Aggregate(new StringBuilder(), (sb, s) => sb.Append(s))
.ToString()
What about space complexity? Well, this can run in O(k) (we can't count size of input and output, of course), if you think about it. It won't in .NET, because Split makes actual array. But you can build iterators instead. And if you imagine the string is just iterator over characters, it will become O(1) algorithm.

Resources