Reorder a string by half the character - algorithm

This is an interview question.
Given a string such as: 123456abcdef consisting of n/2 integers followed by n/2 characters. Reorder the string to contain as 1a2b3c4d5e6f . The algortithm should be in-place.
The solution I gave was trivial - O(n^2). Just shift the characters by n/2 places to the left.
I tried using recursion as -
a. Swap later half of the first half with the previous half of the 2nd part - eg
123 456 abc def
123 abc 456 def
b. Recurse on the two halves.
The pbm I am stuck is that the swapping varies with the number of elements - for eg.
What to do next?
123 abc
12ab 3c
And what to do for : 12345 abcde
123abc 45ab
This is a pretty old question and may be a duplicate. Please let me know.. :)
Another example:
Input: 38726zfgsa
Output: 3z8f7g2s6a

Here's how I would approach the problem:
1) Divide the string into two partitions, number part and letter part
2) Divide each of those partitions into two more (equal sized)
3) Swap the second the third partition (inner number and inner letter)
4) Recurse on the original two partitions (with their newly swapped bits)
5) Stop when the partition has a size of 2
For example:
123456abcdef -> 123456 abcdef -> 123 456 abc def -> 123 abc 456 def
123abc -> 123 abc -> 12 3 ab c -> 12 ab 3 c
12 ab -> 1 2 a b -> 1 a 2 b
... etc
And the same for the other half of the recursion..
All can be done in place with the only gotcha being swapping partitions that aren't the same size (but it'll be off by one, so not difficult to handle).

It is easy to permute an array in place by chasing elements round cycles if you have a bit-map to mark which elements have been moved. We don't have a separate bit-map, but IF your characters are letters (or at least have the high order bit clear) then we can use the top bit of each character to mark this. This produces the following program, which is not recursive and so does not use stack space.
class XX
{
/** new position given old position */
static int newFromOld(int x, int n)
{
if (x < n / 2)
{
return x * 2;
}
return (x - n / 2) * 2 + 1;
}
private static int HIGH_ORDER_BIT = 1 << 15; // 16-bit chars
public static void main(String[] s)
{
// input data - create an array so we can modify
// characters in place
char[] x = s[0].toCharArray();
if ((x.length & 1) != 0)
{
System.err.println("Only works with even length strings");
return;
}
// Character we have read but not yet written, if any
char holding = 0;
// where character in hand was read from
int holdingPos = 0;
// whether picked up a character in our hand
boolean isHolding = false;
int rpos = 0;
while (rpos < x.length)
{ // Here => moved out everything up to rpos
// and put in place with top bit set to mark new occupant
if (!isHolding)
{ // advance read pointer to read new character
char here = x[rpos];
holdingPos = rpos++;
if ((here & HIGH_ORDER_BIT) != 0)
{
// already dealt with
continue;
}
int targetPos = newFromOld(holdingPos, x.length);
// pick up char at target position
holding = x[targetPos];
// place new character, and mark as new
x[targetPos] = (char)(here | HIGH_ORDER_BIT);
// Now holding a character that needs to be put in its
// correct place
isHolding = true;
holdingPos = targetPos;
}
int targetPos = newFromOld(holdingPos, x.length);
char here = x[targetPos];
if ((here & HIGH_ORDER_BIT) != 0)
{ // back to where we picked up a character to hold
isHolding = false;
continue;
}
x[targetPos] = (char)(holding | HIGH_ORDER_BIT);
holding = here;
holdingPos = targetPos;
}
for (int i = 0; i < x.length; i++)
{
x[i] ^= HIGH_ORDER_BIT;
}
System.out.println("Result is " + new String(x));
}
}

These days, if I asked someone that question, what I'm looking for them to write on the whiteboard first is:
assertEquals("1a2b3c4d5e6f",funnySort("123456abcdef"));
...
and then maybe ask for more examples.
(And then, depending, if the task is to interleave numbers & letters, I think you can do it with two walking-pointers, indexLetter and indexDigit, and advance them across swapping as needed til you reach the end.)

In your recursive solution why don't you just make a test if n/2 % 2 == 0 (n%4 ==0 ) and treat the 2 situations differently
As templatetypedef commented your recursion cannot be in-place.
But here is a solution (not in place) using the way you wanted to make your recursion :
def f(s):
n=len(s)
if n==2: #initialisation
return s
elif n%4 == 0 : #if n%4 == 0 it's easy
return f(s[:n/4]+s[n/2:3*n/4])+f(s[n/4:n/2]+s[3*n/4:])
else: #otherwise, n-2 %4 == 0
return s[0]+s[n/2]+f(s[1:n/2]+s[n/2+1:])

Here we go. Recursive, cuts it in half each time, and in-place. Uses the approach outlined by #Chris Mennie. Getting the splitting right was tricky. A lot longer than Python, innit?
/* In-place, divide-and-conquer, recursive riffle-shuffle of strings;
* even length only. No wide characters or Unicode; old school. */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void testrif(const char *s);
void riffle(char *s);
void rif_recur(char *s, size_t len);
void swap(char *s, size_t midpt, size_t len);
void flip(char *s, size_t len);
void if_odd_quit(const char *s);
int main(void)
{
testrif("");
testrif("a1");
testrif("ab12");
testrif("abc123");
testrif("abcd1234");
testrif("abcde12345");
testrif("abcdef123456");
return 0;
}
void testrif(const char *s)
{
char mutable[20];
strcpy(mutable, s);
printf("'%s'\n", mutable);
riffle(mutable);
printf("'%s'\n\n", mutable);
}
void riffle(char *s)
{
if_odd_quit(s);
rif_recur(s, strlen(s));
}
void rif_recur(char *s, size_t len)
{
/* Turn, e.g., "abcde12345" into "abc123de45", then recurse. */
size_t pivot = len / 2;
size_t half = (pivot + 1) / 2;
size_t twice = half * 2;
if (len < 4)
return;
swap(s + half, pivot - half, pivot);
rif_recur(s, twice);
rif_recur(s + twice, len - twice);
}
void swap(char *s, size_t midpt, size_t len)
{
/* Swap s[0..midpt] with s[midpt..len], in place. Algorithm from
* Programming Pearls, Chapter 2. */
flip(s, midpt);
flip(s + midpt, len - midpt);
flip(s, len);
}
void flip(char *s, size_t len)
{
/* Reverse order of characters in s, in place. */
char *p, *q, tmp;
if (len < 2)
return;
for (p = s, q = s + len - 1; p < q; p++, q--) {
tmp = *p;
*p = *q;
*q = tmp;
}
}
void if_odd_quit(const char *s)
{
if (strlen(s) % 2) {
fputs("String length is odd; aborting.\n", stderr);
exit(1);
}
}

By comparing 123456abcdef and 1a2b3c4d5e6f we can note that only the first and the last characters are in their correct position. We can also note that for each remaining n-2 characters we can compute their correct position directly from their original position. They will get there, and the element that was there surely was not in the correct position, so it will have to replace another one. By doing n-2 such steps all the elements will get to the correct positions:
void funny_sort(char* arr, int n){
int pos = 1; // first unordered element
char aux = arr[pos];
for (int iter = 0; iter < n-2; iter++) { // n-2 unordered elements
pos = (pos < n/2) ? pos*2 : (pos-n/2)*2+1;// correct pos for aux
swap(&aux, arr + pos);
}
}

Score each digit as its numerical value. Score each letter as a = 1.5, b = 2.5 c = 3.5 etc. Run an insertion sort of the string based on the score of each character.
[ETA] Simple scoring won't work so use two pointers and reverse the piece of the string between the two pointers. One pointer starts at the front of the string and advances one step each cycle. The other pointer starts in the middle of the string and advances every second cycle.
123456abcdef
^ ^
1a65432bcdef
^ ^
1a23456bcdef
^ ^
1a2b6543cdef
^ ^

Related

Find all anagrams in a string O(n) solution

Here is the problem:
Given a string s and a non-empty string p, find all the start indices of p's anagrams in s.
Input: s: "cbaebabacd" p: "abc"
Output: [0, 6]
Input: s: "abab" p: "ab"
Output: [0, 1, 2]
Here is my solution
vector<int> findAnagrams(string s, string p) {
vector<int> res, s_map(26,0), p_map(26,0);
int s_len = s.size();
int p_len = p.size();
if (s_len < p_len) return res;
for (int i = 0; i < p_len; i++) {
++s_map[s[i] - 'a'];
++p_map[p[i] - 'a'];
}
if (s_map == p_map)
res.push_back(0);
for (int i = p_len; i < s_len; i++) {
++s_map[s[i] - 'a'];
--s_map[s[i - p_len] - 'a'];
if (s_map == p_map)
res.push_back(i - p_len + 1);
}
return res;
}
However, I think it is O(n^2) solution because I have to compare vectors s_map and p_map.
Does a O(n) solution exist for this problem?
lets say p has size n.
lets say you have an array A of size 26 that is filled with the number of a,b,c,... which p contains.
then you create a new array B of size 26 filled with 0.
lets call the given (big) string s.
first of all you initialize B with the number of a,b,c,... in the first n chars of s.
then you iterate through each word of size n in s always updating B to fit this n-sized word.
always B matches A you will have an index where we have an anagram.
to change B from one n-sized word to another, notice you just have to remove in B the first char of the previous word and add the new char of the next word.
Look at the example:
Input
s: "cbaebabacd"
p: "abc" n = 3 (size of p)
A = {1, 1, 1, 0, 0, 0, ... } // p contains just 1a, 1b and 1c.
B = {1, 1, 1, 0, 0, 0, ... } // initially, the first n-sized word contains this.
compare(A,B)
for i = n; i < size of s; i++ {
B[ s[i-n] ]--;
B[ s[ i ] ]++;
compare(A,B)
}
and suppose that compare(A,B) prints the index always A matches B.
the total complexity will be:
first fill of A = O(size of p)
first fill of B = O(size of s)
first comparison = O(26)
for-loop = |s| * (2 + O(26)) = |s| * O(28) = O(28|s|) = O(size of s)
____________________________________________________________________
2 * O(size of s) + O(size of p) + O(26)
which is linear in size of s.
Your solution is the O(n) solution. The size of the s_map and p_map vectors is a constant (26) that doesn't depend on n. So the comparison between s_map and p_map takes a constant amount of time regardless of how big n is.
Your solution takes about 26 * n integer comparisons to complete, which is O(n).
// In papers on string searching algorithms, the alphabet is often
// called Sigma, and it is often not considered a constant. Your
// algorthm works in (Sigma * n) time, where n is the length of the
// longer string. Below is an algorithm that works in O(n) time even
// when Sigma is too large to make an array of size Sigma, as long as
// values from Sigma are a constant number of "machine words".
// This solution works in O(n) time "with high probability", meaning
// that for all c > 2 the probability that the algorithm takes more
// than c*n time is 1-o(n^-c). This is a looser bound than O(n)
// worst-cast because it uses hash tables, which depend on randomness.
#include <functional>
#include <iostream>
#include <type_traits>
#include <vector>
#include <unordered_map>
#include <vector>
using namespace std;
// Finding a needle in a haystack. This works for any iterable type
// whose members can be stored as keys of an unordered_map.
template <typename T>
vector<size_t> AnagramLocations(const T& needle, const T& haystack) {
// Think of a contiguous region of an ordered container as
// representing a function f with the domain being the type of item
// stored in the container and the codomain being the natural
// numbers. We say that f(x) = n when there are n x's in the
// contiguous region.
//
// Then two contiguous regions are anagrams when they have the same
// function. We can track how close they are to being anagrams by
// subtracting one function from the other, pointwise. When that
// difference is uniformly 0, then the regions are anagrams.
unordered_map<remove_const_t<remove_reference_t<decltype(*needle.begin())>>,
intmax_t> difference;
// As we iterate through the haystack, we track the lead (part
// closest to the end) and lag (part closest to the beginning) of a
// contiguous region in the haystack. When we move the region
// forward by one, one part of the function f is increased by +1 and
// one part is decreased by -1, so the same is true of difference.
auto lag = haystack.begin(), lead = haystack.begin();
// To compare difference to the uniformly-zero function in O(1)
// time, we make sure it does not contain any points that map to
// 0. The the property of being uniformly zero is the same as the
// property of having an empty difference.
const auto find = [&](const auto& x) {
difference[x]++;
if (0 == difference[x]) difference.erase(x);
};
const auto lose = [&](const auto& x) {
difference[x]--;
if (0 == difference[x]) difference.erase(x);
};
vector<size_t> result;
// First we initialize the difference with the first needle.size()
// items from both needle and haystack.
for (const auto& x : needle) {
lose(x);
find(*lead);
++lead;
if (lead == haystack.end()) return result;
}
size_t i = 0;
if (difference.empty()) result.push_back(i++);
// Now we iterate through the haystack with lead, lag, and i (the
// position of lag) updating difference in O(1) time at each spot.
for (; lead != haystack.end(); ++lead, ++lag, ++i) {
find(*lead);
lose(*lag);
if (difference.empty()) result.push_back(i);
}
return result;
}
int main() {
string needle, haystack;
cin >> needle >> haystack;
const auto result = AnagramLocations(needle, haystack);
for (auto x : result) cout << x << ' ';
}
import java.util.*;
public class FindAllAnagramsInAString_438{
public static void main(String[] args){
String s="abab";
String p="ab";
// String s="cbaebabacd";
// String p="abc";
System.out.println(findAnagrams(s,p));
}
public static List<Integer> findAnagrams(String s, String p) {
int i=0;
int j=p.length();
List<Integer> list=new ArrayList<>();
while(j<=s.length()){
//System.out.println("Substring >>"+s.substring(i,j));
if(isAnamgram(s.substring(i,j),p)){
list.add(i);
}
i++;
j++;
}
return list;
}
public static boolean isAnamgram(String s,String p){
HashMap<Character,Integer> map=new HashMap<>();
if(s.length()!=p.length()) return false;
for(int i=0;i<s.length();i++){
char chs=s.charAt(i);
char chp=p.charAt(i);
map.put(chs,map.getOrDefault(chs,0)+1);
map.put(chp,map.getOrDefault(chp,0)-1);
}
for(int val:map.values()){
if(val!=0) return false;
}
return true;
}
}

Find word in string buffer/paragraph/text

This was asked in Amazon telephonic interview - "Can you write a program (in your preferred language C/C++/etc.) to find a given word in a string buffer of big size ? i.e. number of occurrences "
I am still looking for perfect answer which I should have given to the interviewer.. I tried to write a linear search (char by char comparison) and obviously I was rejected.
Given a 40-45 min time for a telephonic interview, what was the perfect algorithm he/she was looking for ???
The KMP Algorithm is a popular string matching algorithm.
KMP Algorithm
Checking char by char is inefficient. If the string has 1000 characters and the keyword has 100 characters, you don't want to perform unnecessary comparisons. The KMP Algorithm handles many cases which can occur, but I imagine the interviewer was looking for the case where: When you begin (pass 1), the first 99 characters match, but the 100th character doesn't match. Now, for pass 2, instead of performing the entire comparison from character 2, you have enough information to deduce where the next possible match can begin.
// C program for implementation of KMP pattern searching
// algorithm
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
void computeLPSArray(char *pat, int M, int *lps);
void KMPSearch(char *pat, char *txt)
{
int M = strlen(pat);
int N = strlen(txt);
// create lps[] that will hold the longest prefix suffix
// values for pattern
int *lps = (int *)malloc(sizeof(int)*M);
int j = 0; // index for pat[]
// Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps);
int i = 0; // index for txt[]
while (i < N)
{
if (pat[j] == txt[i])
{
j++;
i++;
}
if (j == M)
{
printf("Found pattern at index %d \n", i-j);
j = lps[j-1];
}
// mismatch after j matches
else if (i < N && pat[j] != txt[i])
{
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j-1];
else
i = i+1;
}
}
free(lps); // to avoid memory leak
}
void computeLPSArray(char *pat, int M, int *lps)
{
int len = 0; // length of the previous longest prefix suffix
int i;
lps[0] = 0; // lps[0] is always 0
i = 1;
// the loop calculates lps[i] for i = 1 to M-1
while (i < M)
{
if (pat[i] == pat[len])
{
len++;
lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
if (len != 0)
{
// This is tricky. Consider the example
// AAACAAAA and i = 7.
len = lps[len-1];
// Also, note that we do not increment i here
}
else // if (len == 0)
{
lps[i] = 0;
i++;
}
}
}
}
// Driver program to test above function
int main()
{
char *txt = "ABABDABACDABABCABAB";
char *pat = "ABABCABAB";
KMPSearch(pat, txt);
return 0;
}
This code is taken from a really good site that teaches algorithms:
Geeks for Geeks KMP
Amazon and companies alike expect knowledge of Boyer–Moore string search or / and Knuth–Morris–Pratt algorithms.
Those are good if you want to show perfect knowledge. Otherwise, try to be creative and write something relatively elegant and efficient.
Did you ask about delimiters before you wrote anything? It could be that they may simplify your task to provide some extra information about a string buffer.
Even code below could be ok (it's really not) if you provide enough information in advance, properly explain runtime, space requirements, choice of data containers.
int find( std::string & the_word, std::string & text )
{
std::stringstream ss( text ); // !!! could be really bad idea if 'text' is really big
std::string word;
std::unordered_map< std::string, int > umap;
while( ss >> text ) ++umap[text]; // you have to assume that each word separated by white-spaces.
return umap[the_word];
}

Points and segments

I'm doing online course and got stuck at this problem.
The first line contains two non-negative integers 1 ≤ n, m ≤ 50000 — the number of segments and points on a line, respectively. The next n lines contain two integers a_i ≤ b_i defining the i-th segment. The next line contain m integers defining points. All the integers are of absolute value at most 10^8. For each segment, output the number of points it is used from the n-points table.
My solution is :
for point in points:
occurrence = 0
for l, r in segments:
if l <= point <= r:
occurrence += 1
print(occurrence),
The complexity of this algorithm is O(m*n), which is obviously not very efficient. What is the best way of solving this problem? Any help will be appreciated!
Sample Input:
2 3
0 5
7 10
1 6 11
Sample Output:
1 0 0
Sample Input 2:
1 3
-10 10
-100 100 0
Sample Output 2:
0 0 1
You can use sweep line algorithm to solve this problem.
First, break each segment into two points, open and close points.
Add all these points together with those m points, and sort them based on their locations.
Iterating through the list of points, maintaining a counter, every time you encounter an open point, increase the counter, and if you encounter an end point, decrease it. If you encounter a point in list m point, the result for this point is the value of counter at this moment.
For example 2, we have:
1 3
-10 10
-100 100 0
After sorting, what we have is:
-100 -10 0 10 100
At point -100, we have `counter = 0`
At point -10, this is open point, we increase `counter = 1`
At point 0, so result is 1
At point 10, this is close point, we decrease `counter = 0`
At point 100, result is 0
So, result for point -100 is 0, point 100 is 0 and point 0 is 1 as expected.
Time complexity is O((n + m) log (n + m)).
[Original answer] by how many segments is each point used
I am not sure I got the problem correctly but looks like simple example of Histogram use ...
create counter array (one item per point)
set it to zero
process the last line incrementing each used point counter O(m)
write the answer by reading histogram O(n)
So the result should be O(m+n) something like (C++):
const int n=2,m=3;
const int p[n][2]={ {0,5},{7,10} };
const int s[m]={1,6,11};
int i,cnt[n];
for (i=0;i<n;i++) cnt[i]=0;
for (i=0;i<m;i++) if ((s[i]>=0)&&(s[i]<n)) cnt[s[i]]++;
for (i=0;i<n;i++) cout << cnt[i] << " "; // result: 0 1
But as you can see the p[] coordinates are never used so either I missed something in your problem description or you missing something or it is there just to trick solvers ...
[edit1] after clearing the inconsistencies in OP the result is a bit different
By how many points is each segment used:
create counter array (one item per segment)
set it to zero
process the last line incrementing each used point counter O(m)
write the answer by reading histogram O(m)
So the result is O(m) something like (C++):
const int n=2,m=3;
const int p[n][2]={ {0,5},{7,10} };
const int s[m]={1,6,11};
int i,cnt[m];
for (i=0;i<m;i++) cnt[i]=0;
for (i=0;i<m;i++) if ((s[i]>=0)&&(s[i]<n)) cnt[i]++;
for (i=0;i<m;i++) cout << cnt[i] << " "; // result: 1,0,0
[Notes]
After added new sample set to OP it is clear now that:
indexes starts from 0
the problem is how many points from table p[n] are really used by each segment (m numbers in output)
Use Binary Search.
Sort the line segments according to 1st value and the second value. If you use c++, you can use custom sort like this:
sort(a,a+n,fun); //a is your array of pair<int,int>, coordinates representing line
bool fun(pair<int,int> a, pair<int,int> b){
if(a.first<b.first)
return true;
if(a.first>b.first)
return false;
return a.second < b.second;
}
Then, for every point, find the 1st line that captures the point and the first line that does not (after the line that does of course). If no line captures the point, you can return -1 or something (and not check for the point that does not).
Something like:
int checkFirstHold(pair<int,int> a[], int p,int min, int max){ //p is the point
while(min < max){
int mid = (min + max)/2;
if(a[mid].first <= p && a[mid].second>=p && a[mid-1].first<p && a[mid-1].second<p) //ie, p is in line a[mid] and not in line a[mid-1]
return mid;
if(a[mid].first <= p && a[mid].second>=p && a[mid-1].first<=p && a[mid-1].second>=p) //ie, p is both in line a[mid] and not in line a[mid-1]
max = mid-1;
if(a[mid].first < p && a[mid].second<p ) //ie, p is not in line a[mid]
min = mid + 1;
}
return -1; //implying no point holds the line
}
Similarly, write a checkLastHold function.
Then, find checkLastHold - checkFirstHold for every point, which is the answer.
The complexity of this solution will be O(n log m), as it takes (log m) for every calculation.
Here is my counter-based solution in Java.
Note that all points, segment start and segment end are read into one array.
If points of different PointType have the same x-coordinate, then the point is sorted after segment start and before segment end. This is done to count the point as "in" the segment if it coincides with both the segment start (counter already increased) and the segment end (counter not yet decreased).
For storing an answer in the same order as the points from the input, I create the array result of size pointsCount (only points counted, not the segments) and set its element with index SuperPoint.index, which stores the position of the point in the original input.
import java.util.Arrays;
import java.util.Scanner;
public final class PointsAndSegmentsSolution {
enum PointType { // in order of sort, so that the point will be counted on both segment start and end coordinates
SEGMENT_START,
POINT,
SEGMENT_END,
}
static class SuperPoint {
final PointType type;
final int x;
final int index; // -1 (actually does not matter) for segments, index for points
public SuperPoint(final PointType type, final int x) {
this(type, x, -1);
}
public SuperPoint(final PointType type, final int x, final int index) {
this.type = type;
this.x = x;
this.index = index;
}
}
private static int[] countSegments(final SuperPoint[] allPoints, final int pointsCount) {
Arrays.sort(allPoints, (o1, o2) -> {
if (o1.x < o2.x)
return -1;
if (o1.x > o2.x)
return 1;
return Integer.compare( o1.type.ordinal(), o2.type.ordinal() ); // points with the same X coordinate by order in PointType enum
});
final int[] result = new int[pointsCount];
int counter = 0;
for (final SuperPoint superPoint : allPoints) {
switch (superPoint.type) {
case SEGMENT_START:
counter++;
break;
case SEGMENT_END:
counter--;
break;
case POINT:
result[superPoint.index] = counter;
break;
default:
throw new IllegalArgumentException( String.format("Unknown SuperPoint type: %s", superPoint.type) );
}
}
return result;
}
public static void main(final String[] args) {
final Scanner scanner = new Scanner(System.in);
final int segmentsCount = scanner.nextInt();
final int pointsCount = scanner.nextInt();
final SuperPoint[] allPoints = new SuperPoint[(segmentsCount * 2) + pointsCount];
int allPointsIndex = 0;
for (int i = 0; i < segmentsCount; i++) {
final int start = scanner.nextInt();
final int end = scanner.nextInt();
allPoints[allPointsIndex] = new SuperPoint(PointType.SEGMENT_START, start);
allPointsIndex++;
allPoints[allPointsIndex] = new SuperPoint(PointType.SEGMENT_END, end);
allPointsIndex++;
}
for (int i = 0; i < pointsCount; i++) {
final int x = scanner.nextInt();
allPoints[allPointsIndex] = new SuperPoint(PointType.POINT, x, i);
allPointsIndex++;
}
final int[] pointsSegmentsCounts = countSegments(allPoints, pointsCount);
for (final int count : pointsSegmentsCounts) {
System.out.print(count + " ");
}
}
}

Filter only digit sequences containing a given set of digits

I have a large list of digit strings like this one. The individual strings are relatively short (say less than 50 digits).
data = [
'300303334',
'53210234',
'123456789',
'5374576807063874'
]
I need to find out a efficient data structure (speed first, memory second) and algorithm which returns only those strings that are composed of a given set of digits.
Example results:
filter(data, [0,3,4]) = ['300303334']
filter(data, [0,1,2,3,4,5]) = ['300303334', '53210234']
The data list will usually fit into memory.
For each digit, precompute a postings list that don't contain the digit.
postings = [[] for _ in xrange(10)]
for i, d in enumerate(data):
for j in xrange(10):
digit = str(j)
if digit not in d:
postings[j].append(i)
Now, to find all strings that contain, for example, just the digits [1, 3, 5] you can merge the postings lists for the other digits (ie: 0, 2, 4, 6, 7, 8, 9).
def intersect_postings(p0, p1):
i0, i1 = next(p0), next(p1)
while True:
if i0 == i1:
yield i0
i0, i1 = next(p0), next(p1)
elif i0 < i1: i0 = next(p0)
else: i1 = next(p1)
def find_all(digits):
p = None
for d in xrange(10):
if d not in digits:
if p is None: p = iter(postings[d])
else: p = intersect_postings(p, iter(postings[d]))
return (data[i] for i in p) if p else iter(data)
print list(find_all([0, 3, 4]))
print list(find_all([0, 1, 2, 3, 4, 5]))
A string can be encoded by a 10-bit number. There are 2^10, or 1,024 possible values.
So create a dictionary that uses an integer for a key and a list of strings for the value.
Calculate the value for each string and add that string to the list of strings for that value.
General idea:
Dictionary Lookup;
for each (string in list)
value = 0;
for each character in string
set bit N in value, where N is the character (0-9)
Lookup[value] += string // adds string to list for this value in dictionary
Then, to get a list of the strings that match your criteria, just compute the value and do a direct dictionary lookup.
So if the user asks for strings that contain only 3, 5, and 7:
value = (1 << 3) || (1 << 5) || (1 << 7);
list = Lookup[value];
Note that, as Matt pointed out in comment below, this will only return strings that contain all three digits. So, for example, it wouldn't return 37. That seems like a fatal flaw to me.
Edit
If the number of symbols you have to deal with is very large, then the number of possible combinations becomes too large for this solution to be practical.
With a large number of symbols, I'd recommend an inverted index as suggested in the comments, combined with a secondary filter that removes the strings that contain extraneous digits.
Consider a function f which constructs a bitmask for each string with bit i set if digit i is in the string.
For example,
f('0') = 0b0000000001
f('00') = 0b0000000001
f('1') = 0b0000000010
f('1100') = 0b0000000011
Then I suggest storing a list of strings for each bitmask.
For example,
Bitmask 0b0000000001 -> ['0','00']
Once you have prepared this data structure (which is the same size as your original list), you can then easily access all the strings for a particular filter by accessing all lists where the bitmask is a subset of the digits in your filter.
So for your example of filter [0,3,4] you would return the lists from:
Strings containing just 0
Strings containing just 3
Strings containing just 4
Strings containing 0 and 3
Strings containing 0 and 4
Strings containing 3 and 4
Strings containing 0 and 3 and 4
Example Python Code
from collections import defaultdict
import itertools
raw_data = [
'300303334',
'53210234',
'123456789',
'5374576807063874'
]
def preprocess(raw_data):
data = defaultdict(list)
for s in raw_data:
bitmask = 0
for digit in s:
bitmask |= 1<<int(digit)
data[bitmask].append(s)
return data
def filter(data,mask):
for r in range(len(mask)):
for m in itertools.combinations(mask,r+1):
bitmask = sum(1<<digit for digit in m)
for s in data[bitmask]:
yield s
data = preprocess(raw_data)
for a in filter(data, [0,1,2,3,4,5]):
print a
Just for kicks, I have coded up Jim's lovely algorithm and the Perl is here if anyone wants to play with it. Please do not accept this as an answer or anything, pass all credit to Jim:
#!/usr/bin/perl
use strict;
use warnings;
my $Debug=1;
my $Nwords=1000;
my ($word,$N,$value,$i,$j,$k);
my (#dictionary,%Lookup);
################################################################################
# Generate "words" with random number of characters 5-30
################################################################################
print "DEBUG: Generating $Nwords word dictionary\n" if $Debug;
for($i=0;$i<$Nwords;$i++){
$j = rand(25) + 5; # length of this word
$word="";
for($k=0;$k<$j;$k++){
$word = $word . int(rand(10));
}
$dictionary[$i]=$word;
print "$word\n" if $Debug;
}
# Add some obvious test cases
$dictionary[++$i]="0" x 50;
$dictionary[++$i]="1" x 50;
$dictionary[++$i]="2" x 50;
$dictionary[++$i]="3" x 50;
$dictionary[++$i]="4" x 50;
$dictionary[++$i]="5" x 50;
$dictionary[++$i]="6" x 50;
$dictionary[++$i]="7" x 50;
$dictionary[++$i]="8" x 50;
$dictionary[++$i]="9" x 50;
$dictionary[++$i]="0123456789";
################################################################################
# Encode words
################################################################################
for $word (#dictionary){
$value=0;
for($i=0;$i<length($word);$i++){
$N=substr($word,$i,1);
$value |= 1 << $N;
}
push(#{$Lookup{$value}},$word);
print "DEBUG: $word encoded as $value\n" if $Debug;
}
################################################################################
# Do lookups
################################################################################
while(1){
print "Enter permitted digits, separated with commas: ";
my $line=<STDIN>;
my #digits=split(",",$line);
$value=0;
for my $d (#digits){
$value |= 1<<$d;
}
print "Value: $value\n";
print join(", ",#{$Lookup{$value}}),"\n\n" if defined $Lookup{$value};
}
I like Jim Mischel's approach. It has pretty efficient look up and bounded memory usage. Code in C follows:
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <readline/readline.h>
#include <readline/history.h>
enum {
zero = '0',
nine = '9',
numbers = nine - zero + 1,
masks = 1 << numbers,
};
typedef uint16_t mask;
struct list {
char *s;
struct list *next;
};
typedef struct list list_cell;
typedef struct list *list;
static inline int is_digit(char c) { return c >= zero && c <= nine; }
static inline mask char2mask(char c) { return 1 << (c - zero); }
static inline mask add_char2mask(mask m, char c) {
return m | (is_digit(c) ? char2mask(c) : 0);
}
static inline int is_set(mask m, mask n) { return (m & n) != 0; }
static inline int is_set_char(mask m, char c) { return is_set(m, char2mask(c)); }
static inline int is_submask(mask sub, mask m) { return (sub & m) == sub; }
static inline char *sprint_mask(char buf[11], mask m) {
char *s = buf;
char i;
for(i = zero; i <= nine; i++)
if(is_set_char(m, i)) *s++ = i;
*s = 0;
return buf;
}
static inline mask get_mask(char *s) {
mask m=0;
for(; *s; s++)
m = add_char2mask(m, *s);
return m;
}
static inline int is_empty(list l) { return !l; }
static inline list insert(list *l, char *s) {
list cell = (list)malloc(sizeof(list_cell));
cell->s = s;
cell->next = *l;
return *l = cell;
}
static void *foreach(void *f(char *, void *), list l, void *init) {
for(; !is_empty(l); l = l->next)
init = f(l->s, init);
return init;
}
struct printer_state {
int first;
FILE *f;
};
static void *prin_list_member(char *s, void *data) {
struct printer_state *st = (struct printer_state *)data;
if(st->first) {
fputs(", ", st->f);
} else
st->first = 1;
fputs(s, st->f);
return data;
}
static void print_list(list l) {
struct printer_state st = {.first = 0, .f = stdout};
foreach(prin_list_member, l, (void *)&st);
putchar('\n');
}
static list *init_lu(void) { return (list *)calloc(sizeof(list), masks); }
static list *insert2lu(list lu[masks], char *s) {
mask i, m = get_mask(s);
if(m) // skip string without any number
for(i = m; i < masks; i++)
if(is_submask(m, i))
insert(lu+i, s);
return lu;
}
int usage(const char *name) {
fprintf(stderr, "Usage: %s filename\n", name);
return EXIT_FAILURE;
}
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
static inline void chomp(char *s) { if( (s = strchr(s, '\n')) ) *s = '\0'; }
list *load_file(FILE *f) {
char *line = NULL;
size_t len = 0;
ssize_t read;
list *lu = init_lu();
for(; (read = getline(&line, &len, f)) != -1; line = NULL) {
chomp(line);
insert2lu(lu, line);
}
return lu;
}
void read_reqs(list *lu) {
char *line;
char buf[11];
for(; (line = readline("> ")); free(line))
if(*line) {
add_history(line);
mask m = get_mask(line);
printf("mask: %s\nstrings: ", sprint_mask(buf, m));
print_list(lu[m]);
};
putchar('\n');
}
int main(int argc, const char* argv[] ) {
const char *name = argv[0];
FILE *f;
list *lu;
if(argc != 2) return usage(name);
f = fopen(argv[1], "r");
if(!f) handle_error("open");
lu = load_file(f);
fclose(f);
read_reqs(lu);
return EXIT_SUCCESS;
}
To compile use
gcc -lreadline -o digitfilter digitfilter.c
And test run:
$ cat data.txt
300303334
53210234
123456789
5374576807063874
$ ./digitfilter data.txt
> 034
mask: 034
strings: 300303334
> 0,1,2,3,4,5
mask: 012345
strings: 53210234, 300303334
> 0345678
mask: 0345678
strings: 5374576807063874, 300303334
Put each value into a set-- Eg.: '300303334'={3, 0, 4}.
Since the length of your data items are bound by a constant (50),
you can do these at O(1) time for each item using Java HashSet. The overall complexity of this phase adds up to O(n).
For each filter set, use containsAll() of HashSet to see whether
each of these data items is a subset of your filter. Takes O(n).
Takes O(m*n) in the overall where n is the number of data items and m the number of filters.

Efficient Way to Arrange Odd and Even Data Sequentially

I have data like (1,2,3,4,5,6,7,8) .I want to arrange them in a way like (1,3,5,7,2,4,6,8) in n/2-2 swap without using any array and loop must be use 1 or less.
Note that i have to do the swap in existing array of number.If there is other way like without swap and without extra array use,
Please give me some advice.
maintain two pointers: p1,p2. p1 goes from start to end, p2 goes from end to start, and swap non matching elements.
pseudo code:
specialSort(array):
p1 <- array.start()
p2 <- array.end()
while (p1 != p2):
if (*p1 %2 == 0):
p1 <- p1 + 1;
continue;
if (*p2 %2 == 1):
p2 <- p2 -1;
continue;
//when here, both p1 and p2 need a swap
swap(p1,p2);
Note that complexity is O(n), at least one of p1 or p2 changes in every second iteration, so the loop cannot repeat more the 2*n=O(n) times. [we can find better bound, but it is not needed]. space complexity is trivially O(1), we allocate a constant amount of space: 2 pointers only.
Note2: if your language does not support pointers [i.e. java,ml,...], it can be replaced with indexes: i1 going from start to end, i2 going from end to start, with the same algorithm principle.
#include <stdio.h>
#include <string.h>
char array[26] = "ABcdEfGiHjklMNOPqrsTUVWxyZ" ;
#define COUNTOF(a_) (sizeof(a_)/sizeof(a_)[0])
#define IS_ODD(e) ((e)&0x20)
#define IS_EVEN(e) (!IS_ODD(e))
void doswap (char *ptr, unsigned sizl, unsigned sizr);
int main(void)
{
unsigned bot,limit,cut,top,size;
size = COUNTOF(array);
printf("Before:%26.26s\n", array);
/* pass 1 count the number of EVEN chars */
for (limit=top=0; top < size; top++) {
if ( IS_EVEN( array[top] ) ) limit++;
}
/* skip initial segment of EVEN */
for (bot=0; bot < limit;bot++ ) {
if ( IS_ODD(array[bot])) break;
}
/* Find leading strech of misplaced ODD + trailing stretch of EVEN */
for (cut=bot;bot < limit; cut = top) {
/* count misplaced items */
for ( ;cut < size && IS_ODD(array[cut]); cut++) {;}
/* count shiftable items */
for (top=cut;top < size && IS_EVEN(array[top]); top++) {;}
/* Now, [bot...cut) and [cut...top) are two blocks
** that need to be swapped: swap them */
doswap(array+bot, cut-bot, top-cut);
bot += top-cut;
}
printf("Result:%26.26s\n", array);
return 0;
}
void doswap (char *ptr, unsigned sizl, unsigned sizr)
{
if (!sizl || !sizr) return;
if (sizl >= sizr) {
char tmp[sizr];
memcpy(tmp, ptr+sizl, sizr);
memmove(ptr+sizr, ptr, sizl);
memcpy(ptr, tmp, sizr);
}
else {
char tmp[sizr];
memcpy(tmp, ptr, sizl);
memmove(ptr, ptr+sizl, sizr);
memcpy(ptr+sizl, tmp, sizl);
}
}

Resources