Minimizing algorithm checks of subsets

I need help with algorithm I’m working on, my problem have 3 main components:

Item: an object with 2 fields

{
  Integer reward;
  Boolean_2D_Array area;
}

Generator:a module that generates items according to certain logic.

Collection: where I keep generated Items

For every items that arrives from the Generator I need to decide one of the following:

  • it goes into the collection
  • I throw it away
  • I throw other item from the collection and put it instead

The decision is made according to “preferable” criteria, item A is prefferable to item B iff:

  • A.reward >= B.reward
  • B.area is subset of A.area – This rule also means that there is no True value in B that is False in A, it can be achive as:

    XOR(A.area,OR(A.area,B.area)) == 0

in example:

A      B
0011   0001
0011   0011
1111   0111

in the above example B.area is subset of A.area but A.area is not subset of B.area.

So if I get item from the generator, and there it is not preferable of any other item then I put it in the collection,
In a preferable case – I’ll keep only one of the items. for example, If the above A came from the generater and I have B in my collection, I’ll throw B from the collection and keep A (assuming same reward for both).

My Problem:
how to manage the collection?

As an initial implementation I hold a simple linkedlist, and for every item i get from the generator I iterate over all existing items and check the preferability according to the reward size and subset, but is there a better way? how can I store the Items in a way that will allow me to skip some of the subset checks?
I mean, in my naiive implementation I do O(n^2) checks, is there a way to store the items in other data structure that will hold information regarding the area and by that minimize the checks needed?

P.S.
All True values in the item’s area are connected (you can travel from any true value to any other true value by moving [left,right,up,down] and tuching only true values all the way.

EDIT: as per Doc’s request

The number of items can reach 100K easely, acutally it can be much bigger, 10M or even more -> this is part of problem solving mechanism where the problem size can be infinite so the better the impl. will be the bigger problems I can solve.

the 2D bit array represent an aspect of the problem so also the bigger the better, for now it’s about less than 1000 bits.

the subset check implemented the best I can think (in c++, with xor & or of bytes) but when the collection gets big it takes a huge amount of time to progress.

0

Currently, I have no good idea to reduce the running time order O(n^2). However, you might improve the speed of the area comparison by using a technique similar to Bloom filters. The idea is to reduce the original set area to some area of reduced size, lets call it area_r, by applying a hash function to each of the element indexes of area.

By doing so, if B.area is subset of A.area, B.area_r must be also subset of A.area_r; from this, it follows if B.area_r is not contained in A.area_r, B.area is not a subset of A.area. The opposite is not necessarily true, of course. But this will allow you, for example, to reduce the original 1000 bits to a reduced set size of, say 64, and you need only to run the full subset test on the 1000 bits if the 64-bit test tells you “B.area_r is contained in A.area_r”. Since 64 bit is the typical word size of contemporary processors, the XOR / OR operation you suggested can be done in a single statement, without any loops.

3

A straight forward approach is to partition the Items according to how many bits are set in the area. Clearly, if a.bits > b.bits, then area a cannot be a subset of area b.

Going from that further, one might consider building a structure similar to a hypertree.

A      B
0011   0001
0011   0011
1111   0111
0000   0000 <- just for demonstrating purposes to have width and height both a power of 2

might be represented by (considering a 2×2 grouping)

X      Y
04     03
22     12

(where the numbers represent the numbers of set bits).
X and Y are nodes which have A and B as children, respectively. X and Y might also have more children.

To check if B is a subset of some other item, you then visit the compacted nodes (X=0422) and check if each cell has more bits set than the compact representation for B. If yes, you can repeat the test for all children of X, using the complete area.

I hope my explanations are comprehensible. 🙂
In the worst case, this will not be better than n^2, but in practice can hopefully significantly reduce the number of comparisons.

I just realise that this is probably a special case of Doc Brown’s answer, where the hash function is simply “count”. But maybe this explanations give you more insight.

Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *