Determine added, removed and kept items between two lists

Arminius 05/15/2018. 2 answers, 686 views
python algorithm python-3.x iterator

My code is determining the difference between two iterables in terms of items that have been addded, removed and kept. The items are assumed to be sortable and there are no duplicates.

def diff(before, after):
    """Return diff between iterables of sortable items `before` and `after` as
    tuple of three lists of items that have been added, removed and kept.
    """
    before_iter = iter(sorted(before))
    after_iter = iter(sorted(after))
    add_items = []
    rem_items = []
    keep_items = []
    try:
        new = old = None
        while True:
            if new == old:
                if new is not None:
                    keep_items.append(new)
                new = old = None
                new = next(after_iter)
                old = next(before_iter)
            elif new < old:
                add_items.append(new)
                new = None
                new = next(after_iter)
            elif old < new:
                rem_items.append(old)
                old = None
                old = next(before_iter)
    except StopIteration:
        if new is not None:
            add_items.append(new)
        if old is not None:
            rem_items.append(old)
        add_items += list(after_iter)
        rem_items += list(before_iter)
    return add_items, rem_items, keep_items

if __name__ == '__main__':
    assert diff([], []) == ([], [], [])
    assert diff([1, 2, 3], [4, 3, 2]) == ([4], [1], [2, 3])
    assert diff(['foo', 'bar'], ['baz', 'bar']) == (['baz'], ['foo'], ['bar'])

Main concerns:

  • The way I'm working with iterators seems a bit awkward and appears to unnecessarily bloat the code. I feel there must be a more elegant pattern, maybe without iterators at all?

  • The code should perform well on large lists. That's also why I'm not using multiple set operations instead. (add_items = set(after) - set(before), etc.) Or can I solve this efficiently with set operations? (I'm expecting both very small and very large differences between the iterables.)


Update: Unfortunately, I failed to include a key requirement into the question which therefore couldn't be considered in the answers. But I'll state it here, just for the record:

  • The algorithm needs to perform well when comparison between the items is expensive. Since an approach using set operations incorporates more comparisons than the algorithm above, it performs worse in my use case.

2 Answers


Graipher 05/15/2018.

First, let's look at the naive implementation using sets:

def diff_set(before, after):
    b, a = set(before), set(after)
    return list(a - b), list(b - a), list(a & b)

This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?

Here are timings of your three example cases:

             Input                     diff             diff_set         diff_counter     
 ------------------------------ ------------------- ------------------- ------------------- 
  [], []                         1.72 µs ± 29.9 ns   1.21 µs ± 5.34 ns   9.91 µs ± 91.4 ns  
  [1,2,3], [4,3,2]               3.16 µs ± 183 ns    1.57 µs ± 28.4 ns   12 µs ± 180 ns     
  ['foo','bar'], ['baz','bar']   2.77 µs ± 83.4 ns   1.56 µs ± 37.7 ns   11.5 µs ± 240 ns

The set implementation is faster in all three cases. But what about the scaling behavior? First lets set up some random test data of increasing length. Note that I scaled the number of available values up with the size of the lists, so that set(before), set(after) does not become trivial and so that there are both added and removed values. The only way in which this might be unrealistic is that both before and after are always of the same size.

import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]

Here also the set implementation wins:

enter image description here

In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.


For reference, here the implementation of diff_counter I used:

from collections import Counter

def diff_counter(before, after):
    c = Counter(after)
    c.subtract(before)
    return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]

Mathias Ettinger 05/15/2018.

With the current algorithm, the code seems as good as it can get. There is only the if new == old: if new is not None: check that bothered me a bit: a comment saying that this second check accounted for the initial pass in the loop would have made things clearer.

But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:

new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
    return [], [], []
elif new is None:
    return [], before, []
elif old is None:
    return after, [], []

try:
    while True:
        …

However, since sorted returns a list of the elements in the iterables, you end up consuming them entirely before processing them, hence making the use of iterator only usefull for the algorithm and not necessarily for its memory footprint.

An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:

>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter({3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1})

As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.

Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):

>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]

But kept elements doesn't have a nice way to retrieve them:

>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]

Related questions

Hot questions

Language

Popular Tags