Determine added, removed and kept items between two lists

Arminius 05/15/2018. 2 answers, 686 views
python algorithm python-3.x iterator

My code is determining the difference between two iterables in terms of items that have been addded, removed and kept. The items are assumed to be sortable and there are no duplicates.

def diff(before, after):
    """Return diff between iterables of sortable items `before` and `after` as
    tuple of three lists of items that have been added, removed and kept.
    """
    before_iter = iter(sorted(before))
    after_iter = iter(sorted(after))
    add_items = []
    rem_items = []
    keep_items = []
    try:
        new = old = None
        while True:
            if new == old:
                if new is not None:
                    keep_items.append(new)
                new = old = None
                new = next(after_iter)
                old = next(before_iter)
            elif new < old:
                add_items.append(new)
                new = None
                new = next(after_iter)
            elif old < new:
                rem_items.append(old)
                old = None
                old = next(before_iter)
    except StopIteration:
        if new is not None:
            add_items.append(new)
        if old is not None:
            rem_items.append(old)
        add_items += list(after_iter)
        rem_items += list(before_iter)
    return add_items, rem_items, keep_items

if __name__ == '__main__':
    assert diff([], []) == ([], [], [])
    assert diff([1, 2, 3], [4, 3, 2]) == ([4], [1], [2, 3])
    assert diff(['foo', 'bar'], ['baz', 'bar']) == (['baz'], ['foo'], ['bar'])

Main concerns:

  • The way I'm working with iterators seems a bit awkward and appears to unnecessarily bloat the code. I feel there must be a more elegant pattern, maybe without iterators at all?

  • The code should perform well on large lists. That's also why I'm not using multiple set operations instead. (add_items = set(after) - set(before), etc.) Or can I solve this efficiently with set operations? (I'm expecting both very small and very large differences between the iterables.)


Update: Unfortunately, I failed to include a key requirement into the question which therefore couldn't be considered in the answers. But I'll state it here, just for the record:

  • The algorithm needs to perform well when comparison between the items is expensive. Since an approach using set operations incorporates more comparisons than the algorithm above, it performs worse in my use case.

2 Answers


Graipher 05/15/2018.

First, let's look at the naive implementation using sets:

def diff_set(before, after):
    b, a = set(before), set(after)
    return list(a - b), list(b - a), list(a & b)

This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?

Here are timings of your three example cases:

             Input                     diff             diff_set         diff_counter     
 ------------------------------ ------------------- ------------------- ------------------- 
  [], []                         1.72 µs ± 29.9 ns   1.21 µs ± 5.34 ns   9.91 µs ± 91.4 ns  
  [1,2,3], [4,3,2]               3.16 µs ± 183 ns    1.57 µs ± 28.4 ns   12 µs ± 180 ns     
  ['foo','bar'], ['baz','bar']   2.77 µs ± 83.4 ns   1.56 µs ± 37.7 ns   11.5 µs ± 240 ns

The set implementation is faster in all three cases. But what about the scaling behavior? First lets set up some random test data of increasing length. Note that I scaled the number of available values up with the size of the lists, so that set(before), set(after) does not become trivial and so that there are both added and removed values. The only way in which this might be unrealistic is that both before and after are always of the same size.

import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]

Here also the set implementation wins:

enter image description here

In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.


For reference, here the implementation of diff_counter I used:

from collections import Counter

def diff_counter(before, after):
    c = Counter(after)
    c.subtract(before)
    return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]

Mathias Ettinger 05/15/2018.

With the current algorithm, the code seems as good as it can get. There is only the if new == old: if new is not None: check that bothered me a bit: a comment saying that this second check accounted for the initial pass in the loop would have made things clearer.

But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:

new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
    return [], [], []
elif new is None:
    return [], before, []
elif old is None:
    return after, [], []

try:
    while True:
        …

However, since sorted returns a list of the elements in the iterables, you end up consuming them entirely before processing them, hence making the use of iterator only usefull for the algorithm and not necessarily for its memory footprint.

An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:

>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter({3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1})

As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.

Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):

>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]

But kept elements doesn't have a nice way to retrieve them:

>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]

HighResolutionMusic.com - Download Hi-Res Songs

1 Martin Garrix

Yottabyte flac

Martin Garrix. 2018. Writer: Martin Garrix.
2 Alan Walker

Diamond Heart flac

Alan Walker. 2018. Writer: Alan Walker;Sophia Somajo;Mood Melodies;James Njie;Thomas Troelsen;Kristoffer Haugan;Edvard Normann;Anders Froen;Gunnar Greve;Yann Bargain;Victor Verpillat;Fredrik Borch Olsen.
3 Sia

I'm Still Here flac

Sia. 2018. Writer: Sia.
4 Blinders

Breach (Walk Alone) flac

Blinders. 2018. Writer: Dewain Whitmore;Ilsey Juber;Blinders;Martin Garrix.
5 Dyro

Latency flac

Dyro. 2018. Writer: Martin Garrix;Dyro.
6 Cardi B

Taki Taki flac

Cardi B. 2018. Writer: Bava;Juan Vasquez;Vicente Saavedra;Jordan Thorpe;DJ Snake;Ozuna;Cardi B;Selena Gomez.
7 Bradley Cooper

Shallow flac

Bradley Cooper. 2018. Writer: Andrew Wyatt;Anthony Rossomando;Mark Ronson;Lady Gaga.
8 Halsey

Without Me flac

Halsey. 2018. Writer: Halsey;Delacey;Louis Bell;Amy Allen;Justin Timberlake;Timbaland;Scott Storch.
9 Lady Gaga

I'll Never Love Again flac

Lady Gaga. 2018. Writer: Benjamin Rice;Lady Gaga.
10 Kelsea Ballerini

This Feeling flac

Kelsea Ballerini. 2018. Writer: Andrew Taggart;Alex Pall;Emily Warren.
11 Mako

Rise flac

Mako. 2018. Writer: Riot Music Team;Mako;Justin Tranter.
12 Dewain Whitmore

Burn Out flac

Dewain Whitmore. 2018. Writer: Dewain Whitmore;Ilsey Juber;Emilio Behr;Martijn Garritsen.
13 Bradley Cooper

Always Remember Us This Way flac

Bradley Cooper. 2018. Writer: Lady Gaga;Dave Cobb.
14 Little Mix

Woman Like Me flac

Little Mix. 2018. Writer: Nicki Minaj;Steve Mac;Ed Sheeran;Jess Glynne.
15 Charli XCX

1999 flac

Charli XCX. 2018. Writer: Charli XCX;Troye Sivan;Leland;Oscar Holter;Noonie Bao.
16 Rita Ora

Let You Love Me flac

Rita Ora. 2018. Writer: Rita Ora.
17 Diplo

Electricity flac

Diplo. 2018. Writer: Diplo;Mark Ronson;Picard Brothers;Wynter Gordon;Romy Madley Croft;Florence Welch.
18 Jonas Blue

Polaroid flac

Jonas Blue. 2018. Writer: Jonas Blue;Liam Payne;Lennon Stella.
19 Lady Gaga

Look What I Found flac

Lady Gaga. 2018. Writer: DJ White Shadow;Nick Monson;Mark Nilan Jr;Lady Gaga.
20 Avril Lavigne

Head Above Water flac

Avril Lavigne. 2018. Writer: Stephan Moccio;Travis Clark;Avril Lavigne.

Related questions

Hot questions

Language

Popular Tags