“Merging” numpy arrays together with a common dimension

Dang Khoa 11/18/2017. 4 answers, 314 views

I have two matricies, corresponding to data points (x,y1) and (x,y2):

   x  |  y1
------------
0  |  0
1  |  1
2  |  2
3  |  3
4  |  4
5  |  5

x   |  y2
----------------
0.5  |  0.5
1.5  |  1.5
2.5  |  2.5
3.5  |  3.5
4.5  |  4.5
5.5  |  5.5

I'd like to create a new matrix that combines the x values into a single column, and has NaNs in the appropriate y1, y2 columns:

    x    |    y1    |   y2
-----------------------------
0    |     0    |  NaN
0.5  |    NaN   |  0.5
1    |     0    |  NaN
1.5  |    NaN   |  1.5
...  |    ...   |  ...
5    |     5    |  NaN
5.5  |    NaN   |  5.5 

Is there an easy way to do this? I'm new to Python and NumPy (coming from MATLAB) and I'm not sure how I would even begin with this. (For reference, my approach to this in MATLAB is simply using an outerjoin against two tables that are generated with array2table.)

cᴏʟᴅsᴘᴇᴇᴅ 11/18/2017
Do you have pandas?
Dang Khoa 11/18/2017
@cᴏʟᴅsᴘᴇᴇᴅ I can install it, more packages is no big deal.
hpaulj 11/18/2017
How would you do this with MATLAB? What kind of structure would you use?
Dang Khoa 11/18/2017
@hpaulj edited question to include my solution in MATLAB. I'd convert my two matrices to tables, then do an outerjoin.
hpaulj 11/18/2017
With pure numpy this is as awkward as using just matrix in MATLAB. I can approximate it with structured arrays (and recfunctions.join), which have some similarities to a MATLAB struct (see, stackoverflow.com/questions/47277436/…). pandas is better for table like operations.

If you can load your data into separate pandas dataframes, this becomes simple.

df

x  y1
0  0   0
1  1   1
2  2   2
3  3   3
4  4   4
5  5   5

df2

x   y2
0  0.5  0.5
1  1.5  1.5
2  2.5  2.5
3  3.5  3.5
4  4.5  4.5
5  5.5  5.5

Perform an outer merge, and sort on the x column.

df = df.merge(df2, how='outer').sort_values('x')
df

x   y1   y2
0     0    0  NaN
6   0.5  NaN  0.5
1     1    1  NaN
7   1.5  NaN  1.5
2     2    2  NaN
8   2.5  NaN  2.5
3     3    3  NaN
9   3.5  NaN  3.5
4     4    4  NaN
10  4.5  NaN  4.5
5     5    5  NaN
11  5.5  NaN  5.5

If you want an array, call .values on the result:

df.values

array([[0.0, 0.0, nan],
[0.5, nan, 0.5],
[1.0, 1.0, nan],
[1.5, nan, 1.5],
[2.0, 2.0, nan],
[2.5, nan, 2.5],
[3.0, 3.0, nan],
[3.5, nan, 3.5],
[4.0, 4.0, nan],
[4.5, nan, 4.5],
[5.0, 5.0, nan],
[5.5, nan, 5.5]], dtype=object)
Eric Duminil 11/18/2017
Nice. Using pandas makes sense here. You basically need a mix of numpy arrays and Python dicts.
cᴏʟᴅsᴘᴇᴇᴅ 11/18/2017
@EricDuminil Thank you. It would seem the most painless option to me. However, I saw your answer which seemed pretty impressive (I couldn't have thought of a numpy solution as you did) and passed you an upvote :)

Eric Duminil 11/18/2017.

Here's an attempt with plain numpy. It creates a matrix with 3 columns and as many rows as a1 + a2. It writes a1 and a2 in the columns, and sort the rows by their first value.

Note that it only works if x values are disjoint:

import numpy as np
x = np.arange(6)
# array([0, 1, 2, 3, 4, 5])
a1 = np.vstack((x,x)).T
# array([[0, 0],
#        [1, 1],
#        [2, 2],
#        [3, 3],
#        [4, 4],
#        [5, 5]])
a2 = a1 + 0.5
# array([[ 0.5,  0.5],
#        [ 1.5,  1.5],
#        [ 2.5,  2.5],
#        [ 3.5,  3.5],
#        [ 4.5,  4.5],
#        [ 5.5,  5.5]])
m = np.empty((12, 3))
m[:] = np.nan
# array([[ nan,  nan,  nan],
#        [ nan,  nan,  nan],
#        [ nan,  nan,  nan],
#        [ nan,  nan,  nan],
#        [ nan,  nan,  nan],
#        [ nan,  nan,  nan],
#        [ nan,  nan,  nan],
#        [ nan,  nan,  nan],
#        [ nan,  nan,  nan],
#        [ nan,  nan,  nan],
#        [ nan,  nan,  nan],
#        [ nan,  nan,  nan]])
m[:6, :2] = a1
# array([[  0.,   0.,  nan],
#        [  1.,   1.,  nan],
#        [  2.,   2.,  nan],
#        [  3.,   3.,  nan],
#        [  4.,   4.,  nan],
#        [  5.,   5.,  nan],
#        [ nan,  nan,  nan],
#        [ nan,  nan,  nan],
#        [ nan,  nan,  nan],
#        [ nan,  nan,  nan],
#        [ nan,  nan,  nan],
#        [ nan,  nan,  nan]])
m[6:, ::2] = a2
# array([[ 0. ,  0. ,  nan],
#        [ 1. ,  1. ,  nan],
#        [ 2. ,  2. ,  nan],
#        [ 3. ,  3. ,  nan],
#        [ 4. ,  4. ,  nan],
#        [ 5. ,  5. ,  nan],
#        [ 0.5,  nan,  0.5],
#        [ 1.5,  nan,  1.5],
#        [ 2.5,  nan,  2.5],
#        [ 3.5,  nan,  3.5],
#        [ 4.5,  nan,  4.5],
#        [ 5.5,  nan,  5.5]])
m[m[:,0].argsort()]
# array([[ 0. ,  0. ,  nan],
#        [ 0.5,  nan,  0.5],
#        [ 1. ,  1. ,  nan],
#        [ 1.5,  nan,  1.5],
#        [ 2. ,  2. ,  nan],
#        [ 2.5,  nan,  2.5],
#        [ 3. ,  3. ,  nan],
#        [ 3.5,  nan,  3.5],
#        [ 4. ,  4. ,  nan],
#        [ 4.5,  nan,  4.5],
#        [ 5. ,  5. ,  nan],
#        [ 5.5,  nan,  5.5]])

Using pandas is the correct method here.

hpaulj 11/18/2017.

A structured array approach (incomplete):

Input a special library of recfunctions:

In [441]: import numpy.lib.recfunctions as rf

Define two structured arrays

In [442]: A = np.zeros((6,),[('x',int),('y',int)])

Oops, the 'xkeys inBare float, so for consistency, let's make theA ones float as well. Don't mix floats and ints unnecessarily.

In [446]: A = np.zeros((6,),[('x',float),('y',int)])
In [447]: A['x']=np.arange(6)
In [448]: A['y']=np.arange(6)
In [449]: A
Out[449]:
array([( 0., 0), ( 1., 1), ( 2., 2), ( 3., 3), ( 4., 4), ( 5., 5)],
dtype=[('x', '<f8'), ('y', '<i4')])

In [450]: B = np.zeros((6,),[('x',float),('z',float)])
In [451]: B['x']=np.linspace(.5,5.5,6)
In [452]: B['z']=np.linspace(.5,5.5,6)
In [453]: B
Out[453]:
array([( 0.5,  0.5), ( 1.5,  1.5), ( 2.5,  2.5), ( 3.5,  3.5),
( 4.5,  4.5), ( 5.5,  5.5)],
dtype=[('x', '<f8'), ('z', '<f8')])

Look at the docs of the rf.join_by function:

In [454]: rf.join_by?

Do an outer join:

In [457]: rf.join_by('x',A,B,'outer')
Out[457]:
masked_array(data = [(0.0, 0, --) (0.5, --, 0.5) (1.0, 1, --) (1.5, --, 1.5) (2.0, 2, --)
(2.5, --, 2.5) (3.0, 3, --) (3.5, --, 3.5) (4.0, 4, --) (4.5, --, 4.5)
(5.0, 5, --) (5.5, --, 5.5)],
mask = [(False, False,  True) (False,  True, False) (False, False,  True)
(False,  True, False) (False, False,  True) (False,  True, False)
(False, False,  True) (False,  True, False) (False, False,  True)
(False,  True, False) (False, False,  True) (False,  True, False)],
fill_value = (  1.00000000e+20, 999999,   1.00000000e+20),
dtype = [('x', '<f8'), ('y', '<i4'), ('z', '<f8')])

Same thing, but with masking turned off:

In [460]: rf.join_by('x',A,B,'outer',usemask=False)
Out[460]:
array([( 0. ,      0,   1.00000000e+20), ( 0.5, 999999,   5.00000000e-01),
( 1. ,      1,   1.00000000e+20), ( 1.5, 999999,   1.50000000e+00),
( 2. ,      2,   1.00000000e+20), ( 2.5, 999999,   2.50000000e+00),
( 3. ,      3,   1.00000000e+20), ( 3.5, 999999,   3.50000000e+00),
( 4. ,      4,   1.00000000e+20), ( 4.5, 999999,   4.50000000e+00),
( 5. ,      5,   1.00000000e+20), ( 5.5, 999999,   5.50000000e+00)],
dtype=[('x', '<f8'), ('y', '<i4'), ('z', '<f8')])

Now we see the fill values explicitly. There must be a way of replacing the 1e20 with np.nan. Replacing 999999 with nan is messier, since np.nan is a float value, not integer.

Under the cover this join_by is probably first creating a blank array with the join dtype, and filling in fields one by one.

kmcodes 11/18/2017.

Considering you may not need pandas for anything else, this is the standard lib solution.

I would break it down to 2 list of lists (assuming order of elements is important). So

xy1 = [[0,0],[1,1],......]
xy2 = [[0.5,0.5],[1.5,1.5],.......]

then merge these lists into a list x adding "NaN" alternately to either x[i][1] or x[i][2] position to compensate for the alternate roles where they are not present. Each x[i][0] is the key for a dictionary element with the values being a list with two elements listed above.

finalx = {item[0]: item[1:] for item in x}

finalx = {0:[0, 'NaN'],0.5:[NaN,0.5],......]`

Hope this helps. this is more of a direction than a solution.