More elaborate arrays#
More data types#
Casting#
“Bigger” type wins in mixed-type operations:
np.array([1, 2, 3]) + 1.5
array([2.5, 3.5, 4.5])
Assignment never changes the type!
a = np.array([1, 2, 3])
a.dtype
dtype('int64')
a[0] = 1.9 # <-- float is truncated to integer
a
array([1, 2, 3])
Forced casts:
a = np.array([1.7, 1.2, 1.6])
b = a.astype(int) # <-- truncates to integer
b
array([1, 1, 1])
Rounding:
a = np.array([1.2, 1.5, 1.6, 2.5, 3.5, 4.5])
b = np.around(a)
b # still floating-point
array([1., 2., 2., 2., 4., 4.])
c = np.around(a).astype(int)
c
array([1, 2, 2, 2, 4, 4])
Different data type sizes#
Integers (signed):
Class |
Bits |
---|---|
|
8 bits |
|
16b its |
|
32 bits (same as |
|
64 bits (same as |
np.array([1], dtype=int).dtype
dtype('int64')
np.iinfo(np.int32).max, 2**31 - 1
(2147483647, 2147483647)
Unsigned integers:
Class |
Bits |
---|---|
|
8 bits |
|
16 bits |
|
32 bits |
|
64 bits |
np.iinfo(np.uint32).max, 2**32 - 1
(4294967295, 4294967295)
Floating-point numbers:
Data Type |
Size (bits) |
---|---|
|
16 bits |
|
32 bits |
|
64 bits (same as |
|
96 bits, platform-dependent (same as |
|
128 bits, platform-dependent (same as |
np.finfo(np.float32).eps
np.float32(1.1920929e-07)
np.finfo(np.float64).eps
np.float64(2.220446049250313e-16)
np.float32(1e-8) + np.float32(1) == 1
np.True_
np.float64(1e-8) + np.float64(1) == 1
np.False_
Complex floating-point numbers:
Data Type |
Size (bits) |
---|---|
|
two 32-bit floats |
|
two 64-bit floats |
|
two 96-bit floats, platform-dependent |
|
two 128-bit floats, platform-dependent |
Smaller data types
If you don’t know you need special data types, then you probably don’t.
Comparison on using float32
instead of float64
:
Half the size in memory and on disk
Half the memory bandwidth required (may be a bit faster in some operations)
In [1]: a = np.zeros((int(1e6),), dtype=np.float64) In [2]: b = np.zeros((int(1e6),), dtype=np.float32) In [3]: %timeit a*a 1000 loops, best of 3: 1.78 ms per loop In [4]: %timeit b*b 1000 loops, best of 3: 1.07 ms per loop
But: bigger rounding errors — sometimes in surprising places (i.e., don’t use them unless you really need them)
Structured data types#
Data Type |
Description |
---|---|
|
4-character string |
|
float |
|
float |
samples = np.zeros((6,), dtype=[('sensor_code', 'S4'),
('position', float), ('value', float)])
samples.ndim
1
samples.shape
(6,)
samples.dtype.names
('sensor_code', 'position', 'value')
samples[:] = [('ALFA', 1, 0.37), ('BETA', 1, 0.11), ('TAU', 1, 0.13),
('ALFA', 1.5, 0.37), ('ALFA', 3, 0.11), ('TAU', 1.2, 0.13)]
samples
array([(b'ALFA', 1. , 0.37), (b'BETA', 1. , 0.11), (b'TAU', 1. , 0.13),
(b'ALFA', 1.5, 0.37), (b'ALFA', 3. , 0.11), (b'TAU', 1.2, 0.13)],
dtype=[('sensor_code', 'S4'), ('position', '<f8'), ('value', '<f8')])
Field access works by indexing with field names:
samples['sensor_code']
array([b'ALFA', b'BETA', b'TAU', b'ALFA', b'ALFA', b'TAU'], dtype='|S4')
samples['value']
array([0.37, 0.11, 0.13, 0.37, 0.11, 0.13])
samples[0]
np.void((b'ALFA', 1.0, 0.37), dtype=[('sensor_code', 'S4'), ('position', '<f8'), ('value', '<f8')])
samples[0]['sensor_code'] = 'TAU'
samples[0]
np.void((b'TAU', 1.0, 0.37), dtype=[('sensor_code', 'S4'), ('position', '<f8'), ('value', '<f8')])
Multiple fields at once:
samples[['position', 'value']]
array([(1. , 0.37), (1. , 0.11), (1. , 0.13), (1.5, 0.37), (3. , 0.11),
(1.2, 0.13)],
dtype={'names': ['position', 'value'], 'formats': ['<f8', '<f8'], 'offsets': [4, 12], 'itemsize': 20})
Fancy indexing works, as usual:
samples[samples['sensor_code'] == b'ALFA']
array([(b'ALFA', 1.5, 0.37), (b'ALFA', 3. , 0.11)],
dtype=[('sensor_code', 'S4'), ('position', '<f8'), ('value', '<f8')])
maskedarray
: dealing with (propagation of) missing data#
For floats one could use NaN’s, but masks work for all types:
x = np.ma.array([1, 2, 3, 4], mask=[0, 1, 0, 1])
x
masked_array(data=[1, --, 3, --],
mask=[False, True, False, True],
fill_value=999999)
y = np.ma.array([1, 2, 3, 4], mask=[0, 1, 1, 1])
x + y
masked_array(data=[2, --, --, --],
mask=[False, True, True, True],
fill_value=999999)
Masking versions of common functions:
np.ma.sqrt([1, -1, 2, -2])
masked_array(data=[1.0, --, 1.4142135623730951, --],
mask=[False, True, False, True],
fill_value=1e+20)
Note
There are other useful array siblings
While it is off topic in a chapter on NumPy, let’s take a moment to recall good coding practice, which really do pay off in the long run:
Good practices
Explicit variable names (no need of a comment to explain what is in the variable)
Style: spaces after commas, around
=
, etc.A certain number of rules for writing “beautiful” code (and, more importantly, using the same conventions as everybody else!) are given in the Style Guide for Python Code and the Docstring Conventions page (to manage help strings).
Except some rare cases, variable names and comments in English.