数组对象

Numpy 核心的对象是 ndarray，表示多维数组，可以看作由类型相同(通常是数字)的元素组成的表。

可以通过官方教程的例子了解其基本属性：

>>> import numpy as np
>>> a = np.arange(15).reshape(3, 5)
>>> a
array([[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14]])
>>> a.shape # 数组的形状(各维度大小）
(3, 5)
>>> a.ndim  # 数组的维数
2
>>> a.dtype.name # 数组中元素的类型
'int64'
>>> a.itemsize   # 数组中每个元素的字节大小
8
>>> a.size  # 数组元素的总数
15
>>> type(a)
<type 'numpy.ndarray'>
>>> b = np.array([6, 7, 8])
>>> b
array([6, 7, 8])
>>> type(b)
<type 'numpy.ndarray'>

创建数组

Numpy中创建数组的方式很多，下面列举常见的创建数组的方式。

np.array()

我们可以使用array函数从 Python 的列表或元组创建数组。

一维数组：

>>> a = np.array([0, 1, 2, 3])
>>> a
array([0, 1, 2, 3])
>>> a.ndim
1
>>> a.shape
(4,)
>>> len(a)
4
>>> a.dtype
dtype('int64')

多维数组:

>>> b = np.array([[0, 1, 2], [3, 4, 5]]) # 2 x 3 的数组
>>> b
array([[0, 1, 2],
       [3, 4, 5]])
>>> b.ndim
2
>>> b.shape
(2, 3)
>>> len(b) # 第一维的大小
2

>>> c = np.array([[[1], [2]], [[3], [4]]])
>>> c
array([[[1],
        [2]],

       [[3],
        [4]]])
>>> c.shape
(2, 2, 1)

注意不要犯以下错误:

1 2	>>> a = np.array(1,2,3,4) # WRONG >>> a = np.array([1,2,3,4]) # RIGHT

数组元素的类型可以在创建时显式地指定：

>>> d = np.array([ [1,2], [3,4] ], dtype=complex)
>>> d
array([[ 1.+0.j,  2.+0.j],
       [ 3.+0.j,  4.+0.j]])

np.zeros(), np.ones(), np.empty()

函数 zeros 创建一个由0组成的数组，函数 ones 创建一个由1组成的数组，函数 empty 创建的数组其元素值未初始化，取决于内存状态。

>>> np.zeros( (3,4) ) # 3 x 4 的数组，默认类型为 float64
array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

>>> np.ones( (2,3,4), dtype=np.int16 ) # dtype可指定 
array([[[ 1, 1, 1, 1],
        [ 1, 1, 1, 1],
        [ 1, 1, 1, 1]],
       [[ 1, 1, 1, 1],
        [ 1, 1, 1, 1],
        [ 1, 1, 1, 1]]], dtype=int16)

>>> np.empty( (2,3) ) # 值未初始化, 随机且取决于内存状态
array([[  3.73603959e-262,   6.02658058e-154,   6.55490914e-260],
       [  5.30498948e-313,   3.14673309e-307,   1.00000000e+000]])

同样也要注意不要犯下面的错误：

1 2	>>> np.zeros(3,4) # WRONG >>> np.zeros( (3,4) ) # RIGHT

np.eye(), np.diag()

eye函数可用于创建单位矩阵，diag函数可用于创建对角矩阵：

>>> a = np.eye(3)
>>> a
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

>>> c = np.diag(np.array([1, 2, 3, 4]))
>>> c
array([[1, 0, 0, 0],
       [0, 2, 0, 0],
       [0, 0, 3, 0],
       [0, 0, 0, 4]])

np.arange()

arange函数用于创建数字序列，可指定起始值、结束值、步长，结束值不包含在序列中：

>>> a = np.arange(10) # 0 ... n-1
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

>>> b = np.arange(1, 9, 2) # 起始值、终止值、步长
>>> b
array([1, 3, 5, 7])

>>> c = np.arange(0, 2, 0.3) # 步长可以是浮点数
>>> c
array([ 0. ,  0.3,  0.6,  0.9,  1.2,  1.5,  1.8])

np.linspace()

linspace函数可以生成指定大小的线性序列:

>>> a = np.linspace(0, 1, 6) # 起始值, 结束值, 元素个数
>>> a
array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ])
>>> b = np.linspace(0, 1, 5, endpoint=False) # 不包括结束值在内
>>> b
array([ 0. ,  0.2,  0.4,  0.6,  0.8])

1
2
3

>>> from numpy import pi
>>> x = np.linspace( 0, 2*pi, 100 ) 
>>> f = np.sin(x)

numpy.random.rand()

random.rand()可以生成指定维度的伪随机数，随机性满足 [0, 1) 区间内的均匀分布.

>>> np.random.rand(3,2)
array([[ 0.14022471,  0.96360618],  # random
       [ 0.37601032,  0.25528411],  # random
       [ 0.49313049,  0.94909878]]) # random`

可通过random.seed函数指定种子：

1	>>> np.random.seed(1234)

numpy.random.randn()

random.randn()函数返回的数组中的每个元素相当于从标准正态分布(高斯分布)中随机抽取的一个样本。

对于如下的正态分布：

$N(\mu, \sigma^2)$

可使用类似下面的方式:

1	sigma * np.random.randn(...) + mu

例如:

>>> np.random.randn()
2.1923875335537315 # 标准正态分布中的随机抽样

# 2 x 4的数组，元素从正态分布 N(3, 6.25) 中取样
>>> 2.5 * np.random.randn(2, 4) + 3
array([[-4.49401501,  4.00950034, -1.81814867,  7.29718677],  #random
       [ 0.39924804,  4.68456316,  4.99394529,  4.84057254]]) #random

数据类型

NumPy 可以从输入中自动识别数据类型：

>>> a = np.array([1, 2, 3])
>>> a.dtype
dtype('int64')

>>> b = np.array([1., 2., 3.])
>>> b.dtype
dtype('float64')

可以显式指定所需的类型:

1
2
3

>>> c = np.array([1, 2, 3], dtype=float)
>>> c.dtype
dtype('float64')

默认的类型是float64:

1
2
3

>>> a = np.ones((3, 3))
>>> a.dtype
dtype('float64')

也支持其他的类型如:

>>> d = np.array([1+2j, 3+4j, 5+6*1j])
>>> d.dtype
dtype('complex128')

>>> e = np.array([True, False, False, True])
>>> e.dtype
dtype('bool')

>>> f = np.array(['Bonjour', 'Hello', 'Hallo',])
>>> f.dtype     # 最多7个字符的字符串 
dtype('S7')

不同数据类型的大小

有符号整数:

类型	大小
int8	8 位
int16	16 位
int32	32 位 (在32位平台等同于 int)
int64	64 位 (在64位平台等同于 int )

无符号整数:

类型	大小
uint8	8 位
uint16	16 位
uint32	32 位
uint64	64 位

浮点数:

类型	大小
float16	16 位
float32	32 位
float64	64 位 (等同于 float)
float96	96 位 (等同于 np.longdouble，依赖于平台)
float128	128 位(等同于 np.longdouble，依赖于平台)

复数(浮点数):

type	size
complex64	2个32位浮点数
complex128	2个64位浮点数
complex192	2个96位浮点数, 平台相关
complex256	2个128位浮点数, 平台相关

>>> np.array([1], dtype=int).dtype
dtype('int64')
>>> np.iinfo(np.int32).max, 2**31 - 1
(2147483647, 2147483647)

>>> np.iinfo(np.uint32).max, 2**32 - 1
(4294967295, 4294967295)

>>> np.finfo(np.float32).eps
1.1920929e-07
>>> np.finfo(np.float64).eps
2.2204460492503131e-16

>>> np.float32(1e-8) + np.float32(1) == 1
True
>>> np.float64(1e-8) + np.float64(1) == 1
False

类型转换

运算过程中遇到不同的数据类型时，运算结果的数据类型为“更大”的那个类型。

1 2	>>> np.array([1, 2, 3]) + 1.5 array([ 2.5, 3.5, 4.5])

特别注意赋值不会改变数组的数据类型：

>>> a = np.array([1, 2, 3])
>>> a.dtype
dtype('int64')
>>> a[0] = 1.9     # 给整型数组赋值时，浮点数会转换为整数
>>> a
array([1, 2, 3])

强制类型转换:

>>> a = np.array([1.7, 1.2, 1.6])
>>> b = a.astype(int) # 强制转换为整数类型
>>> b
array([1, 1, 1])

舍入:

>>> a = np.array([1.2, 1.5, 1.6, 2.5, 3.5, 4.5])
>>> b = np.around(a) # 舍入到最近的偶数
>>> b                # 依然是浮点数
array([ 1.,  2.,  2.,  2.,  4.,  4.])

>>> c = np.around(a).astype(int)
>>> c
array([1, 2, 2, 2, 4, 4])

结构化数据类型

字段名	字段大小
sensor_code	(4-character string)
position	(float)
value	(float)

>>> samples = np.zeros((6,), dtype=[('sensor_code', 'S4'),
...                                 ('position', float), ('value', float)])
>>> samples.ndim
1
>>> samples.shape
(6,)
>>> samples.dtype.names
('sensor_code', 'position', 'value')

>>> samples[:] = [('ALFA',   1, 0.37), ('BETA', 1, 0.11), ('TAU', 1,   0.13),
...               ('ALFA', 1.5, 0.37), ('ALFA', 3, 0.11), ('TAU', 1.2, 0.13)]
>>> samples     
array([('ALFA', 1.0, 0.37), ('BETA', 1.0, 0.11), ('TAU', 1.0, 0.13),
       ('ALFA', 1.5, 0.37), ('ALFA', 3.0, 0.11), ('TAU', 1.2, 0.13)],
      dtype=[('sensor_code', 'S4'), ('position', '<f8'), ('value', '<f8')])

可以通过字段名访问对应的字段值:

>>> samples['sensor_code']    
array(['ALFA', 'BETA', 'TAU', 'ALFA', 'ALFA', 'TAU'],
      dtype='|S4')
>>> samples['value']
array([ 0.37,  0.11,  0.13,  0.37,  0.11,  0.13])
>>> samples[0]    
('ALFA', 1.0, 0.37)

>>> samples[0]['sensor_code'] = 'TAU'
>>> samples[0]    
('TAU', 1.0, 0.37)

可同时访问多个字段:

>>> samples[['position', 'value']]
array([(1.0, 0.37), (1.0, 0.11), (1.0, 0.13), (1.5, 0.37), (3.0, 0.11),
       (1.2, 0.13)],
      dtype=[('position', '<f8'), ('value', '<f8')])

也支持更高级的索引方式:

1
2
3

>>> samples[samples['sensor_code'] == 'ALFA']    
array([('ALFA', 1.5, 0.37), ('ALFA', 3.0, 0.11)],
      dtype=[('sensor_code', 'S4'), ('position', '<f8'), ('value', '<f8')])

缺失值

对于浮点数可以用NaN表示缺失值，但所有类型都支持 masked_array 方式处理缺失值:

>>> x = np.ma.array([1, 2, 3, 4], mask=[0, 1, 0, 1])
>>> x
masked_array(data = [1 -- 3 --],
             mask = [False  True False  True],
       fill_value = 999999)


>>> y = np.ma.array([1, 2, 3, 4], mask=[0, 1, 1, 1])
>>> x + y
masked_array(data = [2 -- -- --],
             mask = [False  True  True  True],
       fill_value = 999999)

>>> np.ma.sqrt([1, -1, 2, -2]) 
masked_array(data = [1.0 -- 1.4142135623730951 --],
             mask = [False  True False  True],
       fill_value = 1e+20)

索引和切片

索引(Indexing)

数组元素的访问类似Python中的序列:

>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> a[0], a[2], a[-1]
(0, 2, 9)

>>> a[::-1]
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

对于多维数组，数组的每个维度都可以有一个索引，每个维度间以逗号隔开:

>>> a = np.diag(np.arange(3))
>>> a
array([[0, 0, 0],
       [0, 1, 0],
       [0, 0, 2]])
>>> a[1, 1]
1
>>> a[1]
array([0, 1, 0])

>>> a[2, 1] = 10 # 第三行, 第二列
>>> a
array([[ 0,  0,  0],
       [ 0,  1,  0],
       [ 0, 10,  2]])

切片(Slicing)

Numpy的数组同样可以像Python的序列一样支持切片:

>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

>>> a[2:9:3] # [start:end:step]
array([2, 5, 8])

>>> a[:4]    # 结束位置不包含在内
array([0, 1, 2, 3])

>>> a[1:3]
array([1, 2])

>>> a[::2]
array([0, 2, 4, 6, 8])

>>> a[3:]
array([3, 4, 5, 6, 7, 8, 9])

结合索引和切片:

>>> a = np.arange(6) + np.arange(0, 51, 10)[:, np.newaxis]
>>> a
array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

>>> a[0, 3:5]
array([3, 4])

>>> a[4:, 4:]
array([[44, 45],
       [54, 55]])
    
>>> a[:, 2]
array([2, 12, 22, 32, 42, 52])

>>> a[2::2, ::2]
array([[20, 22, 24],
       [40, 42, 44]])

>>> np.diag(np.arange(1, 7, dtype=float))[:, 1:]
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 2.,  0.,  0.,  0.,  0.],
       [ 0.,  3.,  0.,  0.,  0.],
       [ 0.,  0.,  4.,  0.,  0.],
       [ 0.,  0.,  0.,  5.,  0.],
       [ 0.,  0.,  0.,  0.,  6.]])
``` 
结合赋值和切片:
```python
>>> a = np.arange(10)
>>> a[5:] = 10
>>> a
array([ 0,  1,  2,  3,  4, 10, 10, 10, 10, 10])
>>> b = np.arange(5)
>>> a[5:] = b[::-1]
>>> a
array([0, 1, 2, 3, 4, 4, 3, 2, 1, 0])

拷贝和视图

切片操作实际上是在原数组上创建了一个视图，访问的还是原数组的数据。即没有在内存中创建原始数组的拷贝。可以使用np.may_share_memory()来检查两个数组是否共享了同一块内存。

当我们编辑视图的时候，原始数组的数据同样也会被修改:

>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> b = a[::2]
>>> b
array([0, 2, 4, 6, 8])
>>> np.may_share_memory(a, b)
True
>>> b[0] = 12
>>> b
array([12,  2,  4,  6,  8])
>>> a   # (!)
array([12,  1,  2,  3,  4,  5,  6,  7,  8,  9])

可以使用copy函数创建原数组的一份拷贝：

>>> a = np.arange(10)
>>> c = a[::2].copy()  # force a copy
>>> np.may_share_memory(a, c)
False

>>> c[0] = 12
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

高级索引方式(Fancy indexing)

NumPy 数组可以使用切片进行索引，同时也支持更复杂的索引方式，如使用boolean或integer的数组(masks)。这种索引方式称为fancy indexing。与切片不同的是，它创建的了一份拷贝而非视图。

使用 boolean 数组

>>> np.random.seed(3)
>>> a = np.random.randint(0, 20, 15)
>>> a
array([10,  3,  8,  0, 19, 10, 11,  9, 10,  6,  0, 20, 12,  7, 14])
>>> (a % 3 == 0) # 创建一个掩码数组
array([False,  True, False,  True, False, False, False,  True, False,
        True,  True, False,  True, False, False], dtype=bool)

>>> mask = (a % 3 == 0)
>>> extract_from_a = a[mask] # 等同于 a[a%3==0]
>>> extract_from_a           # 提取掩码为true的元素
array([ 3,  0,  9,  6,  0, 12])

可以结合赋值:

1
2
3

>>> a[a % 3 == 0] = -1
>>> a
array([10, -1,  8, -1, 19, 10, 11, -1, 10, -1, -1, 20, -1,  7, 14])

使用整型数组

1
2
3

>>> a = np.arange(0, 100, 10)
>>> a
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])

可以使用一个整型数组进行索引，其每个元素对应的是被索引数组的一个元素的索引。同一个索引可以在索引数组中出现多次:

1 2	>>> a[[2, 3, 2, 4, 2]] # [2, 3, 2, 4, 2] 是一个Python中的list array([20, 30, 20, 40, 20])

由此创建的数组的维度与索引数组相同:

>>> np.arange(0, 100, 10)
>>> idx = np.array([[3, 4], [9, 7]])
>>> idx.shape
(2, 2)
>>> a[idx]
array([[30, 40],
       [90, 70]])

下面是更复杂一点的例子:

>>> a = np.arange(6) + np.arange(0, 51, 10)[:, np.newaxis]
>>> a
array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

>>> a[(0, 1, 2, 3, 4), (1, 2, 3, 4, 5)]
array([1, 12, 23, 34, 45])

>>> a[3:, [0, 2, 5]]
array([[30, 32, 35],
       [40, 42, 45],
       [50, 52, 55]])
       
>>> mask = np.array([1, 0, 1, 0, 0, 1], dtype=bool)
>>> a[mask, 2]
array([2, 22, 52])

同样也可以用于赋值:

1
2
3

>>> a[[9, 7]] = -100
>>> a
array([   0,   10,   20,   30,   40,   50,   60, -100,   80, -100])

参考资料

https://docs.scipy.org/doc/numpy/reference/routines.array-creation.html

http://www.scipy-lectures.org/intro/numpy/index.html

https://docs.scipy.org/doc/

Numpy 简明教程 - 多维数组