pandas 数据处理-爱代码爱编程
使用 “group by” 方式我们通常会有以下一个或几个步骤:
-
Splitting:根据某一准则对数据分组
-
Applying :对每一分组数据运用某个方法
-
Combining :将结果组合为数据结构
在上述步骤中,split 方法较直接,在 split 之后我们希望对分组数据做相关计算,在 apply 步骤中我们可能想对数据进行如下操作: -
Aggregation::聚合操作,对分组数据做汇总统计,如计算sums 或 means、统计分组个数 counts
-
Transformation:对分组数据做特定操作,如:分组数据进行标准化、按照分组数据数据选择值填充控制
-
Filtration:过滤数据。对组计算 True 或 False,按组进弃数据。如:丢弃某些 counts 较小的组,根据组sums 或 means 过滤数据
1 Split
pandas objects 可以基于任何轴进行分割,group by 会创建一个 GroupBy object 对象
import numpy as np
import pandas as pd
df = pd.DataFrame(
[
("bird", "Falconiformes", 389.0),
("bird", "Psittaciformes", 24.0),
("mammal", "Carnivora", 80.2),
("mammal", "Primates", np.nan),
("mammal", "Carnivora", 58),
],
index=["falcon", "parrot", "lion", "monkey", "leopard"],
columns=("class", "order", "max_speed"),)
grouped = df.groupby("class")
grouped = df.groupby("order", axis="columns")
grouped = df.groupby(["class", "order"])
grouped
Out[2]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023F22DEA9C8>
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
# 设置 ["A", "B"] 为行索引
df2 = df.set_index(["A", "B"])
Out[2]:
C D
A B
foo one 0.194055 -0.087457
bar one -1.542546 -1.442626
foo two 0.867688 -0.540060
bar three 1.622831 0.331491
foo two -0.364909 0.639529
bar two 0.771066 -0.675301
foo one 1.071776 0.884663
three 1.367875 1.474144
1.1 GroupBy sorting
groupby 之后数据会按照 key 值进行排序,可以显示声明不排序
df2.groupby(["X"], sort=False).sum()
1.2 GroupBy dropna
默认情况下, 在 groupby 操作中会排除 NA
key 的统计,可以通过设置 dropna=False
去除这一限制
df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])
# dropna默认设置为 True
df_dropna.groupby(by=["b"], dropna=True).sum()
Out[30]:
a c
b
1.0 2 3
2.0 2 5
df_dropna.groupby(by=["b"], dropna=False).sum()
Out[31]:
a c
b
1.0 2 3
2.0 2 5
NaN 1 4
1.3 GroupBy object attributes
groups 的属性是一个字典,key 为分组唯一值,value 为属于key的轴标签
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
gb= df.groupby(["A", "B"])
gb.groups
Out[32]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}
# len 为字典键值个数
len(grouped)
# 可用的属性
gb.agg gb.boxplot gb.cummin gb.describe gb.filter gb.get_group gb.height gb.last gb.median gb.ngroups gb.plot gb.rank gb.std gb.transform
gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var
gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
1.4 Grouping DataFrame with Index levels and columns
arrays = [
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],
]
# 多标签
index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
df = pd.DataFrame({"A": [1, 1, 1, 1, 2, 2, 3, 3], "B": np.arange(8)}, index=index)
Out[53]:
A B
first second
bar one 1 0
two 1 1
baz one 1 2
two 1 3
foo one 2 4
two 2 5
qux one 3 6
two 3 7
# 按照轴索引的第二个轴和 A 列分组df.groupby([pd.Grouper(level=1), "A"]).sum()
Out[54]:
B
second A
one 1 2
2 4
3 6
two 1 4
2 5
3 7
# 通过轴名称
df.groupby([pd.Grouper(level="second"), "A"]).sum()
# 直接根据名称
df.groupby(["second", "A"]).sum()
1.5 DataFrame column selection in GroupBy
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
grouped = df.groupby(["A"])
grouped_C = grouped["C"]
# 上面两个操作过程可以简写为
df["C"].groupby(df["A"])
2 遍历组 Iterating through groups
# 如果 key 为多个, group name 为元组
grouped = df.groupby('A')
for name, group in grouped:
print(name)
print(group)
bar
A B C D
1 bar one 0.254161 1.511763
3 bar three 0.215897 -0.990582
5 bar two -0.077118 1.211526
foo
A B C D
0 foo one -0.575247 1.346061
2 foo two -1.143704 1.627081
4 foo two 1.193555 -0.441652
6 foo one -0.408530 0.268520
7 foo three -0.862495 0.024580
3 选择组 Selecting a group
grouped.get_group("bar")
# 多个key的选择,传入数组
df.groupby(["A", "B"]).get_group(("bar", "one"))
4 数据聚合 Aggregation
GroupBy object 对象创建之后我们可以对分组数据做计算
grouped = df.groupby("A")
grouped.aggregate(np.sum)
Out[69]:
C D
A
bar 0.392940 1.732707
foo -1.796421 2.824590
grouped = df.groupby(["A", "B"])
grouped.aggregate(np.sum)
Out[71]:
C D
A B
bar one 0.254161 1.511763
three 0.215897 -0.990582
two -0.077118 1.211526
foo one -0.983776 1.614581
three -0.862495 0.024580
two 0.049851 1.185429
从上面结果可以看到,对分组数据做计算之后,分组 name 作为了新的索引。如果不想分组name 最为新索引,可以用 as_index 操作数据
grouped = df.groupby(["A", "B"], as_index=False)
grouped.aggregate(np.sum)
Out[73]:
A B C D
0 bar one 0.254161 1.511763
1 bar three 0.215897 -0.990582
2 bar two -0.077118 1.211526
3 foo one -0.983776 1.614581
4 foo three -0.862495 0.024580
5 foo two 0.049851 1.185429
# sum 操作可以简写为
df.groupby("A", as_index=False).sum()
# reset_index 方法直接重新设置索引
df.groupby(["A", "B"]).sum().reset_index()
Out[75]:
A B C D
0 bar one 0.254161 1.511763
1 bar three 0.215897 -0.990582
2 bar two -0.077118 1.211526
3 foo one -0.983776 1.614581
4 foo three -0.862495 0.024580
5 foo two 0.049851 1.185429
统计每组个数时,返回为 Series ,group name 为索引,值为组的大小
grouped = df.groupby(["A", "B"], as_index=False)
grouped.size()
Out[76]:
A B size
0 bar one 1
1 bar three 1
2 bar two 1
3 foo one 2
4 foo three 1
5 foo two 2
# 分组数据统计汇总
grouped.describe()
# 组内某个非
df4.groupby("A")["B"].nunique()
可以应用于分组数据的方法
汇总返回的都为 Series 类型数据
4.1 对分组数据一次执行多个方法
执行单个汇总方法,返回的为 Series 类型数据。当传入多个统计方法时候,返回为 DataFrame
grouped = df.groupby("A")
grouped["C"].agg([np.sum, np.mean, np.std])
Out[83]:
sum mean std
A
bar 0.392940 0.130980 0.181231
foo -1.796421 -0.359284 0.912265
grouped[["C", "D"]].agg([np.sum, np.mean, np.std])
Out[84]:
C D
sum mean std sum mean std
A
bar 0.392940 0.130980 0.181231 1.732707 0.577569 1.366330
foo -1.796421 -0.359284 0.912265 2.824590 0.564918 0.884785
应用 lambda 函数
grouped["C"].agg([lambda x: x.max() - x.min(), lambda x: x.median() - x.mean()])
Out[88]:
<lambda_0> <lambda_1>
A
bar 0.331279 0.084917
foo 2.337259 -0.215962
4.2 命名聚合Named aggregation
animals = pd.DataFrame(
{
"kind": ["cat", "dog", "cat", "dog"],
"height": [9.1, 6.0, 9.5, 34.0],
"weight": [7.9, 7.5, 9.9, 198.0],
}
)
# 针对列指定函数
animals.groupby("kind").agg(
min_height=pd.NamedAgg(column="height", aggfunc="min"),
max_height=pd.NamedAgg(column="height", aggfunc="max"),
average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),
)
Out[91]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
# 可以简写为
animals.groupby("kind").agg(
min_height=("height", "min"),
max_height=("height", "max"),
average_weight=("weight", np.mean),
)
# 不同列用不同函数
grouped.agg({"C": np.sum, "D": lambda x: np.std(x, ddof=1)})
Out[95]:
C D
A
bar 0.392940 1.366330
foo -1.796421 0.884785
# 用户自定义函数User Defined Function (UDF)
animals.groupby("kind")[["height"]].agg(lambda x: x.astype(int).sum())
5 过滤 Filtration
dff = pd.DataFrame({"A": np.arange(8), "B": list("aabbbbcc")})
dff.groupby("B").filter(lambda x: len(x) > 2)
Out[142]:
A B
2 2 b
3 3 b
4 4 b
5 5 b
dff.groupby("B").filter(lambda x: len(x) > 2, dropna=False)
Out[143]:
A B
0 NaN NaN
1 NaN NaN
2 2.0 b
3 3.0 b
4 4.0 b
5 5.0 b
6 NaN NaN
7 NaN NaN