警告
从0.25.x系列版本开始,Pandas仅支持Python 3.5.3及更高版本。有关更多详细信息,请参见计划移除对Python 2.7的支持。
警告
在未来的版本中,支持的最低Python版本将提高到3.6。
警告
面板(Panel) 已完全删除。对于N-D标记的数据结构,请使用 xarray (opens new window)。
警告
read_pickle()和read_msgpack()方法仅保证向后兼容的 Pandas 版本为0.20.3(GH27082 (opens new window))。
这些是 Pandas v0.25.0 版本的改变。有关完整的更新日志(包括其他版本的Pandas),请参见发布日志。
增强
具有重新标记的Groupby聚合
Pandas添加了特殊的groupby行为,称为“命名聚合”,用于在将多个聚合函数应用于特定列时命名输出列(GH18366 (opens new window), GH26512 (opens new window))。
In [1]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
...: 'height': [9.1, 6.0, 9.5, 34.0],
...: 'weight': [7.9, 7.5, 9.9, 198.0]})
...:
In [2]: animals
Out[2]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
[4 rows x 3 columns]
In [3]: animals.groupby("kind").agg(
...: min_height=pd.NamedAgg(column='height', aggfunc='min'),
...: max_height=pd.NamedAgg(column='height', aggfunc='max'),
...: average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
...: )
...:
Out[3]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
[2 rows x 3 columns]
将所需的列名称作为 **kwargs 传递给 .agg。 **kwargs 的值应该是元组,其中第一个元素是列选择,第二个元素是要应用的聚合函数。Pandas提供了pandas.NamedAgg (命名为元组),使函数的参数更清晰,但也接受了普通元组。
In [4]: animals.groupby("kind").agg(
...: min_height=('height', 'min'),
...: max_height=('height', 'max'),
...: average_weight=('weight', np.mean),
...: )
...:
Out[4]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
[2 rows x 3 columns]
命名聚合是建议替代不推荐使用的 “dict-of-dicts” 方法来命名特定于列的聚合的输出(重命名时使用字典弃用groupby.agg())。
类似的方法现在也可用于Series Groupby对象。因为不需要选择列,所以值可以只是要应用的函数。
In [5]: animals.groupby("kind").height.agg(
...: min_height="min",
...: max_height="max",
...: )
...:
Out[5]:
min_height max_height
kind
cat 9.1 9.5
dog 6.0 34.0
[2 rows x 2 columns]
在将dict传递给Series groupby聚合(重命名时使用字典时不推荐使用groupby.agg()方法)时,建议使用这种类型的聚合来替代不建议使用的方法和操作。
有关更多信息,请参见命名聚合。
具有多个Lambda的Groupby聚合
您现在可以在 pandas.core.groupby.GroupBy.agg (opens new window) (GH26430 (opens new window)) 中为类似列表的聚合提供多个lambda函数。
In [6]: animals.groupby('kind').height.agg([
...: lambda x: x.iloc[0], lambda x: x.iloc[-1]
...: ])
...:
Out[6]:
<lambda_0> <lambda_1>
kind
cat 9.1 9.5
dog 6.0 34.0
[2 rows x 2 columns]
In [7]: animals.groupby('kind').agg([
...: lambda x: x.iloc[0] - x.iloc[1],
...: lambda x: x.iloc[0] + x.iloc[1]
...: ])
...:
Out[7]:
height weight
<lambda_0> <lambda_1> <lambda_0> <lambda_1>
kind
cat -0.4 18.6 -2.0 17.8
dog -28.0 40.0 -190.5 205.5
[2 rows x 4 columns]
以前的版本,这些行为会引发 SpecificationError 异常。
更好的多索引 repr
MultiIndex (opens new window) 实例的打印现在将会显示每行的元组数据,并确保元组项垂直对齐,因此现在更容易理解MultiIndex的结构。(GH13480 (opens new window)):
repr现在看起来像这样:
In [8]: pd.MultiIndex.from_product([['a', 'abc'], range(500)])
Out[8]:
MultiIndex([( 'a', 0),
( 'a', 1),
( 'a', 2),
( 'a', 3),
( 'a', 4),
( 'a', 5),
( 'a', 6),
( 'a', 7),
( 'a', 8),
( 'a', 9),
...
('abc', 490),
('abc', 491),
('abc', 492),
('abc', 493),
('abc', 494),
('abc', 495),
('abc', 496),
('abc', 497),
('abc', 498),
('abc', 499)],
length=1000)
在以前的版本中,输出 MultiIndex (opens new window) 操作会打印MultiIndex的所有级别和代码,这在视觉和排版上没有吸引力,并使输出的内容更难以定位。例如(将范围限制为5):
In [1]: pd.MultiIndex.from_product([['a', 'abc'], range(5)])
Out[1]: MultiIndex(levels=[['a', 'abc'], [0, 1, 2, 3]],
...: codes=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])
在新的repr中,如果行数小于 options.display.max_seq_items(默认值:100个项目),则将显示所有值。水平方向上,如果输出比options.display.width 宽(默认值:80个字符),则输出将被截断。
用于Series和DataFrame的较短截断 repr
目前,pandas的默认显示选项确保当Series或DataFrame具有超过60行时,其repr将被截断为最多60行(display.max_rows选项)。 然而,这仍然给出一个占据垂直屏幕区域很大一部分的repr。 因此,引入了一个新选项 display.min_rows,默认值为10,它确定截断的repr中显示的行数:
- 对于较小的 Series 或 DataFrame,最多显示
max_rows数行 (默认值:60)。 - 对于长度大于
max_rows的长度较大的DataFrame Series,仅限显示min_rows数行(默认值:10,即第一个和最后一个5行)。
这个双重选项允许仍然可以看到相对较小的对象的全部内容(例如 df.head(20) 显示所有20行),同时为大对象提供简短的repr。
要恢复单个阈值的先前行为,请设置 pd.options.display.min_rows = None。
使用max_level参数支持进行JSON规范化
json_normalize() normalizes the provided input dict to all nested levels. The new max_level parameter provides more control over which level to end normalization (GH23843 (opens new window)):
The repr now looks like this:
In [9]: from pandas.io.json import json_normalize
In [10]: data = [{
....: 'CreatedBy': {'Name': 'User001'},
....: 'Lookup': {'TextField': 'Some text',
....: 'UserField': {'Id': 'ID001', 'Name': 'Name001'}},
....: 'Image': {'a': 'b'}
....: }]
....:
In [11]: json_normalize(data, max_level=1)
Out[11]:
CreatedBy.Name Lookup.TextField Lookup.UserField Image.a
0 User001 Some text {'Id': 'ID001', 'Name': 'Name001'} b
[1 rows x 4 columns]
Series.explode 将类似列表的值拆分为行
Series (opens new window) and DataFrame (opens new window) have gained the DataFrame.explode() (opens new window) methods to transform list-likes to individual rows. See section on Exploding list-like column (opens new window) in docs for more information (GH16538 (opens new window), GH10511 (opens new window))
Here is a typical usecase. You have comma separated string in a column.
In [12]: df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1},
....: {'var1': 'd,e,f', 'var2': 2}])
....:
In [13]: df
Out[13]:
var1 var2
0 a,b,c 1
1 d,e,f 2
[2 rows x 2 columns]
Creating a long form DataFrame is now straightforward using chained operations
In [14]: df.assign(var1=df.var1.str.split(',')).explode('var1')
Out[14]:
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
1 f 2
[6 rows x 2 columns]
其他增强功能
DataFrame.plot()(opens new window) keywordslogy,logxandloglogcan now accept the value'sym'for symlog scaling. (GH24867 (opens new window))- Added support for ISO week year format (‘%G-%V-%u’) when parsing datetimes using
to_datetime()(opens new window) (GH16607 (opens new window)) - Indexing of
DataFrameandSeriesnow accepts zerodimnp.ndarray(GH24919 (opens new window)) Timestamp.replace()(opens new window) now supports thefoldargument to disambiguate DST transition times (GH25017 (opens new window))DataFrame.at_time()(opens new window) andSeries.at_time()(opens new window) now supportdatetime.time(opens new window) objects with timezones (GH24043 (opens new window))DataFrame.pivot_table()(opens new window) now accepts anobservedparameter which is passed to underlying calls toDataFrame.groupby()(opens new window) to speed up grouping categorical data. (GH24923 (opens new window))Series.strhas gainedSeries.str.casefold()(opens new window) method to removes all case distinctions present in a string (GH25405 (opens new window))DataFrame.set_index()(opens new window) now works for instances ofabc.Iterator, provided their output is of the same length as the calling frame (GH22484 (opens new window), GH24984 (opens new window))DatetimeIndex.union()now supports thesortargument. The behavior of the sort parameter matches that ofIndex.union()(opens new window) (GH24994 (opens new window))RangeIndex.union()now supports thesortargument. Ifsort=Falsean unsortedInt64Indexis always returned.sort=Noneis the default and returns a monotonically increasingRangeIndexif possible or a sortedInt64Indexif not (GH24471 (opens new window))TimedeltaIndex.intersection()now also supports thesortkeyword (GH24471 (opens new window))DataFrame.rename()(opens new window) now supports theerrorsargument to raise errors when attempting to rename nonexistent keys (GH13473 (opens new window))- Added Sparse accessor (opens new window) for working with a
DataFramewhose values are sparse (GH25681 (opens new window)) RangeIndex(opens new window) has gainedstart(opens new window),stop(opens new window), andstep(opens new window) attributes (GH25710 (opens new window))datetime.timezone(opens new window) objects are now supported as arguments to timezone methods and constructors (GH25065 (opens new window))DataFrame.query()(opens new window) andDataFrame.eval()(opens new window) now supports quoting column names with backticks to refer to names with spaces (GH6508 (opens new window))merge_asof()(opens new window) now gives a more clear error message when merge keys are categoricals that are not equal (GH26136 (opens new window))pandas.core.window.Rolling()supports exponential (or Poisson) window type (GH21303 (opens new window))- Error message for missing required imports now includes the original import error’s text (GH23868 (opens new window))
DatetimeIndex(opens new window) andTimedeltaIndex(opens new window) now have ameanmethod (GH24757 (opens new window))DataFrame.describe()(opens new window) now formats integer percentiles without decimal point (GH26660 (opens new window))- Added support for reading SPSS .sav files using
read_spss()(GH26537 (opens new window)) - Added new option
plotting.backendto be able to select a plotting backend different than the existingmatplotlibone. Usepandas.set_option('plotting.backend', '')where ``GH14130) pandas.offsets.BusinessHoursupports multiple opening hours intervals (GH15481 (opens new window))read_excel()(opens new window) can now useopenpyxlto read Excel files via theengine='openpyxl'argument. This will become the default in a future release (GH11499 (opens new window))pandas.io.excel.read_excel()supports reading OpenDocument tables. Specifyengine='odf'to enable. Consult the IO User Guide (opens new window) for more details (GH9070 (opens new window))Interval(opens new window),IntervalIndex(opens new window), andIntervalArray(opens new window) have gained anis_empty(opens new window) attribute denoting if the given interval(s) are empty (GH27219 (opens new window))
向后不兼容的API更改
使用UTC偏移量对日期字符串进行索引
Indexing a DataFrame (opens new window) or Series (opens new window) with a DatetimeIndex (opens new window) with a date string with a UTC offset would previously ignore the UTC offset. Now, the UTC offset is respected in indexing. (GH24076 (opens new window), GH16785 (opens new window))
In [15]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))
In [16]: df
Out[16]:
0
2019-01-01 00:00:00-08:00 0
[1 rows x 1 columns]
Previous behavior:
In [3]: df['2019-01-01 00:00:00+04:00':'2019-01-01 01:00:00+04:00']
Out[3]:
0
2019-01-01 00:00:00-08:00 0
New behavior:
In [17]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']
Out[17]:
0
2019-01-01 00:00:00-08:00 0
[1 rows x 1 columns]
MultiIndex由级别和代码构造
Constructing a MultiIndex (opens new window) with NaN levels or codes value < -1 was allowed previously. Now, construction with codes value < -1 is not allowed and NaN levels’ corresponding codes would be reassigned as -1. (GH19387 (opens new window))
Previous behavior:
In [1]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
...: codes=[[0, -1, 1, 2, 3, 4]])
...:
Out[1]: MultiIndex(levels=[[nan, None, NaT, 128, 2]],
codes=[[0, -1, 1, 2, 3, 4]])
In [2]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
Out[2]: MultiIndex(levels=[[1, 2]],
codes=[[0, -2]])
New behavior:
In [18]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
....: codes=[[0, -1, 1, 2, 3, 4]])
....:
Out[18]:
MultiIndex([(nan,),
(nan,),
(nan,),
(nan,),
(128,),
( 2,)],
)
In [19]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-225a01af3975> in <module>
----> 1 pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
/pandas/pandas/util/_decorators.py in wrapper(*args, **kwargs)
206 else:
207 kwargs[new_arg_name] = new_arg_value
--> 208 return func(*args, **kwargs)
209
210 return wrapper
/pandas/pandas/core/indexes/multi.py in __new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity, _set_identity)
270
271 if verify_integrity:
--> 272 new_codes = result._verify_integrity()
273 result._codes = new_codes
274
/pandas/pandas/core/indexes/multi.py in _verify_integrity(self, codes, levels)
348 raise ValueError(
349 "On level {level}, code value ({code})"
--> 350 " < -1".format(level=i, code=level_codes.min())
351 )
352 if not level.is_unique:
ValueError: On level 0, code value (-2) < -1
DataFrame 上的 Groupby.apply 只对第一组求值一次
The implementation of DataFrameGroupBy.apply() previously evaluated the supplied function consistently twice on the first group to infer if it is safe to use a fast code path. Particularly for functions with side effects, this was an undesired behavior and may have led to surprises. (GH2936 (opens new window), GH2656 (opens new window), GH7739 (opens new window), GH10519 (opens new window), GH12155 (opens new window), GH20084 (opens new window), GH21417 (opens new window))
Now every group is evaluated only a single time.
In [20]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
In [21]: df
Out[21]:
a b
0 x 1
1 y 2
[2 rows x 2 columns]
In [22]: def func(group):
....: print(group.name)
....: return group
....:
Previous behavior:
In [3]: df.groupby('a').apply(func)
x
x
y
Out[3]:
a b
0 x 1
1 y 2
New behavior:
In [23]: df.groupby("a").apply(func)
x
y
Out[23]:
a b
0 x 1
1 y 2
[2 rows x 2 columns]
连接稀疏值
When passed DataFrames whose values are sparse, concat() (opens new window) will now return a Series (opens new window) or DataFrame (opens new window) with sparse values, rather than a SparseDataFrame (GH25702 (opens new window)).
In [24]: df = pd.DataFrame({"A": pd.SparseArray([0, 1])})
Previous behavior:
In [2]: type(pd.concat([df, df]))
pandas.core.sparse.frame.SparseDataFrame
New behavior:
In [25]: type(pd.concat([df, df]))
Out[25]: pandas.core.frame.DataFrame
This now matches the existing behavior of concat (opens new window) on Series with sparse values. concat() (opens new window) will continue to return a SparseDataFrame when all the values are instances of SparseDataFrame.
This change also affects routines using concat() (opens new window) internally, like get_dummies() (opens new window), which now returns a DataFrame (opens new window) in all cases (previously a SparseDataFrame was returned if all the columns were dummy encoded, and a DataFrame (opens new window) otherwise).
Providing any SparseSeries or SparseDataFrame to concat() (opens new window) will cause a SparseSeries or SparseDataFrame to be returned, as before.
`.str``-访问器执行更严格的类型检查
Due to the lack of more fine-grained dtypes, Series.str (opens new window) so far only checked whether the data was of object dtype. Series.str (opens new window) will now infer the dtype data within the Series; in particular, 'bytes'-only data will raise an exception (except for Series.str.decode() (opens new window), Series.str.get() (opens new window), Series.str.len() (opens new window), Series.str.slice() (opens new window)), see GH23163 (opens new window), GH23011 (opens new window), GH23551 (opens new window).
Previous behavior:
In [1]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)
In [2]: s
Out[2]:
0 b'a'
1 b'ba'
2 b'cba'
dtype: object
In [3]: s.str.startswith(b'a')
Out[3]:
0 True
1 False
2 False
dtype: bool
New behavior:
In [26]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)
In [27]: s
Out[27]:
0 b'a'
1 b'ba'
2 b'cba'
Length: 3, dtype: object
In [28]: s.str.startswith(b'a')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-28-ac784692b361> in <module>
----> 1 s.str.startswith(b'a')
/pandas/pandas/core/strings.py in wrapper(self, *args, **kwargs)
1840 )
1841 )
-> 1842 raise TypeError(msg)
1843 return func(self, *args, **kwargs)
1844
TypeError: Cannot use .str.startswith with values of inferred dtype 'bytes'.
在groupby期间保留分类dtypes
Previously, columns that were categorical, but not the groupby key(s) would be converted to object dtype during groupby operations. Pandas now will preserve these dtypes. (GH18502 (opens new window))
In [29]: cat = pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)
In [30]: df = pd.DataFrame({'payload': [-1, -2, -1, -2], 'col': cat})
In [31]: df
Out[31]:
payload col
0 -1 foo
1 -2 bar
2 -1 bar
3 -2 qux
[4 rows x 2 columns]
In [32]: df.dtypes
Out[32]:
payload int64
col category
Length: 2, dtype: object
Previous Behavior:
In [5]: df.groupby('payload').first().col.dtype
Out[5]: dtype('O')
New Behavior:
In [33]: df.groupby('payload').first().col.dtype
Out[33]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True)
不兼容的索引类型联合
When performing Index.union() (opens new window) operations between objects of incompatible dtypes, the result will be a base Index (opens new window) of dtype object. This behavior holds true for unions between Index (opens new window) objects that previously would have been prohibited. The dtype of empty Index (opens new window) objects will now be evaluated before performing union operations rather than simply returning the other Index (opens new window) object. Index.union() (opens new window) can now be considered commutative, such that A.union(B) == B.union(A) (GH23525 (opens new window)).
Previous behavior:
In [1]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
...
ValueError: can only call with other PeriodIndex-ed objects
In [2]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[2]: Int64Index([1, 2, 3], dtype='int64')
New behavior:
In [34]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
Out[34]: Index([1991-09-05, 1991-09-06, 1, 2, 3], dtype='object')
In [35]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[35]: Index([1, 2, 3], dtype='object')
Note that integer- and floating-dtype indexes are considered “compatible”. The integer values are coerced to floating point, which may result in loss of precision. See Set operations on Index objects (opens new window) for more.
DataFrame groupby ffill/bfill不再返回组标签
The methods ffill, bfill, pad and backfill of DataFrameGroupBy previously included the group labels in the return value, which was inconsistent with other groupby transforms. Now only the filled values are returned. (GH21521 (opens new window))
In [36]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
In [37]: df
Out[37]:
a b
0 x 1
1 y 2
[2 rows x 2 columns]
Previous behavior:
In [3]: df.groupby("a").ffill()
Out[3]:
a b
0 x 1
1 y 2
New behavior:
In [38]: df.groupby("a").ffill()
Out[38]:
b
0 1
1 2
[2 rows x 1 columns]
DataFrame 在空的分类/对象列上描述将返回top和freq
When calling DataFrame.describe() (opens new window) with an empty categorical / object column, the ‘top’ and ‘freq’ columns were previously omitted, which was inconsistent with the output for non-empty columns. Now the ‘top’ and ‘freq’ columns will always be included, with numpy.nan in the case of an empty DataFrame (opens new window) (GH26397 (opens new window))
In [39]: df = pd.DataFrame({"empty_col": pd.Categorical([])})
In [40]: df
Out[40]:
Empty DataFrame
Columns: [empty_col]
Index: []
[0 rows x 1 columns]
Previous behavior:
In [3]: df.describe()
Out[3]:
empty_col
count 0
unique 0
New behavior:
In [41]: df.describe()
Out[41]:
empty_col
count 0
unique 0
top NaN
freq NaN
[4 rows x 1 columns]
__str__方法现在调用__repr__而不是反之亦然
Pandas has until now mostly defined string representations in a Pandas objects’s __str__/__unicode__/__bytes__ methods, and called __str__ from the __repr__ method, if a specific __repr__ method is not found. This is not needed for Python3. In Pandas 0.25, the string representations of Pandas objects are now generally defined in __repr__, and calls to __str__ in general now pass the call on to the __repr__, if a specific __str__ method doesn’t exist, as is standard for Python. This change is backward compatible for direct usage of Pandas, but if you subclass Pandas objects and give your subclasses specific __str__/__repr__ methods, you may have to adjust your __str__/__repr__ methods (GH26495 (opens new window)).
使用Interval对象索引IntervalIndex
Indexing methods for IntervalIndex (opens new window) have been modified to require exact matches only for Interval (opens new window) queries. IntervalIndex methods previously matched on any overlapping Interval. Behavior with scalar points, e.g. querying with an integer, is unchanged (GH16316 (opens new window)).
In [42]: ii = pd.IntervalIndex.from_tuples([(0, 4), (1, 5), (5, 8)])
In [43]: ii
Out[43]:
IntervalIndex([(0, 4], (1, 5], (5, 8]],
closed='right',
dtype='interval[int64]')
The in operator (__contains__) now only returns True for exact matches to Intervals in the IntervalIndex, whereas this would previously return True for any Interval overlapping an Interval in the IntervalIndex.
Previous behavior:
In [4]: pd.Interval(1, 2, closed='neither') in ii
Out[4]: True
In [5]: pd.Interval(-10, 10, closed='both') in ii
Out[5]: True
New behavior:
In [44]: pd.Interval(1, 2, closed='neither') in ii
Out[44]: False
In [45]: pd.Interval(-10, 10, closed='both') in ii
Out[45]: False
The get_loc() (opens new window) method now only returns locations for exact matches to Interval queries, as opposed to the previous behavior of returning locations for overlapping matches. A KeyError will be raised if an exact match is not found.
Previous behavior:
In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: array([0, 1])
In [7]: ii.get_loc(pd.Interval(2, 6))
Out[7]: array([0, 1, 2])
New behavior:
In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: 1
In [7]: ii.get_loc(pd.Interval(2, 6))
---------------------------------------------------------------------------
KeyError: Interval(2, 6, closed='right')
Likewise, get_indexer() (opens new window) and get_indexer_non_unique() will also only return locations for exact matches to Interval queries, with -1 denoting that an exact match was not found.
These indexing changes extend to querying a Series (opens new window) or DataFrame (opens new window) with an IntervalIndex index.
In [46]: s = pd.Series(list('abc'), index=ii)
In [47]: s
Out[47]:
(0, 4] a
(1, 5] b
(5, 8] c
Length: 3, dtype: object
Selecting from a Series or DataFrame using [] (__getitem__) or loc now only returns exact matches for Interval queries.
Previous behavior:
In [8]: s[pd.Interval(1, 5)]
Out[8]:
(0, 4] a
(1, 5] b
dtype: object
In [9]: s.loc[pd.Interval(1, 5)]
Out[9]:
(0, 4] a
(1, 5] b
dtype: object
New behavior:
In [48]: s[pd.Interval(1, 5)]
Out[48]: 'b'
In [49]: s.loc[pd.Interval(1, 5)]
Out[49]: 'b'
Similarly, a KeyError will be raised for non-exact matches instead of returning overlapping matches.
Previous behavior:
In [9]: s[pd.Interval(2, 3)]
Out[9]:
(0, 4] a
(1, 5] b
dtype: object
In [10]: s.loc[pd.Interval(2, 3)]
Out[10]:
(0, 4] a
(1, 5] b
dtype: object
New behavior:
In [6]: s[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')
In [7]: s.loc[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')
The overlaps() (opens new window) method can be used to create a boolean indexer that replicates the previous behavior of returning overlapping matches.
New behavior:
In [50]: idxr = s.index.overlaps(pd.Interval(2, 3))
In [51]: idxr
Out[51]: array([ True, True, False])
In [52]: s[idxr]
Out[52]:
(0, 4] a
(1, 5] b
Length: 2, dtype: object
In [53]: s.loc[idxr]
Out[53]:
(0, 4] a
(1, 5] b
Length: 2, dtype: object
Series 上的二进制ufunc现在对齐
Applying a binary ufunc like numpy.power() now aligns the inputs when both are Series (opens new window) (GH23293 (opens new window)).
In [54]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
In [55]: s2 = pd.Series([3, 4, 5], index=['d', 'c', 'b'])
In [56]: s1
Out[56]:
a 1
b 2
c 3
Length: 3, dtype: int64
In [57]: s2
Out[57]:
d 3
c 4
b 5
Length: 3, dtype: int64
Previous behavior
In [5]: np.power(s1, s2)
Out[5]:
a 1
b 16
c 243
dtype: int64
New behavior
In [58]: np.power(s1, s2)
Out[58]:
a 1.0
b 32.0
c 81.0
d NaN
Length: 4, dtype: float64
This matches the behavior of other binary operations in pandas, like Series.add() (opens new window). To retain the previous behavior, convert the other Series to an array before applying the ufunc.
In [59]: np.power(s1, s2.array)
Out[59]:
a 1
b 16
c 243
Length: 3, dtype: int64
Categorical.argsort现在在最后放置缺失值
Categorical.argsort() now places missing values at the end of the array, making it consistent with NumPy and the rest of pandas (GH21801 (opens new window)).
In [60]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)
Previous behavior
In [2]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)
In [3]: cat.argsort()
Out[3]: array([1, 2, 0])
In [4]: cat[cat.argsort()]
Out[4]:
[NaN, a, b]
categories (2, object): [a < b]
New behavior
In [61]: cat.argsort()
Out[61]: array([2, 0, 1])
In [62]: cat[cat.argsort()]
Out[62]:
[a, b, NaN]
Categories (2, object): [a < b]
将字典列表传递给DataFrame时,将保留列顺序
Starting with Python 3.7 the key-order of dict is guaranteed (opens new window). In practice, this has been true since Python 3.6. The DataFrame (opens new window) constructor now treats a list of dicts in the same way as it does a list of OrderedDict, i.e. preserving the order of the dicts. This change applies only when pandas is running on Python>=3.6 (GH27309 (opens new window)).
In [63]: data = [
....: {'name': 'Joe', 'state': 'NY', 'age': 18},
....: {'name': 'Jane', 'state': 'KY', 'age': 19, 'hobby': 'Minecraft'},
....: {'name': 'Jean', 'state': 'OK', 'age': 20, 'finances': 'good'}
....: ]
....:
Previous Behavior:
The columns were lexicographically sorted previously,
In [1]: pd.DataFrame(data)
Out[1]:
age finances hobby name state
0 18 NaN NaN Joe NY
1 19 NaN Minecraft Jane KY
2 20 good NaN Jean OK
New Behavior:
The column order now matches the insertion-order of the keys in the dict, considering all the records from top to bottom. As a consequence, the column order of the resulting DataFrame has changed compared to previous pandas verisons.
In [64]: pd.DataFrame(data)
Out[64]:
name state age hobby finances
0 Joe NY 18 NaN NaN
1 Jane KY 19 Minecraft NaN
2 Jean OK 20 NaN good
[3 rows x 5 columns]
增加了依赖项的最低版本
Due to dropping support for Python 2.7, a number of optional dependencies have updated minimum versions (GH25725 (opens new window), GH24942 (opens new window), GH25752 (opens new window)). Independently, some minimum supported versions of dependencies were updated (GH23519 (opens new window), GH25554 (opens new window)). If installed, we now require:
| Package | Minimum Version | Required |
|---|---|---|
| numpy | 1.13.3 | X |
| pytz | 2015.4 | X |
| python-dateutil | 2.6.1 | X |
| bottleneck | 1.2.1 | |
| numexpr | 2.6.2 | |
| pytest (dev) | 4.0.2 |
For optional libraries (opens new window) the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
| Package | Minimum Version |
|---|---|
| beautifulsoup4 | 4.6.0 |
| fastparquet | 0.2.1 |
| gcsfs | 0.2.2 |
| lxml | 3.8.0 |
| matplotlib | 2.2.2 |
| openpyxl | 2.4.8 |
| pyarrow | 0.9.0 |
| pymysql | 0.7.1 |
| pytables | 3.4.2 |
| scipy | 0.19.0 |
| sqlalchemy | 1.1.4 |
| xarray | 0.8.2 |
| xlrd | 1.1.0 |
| xlsxwriter | 0.9.8 |
| xlwt | 1.2.0 |
See Dependencies (opens new window) and Optional dependencies (opens new window) for more.
其他API更改
DatetimeTZDtype(opens new window) will now standardize pytz timezones to a common timezone instance (GH24713 (opens new window))Timestamp(opens new window) andTimedelta(opens new window) scalars now implement theto_numpy()method as aliases toTimestamp.to_datetime64()(opens new window) andTimedelta.to_timedelta64()(opens new window), respectively. (GH24653 (opens new window))Timestamp.strptime()(opens new window) will now rise aNotImplementedError(GH25016 (opens new window))- Comparing
Timestamp(opens new window) with unsupported objects now returnsNotImplemented(opens new window) instead of raisingTypeError. This implies that unsupported rich comparisons are delegated to the other object, and are now consistent with Python 3 behavior fordatetimeobjects (GH24011 (opens new window)) - Bug in
DatetimeIndex.snap()(opens new window) which didn’t preserving thenameof the inputIndex(opens new window) (GH25575 (opens new window)) - The
argargument inpandas.core.groupby.DataFrameGroupBy.agg()has been renamed tofunc(GH26089 (opens new window)) - The
argargument inpandas.core.window._Window.aggregate()has been renamed tofunc(GH26372 (opens new window)) - Most Pandas classes had a
__bytes__method, which was used for getting a python2-style bytestring representation of the object. This method has been removed as a part of dropping Python2 (GH26447 (opens new window)) - The
.str-accessor has been disabled for 1-levelMultiIndex(opens new window), useMultiIndex.to_flat_index()(opens new window) if necessary (GH23679 (opens new window)) - Removed support of gtk package for clipboards (GH26563 (opens new window))
- Using an unsupported version of Beautiful Soup 4 will now raise an
ImportErrorinstead of aValueError(GH27063 (opens new window)) Series.to_excel()(opens new window) andDataFrame.to_excel()(opens new window) will now raise aValueErrorwhen saving timezone aware data. (GH27008 (opens new window), GH7056 (opens new window))ExtensionArray.argsort()places NA values at the end of the sorted array. (GH21801 (opens new window))DataFrame.to_hdf()(opens new window) andSeries.to_hdf()(opens new window) will now raise aNotImplementedErrorwhen saving aMultiIndex(opens new window) with extention data types for afixedformat. (GH7775 (opens new window))- Passing duplicate
namesinread_csv()(opens new window) will now raise aValueError(GH17346 (opens new window))
弃用
Sparse的子类
The SparseSeries and SparseDataFrame subclasses are deprecated. Their functionality is better-provided by a Series or DataFrame with sparse values.
Previous way
In [65]: df = pd.SparseDataFrame({"A": [0, 0, 1, 2]})
In [66]: df.dtypes
Out[66]:
A Sparse[int64, nan]
Length: 1, dtype: object
New way
In [67]: df = pd.DataFrame({"A": pd.SparseArray([0, 0, 1, 2])})
In [68]: df.dtypes
Out[68]:
A Sparse[int64, 0]
Length: 1, dtype: object
The memory usage of the two approaches is identical. See Migrating (opens new window) for more (GH19239 (opens new window)).
msgpack格式
The msgpack format is deprecated as of 0.25 and will be removed in a future version. It is recommended to use pyarrow for on-the-wire transmission of pandas objects. (GH27084 (opens new window))
其他弃用
- The deprecated
.ix[]indexer now raises a more visibleFutureWarninginstead ofDeprecationWarning(GH26438 (opens new window)). - Deprecated the
units=M(months) andunits=Y(year) parameters forunitsofpandas.to_timedelta()(opens new window),pandas.Timedelta()(opens new window) andpandas.TimedeltaIndex()(opens new window) (GH16344 (opens new window)) pandas.concat()(opens new window) has deprecated thejoin_axes-keyword. Instead, useDataFrame.reindex()(opens new window) orDataFrame.reindex_like()(opens new window) on the result or on the inputs (GH21951 (opens new window))- The
SparseArray.valuesattribute is deprecated. You can usenp.asarray(...)or theSparseArray.to_dense()method instead (GH26421 (opens new window)). - The functions
pandas.to_datetime()(opens new window) andpandas.to_timedelta()(opens new window) have deprecated theboxkeyword. Instead, useto_numpy()orTimestamp.to_datetime64()(opens new window) orTimedelta.to_timedelta64()(opens new window). (GH24416 (opens new window)) - The
DataFrame.compound()(opens new window) andSeries.compound()(opens new window) methods are deprecated and will be removed in a future version (GH26405 (opens new window)). - The internal attributes
_start,_stopand_stepattributes ofRangeIndex(opens new window) have been deprecated. Use the public attributesstart(opens new window),stop(opens new window) andstep(opens new window) instead (GH26581 (opens new window)). - The
Series.ftype()(opens new window),Series.ftypes()(opens new window) andDataFrame.ftypes()(opens new window) methods are deprecated and will be removed in a future version. Instead, useSeries.dtype()(opens new window) andDataFrame.dtypes()(opens new window) (GH26705 (opens new window)). - The
Series.get_values()(opens new window),DataFrame.get_values()(opens new window),Index.get_values()(opens new window),SparseArray.get_values()andCategorical.get_values()methods are deprecated. One ofnp.asarray(..)orto_numpy()(opens new window) can be used instead (GH19617 (opens new window)). - The ‘outer’ method on NumPy ufuncs, e.g.
np.subtract.outerhas been deprecated onSeries(opens new window) objects. Convert the input to an array withSeries.array(opens new window) first (GH27186 (opens new window)) Timedelta.resolution()(opens new window) is deprecated and replaced withTimedelta.resolution_string()(opens new window). In a future version,Timedelta.resolution()(opens new window) will be changed to behave like the standard librarydatetime.timedelta.resolution(opens new window) (GH21344 (opens new window))read_table()(opens new window) has been undeprecated. (GH25220 (opens new window))Index.dtype_str(opens new window) is deprecated. (GH18262 (opens new window))Series.imag(opens new window) andSeries.real(opens new window) are deprecated. (GH18262 (opens new window))Series.put()(opens new window) is deprecated. (GH18262 (opens new window))Index.item()(opens new window) andSeries.item()(opens new window) is deprecated. (GH18262 (opens new window))- The default value
ordered=NoneinCategoricalDtypehas been deprecated in favor ofordered=False. When converting between categorical typesordered=Truemust be explicitly passed in order to be preserved. (GH26336 (opens new window)) Index.contains()(opens new window) is deprecated. Usekey in index(__contains__) instead (GH17753 (opens new window)).DataFrame.get_dtype_counts()(opens new window) is deprecated. (GH18262 (opens new window))Categorical.ravel()will return aCategorical(opens new window) instead of anp.ndarray(GH27199 (opens new window))
删除先前版本的弃用/更改
- Removed
Panel(GH25047 (opens new window), GH25191 (opens new window), GH25231 (opens new window)) - Removed the previously deprecated
sheetnamekeyword inread_excel()(opens new window) (GH16442 (opens new window), GH20938 (opens new window)) - Removed the previously deprecated
TimeGrouper(GH16942 (opens new window)) - Removed the previously deprecated
parse_colskeyword inread_excel()(opens new window) (GH16488 (opens new window)) - Removed the previously deprecated
pd.options.html.border(GH16970 (opens new window)) - Removed the previously deprecated
convert_objects(GH11221 (opens new window)) - Removed the previously deprecated
selectmethod ofDataFrameandSeries(GH17633 (opens new window)) - Removed the previously deprecated behavior of
Series(opens new window) treated as list-like inrename_categories()(opens new window) (GH17982 (opens new window)) - Removed the previously deprecated
DataFrame.reindex_axisandSeries.reindex_axis(GH17842 (opens new window)) - Removed the previously deprecated behavior of altering column or index labels with
Series.rename_axis()(opens new window) orDataFrame.rename_axis()(opens new window) (GH17842 (opens new window)) - Removed the previously deprecated
tupleize_colskeyword argument inread_html()(opens new window),read_csv()(opens new window), andDataFrame.to_csv()(opens new window) (GH17877 (opens new window), GH17820 (opens new window)) - Removed the previously deprecated
DataFrame.from.csvandSeries.from_csv(GH17812 (opens new window)) - Removed the previously deprecated
raise_on_errorkeyword argument inDataFrame.where()(opens new window) andDataFrame.mask()(opens new window) (GH17744 (opens new window)) - Removed the previously deprecated
orderedandcategorieskeyword arguments inastype(GH17742 (opens new window)) - Removed the previously deprecated
cdate_range(GH17691 (opens new window)) - Removed the previously deprecated
Trueoption for thedropnakeyword argument inSeriesGroupBy.nth()(GH17493 (opens new window)) - Removed the previously deprecated
convertkeyword argument inSeries.take()(opens new window) andDataFrame.take()(opens new window) (GH17352 (opens new window))
性能改进
- Significant speedup in
SparseArray(opens new window) initialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH24985 (opens new window)) DataFrame.to_stata()(opens new window) is now faster when outputting data with any string or non-native endian columns (GH25045 (opens new window))- Improved performance of
Series.searchsorted()(opens new window). The speedup is especially large when the dtype is int8/int16/int32 and the searched key is within the integer bounds for the dtype (GH22034 (opens new window)) - Improved performance of
pandas.core.groupby.GroupBy.quantile()(GH20405 (opens new window)) - Improved performance of slicing and other selected operation on a
RangeIndex(opens new window) (GH26565 (opens new window), GH26617 (opens new window), GH26722 (opens new window)) RangeIndex(opens new window) now performs standard lookup without instantiating an actual hashtable, hence saving memory (GH16685 (opens new window))- Improved performance of
read_csv()(opens new window) by faster tokenizing and faster parsing of small float numbers (GH25784 (opens new window)) - Improved performance of
read_csv()(opens new window) by faster parsing of N/A and boolean values (GH25804 (opens new window)) - Improved performance of
IntervalIndex.is_monotonic,IntervalIndex.is_monotonic_increasingandIntervalIndex.is_monotonic_decreasingby removing conversion toMultiIndex(opens new window) (GH24813 (opens new window)) - Improved performance of
DataFrame.to_csv()(opens new window) when writing datetime dtypes (GH25708 (opens new window)) - Improved performance of
read_csv()(opens new window) by much faster parsing ofMM/YYYYandDD/MM/YYYYdatetime formats (GH25922 (opens new window)) - Improved performance of nanops for dtypes that cannot store NaNs. Speedup is particularly prominent for
Series.all()(opens new window) andSeries.any()(opens new window) (GH25070 (opens new window)) - Improved performance of
Series.map()(opens new window) for dictionary mappers on categorical series by mapping the categories instead of mapping all values (GH23785 (opens new window)) - Improved performance of
IntervalIndex.intersection()(GH24813 (opens new window)) - Improved performance of
read_csv()(opens new window) by faster concatenating date columns without extra conversion to string for integer/float zero and floatNaN; by faster checking the string for the possibility of being a date (GH25754 (opens new window)) - Improved performance of
IntervalIndex.is_uniqueby removing conversion toMultiIndex(GH24813 (opens new window)) - Restored performance of
DatetimeIndex.__iter__()by re-enabling specialized code path (GH26702 (opens new window)) - Improved performance when building
MultiIndex(opens new window) with at least oneCategoricalIndex(opens new window) level (GH22044 (opens new window)) - Improved performance by removing the need for a garbage collect when checking for
SettingWithCopyWarning(GH27031 (opens new window)) - For
to_datetime()(opens new window) changed default value of cache parameter toTrue(GH26043 (opens new window)) - Improved performance of
DatetimeIndex(opens new window) andPeriodIndex(opens new window) slicing given non-unique, monotonic data (GH27136 (opens new window)). - Improved performance of
pd.read_json()for index-oriented data. (GH26773 (opens new window)) - Improved performance of
MultiIndex.shape()(GH27384 (opens new window)).
Bug修复
Categorical相关
- Bug in
DataFrame.at()(opens new window) andSeries.at()(opens new window) that would raise exception if the index was aCategoricalIndex(opens new window) (GH20629 (opens new window)) - Fixed bug in comparison of ordered
Categorical(opens new window) that contained missing values with a scalar which sometimes incorrectly resulted inTrue(GH26504 (opens new window)) - Bug in
DataFrame.dropna()(opens new window) when theDataFrame(opens new window) has aCategoricalIndex(opens new window) containingInterval(opens new window) objects incorrectly raised aTypeError(GH25087 (opens new window))
和Datetime相关的
- Bug in
to_datetime()(opens new window) which would raise an (incorrect)ValueErrorwhen called with a date far into the future and theformatargument specified instead of raisingOutOfBoundsDatetime(GH23830 (opens new window)) - Bug in
to_datetime()(opens new window) which would raiseInvalidIndexError: Reindexing only valid with uniquely valued Index objectswhen called withcache=True, withargincluding at least two different elements from the set{None, numpy.nan, pandas.NaT}(GH22305 (opens new window)) - Bug in
DataFrame(opens new window) andSeries(opens new window) where timezone aware data withdtype='datetime64[ns]was not cast to naive (GH25843 (opens new window)) - Improved
Timestamp(opens new window) type checking in various datetime functions to prevent exceptions when using a subclasseddatetime(GH25851 (opens new window)) - Bug in
Series(opens new window) andDataFrame(opens new window) repr wherenp.datetime64('NaT')andnp.timedelta64('NaT')withdtype=objectwould be represented asNaN(GH25445 (opens new window)) - Bug in
to_datetime()(opens new window) which does not replace the invalid argument withNaTwhen error is set to coerce (GH26122 (opens new window)) - Bug in adding
DateOffsetwith nonzero month toDatetimeIndex(opens new window) would raiseValueError(GH26258 (opens new window)) - Bug in
to_datetime()(opens new window) which raises unhandledOverflowErrorwhen called with mix of invalid dates andNaNvalues withformat='%Y%m%d'anderror='coerce'(GH25512 (opens new window)) - Bug in
isin()for datetimelike indexes;DatetimeIndex(opens new window),TimedeltaIndex(opens new window) andPeriodIndex(opens new window) where thelevelsparameter was ignored. (GH26675 (opens new window)) - Bug in
to_datetime()(opens new window) which raisesTypeErrorforformat='%Y%m%d'when called for invalid integer dates with length >= 6 digits witherrors='ignore' - Bug when comparing a
PeriodIndex(opens new window) against a zero-dimensional numpy array (GH26689 (opens new window)) - Bug in constructing a
SeriesorDataFramefrom a numpydatetime64array with a non-ns unit and out-of-bound timestamps generating rubbish data, which will now correctly raise anOutOfBoundsDatetimeerror (GH26206 (opens new window)). - Bug in
date_range()(opens new window) with unnecessaryOverflowErrorbeing raised for very large or very small dates (GH26651 (opens new window)) - Bug where adding
Timestamp(opens new window) to anp.timedelta64object would raise instead of returning aTimestamp(opens new window) (GH24775 (opens new window)) - Bug where comparing a zero-dimensional numpy array containing a
np.datetime64object to aTimestamp(opens new window) would incorrect raiseTypeError(GH26916 (opens new window)) - Bug in
to_datetime()(opens new window) which would raiseValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=Truewhen called withcache=True, withargincluding datetime strings with different offset (GH26097 (opens new window))
Timedelta相关
- Bug in
TimedeltaIndex.intersection()where for non-monotonic indices in some cases an emptyIndexwas returned when in fact an intersection existed (GH25913 (opens new window)) - Bug with comparisons between
Timedelta(opens new window) andNaTraisingTypeError(GH26039 (opens new window)) - Bug when adding or subtracting a
BusinessHourto aTimestamp(opens new window) with the resulting time landing in a following or prior day respectively (GH26381 (opens new window)) - Bug when comparing a
TimedeltaIndex(opens new window) against a zero-dimensional numpy array (GH26689 (opens new window))
Timezones相关
- Bug in
DatetimeIndex.to_frame()(opens new window) where timezone aware data would be converted to timezone naive data (GH25809 (opens new window)) - Bug in
to_datetime()(opens new window) withutc=Trueand datetime strings that would apply previously parsed UTC offsets to subsequent arguments (GH24992 (opens new window)) - Bug in
Timestamp.tz_localize()(opens new window) andTimestamp.tz_convert()(opens new window) does not propagatefreq(GH25241 (opens new window)) - Bug in
Series.at()(opens new window) where settingTimestamp(opens new window) with timezone raisesTypeError(GH25506 (opens new window)) - Bug in
DataFrame.update()(opens new window) when updating with timezone aware data would return timezone naive data (GH25807 (opens new window)) - Bug in
to_datetime()(opens new window) where an uninformativeRuntimeErrorwas raised when passing a naiveTimestamp(opens new window) with datetime strings with mixed UTC offsets (GH25978 (opens new window)) - Bug in
to_datetime()(opens new window) withunit='ns'would drop timezone information from the parsed argument (GH26168 (opens new window)) - Bug in
DataFrame.join()(opens new window) where joining a timezone aware index with a timezone aware column would result in a column ofNaN(GH26335 (opens new window)) - Bug in
date_range()(opens new window) where ambiguous or nonexistent start or end times were not handled by theambiguousornonexistentkeywords respectively (GH27088 (opens new window)) - Bug in
DatetimeIndex.union()when combining a timezone aware and timezone unawareDatetimeIndex(opens new window) (GH21671 (opens new window)) - Bug when applying a numpy reduction function (e.g.
numpy.minimum()) to a timezone awareSeries(opens new window) (GH15552 (opens new window))
Numeric相关
- Bug in
to_numeric()(opens new window) in which large negative numbers were being improperly handled (GH24910 (opens new window)) - Bug in
to_numeric()(opens new window) in which numbers were being coerced to float, even thougherrorswas notcoerce(GH24910 (opens new window)) - Bug in
to_numeric()(opens new window) in which invalid values forerrorswere being allowed (GH26466 (opens new window)) - Bug in
formatin which floating point complex numbers were not being formatted to proper display precision and trimming (GH25514 (opens new window)) - Bug in error messages in
DataFrame.corr()(opens new window) andSeries.corr()(opens new window). Added the possibility of using a callable. (GH25729 (opens new window)) - Bug in
Series.divmod()(opens new window) andSeries.rdivmod()(opens new window) which would raise an (incorrect)ValueErrorrather than return a pair ofSeries(opens new window) objects as result (GH25557 (opens new window)) - Raises a helpful exception when a non-numeric index is sent to
interpolate()with methods which require numeric index. (GH21662 (opens new window)) - Bug in
eval()(opens new window) when comparing floats with scalar operators, for example:x < -0.1(GH25928 (opens new window)) - Fixed bug where casting all-boolean array to integer extension array failed (GH25211 (opens new window))
- Bug in
divmodwith aSeries(opens new window) object containing zeros incorrectly raisingAttributeError(GH26987 (opens new window)) - Inconsistency in
Series(opens new window) floor-division (//) anddivmodfilling positive//zero withNaNinstead ofInf(GH27321 (opens new window))
转换相关
- Bug in
DataFrame.astype()(opens new window) when passing a dict of columns and types theerrorsparameter was ignored. (GH25905 (opens new window))
字符串相关
- Bug in the
__name__attribute of several methods ofSeries.str(opens new window), which were set incorrectly (GH23551 (opens new window)) - Improved error message when passing
Series(opens new window) of wrong dtype toSeries.str.cat()(opens new window) (GH22722 (opens new window))
“间隔”相关
- Construction of
Interval(opens new window) is restricted to numeric,Timestamp(opens new window) andTimedelta(opens new window) endpoints (GH23013 (opens new window)) - Fixed bug in
Series(opens new window)/DataFrame(opens new window) not displayingNaNinIntervalIndex(opens new window) with missing values (GH25984 (opens new window)) - Bug in
IntervalIndex.get_loc()(opens new window) where aKeyErrorwould be incorrectly raised for a decreasingIntervalIndex(opens new window) (GH25860 (opens new window)) - Bug in
Index(opens new window) constructor where passing mixed closedInterval(opens new window) objects would result in aValueErrorinstead of anobjectdtypeIndex(GH27172 (opens new window))
索引相关
- Improved exception message when calling
DataFrame.iloc()(opens new window) with a list of non-numeric objects (GH25753 (opens new window)). - Improved exception message when calling
.ilocor.locwith a boolean indexer with different length (GH26658 (opens new window)). - Bug in
KeyErrorexception message when indexing aMultiIndex(opens new window) with a non-existant key not displaying the original key (GH27250 (opens new window)). - Bug in
.ilocand.locwith a boolean indexer not raising anIndexErrorwhen too few items are passed (GH26658 (opens new window)). - Bug in
DataFrame.loc()(opens new window) andSeries.loc()(opens new window) whereKeyErrorwas not raised for aMultiIndexwhen the key was less than or equal to the number of levels in theMultiIndex(opens new window) (GH14885 (opens new window)). - Bug in which
DataFrame.append()(opens new window) produced an erroneous warning indicating that aKeyErrorwill be thrown in the future when the data to be appended contains new columns (GH22252 (opens new window)). - Bug in which
DataFrame.to_csv()(opens new window) caused a segfault for a reindexed data frame, when the indices were single-levelMultiIndex(opens new window) (GH26303 (opens new window)). - Fixed bug where assigning a
arrays.PandasArray(opens new window) to apandas.core.frame.DataFramewould raise error (GH26390 (opens new window)) - Allow keyword arguments for callable local reference used in the
DataFrame.query()(opens new window) string (GH26426 (opens new window)) - Fixed a
KeyErrorwhen indexing a ``MultiIndex``` level with a list containing exactly one label, which is missing (GH27148 (opens new window)) - Bug which produced
AttributeErroron partial matchingTimestamp(opens new window) in aMultiIndex(opens new window) (GH26944 (opens new window)) - Bug in
Categorical(opens new window) andCategoricalIndex(opens new window) withInterval(opens new window) values when using theinoperator (__contains) with objects that are not comparable to the values in theInterval(GH23705 (opens new window)) - Bug in
DataFrame.loc()(opens new window) andDataFrame.iloc()(opens new window) on aDataFrame(opens new window) with a single timezone-aware datetime64[ns] column incorrectly returning a scalar instead of aSeries(opens new window) (GH27110 (opens new window)) - Bug in
CategoricalIndex(opens new window) andCategorical(opens new window) incorrectly raisingValueErrorinstead ofTypeErrorwhen a list is passed using theinoperator (__contains__) (GH21729 (opens new window)) - Bug in setting a new value in a
Series(opens new window) with aTimedelta(opens new window) object incorrectly casting the value to an integer (GH22717 (opens new window)) - Bug in
Series(opens new window) setting a new key (__setitem__) with a timezone-aware datetime incorrectly raisingValueError(GH12862 (opens new window)) - Bug in
DataFrame.iloc()(opens new window) when indexing with a read-only indexer (GH17192 (opens new window)) - Bug in
Series(opens new window) setting an existing tuple key (__setitem__) with timezone-aware datetime values incorrectly raisingTypeError(GH20441 (opens new window))
缺失(Missing)相关
- Fixed misleading exception message in
Series.interpolate()(opens new window) if argumentorderis required, but omitted (GH10633 (opens new window), GH24014 (opens new window)). - Fixed class type displayed in exception message in
DataFrame.dropna()(opens new window) if invalidaxisparameter passed (GH25555 (opens new window)) - A
ValueErrorwill now be thrown byDataFrame.fillna()(opens new window) whenlimitis not a positive integer (GH27042 (opens new window))
多索引(MultiIndex)相关
- Bug in which incorrect exception raised by
Timedelta(opens new window) when testing the membership ofMultiIndex(opens new window) (GH24570 (opens new window))
输入输出(I/O)相关
- Bug in
DataFrame.to_html()(opens new window) where values were truncated using display options instead of outputting the full content (GH17004 (opens new window)) - Fixed bug in missing text when using
to_clipboard()if copying utf-16 characters in Python 3 on Windows (GH25040 (opens new window)) - Bug in
read_json()(opens new window) fororient='table'when it tries to infer dtypes by default, which is not applicable as dtypes are already defined in the JSON schema (GH21345 (opens new window)) - Bug in
read_json()(opens new window) fororient='table'and float index, as it infers index dtype by default, which is not applicable because index dtype is already defined in the JSON schema (GH25433 (opens new window)) - Bug in
read_json()(opens new window) fororient='table'and string of float column names, as it makes a column name type conversion toTimestamp(opens new window), which is not applicable because column names are already defined in the JSON schema (GH25435 (opens new window)) - Bug in
json_normalize()forerrors='ignore'where missing values in the input data, were filled in resultingDataFramewith the string"nan"instead ofnumpy.nan(GH25468 (opens new window)) DataFrame.to_html()(opens new window) now raisesTypeErrorwhen using an invalid type for theclassesparameter instead ofAssertionError(GH25608 (opens new window))- Bug in
DataFrame.to_string()(opens new window) andDataFrame.to_latex()(opens new window) that would lead to incorrect output when theheaderkeyword is used (GH16718 (opens new window)) - Bug in
read_csv()(opens new window) not properly interpreting the UTF8 encoded filenames on Windows on Python 3.6+ (GH15086 (opens new window)) - Improved performance in
pandas.read_stata()(opens new window) andpandas.io.stata.StataReaderwhen converting columns that have missing values (GH25772 (opens new window)) - Bug in
DataFrame.to_html()(opens new window) where header numbers would ignore display options when rounding (GH17280 (opens new window)) - Bug in
read_hdf()(opens new window) where reading a table from an HDF5 file written directly with PyTables fails with aValueErrorwhen using a sub-selection via thestartorstoparguments (GH11188 (opens new window)) - Bug in
read_hdf()(opens new window) not properly closing store after aKeyErroris raised (GH25766 (opens new window)) - Improved the explanation for the failure when value labels are repeated in Stata dta files and suggested work-arounds (GH25772 (opens new window))
- Improved
pandas.read_stata()(opens new window) andpandas.io.stata.StataReaderto read incorrectly formatted 118 format files saved by Stata (GH25960 (opens new window)) - Improved the
col_spaceparameter inDataFrame.to_html()(opens new window) to accept a string so CSS length values can be set correctly (GH25941 (opens new window)) - Fixed bug in loading objects from S3 that contain
#characters in the URL (GH25945 (opens new window)) - Adds
use_bqstorage_apiparameter toread_gbq()(opens new window) to speed up downloads of large data frames. This feature requires version 0.10.0 of thepandas-gbqlibrary as well as thegoogle-cloud-bigquery-storageandfastavrolibraries. (GH26104 (opens new window)) - Fixed memory leak in
DataFrame.to_json()(opens new window) when dealing with numeric data (GH24889 (opens new window)) - Bug in
read_json()(opens new window) where date strings withZwere not converted to a UTC timezone (GH26168 (opens new window)) - Added
cache_dates=Trueparameter toread_csv()(opens new window), which allows to cache unique dates when they are parsed (GH25990 (opens new window)) DataFrame.to_excel()(opens new window) now raises aValueErrorwhen the caller’s dimensions exceed the limitations of Excel (GH26051 (opens new window))- Fixed bug in
pandas.read_csv()(opens new window) where a BOM would result in incorrect parsing using engine=’python’ (GH26545 (opens new window)) read_excel()(opens new window) now raises aValueErrorwhen input is of typepandas.io.excel.ExcelFileandengineparam is passed sincepandas.io.excel.ExcelFilehas an engine defined (GH26566 (opens new window))- Bug while selecting from
HDFStorewithwhere=''specified (GH26610 (opens new window)). - Fixed bug in
DataFrame.to_excel()(opens new window) where custom objects (i.e. PeriodIndex) inside merged cells were not being converted into types safe for the Excel writer (GH27006 (opens new window)) - Bug in
read_hdf()(opens new window) where reading a timezone awareDatetimeIndex(opens new window) would raise aTypeError(GH11926 (opens new window)) - Bug in
to_msgpack()andread_msgpack()(opens new window) which would raise aValueErrorrather than aFileNotFoundErrorfor an invalid path (GH27160 (opens new window)) - Fixed bug in
DataFrame.to_parquet()(opens new window) which would raise aValueErrorwhen the dataframe had no columns (GH27339 (opens new window)) - Allow parsing of
PeriodDtype(opens new window) columns when usingread_csv()(opens new window) (GH26934 (opens new window))
绘图(Plotting)相关
- Fixed bug where
api.extensions.ExtensionArray(opens new window) could not be used in matplotlib plotting (GH25587 (opens new window)) - Bug in an error message in
DataFrame.plot()(opens new window). Improved the error message if non-numerics are passed toDataFrame.plot()(opens new window) (GH25481 (opens new window)) - Bug in incorrect ticklabel positions when plotting an index that are non-numeric / non-datetime (GH7612 (opens new window), GH15912 (opens new window), GH22334 (opens new window))
- Fixed bug causing plots of
PeriodIndex(opens new window) timeseries to fail if the frequency is a multiple of the frequency rule code (GH14763 (opens new window)) - Fixed bug when plotting a
DatetimeIndex(opens new window) withdatetime.timezone.utctimezone (GH17173 (opens new window))
分组/重采样/滚动
- Bug in
pandas.core.resample.Resampler.agg()with a timezone aware index whereOverflowErrorwould raise when passing a list of functions (GH22660 (opens new window)) - Bug in
pandas.core.groupby.DataFrameGroupBy.nunique()(opens new window) in which the names of column levels were lost (GH23222 (opens new window)) - Bug in
pandas.core.groupby.GroupBy.agg()(opens new window) when applying an aggregation function to timezone aware data (GH23683 (opens new window)) - Bug in
pandas.core.groupby.GroupBy.first()(opens new window) andpandas.core.groupby.GroupBy.last()(opens new window) where timezone information would be dropped (GH21603 (opens new window)) - Bug in
pandas.core.groupby.GroupBy.size()(opens new window) when grouping only NA values (GH23050 (opens new window)) - Bug in
Series.groupby()(opens new window) whereobservedkwarg was previously ignored (GH24880 (opens new window)) - Bug in
Series.groupby()(opens new window) where usinggroupbywith aMultiIndex(opens new window) Series with a list of labels equal to the length of the series caused incorrect grouping (GH25704 (opens new window)) - Ensured that ordering of outputs in
groupbyaggregation functions is consistent across all versions of Python (GH25692 (opens new window)) - Ensured that result group order is correct when grouping on an ordered
Categoricaland specifyingobserved=True(GH25871 (opens new window), GH25167 (opens new window)) - Bug in
pandas.core.window.Rolling.min()(opens new window) andpandas.core.window.Rolling.max()(opens new window) that caused a memory leak (GH25893 (opens new window)) - Bug in
pandas.core.window.Rolling.count()(opens new window) andpandas.core.window.Expanding.countwas previously ignoring theaxiskeyword (GH13503 (opens new window)) - Bug in
pandas.core.groupby.GroupBy.idxmax()andpandas.core.groupby.GroupBy.idxmin()with datetime column would return incorrect dtype (GH25444 (opens new window), GH15306 (opens new window)) - Bug in
pandas.core.groupby.GroupBy.cumsum()(opens new window),pandas.core.groupby.GroupBy.cumprod()(opens new window),pandas.core.groupby.GroupBy.cummin()(opens new window) andpandas.core.groupby.GroupBy.cummax()(opens new window) with categorical column having absent categories, would return incorrect result or segfault (GH16771 (opens new window)) - Bug in
pandas.core.groupby.GroupBy.nth()(opens new window) where NA values in the grouping would return incorrect results (GH26011 (opens new window)) - Bug in
pandas.core.groupby.SeriesGroupBy.transform()where transforming an empty group would raise aValueError(GH26208 (opens new window)) - Bug in
pandas.core.frame.DataFrame.groupby()where passing apandas.core.groupby.grouper.Grouperwould return incorrect groups when using the.groupsaccessor (GH26326 (opens new window)) - Bug in
pandas.core.groupby.GroupBy.agg()(opens new window) where incorrect results are returned for uint64 columns. (GH26310 (opens new window)) - Bug in
pandas.core.window.Rolling.median()(opens new window) andpandas.core.window.Rolling.quantile()(opens new window) where MemoryError is raised with empty window (GH26005 (opens new window)) - Bug in
pandas.core.window.Rolling.median()(opens new window) andpandas.core.window.Rolling.quantile()(opens new window) where incorrect results are returned withclosed='left'andclosed='neither'(GH26005 (opens new window)) - Improved
pandas.core.window.Rolling,pandas.core.window.Windowandpandas.core.window.EWMfunctions to exclude nuisance columns from results instead of raising errors and raise aDataErroronly if all columns are nuisance (GH12537 (opens new window)) - Bug in
pandas.core.window.Rolling.max()(opens new window) andpandas.core.window.Rolling.min()(opens new window) where incorrect results are returned with an empty variable window (GH26005 (opens new window)) - Raise a helpful exception when an unsupported weighted window function is used as an argument of
pandas.core.window.Window.aggregate()(GH26597 (opens new window))
重塑(Reshaping)相关
- Bug in
pandas.merge()(opens new window) adds a string ofNone, ifNoneis assigned in suffixes instead of remain the column name as-is (GH24782 (opens new window)). - Bug in
merge()(opens new window) when merging by index name would sometimes result in an incorrectly numbered index (missing index values are now assigned NA) (GH24212 (opens new window), GH25009 (opens new window)) to_records()now accepts dtypes to itscolumn_dtypesparameter (GH24895 (opens new window))- Bug in
concat()(opens new window) where order ofOrderedDict(anddictin Python 3.6+) is not respected, when passed in asobjsargument (GH21510 (opens new window)) - Bug in
pivot_table()(opens new window) where columns withNaNvalues are dropped even ifdropnaargument isFalse, when theaggfuncargument contains alist(GH22159 (opens new window)) - Bug in
concat()(opens new window) where the resultingfreqof twoDatetimeIndex(opens new window) with the samefreqwould be dropped (GH3232 (opens new window)). - Bug in
merge()(opens new window) where merging with equivalent Categorical dtypes was raising an error (GH22501 (opens new window)) - bug in
DataFrame(opens new window) instantiating with a dict of iterators or generators (e.g.pd.DataFrame({'A': reversed(range(3))})) raised an error (GH26349 (opens new window)). - Bug in
DataFrame(opens new window) instantiating with arange(e.g.pd.DataFrame(range(3))) raised an error (GH26342 (opens new window)). - Bug in
DataFrame(opens new window) constructor when passing non-empty tuples would cause a segmentation fault (GH25691 (opens new window)) - Bug in
Series.apply()(opens new window) failed when the series is a timezone awareDatetimeIndex(opens new window) (GH25959 (opens new window)) - Bug in
pandas.cut()(opens new window) where large bins could incorrectly raise an error due to an integer overflow (GH26045 (opens new window)) - Bug in
DataFrame.sort_index()(opens new window) where an error is thrown when a multi-indexedDataFrameis sorted on all levels with the initial level sorted last (GH26053 (opens new window)) - Bug in
Series.nlargest()(opens new window) treatsTrueas smaller thanFalse(GH26154 (opens new window)) - Bug in
DataFrame.pivot_table()(opens new window) with aIntervalIndex(opens new window) as pivot index would raiseTypeError(GH25814 (opens new window)) - Bug in which
DataFrame.from_dict()(opens new window) ignored order ofOrderedDictwhenorient='index'(GH8425 (opens new window)). - Bug in
DataFrame.transpose()(opens new window) where transposing a DataFrame with a timezone-aware datetime column would incorrectly raiseValueError(GH26825 (opens new window)) - Bug in
pivot_table()(opens new window) when pivoting a timezone aware column as thevalueswould remove timezone information (GH14948 (opens new window)) - Bug in
merge_asof()(opens new window) when specifying multiplebycolumns where one isdatetime64[ns, tz]dtype (GH26649 (opens new window))
零散(Sparse)
- Significant speedup in
SparseArray(opens new window) initialization that benefits most operations, fixing performance regression introduced in v0.20.0 (GH24985 (opens new window)) - Bug in
SparseFrameconstructor where passingNoneas the data would causedefault_fill_valueto be ignored (GH16807 (opens new window)) - Bug in
SparseDataFramewhen adding a column in which the length of values does not match length of index,AssertionErroris raised instead of raisingValueError(GH25484 (opens new window)) - Introduce a better error message in
Series.sparse.from_coo()(opens new window) so it returns aTypeErrorfor inputs that are not coo matrices (GH26554 (opens new window)) - Bug in
numpy.modf()on aSparseArray(opens new window). Now a tuple ofSparseArray(opens new window) is returned (GH26946 (opens new window)).
构建相关更改
- Fix install error with PyPy on macOS (GH26536 (opens new window))
扩展数组
- Bug in
factorize()(opens new window) when passing anExtensionArraywith a customna_sentinel(GH25696 (opens new window)). Series.count()(opens new window) miscounts NA values in ExtensionArrays (GH26835 (opens new window))- Added
Series.__array_ufunc__to better handle NumPy ufuncs applied to Series backed by extension arrays (GH23293 (opens new window)). - Keyword argument
deephas been removed fromExtensionArray.copy()(GH27083 (opens new window))
其他
- Removed unused C functions from vendored UltraJSON implementation (GH26198 (opens new window))
- Allow
Index(opens new window) andRangeIndex(opens new window) to be passed to numpyminandmaxfunctions (GH26125 (opens new window)) - Use actual class name in repr of empty objects of a
Seriessubclass (GH27001 (opens new window)). - Bug in
DataFrame(opens new window) where passing an object array of timezone-aware datetime objects would incorrectly raiseValueError(GH13287 (opens new window))
贡献者
(译者注:官方未公布)
讨论区