本文共 4332 字,大约阅读时间需要 14 分钟。
Pandas 是 Python 中最受欢迎的数据处理库之一,旨在为数据分析提供高效、灵活且易于使用的工具。以下将详细介绍其核心功能,包括数据结构、操作、可视化等内容。
Pandas 提供两种核心数据结构:Series 和 DataFrame。
Series:类似于 NumPy 的一维数组,支持标签化数据(即行标签)。
import pandas as pddata = [1, 2, 3, 4, 5]s = pd.Series(data)print(s)
输出结果:
0 11 22 33 44 5dtype: int64
DataFrame:类似于 Excel 的二维表格,支持多种数据类型,方便数据操作。
import pandas as pddata = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'Country': ['USA', 'Canada', 'UK']}df = pd.DataFrame(data)print(df) 输出结果:
Name Age Country0 Alice 25 USA1 Bob 30 Canada2 Charlie 35 UK
安装 Pandas:
pip install pandas
导入库:
import pandas as pd
Pandas 支持从多种数据源读取数据,包括 CSV、Excel 文件等。
import pandas as pddf = pd.read_csv('data.csv')print(df) import pandas as pddf = pd.read_excel('data.xlsx')print(df) import pandas as pddata = { 'Name': ['John', 'Mary', 'Mark'], 'Age': [25, 30, 35], 'Country': ['USA', 'Canada', 'UK']}df.to_csv('data.csv', index=False)df.to_excel('data.xlsx', index=False) import pandas as pddata = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'Country': ['USA', 'Canada', 'UK']}df = pd.DataFrame(data)print(df['Name']) # 选择单列print(df[['Name', 'Age']]) # 选择多列print(df.loc[0]) # 选择行print(df.loc[[0, 2]]) # 选择多行print(df[df['Age'] > 30]) # 根据条件筛选 import pandas as pddata = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'Country': ['USA', 'Canada', 'UK']}df = pd.DataFrame(data)print(df.iloc[1:3, :]) # 切片print(df[df['Age'] > 30]) # 过滤 import pandas as pdimport numpy as npdata = { 'Name': ['Alice', np.nan, 'Charlie'], 'Age': [25, np.nan, 35], 'Country': ['USA', 'Canada', np.nan]}df = pd.DataFrame(data)print(df.isnull()) # 查看缺失值df_filled = df.fillna(0) # 填充缺失值print(df_filled) import pandas as pddata = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'Country': ['USA', 'Canada', 'UK']}df = pd.DataFrame(data)df_sorted = df.sort_values('Age')print(df_sorted)df['Rank'] = df['Age'].rank()print(df) import pandas as pddata = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'Country': ['USA', 'Canada', 'UK']}df = pd.DataFrame(data)grouped = df.groupby('Country')agg_result = grouped['Age'].mean()print(agg_result) Pandas 与 Matplotlib 结合提供数据可视化功能。
import pandas as pdimport matplotlib.pyplot as pltdata = { 'Year': [2010, 2011, 2012, 2013, 2014], 'Sales': [100, 200, 150, 300, 250]}df = pd.DataFrame(data)df.plot(x='Year', y='Sales', kind='line')plt.show() import pandas as pdimport matplotlib.pyplot as pltdata = { 'Year': [2010, 2011, 2012, 2013, 2014], 'Sales': [100, 200, 150, 300, 250]}df = pd.DataFrame(data)df.plot(x='Year', y='Sales', kind='bar')plt.show() import pandas as pddates = pd.date_range('2023-01-01', '2023-01-10')data = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], index=dates)monthly_data = data.resample('M').sum()print(monthly_data) import pandas as pddata1 = { 'Name': ['Alice', 'Bob'], 'Age': [25, 30]}df1 = pd.DataFrame(data1)data2 = { 'Name': ['Charlie', 'Dave'], 'Age': [35, 40]}df2 = pd.DataFrame(data2)df_merged = pd.concat([df1, df2])print(df_merged) import pandas as pddata = { 'Product': ['Phone', 'Laptop', 'Phone', 'Laptop'], 'Price': [100, 900, 120, 1100], 'Sales': [50, 200, 60, 300]}df = pd.DataFrame(data)pivot_table = pd.pivot_table(df, values=['Price', 'Sales'], index='Product', columns='Sales', aggfunc='sum')print(pivot_table) import pandas as pddf = pd.read_csv('sales_data.csv')print(df.head())print(df.info())print(df[['Sales', 'Profit']].describe())category_sales_profit = df.groupby('Category')[['Sales', 'Profit']].sum()print(category_sales_profit)df['OrderDate'] = pd.to_datetime(df['OrderDate'])df['Month'] = df['OrderDate'].dt.monthmonthly_sales_profit = df.groupby('Month')[['Sales', 'Profit']].sum()print(monthly_sales_profit) | OrderDate | Category | Sales | Profit |
|---|---|---|---|
| 2021-01-01 | Electronics | 100 | 10 |
| 2021-01-02 | Fashion | 200 | 20 |
| 2021-01-03 | Electronics | 150 | 15 |
| 2021-02-01 | Fashion | 300 | 30 |
| 2021-02-02 | Clothing | 250 | 25 |
转载地址:http://dvvfk.baihongyu.com/