pandas 数据处理基础入门（1）

󰃭 2017-09-12

1. 什么是 pandas

pandas 是基于 numpy 构建的，它是一套用来进行数据分析的数据结构和操作工具，它具有处理数据缺失值，便捷的数据分组过滤，进行数据汇总和统计分析的功能

2. pandas 能做什么

处理缺失的数据
数据过滤分组，排序
数学计算
数据排序，统计分析
从多个数据源（文件，数据库）导入，导出数据

3. 主要数据结构 Series 和 DataFrame

3.1 Series 数据结构

类似一维数组的数据结构，它由一组数据及其相关的索引组成

可以使用列表创建一个简单的 Series，默认创建整数索引

import pandas as pd
obj = pd.Series([2, 8, -5, 9])

print(obj)

# 打印数组形式的值
print(obj.values)

# 打印 Series 索引
print(obj.index)

输出

0    2
1    8
2   -5
3    9
dtype: int64

# 数组形式
[ 2  8 -5  9]

# 索引
RangeIndex(start=0, stop=4, step=1)

Series 可以使用有意义的标记索引，如下使用 a, b, c, d 作为索引

obj_2 = pd.Series([4, 8, -5, 2], index=['a', 'b', 'c', 'd'])

# 使用条件过滤
print(obj_2[obj_2>0])

# 使用索引修改数据
obj_2['c'] = 10

print(obj_2[['a', 'c', 'd']])

# 用 in 判断索引存在
print( a in obj_2)

输出

# 条件过滤（ >0 ）结果输出
a    4
b    8
d    2

# 修改索引数据后结果输出
a    4
c   10
d    2
dtype: int64

# 判断索引存在的结果输出
True

可以用字典创建 Series，只传入一个字典，字典键就是索引

my_dict = {1: 4, 'a': 2, 'c': 'hello'}
obj_3 = pd.Series(my_dict)
print(obj_3)
print(obj_3[1])
print(obj_3['a'])

使用索引和字典创建 Series，使用 NaN 表示缺失值

obj_3 = pd.Series(my_dict, index=[1, 'b'])
print(obj_3)
print(obj_3.isnull())
print(obj_3.notnull())

输出

# 用索引和字典创建 Series
1    4.0
b    NaN
dtype: float64

# 判断 isnull
1    False
b     True
dtype: bool

# 判断 notnull
1     True
b    False
dtype: bool

3.2 DataFrame 数据结构

DataFrame 是一个表格数据结构，它包含一组有序的列，每列可以保存不同值类型（字符串，数值，bool值）, DataFrame 既有行索引，也有列索引，它是一个二维数据结构

3.2.1 创建 DataFrame

可以使用字典构建 DataFrame，字典的键是列索引，默认自动加上行索引

data = {'one': [1, 2, 3],
        'two': [4, 5, 6],
        'three': [7, 8, 9]}
print(pd.DataFrame(data))

输出

   one  three  two
   0    1      7    4
   1    2      8    5
   2    3      9    6

用嵌套字典构建 DataFrame，外部字典的 key 作为列，内部字典的 key 作为行索引

data = {'one': {'row1': 1, 'row2': 2},
        'two': {'row1': 1, 'row2': 2},
        'three': {'row1': 1, 'row2': 2, 'row3': 3},
       }

print(pd.DataFrame(data, columns=['one', 'two', 'three']))

输出

      one  two  three
row1  1.0  1.0      1
row2  2.0  2.0      2
row3  NaN  NaN      3

3.2.2 从 DataFrame 获取 Series 和数组

通过字典下标的方式，从 DataFrame 提取列为 Series

# DataFrame 获取 Series
df = pd.DataFrame(data, columns=['one', 'two', 'three'])
print(df['one'])
print(type(df['one']))

# DataFrame 获取数组
print(type(df.values))
print(df.values)

输出

# DataFrame => Seriew
row1    1.0
row2    2.0
row3    NaN
Name: one, dtype: float64
<class 'pandas.core.series.Series'>

# DataFrame => array
<class 'numpy.ndarray'>
[[  1.   1.   1.]
 [  2.   2.   2.]
 [ nan  nan   3.]]