pandas 기초

2017. 8. 7. 13:59

API 문서 : http://pandas.pydata.org/pandas-docs/stable/

1. 2가지 데이터 type

1) dataframe : 전체 데이터 또는 사각형 형태에 데이터 (벡터)

2) series : dataframe의 single column

=> data_df = data_series.to_frame() ( series를 data frame으로 전환 )

2. 기본 사용

# pandas 라이브러리 import

import pandas

# csv를 읽어서 dataframe 생성

df = pandas.read_csv('/tmp/test1.csv', sep='\t')

print(df.head())

# type DataFrame 출력

print(type(df))

# row와 컬럼수 확인

print(df.shape)

# column 이름 출력

print(df.columns)

# column type (int, float..) 출력

print(df.dtypes)

print(df.info())

3. 유형별 데이터 읽고 쓰기

1) pickle

- 쓰기

data.to_pickle('/tmp/test1.pickle')

- 읽기

pd.read_pickle('/tmp/test1.pickle')

2) csv

- 쓰기

data.to_csv('/tmp/test1.csv') or data.to_csv('/tmp/test1.csv', sep='\t')

data.to_csv('/tmp/test1.csv', index=False)

- 읽기

pd.read_csv('/tmp/test1.csv')

3) excel

- 쓰기

data.to_excel('/tmp/test1.xls')

data.to_excel('/tmp/test1.xls', sheet_name='test1', index=False)

- 읽기

pd.read_excel('/tmp/test1.xls')

4) 기타

to_clipboard

to_dense

to_dict

to_gbq

toJidf

to_msgpack

toJitml

tojson

to_records

to_string

to_sparse

to_sql

to_stata

4. read_csv 읽기 옵션

파일로 데이터 읽을때 missing value 표현은 NaN, NAN, nan으로 표시됨

misssing인지 확인

if ( ! pd.notnull(NaN) ) :

missing 데이터 찾기

df.loc[ df.isnull()['컬럼명'], : ]

import pandas as pd

pd.read_csv('/tmp/test1.csv')

====================

1 t1 t2

2 t1 NaN

pd.read_csv('/tmp/test1.csv', keep_default_na=False)

====================

1 t1 t2

2 t1

5. DataFrame 데이터 작업

index명 지정 : df.index = [0, 1, 2, 3]

column명 지정 : df.colmns = ['A', 'B', 'C', 'D']

1) pd.concat( [df1, df2, df3] ) : row concat

2) df.append(df2) : single-row add

new_df = pd.DataFrame([['a', 'b', 'c', 'd']], columns=['A', 'B', 'C', 'D'])

pd.concat([df1, new_df])

3) 새로운 데이터에 row 인덱스 무시하고 추가

df1.append(new_df, ignore_index=True)

4) add columns

axis는 데이터 테이블에 row, column 방향 (axis=1 => column 추가, axis=0 => row 추가, default)

pd.concat( [ df1, df2, df3 ], axis =1, ignore_index=True ) : column 방향으로 concat

con_concat['new1'] = ['a', 'b', 'c', 'd']

con_concat['new1'] = pd.Series( ['n1', 'n2', 'n3', 'n4'] )

5) join

left, right, outer, inner

pd.concat( [df1, df2], ignore_index=False, join='inner' )

pd.concat( [df1, df2], axis=1, join='inner' )

new_df = test1.merge(test2, left_on='test1_name', right_on='test2_name')

new_df = test1.merge(test2, left_on=['A', 'B', 'C', 'D'], right_on=['A', 'B', 'C', 'D']) # 멀티컬럼 조인

6) groupby

test2 = test1.groupby(['year'])['month'].mean()

test1.groupby('year').unique().to_frame()

7) reindex

test2.reindex(range(2010, 2017))

8) missing data count

test1.shape[0] - test1.count()

9) change missing data to another data

test1.fillna(0) => nan을 0으로 채움

test1.fillna(method='ffill') => 이전 column값으로 채움

test1.fillna(method='bfill') => 이후 column값으로 채움

10) nan 데이터 삭제

test1.dropna()

11) nan 데이터 skip 하고 계산

test1.sum(skipna = True)

12) head/ tail 출력

test1.head(n=10), test1.tail(n=10)

13) 데이터 분석

df.describe() : count, mean, std, min, 25%, 50%, 75%, max 표시

df.corr() : 컬럼간 상관관계 분석 ( df.corr(method="spearman") or df.corr(method="kendall") )

14) map

column컬럼 series에서 대해서 값 대체 수행

df['test'] = df'[test'].map({0: 'SET', 1: 'VER', 2: 'ER'})

15) apply

dataframes 와 series에 둘다 적용

df['test1'] = df.apply(lambda v: 1 if v >= 1.3 else 0)

16) applymap

all data cells에 함수를 적용

df.applymap(lambda x: np.log(x) if isinstance(x float) else x)

6. DataFrame plot

test1.plot()

1) histograms

test1['A'].plot.hist()

test1['A'].plot.hist(alpha=0.5, bins=20)

2) Density plot

test1['A'].plot.kde()

3) Scatter plot

test1['A'].plot.scatter(x='age', y='salary')

4) Hexbin plot

test1['A'].plot.hexbin(x=''age', y='salary')

test1['A'].plot.hexbin(x=''age', y='salary', gridsize=10)

5) box plot

test1['A'].plot.box()

'데이터분석 > pandas' 카테고리의 다른 글

data cleaning (0)	2019.11.06
nump 사용 (0)	2019.10.31
data crawling (0)	2019.10.28
pandas 데이터 모델링 (0)	2019.08.25
pandas dataframe view (0)	2018.11.05

세모데

pandas 기초

'데이터분석 > pandas' 카테고리의 다른 글

+ Recent posts

티스토리툴바