python/라이브러리

[기초] Pandas 라이브러리 정리 (1) / Python 파이썬

sillon 2022. 5. 16. 15:54

728x90

1. 데이터 내용 미리보기 : head(), tail()

head()는 데이터의 앞단, tail()은 뒷단을 볼 수 있다.
괄호()안에 숫자를 입력해 해당 숫자만큼의 행을 볼 수 있고, 기본값은 6row까지다.

import pandas as pd

df = pd.read_csv("~/auto-mpg.csv",header=None)
df.head()

18.0	8	307.0	130.0	3504.0	12.0	70	1	chevrolet chevelle malibu
15.0	8	350.0	165.0	3693.0	11.5	70	1	buick skylark 320
18.0	8	318.0	150.0	3436.0	11.0	70	1	plymouth satellite
16.0	8	304.0	150.0	3433.0	12.0	70	1	amc rebel sst
17.0	8	302.0	140.0	3449.0	10.5	70	1	ford torino

print(df.columns)

[Output]
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8], dtype='int64')

열이름을 지정해주자
read_csv()옵션으로 names = []를 줘서 바꿔 줄수도 있다.

df.columns = ['mpg', 'cyclinders','displacement','horsepower','weight',
             'accerleration','model year','origin','name']
df.head()

18.0	8	307.0	130.0	3504.0	12.0	70	1	chevrolet chevelle malibu
15.0	8	350.0	165.0	3693.0	11.5	70	1	buick skylark 320
18.0	8	318.0	150.0	3436.0	11.0	70	1	plymouth satellite
16.0	8	304.0	150.0	3433.0	12.0	70	1	amc rebel sst
17.0	8	302.0	140.0	3449.0	10.5	70	1	ford torino

df.tail()

27.0	4	140.0	86.00	2790.0	15.6	82	1	ford mustang gl
44.0	4	97.0	52.00	2130.0	24.6	82	2	vw pickup
32.0	4	135.0	84.00	2295.0	11.6	82	1	dodge rampage
28.0	4	120.0	79.00	2625.0	18.6	82	1	ford ranger
31.0	4	119.0	82.00	2720.0	19.4	82	1	chevy s-10

2. 데이터 요약정보 확인하기

2-1. shape

데이터 프레임의 모든 기본 정보(타입, 행열크기, 타입, 메모리 등)를 보여준다.

df.info()

[Output]
 	<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg              398 non-null float64
cyclinders       398 non-null int64
displacement     398 non-null float64
horsepower       398 non-null object
weight           398 non-null float64
accerleration    398 non-null float64
model year       398 non-null int64
origin           398 non-null int64
name             398 non-null object
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB
None

2-2. info()

데이터의 행과 열을 보여준다.

df.shape

[Output]
(398, 9)

2-3. dtypes

데이터 변수(열)들의 타입을 보여준다.

df.dtypes

[Output]
mpg              float64
cyclinders         int64
displacement     float64
horsepower        object
weight           float64
accerleration    float64
model year         int64
origin             int64
name              object
dtype: object

시리즈(mpg 열)의 자료형만 확인할수도 있다.

df.mpg.dtypes

[Output]
float64

3. 기초 통계정보 확인하기

3-1. 기초통계량 : describe()

describe()는 데이터의 기초통계량을 제공한다.
총 데이터 수(count), 평균(mean), 표준편차(std), 분위수(25,50,75%), 최대최소(max,min)
여기서 50%가 중앙값(median)인건 당연하다.

df.describe()

[Output]
              mpg  cyclinders  displacement       weight  accerleration  \
count  398.000000  398.000000    398.000000   398.000000     398.000000   
mean    23.514573    5.454774    193.425879  2970.424623      15.568090   
std      7.815984    1.701004    104.269838   846.841774       2.757689   
min      9.000000    3.000000     68.000000  1613.000000       8.000000   
25%     17.500000    4.000000    104.250000  2223.750000      13.825000   
50%     23.000000    4.000000    148.500000  2803.500000      15.500000   
75%     29.000000    8.000000    262.000000  3608.000000      17.175000   
max     46.600000    8.000000    455.000000  5140.000000      24.800000   

       model year      origin  
count  398.000000  398.000000  
mean    76.010050    1.572864  
std      3.697627    0.802055  
min     70.000000    1.000000  
25%     73.000000    1.000000  
50%     76.000000    1.000000  
75%     79.000000    2.000000  
max     82.000000    3.000000

추가로 입력인수로 include='all'을 넣어주면 문자열 데이터가 있는 열에 대한 추가정보를 제공한다.
정확히는 고유값 개수(unique), 최빈값(top), 빈도수(freq)이다.
숫자형 열에 대해서는 NaN인 것을 알 수 있다.

df.describe(include='all')

[Output]
               mpg  cyclinders  displacement horsepower       weight  \
count   398.000000  398.000000    398.000000        398   398.000000   
unique         NaN         NaN           NaN         94          NaN   
top            NaN         NaN           NaN      150.0          NaN   
freq           NaN         NaN           NaN         22          NaN   
mean     23.514573    5.454774    193.425879        NaN  2970.424623   
std       7.815984    1.701004    104.269838        NaN   846.841774   
min       9.000000    3.000000     68.000000        NaN  1613.000000   
25%      17.500000    4.000000    104.250000        NaN  2223.750000   
50%      23.000000    4.000000    148.500000        NaN  2803.500000   
75%      29.000000    8.000000    262.000000        NaN  3608.000000   
max      46.600000    8.000000    455.000000        NaN  5140.000000   

        accerleration  model year      origin        name  
count      398.000000  398.000000  398.000000         398  
unique            NaN         NaN         NaN         305  
top               NaN         NaN         NaN  ford pinto  
freq              NaN         NaN         NaN           6  
mean        15.568090   76.010050    1.572864         NaN  
std          2.757689    3.697627    0.802055         NaN  
min          8.000000   70.000000    1.000000         NaN  
25%         13.825000   73.000000    1.000000         NaN  
50%         15.500000   76.000000    1.000000         NaN  
75%         17.175000   79.000000    2.000000         NaN  
max         24.800000   82.000000    3.000000         NaN

3-2. 데이터 개수 확인 : count()

count()는 데이터의 열마다 개수를 시리즈 형태로 반환한다.

df.count()

[Output]
mpg              398
cyclinders       398
displacement     398
horsepower       398
weight           398
accerleration    398
model year       398
origin           398
name             398
dtype: int64

print(type(df.count()))

[Output]
<class 'pandas.core.series.Series'>

3-3. 각 열의 고유값 개수 : value_counts()

이 함수는 시리즈 자료형에 적용하는 함수로, 특정 열의 고유값 개수를 다시 시리즈형태로 반환한다.

df['origin'].value_counts()

[Output]
1    249
3     79
2     70
Name: origin, dtype: int64

의미는 ‘1’이 249개, ‘3’이 79개, ‘70’이 70개가 있다는 소리다.

3-4 기초통계량 직접 계산하기 : mean(),median(),max(), min(), std()

아까 describe()함수로 한번에 보여줄 수도 있고, 특정 의도로 어떤 기초통계 값만 뽑아서 사용하고 싶을때가 있다.

이 연산메소드들은 데이터프레임, 시리즈(특정열)에 다 적용할 수 있다.
`

print(df.mean()) # 데이터프레임에 적용
print('\n')
print(df['mpg'].mean()) # 시리즈에 적용
print('\n')
print(df[['mpg','weight']].mean()) # 2개의 시리즈, 즉 데이터프레임에 적용

[Output]
mpg                23.514573
cyclinders          5.454774
displacement      193.425879
weight           2970.424623
accerleration      15.568090
model year         76.010050
origin              1.572864
dtype: float64

23.514572864321615

mpg         23.514573
weight    2970.424623
dtype: float64

3-5. 상관계수(문자열은 제외) : corr()

상관관계는 두 변수간의 상관성에 대한 정보로 -1~1사이의 값을 갖는다.
이는 선형성가정 등을 고려해야할 필요가 있지만, EDA과정에서 한번 볼 필요는 있다.
corr()함수는 문자열을 제외한 변수간 매트릭스를 생성하고 각 쌍들의 상관계수를 반환한다.

df.corr()

[Output]
                    mpg  cyclinders  displacement    weight  accerleration  \
mpg            1.000000   -0.775396     -0.804203 -0.831741       0.420289   
cyclinders    -0.775396    1.000000      0.950721  0.896017      -0.505419   
displacement  -0.804203    0.950721      1.000000  0.932824      -0.543684   
weight        -0.831741    0.896017      0.932824  1.000000      -0.417457   
accerleration  0.420289   -0.505419     -0.543684 -0.417457       1.000000   
model year     0.579267   -0.348746     -0.370164 -0.306564       0.288137   
origin         0.563450   -0.562543     -0.609409 -0.581024       0.205873   

               model year    origin  
mpg              0.579267  0.563450  
cyclinders      -0.348746 -0.562543  
displacement    -0.370164 -0.609409  
weight          -0.306564 -0.581024  
accerleration    0.288137  0.205873  
model year       1.000000  0.180662  
origin           0.180662  1.000000

특정 두 변수만 지정해서 상관계수를 구할수도 있다.

print('\n')
print(df[['mpg','weight']].corr())

[Output]
             mpg    weight
mpg     1.000000 -0.831741
weight -0.831741  1.000000

Reference

도서 [파이썬 머신러닝 판다스 데이터 분석]

https://yganalyst.github.io/data_handling/Pd_5/

728x90