[KT AIVLE 4๊ธฐ] EDA

EDA (Explorary Data Analysis)

EDA๋ž€?

๋น…๋ฐ์ดํ„ฐ์—์„œ ์˜๋ฏธ ์žˆ๋Š” ํŒจํ„ด์„ ์ฐพ๊ณ , ์˜์‚ฌ ๊ฒฐ์ •์— ํ•„์š”ํ•œ ์ธ์‚ฌ์ดํŠธ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด์„œ ๋ฐ์ดํ„ฐ ๋ถ„์„์ด ์ˆ˜ํ–‰๋œ๋‹ค.

ํ˜น์€ AI ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๊ธฐ ์ „ ํƒ€๊ฒŸ ํŠน์„ฑ์„ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ๋„์›€์ด ๋ ๋งŒํ•œ ์ž…๋ ฅ ํŠน์„ฑ์„ ๊ณ ๋ฅด๊ธฐ ์œ„ํ•ด์„œ ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•œ๋‹ค.

๋ฐ์ดํ„ฐ ๋ถ„์„์€ ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€์˜ ์ ‘๊ทผ ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค. (1) EDA์™€ (2) CDA(confirmatory Data Analysis) ์ด๋‹ค.

CDA๊ฐ€ ์ถ”๋ก  ํ†ต๊ณ„๋ผ๋ฉด, EDA๋Š” ๊ธฐ์ˆ  ํ†ต๊ณ„์— ํ•ด๋‹นํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋ฐ์ดํ„ฐ ๋ถ„์„ ํ”„๋กœ์„ธ์Šค ํ†ต๊ณ„ ๊ธฐ๋ฒ•
EDA ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ > ์‹œ๊ฐํ™” ํƒ์ƒ‰ > ํŒจํ„ด ๋„์ถœ > ์ธ์‚ฌ์ดํŠธ ๋ฐœ๊ฒฌ ๊ธฐ์ˆ  ํ†ต๊ณ„ (๋ชจ์ง‘๋‹จ์˜ ํŠน์„ฑ์„ ์š”์•ฝ)
CDA ๊ฐ€์„ค ์„ค์ • > ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ > ํ†ต๊ณ„ ๋ถ„์„ > ๊ฐ€์„ค ๊ฒ€์ฆ ์ถ”๋ก  ํ†ต๊ณ„ (ํ‘œ๋ณธ์ง‘๋‹จ์„ ํ†ตํ•ด ๋ชจ์ง‘๋‹จ์˜ ํŠน์„ฑ์„ ์ถ”๋ก )

EDA๋ฅผ ํ†ตํ•ด ์–ป์€ ์ธ์‚ฌ์ดํŠธ๋Š” CDA์˜ ๊ฐ€์„ค๋กœ ์„ค์ •๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ๊ฒ€์ฆํ•˜๋Š” ์ฒ™๋„๋Š” p.value(์œ ์˜ ํ™•๋ฅ )์ด๋‹ค.

EDA ๊ณผ์ •

  1. ๋ฐ์ดํ„ฐ ์ดํ•ด ๋ฐ ์ „์ฒ˜๋ฆฌ (๊ฒฐ์ธก์น˜, ์ด์ƒ์น˜ ํ™•์ธ)

  2. ๋‹จ๋ณ€๋Ÿ‰ ๋ถ„์„ (๊ธฐ์ดˆํ†ต๊ณ„๋Ÿ‰ ๋ฐ ๋ถ„ํฌ ํ™•์ธ)

  3. ์ด๋ณ€๋Ÿ‰ ๋ถ„์„ (์ƒ๊ด€๊ด€๊ณ„ ํ™•์ธ)


1. ๋ฐ์ดํ„ฐ ์ดํ•ด ๋ฐ ์ „์ฒ˜๋ฆฌ

๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ๋‹ค๋ฉฐ, ์•ž๋ถ€๋ถ„ ํ˜น์€ ๋’ท๋ถ€๋ถ„๋งŒ ๋ณด๋ฉด ์•ˆ ๋˜๋ฏ€๋กœ, ๋ฌด์ž‘์œ„๋กœ ํ‘œ๋ณธ์„ ์ถ”์ถœํ•ด์„œ ๊ด€์ฐฐํ•ด๋ด์•ผ ํ•œ๋‹ค.

๋‹จ๋ณ€๋Ÿ‰ ๋ถ„์„๊ณผ ํ•จ๊ป˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ,

  • ๋ฐ์ดํ„ฐ์˜ ์ค‘์‹ฌ์„ ์•Œ๊ธฐ ์œ„ํ•ด : ํ‰๊ท , ์ค‘์•™๊ฐ’, ์ตœ๋นˆ๊ฐ’

  • ๋ฐ์ดํ„ฐ์˜ ๋ถ„์‚ฐ๋„๋ฅผ ์•Œ๊ธฐ ์œ„ํ•ด : ๋ฒ”์œ„, ๋ถ„์‚ฐ

  • ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋„๋ฅผ ์•Œ๊ธฐ ์œ„ํ•ด : ์™œ๋„(skew), ์ฒจ๋„(kurosis) ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

(์ฐธ๊ณ ๋กœ, ํ‰๊ท ์€ ์ด์ƒ์น˜๊ฐ’์— ์˜ํ–ฅ์„ ๋งŽ์ด ๋ฐ›์œผ๋ฉฐ ์ค‘์•™๊ฐ’์€ ์ด์ƒ์น˜์˜ ์กด์žฌ์—๋„ ๋Œ€ํ‘œ์„ฑ์ด ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.)


2. ๋‹จ๋ณ€๋Ÿ‰ ๋ถ„์„

๊ธฐ์ดˆํ†ต๊ณ„๋Ÿ‰์„ ํ‘œ์™€ ๊ทธ๋ž˜ํ”„๋กœ ์‹œ๊ฐํ™”ํ•œ๋‹ค.

์ˆซ์žํ˜• ๋ณ€์ˆ˜

๊ธฐ์ดˆํ†ต๊ณ„๋Ÿ‰์€ describe( ) ๋ฅผ ํ†ตํ•ด ๊ตฌํ•œ๋‹ค.

def eda_1_num(data, var, bins = 30):

    # ๊ธฐ์ดˆํ†ต๊ณ„๋Ÿ‰
    print('<< ๊ธฐ์ดˆํ†ต๊ณ„๋Ÿ‰ >>')
    display(data[[var]].describe().T)
    print('=' * 100)

    # ์‹œ๊ฐํ™”
    print('<< ๊ทธ๋ž˜ํ”„ >>')
    plt.figure(figsize = (10,6))

    plt.subplot(2,1,1)
    sns.histplot(data[var], bins = bins, kde = True)
    plt.grid()

    plt.subplot(2,1,2)
    sns.boxplot(x = data[var])
    plt.grid()
    plt.show()

๊ฒฐ๊ณผ

var = 'Income'
eda_1_num(data, var)


๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜

๊ธฐ์ดˆํ†ต๊ณ„๋Ÿ‰์€ value_counts( ) ๋ฅผ ํ†ตํ•ด ๊ตฌํ•œ๋‹ค.

def eda_1_cat(data, var) :
    t1 = data[var].value_counts()
    t2 = data[var].value_counts(normalize = True)
    t3 = pd.concat([t1, t2], axis = 1)
    t3.columns = ['count','ratio']
    display(t3)
    
    sns.countplot(x = var, data = data)
    plt.show()

๊ฒฐ๊ณผ

var = 'ShelveLoc'
eda_1_cat(data, var)



3. ์ด๋ณ€๋Ÿ‰ ๋ถ„์„


์ˆซ์žํ˜• -> ์ˆซ์žํ˜•

def analyze(var, target, data=data):
    sns.scatterplot(x=var, y = target, data = data)
    plt.show()

    # sns.regplot(x=var, y = target, data = data)
    # plt.show()

    result = spst.pearsonr(data[var], data[target])
    print(f'์ƒ๊ด€๊ณ„์ˆ˜ : {result[0]}, p-value : {result[1]}')

๊ฒฐ๊ณผ

analyze('Population', 'Sales')


์ƒ๊ด€๊ณ„์ˆ˜ > 0.5 ์ด๋ฉด ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„

p-value(์œ ์˜ํ™•๋ฅ ) < 0.05์ด๋ฉด ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ์˜๋ฏธ๊ฐ€ ์žˆ์Œ์„ ์˜๋ฏธ


์ƒ๊ด€๊ณ„์ˆ˜์˜ ํ•œ๊ณ„

  • ๋น„์„ ํ˜• ๊ด€๊ณ„๋ฅผ ์žก์ง€ ๋ชปํ•จ

  • ์ง์„ ์˜ ๊ธฐ์šธ๊ธฐ ํŒŒ์•…์„ ๋ชปํ•จ

=> ๋”ฐ๋ผ์„œ, ์‚ฐ์ ๋„๋ฅผ ํ•จ๊ป˜ ๋ด์•ผํ•จ

๊ณ„๋‹จ์‹ ๊ตฌ์กฐ์˜ ์‚ฐ์ ๋„

๊ตฌ๊ฐ„ ์•ˆ์—์„œ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์„ฑ๋ฆฝํ•˜์ง€ ์•Š์Œ. ์ด๋Ÿด ๋•Œ๋Š” ์ˆซ์ž๋ฅผ ๋ฒ”์ฃผ๋กœ ๋ฐ”๊ฟ”์„œ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋‹ค. (pd.cut)


๋ฒ”์ฃผํ˜• -> ์ˆซ์žํ˜•

def analyze(var, target, data=data):
    sns.barplot(x = var, y = target, data=data)
    plt.show()
    
    temp = data.loc[data[var].notnull()]
    cate = data[var].unique()
    arg = []
    for i in cate:
        arg.append(temp.loc[temp[var] == i, target])
        
    ## t-test
    if len(cate) == 2:
        result =spst.ttest_ind(arg[0], arg[1])
        print(f't-ํ†ต๊ณ„๋Ÿ‰ : {result[0]}, p-value : {result[1]}')
    
    ## ANOVA
    else:
        result = spst.f_oneway(*arg)
        print(f'f-ํ†ต๊ณ„๋Ÿ‰ : {result[0]}, p-value : {result[1]}')

๊ฒฐ๊ณผ

analyze('Urban', 'Sales')


t-ํ†ต๊ณ„๋Ÿ‰, f-ํ†ป๊ณ„๋Ÿ‰

t-ํ†ต๊ณ„๋Ÿ‰์€ ๋‘ ๋ณ€์ˆ˜์˜ ํ‰๊ท  ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ํ‘œ์ค€ ์˜ค์ฐจ๋กœ ๋‚˜๋ˆˆ ๊ฐ’์ž„.

\[t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]
  • ( $\bar{x}_1$ )์™€ ( $\bar{x}_2$ ): ๋‘ ํ‘œ๋ณธ์˜ ํ‰๊ท 
  • ( $s_1$ )์™€ ( $s_2$ ): ๋‘ ํ‘œ๋ณธ์˜ ํ‘œ์ค€ํŽธ์ฐจ
  • ( $n_1$ )์™€ ( $n_2$ ): ๋‘ ํ‘œ๋ณธ์˜ ํฌ๊ธฐ

t-ํ†ต๊ณ„๋Ÿ‰ > |2| ์ด๋ฉด, ์ฐจ์ด๊ฐ€ ์žˆ์Œ์„ ์˜๋ฏธ

f-ํ†ต๊ณ„๋Ÿ‰์€ ๊ทธ๋ฃน ๊ฐ„์˜ ๋ถ„์‚ฐ๊ณผ ๊ทธ๋ฃน ๋‚ด์˜ ๋ถ„์‚ฐ์˜ ๋น„์œจ๋กœ ๊ณ„์‚ฐํ•œ ๊ฐ’์ž„.

์ด๋Š” ๋ถ„์‚ฐ ๋ถ„์„(ANOVA, ANalysis Of Variance)์—์„œ ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๋ฉฐ, ๋‘ ๊ฐœ ์ด์ƒ์˜ ๊ทธ๋ฃน ๊ฐ„์˜ ํ‰๊ท  ์ฐจ์ด๊ฐ€ ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ์ง€๋ฅผ ๊ฒ€์ •ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•จ.

\[F = \frac{\text{MSB (๊ทธ๋ฃน ๊ฐ„์˜ ๋ถ„์‚ฐ)}}{\text{MSW (๊ทธ๋ฃน ๋‚ด์˜ ๋ถ„์‚ฐ)}} = \frac{\text{์ „์ฒด ํ‰๊ท  - ๊ฐ ์ง‘๋‹จ ํ‰๊ท }}{\text{๊ฐ ์ง‘๋‹จ์˜ ํ‰๊ท  - ๊ฐœ๋ณ„๊ฐ’}}\]

ANOVA Image

f-ํ†ต๊ณ„๋Ÿ‰ >= 2 ์ด๋ฉด, ์ฐจ์ด๊ฐ€ ์žˆ์Œ์„ ์˜๋ฏธ

p-value(์œ ์˜ํ™•๋ฅ ) < 0.05์ด๋ฉด t-ํ†ต๊ณ„๋Ÿ‰, f-ํ†ต๊ณ„๋Ÿ‰์ด ์˜๋ฏธ๊ฐ€ ์žˆ์Œ์„ ์˜๋ฏธ


์ˆซ์žํ˜• -> ๋ฒ”์ฃผํ˜•

feature = 'Age'

sns.kdeplot(x= feature, data = data, hue = target,
            common_norm = False)
plt.show()


๋ฒ”์ฃผํ˜• -> ๋ฒ”์ฃผํ˜•

def analyze(var,targ, data=data):
    mosaic(data, [var,target])
    plt.axhline(1-data[target].mean(), color='r')
    plt.show()
    
    table = pd.crosstab(data[var], data[targ])
    print(f'๊ต์ฐจํ‘œ\n {table}')
    print('-'*50)
    
    result = spst.chi2_contingency(table)
    print(f'์นด์ด์ œ๊ณฑํ†ต๊ณ„๋Ÿ‰ : {result[0]}')
    print(f'p-value : {result[1]}')
    print(f'์ž์œ ๋„ : {result[2]}')
    print(f'๊ธฐ๋Œ€๋นˆ๋„\n {result[3]}')

๊ฒฐ๊ณผ

analyze('MaritalStatus', 'Attrition')


์นด์ด์ œ๊ณฑ ํ†ต๊ณ„๋Ÿ‰ (x2 ํ†ต๊ณ„๋Ÿ‰)

๋…๋ฆฝ๋ณ€์ˆ˜์™€ ์ข…์†๋ณ€์ˆ˜๊ฐ€ ๊ด€๋ จ์ด ์—†๋‹ค๊ณ  ๊ฐ€์ •ํ•  ๊ฒฝ์šฐ ๊ธฐ๋Œ€๋˜๋Š” ๋นˆ๋„์™€ ์‹ค์ œ ๋ฐ์ดํ„ฐ์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•œ ๊ฐ’์ž„.

๋‘ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ๊ฐ„์˜ ๋…๋ฆฝ์„ฑ์„ ๊ฒ€์ •ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•จ.

\[\chi^2 = \sum \frac{(O - E)^2}{E}\]

์—ฌ๊ธฐ์„œ:

  • ( O ): ๊ด€์ธก๋œ ๋นˆ๋„
  • ( E ): ๊ธฐ๋Œ€๋œ ๋นˆ๋„

์นด์ด์ œ๊ณฑ ํ†ต๊ณ„๋Ÿ‰์€ ์ž์œ ๋„(ฮฝ)์˜ 2๋ฐฐ๋ณด๋‹ค ํฌ๋ฉด ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค๊ณ  ๋ด„.

์ž์œ ๋„(ฮฝ)๋Š” ( x ๋ณ€์ˆ˜ ๋ฒ”์ฃผ์˜ ์ˆ˜ -1 ) X ( y ๋ณ€์ˆ˜ ๋ฒ”์ฃผ์˜ ์ˆ˜ -1 ) ์ž„.

Categories:

Updated:

Leave a comment