Предварительная обработка Sklearn-PolynomialFeatures - как сохранить имена столбцов / заголовки выходного массива / фрейма данных

Question

Предварительная обработка Sklearn-PolynomialFeatures - как сохранить имена столбцов / заголовки выходного массива / фрейма данных

TLDR:Как получить заголовки для выходного массива numpy из sklearn.предварительная обработка.PolynomialFeatures (функция)?

скажем, у меня есть следующий код...

import pandas as pd
import numpy as np
from sklearn import preprocessing as pp

a = np.ones(3)
b = np.ones(3) * 2
c = np.ones(3) * 3

input_df = pd.DataFrame([a,b,c])
input_df = input_df.T
input_df.columns=['a', 'b', 'c']

input_df

    a   b   c
0   1   2   3
1   1   2   3
2   1   2   3

poly = pp.PolynomialFeatures(2)
output_nparray = poly.fit_transform(input_df)
print output_nparray

[[ 1.  1.  2.  3.  1.  2.  3.  4.  6.  9.]
 [ 1.  1.  2.  3.  1.  2.  3.  4.  6.  9.]
 [ 1.  1.  2.  3.  1.  2.  3.  4.  6.  9.]]

Как я могу заставить эту матрицу 3x10 / output_nparray переносить метки a,b,c, как они относятся к данным выше?

7

cross-validation python python-2.7 scikit-learn validation

автор: Afflatus

3 ответов

автор: Guiem Bosch · Accepted Answer · 2018-03-16 06:32:16

рабочий пример, все в одной строке (я предполагаю, что "читаемость" здесь не является целью):

target_feature_names = ['x'.join(['{}^{}'.format(pair[0],pair[1]) for pair in tuple if pair[1]!=0]) for tuple in [zip(input_df.columns,p) for p in poly.powers_]]
output_df = pd.DataFrame(output_nparray, columns = target_feature_names)

обновление: как отметил @OmerB, теперь вы можете использовать get_feature_names метод:

>> poly.get_feature_names(input_df.columns)
['1', 'a', 'b', 'c', 'a^2', 'a b', 'a c', 'b^2', 'b c', 'c^2']

автор: OmerB · Accepted Answer · 2017-08-08 09:23:59

scikit-learn 0.18 добавлен отличный get_feature_names() способ!

>> input_df.columns
Index(['a', 'b', 'c'], dtype='object')

>> poly.fit_transform(input_df)
array([[ 1.,  1.,  2.,  3.,  1.,  2.,  3.,  4.,  6.,  9.],
       [ 1.,  1.,  2.,  3.,  1.,  2.,  3.,  4.,  6.,  9.],
       [ 1.,  1.,  2.,  3.,  1.,  2.,  3.,  4.,  6.,  9.]])

>> poly.get_feature_names(input_df.columns)
['1', 'a', 'b', 'c', 'a^2', 'a b', 'a c', 'b^2', 'b c', 'c^2']

Примечание. Вы должны предоставить ему имена столбцов, так как sklearn не считывает его из фрейма данных сам по себе.

автор: Afflatus · Accepted Answer · 2016-04-19 20:05:25

это работает:

def PolynomialFeatures_labeled(input_df,power):
    '''Basically this is a cover for the sklearn preprocessing function. 
    The problem with that function is if you give it a labeled dataframe, it ouputs an unlabeled dataframe with potentially
    a whole bunch of unlabeled columns. 

    Inputs:
    input_df = Your labeled pandas dataframe (list of x's not raised to any power) 
    power = what order polynomial you want variables up to. (use the same power as you want entered into pp.PolynomialFeatures(power) directly)

    Ouput:
    Output: This function relies on the powers_ matrix which is one of the preprocessing function's outputs to create logical labels and 
    outputs a labeled pandas dataframe   
    '''
    poly = pp.PolynomialFeatures(power)
    output_nparray = poly.fit_transform(input_df)
    powers_nparray = poly.powers_

    input_feature_names = list(input_df.columns)
    target_feature_names = ["Constant Term"]
    for feature_distillation in powers_nparray[1:]:
        intermediary_label = ""
        final_label = ""
        for i in range(len(input_feature_names)):
            if feature_distillation[i] == 0:
                continue
            else:
                variable = input_feature_names[i]
                power = feature_distillation[i]
                intermediary_label = "%s^%d" % (variable,power)
                if final_label == "":         #If the final label isn't yet specified
                    final_label = intermediary_label
                else:
                    final_label = final_label + " x " + intermediary_label
        target_feature_names.append(final_label)
    output_df = pd.DataFrame(output_nparray, columns = target_feature_names)
    return output_df

output_df = PolynomialFeatures_labeled(input_df,2)
output_df

    Constant Term   a^1 b^1 c^1 a^2 a^1 x b^1   a^1 x c^1   b^2 b^1 x c^1   c^2
0               1   1   2   3   1           2           3   4           6   9
1               1   1   2   3   1           2           3   4           6   9
2               1   1   2   3   1           2           3   4           6   9