信息发布→ 登录 注册 退出

【数据挖掘入门】使用树模型快速搭建比赛基线模型及进阶分享

发布时间:2025-07-30

点击量:
本文是数据挖掘比赛入门教程,以车辆贷款违约预测挑战赛为例,演示用LightGBM树模型快速搭建基线。涵盖数据读取与内存优化、EDA分析、特征筛选,通过5折交叉验证训练模型,输出预测结果,还分享进阶思路,助力初学者系统认识比赛并入门。

☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

项目介绍:

本项目作为个比赛的入门教程,将演示如何用树模型快速搭建比赛基线及分享比赛进阶提升思路。希望能够帮助初学者对比赛形成一个系统的认识,更好地入门并在比赛中取得好成绩。

树模型LightGBM介绍:

LightGBM是基于XGBoost的一款可以快速并行的树模型框架,内部集成了多种集成学习思路,在代码实现上对XGBoost的节点划分进行了改进,内存占用更低训练速度更快。

LightGBM官网:https://lightgbm.readthedocs.io/en/latest/

参数介绍:https://lightgbm.readthedocs.io/en/latest/Parameters.html

使用介绍:你应该知道的LightGBM各种操作!

使用树模型的优势:树模型是生成规则的利器,能够从一系列有特征和标签的数据中总结出决策规则,并用树状图的结构来呈现这些规则,以解决分类和回归问题。

对于采用表格数据的任务,基本都是决策树模型的主场,像XGBoost和LightGBM这类提升(Boosting)树模型已经成为了现在数据挖掘比赛中的标配。

In [1]
# LightGBM的安装# 默认版本!pip install lightgbm# GPU版本,训练更快# !pip install lightgbm --install-option=--gpu
       
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: lightgbm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (3.1.1)
Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.16.4)
Requirement already satisfied: wheel in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.33.6)
Requirement already satisfied: scikit-learn!=0.22.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.22.1)
Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.3.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (0.14.1)
       

此次以讯飞赛题:车辆贷款违约预测挑战赛为例,并以树模型构建赛题基线模型

赛事地址:http://challenge.xfyun.cn/topic/info?type=car-loan

赛题任务:通过训练集训练模型,来预测测试集中loan_default字段的具体值,即借款人是否会拖欠付款,其中1表示客户逾期,0表示客户未逾期。

运行要求:对配置上无高要求,选择CPU版本即可运行本项目。树模型一般处理特征多或维度高时才会对内存会有一定要求。

In [2]
# 解压比赛数据集%cd /home/aistudio/data/data101719/
!unzip data.zip
       
/home/aistudio/data/data101719
Archive:  data.zip
  inflating: sample_submit.csv       
  inflating: test.csv                
  inflating: train.csv
        In [3]
# 导入依赖包import pandas as pdimport numpy as npfrom sklearn.model_selection import KFoldfrom sklearn.metrics import f1_score, roc_auc_scorefrom tqdm import tqdmimport gcimport timeimport lightgbm as lgbimport warnings
warnings.filterwarnings('ignore')
       
                In [4]
# 内存优化脚本,避免内存溢出def reduce_mem(df, cols):
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in tqdm(cols):
        col_type = df[col].dtypes        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()            if str(col_type)[:3] == 'int':                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)            else:                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    print('{:.2f} Mb, {:.2f} Mb ({:.2f} %)'.format(start_mem, end_mem, 100 * (start_mem - end_mem) / start_mem))
    gc.collect()    return df
    In [5]
# 读取比赛数据集train = pd.read_csv('./train.csv')  # 训练集test = pd.read_csv('./test.csv')    # 测试集# 对数据集进行内存优化train = reduce_mem(train, [f for f in train.columns])
test = reduce_mem(test, [f for f in test.columns])
       
100%|██████████| 53/53 [00:01<00:00, 42.04it/s]
100%|██████████| 52/52 [00:00<00:00, 559.02it/s]
       
60.65 Mb, 18.02 Mb (70.28 %)
11.90 Mb, 3.55 Mb (70.19 %)
       

        In [6]
# 根据赛题要求设置提交结果文件格式:'customer_id', 'loan_default'# 'loan_default'作为要对测试集数据进行预测的标签,1表示客户逾期,0表示客户未逾期。sample_submit = pd.DataFrame(columns=['customer_id', 'loan_default']) 
sample_submit['customer_id'] = test['customer_id']
   

数据分析(EDA):

全局数据分析:数据的整体情况,包括数据类型、大小、质量等

单变量数据分析:对每个变量进行探索性分析,包括类别变量,连续变量,文本变量等

交叉特征分析:特征与标签的交叉分析以及特征与特征之间的交叉等

训练集、测试集分布分析:训练集和测试集的分布不一致是导致线上和线下不一致的重要原因

参考文章:初学者竞赛学习手册

In [7]
# 数据大小概览,可以看出此赛题的字段较多,如何善用好特征是比赛一大难点train.info()
       

RangeIndex: 150000 entries, 0 to 149999
Data columns (total 53 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   customer_id                    150000 non-null  int32  
 1   main_account_loan_no           150000 non-null  int16  
 2   main_account_active_loan_no    150000 non-null  int16  
 3   main_account_overdue_no        150000 non-null  int8   
 4   main_account_outstanding_loan  150000 non-null  int32  
 5   main_account_sanction_loan     150000 non-null  int32  
 6   main_account_disbursed_loan    150000 non-null  int32  
 7   sub_account_loan_no            150000 non-null  int8   
 8   sub_account_active_loan_no     150000 non-null  int8   
 9   sub_account_overdue_no         150000 non-null  int8   
 10  sub_account_outstanding_loan   150000 non-null  int32  
 11  sub_account_sanction_loan      150000 non-null  int32  
 12  sub_account_disbursed_loan     150000 non-null  int32  
 13  disbursed_amount               150000 non-null  int32  
 14  asset_cost                     150000 non-null  int32  
 15  branch_id                      150000 non-null  int8   
 16  supplier_id                    150000 non-null  int16  
 17  manufacturer_id                150000 non-null  int8   
 18  area_id                        150000 non-null  int8   
 19  employee_code_id               150000 non-null  int16  
 20  mobileno_flag                  150000 non-null  int8   
 21  idcard_flag                    150000 non-null  int8   
 22  Driving_flag                   150000 non-null  int8   
 23  passport_flag                  150000 non-null  int8   
 24  credit_score                   150000 non-null  int16  
 25  main_account_monthly_payment   150000 non-null  int32  
 26  sub_account_monthly_payment    150000 non-null  int32  
 27  last_six_month_new_loan_no     150000 non-null  int8   
 28  last_six_month_defaulted_no    150000 non-null  int8   
 29  average_age                    150000 non-null  int8   
 30  credit_history                 150000 non-null  int8   
 31  enquirie_no                    150000 non-null  int8   
 32  loan_to_asset_ratio            150000 non-null  float16
 33  total_account_loan_no          150000 non-null  int16  
 34  sub_account_inactive_loan_no   150000 non-null  int16  
 35  total_inactive_loan_no         150000 non-null  int8   
 36  main_account_inactive_loan_no  150000 non-null  int16  
 37  total_overdue_no               150000 non-null  int8   
 38  total_outstanding_loan         150000 non-null  int32  
 39  total_sanction_loan            150000 non-null  int32  
 40  total_disbursed_loan           150000 non-null  int32  
 41  total_monthly_payment          150000 non-null  int32  
 42  outstanding_disburse_ratio     150000 non-null  float64
 43  main_account_tenure            150000 non-null  int32  
 44  sub_account_tenure             150000 non-null  int32  
 45  disburse_to_sactioned_ratio    150000 non-null  float32
 46  active_to_inactive_act_ratio   150000 non-null  float16
 47  year_of_birth                  150000 non-null  int16  
 48  disbursed_date                 150000 non-null  int16  
 49  Credit_level                   150000 non-null  int8   
 50  employment_type                150000 non-null  int8   
 51  age                            150000 non-null  int8   
 52  loan_default                   150000 non-null  int8   
dtypes: float16(2), float32(1), float64(1), int16(10), int32(17), int8(22)
memory usage: 18.0 MB
        In [8]
# 确定每个字段中不同的个数,对nunique为1的字段直接删除。train.nunique()
       
customer_id                      150000
main_account_loan_no                104
main_account_active_loan_no          35
main_account_overdue_no              19
main_account_outstanding_loan     48609
main_account_sanction_loan        30564
main_account_disbursed_loan       32862
sub_account_loan_no                  36
sub_account_active_loan_no           21
sub_account_overdue_no                8
sub_account_outstanding_loan       2108
sub_account_sanction_loan          1519
sub_account_disbursed_loan         1725
disbursed_amount                  19235
asset_cost                        38902
branch_id                            82
supplier_id                        2888
manufacturer_id                      10
area_id                              22
employee_code_id                   3241
mobileno_flag                         1
idcard_flag                           1
Driving_flag                          2
passport_flag                         2
credit_score                        570
main_account_monthly_payment      21499
sub_account_monthly_payment        1304
last_six_month_new_loan_no           24
last_six_month_defaulted_no          14
average_age                         100
credit_history                      100
enquirie_no                          23
loan_to_asset_ratio                1994
total_account_loan_no               103
sub_account_inactive_loan_no         90
total_inactive_loan_no               27
main_account_inactive_loan_no        91
total_overdue_no                     19
total_outstanding_loan            49406
total_sanction_loan               31216
total_disbursed_loan              33557
total_monthly_payment             21843
outstanding_disburse_ratio         4391
main_account_tenure               12816
sub_account_tenure                 1230
disburse_to_sactioned_ratio         375
active_to_inactive_act_ratio        211
year_of_birth                        48
disbursed_date                        1
Credit_level                         14
employment_type                       3
age                                  48
loan_default                          2
dtype: int64
               

特征工程( 重点!):

1.特征交互:特征和特征之间组合、特征和特征之间衍生

2.特征编码:one-hot编码、label-encode编码等

3.特征选择:通过对特征重要性及相关性的分析,精简掉无用的特征

特征工程很大程度上是在帮助模型学习,在模型学习不好的地方或者难以学习的地方,采用特征工程的方式帮助其学习,通过人为筛选、人为构建组合特征让模型原本很难学好的东西可以更加容易地进行学习、进而拿到更好的效果。

In [9]
# 筛掉无用特征all_cols = [f for f in train.columns if f not in ['customer_id','loan_default','mobileno_flag','idcard_flag','disbursed_date']]
   

基线模型构建:

主要演示如何用树模型快速地搭建一个比赛基线模型,在特征工程及模型优化上需要结合具体赛题要求进行针对性地优化。

In [10]
# 训练集x_train = train[all_cols]# 训练集标签字段y_train = train['loan_default']# 要进行预测的测试集x_test = test[all_cols]
    In [11]
# 定义训练和预测函数def train_predict(clf, train_x, train_y, test_x, clf_name='lgb'):
    # 5折交叉验证
    folds = 5
    seed = 2025
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)

    train = np.zeros(train_x.shape[0])
    test = np.zeros(test_x.shape[0])
    cv_scores = []    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]

        train_matrix = clf.Dataset(trn_x, label=trn_y)
        valid_matrix = clf.Dataset(val_x, label=val_y)        # 树模型参数设置
        params = {            'boosting_type': 'gbdt',            'objective': 'binary',            'metric': 'auc',            'min_child_weight': 5,            'num_leaves': 2 ** 7,            'lambda_l2': 10,            'feature_fraction': 0.9,            'bagging_fraction': 0.9,            'bagging_freq': 4,            'learning_rate': 0.01,            'seed': 2025,            'n_jobs':-1,            'verbose': -1,
        }        # 早停和验证步数需要根据具体情况进行调优
        model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=500,early_stopping_rounds=200)        # 对验证集进行预测
        val_pred = model.predict(val_x, num_iteration=model.best_iteration)        # 对测试集进行预测
        test_pred = model.predict(test_x, num_iteration=model.best_iteration)

        train[valid_index] = val_pred
        test += test_pred / kf.n_splits
        cv_scores.append(roc_auc_score(val_y, val_pred))        # 输出验证集结果分数
        print(cv_scores)    print("%s_scotrainre_list:" % clf_name, cv_scores)    print("%s_score_mean:" % clf_name, np.mean(cv_scores))    print("%s_score_std:" % clf_name, np.std(cv_scores))    # 在训练完成后输出feature_importance,输出各特征的重要性
    print(pd.DataFrame({            'column': all_cols,            'importance': model.feature_importance()/5,
        }).sort_values(by='importance',ascending=False))    return train, test
    In [12]
# 进行模型的训练与预测lgb_train, lgb_test = train_predict(lgb, x_train, y_train, x_test)
       
************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.757221	valid_1's auc: 0.665608
Early stopping, best iteration is:
[648]	training's auc: 0.774819	valid_1's auc: 0.666395
[0.6663954692558639]
************************************ 2 ************************************
Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.756217	valid_1's auc: 0.6646
Early stopping, best iteration is:
[774]	training's auc: 0.786664	valid_1's auc: 0.665809
[0.6663954692558639, 0.6658088579217993]
************************************ 3 ************************************
Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.757318	valid_1's auc: 0.664588
[1000]	training's auc: 0.809107	valid_1's auc: 0.665196
Early stopping, best iteration is:
[840]	training's auc: 0.794933	valid_1's auc: 0.665534
[0.6663954692558639, 0.6658088579217993, 0.6655342821383231]
************************************ 4 ************************************
Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.758371	valid_1's auc: 0.650627
[1000]	training's auc: 0.809869	valid_1's auc: 0.652059
Early stopping, best iteration is:
[996]	training's auc: 0.809559	valid_1's auc: 0.652149
[0.6663954692558639, 0.6658088579217993, 0.6655342821383231, 0.652148910391985]
************************************ 5 ************************************
Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.757135	valid_1's auc: 0.662366
Early stopping, best iteration is:
[692]	training's auc: 0.779432	valid_1's auc: 0.662648
[0.6663954692558639, 0.6658088579217993, 0.6655342821383231, 0.652148910391985, 0.662648392749281]
lgb_scotrainre_list: [0.6663954692558639, 0.6658088579217993, 0.6655342821383231, 0.652148910391985, 0.662648392749281]
lgb_score_mean: 0.6625071824914504
lgb_score_std: 0.005338481209206612
                           column  importance
18               employee_code_id      1421.0
15                    supplier_id      1374.6
14                      branch_id      1341.0
29            loan_to_asset_ratio      1307.0
12               disbursed_amount      1150.2
13                     asset_cost      1089.6
44                  year_of_birth       995.4
21                   credit_score       781.6
17                        area_id       760.6
39     outstanding_disburse_ratio       635.6
27                 credit_history       565.6
40            main_account_tenure       560.8
26                    average_age       560.8
22   main_account_monthly_payment       445.8
16                manufacturer_id       434.6
38          total_monthly_payment       371.6
3   main_account_outstanding_loan       339.4
43   active_to_inactive_act_ratio       304.6
35         total_outstanding_loan       264.8
36            total_sanction_loan       233.0
46                employment_type       228.6
4      main_account_sanction_loan       213.2
37           total_disbursed_loan       205.6
28                    enquirie_no       188.8
5     main_account_disbursed_loan       182.2
31   sub_account_inactive_loan_no       155.4
0            main_account_loan_no       155.4
25    last_six_month_defaulted_no       155.2
30          total_account_loan_no       152.6
42    disburse_to_sactioned_ratio       141.6
33  main_account_inactive_loan_no       134.4
1     main_account_active_loan_no       126.4
2         main_account_overdue_no       126.4
24     last_six_month_new_loan_no       122.8
47                            age       117.6
34               total_overdue_no        87.4
45                   Credit_level        53.4
19                   Driving_flag        27.0
23    sub_account_monthly_payment        12.4
41             sub_account_tenure        12.4
6             sub_account_loan_no        10.8
9    sub_account_outstanding_loan         8.0
20                  passport_flag         7.0
32         total_inactive_loan_no         5.6
10      sub_account_sanction_loan         5.6
11     sub_account_disbursed_loan         3.0
8          sub_account_overdue_no         0.2
7      sub_account_active_loan_no         0.2
        In [13]
# 保存预测结果文件sample_submit['loan_default'] = lgb_test# 注意由于赛题要求输出的为0或1,故需要对预测结果进行一定的转换。此处设置大于0.25为1,小于或等于0.25则为0。sample_submit['loan_default'] = sample_submit['loan_default'].apply(lambda x:1 if x>0.25 else 0).values# 保存结果文件sample_submit.to_csv('result.csv', index=False)


相关文章: AI文案工具,文心一言的商业潜能解析,元宝ai绘图  国内开源AI模型库,助力人工智能创新发展的关键枢纽,AI南洋  豆包AI,重塑声线艺术的未来创作工具,初创ai 医药  华为引领智能语音交互新,大模型赋能AI语音突破,ai做小怪物  AI模型百团大战,揭秘智能盛宴制胜攻略,中国ai系统  写文章稿子的软件让写作更轻松高效  AI大模型引领深度交互,智能时代新启航,韩式证件照ai软件  SEO与SEM的完美结合:提升企业网站流量和转化的双重利器  AI赋能打印,揭秘视觉效果模型选择与优化技巧,鞍山ai托管  AI提炼主要内容:如何让信息更精准、高效、易懂,女军人ai  揭秘AI模型训练高性能显卡需求,应用与并存,矿洞ai  怎么用AI写出令人惊叹的文章?轻松搞定写作难题!  AI模型训练故障诊断与解决策略全面剖析,吉林论文ai写作软件有哪些  AI赋能艺术创作,卡通狮子雕塑模型的创新之旅,可灵ai视频与ai绘画  AI赋能艺术,揭秘栩栩如生的3D模型绘制奥秘,抗击疫情ai  解码AI大脑,人工智能模型通俗解析,ai海边海报  AI模型与实际应用,揭示本质差异的深度解析,海尔 ai7 g  不同类型文章生成案例  荣耀与AI大模型联袂,引领智能时代革新,ai图文基础教程  AI模型训练深度教程,从新手到专家的全面攻略,古装ai 照片  PS镜头模糊问题深度解析,深度估计模型加载失败原因及对策,ai海洋男装  文心一言Plus,智能创作新潮流的引领者,开启写作新时代,英ai  文心一言深度解析,差异揭秘与独特之处,ai全屏水印  AI免费试用不需要登录:体验智能科技的魅力,轻松开启未来,华为ai身材  文心一言,高效网页数据采集攻略,赋能时代智慧升级,怎么看ai文件设计尺寸  AI领域关键模型发布遇阻,行业未来路径热议不断,AI怎么让直线两端  AI赋能敦煌,揭秘飞天模型制作的艺术与技术之旅,制作简历ai  文心一言官网,开启创作灵感地,ai批量更改图片分辨率  AI矢量汽车模型制作,从新手到专家的实战教程,亚锦赛AI  如何快速写出高质量的AI文章:从入门到精通  汉王AI,核心技术揭秘与模型优势解析,缠论ai与ai 2偶数  GT5AI大模型,开启人工智能探索新篇章,赖斯说ai  2024年最全SEO资源指南:助你轻松提升网站排名  打造个人AI大模型,开启智能生活新时代,AI软件下载电子书  AI大模型,自然领域变革的智能引擎,ai怎么画西柚  企业携手文心一言,打造智能办公新体验,著名电影 ai  一键生成原创文章,轻松写作从此开启  烘焙行业智能化新,吐司AI模型引领未来变革,ai合成汉服|美女|  什么软件可以一键生成作文?轻松应对各种写作需求!  豆包对话文心一言,跨界智能对话的火花碰撞,Ai_1212  文心一言,学术研究中的文献阅读高效助手,如何用必应ai写作赚钱  文心一言公测延迟,揭开背后神秘面纱的真相,dota地图6.78ai下载  文心一言一键转换,轻松实现高效文本保存与分享,战鹰测试ai  爱酷与AI大模型联袂,引领智能生活新时代,科技生活ai  AI模型技术全景报告,前沿动态、实战应用与未来趋势解析,怎么样ai写作文  AI智能工具的无限可能:未来已来,你准备好了吗?  AI豆包伴成长,孩子与智能伙伴的奇幻对话,ai华诚  AI模型全生命周期攻略,从数据准备到部署的AI模型生成解析,优美ai图  轻松排名查询技巧,提升网站流量与SEO效果!  颠覆传统,提升效率!一款你不能错过的“网站复制工具” 

标签:# ai  # https  # http  # 数据分析  # boosting  # 数据类型  # html  # red  # cos  # 内存占用  # python  # 进阶  # 更快  # 为例  # 内存优化  # 如何用  # 都是  # 数据挖掘  # 是在  # 比赛中  # 很难  
在线客服
服务热线

服务热线

400 8905 500

微信咨询
二维码
返回顶部
×二维码

截屏,微信识别二维码

打开微信

微信号已复制,请打开微信添加咨询详情!