本文是数据挖掘比赛入门教程,以车辆贷款违约预测挑战赛为例,演示用LightGBM树模型快速搭建基线。涵盖数据读取与内存优化、EDA分析、特征筛选,通过5折交叉验证训练模型,输出预测结果,还分享进阶思路,助力初学者系统认识比赛并入门。
☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜
本项目作为个比赛的入门教程,将演示如何用树模型快速搭建比赛基线及分享比赛进阶提升思路。希望能够帮助初学者对比赛形成一个系统的认识,更好地入门并在比赛中取得好成绩。
LightGBM是基于XGBoost的一款可以快速并行的树模型框架,内部集成了多种集成学习思路,在代码实现上对XGBoost的节点划分进行了改进,内存占用更低训练速度更快。
LightGBM官网:https://lightgbm.readthedocs.io/en/latest/
参数介绍:https://lightgbm.readthedocs.io/en/latest/Parameters.html
使用介绍:你应该知道的LightGBM各种操作!
使用树模型的优势:树模型是生成规则的利器,能够从一系列有特征和标签的数据中总结出决策规则,并用树状图的结构来呈现这些规则,以解决分类和回归问题。
对于采用表格数据的任务,基本都是决策树模型的主场,像XGBoost和LightGBM这类提升(Boosting)树模型已经成为了现在数据挖掘比赛中的标配。
In [1]# LightGBM的安装# 默认版本!pip install lightgbm# GPU版本,训练更快# !pip install lightgbm --install-option=--gpu
Looking in indexes: https://mirror.baidu.com/pypi/simple/ Requirement already satisfied: lightgbm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (3.1.1) Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.16.4) Requirement already satisfied: wheel in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.33.6) Requirement already satisfied: scikit-learn!=0.22.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.22.1) Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.3.0) Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (0.14.1)
此次以讯飞赛题:车辆贷款违约预测挑战赛为例,并以树模型构建赛题基线模型
赛事地址:http://challenge.xfyun.cn/topic/info?type=car-loan
赛题任务:通过训练集训练模型,来预测测试集中loan_default字段的具体值,即借款人是否会拖欠付款,其中1表示客户逾期,0表示客户未逾期。
运行要求:对配置上无高要求,选择CPU版本即可运行本项目。树模型一般处理特征多或维度高时才会对内存会有一定要求。
In [2]# 解压比赛数据集%cd /home/aistudio/data/data101719/ !unzip data.zip
/home/aistudio/data/data101719 Archive: data.zip inflating: sample_submit.csv inflating: test.csv inflating: train.csvIn [3]
# 导入依赖包import pandas as pdimport numpy as npfrom sklearn.model_selection import KFoldfrom sklearn.metrics import f1_score, roc_auc_scorefrom tqdm import tqdmimport gcimport timeimport lightgbm as lgbimport warnings
warnings.filterwarnings('ignore')
In [4]
# 内存优化脚本,避免内存溢出def reduce_mem(df, cols):
start_mem = df.memory_usage().sum() / 1024 ** 2
for col in tqdm(cols):
col_type = df[col].dtypes if col_type != object:
c_min = df[col].min()
c_max = df[col].max() if str(col_type)[:3] == 'int': if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8) elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16) elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32) elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64) else: if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16) elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32) else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 1024 ** 2
print('{:.2f} Mb, {:.2f} Mb ({:.2f} %)'.format(start_mem, end_mem, 100 * (start_mem - end_mem) / start_mem))
gc.collect() return df
In [5]
# 读取比赛数据集train = pd.read_csv('./train.csv') # 训练集test = pd.read_csv('./test.csv') # 测试集# 对数据集进行内存优化train = reduce_mem(train, [f for f in train.columns])
test = reduce_mem(test, [f for f in test.columns])
100%|██████████| 53/53 [00:01<00:00, 42.04it/s] 100%|██████████| 52/52 [00:00<00:00, 559.02it/s]
60.65 Mb, 18.02 Mb (70.28 %) 11.90 Mb, 3.55 Mb (70.19 %)
In [6]
# 根据赛题要求设置提交结果文件格式:'customer_id', 'loan_default'# 'loan_default'作为要对测试集数据进行预测的标签,1表示客户逾期,0表示客户未逾期。sample_submit = pd.DataFrame(columns=['customer_id', 'loan_default']) sample_submit['customer_id'] = test['customer_id']
全局数据分析:数据的整体情况,包括数据类型、大小、质量等
单变量数据分析:对每个变量进行探索性分析,包括类别变量,连续变量,文本变量等
交叉特征分析:特征与标签的交叉分析以及特征与特征之间的交叉等
训练集、测试集分布分析:训练集和测试集的分布不一致是导致线上和线下不一致的重要原因
参考文章:初学者竞赛学习手册
In [7]# 数据大小概览,可以看出此赛题的字段较多,如何善用好特征是比赛一大难点train.info()
In [8]RangeIndex: 150000 entries, 0 to 149999 Data columns (total 53 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 150000 non-null int32 1 main_account_loan_no 150000 non-null int16 2 main_account_active_loan_no 150000 non-null int16 3 main_account_overdue_no 150000 non-null int8 4 main_account_outstanding_loan 150000 non-null int32 5 main_account_sanction_loan 150000 non-null int32 6 main_account_disbursed_loan 150000 non-null int32 7 sub_account_loan_no 150000 non-null int8 8 sub_account_active_loan_no 150000 non-null int8 9 sub_account_overdue_no 150000 non-null int8 10 sub_account_outstanding_loan 150000 non-null int32 11 sub_account_sanction_loan 150000 non-null int32 12 sub_account_disbursed_loan 150000 non-null int32 13 disbursed_amount 150000 non-null int32 14 asset_cost 150000 non-null int32 15 branch_id 150000 non-null int8 16 supplier_id 150000 non-null int16 17 manufacturer_id 150000 non-null int8 18 area_id 150000 non-null int8 19 employee_code_id 150000 non-null int16 20 mobileno_flag 150000 non-null int8 21 idcard_flag 150000 non-null int8 22 Driving_flag 150000 non-null int8 23 passport_flag 150000 non-null int8 24 credit_score 150000 non-null int16 25 main_account_monthly_payment 150000 non-null int32 26 sub_account_monthly_payment 150000 non-null int32 27 last_six_month_new_loan_no 150000 non-null int8 28 last_six_month_defaulted_no 150000 non-null int8 29 average_age 150000 non-null int8 30 credit_history 150000 non-null int8 31 enquirie_no 150000 non-null int8 32 loan_to_asset_ratio 150000 non-null float16 33 total_account_loan_no 150000 non-null int16 34 sub_account_inactive_loan_no 150000 non-null int16 35 total_inactive_loan_no 150000 non-null int8 36 main_account_inactive_loan_no 150000 non-null int16 37 total_overdue_no 150000 non-null int8 38 total_outstanding_loan 150000 non-null int32 39 total_sanction_loan 150000 non-null int32 40 total_disbursed_loan 150000 non-null int32 41 total_monthly_payment 150000 non-null int32 42 outstanding_disburse_ratio 150000 non-null float64 43 main_account_tenure 150000 non-null int32 44 sub_account_tenure 150000 non-null int32 45 disburse_to_sactioned_ratio 150000 non-null float32 46 active_to_inactive_act_ratio 150000 non-null float16 47 year_of_birth 150000 non-null int16 48 disbursed_date 150000 non-null int16 49 Credit_level 150000 non-null int8 50 employment_type 150000 non-null int8 51 age 150000 non-null int8 52 loan_default 150000 non-null int8 dtypes: float16(2), float32(1), float64(1), int16(10), int32(17), int8(22) memory usage: 18.0 MB
# 确定每个字段中不同的个数,对nunique为1的字段直接删除。train.nunique()
customer_id 150000 main_account_loan_no 104 main_account_active_loan_no 35 main_account_overdue_no 19 main_account_outstanding_loan 48609 main_account_sanction_loan 30564 main_account_disbursed_loan 32862 sub_account_loan_no 36 sub_account_active_loan_no 21 sub_account_overdue_no 8 sub_account_outstanding_loan 2108 sub_account_sanction_loan 1519 sub_account_disbursed_loan 1725 disbursed_amount 19235 asset_cost 38902 branch_id 82 supplier_id 2888 manufacturer_id 10 area_id 22 employee_code_id 3241 mobileno_flag 1 idcard_flag 1 Driving_flag 2 passport_flag 2 credit_score 570 main_account_monthly_payment 21499 sub_account_monthly_payment 1304 last_six_month_new_loan_no 24 last_six_month_defaulted_no 14 average_age 100 credit_history 100 enquirie_no 23 loan_to_asset_ratio 1994 total_account_loan_no 103 sub_account_inactive_loan_no 90 total_inactive_loan_no 27 main_account_inactive_loan_no 91 total_overdue_no 19 total_outstanding_loan 49406 total_sanction_loan 31216 total_disbursed_loan 33557 total_monthly_payment 21843 outstanding_disburse_ratio 4391 main_account_tenure 12816 sub_account_tenure 1230 disburse_to_sactioned_ratio 375 active_to_inactive_act_ratio 211 year_of_birth 48 disbursed_date 1 Credit_level 14 employment_type 3 age 48 loan_default 2 dtype: int64
1.特征交互:特征和特征之间组合、特征和特征之间衍生
2.特征编码:one-hot编码、label-encode编码等
3.特征选择:通过对特征重要性及相关性的分析,精简掉无用的特征
特征工程很大程度上是在帮助模型学习,在模型学习不好的地方或者难以学习的地方,采用特征工程的方式帮助其学习,通过人为筛选、人为构建组合特征让模型原本很难学好的东西可以更加容易地进行学习、进而拿到更好的效果。
In [9]# 筛掉无用特征all_cols = [f for f in train.columns if f not in ['customer_id','loan_default','mobileno_flag','idcard_flag','disbursed_date']]
主要演示如何用树模型快速地搭建一个比赛基线模型,在特征工程及模型优化上需要结合具体赛题要求进行针对性地优化。
In [10]# 训练集x_train = train[all_cols]# 训练集标签字段y_train = train['loan_default']# 要进行预测的测试集x_test = test[all_cols]In [11]
# 定义训练和预测函数def train_predict(clf, train_x, train_y, test_x, clf_name='lgb'):
# 5折交叉验证
folds = 5
seed = 2025
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
train = np.zeros(train_x.shape[0])
test = np.zeros(test_x.shape[0])
cv_scores = [] for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)): print('************************************ {} ************************************'.format(str(i+1)))
trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
train_matrix = clf.Dataset(trn_x, label=trn_y)
valid_matrix = clf.Dataset(val_x, label=val_y) # 树模型参数设置
params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'min_child_weight': 5, 'num_leaves': 2 ** 7, 'lambda_l2': 10, 'feature_fraction': 0.9, 'bagging_fraction': 0.9, 'bagging_freq': 4, 'learning_rate': 0.01, 'seed': 2025, 'n_jobs':-1, 'verbose': -1,
} # 早停和验证步数需要根据具体情况进行调优
model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=500,early_stopping_rounds=200) # 对验证集进行预测
val_pred = model.predict(val_x, num_iteration=model.best_iteration) # 对测试集进行预测
test_pred = model.predict(test_x, num_iteration=model.best_iteration)
train[valid_index] = val_pred
test += test_pred / kf.n_splits
cv_scores.append(roc_auc_score(val_y, val_pred)) # 输出验证集结果分数
print(cv_scores) print("%s_scotrainre_list:" % clf_name, cv_scores) print("%s_score_mean:" % clf_name, np.mean(cv_scores)) print("%s_score_std:" % clf_name, np.std(cv_scores)) # 在训练完成后输出feature_importance,输出各特征的重要性
print(pd.DataFrame({ 'column': all_cols, 'importance': model.feature_importance()/5,
}).sort_values(by='importance',ascending=False)) return train, test
In [12]
# 进行模型的训练与预测lgb_train, lgb_test = train_predict(lgb, x_train, y_train, x_test)
************************************ 1 ************************************ Training until validation scores don't improve for 200 rounds [500] training's auc: 0.757221 valid_1's auc: 0.665608 Early stopping, best iteration is: [648] training's auc: 0.774819 valid_1's auc: 0.666395 [0.6663954692558639] ************************************ 2 ************************************ Training until validation scores don't improve for 200 rounds [500] training's auc: 0.756217 valid_1's auc: 0.6646 Early stopping, best iteration is: [774] training's auc: 0.786664 valid_1's auc: 0.665809 [0.66In [13]63954692558639, 0.6658088579217993] ************************************ 3 ************************************ Training until validation scores don't improve for 200 rounds [500] training's auc: 0.757318 valid_1's auc: 0.664588 [1000] training's auc: 0.809107 valid_1's auc: 0.665196 Early stopping, best iteration is: [840] training's auc: 0.794933 valid_1's auc: 0.665534 [0.6663954692558639, 0.6658088579217993, 0.6655342821383231] ************************************ 4 ************************************ Training until validation scores don't improve for 200 rounds [500] training's auc: 0.758371 valid_1's auc: 0.650627 [1000] training's auc: 0.809869 valid_1's auc: 0.652059 Early stopping, best iteration is: [996] training's auc: 0.809559 valid_1's auc: 0.652149 [0.6663954692558639, 0.6658088579217993, 0.6655342821383231, 0.652148910391985] ************************************ 5 ************************************ Training until validation scores don't improve for 200 rounds [500] training's auc: 0.757135 valid_1's auc: 0.662366 Early stopping, best iteration is: [692] training's auc: 0.779432 valid_1's auc: 0.662648 [0.6663954692558639, 0.6658088579217993, 0.6655342821383231, 0.652148910391985, 0.662648392749281] lgb_scotrainre_list: [0.6663954692558639, 0.6658088579217993, 0.6655342821383231, 0.652148910391985, 0.662648392749281] lgb_score_mean: 0.6625071824914504 lgb_score_std: 0.005338481209206612 column importance 18 employee_code_id 1421.0 15 supplier_id 1374.6 14 branch_id 1341.0 29 loan_to_asset_ratio 1307.0 12 disbursed_amount 1150.2 13 asset_cost 1089.6 44 year_of_birth 995.4 21 credit_score 781.6 17 area_id 760.6 39 outstanding_disburse_ratio 635.6 27 credit_history 565.6 40 main_account_tenure 560.8 26 average_age 560.8 22 main_account_monthly_payment 445.8 16 manufacturer_id 434.6 38 total_monthly_payment 371.6 3 main_account_outstanding_loan 339.4 43 active_to_inactive_act_ratio 304.6 35 total_outstanding_loan 264.8 36 total_sanction_loan 233.0 46 employment_type 228.6 4 main_account_sanction_loan 213.2 37 total_disbursed_loan 205.6 28 enquirie_no 188.8 5 main_account_disbursed_loan 182.2 31 sub_account_inactive_loan_no 155.4 0 main_account_loan_no 155.4 25 last_six_month_defaulted_no 155.2 30 total_account_loan_no 152.6 42 disburse_to_sactioned_ratio 141.6 33 main_account_inactive_loan_no 134.4 1 main_account_active_loan_no 126.4 2 main_account_overdue_no 126.4 24 last_six_month_new_loan_no 122.8 47 age 117.6 34 total_overdue_no 87.4 45 Credit_level 53.4 19 Driving_flag 27.0 23 sub_account_monthly_payment 12.4 41 sub_account_tenure 12.4 6 sub_account_loan_no 10.8 9 sub_account_outstanding_loan 8.0 20 passport_flag 7.0 32 total_inactive_loan_no 5.6 10 sub_account_sanction_loan 5.6 11 sub_account_disbursed_loan 3.0 8 sub_account_overdue_no 0.2 7 sub_account_active_loan_no 0.2
# 保存预测结果文件sample_submit['loan_default'] = lgb_test# 注意由于赛题要求输出的为0或1,故需要对预测结果进行一定的转换。此处设置大于0.25为1,小于或等于0.25则为0。sample_submit['loan_default'] = sample_submit['loan_default'].apply(lambda x:1 if x>0.25 else 0).values# 保存结果文件sample_submit.to_csv('result.csv', index=False)
相关文章:
AI文案工具,文心一言的商业潜能解析,元宝ai绘图
国内开源AI模型库,助力人工智能创新发展的关键枢纽,AI南洋
豆包AI,重塑声线艺术的未来创作工具,初创ai 医药
华为引领智能语音交互新,大模型赋能AI语音突破,ai做小怪物
AI模型百团大战,揭秘智能盛宴制胜攻略,中国ai系统
写文章稿子的软件让写作更轻松高效
AI大模型引领深度交互,智能时代新启航,韩式证件照ai软件
SEO与SEM的完美结合:提升企业网站流量和转化的双重利器
AI赋能打印,揭秘视觉效果模型选择与优化技巧,鞍山ai托管
AI提炼主要内容:如何让信息更精准、高效、易懂,女军人ai
揭秘AI模型训练高性能显卡需求,应用与并存,矿洞ai
怎么用AI写出令人惊叹的文章?轻松搞定写作难题!
AI模型训练故障诊断与解决策略全面剖析,吉林论文ai写作软件有哪些
AI赋能艺术创作,卡通狮子雕塑模型的创新之旅,可灵ai视频与ai绘画
AI赋能艺术,揭秘栩栩如生的3D模型绘制奥秘,抗击疫情ai
解码AI大脑,人工智能模型通俗解析,ai海边海报
AI模型与实际应用,揭示本质差异的深度解析,海尔 ai7 g
不同类型文章生成案例
荣耀与AI大模型联袂,引领智能时代革新,ai图文基础教程
AI模型训练深度教程,从新手到专家的全面攻略,古装ai 照片
PS镜头模糊问题深度解析,深度估计模型加载失败原因及对策,ai海洋男装
文心一言Plus,智能创作新潮流的引领者,开启写作新时代,英ai
文心一言深度解析,差异揭秘与独特之处,ai全屏水印
AI免费试用不需要登录:体验智能科技的魅力,轻松开启未来,华为ai身材
文心一言,高效网页数据采集攻略,赋能时代智慧升级,怎么看ai文件设计尺寸
AI领域关键模型发布遇阻,行业未来路径热议不断,AI怎么让直线两端
AI赋能敦煌,揭秘飞天模型制作的艺术与技术之旅,制作简历ai
文心一言官网,开启创作灵感地,ai批量更改图片分辨率
AI矢量汽车模型制作,从新手到专家的实战教程,亚锦赛AI
如何快速写出高质量的AI文章:从入门到精通
汉王AI,核心技术揭秘与模型优势解析,缠论ai与ai 2偶数
GT5AI大模型,开启人工智能探索新篇章,赖斯说ai
2024年最全SEO资源指南:助你轻松提升网站排名
打造个人AI大模型,开启智能生活新时代,AI软件下载电子书
AI大模型,自然领域变革的智能引擎,ai怎么画西柚
企业携手文心一言,打造智能办公新体验,著名电影 ai
一键生成原创文章,轻松写作从此开启
烘焙行业智能化新,吐司AI模型引领未来变革,ai合成汉服|美女|
什么软件可以一键生成作文?轻松应对各种写作需求!
豆包对话文心一言,跨界智能对话的火花碰撞,Ai_1212
文心一言,学术研究中的文献阅读高效助手,如何用必应ai写作赚钱
文心一言公测延迟,揭开背后神秘面纱的真相,dota地图6.78ai下载
文心一言一键转换,轻松实现高效文本保存与分享,战鹰测试ai
爱酷与AI大模型联袂,引领智能生活新时代,科技生活ai
AI模型技术全景报告,前沿动态、实战应用与未来趋势解析,怎么样ai写作文
AI智能工具的无限可能:未来已来,你准备好了吗?
AI豆包伴成长,孩子与智能伙伴的奇幻对话,ai华诚
AI模型全生命周期攻略,从数据准备到部署的AI模型生成解析,优美ai图
轻松排名查询技巧,提升网站流量与SEO效果!
颠覆传统,提升效率!一款你不能错过的“网站复制工具”