Title: Predicting long time contributors with knowledge units of programming languages: an empirical study [Replication Instructions] 1) Extracting KUs from Java source code: - Configure all the paths in util/ConstantUtil.java - Run the main script for extracting KUs from the commits of selected proejcts. Main script : MasterRepositoryReleaseLevelCommitAnalyzerOtherProject.java Detailed Info: This is thread implemented to speed up the computation of KU extraction. Set up the number of threads "numberOfThreads" (default is 30) This script calls the ChildRepositoryReleaseLevelCommitAnalyzerOtherProject.java that mainly handles all the job. When all the job execution is done, all KU extracted for every file in a commit will be saved in the desinated location. 2) Generate KU-based feature for different profile dimension: - Configure all the paths in util/ConstantUtil.java - Now, run the following scripts to generate vector of KUs for each dimension: Studied Project Developaer Expertise Dimension: studiedProjects/KUFeatureExtractorMaster.java Other Project Developer Expertise Dimension: otherProjects/KUFeatureExtractorMaster.java Collaborator Expertise Dimension: collaborator/KUFeatureExtractorMaster.java Studied Project Characteristics Dimension: projectdim/ProjectDimKUAnalyzer.java Other Project Characteristics Dimension: projectdim/OtherProjDimKUAnalyzer.java - Output: These scripts will generate csv files containing the feature vector for each dimension. 3) Construct studied models including KULTC and Baseline model - We built our studied models using Python. - Required packages: sklearn, numpy, pandas, xgboost, imblearn, lightgbm, pickle and multiprocessing - Run the script to build the studied models: Main script: ltc_KU/ku-model-variation/kultc_feature_variation.py Detailed Info: * Create a list of the models that we want to build * Mapping the correpsonding feature dimension will be done automatically * Set up the required information (e.g., boot limit) * This script will first run the AutoSpearman to filter out all the highly correlated features. * Then the script will built model and save the model result into the desired location 4) Hyper-parameter tuning analysis: - Run the following script to built models varying different parameters and classification algorithms. Script: hyper-parameter-analysis/hyper-parameter-tuning-analyzer.py 5) Rank model based on their performance (AUC) - We apply Scott-Knott ESD method to rank the models based on their performance Script: generate sk-rank input: ltc-model-analysis/model-sk-rank-analysis.py Apply sk-rank: ku-model-variation/sk-rank-models.py 6) Model feature importance analysis: - Running the SHAP Analysis - We apply the SHAP analysis for our model's feature importance analysis. SHAP is a robust, flexible and widely used local (i.e., instances based) model interpreation method. - Needed package: shap - Script: ltc-model-analysis/shap_feature_importance_analysis_trained_all_data_single_model.py Run this script to generate shapley values for each feature of every instance in the dataset. The script will then save all the raw values generated from - Analyze the SHAP values: Script: ku-model-variation/full_model_feature_shap.py - Scott-Knot test to rank the feature based on their importance for whole model Script: Call the method: generate_sk_rank_input_data() in shap_feature_importance_analysis_trained_all_data_single_model.py