Detail

Untangling How Machines 'Learn' Perovskite Crystallization Chemistry Through Stepwise Data Sample Comparisons

Pendleton, Ian M.; Caucci, Mary K.; Tynes, Michael; Dharna, Aaron; Najeeb, Mansoor Ani; Chan, Emory M.; Norquist, Alexander J.; Schrier, Joshua

Organizations

MDF Open

Year

2020

Source Name

darpa_sd2_perovskites

License

CC-BY 4.0

Contacts

jschrier@fordham.edu anorquis@haverford.edu ipendlet@umich.edu
Included in this content:
  • 0045.perovksitedata.csv - main dataset used in this article. A more detailed description can be found in the “dataset overview” section below
  • Chemical Inventory.csv - the hand curated file of all chemicals used in the construction of the perovskite dataset. This file includes identifiers, chemical properties, and other information.
  • ExcessMolarVolumeData.xlsx - record of experimental data, computations, and final dataset used in the generation of the excess molar volume plots.
  • MLModelMetrics.xlsx - all of the ML metrics organized in one place (excludes reactant set specific breakdown, see ML_Logs.zip for those files).
  • OrganoammoniumDensityDataset.xlsx - complete set of the data used to generate the density values. Example calculations included.
  • model_matchup_main.py - python pipeline used to generate all of the ML runs associated with the article. More detailed instructions on the operation of this code is included in the “ML Code” Section below. This file is also hosted on
  • SolutionVolumeDataset - complete set of 219 solutions in the perovskite dataset. Tabs include the automatically generated reagent information from ESCALATE, hand curated reagent information from early runs, and the generation of the dataset used in the creation of Figures
  • error_auditing.zip - code and historical datasets used for reporting the dataset auditing.
  • “AllCode.zip” which contains:
    • model_matchup_main_20191231.py - python pipeline used to generate all of the ML runs associated with the article. More detailed instructions on the operation of this code is included in the “ML Code” Section below. This file is also hosted on
    • GIT: https://github.com/ipendlet/MLScripts/blob/master/temp_densityconc/0045.perovskitedata.csv
    • VmE_CurveFitandPlot.py - python code for generating the third order polynomial fit to the VmE vs mole fraction of FAH included in the main text. Requires the ‘MolFractionResults.csv’ to function (also included).
    • Calculation_Vm_Ve_CURVEFITTING.nb - mathematica code for generating the third order polynomial fit to the VmE vs mole fraction of FAH included in the main text.
    • Covariance_Analysis.py - python code for ingesting and plotting the covariance of features and volumes in the perovskite dataset. Includes renaming dictionaries used for the publication.
    • FeatureComparison_Plotting.py - python code for reading in and plotting features for the ‘GBT’ and ‘OHGBT’ folders in this directory. The code parses the contents of these folders and generates feature comparison metrics used for Figure 9 and the associated Figure S8. Some assembly required.
    • Requirements.txt - all of the packages used in the generation of this paper
    • 0045.perovskitedata.csv - the main dataset described throughout the article. This file is required to run some of the code and is therefore kept near the code.
  • “ML_Logs.zip” which contains:
    • A folder describing every model generated for this article. In each folder there are a number of files:
    • Features_named_important.csv and features_value_importance.csv - these files are linked together and describe the weighted feature contributions from features (only present for GBT models)
    • AnalysisLog.txt - Log file of the run including all options, data curation and model training summaries
    • LeaveOneOut_Summary.csv - Results of the leave-one-reactant set-out studies on the model (if performed)
    • LOOModelInfo.txt - Hyperparameter information for each model in the study (associated with the given dataset, sometimes includes duplicate runs).
    • STTSModelInfo.txt - Hyperparameter information for each model in the study (associated with the given dataset, sometimes includes duplicate runs).
    • StandardTestTrain_Summary.csv - Results of the 6 fold cross validation ML performance (for the hold out case)
    • LeaveOneOut_FullDataset_ByAmine.csv - Results of the leave-one-reactant set-out studies performed on the full dataset (all experiments) specified by reactant set (delineated by the amine)
    • LeaveOneOut_StratifiedData_ByAmine.csv - Results of the leave-one-reactant set-out studies performed on a random stratified sample (96 random experiments) specified by reactant set (delineated by the amine)
    • model_matchup_main_*.py - code used to generate all of the runs contained in a particular folder. The code is exactly what was used at run time to generate a given dataset (requires 0045.perovskitedata.csv file to run).