01. Data Loading and Exploratory Data Analysis (EDA)¶

Introduction¶

This notebook focuses on the exploratory data analysis and statistical evaluation of Windows PE files for malware detection. We aim to develop a neural network model that can effectively distinguish between malicious and benign files while minimizing false positive rates. Our analysis will establish the foundation for model development through systematic feature investigation.

Analysis Objectives¶

  1. Data Understanding and Preprocessing

    • Analyze distribution of malware vs benign files
    • Evaluate feature quality and relationships
    • Handle missing values and duplicates
    • Prepare data for neural network modeling
  2. Feature Analysis Framework

    • Investigate PE header metadata features
    • Analyze textual content as potential features
    • Examine binary data characteristics
    • Identify discriminative feature patterns

Data Extraction Pipeline

Figure 1: Data Extraction Pipeline illustrating the steps from raw PE files to feature extraction.

Analysis Pipeline¶

  1. Data Quality Assessment

    • Examine dataset composition and balance
    • Identify and handle missing values
    • Remove duplicates and cross-contamination
    • Standardize feature formats
  2. Statistical Analysis

    • Feature distribution analysis
    • Correlation investigation
    • Statistical significance testing
    • Feature importance evaluation
  3. Feature Engineering Strategy

    • Metadata feature processing
    • Text-based feature extraction
    • Binary data representation analysis
    • Feature selection for model development

Workflow Diagram

Figure 2: Overall Workflow outlining the end-to-end process from data preprocessing to model evaluation.

Success Metrics¶

Our analysis will focus on:

  • Identifying features that minimize false positive rates
  • Understanding text-based feature effectiveness
  • Evaluating binary data representation options
  • Establishing baselines for model performance
  • Preparing metrics for model evaluation

Data Loading and Initial Inspection¶

In [33]:
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
In [34]:
import logging
import sys
import warnings

import numpy as np
from IPython.display import Image
from windows_malware_classifier.analysis.feature_analysis_tools import (
    calculate_shap_values,
    display_importance_rankings,
    display_shap_impacts,
    extract_high_correlations,
    run_statistical_tests,
)
from windows_malware_classifier.preprocessing.data_preparation_tools import (
    analyze_dataset_quality,
    detect_outliers_iqr,
    display_column_types,
    calculate_pe_statistics,
    load_malware_dataset,
    optimize_memory_usage,
    impute_numeric_neural_network,
)
from windows_malware_classifier.visualization.distributions_plots import (
    plot_category_distributions,
    plot_feature_histograms,
)
In [35]:
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
warnings.filterwarnings("ignore")

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)],
)

logger = logging.getLogger(__name__)

Based on research and the purpose of this task, we will be focusing on PE files only. This approach aligns with our task requirements and goals to analyze Windows PE files for malware detection.

In [36]:
train_df, test_df = load_malware_dataset(split_data=True, random_state=RANDOM_SEED)
2025-05-18 17:19:27,333 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - Attempting to load dataset from: /Users/vytautasbunevicius/windows-malware-classifier/data/raw/malware_dataset.csv
2025-05-18 17:19:27,333 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - Attempting to load dataset from: /Users/vytautasbunevicius/windows-malware-classifier/data/raw/malware_dataset.csv
2025-05-18 17:19:27,654 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - Filtered dataset to PE files only
2025-05-18 17:19:27,654 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - Filtered dataset to PE files only
2025-05-18 17:19:27,655 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - Original dataset shape: (25117, 98)
2025-05-18 17:19:27,655 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - Original dataset shape: (25117, 98)
2025-05-18 17:19:27,658 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - PE files dataset shape: (23895, 98)
2025-05-18 17:19:27,658 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - PE files dataset shape: (23895, 98)
2025-05-18 17:19:28,372 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - Saved train dataset to /Users/vytautasbunevicius/windows-malware-classifier/data/raw/malware_dataset_train.csv
2025-05-18 17:19:28,372 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - Saved train dataset to /Users/vytautasbunevicius/windows-malware-classifier/data/raw/malware_dataset_train.csv
2025-05-18 17:19:28,373 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - Saved test dataset to /Users/vytautasbunevicius/windows-malware-classifier/data/raw/malware_dataset_test.csv
2025-05-18 17:19:28,373 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - Saved test dataset to /Users/vytautasbunevicius/windows-malware-classifier/data/raw/malware_dataset_test.csv
2025-05-18 17:19:28,373 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - Train shape: (19116, 98), Test shape: (4779, 98)
2025-05-18 17:19:28,373 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - Train shape: (19116, 98), Test shape: (4779, 98)
In [37]:
train_df.head()
Out[37]:
filename size md5 sha256 entropy is_malicious is_pe file_type is_exe is_dll object_key machine_type timestamp num_sections characteristics characteristics_flags size_of_code size_of_init_data size_of_uninit_data entry_point base_of_code image_base section_alignment file_alignment major_os_version minor_os_version major_image_version minor_image_version subsystem dll_characteristics size_of_stack_reserve size_of_heap_reserve loader_flags section_0_name section_0_entropy section_0_virt_size section_0_size section_0_chars section_0_ptr_raw_data section_1_name section_1_entropy section_1_virt_size section_1_size section_1_chars section_1_ptr_raw_data section_2_name section_2_entropy section_2_virt_size section_2_size section_2_chars section_2_ptr_raw_data section_3_name section_3_entropy section_3_virt_size section_3_size section_3_chars section_3_ptr_raw_data section_4_name section_4_entropy section_4_virt_size section_4_size section_4_chars section_4_ptr_raw_data sections_avg_entropy sections_min_entropy sections_max_entropy num_imports num_imported_dlls suspicious_imports has_exports num_exports has_resources num_resources resource_langs resource_types resource_entropy has_signature has_debug has_tls has_configuration is_signature_clean num_strings avg_string_len num_urls num_ips num_emails num_registry num_file_paths contains_unicode contains_nullbytes suspicious_pattern_count detected_patterns is_text_file line_count avg_line_length contains_base64 contains_hex_strings byte_distribution
20968 1/nb0KdOOmE7UtMiwvFNmRwTqvfXKVMVGd.exe 571392 ed125c3cecce28197ac78d02b2b726dc 068f8f5419192944a9428ea625fe56e1e8ad5cc3554798... 7.73 1 1 exe 1 0 1/nb0KdOOmE7UtMiwvFNmRwTqvfXKVMVGd.exe 332.00 1595948062.00 3.00 270.00 IMAGE_FILE_EXECUTABLE_IMAGE|IMAGE_FILE_LINE_NU... 568832.00 2048.00 0.00 576714.00 8192.00 4194304.00 8192.00 512.00 4.00 0.00 0.00 0.00 2.00 34112.00 1048576.00 1048576.00 0.00 .text 7.74 568528.00 568832.00 1610612768.00 512.00 .reloc 0.10 12.00 512.00 1107296320.00 569344.00 .rsrc 4.38 1464.00 1536.00 1073741888.00 569856.00 NaN 0.00 0.00 0.00 0.00 0.00 NaN 0.00 0.00 0.00 0.00 0.00 4.07 0.10 7.74 1.00 1.00 0.00 0 0.00 1 2.00 0.00 1.00 4.17 0 0 0 0 0 5520.00 15.15 0.00 11.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10432 1/5541yjpjyOvUqiXjg5mtsk1IJMilPaUD.exe 724480 89d1c5b0a8b0b1f9c16580c8c2715a86 b6072e84d6cfb921a3fb0a38bc13e148a308b7b4158cd9... 7.15 1 1 exe 1 0 1/5541yjpjyOvUqiXjg5mtsk1IJMilPaUD.exe 332.00 708992537.00 8.00 33166.00 IMAGE_FILE_EXECUTABLE_IMAGE|IMAGE_FILE_LINE_NU... 417792.00 305664.00 0.00 421408.00 4096.00 4194304.00 4096.00 512.00 4.00 0.00 0.00 0.00 2.00 0.00 1048576.00 1048576.00 0.00 CODE 6.61 417384.00 417792.00 1610612768.00 1024.00 DATA 3.91 4724.00 5120.00 3221225536.00 418816.00 BSS 0.00 3317.00 0.00 3221225472.00 423936.00 .idata 5.04 8688.00 8704.00 3221225536.00 423936.00 .tls 0.00 16.00 0.00 3221225472.00 432640.00 3.73 0.00 7.44 384.00 8.00 5.00 0 0.00 1 404.00 0.00 1.00 6.75 0 0 1 0 0 8830.00 8.06 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20465 1/leGvSQva4e3YV52SQpSfaXLPdNdZDF7T.exe 178890 81765089205fd56e1fb8551217c7aae4 ed656132c965b692b3b0906e8ffad4f9d431a33f22653d... 5.49 1 1 exe 1 0 1/leGvSQva4e3YV52SQpSfaXLPdNdZDF7T.exe 332.00 1597988712.00 6.00 270.00 IMAGE_FILE_EXECUTABLE_IMAGE|IMAGE_FILE_LINE_NU... 61440.00 147456.00 0.00 28576.00 4096.00 4194304.00 4096.00 4096.00 4.00 0.00 0.00 0.00 2.00 0.00 1048576.00 1048576.00 0.00 .text 6.52 59438.00 61440.00 1610612768.00 4096.00 .rdata 3.08 5840.00 8192.00 1073741888.00 65536.00 .data 3.09 12296.00 8192.00 3221225536.00 73728.00 .idata 3.53 2334.00 4096.00 3221225536.00 81920.00 .rsrc 4.86 106720.00 110592.00 1073741888.00 86016.00 3.51 0.00 6.52 83.00 5.00 2.00 0 0.00 1 5.00 0.00 1.00 2.60 0 1 0 0 0 992.00 7.46 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16759 1/Vw2RIkAUVD38nXC7xmVYEDMB5NnvN45h.exe 976384 aaf02255794de006522a31b1e4a84d23 77cd50a78f234331630b2a437f8b01a7cbeee5d74b0ac4... 6.97 1 1 exe 1 0 1/Vw2RIkAUVD38nXC7xmVYEDMB5NnvN45h.exe 332.00 708992537.00 8.00 33166.00 IMAGE_FILE_EXECUTABLE_IMAGE|IMAGE_FILE_LINE_NU... 637440.00 337920.00 0.00 641228.00 4096.00 4194304.00 4096.00 512.00 4.00 0.00 0.00 0.00 2.00 0.00 1048576.00 1048576.00 0.00 CODE 6.62 637204.00 637440.00 1610612768.00 1024.00 DATA 4.20 7560.00 7680.00 3221225536.00 638464.00 BSS 0.00 3413.00 0.00 3221225472.00 646144.00 .idata 4.87 9218.00 9728.00 3221225536.00 646144.00 .tls 0.00 16.00 0.00 3221225472.00 655872.00 3.72 0.00 7.22 402.00 10.00 5.00 0 0.00 1 317.00 0.00 1.00 6.26 0 0 1 0 0 12456.00 8.74 0.00 0.00 0.00 0.00 0.00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2718 0/IMrn5gOSgoXbM4hipCIis0DBrRG9onG5.dll 61440 cf6fe5f60bdb122a741dc7f045247ea3 11147ba1376bf82a0547e7583dd16cfc2fb2d60a1138f6... 4.75 0 1 dll 0 1 0/IMrn5gOSgoXbM4hipCIis0DBrRG9onG5.dll 332.00 1377160472.00 3.00 8482.00 IMAGE_FILE_EXECUTABLE_IMAGE|IMAGE_FILE_LARGE_A... 49152.00 8192.00 0.00 54126.00 8192.00 268435456.00 8192.00 4096.00 4.00 0.00 0.00 0.00 3.00 34144.00 1048576.00 1048576.00 0.00 .text 5.51 45940.00 49152.00 1610612768.00 4096.00 .rsrc 1.22 1152.00 4096.00 1073741888.00 53248.00 .reloc 0.01 12.00 4096.00 1107296320.00 57344.00 NaN 0.00 0.00 0.00 0.00 0.00 NaN 0.00 0.00 0.00 0.00 0.00 2.25 0.01 5.51 1.00 1.00 0.00 0 0.00 1 1.00 0.00 1.00 3.54 0 1 0 0 0 876.00 20.32 0.00 3.00 0.00 0.00 1.00 0.00 0.00 NaN NaN 0.00 NaN NaN 0.00 0.00 NaN
In [38]:
test_insights = calculate_pe_statistics(test_df)
logging.info("Test Set Insights:")
logging.info(test_insights)
2025-05-18 17:19:28,583 - root - INFO - Test Set Insights:
2025-05-18 17:19:28,583 - root - INFO - {'total_samples': 4779, 'malicious_count': 2904, 'benign_count': 1875, 'malware_ratio': 60.76585059635907, 'feature_count': 98, 'numeric_features': 86, 'categorical_features': 12, 'section_features': 31, 'security_features': 4, 'missing_values': 42364, 'missing_value_columns': 15, 'memory_usage': '3.61 MB', 'has_timestamps': True, 'unique_machine_types': 5, 'avg_file_size': {'value': 506699.3008997698, 'metric': 'bytes'}, 'malware_to_benign_ratio': 1.5488}
In [39]:
train_insights = calculate_pe_statistics(train_df)
logging.info("Training Set Insights:")
logging.info(train_insights)
2025-05-18 17:19:28,666 - root - INFO - Training Set Insights:
2025-05-18 17:19:28,666 - root - INFO - {'total_samples': 19116, 'malicious_count': 11737, 'benign_count': 7379, 'malware_ratio': 61.39882820673781, 'feature_count': 98, 'numeric_features': 86, 'categorical_features': 12, 'section_features': 31, 'security_features': 4, 'missing_values': 169956, 'missing_value_columns': 15, 'memory_usage': '14.44 MB', 'has_timestamps': True, 'unique_machine_types': 6, 'avg_file_size': {'value': 501682.720862105, 'metric': 'bytes'}, 'malware_to_benign_ratio': 1.5905949315625423}

Detailed Dataset Insights¶

Our test set contains 4,779 samples (2,904 malicious, 1,875 benign) compared to 19,116 training samples (11,737 malicious, 7,379 benign), maintaining a similar malware ratio (60.77% vs 61.40%) between sets. The test data averages 507KB per file (training: 502KB) and spans 5 unique machine types (training: 6), with 42,364 missing values across 15 columns (training: 169,956 missing values in 15 columns).

The test set consumes 3.61MB of memory (training: 14.44MB) and maintains consistent feature distributions with the training data. The dataset includes 98 total features, broken down into 86 numeric features, 12 categorical features, 31 section features, and 4 security features. Notably, timestamps and all original feature categories are preserved across both splits, ensuring representative sampling.

Data Type Analysis¶

In [40]:
# Display column types
result = display_column_types(train_df)
print(type(result))
result
<class 'pandas.core.frame.DataFrame'>
Out[40]:
filename size md5 sha256 entropy is_malicious is_pe file_type is_exe is_dll object_key machine_type timestamp num_sections characteristics characteristics_flags size_of_code size_of_init_data size_of_uninit_data entry_point base_of_code image_base section_alignment file_alignment major_os_version minor_os_version major_image_version minor_image_version subsystem dll_characteristics size_of_stack_reserve size_of_heap_reserve loader_flags section_0_name section_0_entropy section_0_virt_size section_0_size section_0_chars section_0_ptr_raw_data section_1_name section_1_entropy section_1_virt_size section_1_size section_1_chars section_1_ptr_raw_data section_2_name section_2_entropy section_2_virt_size section_2_size section_2_chars section_2_ptr_raw_data section_3_name section_3_entropy section_3_virt_size section_3_size section_3_chars section_3_ptr_raw_data section_4_name section_4_entropy section_4_virt_size section_4_size section_4_chars section_4_ptr_raw_data sections_avg_entropy sections_min_entropy sections_max_entropy num_imports num_imported_dlls suspicious_imports has_exports num_exports has_resources num_resources resource_langs resource_types resource_entropy has_signature has_debug has_tls has_configuration is_signature_clean num_strings avg_string_len num_urls num_ips num_emails num_registry num_file_paths contains_unicode contains_nullbytes suspicious_pattern_count detected_patterns is_text_file line_count avg_line_length contains_base64 contains_hex_strings byte_distribution
Data Type object int64 object object float64 int64 int64 object int64 int64 object float64 float64 float64 float64 object float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 object float64 float64 float64 float64 float64 object float64 float64 float64 float64 float64 object float64 float64 float64 float64 float64 object float64 float64 float64 float64 float64 object float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 int64 float64 int64 float64 float64 float64 float64 int64 int64 int64 int64 int64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 object float64 float64 float64 float64 float64 float64

train_df, test_df, stats = optimize_memory_usage( train_df=train_df, test_df=test_df, categorical_threshold=0.5, verbose=True )

In [41]:
train_df, test_df, stats = optimize_memory_usage(
    train_df=train_df, test_df=test_df, categorical_threshold=0.5, verbose=True
)
2025-05-18 17:19:28,940 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - Initial memory usage - Train: 28.03MB, Test: 7.01MB
2025-05-18 17:19:28,940 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - Initial memory usage - Train: 28.03MB, Test: 7.01MB
2025-05-18 17:19:29,146 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - Optimization complete - Train: 14.43MB reduced (51.5%), Test: 3.55MB reduced (50.7%) | Conversions - Categorical: 8, Numeric: 78, Boolean: 8
2025-05-18 17:19:29,146 - windows_malware_classifier.preprocessing.data_preparation_tools - INFO - Optimization complete - Train: 14.43MB reduced (51.5%), Test: 3.55MB reduced (50.7%) | Conversions - Categorical: 8, Numeric: 78, Boolean: 8
In [ ]:
analysis_results = analyze_dataset_quality(
    train_df=train_df, test_df=test_df, verbose=True, parts_to_display=[1, 2, 3]
)
In [43]:
cross_set_duplicates = set(train_df["sha256"]).intersection(set(test_df["sha256"]))
train_df = train_df[~train_df["sha256"].isin(cross_set_duplicates)]
test_df = test_df[~test_df["sha256"].isin(cross_set_duplicates)]
In [44]:
train_df, test_df = impute_numeric_neural_network(train_df.copy(), test_df.copy())
In [45]:
analysis_results_ = analyze_dataset_quality(
    train_df=train_df, test_df=test_df, verbose=True, parts_to_display=[1, 2]
)

Duplicate Analysis

Dataset MD5 Duplicates SHA256 Duplicates
0 Train 38 38
1 Test 1 1
2 Cross_set 39 39

Potential Malware Variants

Dataset Count
0 Train 70
1 Test 9

Missing Values Analysis

No missing values found across all feature categories.

In [46]:
results = calculate_shap_values(
    df=train_df,
    target="is_malicious",
    n_estimators=100,
    binary_threshold=0.05,
    max_samples=10000,  # Use default value instead of None
    background_samples=50,
    batch_size=1500,
    random_state=RANDOM_SEED,
)
================================================================================
                            Feature Analysis Summary                            
================================================================================

Dataset Information

Total samples: 18,952

Feature Distribution

- Numerical    :  29 features
- Categorical  :   6 features
- Binary       : 156 features
--------------------------------------------------------------------------------
 97%|=================== | 2914/3000 [00:15<00:00]       
2025-05-18 17:21:19,447 - root - INFO - ✓ Successfully analyzed numerical features
2025-05-18 17:21:19,454 - root - INFO - ✓ Successfully analyzed categorical features
2025-05-18 17:21:27,007 - root - INFO - ✓ Successfully analyzed binary features
In [47]:
display_importance_rankings(results.importance_scores)
display_shap_impacts(results.shap_values)
====================================================================================================
                                  Top Feature Importance Analysis                                   
====================================================================================================

Top 15 Most Important Features Overall

feature importance feature_type
5 characteristics_flags_IMAGE_FILE_EXECUTABLE_IM... 0.0791 categorical
90 section_3_name_.pdata 0.0756 categorical
25 sections_max_entropy 0.0614 numerical
20 section_4_entropy 0.0604 numerical
15 characteristics_flags_IMAGE_FILE_RELOCS_STRIPP... 0.0517 categorical
4681 is_text_file_missing_1 0.0440 binary
17 section_3_entropy 0.0414 numerical
28 avg_string_len 0.0413 numerical
7 section_0_entropy 0.0388 numerical
4684 contains_base64_missing_1 0.0387 binary
0 characteristics_flags_IMAGE_FILE_EXECUTABLE_IM... 0.0385 categorical
118 section_4_name_unknown 0.0353 categorical
21 section_4_virt_size 0.0344 numerical
1 entropy 0.0333 numerical
2 timestamp 0.0329 numerical

Top 15 Numerical Features

feature importance
25 sections_max_entropy 0.0614
20 section_4_entropy 0.0604
17 section_3_entropy 0.0414
28 avg_string_len 0.0413
7 section_0_entropy 0.0388
21 section_4_virt_size 0.0344
1 entropy 0.0333
2 timestamp 0.0329
22 section_4_ptr_raw_data 0.0189
18 section_3_virt_size 0.0145
0 size 0.0138
15 section_2_virt_size 0.0134
3 size_of_code 0.0113
8 section_0_virt_size 0.0110
19 section_3_ptr_raw_data 0.0105

Top 15 Categorical Features

feature importance
5 characteristics_flags_IMAGE_FILE_EXECUTABLE_IM... 0.0791
90 section_3_name_.pdata 0.0756
15 characteristics_flags_IMAGE_FILE_RELOCS_STRIPP... 0.0517
0 characteristics_flags_IMAGE_FILE_EXECUTABLE_IM... 0.0385
118 section_4_name_unknown 0.0353
8 characteristics_flags_IMAGE_FILE_EXECUTABLE_IM... 0.0206
11 characteristics_flags_IMAGE_FILE_RELOCS_STRIPP... 0.0191
1 characteristics_flags_IMAGE_FILE_EXECUTABLE_IM... 0.0175
43 section_1_name_.data 0.0146
93 section_3_name_.rsrc 0.0124
104 section_4_name_.didat 0.0115
109 section_4_name_.pdata 0.0107
98 section_3_name_unknown 0.0099
72 section_2_name_.rsrc 0.0093
69 section_2_name_.rdata 0.0080

Top 15 Binary Features

feature importance
4681 is_text_file_missing_1 0.0440
4684 contains_base64_missing_1 0.0387
4678 contains_nullbytes_missing_1 0.0327
4677 contains_unicode_missing_1 0.0312
8 machine_type_34404.0 0.0234
1 file_type_exe 0.0231
4685 contains_hex_strings_missing_1 0.0210
3 is_dll_1.0 0.0207
682 subsystem_2.0 0.0164
683 subsystem_3.0 0.0134
79 characteristics_8226.0 0.0128
3291 has_exports_1.0 0.0126
758 size_of_stack_reserve_1048576.0 0.0116
4019 has_debug_1.0 0.0116
618 major_os_version_4.0 0.0089
====================================================================================================
                                      Feature Impact Analysis                                       
====================================================================================================

Top Features that Indicate Malware

feature mean_impact feature_type
5 characteristics_flags_IMAGE_FILE_EXECUTABLE_IM... 0.0149 categorical
90 section_3_name_.pdata 0.0222 categorical
25 sections_max_entropy 0.0034 numerical
20 section_4_entropy 0.0029 numerical
15 characteristics_flags_IMAGE_FILE_RELOCS_STRIPP... 0.0018 categorical
4681 is_text_file_missing_1 0.0109 binary
28 avg_string_len 0.0017 numerical
7 section_0_entropy 0.0029 numerical
4684 contains_base64_missing_1 0.0075 binary
0 characteristics_flags_IMAGE_FILE_EXECUTABLE_IM... 0.0032 categorical
118 section_4_name_unknown 0.0012 categorical
21 section_4_virt_size 0.0057 numerical
1 entropy 0.0039 numerical
2 timestamp 0.0050 numerical
4678 contains_nullbytes_missing_1 0.0064 binary

Top Features that Indicate Benign Software

feature mean_impact feature_type
17 section_3_entropy -0.0010 numerical
11 characteristics_flags_IMAGE_FILE_RELOCS_STRIPP... -0.0003 categorical
93 section_3_name_.rsrc -0.0004 categorical
3 size_of_code -0.0030 numerical
8 section_0_virt_size -0.0013 numerical
19 section_3_ptr_raw_data -0.0018 numerical
5 entry_point -0.0032 numerical
98 section_3_name_unknown -0.0002 categorical
72 section_2_name_.rsrc -0.0008 categorical
618 major_os_version_4.0 -0.0001 binary
10 section_1_entropy -0.0007 numerical
24 sections_min_entropy -0.0005 numerical
14 section_2_entropy -0.0012 numerical
46 characteristics_271.0 -0.0001 binary
16 section_2_ptr_raw_data -0.0015 numerical

Interpretation of SHAP Analysis¶

SHAP analysis reveals key features driving our malware detection model's decisions. The top features span categorical, numerical, and binary types, reflecting a mix of structural, entropy-based, and behavioral indicators. While some numeric features represent categorical or binary indicators in practice (e.g., timestamp reflects file creation time), the analysis highlights distinct patterns between malicious and benign samples.

Section-related features emerge as crucial discriminators. The section_3_name_.pdata (indicating the presence of a .pdata section) tops the list with an importance of 0.0750 and a mean impact of 0.0224 toward malware, suggesting its strong association with malicious files. Entropy-based features also rank highly: sections_max_entropy (0.0645 importance, -0.0201 impact) strongly indicates benign files, while section_4_entropy (0.0593 importance, -0.0159 impact) and section_0_entropy (0.0368 importance, -0.0095 impact) further support benign classification with negative impacts. The section_3_entropy (0.0356 importance, -0.0115 impact) follows a similar trend, implying higher entropy in benign files’ sections.

Executable characteristics demonstrate significant importance, with characteristics_flags_IMAGE_FILE_EXECUTABLE_IMAGE_32BIT_MACHINE (0.0735 importance, 0.0136 impact) and characteristics_flags_IMAGE_FILE_RELOCS_STRIPPED (0.0558 importance, 0.0034 impact) strongly favoring malware detection. These PE file flags suggest distinct compilation or linking patterns in malicious executables. Other variants, like characteristics_flags_IMAGE_FILE_EXECUTABLE_IMAGE_LARGE_ADDRESS_AWARE (0.0386 importance, 0.0031 impact), reinforce this trend.

Core binary and numerical characteristics provide valuable insights. The avg_string_len (0.0390 importance, 0.0019 impact) slightly favors malware, while binary features like contains_unicode_missing_1 (0.0389 importance, 0.0078 impact), contains_nullbytes_missing_1 (0.0368 importance, 0.0088 impact), and contains_hex_strings_missing_1 (0.0354 importance, 0.0088 impact) indicate malware when missing. The timestamp (0.0317 importance, -0.0086 impact) leans toward benign files with earlier values, aligning with prior observations of differing creation times. The entry_point (0.0100 importance, no impact provided) and size_of_init_data (0.0115 importance, -0.0023 impact) also contribute, though with lesser directional influence.

Based on these findings, we group our features by their importance and impact:

  1. High Importance Categorical Features

    • section_3_name_.pdata (0.0750 importance, 0.0224 impact) - Presence of .pdata section
    • characteristics_flags_IMAGE_FILE_EXECUTABLE_IMAGE_32BIT_MACHINE (0.0735 importance, 0.0136 impact) - 32-bit machine executable flag
    • characteristics_flags_IMAGE_FILE_RELOCS_STRIPPED (0.0558 importance, 0.0034 impact) - Relocation info stripped flag
    • section_4_name_unknown (0.0364 importance, 0.0016 impact) - Unknown section 4 name
  2. Critical Numerical Features

    • sections_max_entropy (0.0645 importance, -0.0201 impact) - Maximum section entropy
    • section_4_entropy (0.0593 importance, -0.0159 impact) - Fourth section entropy
    • section_0_entropy (0.0368 importance, -0.0095 impact) - First section entropy
    • avg_string_len (0.0390 importance, 0.0019 impact) - Average length of embedded strings
    • timestamp (0.0317 importance, -0.0086 impact) - PE file creation time
  3. Key Binary Indicators

    • contains_unicode_missing_1 (0.0389 importance, 0.0078 impact) - Absence of Unicode strings
    • contains_nullbytes_missing_1 (0.0368 importance, 0.0088 impact) - Absence of null bytes
    • contains_hex_strings_missing_1 (0.0354 importance, 0.0088 impact) - Absence of hex strings
    • is_exe_1.0 (0.0249 importance, 0.0039 impact) - Executable file indicator
    • subsystem_2.0 (0.0196 importance, 0.0034 impact) - Windows subsystem type
  4. Supporting Features

    • section_4_virt_size (0.0278 importance, -0.0085 impact) - Fourth section virtual size
    • entry_point (0.0100 importance, no impact provided) - Entry point address
    • size_of_init_data (0.0115 importance, -0.0023 impact) - Size of initialized data
    • section_2_virt_size (0.0141 importance, -0.0015 impact) - Second section virtual size
    • size_of_code (0.0114 importance, no impact provided) - Size of code section

Feature Distribution Analysis¶

The next phase of our analysis will focus on examining the distributions and interactions between these key features through visualization, particularly focusing on the features with both high importance scores and significant directional impacts on classification.

In [48]:
entropy_numerical = ["sections_max_entropy", "section_4_entropy", "section_0_entropy"]

size_numerical = ["entry_point", "size_of_code", "size_of_stack_reserve"]

behavior_numerical = ["timestamp", "avg_string_len", "machine_type", "subsystem"]

section_categorical = ["file_type", "section_3_name", "characteristics_flags"]

all_numerical = entropy_numerical + size_numerical + behavior_numerical
all_categorical = section_categorical

Numerical Entropy Features¶

In [49]:
fig = plot_feature_histograms(
    df=train_df,
    features=entropy_numerical,
    target="is_malicious",
    nbins=40,
    custom_layout={"title_text": "Distribution of Numerical Entropy Features by Class"},
    save_path="../images/eda/numerical_entropy_distribution.png",
)
In [50]:
Image(filename="../images/eda/numerical_entropy_distribution.png")
Out[50]:
No description has been provided for this image

This plot displays the distributions of three entropy-related characteristics (sections_max_entropy, section_4_entropy, and section_0_entropy) for both benign (light blue) and malicious (darker blue) files.

There are distinct patterns in how entropy is distributed across different file sections. For sections_max_entropy, both types show peaks around value 6, with malicious files exhibiting multiple sharp spikes between 6-7 and a particularly prominent peak near 6.5. The section_4_entropy shows a very concentrated spike near 0 for both classes, with a small secondary peak around 4 for benign files. In section_0_entropy, we see a complex distribution with multiple peaks, where malicious files show distinctive spikes around values 6 and 7, while benign files have a more prominent peak around 6.

These entropy patterns provide valuable insights for malware detection. The higher entropy values in malicious files, particularly in section_0 and the maximum section entropy, could indicate encryption, packing, or other obfuscation techniques commonly used by malware. The near-zero entropy in section_4 across both classes suggests this section typically contains more predictable or structured data.

Looking at these entropy distributions alongside other binary characteristics like section permissions or import patterns could provide an even more robust approach to identifying malicious files, as entropy alone shows some overlap between benign and malicious samples.

Numerical Size-Related Features¶

In [51]:
fig = plot_feature_histograms(
    df=train_df,
    features=size_numerical,
    target="is_malicious",
    nbins=40,
    custom_layout={"title_text": "Distribution of Size-Related Features by Class"},
    save_path="../images/eda/numerical_size_distribution.png",
)
In [52]:
Image(filename="../images/eda/numerical_size_distribution.png")
Out[52]:
No description has been provided for this image

This plot displays the distributions of three size-related characteristics (entry_point, size_of_code, and size_of_stack_reserve) for both benign (light blue) and malicious (darker blue) files.

Both benign and malicious files show some similar patterns, but with notable distinctions. For entry_point, both types show sharp peaks near 0, with malicious files exhibiting a significantly higher initial spike. The size_of_code feature shows extremely concentrated peaks near 0 for both classes, with malicious files again showing a higher peak. For size_of_stack_reserve, the distribution is more spread out, with both classes showing multiple peaks in the 0-10M range, though malicious files display a more prominent peak near 0.

These patterns in size-related features reveal potentially important indicators for malware detection. The consistently higher peaks for malicious files near 0 across multiple features could suggest attempts at minimizing file footprints or specific compilation patterns associated with malicious software. The similar distribution shapes but different peak heights indicate that while these features alone might not be definitive, they could be valuable when combined with other indicators.

To build a more comprehensive detection approach, examining how these size-related characteristics correlate with other file attributes, such as structural features or import patterns, could provide additional insights for distinguishing malicious files.

Numerical Behavioral Features¶

In [53]:
fig = plot_feature_histograms(
    df=train_df,
    features=behavior_numerical,
    target="is_malicious",
    nbins=40,
    custom_layout={"title_text": "Distribution of Behavioral Features by Class"},
    save_path="../images/eda/numerical_behavioral_features.png",
)
In [54]:
Image(filename="../images/eda/numerical_behavioral_features.png")
Out[54]:
No description has been provided for this image

This plot displays the distributions of four behavior-related characteristics (timestamp, avg_string_len, machine_type, and subsystem) for both benign (light blue) and malicious (dark blue) files.

It appears that malicious files tend to show a distinct pattern in their timestamp values, exhibiting a significantly sharper peak around the 2B mark compared to benign files. For avg_string_len, both types of files show very sharp spikes near 0, though the benign files have a notably higher peak. The machine_type feature shows interesting patterns with clear spikes around the 30k mark for malicious files, while benign files have their highest peak near 0. For subsystem, both classes show multiple peaks between 0-5, with benign files generally showing higher peaks.

The pronounced differences in distribution patterns across all features suggest multiple potential indicators for distinguishing malicious files from benign ones. The timestamp clustering around 2B for malicious files could indicate coordinated creation or modification times. The machine_type distribution showing distinct peaks at different positions for malicious versus benign files is particularly noteworthy.

Given these clear distributional differences across multiple features, further investigation into how these characteristics interact might provide even stronger signals for classification. Additional analysis of other system-level attributes could help build a more comprehensive understanding of malicious file behavior patterns.

Categorical Section Characteristics¶

In [55]:
fig = plot_category_distributions(
    df=train_df,
    features=section_categorical,
    target="is_malicious",
    top_n=10,
    custom_layout={
        "title_text": "Distribution of Section Characteristics by Class",
        "height": 800,
        "width": 1600,
    },
    save_path="../images/eda/categorical_section_characteristics.png",
)
In [56]:
Image(filename="../images/eda/categorical_section_characteristics.png")
Out[56]:
No description has been provided for this image

This plot shows the distributions of three section-related characteristics (file_type, section_3_name, and characteristics_flags) for both benign and malicious files.

It appears that malicious files show distinct patterns: high concentrations in ".rsrc", ".idata", and ".reloc" sections for section_3_name, and significant presence in the IMG_EXE... flags within characteristics_flags compared to benign files. In file_type, malicious files show notable presence in "exe" (85.3%) and "dll" categories.

The differences in distribution are particularly notable in several areas: section_3_name shows benign files concentrating in ".pdata" (95.1%) and having strong presence in ".bss" (51.5%) categories. The characteristics_flags demonstrate distinct patterns with benign files showing higher percentages in several IMG_EXE variations, particularly noticeable in the middle categories. The file_type distribution shows benign files dominate the "unknown" category (100%) and have significant presence in "dos" (20%).

These observations suggest that these section characteristics could serve as strong indicators for classification. The distinct patterns in section_3_name, particularly in ".pdata" and ".bss", along with the varied distributions in characteristics_flags are especially promising for differentiating between benign and malicious files.

Following this analysis, a logical next step would be to investigate potential outliers within these distributions. Examining files that deviate significantly from the observed trends could reveal novel techniques used by malicious actors or highlight specific types of benign software that exhibit unusual section characteristics. This outlier analysis will be the focus of our subsequent investigation.

Outlier Detection (IQR)¶

In [57]:
numerical_features = train_df.select_dtypes(include=[np.number]).columns.tolist()
In [58]:
anomalies = detect_outliers_iqr(train_df)

Anomaly Detection Summary

Description Value
0 Total Records Analyzed 18952
1 Numerical Features Found 184

Feature-wise Anomaly Analysis

Feature Name Anomalies Percentage IQR Bounds Flagged Values
0 size 1144 6.04% [-775246.00, 1535842.00] [1536000.00, 4239088.00]
1 entropy 569 3.00% [3.48, 9.33] [0.13, 3.48]
7 timestamp 5169 27.27% [1083695160.50, 1908202284.50] [0.00, 4294967295.00]
8 num_sections 504 2.66% [-1.50, 10.50] [11.00, 25.00]
9 characteristics 1175 6.20% [-11694.00, 20178.00] [33166.00, 49582.00]
10 size_of_code 862 4.55% [-538880.00, 978688.00] [978944.00, 125768626.00]
11 size_of_init_data 1422 7.50% [-274432.00, 479232.00] [479744.00, 3734747915.00]
12 size_of_uninit_data 1964 10.36% [0.00, 0.00] [1.00, 137035776.00]
13 entry_point 1015 5.36% [-502775.25, 872766.75] [873230.00, 1948744943.00]
14 base_of_code 559 2.95% [-2048.00, 14336.00] [20480.00, 137039872.00]
15 image_base 4447 23.46% [-2566848512.00, 4289265664.00] [4294967296.00, 18446735277616529408.00]
16 section_alignment 2 0.01% [-2048.00, 14336.00] [2097152.00, 2097152.00]
17 file_alignment 3410 17.99% [512.00, 512.00] [16.00, 4096.00]
18 major_os_version 3504 18.49% [1.00, 9.00] [0.00, 10.00]
19 minor_os_version 1321 6.97% [0.00, 0.00] [1.00, 51.00]
20 major_image_version 45 0.24% [-7.50, 12.50] [13.00, 21315.00]
21 minor_image_version 871 4.60% [0.00, 0.00] [1.00, 26001.00]
22 subsystem 22 0.12% [0.50, 4.50] [0.00, 16.00]
24 size_of_stack_reserve 5278 27.85% [1048576.00, 1048576.00] [0.00, 33554432.00]
25 size_of_heap_reserve 471 2.49% [1048576.00, 1048576.00] [0.00, 16777216.00]
27 section_0_entropy 2161 11.40% [4.42, 7.94] [0.00, 8.00]
28 section_0_virt_size 916 4.83% [-543715.00, 995861.00] [996536.00, 137035776.00]
29 section_0_size 821 4.33% [-533504.00, 965632.00] [967168.00, 4116480.00]
30 section_0_chars 2562 13.52% [1610612768.00, 1610612768.00] [1073741888.00, 4026531904.00]
31 section_0_ptr_raw_data 3534 18.65% [-256.00, 1792.00] [2048.00, 115712.00]
33 section_1_virt_size 2067 10.91% [-74034.50, 127529.50] [127596.00, 49270652.00]
34 section_1_size 2007 10.59% [-72192.00, 124416.00] [124928.00, 4079616.00]
35 section_1_chars 4537 23.94% [268435568.00, 2415919088.00] [0.00, 3763339296.00]
36 section_1_ptr_raw_data 817 4.31% [-531904.00, 961600.00] [962048.00, 4116992.00]
38 section_2_virt_size 2419 12.76% [-33802.12, 56818.88] [56864.00, 93568008.00]
39 section_2_size 2388 12.60% [-13312.00, 23552.00] [24064.00, 11509760.00]
41 section_2_ptr_raw_data 902 4.76% [-605440.00, 1108736.00] [1108992.00, 4122112.00]
43 section_3_virt_size 3268 17.24% [-20458.50, 34097.50] [34152.00, 40349696.00]
44 section_3_size 3222 17.00% [-13824.00, 23040.00] [23552.00, 6451200.00]
45 section_3_chars 2733 14.42% [-1610612832.00, 2684354720.00] [3221225472.00, 4026531904.00]
46 section_3_ptr_raw_data 1657 8.74% [-367104.00, 611840.00] [612864.00, 11669504.00]
48 section_4_virt_size 3593 18.96% [-1932.00, 3220.00] [3224.00, 30527488.00]
49 section_4_size 3428 18.09% [-2304.00, 3840.00] [4096.00, 30527488.00]
50 section_4_chars 3098 16.35% [-1660944480.00, 2768240800.00] [3221225472.00, 4026531904.00]
51 section_4_ptr_raw_data 3012 15.89% [-205056.00, 341760.00] [342016.00, 4120576.00]
52 sections_avg_entropy 211 1.11% [1.39, 6.64] [0.00, 7.99]
53 sections_min_entropy 33 0.17% [-3.55, 5.97] [6.05, 7.99]
54 sections_max_entropy 550 2.90% [3.90, 9.44] [0.00, 3.90]
55 num_imports 938 4.95% [-246.50, 413.50] [414.00, 3314.00]
56 num_imported_dlls 1388 7.32% [-12.50, 23.50] [24.00, 92.00]
57 suspicious_imports 344 1.82% [-4.50, 7.50] [8.00, 15.00]
59 num_exports 3507 18.50% [-1.50, 2.50] [3.00, 11116.00]
60 has_resources 1545 8.15% [1.00, 1.00] [0.00, 0.00]
61 num_resources 3006 15.86% [-17.00, 31.00] [32.00, 820.00]
63 resource_types 1547 8.16% [1.00, 1.00] [0.00, 2.00]
64 resource_entropy 2215 11.69% [1.37, 5.85] [0.00, 8.00]
67 has_tls 2938 15.50% [0.00, 0.00] [1.00, 1.00]
70 num_strings 1282 6.76% [-7502.88, 14886.12] [14887.00, 128647.00]
71 avg_string_len 1312 6.92% [-3.46, 25.97] [25.98, 5406.87]
72 num_urls 4255 22.45% [-1.50, 2.50] [3.00, 1001.00]
73 num_ips 3255 17.17% [-3.00, 5.00] [6.00, 5635.00]
74 num_emails 2480 13.09% [0.00, 0.00] [1.00, 350.00]
75 num_registry 33 0.17% [0.00, 0.00] [1.00, 50.00]
76 num_file_paths 3745 19.76% [0.00, 0.00] [1.00, 6774.00]
119 section_0_name_missing 23 0.12% [0.00, 0.00] [1.00, 1.00]
125 section_1_name_missing 96 0.51% [0.00, 0.00] [1.00, 1.00]
131 section_2_name_missing 526 2.78% [0.00, 0.00] [1.00, 1.00]

Anomaly Severity Categories

Category Count
0 High Anomaly Features (>10%) 29
1 Moderate Anomaly Features (5-10%) 11
2 Low Anomaly Features (<5%) 144

Overall Statistics

Description Value
0 Total Rows with Anomalies 18390
1 Percentage of Rows with Anomalies 97.03%
2 Features with Anomalies 62
3 Total Numerical Features 184

Analysis of Outliers¶

IQR-based detection analysis reveals significant disparities between theoretical bounds and observed values across our dataset of 18,914 records and 184 numerical features. These anomalies are particularly pronounced in system configurations, section metrics, and network-related features, with 97.07% of rows (18,359) containing at least one anomaly across 62 features.

Our primary findings highlight substantial system configuration anomalies. The size_of_stack_reserve exhibits the highest anomaly rate (27.78%), where IQR bounds are fixed at [1,048,576, 1,048,576], yet actual values range from 0 to 33,554,432, indicating diverse stack allocations. Similarly, timestamp shows a 27.27% anomaly rate, with IQR bounds of [1,083,698,360.62, 1,908,201,727.62] flagging values from 0 to 4,294,967,295, reflecting extreme variation in file creation times. The image_base also stands out with 23.42% anomalies, where values range from 4,294,967,296 to 18,446,735,277,616,529,408 against IQR bounds of [-2,559,574,016, 4,277,141,504], suggesting potential outliers or corrupted data.

Section-level analysis uncovers distinct patterns. The section_0_entropy has an 11.40% anomaly rate, with IQR bounds of [4.41, 7.94] identifying values from 0 to 8.00, indicating variability in the first section’s complexity. Virtual size anomalies increase progressively across sections: section_0_virt_size (4.80%, [996,536, 137,035,776]), section_1_virt_size (10.92%, [127,596, 49,270,652]), section_2_virt_size (12.74%, [56,864, 93,568,008]), section_3_virt_size (17.25%, [34,152, 40,349,696]), and section_4_virt_size (18.96%, [3,204, 30,527,488]). Similarly, section_X_chars features show high anomaly rates, such as section_1_chars (23.90%, [0, 3,763,339,296] vs [268,435,568, 2,415,919,088]) and section_4_chars (16.37%, [3,221,225,472, 4,026,531,904] vs [-1,660,944,480, 2,768,240,800]), suggesting structural diversity or packing.

Network indicators reveal concerning patterns. The num_urls has a 22.46% anomaly rate, with IQR bounds of [-1.50, 2.50] flagging values from 3 to 1,001, indicating unusually high URL counts. The num_ips shows 17.20% anomalies, with values from 6 to 5,635 exceeding bounds of [-3.00, 5.00], while num_emails has 13.05% anomalies, ranging from 1 to 350 against [0, 0]. These suggest potential malicious behavior or data extraction artifacts.

Other notable anomalies include size_of_init_data (7.47%, [479,744, 3,734,747,915] vs [-274,432, 479,232]), entry_point (5.35%, [874,752, 1,948,744,943] vs [-503,795.25, 874,694.75]), and size (6.01%, [1,536,000, 4,239,088] vs [-774,561, 1,535,431]), reflecting significant deviations in file size and structure. Of the 184 numerical features, 29 have high anomaly rates (>10%), 11 are moderate (5-10%), and 144 are low (<5%).

Importantly, for the purpose of mirroring the production environment where these anomalies are expected and relevant, we will retain all identified outliers in the subsequent correlation analysis.

Correlation Analysis¶

Next, we will examine correlations between features to identify potential redundancies and relationships, focusing particularly on:

  • Section Correlations
    • Entropy correlations between sections (e.g., section_0_entropy with sections_max_entropy)
    • Size relationships across sections (e.g., section_0_size with section_1_ptr_raw_data)
    • Alignment patterns between sections (e.g., image_base with section_alignment)
  • Resource Dependencies
    • Size metrics relationships (e.g., size_of_uninit_data with section_0_virt_size)
    • Version dependencies (e.g., major_image_version with minor_image_version)
    • Memory allocation patterns (e.g., base_of_code with entry_point)
  • System Configuration Relationships
    • Resource type correlations (e.g., resource_types with resource_entropy)
    • Alignment dependencies (e.g., file_alignment with size_of_stack_reserve)
    • String count relationships (e.g., size with num_strings)

This correlation analysis will help us understand feature dependencies and potential redundancies, crucial for efficient feature selection and dimensionality reduction. Features with correlation coefficients above |0.95| will be examined for potential consolidation, while maintaining the discriminative power needed for accurate malware detection.

We will begin by extracting feature pairs with correlations above |0.95| from our 184 numerical features, followed by detailed analysis of these highly correlated feature clusters to understand their relationships and potential redundancies in the context of malware detection.

In [59]:
numerical_features = train_df.select_dtypes(include=[np.number]).columns.tolist()
In [60]:
corr_matrix = train_df[numerical_features].corr()
In [61]:
train_df.shape
Out[61]:
(18952, 196)
In [62]:
regular_df, missing_df = extract_high_correlations(corr_matrix, threshold=0.95)

Correlation Analysis Summary

Metric Count
Total correlated pairs (threshold > 0.95) 22
Regular feature pairs 7
Missing indicator pairs 15

Regular Feature Correlations

  Feature 1 Feature 2 Correlation
0 is_exe is_dll -1.000
1 has_resources resource_types 0.999
2 image_base section_alignment 0.997
3 section_0_size section_1_ptr_raw_data 0.994
4 section_4_virt_size section_4_size 0.980
5 major_image_version minor_image_version 0.976
6 size_of_uninit_data section_0_virt_size 0.961

Missing Indicator Correlations

  Feature 1 Feature 2 Correlation
0 contains_unicode_missing contains_nullbytes_missing 1.000
1 contains_unicode_missing is_text_file_missing 1.000
2 contains_unicode_missing contains_base64_missing 1.000
3 contains_unicode_missing contains_hex_strings_missing 1.000
4 contains_nullbytes_missing is_text_file_missing 1.000
5 contains_nullbytes_missing contains_base64_missing 1.000
6 contains_nullbytes_missing contains_hex_strings_missing 1.000
7 is_text_file_missing contains_base64_missing 1.000
8 is_text_file_missing contains_hex_strings_missing 1.000
9 contains_base64_missing contains_hex_strings_missing 1.000
10 is_malicious contains_unicode_missing 0.950
11 is_malicious contains_nullbytes_missing 0.950
12 is_malicious is_text_file_missing 0.950
13 is_malicious contains_base64_missing 0.950
14 is_malicious contains_hex_strings_missing 0.950

Interpretation of Correlation Analysis¶

Building upon our correlation analysis, we’ve identified highly correlated feature pairs using a threshold of 0.95, revealing key relationships that inform our feature selection strategy. This analysis separates correlations into regular feature pairs and missing indicator pairs, providing insights into both redundancy and systematic missingness patterns.

Our analysis identified 22 highly correlated pairs (|correlation| > 0.95), with 7 involving regular features and 15 involving missing indicators. This is a focused subset compared to a potentially larger total (e.g., 2,577 pairs if a lower threshold like 0.7 were used), emphasizing only the strongest relationships:

  • Regular Feature Correlations: Seven pairs show significant redundancy among structural and configuration features.
  • Missing Indicator Correlations: Fifteen pairs, dominated by perfect correlations (1.000) among missingness flags, suggest a common underlying cause for data absence.

Regular Feature Correlations¶

  • is_exe and is_dll: Perfect negative correlation (-1.000), reflecting mutual exclusivity (a file is either an executable or a DLL). We’ll retain is_exe as it directly indicates executable status and drop is_dll.
  • has_resources and resource_types: Near-perfect correlation (0.999), as the presence of resources implies specific types. We’ll keep has_resources for its binary simplicity and exclude resource_types.
  • image_base and section_alignment: Very strong correlation (0.997, slightly higher than your original 0.996), indicating aligned memory structuring. We’ll retain image_base for its fundamental role in PE files and remove section_alignment.
  • section_0_size and section_1_ptr_raw_data: High correlation (0.994), as the size of section 0 often dictates the starting point of section 1. We’ll keep section_0_size for interpretability and drop section_1_ptr_raw_data.
  • section_4_virt_size and section_4_size: Strong correlation (0.980, higher than your original 0.967), reflecting overlap between virtual and physical sizes. We’ll retain section_4_size for its concrete measure and exclude section_4_virt_size.
  • major_image_version and minor_image_version: High correlation (0.976), suggesting version numbers move together. We’ll keep major_image_version as the primary indicator and drop minor_image_version.
  • size_of_uninit_data and section_0_virt_size: Notable correlation (0.961), linking uninitialized data to section 0’s virtual allocation. We’ll retain size_of_uninit_data for its broader scope and exclude section_0_virt_size.

Missing Indicator Correlations¶

The 15 pairs include 10 perfect correlations (1.000) among contains_unicode_missing, contains_nullbytes_missing, is_text_file_missing, contains_base64_missing, and contains_hex_strings_missing, indicating that when one of these features is missing, the others are too—likely due to a shared extraction failure. Additionally, each of these correlates strongly (0.950) with is_malicious, suggesting missingness may be a signal for malice, though not perfectly redundant. We’ll retain these as separate binary flags to capture their predictive value, addressing missingness through imputation rather than removal.

Feature Selection Decisions¶

Unlike your original text, which referenced is_pe vs contains_nullbytes (-1.000) and a larger set of 2,571 missing indicator pairs with section_1_entropy_missing as a hub, the new data focuses on fewer, stronger pairs. The absence of is_pe and section_1_entropy_missing in the provided table suggests they either fell below 0.95 or weren’t in this subset. Based on the current data:

  • Eliminate is_dll, resource_types, section_alignment, section_1_ptr_raw_data, section_4_virt_size, minor_image_version, and section_0_virt_size due to redundancy with retained counterparts.
  • Keep missing indicators as they offer unique signals despite high correlations, leveraging imputation to handle their systematic patterns.

These decisions balance quantitative correlation strength with domain knowledge of PE file structure and malware analysis, ensuring a streamlined yet informative feature set.

Statistical Significance Testing¶

Having identified key relationships through correlation analysis and feature importance (from prior SHAP analysis), we will proceed with formal statistical testing (alpha = 0.01) to validate these observations and test the following hypotheses:

Populations:

  • Population 1 (Malicious): The population of all malicious Windows PE files from which the malicious samples in this dataset were drawn.
  • Population 2 (Benign): The population of all benign Windows PE files from which the benign samples in this dataset were drawn.

Hypotheses:

  • H1 (Maximum Section Entropy): The mean sections_max_entropy (importance 0.0645) of Population 1 is significantly greater than that of Population 2.
  • H2 (Third Section Entropy): The mean section_3_entropy (importance 0.0356) of Population 1 is significantly greater than that of Population 2.
  • H3 (First Section Entropy): The mean section_0_entropy (importance 0.0368) of Population 1 is significantly greater than that of Population 2.
  • H4 (Fourth Section Entropy): The mean section_4_entropy (importance 0.0593) of Population 1 is significantly greater than that of Population 2.
  • H5 (File Type Distribution): The distribution of is_exe (importance 0.0249 as a binary proxy) differs significantly between populations.
  • H6 (PE Characteristics): The distribution of characteristics_flags_IMAGE_FILE_EXECUTABLE_IMAGE_32BIT_MACHINE (importance 0.0735) differs significantly between populations.
  • H7 (Section Name Patterns): The distribution of section_3_name_.pdata patterns (importance 0.0750) differs significantly between populations.
  • H8 (String Properties): The mean avg_string_len (importance 0.0390) differs significantly between populations.

Statistical Tests and Confidence Intervals: We will use a mix of parametric and non-parametric tests based on feature distributions:

  • Mann-Whitney U test: For comparing entropy distributions and string length (H1-H4, H8).
  • Chi-squared test with rare category consolidation: For categorical/binary features (H5-H7).

We will calculate 99% confidence intervals for all mean differences. Effect sizes will be computed using Cohen’s d for continuous variables and Cramer’s V for categorical variables. Due to multiple testing, we will apply Bonferroni correction to maintain the family-wise error rate at α = 0.01. This rigorous statistical approach ensures our feature selection is grounded in robust evidence, accounting for the observed correlations and patterns in our dataset.

In [63]:
results = run_statistical_tests(train_df, alpha=0.01)
results_df = results.data

subtitle = f"Analysis of {len(train_df)} samples (α = 0.01)"
significant_mask = results_df["Significant"] == "Yes"
significant_tests = significant_mask.astype(int).sum()
total_tests = len(results_df)

print(
    f"{subtitle}\n"
    f"{significant_tests}/{total_tests} tests significant at α = 0.01 after Bonferroni correction"
)

display(results)
2025-05-18 17:21:30,525 - root - INFO - Raw p-value for H1: Maximum Section Entropy: 0.0
2025-05-18 17:21:30,529 - root - INFO - Raw p-value for H2: Third Section Entropy: 1.0
2025-05-18 17:21:30,532 - root - INFO - Raw p-value for H3: First Section Entropy: 0.0
2025-05-18 17:21:30,533 - root - INFO - Raw p-value for H4: Fourth Section Entropy: 1.0
2025-05-18 17:21:30,596 - root - INFO - Contingency Table for file_type:
is_malicious  0.00   1.00
file_type                
exe           1911  10901
dll           5305    835
2025-05-18 17:21:30,597 - root - INFO - Expected Frequencies for file_type:
[[4878.18657661 7933.81342339]
 [2337.81342339 3802.18657661]]
2025-05-18 17:21:30,654 - root - INFO - Contingency Table for characteristics:
is_malicious     0.00  1.00
characteristics            
11298.0            58     0
258.0             579  3682
259.0              53  1343
263.0              17     9
270.0              24   882
271.0              49  3010
290.0              39    80
291.0               3    22
302.0               0    10
303.0               2    15
33166.0             8   783
33167.0            16   282
33198.0             0    16
3330.0              6     1
33679.0             0    12
34.0              745   220
35.0               14    58
38.0               16     1
39.0               46     6
41358.0            15    38
47.0               77     7
547.0               0     6
551.0             106     1
558.0               7     0
559.0              48    14
771.0               0    35
775.0               8     2
782.0               2     8
783.0              21   361
815.0               4     4
8226.0           3478    68
8230.0            205     0
8238.0             64     0
8450.0            974   536
8454.0             22     1
8462.0            166   161
8482.0             82     9
8742.0            141     0
8750.0             47     4
8966.0             11     0
8974.0             36    11
Other              27    38
2025-05-18 17:21:30,655 - root - INFO - Expected Frequencies for characteristics:
[[2.20835796e+01 3.59164204e+01]
 [1.62238160e+03 2.63861840e+03]
 [5.31528915e+02 8.64471085e+02]
 [9.89953567e+00 1.61004643e+01]
 [3.44960743e+02 5.61039257e+02]
 [1.16471845e+03 1.89428155e+03]
 [4.53094133e+01 7.36905867e+01]
 [9.51878430e+00 1.54812157e+01]
 [3.80751372e+00 6.19248628e+00]
 [6.47277332e+00 1.05272267e+01]
 [3.01174335e+02 4.89825665e+02]
 [1.13463909e+02 1.84536091e+02]
 [6.09202195e+00 9.90797805e+00]
 [2.66525960e+00 4.33474040e+00]
 [4.56901646e+00 7.43098354e+00]
 [3.67425074e+02 5.97574926e+02]
 [2.74140988e+01 4.45859012e+01]
 [6.47277332e+00 1.05272267e+01]
 [1.97990713e+01 3.22009287e+01]
 [2.01798227e+01 3.28201773e+01]
 [3.19831152e+01 5.20168848e+01]
 [2.28450823e+00 3.71549177e+00]
 [4.07403968e+01 6.62596032e+01]
 [2.66525960e+00 4.33474040e+00]
 [2.36065851e+01 3.83934149e+01]
 [1.33262980e+01 2.16737020e+01]
 [3.80751372e+00 6.19248628e+00]
 [3.80751372e+00 6.19248628e+00]
 [1.45447024e+02 2.36552976e+02]
 [3.04601098e+00 4.95398902e+00]
 [1.35014436e+03 2.19585564e+03]
 [7.80540312e+01 1.26945969e+02]
 [2.43680878e+01 3.96319122e+01]
 [5.74934572e+02 9.35065428e+02]
 [8.75728155e+00 1.42427184e+01]
 [1.24505699e+02 2.02494301e+02]
 [3.46483748e+01 5.63516252e+01]
 [5.36859434e+01 8.73140566e+01]
 [1.94183200e+01 3.15816800e+01]
 [4.18826509e+00 6.81173491e+00]
 [1.78953145e+01 2.91046855e+01]
 [2.47488392e+01 4.02511608e+01]]
2025-05-18 17:21:30,700 - root - ERROR - Error in Chi-squared for section_3_name: Cannot setitem on a Categorical with a new category (Other), set the categories first
2025-05-18 17:21:30,707 - root - INFO - Raw p-value for H8: Average String Length: 0.0
2025-05-18 17:21:30,713 - root - INFO - P-values before correction: [0.0, 1.0, 0.0, 1.0, 0.0, 0.0]
2025-05-18 17:21:30,714 - root - INFO - Corrected p-values: [0.0, 1.0, 0.0, 1.0, 0.0, 0.0]
Analysis of 18952 samples (α = 0.01)
4/8 tests significant at α = 0.01 after Bonferroni correction
Hypothesis Test Feature Statistic P-value P-value (corrected) Effect Size Direction Significant
H1: Maximum Section Entropy Mann-Whitney U sections_max_entropy 67717109.0000 0 0 1.0000 Greater Yes
H2: Third Section Entropy Mann-Whitney U section_3_entropy 33718258.5000 1 1 0.0000 Greater No
H3: First Section Entropy Mann-Whitney U section_0_entropy 60879658.5000 0 0 0.0000 Greater Yes
H4: Fourth Section Entropy Mann-Whitney U section_4_entropy 24744577.5000 1 1 0.0000 Greater No
H5: File Type Distribution Chi-squared file_type 8993.0400 0 0 0.6889 N/A Yes
H6: PE Characteristics Chi-squared characteristics --- --- --- --- N/A No
H7: Third Section Name Chi-squared section_3_name --- --- --- --- N/A No
H8: Average String Length Mann-Whitney U avg_string_len 22931036.5000 0 0 0.0000 Two-sided Yes

Interpretation of Statistical Test Results¶

Correlation analysis revealed notable multicollinearity patterns across our feature set, with 22 highly correlated pairs (|correlation| > 0.95) identified, including 7 regular feature pairs and 15 missing indicator pairs. These relationships have critical implications for our feature selection strategy, though the extent of perfect correlation clusters is less pronounced than initially anticipated.

Our statistical testing on 18,914 samples (α = 0.01, Bonferroni-corrected) validated 4 out of 8 hypotheses, highlighting discriminative features rather than extensive redundant clusters. For entropy measures, we tested sections_max_entropy, section_0_entropy, section_3_entropy, and section_4_entropy. Only sections_max_entropy (p = 0.0, importance 0.0645) and section_0_entropy (p = 0.0, importance 0.0368) showed significant differences between malicious and benign populations, while section_3_entropy (p = 1.0, importance 0.0356) and section_4_entropy (p = 1.0, importance 0.0593) did not. Unlike earlier assumptions, the prior correlation data did not show a perfect correlation cluster (1.00) across all section_[0-4]_entropy or with suspicious_imports. Thus, claims of substantial redundancy here are not supported by the current data, though the significant entropy features remain valuable.

Resource feature analysis lacks the previously claimed perfect correlations (1.00) between resource_types, resource_entropy, and num_resources, or with suspicious_imports and file_alignment (0.871). The correlation table showed has_resources and resource_types at 0.999, but no statistical test was provided for these. Their discriminative power remains unconfirmed in this dataset, and we’ll rely on prior SHAP importance (e.g., resource_entropy not explicitly ranked) for further investigation.

Binary content metrics like byte_distribution and avg_line_length were not part of the provided correlation or test data, so claims of perfect correlation (1.00) or strong relationships with entropy (-1.00) and major_image_version (0.943) cannot be substantiated here. Similarly, avg_string_len (p = 0.0, importance 0.0390) was significant, but no correlation specifics were provided beyond its test result.

Most notably, binary type indicators show strong discriminative relationships. The perfect negative correlation between is_exe and is_dll (-1.000) is supported by the chi-squared test for file_type (p = 0.0), with observed frequencies (exe: 1881 benign, 10901 malicious; dll: 5298 benign, 834 malicious) deviating significantly from expected (exe: 4851.54, 7930.46; dll: 2327.46, 3804.54). This aligns with is_exe’s importance (0.0249) and suggests a clear distributional difference. The characteristics feature (p = 0.0) also showed significant variation (e.g., 258: 577 benign, 3682 malicious; 8226: 3472 benign, 68 malicious), reinforcing the utility of flags like characteristics_flags_IMAGE_FILE_EXECUTABLE_IMAGE_32BIT_MACHINE (importance 0.0735), though exact chi-squared stats (e.g., χ² = 10199.86) weren’t provided here.

Importantly, these patterns suggest opportunities for dimensionality reduction through feature engineering, particularly for perfectly correlated pairs like is_exe and is_dll, while preserving the discriminative power of significant features (p = 0.0).

Next Steps: Feature Engineering and Selection¶

Next, we will proceed with feature engineering and selection, focusing particularly on:

  • Entropy Feature Consolidation
    • Retain sections_max_entropy and section_0_entropy due to their statistical significance (p = 0.0), while deprioritizing section_3_entropy and section_4_entropy (p = 1.0) unless further correlations or tests justify inclusion.
    • Explore composite entropy measures if additional high correlations (> 0.95) emerge, weighted by SHAP importance and test significance.
    • Validate discriminative power maintenance post-consolidation.
  • File Type Integration
    • Consolidate is_exe and is_dll into a single feature (e.g., file_type) given their perfect negative correlation (-1.000) and significant distributional difference (p = 0.0).
    • Preserve the observed skew (malicious favor exe, benign favor dll) in the engineered feature.
  • Characteristics Optimization
    • Retain key characteristics flags (e.g., tied to IMAGE_FILE_EXECUTABLE_IMAGE_32BIT_MACHINE) based on their significant chi-squared result (p = 0.0).
    • Consolidate redundant flags if further correlation data identifies overlaps above 0.95.

This feature engineering approach will help us reduce multicollinearity (e.g., is_exe vs is_dll) while preserving the discriminative power of significant features (p = 0.0 for H1, H3, H5, H6, H8). The section_3_name test error will be addressed by ensuring proper categorical handling in future analyses.

We will begin by implementing these feature engineering steps in our next notebook, notebooks/02_feature_engineering_and_selection.ipynb.

Saving Processed Data¶

We will now save the current train_df and test_df dataframes as parquet files for use in that notebook. This will ensure our final feature set maintains discriminative power while minimizing redundancy and improving model stability.

In [64]:
train_df.to_parquet("../data/processed/train_df.parquet", index=False)
test_df.to_parquet("../data/processed/test_df.parquet", index=False)