Android Software Reverse Engineering & Decompilation

Build Your Own DEX Parser: A Python Scripting Tutorial for Android Analysts

Google AdSense Native Placement - Horizontal Top-Post banner

Introduction: Diving Deep into Android’s Core – The DEX File Format

For Android reverse engineers, malware analysts, and security researchers, understanding the Dalvik Executable (DEX) file format is paramount. DEX files are the compiled bytecode format used by the Android operating system to run applications, essentially the ‘executables’ of the Android world. While tools like jadx or apktool provide convenient decompilation, building your own parser offers unparalleled insight into the raw structure, metadata, and underlying logic. This tutorial will guide you through creating a foundational DEX parser using Python, focusing on the header, string IDs, and type IDs.

Understanding the DEX File Structure

A DEX file is a highly structured binary file designed for efficient execution on resource-constrained devices. It’s broadly organized into several sections, each containing specific types of data:

  • Header: Contains file metadata like checksum, file sizes, and offsets to other sections.
  • String IDs: A list of offsets to string data, typically used for class names, method names, field names, and literal strings.
  • Type IDs: References to string IDs that represent type descriptors (e.g., Ljava/lang/String; for String objects).
  • Proto IDs: References to string IDs and type IDs defining method prototypes (return types and parameter types).
  • Field IDs: References to type IDs and string IDs defining fields (class, type, name).
  • Method IDs: References to class, prototype, and name definitions for methods.
  • Class Defs: Definitions for each class, including access flags, superclass, interfaces, source file, annotations, fields, methods, and static values.
  • Data Section: Contains the actual bytecode, debugging information, annotations, and other variable-length data.
  • Map List: A list of all sections in the DEX file, their types, and their offsets/sizes.

Prerequisites

To follow along, you’ll need:

  • Python 3.x installed.
  • Basic understanding of binary data, hexadecimals, and file I/O.
  • A sample Android APK file to extract the `classes.dex` file from (you can rename an APK to `.zip` and extract it).

Step 1: Setting Up Your Python Environment

No special libraries are strictly required beyond Python’s standard library. The `struct` module will be our primary tool for parsing binary data.

First, let’s create a Python file, say `dex_parser.py`.

import struct
import sys

def main(dex_path):
    try:
        with open(dex_path, 'rb') as f:
            print(f"[*] Successfully opened {dex_path}")
            # We'll add our parsing logic here
    except FileNotFoundError:
        print(f"[-] Error: File not found at {dex_path}")
        sys.exit(1)

if __name__ == '__main__':
    if len(sys.argv) != 2:
        print("Usage: python dex_parser.py <path_to_classes.dex>")
        sys.exit(1)
    main(sys.argv[1])

You can run this initial script to ensure it opens your DEX file:

python dex_parser.py path/to/your/classes.dex

Step 2: Parsing the DEX Header

The DEX header is the most crucial part, as it provides pointers to all other sections. It’s a fixed-size structure, typically 0x70 bytes (112 bytes). We’ll use Python’s `struct` module to unpack these bytes into meaningful values.

Here’s a breakdown of some key header fields and their sizes/types:

  • `magic` (8 bytes): The magic string ‘dex
    035
    ‘.
  • `checksum` (4 bytes): Adler-32 checksum of the rest of the file.
  • `signature` (20 bytes): SHA-1 hash of the rest of the file (excluding magic and checksum).
  • `file_size` (4 bytes): Total size of the DEX file.
  • `header_size` (4 bytes): Size of the header section itself (0x70).
  • `endian_tag` (4 bytes): Indicates endianness (should be 0x12345678).
  • `string_ids_size` (4 bytes): Number of string identifiers.
  • `string_ids_off` (4 bytes): Offset from the start of the file to the string ID list.
  • `type_ids_size` (4 bytes): Number of type identifiers.
  • `type_ids_off` (4 bytes): Offset from the start of the file to the type ID list.
  • … and many more offsets for other sections.

Let’s add code to `main` to parse the header:

def parse_dex_header(f):
    f.seek(0) # Ensure we're at the beginning of the file
    header_format = (
        "<8s"   # magic (char[8])
        "I"    # checksum (uint)
        "20s"  # signature (char[20])
        "I"    # file_size (uint)
        "I"    # header_size (uint)
        "I"    # endian_tag (uint)
        "I"    # link_size (uint)
        "I"    # link_off (uint)
        "I"    # map_off (uint)
        "I"    # string_ids_size (uint)
        "I"    # string_ids_off (uint)
        "I"    # type_ids_size (uint)
        "I"    # type_ids_off (uint)
        "I"    # proto_ids_size (uint)
        "I"    # proto_ids_off (uint)
        "I"    # field_ids_size (uint)
        "I"    # field_ids_off (uint)
        "I"    # method_ids_size (uint)
        "I"    # method_ids_off (uint)
        "I"    # class_defs_size (uint)
        "I"    # class_defs_off (uint)
        "I"    # data_size (uint)
        "I"    # data_off (uint)
    )

    header_raw = f.read(struct.calcsize(header_format))
    header_data = struct.unpack(header_format, header_raw)

    header = {
        'magic': header_data[0].decode('ascii').strip('x00'),
        'checksum': hex(header_data[1]),
        'signature': header_data[2].hex(),
        'file_size': header_data[3],
        'header_size': header_data[4],
        'endian_tag': hex(header_data[5]),
        'link_size': header_data[6],
        'link_off': header_data[7],
        'map_off': header_data[8],
        'string_ids_size': header_data[9],
        'string_ids_off': header_data[10],
        'type_ids_size': header_data[11],
        'type_ids_off': header_data[12],
        'proto_ids_size': header_data[13],
        'proto_ids_off': header_data[14],
        'field_ids_size': header_data[15],
        'field_ids_off': header_data[16],
        'method_ids_size': header_data[17],
        'method_ids_off': header_data[18],
        'class_defs_size': header_data[19],
        'class_defs_off': header_data[20],
        'data_size': header_data[21],
        'data_off': header_data[22],
    }
    return header

def main(dex_path):
    try:
        with open(dex_path, 'rb') as f:
            print(f"[*] Successfully opened {dex_path}")
            header = parse_dex_header(f)
            print("n--- DEX Header ---")
            for key, value in header.items():
                print(f"{key}: {value}")
    except FileNotFoundError:
        print(f"[-] Error: File not found at {dex_path}")
        sys.exit(1)

Step 3: Parsing String Identifiers (String IDs)

The string ID list contains an array of `uint` offsets. Each offset points to the actual string data in the data section. The string data itself is prefixed by its length, encoded as a ULEB128 (Unsigned Little Endian Base 128) value.

Decoding ULEB128

ULEB128 is a variable-length encoding for integers. It allows smaller numbers to take up fewer bytes. Each byte in a ULEB128 sequence has its most significant bit (MSB) indicating if more bytes follow. The actual value is stored in the lower 7 bits.

def read_uleb128(f):
    result = 0
    shift = 0
    while True:
        byte = struct.unpack('<B', f.read(1))[0]
        result |= (byte & 0x7f) << shift
        if (byte & 0x80) == 0: # MSB is 0, so this is the last byte
            break
        shift += 7
    return result

Parsing the String ID List

Now, let’s incorporate string ID parsing into our `main` function.

def parse_string_ids(f, header):
    string_ids = []
    f.seek(header['string_ids_off'])

    print(f"n[*] Parsing {header['string_ids_size']} String IDs from offset {hex(header['string_ids_off'])}")

    for i in range(header['string_ids_size']):
        string_data_off = struct.unpack('<I', f.read(4))[0] # Read uint offset to string data
        current_pos = f.tell() # Save current position

        f.seek(string_data_off) # Jump to string data offset
        str_len = read_uleb128(f)
        
        # +1 to str_len for null terminator, then -1 for string_data_off to string
        # The string_data_off points to the ULEB128 length, not the string content itself.
        # So, the actual string content starts AFTER the ULEB128 length bytes.
        # We need to account for the bytes read by read_uleb128.
        
        # Read the actual string bytes
        string_bytes = f.read(str_len) # Read string bytes (excluding null terminator)
        string_val = string_bytes.decode('utf-8', errors='ignore')
        string_ids.append(string_val)

        f.seek(current_pos) # Return to previous position in string_ids list
    return string_ids

def main(dex_path):
    try:
        with open(dex_path, 'rb') as f:
            print(f"[*] Successfully opened {dex_path}")
            header = parse_dex_header(f)
            print("n--- DEX Header ---")
            for key, value in header.items():
                print(f"{key}: {value}")

            string_list = parse_string_ids(f, header)
            print("n--- Top 10 String IDs ---")
            for i, s in enumerate(string_list[:10]):
                print(f"{i}: {s}")
    except FileNotFoundError:
        print(f"[-] Error: File not found at {dex_path}")
        sys.exit(1)
```

Step 4: Parsing Type Identifiers (Type IDs)

The type ID list contains an array of `uint` indices. Each index is a reference into the string ID list, pointing to a string that represents a type descriptor (e.g., `Ljava/lang/String;`, `I` for int, `V` for void, etc.).

def parse_type_ids(f, header, string_ids):
    type_ids = []
    f.seek(header['type_ids_off'])

    print(f"n[*] Parsing {header['type_ids_size']} Type IDs from offset {hex(header['type_ids_off'])}")

    for i in range(header['type_ids_size']):
        descriptor_idx = struct.unpack('<I', f.read(4))[0] # Read uint index into string_ids
        if descriptor_idx < len(string_ids):
            type_ids.append(string_ids[descriptor_idx])
        else:
            type_ids.append(f"<INVALID STRING ID: {descriptor_idx}>")
    return type_ids

def main(dex_path):
    try:
        with open(dex_path, 'rb') as f:
            print(f"[*] Successfully opened {dex_path}")
            header = parse_dex_header(f)
            print("n--- DEX Header ---")
            for key, value in header.items():
                print(f"{key}: {value}")

            string_list = parse_string_ids(f, header)
            print("n--- Top 10 String IDs ---")
            for i, s in enumerate(string_list[:10]):
                print(f"{i}: {s}")

            type_list = parse_type_ids(f, header, string_list)
            print("n--- Top 10 Type IDs ---")
            for i, t in enumerate(type_list[:10]):
                print(f"{i}: {t}")

    except FileNotFoundError:
        print(f"[-] Error: File not found at {dex_path}")
        sys.exit(1)
```

Conclusion: Your First Steps into DEX Analysis

Congratulations! You've successfully built a rudimentary DEX parser that can read the file header, extract all string literals, and resolve type descriptors. This foundational understanding is invaluable for advanced Android analysis. From here, you can extend your parser to:

  • Parse `proto_ids`, `field_ids`, and `method_ids` to understand method signatures and class members.
  • Dive into `class_defs` to reconstruct class hierarchies, access flags, and implemented interfaces.
  • Most importantly, parse the actual Dalvik bytecode within the `code_item` structures referenced by method definitions, enabling you to build your own disassembler or even a basic decompiler.

By interacting directly with the binary format, you gain a deeper appreciation for how Android applications are structured and executed, providing a powerful advantage in reverse engineering and security research.

Android Mobile Specs & Compare Directory

Are you researching mobile hardware properties, processor SoCs, GPU chipsets, or RAM configurations? Access our complete specs catalog to compare up to 5 devices side-by-side!

Compare Devices Specs →
Google AdSense Inline Placement - Content Footer banner