Adding an Extractor¶
This guide shows how to add a new extractor to recognize column reference patterns in Python code.
Overview¶
Extractors identify DataFrame column references in various AST expression patterns. Each extractor is a simple function that returns ColumnRef objects that the checker uses for validation.
Location: frame-check-core/src/frame_check_core/extractors/
When to Add an Extractor¶
Add a new extractor when you want to track column references in a pattern that isn't currently recognized.
The Pattern¶
import ast
from frame_check_core.refs import ColumnRef
def extract_my_pattern(node: ast.expr) -> list[ColumnRef] | None:
"""
Extract column references from my pattern.
Returns:
List of ColumnRef objects if pattern matches, None otherwise.
"""
# 1. Check if node matches your pattern
if not isinstance(node, ast.SomeType):
return None
# 2. Extract column references
refs = []
# ... extraction logic ...
# 3. Return refs or None
return refs if refs else None
Function Signature¶
Every extractor must have this signature:
Return Type¶
list[ColumnRef]: Pattern matched, here are the column references foundNone: Pattern didn't match, try the next extractor
The ColumnRef Dataclass¶
@dataclass(slots=True)
class ColumnRef:
node: ast.Subscript # Original AST node for location tracking
df_name: str # DataFrame variable name (e.g., 'df')
col_names: list[str] # Column names being accessed (e.g., ['A'])
Example: Existing extract_column_ref¶
The simplest extractor handles df['col'] patterns:
def extract_column_ref(node: ast.expr) -> list[ColumnRef] | None:
if not is_subscript(node):
return None
if not is_name(node.value):
return None
slice_node = node.slice
# Single column: df['col']
if is_constant(slice_node) and isinstance(slice_node.value, str):
return [ColumnRef(node, node.value.id, [slice_node.value])]
# Multi-column: df[['a', 'b']]
if isinstance(slice_node, ast.List):
col_names: list[str] = []
for elt in slice_node.elts:
if not isinstance(elt, ast.Constant):
return None
if not isinstance(elt.value, str):
return None
col_names.append(elt.value)
if not col_names:
return None
return [ColumnRef(node, node.value.id, col_names)]
return None
Example: Existing extract_column_refs_from_binop¶
Handles binary operations like df['A'] + df['B']:
def extract_column_refs_from_binop(node: ast.expr) -> list[ColumnRef] | None:
if not is_binop(node):
return None
refs: list[ColumnRef] = []
stack = [node.left, node.right]
while stack:
n = stack.pop()
if is_binop(n):
# Nested binop: recurse into both sides
stack.extend([n.left, n.right])
elif ref := extract_single_column_ref(n):
refs.append(ref)
else:
# Non-column operand (e.g., constant) - pattern doesn't match
return None
return refs if refs else None
Step-by-Step: Adding a Method Call Extractor¶
Let's add support for df['A'].fillna(df['B']).
Step 1: Create the module¶
Create frame-check-core/src/frame_check_core/extractors/method.py:
"""
Extractor for method call expressions containing column references.
Handles patterns like:
df['A'].fillna(df['B'])
df['A'].replace(df['B'], df['C'])
df['A'].where(df['B'] > 0)
Does NOT handle:
df.groupby('A') # Method on DataFrame, not column
df['A'].sum() # No column refs in arguments
"""
import ast
from frame_check_core.refs import ColumnRef
from .column import extract_single_column_ref
__all__ = ["extract_column_refs_from_method"]
def extract_column_refs_from_method(node: ast.expr) -> list[ColumnRef] | None:
"""
Extract column references from a method call on a column.
Matches: df['col'].method(...) where args may contain column refs.
Args:
node: The AST expression to analyze.
Returns:
List of ColumnRef objects for all columns found, or None if
the pattern doesn't match.
"""
# Must be a Call node: something(...)
if not isinstance(node, ast.Call):
return None
# func must be an Attribute: something.method
if not isinstance(node.func, ast.Attribute):
return None
# The object being called on should be a column ref: df['col'].method
base_ref = extract_single_column_ref(node.func.value)
if base_ref is None:
return None
refs = [base_ref]
# Check positional args for additional column references
for arg in node.args:
if arg_ref := extract_single_column_ref(arg):
refs.append(arg_ref)
# Check keyword args too
for kw in node.keywords:
if kw_ref := extract_single_column_ref(kw.value):
refs.append(kw_ref)
return refs
Step 2: Register in registry.py¶
Update frame-check-core/src/frame_check_core/extractors/registry.py:
# Import all extractors
from .binop import extract_column_refs_from_binop
from .column import extract_column_ref
from .method import extract_column_refs_from_method # ADD THIS
# Extractors to use, in priority order (earlier = tried first)
EXTRACTORS: list[ExtractorFunc] = [
extract_column_ref, # df['col'] and df[['a', 'b']] - most common
extract_column_refs_from_binop, # df['A'] + df['B'] - binary operations
extract_column_refs_from_method, # df['A'].fillna(df['B']) - method calls
]
That's it! Just add your function to the list in the order you want it tried.
Step 3: Export from init.py (optional)¶
If you want users to be able to import your extractor directly, add it to frame-check-core/src/frame_check_core/extractors/__init__.py:
# Individual extractors (for direct use)
from .binop import extract_column_refs_from_binop
from .column import extract_column_ref, extract_single_column_ref
from .method import extract_column_refs_from_method # ADD THIS
__all__ = [
"Extractor",
"ColumnRef",
"extract_column_ref",
"extract_single_column_ref",
"extract_column_refs_from_binop",
"extract_column_refs_from_method", # ADD THIS
"extract",
]
Step 4: Add tests¶
Create frame-check-core/tests/extractors/test_method.py:
"""Tests for the method call extractor."""
import ast
import pytest
from frame_check_core.extractors.method import extract_column_refs_from_method
def parse_expr(code: str) -> ast.expr:
"""Helper to parse a single expression."""
return ast.parse(code, mode="eval").body
def test_fillna_with_column():
expr = parse_expr("df['A'].fillna(df['B'])")
refs = extract_column_refs_from_method(expr)
assert refs is not None
assert len(refs) == 2
assert {r.col_names[0] for r in refs} == {"A", "B"}
def test_method_with_keyword_arg():
expr = parse_expr("df['A'].replace(to_replace=df['B'], value=df['C'])")
refs = extract_column_refs_from_method(expr)
assert refs is not None
assert len(refs) == 3
assert {r.col_names[0] for r in refs} == {"A", "B", "C"}
def test_method_no_column_args():
expr = parse_expr("df['A'].fillna(0)")
refs = extract_column_refs_from_method(expr)
assert refs is not None
assert len(refs) == 1
assert refs[0].col_names == ["A"]
def test_not_a_method_call():
expr = parse_expr("df['A']")
refs = extract_column_refs_from_method(expr)
assert refs is None
def test_method_on_non_column():
expr = parse_expr("some_list.append(df['A'])")
refs = extract_column_refs_from_method(expr)
assert refs is None
def test_df_names_preserved():
expr = parse_expr("df1['A'].fillna(df2['B'])")
refs = extract_column_refs_from_method(expr)
assert refs is not None
df_names = {r.df_name for r in refs}
assert df_names == {"df1", "df2"}
Step 5: Add integration tests¶
Add tests to frame-check-core/tests/test_checker.py:
def test_fillna_with_valid_column():
code = """
import pandas as pd
df = pd.DataFrame({"A": [1, None], "B": [2, 3]})
df["C"] = df["A"].fillna(df["B"])
"""
fc = Checker.check(code)
assert len(fc.diagnostics) == 0
assert "C" in fc.dfs["df"].columns
def test_fillna_with_invalid_column():
code = """
import pandas as pd
df = pd.DataFrame({"A": [1, None], "B": [2, 3]})
df["C"] = df["A"].fillna(df["X"]) # X doesn't exist
"""
fc = Checker.check(code)
assert len(fc.diagnostics) == 1
assert "X" in fc.diagnostics[0].message
Step 6: Run tests¶
Using the Registry API¶
Viewing Registered Extractors¶
from frame_check_core.extractors import Extractor
# Get all registered extractors
extractors = Extractor.get_registered()
for func in extractors:
print(func.__name__)
Output:
Extracting Column References¶
import ast
from frame_check_core.extractors import Extractor
expr = ast.parse("df['A'] + df['B']", mode="eval").body
refs = Extractor.extract(expr)
for ref in refs:
print(f"{ref.df_name}[{ref.col_names}]")
Type Guards¶
The refs.py module provides type guards for cleaner code:
from frame_check_core.refs import is_name, is_constant, is_subscript, is_binop
# Instead of:
if isinstance(node, ast.Name):
...
# Use:
if is_name(node):
# node is now narrowed to ast.Name
print(node.id) # Type checker knows this is valid
Tips¶
-
Return
Noneearly: If the pattern doesn't match, returnNoneimmediately so other extractors can try. -
Use
extract_single_column_ref: This is your building block for recognizingdf['col']patterns within larger expressions. It returns a singleColumnRefinstead of a list. -
Order matters: Extractors earlier in the
EXTRACTORSlist are tried first. Put more specific patterns before general ones. -
Handle partial matches carefully: Decide whether partial matches (e.g.,
df['A'] + 1) should return partial results orNone. -
Test edge cases: Empty lists, nested patterns, mixed DataFrames, etc.
-
Keep extractors focused: Each extractor should handle one pattern type.
Debugging Tips¶
Use ast.dump() to understand node structure:
import ast
code = "df['A'].fillna(df['B'])"
tree = ast.parse(code, mode="eval")
print(ast.dump(tree, indent=2))
Output:
Expression(
body=Call(
func=Attribute(
value=Subscript(
value=Name(id='df', ctx=Load()),
slice=Constant(value='A'),
ctx=Load()),
attr='fillna',
ctx=Load()),
args=[
Subscript(
value=Name(id='df', ctx=Load()),
slice=Constant(value='B'),
ctx=Load())],
keywords=[]))
This helps you understand exactly what AST pattern you need to match!
Summary¶
Adding an extractor is simple:
- Create a function that takes
ast.exprand returnslist[ColumnRef] | None - Import it in
registry.py - Add it to the
EXTRACTORSlist in the order you want it tried - Optionally export it from
__init__.py - Add tests
That's it! No decorators, no priority numbers, just a simple list.