Finding definitions from a source file and a line number in Python
My job at Datadog keeps me busy with new and questioning challenges. I recently stumbled upon a problem that sounded easy but was more difficult than I imagined.
Here's the thing: considering a filename and a line number, can you tell which function, method or class this line of code belongs to?
I started to dig into the standard library, but I did not find anything solving this problem. It sounded like I had to write this myself.
The first steps sound easy. Open a file, read it, find the line number. Right.
Then, how do you know which functions this line is in? You don't, expect if you parse the whole file and keep tracks of function definitions. A regular expression parsing each line might be a solution?
Well, you had to be careful as function definitions can span multiple lines.
Using the AST
I decided that a good and robust strategy was not going to use manual parsing or the like, but using Python abstract syntax tree (AST) directly. By leveraging Python's own parsing code, I was sure I was not going to fail while parsing a Python source file.
This can be simply be accomplished with:
import ast def parse_file(filename): with open(filename) as f: return ast.parse(f.read(), filename=filename)
And you're done. Are you? No, because that only works in 99.99% of the case. If your source file is using an encoding that is now ASCII or UTF-8, then the function fails. I know you think I'm crazy to think about this but I like my code to be robust.
It turns out Python has a cookie to specify the encoding in the form of
# encoding: utf-8 as defined in PEP 263. Reading this cookie would help to find the encoding.
To do that, we need to open the file in binary mode, use a regular expression to match the data, and… Well, it's dull, and somebody already implemented it for us so let's use the fantastic
tokenize.open function provided by Python:
import ast import tokenize def parse_file(filename): with tokenize.open(filename) as f: return ast.parse(f.read(), filename=filename)
That should work in 100% of the time. Until proven otherwise.
Browsing the AST
parse_file function now returns a Python AST. If you never played with Python AST, it's a gigantic tree that represents your source code just before it is compiled down to Python bytecode.
In the tree, there should be statements and expression. In our case, we're interested in finding the function definition that is the closest to our line number. Here's an implementation of that function:
def filename_and_lineno_to_def(filename, lineno): candidate = None for item in ast.walk(parse_file(filename)): if isinstance(item, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)): if item.lineno > lineno: # Ignore whatever is after our line continue if candidate: distance = lineno - item.lineno if distance < (lineno - candidate.lineno): candidate = item else: candidate = item if candidate: return candidate.name
This iterates over all the node of the AST and returns the node where the line number is the closest to our definition. If we have a file that contains:
class A(object): X = 1 def y(self): return 42
filename_and_lineno_to_def returns for the lines 1 to 5:
The naive approach described earlier likely works for 90% of your code, but there are some edge cases. For example, when defining function closures, the above algorithm fails. With the following code:
class A(object): X = 1 def y(self): def foo(): return 42 return foo
filename_and_lineno_to_def returns for lines 1 to 7:
Oops. Clearly, lines 6 and 7 do not belong to the
foo function. Our approach is too naive to see that starting at line 6, we're back in the
The correct way of handling that is to consider each function definition as an interval:
Whatever the line number we request is, we should return the node that is responsible for the smallest interval that the line is in.
What we need in this case is a correct data structure to solve our problem: an interval tree fits perfectly our use case. It allows for searching rapidly pieces of code that match our line number.
To solve our problem we need several things:
- A way to compute the beginning and end line numbers for a function.
- A tree that is fed with the intervals we computed just before.
- A way to select the best matching intervals if a line is part of several functions (closure).
Computing Function Intervals
The interval of a function is the first and last lines that compose its body. It's pretty easy to find those by walking through the function AST node:
def _compute_interval(node): min_lineno = node.lineno max_lineno = node.lineno for node in ast.walk(node): if hasattr(node, "lineno"): min_lineno = min(min_lineno, node.lineno) max_lineno = max(max_lineno, node.lineno) return (min_lineno, max_lineno + 1)
Given any AST node, the function returns a tuple of the first and last line number of that node.
Building The Tree
Rather than implementing an interval tree, we'll use the intervaltree library. We need to create a tree and feed it with the computed interval:
def file_to_tree(filename): with tokenize.open(filename) as f: parsed = ast.parse(f.read(), filename=filename) tree = intervaltree.IntervalTree() for node in ast.walk(parsed): if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)): start, end = _compute_interval(node) tree[start:end] = node return tree
Here you go: the function parses the Python file passed as an argument and converts it to its AST representation. It then walks it and feeds the interval tree with every class and function definition.
Querying the Tree
Now that the tree is built, it should be queried with the line number. This is pretty simple:
matches = file_to_tree(filename)[lineno] if matches: return min(matches, key=lambda i: i.length()).data.name
The build tree might return several matches if there are several intervals containing our line number. In that case, we pick the smallest interval and return the name of the node — which is our class or function name!
We did it! We started with a naive approach and iterated to a final solution covering 100% of our cases. Picking the right data structure, interval trees here, helped us solving this in an intelligent approach.