Written by Yi Wang
The Trick of gdb python3
PyTorch authors use gdb to debug C++ code triggered by some Python code. The guide is here.
The basic idea is to run a command gdb python3
. In the GDB session, we can set a breakpoint given a C++ function name, say at::Tensor::neg
. GDB would not be able to find this function at the moment, and it would prompt to make the breakpoint pending on future shared library load. We should answer yes. Then, we can type the command run
, which will make GDB to start the Python interpreter. The Python interpreter will then prompt us to input Python source code. We can now type in import torch
and other code and press enter. When the Python interpreter executes the import
statement, it loads related shared libraries. GDB will watch the loading and set the breakpoint. The following Python source code executes, tiggers the breakpoint, then bring us to the GDB prompt. We can do usual C++ debugging work here, such as using bt
to check the backtrack and l
to show the C++ code invoked by the Python program.
Build OneFlow in Debug Mode
Linux-Only
OneFlow supports only Linux, but not macOS or Windows. I built OneFlow successfully on an AWS GPU host running Amazon Linux 2, which is similar to CentOS.
Use Conda or Docker
The official document of OneFlow recommends to build using Conda or a Docker image https://github.com/Oneflow-Inc/oneflow#option-1-build-with-conda-recommended. I use Anaconda. The reason to use Conda or Docker is to fix the version of the C++ compiler and other build toolchain. Using a newer version of g++ might require some updates of the source code, for example, https://github.com/Oneflow-Inc/oneflow/issues/8397.
Build the Debug Version
Please be aware that, we must build a debug version of PyTorch for this trick to work, because GDB would need the debug symbols make the output of bt
and l
some sense.
I built a CPU-only version of OneFlow hence the cpu.cmake
file. My AWS host is out of China, so I used the file in the international
directory.
Report Errors
I suffered from some issues while building OneFlow. Once I reported them, OneFlow authors responded promptly. Kudos to these awesome developers.
Build Step-by-Step
This section records my step-by-step process.
- Download and install Anaconda. The default installation destination will be
~/anaconda3
. The installation process adds environment variable settings to~/.bashrc
. So, source it or reconnect to the host to make the changes take effect. - Following https://github.com/Oneflow-Inc/conda-env to create and active the Conda environment.
- Git clone the source code
mkdir ~/w
cd ~/w
git clone https://github.com/Oneflow-Inc/oneflow
4. Build OneFlow
cd oneflow
mkdir build
cd build
CMAKE_BUILD_TYPE=Debug cmake .. -C ../cmake/caches/international/cpu.cmake
make -k -j $(nproc)
Run and Debug
After the building, in the ~/w/oneflow/build
directory, there comes a file source.sh
. It sets the PYTHONPATH
environment. Run the following command to make it take effect.
Then, we can run the Python interpreter using GDB.
At the GDB prompt, I tried to set a breakpoint at oneflow::one::Tensor::is_eager
, pending on future shared library load.
Then, I can make GDB run the Python interpreter by typing the run
command. At the Python interpreter prompt, I could import oneflow
.
This importing in GDB will take much longer than usual. If it complains ImportError
, please make sure that you had source source.sh
as aforementioned.
Now, let us create a tensor.
Typing enter, the execution of this line triggers the breakpoint. The above message tells that a function oneflow::one::CopyBetweenMirroredTensorAndNumpy
called tensor->is_eager()
at line 98 in a source file.
To display more context, we can type l
. At line 98, there is the call to tensor->is_eager()
.
We might be curious about why/how the creation of a tensor in the Python world would trigger the call to Tensor::is_eager
. We can reveal more details by typing the bt
command.
At the bottom of the call stack is _stack
, which is the entrypoint of the Python interpreter. Looking upward, we can see the call boundary between Python and OneFlow shared library — the function _PyMethodDef_RawFastCallKeywords
in Python calls the OneFlow C++ function oneflow::one::functional::tensor
, which, in turn, triggered the call to oneflow::one::Tensor::is_eager
.
Related articles:
- The Journey of an Operator in a Deep Learning Framework
- How to derive ring all-reduce’s mathematical property step by step
Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.
Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.