Use GDB to Walkthrough OneFlow Source Code

OneFlow
4 min readJul 11, 2022

--

Written by Yi Wang

The Trick of gdb python3

PyTorch authors use gdb to debug C++ code triggered by some Python code. The guide is here.

The basic idea is to run a command gdb python3. In the GDB session, we can set a breakpoint given a C++ function name, say at::Tensor::neg. GDB would not be able to find this function at the moment, and it would prompt to make the breakpoint pending on future shared library load. We should answer yes. Then, we can type the command run, which will make GDB to start the Python interpreter. The Python interpreter will then prompt us to input Python source code. We can now type in import torch and other code and press enter. When the Python interpreter executes the import statement, it loads related shared libraries. GDB will watch the loading and set the breakpoint. The following Python source code executes, tiggers the breakpoint, then bring us to the GDB prompt. We can do usual C++ debugging work here, such as using bt to check the backtrack and l to show the C++ code invoked by the Python program.

Build OneFlow in Debug Mode

Linux-Only

OneFlow supports only Linux, but not macOS or Windows. I built OneFlow successfully on an AWS GPU host running Amazon Linux 2, which is similar to CentOS.

Use Conda or Docker

The official document of OneFlow recommends to build using Conda or a Docker image https://github.com/Oneflow-Inc/oneflow#option-1-build-with-conda-recommended. I use Anaconda. The reason to use Conda or Docker is to fix the version of the C++ compiler and other build toolchain. Using a newer version of g++ might require some updates of the source code, for example, https://github.com/Oneflow-Inc/oneflow/issues/8397.

Build the Debug Version

Please be aware that, we must build a debug version of PyTorch for this trick to work, because GDB would need the debug symbols make the output of bt and l some sense.

I built a CPU-only version of OneFlow hence the cpu.cmake file. My AWS host is out of China, so I used the file in the international directory.

Report Errors

I suffered from some issues while building OneFlow. Once I reported them, OneFlow authors responded promptly. Kudos to these awesome developers.

Build Step-by-Step

This section records my step-by-step process.

  1. Download and install Anaconda. The default installation destination will be ~/anaconda3. The installation process adds environment variable settings to ~/.bashrc. So, source it or reconnect to the host to make the changes take effect.
  2. Following https://github.com/Oneflow-Inc/conda-env to create and active the Conda environment.
  3. Git clone the source code

mkdir ~/w

cd ~/w

git clone https://github.com/Oneflow-Inc/oneflow

4. Build OneFlow

cd oneflow

mkdir build

cd build

CMAKE_BUILD_TYPE=Debug cmake .. -C ../cmake/caches/international/cpu.cmake

make -k -j $(nproc)

Run and Debug

After the building, in the ~/w/oneflow/build directory, there comes a file source.sh. It sets the PYTHONPATH environment. Run the following command to make it take effect.

Then, we can run the Python interpreter using GDB.

At the GDB prompt, I tried to set a breakpoint at oneflow::one::Tensor::is_eager, pending on future shared library load.

Then, I can make GDB run the Python interpreter by typing the run command. At the Python interpreter prompt, I could import oneflow.

This importing in GDB will take much longer than usual. If it complains ImportError, please make sure that you had source source.sh as aforementioned.

Now, let us create a tensor.

Typing enter, the execution of this line triggers the breakpoint. The above message tells that a function oneflow::one::CopyBetweenMirroredTensorAndNumpy called tensor->is_eager() at line 98 in a source file.

To display more context, we can type l. At line 98, there is the call to tensor->is_eager().

We might be curious about why/how the creation of a tensor in the Python world would trigger the call to Tensor::is_eager. We can reveal more details by typing the bt command.

At the bottom of the call stack is _stack, which is the entrypoint of the Python interpreter. Looking upward, we can see the call boundary between Python and OneFlow shared library — the function _PyMethodDef_RawFastCallKeywords in Python calls the OneFlow C++ function oneflow::one::functional::tensor, which, in turn, triggered the call to oneflow::one::Tensor::is_eager.

Related articles:

  1. The Journey of an Operator in a Deep Learning Framework
  2. How to derive ring all-reduce’s mathematical property step by step

Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.

Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.

--

--

OneFlow
OneFlow

Written by OneFlow

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient. https://github.com/Oneflow-Inc/oneflow

No responses yet