Xilinx DMA PCIe tutorial-Part 1
This document is a thorough tutorial on how to implement a DMA controller with Xilinx IP. My idea was to write a comprehensive guide with all Do’s and Don’ts related to the implementation of DMA system with Xilinx IP’s.
I’ve spent a lot of time searching for such tutorial but could not find any useful ones which fit my needs. I wanted to design a quick DMA controller to check the bandwidth of Xilinx PCIe, preferably free, which will work flawlessly.
Eventually I decided to do it on my own, and after successfully designing it with very impressive throughput results, I thought it would be a good idea to publish an article.
This article, tutorial or guide (whatever you wish to call it),is NOT a full solution, nor a partial one. In fact, I did not put here (almost) any source codes. I truly believe that in order to learn something and really understand it, one needs to get his hands dirty as much as he can. So whoever feels offended by this approach is welcome to search for code elsewhere 😊.
I did my best to stay away from RTFM usual stuff, and really gave the tips and tricks which have helped me to implement a fully working DMA system with Xilinx.
This guide is divided to 3 parts:
- Part 1 is a general explanation on the notion of DMA, I’ve put some links to DMA tutorials (since there are many), but also interesting facts related to DMA with Scatter-Gather descriptor table.
- Part 2 is dedicated to the XDMA of Xilinx. This is the main core in my project and I’ve explained all the tabs (almost all options, except advanced feature which I did not use).
- Part 3 goes over all main blocks in my projects. This is an extensive overview of all the blocks I’ve used and I think it will give a nice starting point to whoever wants to implement DMA with Xilinx XDMA core.
Part 1 :
DMA — Don’t Mess Around!
DMA (initials for Direct Memory Access) engine is a key element to achieve high bandwidth utilization for PCI Express applications. It frees up CPU resources from data streaming and helps to improve the overall system performance. In a typical system with PCIe architecture, PCIe Endpoints often contain a DMA engine. This engine is controlled by the system host to transfer data between system memory and the Endpoint
I will not go over the DMA concept and methodology (Wikipedia can help for that issue). I will briefly explain 2 main types of DMA transfers:
1. “Common-buffer DMA” (“continuous DMA”)
2. “Scatter/gather DMA”
A common buffer is based on one buffer to deal with. The base physical pointer of the buffer is passed to the hardware and the data transmission starts from this point. Pretty straight forward indeed.
The main disadvantage is it’s one single buffer. If it needs to be rather big (and big can be few MB’s) there could be a problem allocating it, and there’s the issue of the ownership. Who owns it? Is it the hardware (driver) or the host (PCI root port)?
Scatter-gather solves the problem of the contiguous buffer by allocating many small chunks, each has its own address and size. This is called a descriptor table and it is comprised of a table of address and length pairs, which prepares the engine for all transfers.
The DMA engine incurs very little overhead when switching between buffer locations, because no software intervention is necessary, all is in hardware logic.
The buffer created is not contiguous (physically). This is a virtual-to-physical mapping. The hardware knows it, the driver knows it, the software does not. It sees it as fully contiguous buffer. This is the simple philosophy behind it. The physical memory is divided into discrete units called pages. Page size varies from one architecture to the next, although most systems currently use 4096-byte pages. These physically non-contiguous pages can be assembled into a single, contiguous array from the device’s point of view.
This table of address and size must be saved & handled somewhere (some physical non-volatile storage, like DDR or BRAM, for example), prior to DMA transaction, and the ownership comes for free, as each chunk is owned by the hardware (driver). Obviously, the physical pointer to each small buffer needs to be passed to the hardware DMA mechanism and someone needs to handle these pointers and decide when to use each one. The driver can do it, or the FPGA can handle the passing of descriptors. In Xilinx language, they refer to it as “Descriptor Bypass”. I’ll explain it later in more details.
As pointed out, even when allocating 10MB as a contiguous space, there could be hundreds of descriptors allocated for this task, each one is 4096 Bytes (‘Page size’). Obviously, small number of descriptors, smaller table, is much preferred, since this table can be stored in the FPGA internal RAM (BRAM) and thus we gain less latency, more error prone mechanism, etc.
When the driver maps the non-contiguous memory (physical memory) to contiguous memory we call it a virtual memory. The driver can map the buffers prior to operating system start up, thus the chances to allocate large buffers is higher (but not guaranteed). In such case the mapping will be done in kernel space. In case the driver maps the buffer after the OS has started, there will be much more pages to allocate, as the PCIe peripherals have first taken the big memory chunks. In such case the mapping will be done in user space.
In FPGA applications which involves PCIe and DMA, the DMA usually refers to moving data between the host memory and the internal memory that resides in the FPGA. Moving the data from the host to the FPGA is referred as Host to Card (H2C) direction, which is a DMA read channel, while moving the data from the FPGA to the host is defined Card to Host (C2H) direction, which is a DMA write channel.
For this tutorial, I decided to go with a H2C direction, based on a scatter-gather driver configuration. I’ve chosen the FPGA to manage the table; the ‘ownership’ I’ve referred earlier. The driver oversees the allocation of the memory, but the FPGA knows when to pass a new descriptor of the table. This is the basic idea of the DMA controller.
Hardware
For my project I chose to work with KCU105 XIlinx Ultrascale evaluation board with PCIe * 16 lane width. This board consists of Kintex UltraScale XCKU040–2FFVA1156E device and a lot of other goodies to play with.
PCIe + DMA solutions:
Clicking on the ‘+’ icon in the Vivado block design (BD) and looking for ‘PCI’ brings up these options:
There are various solutions the user can choose from. For start, we’ll need Xilinx AXI Bridge for PCI Express. This is the basic building block which enables PCIe interface:
Still, this block does not include any DMA implementation, so the user must wrap his own DMA mechanism on top of it, meaning a DMA mechanism must be written from scratch and wrapped over this block. There are various solutions for such wrapper, and for different methodology of the DMA controller, as well. I will name here various applicable solutions:
The links above are shown here: North East Logic DMA block, XAPP 1052, XAPP 1171, XAPP 1289, Microblaze forum link
After going over the various options, eventually I decided to go with option 7; if it’s free and it gives all these wonderful features, why not use it? There are some drawbacks, I will mention them later.
Regarding PG195: PG195 is a very informative manual (about 150 pages). I’ve read it a few times, just to get the hang of it. It has many simulations and example designs (chapter 6) which in the early stage of my design tended to be useful. I must say that going over this chapter made me feel a bit dizzy, as it is not very clear what to do in each simulation. Eventually I succeeded to design the basic PCIe simulation with my DMA control mechanism, but it was not so easy to set it up.
Driver
Xilinx has their own driver, with a very informative manual (AR65444). The AR is straightforward manual with all needed code (C language) for setup the driver with a DMA test (H2C and C2H). It has a readme file which explains the user exactly what to do and how to compile the driver. It gives a good starting point to understand the DMA concept, as the core registers are widely used and I used it successfully.
You can choose between working with Windows driver and Linux driver. The concept is the same.
AR71435 is a very thorough tutorial on the interaction of the driver and the XDMA. This can give an additional information on how to use the driver. I suggest to read it and run the various tests implemented.
Jason Lawley, a Xilinx expert to PCIe application has a great tutorial on getting the best performance with Xilinx’s DMA engine. I strongly urge anyone who plans to design a DMA controller to first go over this tutorial, together with AR65444.
So, after this long introduction, I think it’s time to move on to the real deal.
Requirements
To complete this hands-on tutorial, you will need the following:
- Vivado 2018.2 — the screenshots are taken from this version. Obviously, higher version will also work, but the section numbers I’ve wrote here may be different.
- KCU105 Evaluation Board
- Xilinx PCIe Driver
Parts 2 & 3 of this tutorial can be found at my Hackster.io webpage or my Lin kedIn webpage, along with other published articles.