代码编织梦想

1 ARMv8-A Memory systems

You should understand the operation of the memory system and access ordering in cases where your code interacts directly either with the hardware or with code executing on other cores, or if it directly loads or writes instructions to be executed, or modifies translation tables.

If you are an application developer, hardware interaction on an OS such as Linux is probably through a device driver, the interaction with other cores is through Pthreads or another multithreading API and the interaction with a paged memory system is through the operating system. In this case, the memory ordering issues are taken care of by the relevant code, however, this is not the case for all operating systems and you must check whether the same is true for the OS you work with.

However, if you are, for example, writing an operating system kernel or device drivers, or implementing a hypervisor, you must have a good understanding of the memory ordering rules of the ARM architecture.

Some reordering might be required when your code requires explicit ordering of memory accesses to be seen by cores or devices in the system.

2 The memory model

Compilers give you a wide range of options that aim to increase the speed, or reduce the size, of the executable files they generate. For each line in the source code, there are many possible choices of assembly instructions that could be used.

编译器为您提供了广泛的选项,旨在提高它们生成的可执行文件的速度或减小其大小。对于源代码中的每一行,可以使用多种可能的汇编指令选择。

The ARMv8-A architecture employs a weakly ordered model of memory. This means that the order of memory accesses is not necessarily required to be the same as the program order for load and store operations.

ARMv8-A 架构采用弱有序的内存模型。这意味着内存访问的顺序不一定需要与加载和存储操作的程序顺序相同

1、对于ARM,只要没有依赖关系,对指令的执行顺序没有要求,load指令(以"L"表示)和store指令(以"S"表示)可任意交换,属于relaxed model,俗称weak order。
2、对于x86中,对于同一CPU执行的load指令后接load指令(L-L),store指令后接store指令(S-S),load指令后接store指令(L-S),都是不能交换指令的执行顺序的,只有store指令后接load指令(S-L)时才可以[注1]。这种memory order被称为TSO(Total Store Order),俗称strong order。

为什么只有store指令后接load指令才可以交换指令顺序?因为如果写一个数据没完成,一般没多大影响,而如果读一个数据没完成,就可能对后面依赖这个数据的指令的继续执行造成影响,形成"stall",所以CPU会优先保证读操作的完成。
这可以理解为是“读”的优先级比“写”高,所以"load"可以跑到"store"前面去执行,而其他的三种情况,要么优先级相同,要么后面的一条指令的优先级更低。

During the optimization process, the processor and system elements can reorder memory read operations with respect to each other to improve data throughput. Writes can also be reordered. This means that the required bandwidth between the processor and external memory can be reduced and the long latencies that are associated with such external memory accesses are hidden.

在优化过程中,处理器和系统元件可以相互重新排列内存读取操作以提高数据吞吐量。写入也可以重新排序。这意味着可以减少处理器和外部存储器之间所需的带宽,并且隐藏与此类外部存储器访问相关的长延迟。

To ensure that reordering can take place, there must be memory types that allow such optimizations to take place in them.

为了确保可以进行重新排序,必须存在允许在其中进行此类优化的内存类型。

Hardware can reorder reads and writes to Normal memory. Reads and writes can also be ordered by address dependencies, and half barriers. However, the existence of either data dependencies or explicit memory barrier instructions can override this. Certain situations require stronger ordering rules. You can provide information to the core about this through the memory type attribute of the translation table entry that describes that memory.

硬件可能对Normal memory的读取和写入重新排序。读取和写入也可以按地址依赖性和半屏障排序。但是,数据相关性或显式内存屏障指令的存在可以覆盖这一点。某些情况需要更强的排序规则。您可以通过描述该内存的转换表条目的内存类型属性向内核提供有关此信息。

High-performance systems can support techniques such as speculative memory reads, multiple issuing of instructions, or out-of-order execution and these, along with other techniques, offer further possibilities for hardware reordering of memory access:

高性能系统可以支持推测性内存读取、多次发出指令或乱序执行等技术,这些技术与其他技术一起,为内存访问的硬件重新排序提供了进一步的可能性:

Multiple issue of instructions

Processors can issue and execute multiple instructions per cycle. Some instructions can reach the execution stage of the pipeline in parallel, as a result they may execute in a different order to their order in the program.

处理器可以在每个周期发出和执行多条指令。一些指令可以并行地到达流水线的执行阶段,因此它们的执行顺序可能与它们在程序中的顺序不同。

Out-of-order execution

Many processors support out-of-order execution of non-dependent instructions. Because of the multiple issue of instructions, some instructions can stall in the execution stage, while they wait for others to complete, but these will not stop non-dependent instructions from completing.This can also change the program order.

许多处理器支持非相关指令的乱序执行。由于指令的多发射,一些指令可以在执行阶段停顿,等待其他指令完成,但这些不会阻止非依赖指令完成。这也可以更改程序执行的顺序。

Speculation

When the processor encounters a conditional instruction, such as a branch, it can begin to execute instructions before it knows for certain whether that particular instruction is executed or not.

当处理器遇到条件指令(例如分支)时,它可以在确定是否执行该特定指令之前开始执行指令。

The result is therefore available sooner if conditions prove that the speculation was correct.

因此,如果条件证明推测是正确的,则可以更快地获得结果。

Instruction fetch speculation is the fetch of instructions that are not defined by the program execution order.

取指令推测是指不是由程序执行顺序定义的指令取指。

Speculative loads

If a load instruction that reads from a Cacheable location is speculatively executed, this can result in a cache linefill and potenti of an existing cache line.

如果从可缓存位置读取的加载指令被推测执行,这可能导致缓存行填充和现有缓存行的潜在驱逐。

Load and store optimizations

As reads and writes to external memory can have a long latency, processors can reduce the number of transfers for example, by merging together several stores into one larger transaction.

由于对外部存储器的读取和写入可能有很长的延迟,因此处理器可以减少传输次数,例如,将多个存储合并为一个更大的事务。

External memory systems

In many System on Chip (SoC) devices, there are several agents capable of initiating transfers and multiple routes to the slave devices that are read or written.

在许多片上系统 (SoC) 设备中,有几个代理能够启动传输和多条路由到读取或写入的从设备。

Some of these devices, such as a DRAM controller, are capable of accepting simultaneous requests from different masters. Transactions can be buffered, or reordered.

其中一些设备,例如 DRAM 控制器,能够同时接受来自不同主机的请求。事务可以被缓冲或重新排序。

This means that accesses from different masters can therefore take varying numbers of cycles to complete and might overtake each other.

这意味着来自不同主机的访问因此可能需要不同数量的周期才能完成,并且可能会相互超越。

Cache coherent multi-core processing

In a cluster, hardware cache coherency can migrate cache lines between cores.

在一个cluster中,硬件缓存一致性可以在内核之间迁移缓存行。

Different cores might see updates to cached memory locations in a different order to each other.

不同的核可能会以彼此不同的顺序看到对缓存内存位置的更新。

Also, these might not be coherent with external memory.

此外,这些可能与外部存储器不一致。

Optimizing compilers

An optimizing compiler can reorder instructions to hide latencies or make best use of hardware features.

优化编译器可以重新排序指令以隐藏延迟或充分利用硬件功能。

It can often move a memory access forward, to make it earlier, and give it more time to complete before the value is required.

它通常可以把内存访问移到前面,使其更早,并在需要值之前给它更多时间来完成。

They can also have instruction scheduling that can take advantage of specific core multi-issue pipelines.

它们还可以具有可以利用特定核心多级流水线的指令调度。

In a single core system, the effects of such reordering are transparent to the programmer, because the individual processor can check for hazards and ensure that data dependencies are respected. However, in cases where you have multiple cores that communicate through shared memory, or share data in other ways, memory ordering considerations become more important.

在单核系统中,这种重新排序的影响对程序员来说是透明的,因为单个处理器可以检查危险并确保尊重数据相关性。但是,如果您有多个内核通过共享内存进行通信或以其他方式共享数据,则内存排序考虑变得更加重要。

3 Memory types

The ARMv8-A architecture defines two mutually exclusive memory types, Normal and Device and all regions of memory are configured as one or the other of these two types.

ARMv8-A 架构定义了两种互斥的内存类型,Normal 和 Device,并且所有内存区域都配置为这两种类型中的一种。

3.1 Normal memory

Normal memory is used for all code and for most data regions in memory. Examples of Normal memory include areas of RAM, Flash, or ROM in physical memory. This kind of memory provides the highest processor performance as it is weakly ordered and the compiler can perform more optimizations. The processor can reorder, repeat, and merge accesses to Normal memory.

Normal memory用于所有代码和内存中的大多数数据区域。正常内存的示例包括物理内存中的 RAM、闪存或 ROM 区域。这种内存提供了最高的处理器性能,因为它是弱排序的,编译器可以执行更多的优化。处理器可以重新排序、重复和合并对普通内存的访问。

The processor can speculatively access address locations that are marked as Normal, so that data or instructions can be read from memory without being explicitly referenced in the program, or before the actual execution of an explicit reference. Such speculative accesses can occur as a result of branch prediction, speculative cache linefills, out-of-order data loads, or other hardware optimizations.

处理器可以推测性地访问标记为 Normal 的地址位置,以便可以从内存中读取数据或指令,而无需在程序中显式引用,或者在显式引用的实际执行之前。这种推测性访问可能是分支预测、推测性缓存行填充、无序数据加载或其他硬件优化的结果。

For best performance, always mark application code and data as Normal. In circumstances where enforced memory ordering is required, do this by using explicit barrier operations. Normal memory accepts weakly ordered memory accesses without any issues. There is no requirement for Normal accesses to complete in order with respect to either other Normal accesses or to Device accesses.

为获得最佳性能,请始终将应用程序代码和数据标记为“正常”。在需要强制内存排序的情况下,请使用显式barrier操作来执行此操作。正常内存接受弱有序的内存访问,没有任何问题。对于其他正常访问或设备访问,不要求正常访问按顺序完成。

However, the processor must always handle hazards that are caused by address dependencies. For example, consider the following simple code sequence:

但是,处理器必须始终处理由地址相关性引起的危险。例如,考虑以下简单的代码序列:

STR X0, [X2] LDR X1, [X2]

A single processor running a single thread always ensures that the value that is placed in X1 is the value that was written from register X0 through to the address stored in X2.

运行单个线程的单个处理器始终确保放置在 X1 中的值是从寄存器 X0 写入到存储在 X2 中的地址的值。

This applies to more complex dependencies. Consider the following code:

这适用于更复杂的依赖关系。考虑以下代码:

ADD X4, X3, #3 ADD X5, X3, #2 ... STR X0, [X3] STRB W1, [X4] LDRH W2, [X5]

In this case, the accesses take place to addresses that overlap each other. The processor must ensure that the memory is updated as if the STR and STRB occurred in order, so that the LDRH returns the most up-to-date value. It would still be valid for the processor to merge the STR and STRB into a single access that contained the latest, correct data written.

在这种情况下,访问地址相互重叠。处理器必须确保内存按照STR和STRB发生的顺序进行更新,以便LDRH返回最新的值。处理器将STR和STRB合并为包含最新写入的正确数据的单个访问仍然有效。

3.2 Device memory

The Device memory type is used with memory-mapped peripherals and all memory regions where an access might have a side effect. For example, a read to a timer is not repeatable, as it returns different values for each read. A write to a control register can trigger an interrupt. The Device memory type imposes more restrictions on the core.

设备内存类型与内存映射外设和所有访问可能产生副作用的内存区域一起使用。例如,对计时器的读取是不可重复的,因为它为每次读取返回不同的值。写入控制寄存器可以触发中断。设备内存类型对内核施加了更多限制。

Accesses to these types of memory must occur exactly the number of times that executing the program suggests they should. Two writes to the same location must be performed as two writes, and two reads from the same location must both take place. This is important when you are accessing peripheral control registers.

对这些类型的内存的访问必须恰好按照执行程序建议的次数进行。对同一位置的两次写入必须作为两次写入执行,并且必须同时从同一位置进行两次读取。这在您访问外设控制寄存器时很重要。

There is however no guarantee about ordering between memory accesses to different devices, or usually between accesses of different memory types.

然而,无法保证对不同设备的内存访问之间的排序,或者通常在不同内存类型的访问之间进行排序。

Speculative data accesses cannot be performed to regions of memory that are marked as Device.

不能对标记为设备的内存区域执行推测数据访问。

Trying to execute code from a region marked as Device is UNPREDICTABLE.

尝试从标记为 Device 的区域执行代码是不可预测的。

When an instruction can result in UNPREDICTABLE behavior, the ARM architecture can specify a narrow range of permitted behaviors. This is defined as a number of CONSTRAINED UNPREDICTABLE behaviors. The implementation can either handle the instruction fetch as if it were to a memory location with the normal Non-cacheable attribute, or it can take a permission fault.

当指令可能导致不可预测的行为时,ARM 体系结构可以指定一个狭窄的允许行为范围。这被定义为许多受约束的不可预测的行为。该实现可以处理指令获取,就好像它是具有正常的不可缓存属性的内存位置一样,或者它可以采取权限错误。

There are four different types of device memory, defining the rules which memory accesses must obey.

有四种不同类型的设备内存,定义了内存访问必须遵守的规则。

As the memory type weakens those rules are relaxed.

随着内存类型变弱,这些规则被放宽。

Device-nGnRnE is the most restrictive.

Device-nGnRE

Device-nGRE

Device-GRE least restrictive

The letter suffixes refer to the following three properties:

字母后缀指的是以下三个属性:

Gathering or non-Gathering (G or nG)

This determines whether multiple accesses can be merged into a single transaction for this memory region. If the address is marked as non-Gathering (nG), then the number and size of accesses that are performed to that location must exactly match the number and size of explicit accesses in the code. If the address is marked as Gathering (G), then the processor can, for example, merge two byte writes into a single halfword write.

这决定了是否可以将多次访问合并到该内存区域的单个事务中。如果地址被标记 non-Gathering(nG),则对该位置执行的访问次数和大小必须与代码中显式访问的次数和大小完全匹配。如果地址被标记为Gathering (G),那么处理器可以,例如,将两个字节写入合并为一个半字写入。

Reordering (R or nR)

This determines whether accesses to the same device can be reordered with respect to each other. If the address is marked as non-Reordering (nR), then accesses within the same block always appear on the bus in program order. The size of this block is IMPLEMENTATION DEFINED. Where the size of this block is large, it could span several table entries. In this case, the ordering rule is observed with respect to any other accesses also marked as nR.

这决定了对同一设备的访问是否可以相互重新排序。如果地址被标记non-Reordering(nR),则同一块内的访问总是按程序顺序出现在总线上。此块的大小是IMPLEMENTATION DEFINED。在这个块的大小很大的地方,它可以跨越几个表条目。在这种情况下,对于也标记为 nR 的任何其他访问,遵守排序规则。

Early Write Acknowledgement (E or nE)

This determines whether an intermediate write buffer between the processor and the device being accessed is allowed to send an acknowledgement of a write completion.

这决定了是否允许处理器和正在访问的设备之间的中间写入缓冲区发送写入完成的确认。

If the address is marked as non-Early Write Acknowledgement (nE), then the write response must come from the peripheral. If the address is marked as Early Write Acknowledgement (E), then it is a buffer in the interconnect logic can signal write acceptance, before the write actually being received by the end device. This is essentially a message to the external memory system.

如果地址标记non-Early Write Acknowledgement(nE),则写入响应必须来自外设。如果地址被标记为Early Write Acknowledgement (E),那么它是互连逻辑中的缓冲区,可以在终端设备实际接收写入之前发出写入接受信号。这本质上是向外部存储系统发送的消息。

0ef695f07885420843fe97469bd22663.jpeg

4 Memory attributes

The memory map of a system can be divided into several regions. Each region can have different memory attributes, such as access permissions that include read and write permissions for different privilege levels, memory type, and cache policies.

一个系统的内存映射可以分为几个区域。每个区域可以有不同的内存属性,例如访问权限,包括不同权限级别的读写权限、内存类型和缓存策略。

The following figure shows an example system memory map:

e12feef03aae688c2531e5b79612c143.jpeg

Functional pieces of code and data are grouped in the memory map and the attributes for each of these areas are controlled separately by the Memory Management Unit.

代码和数据的功能片段在内存映射中分组,每个区域的属性由内存管理单元单独控制。

In addition to the memory type, memory attributes also provide control over cacheability, shareability, access, and execution permissions. Shareable and cache properties apply only to Normal memory. Device regions are always Non-cacheable and Outer-shareable. For Cacheable locations, you can use attributes to indicate cache allocation policy to the processor.

除了内存类型之外,内存属性还提供对可缓存性、可共享性、访问和执行权限的控制。可共享和缓存属性仅适用于普通内存。设备区域始终不可缓存且可外部共享。对于可缓存位置,您可以使用属性向处理器指示缓存分配策略。

4.1 Cacheable and shareable memory attributes

Regions of memory that are marked as Normal can be specified as either cached or non-cached. Memory caching can be separately controlled through inner and outer attributes, for multiple levels of cache. The division between inner and outer is IMPLEMENTATION DEFINED, but typically the inner attributes are used by caches in the processor. The outer attributes are used by external memory where they can be used by caches external to the core or cluster.

标记为正常的内存区域可以指定为缓存或非缓存。内存缓存可以通过内部和外部属性分别控制,用于多级缓存。内部和外部之间的划分是实现定义的,但通常内部属性由处理器中的缓存使用。外部属性由外部存储器使用,它们可以由core或cluster外部的缓存使用。

The shareable attribute is used to define whether a location is shared with multiple cores. Marking a region as Non-shareable means that it is only used by a particular core, whereas marking it as Inner Shareable or Outer shareable, or both, means that the location is shared with other observers, for example, a GPU or DMA device might be considered another observer.

shareable 属性用于定义一个位置是否与多个core共享。将区域标记为不可共享意味着它仅由特定核心使用,而将其标记为内部可共享或外部可共享,或两者兼而有之,意味着该位置与其他观察者共享,例如,GPU 或 DMA 设备可能被视为另一个观察者。

The division between inner and outer is also IMPLEMENTATION DEFINED. These attributes define sets of observers for which the shareability attributes make the caches transparent for data accesses. This also means that the system must provide hardware coherency management so that cores in the Inner Shareable domain see a coherent copy of locations that are marked as Inner Shareable.

内部和外部之间的划分也是实现定义的。这些属性定义了一组观察者,其共享属性使缓存对数据访问透明。这也意味着系统必须提供硬件一致性管理,以便 Inner Shareable 域中的核心看到标记为 Inner Shareable 的位置的一致副本。

If a processor or other master in the system does not support coherency, then it must treat the shareable regions as Non-cacheable.

如果系统中的处理器或其他主机不支持一致性,那么它必须将可共享区域视为不可缓存。

4.2 Domains

Data memory accesses can take longer and consume more power with cache coherency hardware than they otherwise would do. This overhead can be minimized by maintaining coherency between a smaller number of masters while ensuring that they are physically close in the processor. For this reason, the architecture splits the system into domains, and makes it possible to limit the overhead to those locations where the coherency is required.

使用高速缓存一致性硬件访问数据存储器可能需要更长的时间并消耗更多的功率。通过保持较少数量的主设备之间的一致性,同时确保它们在处理器中物理上接近,可以最大限度地减少这种开销。出于这个原因,该体系结构将系统划分为多个域,并可以将开销限制在需要一致性的那些位置。

Shareability is assigned to each memory transaction in the system, based on:

Memory attributes for the region accessed (determined by MMU translation tables).

Core configuration (can differ between cores in a cluster).

Implementation of interconnect.

Integration between interconnect and the masters that are connected to it.

But there are also specific operations that can be performed with a domain defining their scope.

但是也有一些特定的操作可以通过定义其范围的域来执行。

3dcd0266e0da4805d563980b4810810c.jpeg

The following shareability domain options are available:

Non-shareable

A domain consisting only of the local agent. Accesses that never require synchronization with other cores, processors, or devices. This domain is not typically used in Symmetric Multi-Processing (SMP) systems.

NoteSMP is a software architecture that dynamically determines the roles of individual cores. Each core in the cluster has the same view of memory and of shared hardware. Any application, process, or task can run on any core and the operating system scheduler can migrate tasks between cores to achieve optimal system load.

Inner Shareable

Outer Shareable

Full system

An operation on the full system affects all observers in the system.

5 Barriers

The ARM architecture includes barrier instructions to force access ordering and access completion at a specific point.

ARM 体系结构包括barrier指令,用于在特定点强制访问排序和访问完成。

Barriers are used to prevent unsafe optimizations from occurring and to enforce a specific memory ordering. Use of unnecessary barrier instructions can therefore reduce software performance. Consider carefully whether a barrier is necessary in a specific situation, and if so, which is the correct barrier to use.

barriers用于防止发生不安全的优化并强制执行特定的内存排序。因此,使用不必要的barrier指令会降低软件性能。仔细考虑在特定情况下是否需要使用barrier,如果需要,使用哪种屏障才是正确的。

There are three types of barrier instruction.

barrier指令分为三种类型。

5.1 Instruction Synchronization Barrier (ISB)

This is used to guarantee that any subsequent instructions are fetched, so that privilege and access are checked with the current MMU configuration. It is used to ensure any previously executed context-changing operations, such as writes to system control registers, have completed by the time the ISB completes.

这用于保证获取任何后续指令,以便使用当前 MMU 配置检查特权和访问。它用于确保任何先前执行的上下文更改操作,例如写入系统控制寄存器,在 ISB 完成时已经完成。

In hardware terms, for example, this might mean that the instruction pipeline is flushed. Typical uses of this would be in memory management, cache control, and context switching code, or where code is being moved about in memory.

例如,在硬件方面,这可能意味着指令流水线被刷新。它的典型用途是内存管理、缓存控制和上下文切换代码,或者代码在内存中移动的地方

The following example shows how to enable the floating-point unit and SIMD, which you can do in AArch64 by writing to bit [20] of the CPACR_EL1 register. The ISB is a context synchronization event that guarantees that the enable is complete before any subsequent FPU or NEON instructions are executed.

以下示例显示如何启用浮点单元和 SIMD,您可以在 AArch64 中通过写入 CPACR_EL1 寄存器的位 [20] 来执行此操作。ISB 是一个上下文同步事件,可确保在执行任何后续 FPU 或 NEON 指令之前完成启用。

MRS X1, CPACR_EL1               // Copy contents of CPACR to X1
ORR X1, X1, #(0x3 << 20)  // Write to bit 20 of X1. (Enable FPU and SIMD)
SR CPACR_EL1, X1               // Write contents of X1 to CPACR
ISB

An ISB flushes the pipeline and ensures that the effects of any completed context-changing operation before the ISB are visible to any instruction after the ISB. Instructions from the cache or memory are refetched.

ISB 刷新流水线并确保在 ISB 之前完成的任何上下文更改操作的效果对 ISB 之后的任何指令都是可见的。来自高速缓存或内存的指令被重新获取。

It also ensures that any context-changing operations after the ISB instruction only take effect after the ISB has completed and are not seen by instructions before the ISB.

它还确保 ISB 指令之后的任何上下文更改操作仅在 ISB 完成后才生效,并且不会被 ISB 之前的指令看到。

This does not mean that an ISB is required after each instruction that modifies a processor register. For example, reads or writes to PSTATE fields, ELRs, SPs, and SPSRs always occur in program order relative to other instructions.

这并不意味着在修改处理器寄存器的每条指令之后都需要 ISB。例如,对 PSTATE 字段、ELR、SP 和 SPSR 的读取或写入始终按照相对于其他指令的程序顺序进行。

5.2 Data Memory Barrier (DMB)

This prevents reordering of data accesses instructions across the DMB instruction. All data accesses, that is, loads or stores, but not instruction fetches, performed by this processor before the DMB, are visible to all other masters within the specified shareability domain before any of the data accesses after the DMB.

这可以防止跨 DMB 指令重新排序数据访问指令。在 DMB 之前由该处理器执行的所有数据访问,即加载或存储,但不是指令提取,在 DMB 之后的任何数据访问之前对指定可共享域内的所有其他主控器都是可见的。

For example:

LDR X0, [X1] // Must be seen by the memory system before the
// STR below.
DMB ISHLD
ADD X2, #1 // May be executed before or after the memory
//system sees LDR.
STR X3, [X4] // Must be seen by the memory system after the
// LDR above.

It also ensures that any explicit preceding data or unified cache maintenance operations have completed before any subsequent data accesses are executed.

它还确保在执行任何后续数据访问之前已完成任何显式的先前数据或统一缓存维护操作。

For example:

DC CSW, X5 // Data clean by Set/way
LDR x0, [X1] // Effect of data cache clean might not be seen by
// this instruction
DMB ISH
LDR X2, [X3] // Effect of data cache clean are seen by this
// instruction

5.3 Data Synchronization Barrier (DSB)

DSB enforces the same ordering as the Data Memory Barrier, but it also blocks execution of any further instructions, not just loads or stores, until synchronization is complete. This can be used to prevent execution of a SEV instruction, for instance, that would signal to other cores that an event occurred. It waits until all cache, TLB, and branch predictor maintenance operations that are issued by this processor have completed for the specified shareability domain.

DSB 强制执行与数据存储器屏障相同的顺序,但它也阻止任何进一步指令的执行,而不仅仅是加载或存储,直到同步完成。这可用于防止执行 SEV 指令,例如,该指令将向其他内核发出事件发生的信号。它一直等到此处理器发出的所有高速缓存、TLB 和分支预测器维护操作都已针对指定的可共享域完成。

For example:

DC ISW, X5 // operation must have completed before DSB can
// complete STR
STR X0, [X1] // Access must have completed before DSB can complete
DSB ISH
ADD X2, X2, #3 // Cannot be executed until DSB completes

5.4 Using barriers

The DMB and DSB instructions take a parameter which specifies the types of access to which the barrier operates, before or after, and a shareability domain to which it applies.

DMB 和 DSB 指令采用一个参数,该参数指定屏障在之前或之后操作的访问类型,以及它适用的可共享域。

The available options are listed in the following table.

<option>

Description

Ordered Accesses (before - after)

Shareability Domain

OSHLD

Operation that waits only for loads to complete, and only to the outer shareable domain

Load - Load, Load - Store

Outer Shareable

OSHST

Operation that waits only for stores to complete, and only to the outer shareable domain.

Store - Store

Outer Shareable

OSH

Operation only to the outer shareable domain.

Any - Any

Outer Shareable

NSHLD

Operation that waits only for loads to complete and only out to the point of unification.

Load - Load, Load - Store

Non-shareable

NSHST

Operation that waits only for stores to complete and only out to the point of unification.

Store - Store

Non-shareable

NSH

Operation only out to the point of unification.

Any - Any

Non-shareable

ISHLD

Operation that waits only for loads to complete, and only to the Inner Shareable domain

Load -Load, Load - Store

Inner Shareable

ISHST

Operation that waits only for stores to complete, and only to the Inner Shareable domain.

Store - Store

Inner Shareable

ISH

Operation only to the Inner Shareable domain.

Any - Any

Inner Shareable

LD

Operation that waits only for loads to complete.

Load -Load, Load - Store

Full system

ST

Operation that waits only for stores to complete.

Load -Load, Load - Store

Full system

SY

Full system operation. This is the default and can be omitted.

Any - Any

Full system

The ordered access field specifies which classes of accesses the barrier operates on. There are three options.

Load - Load/Store

This means that the barrier requires all loads to complete before the barrier but does not require stores to complete. Both loads and stores that appear after the barrier in program order must wait for the barrier to complete.

Store - Store

This means that the barrier only affects store accesses and that loads can still be freely reordered around the barrier.

Any - Any

This means that both loads and stores must complete before the barrier. Both loads and stores that appear after the barrier in program order must wait for the barrier to complete.

A more subtle effect of the ordering rules is that the instruction interface, data interface, and MMU table walker of a core are considered as separate observers. This means that you might need, for example, to use DSB instructions to ensure that an access to one interface is guaranteed to be observable on a different interface.

排序规则的一个更微妙的影响是核心的指令接口、数据接口和 MMU table walker 被视为单独的观察者。这意味着您可能需要,例如,使用 DSB 指令来确保对一个接口的访问保证在不同的接口上是可观察的。

If you execute a data cache clean and invalidate instruction, DC CVAU, X0 for example, you must insert a DSB instruction after this to be sure that subsequent translation table walks, modifications to translation table entries, instruction fetches, or updates to instructions in memory, can all see the new values.

如果您执行数据缓存清理和无效指令,例如 DC CVAU,X0,则必须在此之后插入一条 DSB 指令,以确保后续转换表遍历、对转换表条目的修改、指令提取或对内存中指令的更新,都可以看到新的值。

For example, consider an update of the translation tables:

STR X0, [X1]                      // update a translation table entry
DSB ISHST                               // ensure write has completed
TLBI VAE1IS, X2                 // invalidate the TLB entry for the entry that 
                                        // changes
DSB ISH                         // ensure that TLB invalidation is complete
ISB

A DSB is required to ensure that the maintenance operations complete and an ISB is required to ensure that the effects of those operations are seen by the instructions that follow.

需要一个 DSB 来确保维护操作的完成,并且需要一个 ISB 来确保这些操作的效果可以通过下面的说明看到。

The processor might speculatively access an address that is marked as Normal at any time. So when considering whether barriers are required, consider more than just explicit accesses that are generated by load or store instructions.

处理器可能会在任何时候推测性地访问标记为正常的地址。因此,在考虑是否需要屏障时,要考虑的不仅仅是由加载或存储指令生成的显式访问。

5.5 One-way barriers

A64 adds new load and store instructions with implicit barrier semantics. The instructions are less restrictive than either DMB or DSB instructions. They also require that all loads and stores before or after the implicit barrier are observed in program order.

A64 添加了具有隐式屏障语义的新加载和存储指令。这些指令的限制性低于 DMB 或 DSB 指令。它们还要求按程序顺序观察隐式屏障之前或之后的所有加载和存储。

Load-Acquire (LDAR)

All loads and stores that are after an LDAR in program order, and that match the shareability domain of the target address, must be observed after the LDAR.

Store-Release (STLR)

All loads and stores preceding an STLR that match the shareability domain of the target address must be observed before the STLR.

There are also exclusive versions of the above, LDAXR and STLXR, available.

Unlike the data barrier instructions, which take a qualifier to control which shareability domains see the effect of the barrier, the LDAR and STLR instructions use the attribute of the address accessed.

An LDAR instruction guarantees that any memory access instructions after the LDAR, are only visible after the load-acquire. A store-release guarantees that all earlier memory accesses are visible before the store-release becomes visible and that the store is visible to all parts of the system capable of storing cached data at the same time.

一条LDAR指令保证 , 之后的任何内存访问指令LDAR仅在加载获取后可见。存储释放保证在存储释放变得可见之前所有早期的内存访问都是可见的,并且存储对于能够同时存储缓存数据的系统的所有部分都是可见的。

The following figure shows how accesses can cross a one-way barrier in one direction but not in the other.

802c89c1be9143b4f34f082354f7cbf2.jpeg

5.6 Use of barriers in C code

The C11 and C++11 languages have a good platform-independent memory model that is preferable to intrinsics.

All versions of C and C++ have sequence points, but C11 and C++11 also provide memory models. Sequence points only prevent the compiler from reordering C++ source code. There is nothing to stop the processor reordering instructions in the generated object code, or for read and write buffers to reorder the sequence in which data transfers are sent to the cache. In other words, they are only relevant for single-threaded code. For multi-threaded code, then either use the memory model features of C11 / C++11, or other synchronization mechanisms such as mutexes which are provided by the operating system. Examples of sequence points in code include function calls and accesses to volatile variables.

The C language specification defines sequence points as follows:

“At certain specified points in the execution sequence, called sequence points, all side effects of previous evaluations shall be complete, and no side effects of subsequent evaluations shall have taken place.”

5.7 Barriers in Linux

The Linux kernel includes several platform-independent barrier functions. See the Linux kernel documentation in the memory-barriers.txt file at: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/ for more details.

参考文献

[1]《ARMv8-A Memory systems》

https://developer.arm.com/documentation/100941/0100

[2]《对优化说不 - Linux中的Barrier》

https://zhuanlan.zhihu.com/p/96

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/weixin_49382066/article/details/128811630

memory access ordering part 3 - memory access ordering in the arm architecture-爱代码爱编程

Memory access ordering part 3 - memory access ordering in the ARM Architecture Posted by leiflindholm in ARM Processors on Oct 19, 2011 6:36:00 PM In my previous posts, I hav

ARMv8-A Memory systems-爱代码爱编程

ARMv8-A Memory systems 0x00 Intro0x01 Memory Model0x02 Memory typeNormal memoryDevice memory0x03 Memory attributesCacheable and shareable memory attributesDomains0x04 Barriers

[Linux Kernel] memory-barriers 内存屏蔽 官方文档-爱代码爱编程

文章目录 DISCLAIMER | 免责声明CONTENTS | 目录一、ABSTRACT MEMORY ACCESS MODEL | 抽象内存访问模型1. DEVICE OPERATIONS | 设备操作2. GUARANTEES | 保障二、WHAT ARE MEMORY BARRIERS? | 什么是内存屏障?1. VARIETIES OF

Armv8-A Memory management-爱代码爱编程

  本文介绍Armv8-A的内存管理。内存管理指的是在系统中,内存访问是如何实现的。 使用内存管理机制,可以让每个应用之间的内存地址分离,即sandbox application,也可以让多个在物理内存上碎片化的地址形成虚拟地址空间一个连续的地址,同时可以让程序员编程更为方便。 虚拟地址到物理地址的转换通过mapping的方式来进行,其关键为Tr

ARM Cortex-A(arm v7) 裸机及Linux内核启动代码过程官方介绍翻译-爱代码爱编程

本文是一篇翻译的文章,翻译自《ARM Cortex-A Series Programmer’s Guide》第13章Boot Code。翻译可能存在纰漏,仅供参考,请对照原文阅读。 文档来自https://developer.arm.com/,由于在移植Linux内核到exynos4412芯片上时,遇见了一些问题,所以决定查找一下官方的资料,顺便翻译一下。

dpdk-19.11 armv8 l2fwd 在某 arm 内核上无法运行问题-爱代码爱编程

问题描述 编译 dpdk-19.11 arm 版本的 l2fwd,在指定的 arm 内核上运行,有如下报错信息: EAL: Detected 16 lcore(s) EAL: Detected 1 NUMA nodes EAL: Multi-process socket /var/run/dpdk/rte/mp_socket EAL: Selected

ARM A-profile v7 v8 v9-爱代码爱编程

Key features of the Cortex-A family of devices: Scalable clusters supporting single and multi-core configurationsRISC cores with support for Armv7-A and Armv8-A architectureFull

ARMv7-A ARMv8-A ARMv9-A 架构-爱代码爱编程

ARMv7-A ARMv8-A ARMv9-A 架构 本文采用知识共享署名 4.0 国际许可协议进行许可,转载时请注明原文链接,图片在使用时请保留全部内容,可适当缩放并在引用处附上图片所在的文章链接。 ARMv7-A 构架 Cortex-A17Cortex-A15Cortex-A9Cortex-A8Cortex-A7Cort

ARM Cortex-A系列编程指南之ARMv8 A -- 第四章 ARMv8寄存器-爱代码爱编程

1、通用寄存器 AArch64运行环境提供了31个64bit的通用寄存器:X0~X31,同时他们也都有32bit的形式:W0~W31,他们对应映射到64bit寄存器的低32位。 读取W寄存器,将会只读X的低32位; 写W寄存器,将会将X的高32位写为0。也就是说设置W0为0xFFFFFFFF,结果设置X0为0x00000000FFFFFFFF。

ARM Cortex -A Series Programmer’s Guide for ARMv8-A Chapter 13 Memory Ordering 第13章 内存排列-爱代码爱编程

文档下载地址   Documentation – Arm Developerhttps://developer.arm.com/documentation/den0024/a 缩写我放前面: TLB  Translation Lookaside Buffer. 旁路转换缓冲,或称为页表缓冲 . TLB(translation lookaside b

1. armv9-a overview_zs.w的博客-爱代码爱编程

目录 Armv9 Cortex CPU Key Features Cortex®‑X2  Core Components Core Pipeline Cortex®‑A710 Core Components Core Pipeline Cortex®‑A510 Core Components Core Pipeline DynamI