Mapping MMIO region write-back does not work










2















I want all read & write requests to a PCIe device to be cached by CPU caches. However, it does not work as I expected.



These are my assumptions on write-back MMIO regions.



  1. Writes to the PCIe device happen only on cache write-back.

  2. The size of TLP payloads is cache block size (64B).

However, captured TLPs do not follow my assumptions.



  1. Writes to the PCIe device happen on every write to the MMIO region.

  2. The size of TLP payloads is 1B.

I write 8-byte of 0xff to the MMIO region with the following user space program & device driver.



Part of User Program



struct pcie_ioctl ioctl_control;
ioctl_control.bar_select = BAR_ID;
ioctl_control.num_bytes_to_write = atoi(argv[1]);
if (ioctl(fd, IOCTL_WRITE_0xFF, &ioctl_control) < 0)
printf("ioctl failedn");



Part of Device Driver



case IOCTL_WRITE_0xFF:

int i;
char *buff;
struct pci_cdev_struct *pci_cdev = pci_get_drvdata(fpga_pcie_dev.pci_device);
copy_from_user(&ioctl_control, (void __user *)arg, sizeof(ioctl_control));
buff = kmalloc(sizeof(char) * ioctl_control.num_bytes_to_write, GFP_KERNEL);
for (i = 0; i < ioctl_control.num_bytes_to_write; i++)
buff[i] = 0xff;

memcpy(pci_cdev->bar[ioctl_control.bar_select], buff, ioctl_control.num_bytes_to_write);
kfree(buff);
break;



I modified MTRRs to make the corresponding MMIO region write-back. The MMIO region starts from 0x0c7300000, and the length is 0x100000 (1MB). Followings are cat /proc/mtrr results for different policies. Please note that I made each region exclusive.



Uncacheable



reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: uncachable
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable


Write-combining



reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: write-combining
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable


Write-back



reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: write-back
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable


Followings are waveform captures for 8B write with different policies. I have used integrated logic analyzer (ILA) to capture these waveform. Please watch pcie_endpoint_litepcietlpdepacketizer_tlp_req_payload_dat when pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid is set. You can count the number of packets by counting pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid in these waveform example.




  1. Uncacheable: link -> correct, 1B x 8 packets


  2. Write-combining: link -> correct, 8B x 1 packet


  3. Write-back: link -> unexpected, 1B x 8 packets

System configuration is like below.




  • CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz


  • OS: Linux kernel 4.15.0-38


  • PCIe Device: Xilinx FPGA KC705 programmed with litepcie

Related Links



  1. Generating a 64-byte read PCIe TLP from an x86 CPU

  2. How to Implement a 64B PCIe* Burst Transfer on Intel® Architecture

  3. Write Combining Buffer Out of Order Writes and PCIe

  4. Do Ryzen support write-back caching for Memory Mapped IO (through PCIe interface)?

  5. MTRR (Memory Type Range Register) control

  6. PATting Linux

  7. Down to the TLP: How PCI express devices talk (Part I)









share|improve this question



















  • 1





    Is it possible that something else (like PAT) is overriding the MTRR setting and making it actually be UC instead of WB? An BTW, the single-byte transactions might be from mempcy being implemented as rep movsb inside the kernel (because your CPU is new enough for ERMSB which makes rep movsb fairly good.)

    – Peter Cordes
    Nov 15 '18 at 1:48







  • 1





    @PeterCordes Thanks, Peter. I could find a conflict in PAT. The region is set as uncached-minus in PAT. I will try to solve it.

    – Taekyung Heo
    Nov 15 '18 at 5:22
















2















I want all read & write requests to a PCIe device to be cached by CPU caches. However, it does not work as I expected.



These are my assumptions on write-back MMIO regions.



  1. Writes to the PCIe device happen only on cache write-back.

  2. The size of TLP payloads is cache block size (64B).

However, captured TLPs do not follow my assumptions.



  1. Writes to the PCIe device happen on every write to the MMIO region.

  2. The size of TLP payloads is 1B.

I write 8-byte of 0xff to the MMIO region with the following user space program & device driver.



Part of User Program



struct pcie_ioctl ioctl_control;
ioctl_control.bar_select = BAR_ID;
ioctl_control.num_bytes_to_write = atoi(argv[1]);
if (ioctl(fd, IOCTL_WRITE_0xFF, &ioctl_control) < 0)
printf("ioctl failedn");



Part of Device Driver



case IOCTL_WRITE_0xFF:

int i;
char *buff;
struct pci_cdev_struct *pci_cdev = pci_get_drvdata(fpga_pcie_dev.pci_device);
copy_from_user(&ioctl_control, (void __user *)arg, sizeof(ioctl_control));
buff = kmalloc(sizeof(char) * ioctl_control.num_bytes_to_write, GFP_KERNEL);
for (i = 0; i < ioctl_control.num_bytes_to_write; i++)
buff[i] = 0xff;

memcpy(pci_cdev->bar[ioctl_control.bar_select], buff, ioctl_control.num_bytes_to_write);
kfree(buff);
break;



I modified MTRRs to make the corresponding MMIO region write-back. The MMIO region starts from 0x0c7300000, and the length is 0x100000 (1MB). Followings are cat /proc/mtrr results for different policies. Please note that I made each region exclusive.



Uncacheable



reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: uncachable
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable


Write-combining



reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: write-combining
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable


Write-back



reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: write-back
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable


Followings are waveform captures for 8B write with different policies. I have used integrated logic analyzer (ILA) to capture these waveform. Please watch pcie_endpoint_litepcietlpdepacketizer_tlp_req_payload_dat when pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid is set. You can count the number of packets by counting pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid in these waveform example.




  1. Uncacheable: link -> correct, 1B x 8 packets


  2. Write-combining: link -> correct, 8B x 1 packet


  3. Write-back: link -> unexpected, 1B x 8 packets

System configuration is like below.




  • CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz


  • OS: Linux kernel 4.15.0-38


  • PCIe Device: Xilinx FPGA KC705 programmed with litepcie

Related Links



  1. Generating a 64-byte read PCIe TLP from an x86 CPU

  2. How to Implement a 64B PCIe* Burst Transfer on Intel® Architecture

  3. Write Combining Buffer Out of Order Writes and PCIe

  4. Do Ryzen support write-back caching for Memory Mapped IO (through PCIe interface)?

  5. MTRR (Memory Type Range Register) control

  6. PATting Linux

  7. Down to the TLP: How PCI express devices talk (Part I)









share|improve this question



















  • 1





    Is it possible that something else (like PAT) is overriding the MTRR setting and making it actually be UC instead of WB? An BTW, the single-byte transactions might be from mempcy being implemented as rep movsb inside the kernel (because your CPU is new enough for ERMSB which makes rep movsb fairly good.)

    – Peter Cordes
    Nov 15 '18 at 1:48







  • 1





    @PeterCordes Thanks, Peter. I could find a conflict in PAT. The region is set as uncached-minus in PAT. I will try to solve it.

    – Taekyung Heo
    Nov 15 '18 at 5:22














2












2








2








I want all read & write requests to a PCIe device to be cached by CPU caches. However, it does not work as I expected.



These are my assumptions on write-back MMIO regions.



  1. Writes to the PCIe device happen only on cache write-back.

  2. The size of TLP payloads is cache block size (64B).

However, captured TLPs do not follow my assumptions.



  1. Writes to the PCIe device happen on every write to the MMIO region.

  2. The size of TLP payloads is 1B.

I write 8-byte of 0xff to the MMIO region with the following user space program & device driver.



Part of User Program



struct pcie_ioctl ioctl_control;
ioctl_control.bar_select = BAR_ID;
ioctl_control.num_bytes_to_write = atoi(argv[1]);
if (ioctl(fd, IOCTL_WRITE_0xFF, &ioctl_control) < 0)
printf("ioctl failedn");



Part of Device Driver



case IOCTL_WRITE_0xFF:

int i;
char *buff;
struct pci_cdev_struct *pci_cdev = pci_get_drvdata(fpga_pcie_dev.pci_device);
copy_from_user(&ioctl_control, (void __user *)arg, sizeof(ioctl_control));
buff = kmalloc(sizeof(char) * ioctl_control.num_bytes_to_write, GFP_KERNEL);
for (i = 0; i < ioctl_control.num_bytes_to_write; i++)
buff[i] = 0xff;

memcpy(pci_cdev->bar[ioctl_control.bar_select], buff, ioctl_control.num_bytes_to_write);
kfree(buff);
break;



I modified MTRRs to make the corresponding MMIO region write-back. The MMIO region starts from 0x0c7300000, and the length is 0x100000 (1MB). Followings are cat /proc/mtrr results for different policies. Please note that I made each region exclusive.



Uncacheable



reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: uncachable
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable


Write-combining



reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: write-combining
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable


Write-back



reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: write-back
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable


Followings are waveform captures for 8B write with different policies. I have used integrated logic analyzer (ILA) to capture these waveform. Please watch pcie_endpoint_litepcietlpdepacketizer_tlp_req_payload_dat when pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid is set. You can count the number of packets by counting pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid in these waveform example.




  1. Uncacheable: link -> correct, 1B x 8 packets


  2. Write-combining: link -> correct, 8B x 1 packet


  3. Write-back: link -> unexpected, 1B x 8 packets

System configuration is like below.




  • CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz


  • OS: Linux kernel 4.15.0-38


  • PCIe Device: Xilinx FPGA KC705 programmed with litepcie

Related Links



  1. Generating a 64-byte read PCIe TLP from an x86 CPU

  2. How to Implement a 64B PCIe* Burst Transfer on Intel® Architecture

  3. Write Combining Buffer Out of Order Writes and PCIe

  4. Do Ryzen support write-back caching for Memory Mapped IO (through PCIe interface)?

  5. MTRR (Memory Type Range Register) control

  6. PATting Linux

  7. Down to the TLP: How PCI express devices talk (Part I)









share|improve this question
















I want all read & write requests to a PCIe device to be cached by CPU caches. However, it does not work as I expected.



These are my assumptions on write-back MMIO regions.



  1. Writes to the PCIe device happen only on cache write-back.

  2. The size of TLP payloads is cache block size (64B).

However, captured TLPs do not follow my assumptions.



  1. Writes to the PCIe device happen on every write to the MMIO region.

  2. The size of TLP payloads is 1B.

I write 8-byte of 0xff to the MMIO region with the following user space program & device driver.



Part of User Program



struct pcie_ioctl ioctl_control;
ioctl_control.bar_select = BAR_ID;
ioctl_control.num_bytes_to_write = atoi(argv[1]);
if (ioctl(fd, IOCTL_WRITE_0xFF, &ioctl_control) < 0)
printf("ioctl failedn");



Part of Device Driver



case IOCTL_WRITE_0xFF:

int i;
char *buff;
struct pci_cdev_struct *pci_cdev = pci_get_drvdata(fpga_pcie_dev.pci_device);
copy_from_user(&ioctl_control, (void __user *)arg, sizeof(ioctl_control));
buff = kmalloc(sizeof(char) * ioctl_control.num_bytes_to_write, GFP_KERNEL);
for (i = 0; i < ioctl_control.num_bytes_to_write; i++)
buff[i] = 0xff;

memcpy(pci_cdev->bar[ioctl_control.bar_select], buff, ioctl_control.num_bytes_to_write);
kfree(buff);
break;



I modified MTRRs to make the corresponding MMIO region write-back. The MMIO region starts from 0x0c7300000, and the length is 0x100000 (1MB). Followings are cat /proc/mtrr results for different policies. Please note that I made each region exclusive.



Uncacheable



reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: uncachable
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable


Write-combining



reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: write-combining
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable


Write-back



reg00: base=0x080000000 ( 2048MB), size= 1024MB, count=1: uncachable
reg01: base=0x380000000000 (58720256MB), size=524288MB, count=1: uncachable
reg02: base=0x0c0000000 ( 3072MB), size= 64MB, count=1: uncachable
reg03: base=0x0c4000000 ( 3136MB), size= 32MB, count=1: uncachable
reg04: base=0x0c6000000 ( 3168MB), size= 16MB, count=1: uncachable
reg05: base=0x0c7000000 ( 3184MB), size= 1MB, count=1: uncachable
reg06: base=0x0c7100000 ( 3185MB), size= 1MB, count=1: uncachable
reg07: base=0x0c7200000 ( 3186MB), size= 1MB, count=1: uncachable
reg08: base=0x0c7300000 ( 3187MB), size= 1MB, count=1: write-back
reg09: base=0x0c7400000 ( 3188MB), size= 1MB, count=1: uncachable


Followings are waveform captures for 8B write with different policies. I have used integrated logic analyzer (ILA) to capture these waveform. Please watch pcie_endpoint_litepcietlpdepacketizer_tlp_req_payload_dat when pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid is set. You can count the number of packets by counting pcie_endpoint_litepcietlpdepacketizer_tlp_req_valid in these waveform example.




  1. Uncacheable: link -> correct, 1B x 8 packets


  2. Write-combining: link -> correct, 8B x 1 packet


  3. Write-back: link -> unexpected, 1B x 8 packets

System configuration is like below.




  • CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz


  • OS: Linux kernel 4.15.0-38


  • PCIe Device: Xilinx FPGA KC705 programmed with litepcie

Related Links



  1. Generating a 64-byte read PCIe TLP from an x86 CPU

  2. How to Implement a 64B PCIe* Burst Transfer on Intel® Architecture

  3. Write Combining Buffer Out of Order Writes and PCIe

  4. Do Ryzen support write-back caching for Memory Mapped IO (through PCIe interface)?

  5. MTRR (Memory Type Range Register) control

  6. PATting Linux

  7. Down to the TLP: How PCI express devices talk (Part I)






linux caching x86 fpga pci-e






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 '18 at 2:03







Taekyung Heo

















asked Nov 15 '18 at 1:21









Taekyung HeoTaekyung Heo

3115




3115







  • 1





    Is it possible that something else (like PAT) is overriding the MTRR setting and making it actually be UC instead of WB? An BTW, the single-byte transactions might be from mempcy being implemented as rep movsb inside the kernel (because your CPU is new enough for ERMSB which makes rep movsb fairly good.)

    – Peter Cordes
    Nov 15 '18 at 1:48







  • 1





    @PeterCordes Thanks, Peter. I could find a conflict in PAT. The region is set as uncached-minus in PAT. I will try to solve it.

    – Taekyung Heo
    Nov 15 '18 at 5:22













  • 1





    Is it possible that something else (like PAT) is overriding the MTRR setting and making it actually be UC instead of WB? An BTW, the single-byte transactions might be from mempcy being implemented as rep movsb inside the kernel (because your CPU is new enough for ERMSB which makes rep movsb fairly good.)

    – Peter Cordes
    Nov 15 '18 at 1:48







  • 1





    @PeterCordes Thanks, Peter. I could find a conflict in PAT. The region is set as uncached-minus in PAT. I will try to solve it.

    – Taekyung Heo
    Nov 15 '18 at 5:22








1




1





Is it possible that something else (like PAT) is overriding the MTRR setting and making it actually be UC instead of WB? An BTW, the single-byte transactions might be from mempcy being implemented as rep movsb inside the kernel (because your CPU is new enough for ERMSB which makes rep movsb fairly good.)

– Peter Cordes
Nov 15 '18 at 1:48






Is it possible that something else (like PAT) is overriding the MTRR setting and making it actually be UC instead of WB? An BTW, the single-byte transactions might be from mempcy being implemented as rep movsb inside the kernel (because your CPU is new enough for ERMSB which makes rep movsb fairly good.)

– Peter Cordes
Nov 15 '18 at 1:48





1




1





@PeterCordes Thanks, Peter. I could find a conflict in PAT. The region is set as uncached-minus in PAT. I will try to solve it.

– Taekyung Heo
Nov 15 '18 at 5:22






@PeterCordes Thanks, Peter. I could find a conflict in PAT. The region is set as uncached-minus in PAT. I will try to solve it.

– Taekyung Heo
Nov 15 '18 at 5:22













1 Answer
1






active

oldest

votes


















2














In short, it seems that mapping MMIO region write-back does not work by design.



Please upload an answer if anyone finds that it is possible.



I came to find John McCalpin's articles and answers. First, mapping MMIO region write-back is not possible. Second, workaround is possible on some processors.




  1. Mapping MMIO region write-back is not possible



    Quote from this link




    FYI: The WB type will not work with memory-mapped IO. You can
    program the bits to set up the mapping as WB, but the system will
    crash as soon as it gets a transaction that it does not know how to
    handle. It is theoretically possible to use WP or WT to get cached
    reads from MMIO, but coherence has to be handled in software.




    Quote from this link




    Only when I set both PAT and MTRR to WB does the kernel crash





  2. Workaround is possible on some processors



    Notes on Cached Access to Memory-Mapped IO Regions, John McCalpin




    There is one set of mappings that can be made to work on at least some
    x86-64 processors, and it is based on mapping the MMIO space twice.
    Map the MMIO range with a set of attributes that allow write-combining
    stores (but only uncached reads). Map the MMIO range a second time
    with a set of attributes that allow cache-line reads (but only
    uncached, non-write-combined stores).








share|improve this answer




















  • 1





    I wonder if anything changes with Skylake-SP where 64-byte stores are possible to UC MMIO regions with a single AVX512 instruction. But probably cache eviction is still different from a movntps [rdi], zmm0, or movaps on a UC memory region.

    – Peter Cordes
    Nov 15 '18 at 11:06











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311131%2fmapping-mmio-region-write-back-does-not-work%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














In short, it seems that mapping MMIO region write-back does not work by design.



Please upload an answer if anyone finds that it is possible.



I came to find John McCalpin's articles and answers. First, mapping MMIO region write-back is not possible. Second, workaround is possible on some processors.




  1. Mapping MMIO region write-back is not possible



    Quote from this link




    FYI: The WB type will not work with memory-mapped IO. You can
    program the bits to set up the mapping as WB, but the system will
    crash as soon as it gets a transaction that it does not know how to
    handle. It is theoretically possible to use WP or WT to get cached
    reads from MMIO, but coherence has to be handled in software.




    Quote from this link




    Only when I set both PAT and MTRR to WB does the kernel crash





  2. Workaround is possible on some processors



    Notes on Cached Access to Memory-Mapped IO Regions, John McCalpin




    There is one set of mappings that can be made to work on at least some
    x86-64 processors, and it is based on mapping the MMIO space twice.
    Map the MMIO range with a set of attributes that allow write-combining
    stores (but only uncached reads). Map the MMIO range a second time
    with a set of attributes that allow cache-line reads (but only
    uncached, non-write-combined stores).








share|improve this answer




















  • 1





    I wonder if anything changes with Skylake-SP where 64-byte stores are possible to UC MMIO regions with a single AVX512 instruction. But probably cache eviction is still different from a movntps [rdi], zmm0, or movaps on a UC memory region.

    – Peter Cordes
    Nov 15 '18 at 11:06
















2














In short, it seems that mapping MMIO region write-back does not work by design.



Please upload an answer if anyone finds that it is possible.



I came to find John McCalpin's articles and answers. First, mapping MMIO region write-back is not possible. Second, workaround is possible on some processors.




  1. Mapping MMIO region write-back is not possible



    Quote from this link




    FYI: The WB type will not work with memory-mapped IO. You can
    program the bits to set up the mapping as WB, but the system will
    crash as soon as it gets a transaction that it does not know how to
    handle. It is theoretically possible to use WP or WT to get cached
    reads from MMIO, but coherence has to be handled in software.




    Quote from this link




    Only when I set both PAT and MTRR to WB does the kernel crash





  2. Workaround is possible on some processors



    Notes on Cached Access to Memory-Mapped IO Regions, John McCalpin




    There is one set of mappings that can be made to work on at least some
    x86-64 processors, and it is based on mapping the MMIO space twice.
    Map the MMIO range with a set of attributes that allow write-combining
    stores (but only uncached reads). Map the MMIO range a second time
    with a set of attributes that allow cache-line reads (but only
    uncached, non-write-combined stores).








share|improve this answer




















  • 1





    I wonder if anything changes with Skylake-SP where 64-byte stores are possible to UC MMIO regions with a single AVX512 instruction. But probably cache eviction is still different from a movntps [rdi], zmm0, or movaps on a UC memory region.

    – Peter Cordes
    Nov 15 '18 at 11:06














2












2








2







In short, it seems that mapping MMIO region write-back does not work by design.



Please upload an answer if anyone finds that it is possible.



I came to find John McCalpin's articles and answers. First, mapping MMIO region write-back is not possible. Second, workaround is possible on some processors.




  1. Mapping MMIO region write-back is not possible



    Quote from this link




    FYI: The WB type will not work with memory-mapped IO. You can
    program the bits to set up the mapping as WB, but the system will
    crash as soon as it gets a transaction that it does not know how to
    handle. It is theoretically possible to use WP or WT to get cached
    reads from MMIO, but coherence has to be handled in software.




    Quote from this link




    Only when I set both PAT and MTRR to WB does the kernel crash





  2. Workaround is possible on some processors



    Notes on Cached Access to Memory-Mapped IO Regions, John McCalpin




    There is one set of mappings that can be made to work on at least some
    x86-64 processors, and it is based on mapping the MMIO space twice.
    Map the MMIO range with a set of attributes that allow write-combining
    stores (but only uncached reads). Map the MMIO range a second time
    with a set of attributes that allow cache-line reads (but only
    uncached, non-write-combined stores).








share|improve this answer















In short, it seems that mapping MMIO region write-back does not work by design.



Please upload an answer if anyone finds that it is possible.



I came to find John McCalpin's articles and answers. First, mapping MMIO region write-back is not possible. Second, workaround is possible on some processors.




  1. Mapping MMIO region write-back is not possible



    Quote from this link




    FYI: The WB type will not work with memory-mapped IO. You can
    program the bits to set up the mapping as WB, but the system will
    crash as soon as it gets a transaction that it does not know how to
    handle. It is theoretically possible to use WP or WT to get cached
    reads from MMIO, but coherence has to be handled in software.




    Quote from this link




    Only when I set both PAT and MTRR to WB does the kernel crash





  2. Workaround is possible on some processors



    Notes on Cached Access to Memory-Mapped IO Regions, John McCalpin




    There is one set of mappings that can be made to work on at least some
    x86-64 processors, and it is based on mapping the MMIO space twice.
    Map the MMIO range with a set of attributes that allow write-combining
    stores (but only uncached reads). Map the MMIO range a second time
    with a set of attributes that allow cache-line reads (but only
    uncached, non-write-combined stores).









share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 15 '18 at 11:24

























answered Nov 15 '18 at 6:07









Taekyung HeoTaekyung Heo

3115




3115







  • 1





    I wonder if anything changes with Skylake-SP where 64-byte stores are possible to UC MMIO regions with a single AVX512 instruction. But probably cache eviction is still different from a movntps [rdi], zmm0, or movaps on a UC memory region.

    – Peter Cordes
    Nov 15 '18 at 11:06













  • 1





    I wonder if anything changes with Skylake-SP where 64-byte stores are possible to UC MMIO regions with a single AVX512 instruction. But probably cache eviction is still different from a movntps [rdi], zmm0, or movaps on a UC memory region.

    – Peter Cordes
    Nov 15 '18 at 11:06








1




1





I wonder if anything changes with Skylake-SP where 64-byte stores are possible to UC MMIO regions with a single AVX512 instruction. But probably cache eviction is still different from a movntps [rdi], zmm0, or movaps on a UC memory region.

– Peter Cordes
Nov 15 '18 at 11:06






I wonder if anything changes with Skylake-SP where 64-byte stores are possible to UC MMIO regions with a single AVX512 instruction. But probably cache eviction is still different from a movntps [rdi], zmm0, or movaps on a UC memory region.

– Peter Cordes
Nov 15 '18 at 11:06




















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311131%2fmapping-mmio-region-write-back-does-not-work%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

How to read a connectionString WITH PROVIDER in .NET Core?

In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

Museum of Modern and Contemporary Art of Trento and Rovereto