How to optimise the re-ordering of an array?
I want to optimize the re-ordering of some data arrays that hold about 4 million unsigned shorts. The aim is to process a data stream by bringing values that should be similar to each other to be close to each other. The pseudo-code is like this:
for( i=0; i<n; i++)
dest[i] = src[ idx[i] ] ;
To optimize the code for a specific list of idx[i]
I tried to compile a 4 million line c function with the idx values filled in:
void reorder( unsigned short * restrict i, unsigned short * restrict o)
o[0]=i[2075723];
o[1]=i[2075724];
o[2]=i[2075722];
...
o[4194301]=i[4192257];
o[4194302]=i[4192256];
o[4194303]=i[4190208];
I had hoped to get GCC to create a clever stream of pshufw/pblend/unpack instructions ... instead it hangs after using up a lot of memory (7 GB). I was trying to make copy based version to avoid the complications of doing swaps in place.
Would anyone be able to suggest good ways to produce optimized the code for this problem? So far I tried:
- ordered reading, random writes : 60 ms (openmp didn't help)
- ordered writing, random reads : 20 ms (openmp -> 4 ms)
I was hoping to end up getting closer to the memory bandwidth (order 0.4 ms). A scheme that takes into account cache size and does blocking should help but I don't know where to start for designing one to do it. I also wonder if there is a simple way to exploit SIMD instructions?
Making a toy example with transpose I couldn't even get gcc to output an SIMD version, see:
https://godbolt.org/z/bzGWad
Is this a difficult problem for compilers or am I missing something simple?
Edit 21/11/2018 Added a complete but minimal example of the problem
Here is a complete example of the problem I am trying to optimise. In reality the ordering is a more complicated function, but the point is just to order data pixels according to their distance from the image center, like unwinding a spiral.
#include <omp.h>
#include <vector>
#include <cmath>
#include <iostream>
#include <iomanip>
#include <algorithm>
#define N 2048
// Sorting on output, one core
void reorder_simple( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output)
for( int i=0; i<N*N; i++)
output[i] = input[ indices[i] ];
// Sorting on output write, many cores
void reorder_omp( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output)
#pragma omp parallel for
for( int i=0; i<N*N; i++)
output[i] = input[ indices[i] ];
// Benchmark for memory throughput, one core
void copy_simple( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output)
for( int i=0; i<N*N; i++)
output[i] = input[i];
// Benchmark for memory throughput, many cores
void copy_omp ( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output)
#pragma omp parallel for
for( int i=0; i<N*N; i++)
output[i] = input[i];
// Macro to avoid retyping
#define bench(func)
func( indices, input, output);
start = omp_get_wtime();
for( size_t i=0; i<100; i++)
func( indices, input, output );
end = omp_get_wtime();
std:: cout << std::setw(15) << #func <<
", Time taken: " << (end-start)/100 << " /sn";
int main()
std::vector<float> sort_order(N*N);
std::vector<size_t> indices(N*N);
float radius, azimuth, ci, cj;
double start, end;
unsigned short *input, *output;
ci = N*0.496; // changes according to calibration
cj = N*0.4985; // reality is more complicated (tilts etc)
for( size_t i=0; i<N; i++)
for( size_t j=0; j<N; j++)
radius = sqrt( (i-ci)*(i-ci) + (j-cj)*(j-cj) );
azimuth = atan2( i-ci, j-cj ); // from -pi to pi
sort_order[i*N+j] = round( radius ) + azimuth/2/M_PI;
indices[i*N+j] = i*N+j;
// Find the order to sort data onto a spiral
std::sort( indices.begin(), indices.end(),
[&sort_order](int i, int j)
return sort_order[i] < sort_order[j]; );
// Invent some test data
input = new unsigned short [N*N];
output = new unsigned short [N*N];
for( size_t i=0 ; i<N*N; i++)
input[i] = i;
output[i]= 0;
// some timing:
bench(reorder_simple);
bench(reorder_omp) ;
bench(copy_simple) ;
bench(copy_omp) ;
% g++ reorder.cpp -o reorder -std=c++11 -O3 -march=native -fopenmp -Wall
% ./reorder
reorder_simple, Time taken: 0.0179023 /s
reorder_omp, Time taken: 0.00349932 /s
copy_simple, Time taken: 0.00140805 /s
copy_omp, Time taken: 0.000250205 /s
I would like to get the reorder_omp
function to be closer to the speed of the copy_omp
function. Detectors can run at 500 frames per second so 3.5 ms is bad in comparison to 0.25 ms.
Edit again: 21/11/2018 the code to write the function that does not compile
//top of file
#include <fstream>
...
//just before the end:
std::ofstream out;
out.open("cfunc.c");
out << "void cfunc( unsigned short * restrict input,n" <<
" unsigned short * restrict output) n";
for(int i=0;i<N;i++)
for(int j=0;j<N;j++)
out << "output[" << i*N+j << "] = input[" << indices[i*N+j] << "];n";
out << "n";
out.close();
Testing this on a different machine I am getting compiler errors from both gcc (7.3.0) and clang (6.0.0). It compiles and runs with tcc (0.9.27) but finishes slower than the looping over the indices.
c++ arrays gcc optimization compiler-optimization
|
show 3 more comments
I want to optimize the re-ordering of some data arrays that hold about 4 million unsigned shorts. The aim is to process a data stream by bringing values that should be similar to each other to be close to each other. The pseudo-code is like this:
for( i=0; i<n; i++)
dest[i] = src[ idx[i] ] ;
To optimize the code for a specific list of idx[i]
I tried to compile a 4 million line c function with the idx values filled in:
void reorder( unsigned short * restrict i, unsigned short * restrict o)
o[0]=i[2075723];
o[1]=i[2075724];
o[2]=i[2075722];
...
o[4194301]=i[4192257];
o[4194302]=i[4192256];
o[4194303]=i[4190208];
I had hoped to get GCC to create a clever stream of pshufw/pblend/unpack instructions ... instead it hangs after using up a lot of memory (7 GB). I was trying to make copy based version to avoid the complications of doing swaps in place.
Would anyone be able to suggest good ways to produce optimized the code for this problem? So far I tried:
- ordered reading, random writes : 60 ms (openmp didn't help)
- ordered writing, random reads : 20 ms (openmp -> 4 ms)
I was hoping to end up getting closer to the memory bandwidth (order 0.4 ms). A scheme that takes into account cache size and does blocking should help but I don't know where to start for designing one to do it. I also wonder if there is a simple way to exploit SIMD instructions?
Making a toy example with transpose I couldn't even get gcc to output an SIMD version, see:
https://godbolt.org/z/bzGWad
Is this a difficult problem for compilers or am I missing something simple?
Edit 21/11/2018 Added a complete but minimal example of the problem
Here is a complete example of the problem I am trying to optimise. In reality the ordering is a more complicated function, but the point is just to order data pixels according to their distance from the image center, like unwinding a spiral.
#include <omp.h>
#include <vector>
#include <cmath>
#include <iostream>
#include <iomanip>
#include <algorithm>
#define N 2048
// Sorting on output, one core
void reorder_simple( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output)
for( int i=0; i<N*N; i++)
output[i] = input[ indices[i] ];
// Sorting on output write, many cores
void reorder_omp( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output)
#pragma omp parallel for
for( int i=0; i<N*N; i++)
output[i] = input[ indices[i] ];
// Benchmark for memory throughput, one core
void copy_simple( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output)
for( int i=0; i<N*N; i++)
output[i] = input[i];
// Benchmark for memory throughput, many cores
void copy_omp ( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output)
#pragma omp parallel for
for( int i=0; i<N*N; i++)
output[i] = input[i];
// Macro to avoid retyping
#define bench(func)
func( indices, input, output);
start = omp_get_wtime();
for( size_t i=0; i<100; i++)
func( indices, input, output );
end = omp_get_wtime();
std:: cout << std::setw(15) << #func <<
", Time taken: " << (end-start)/100 << " /sn";
int main()
std::vector<float> sort_order(N*N);
std::vector<size_t> indices(N*N);
float radius, azimuth, ci, cj;
double start, end;
unsigned short *input, *output;
ci = N*0.496; // changes according to calibration
cj = N*0.4985; // reality is more complicated (tilts etc)
for( size_t i=0; i<N; i++)
for( size_t j=0; j<N; j++)
radius = sqrt( (i-ci)*(i-ci) + (j-cj)*(j-cj) );
azimuth = atan2( i-ci, j-cj ); // from -pi to pi
sort_order[i*N+j] = round( radius ) + azimuth/2/M_PI;
indices[i*N+j] = i*N+j;
// Find the order to sort data onto a spiral
std::sort( indices.begin(), indices.end(),
[&sort_order](int i, int j)
return sort_order[i] < sort_order[j]; );
// Invent some test data
input = new unsigned short [N*N];
output = new unsigned short [N*N];
for( size_t i=0 ; i<N*N; i++)
input[i] = i;
output[i]= 0;
// some timing:
bench(reorder_simple);
bench(reorder_omp) ;
bench(copy_simple) ;
bench(copy_omp) ;
% g++ reorder.cpp -o reorder -std=c++11 -O3 -march=native -fopenmp -Wall
% ./reorder
reorder_simple, Time taken: 0.0179023 /s
reorder_omp, Time taken: 0.00349932 /s
copy_simple, Time taken: 0.00140805 /s
copy_omp, Time taken: 0.000250205 /s
I would like to get the reorder_omp
function to be closer to the speed of the copy_omp
function. Detectors can run at 500 frames per second so 3.5 ms is bad in comparison to 0.25 ms.
Edit again: 21/11/2018 the code to write the function that does not compile
//top of file
#include <fstream>
...
//just before the end:
std::ofstream out;
out.open("cfunc.c");
out << "void cfunc( unsigned short * restrict input,n" <<
" unsigned short * restrict output) n";
for(int i=0;i<N;i++)
for(int j=0;j<N;j++)
out << "output[" << i*N+j << "] = input[" << indices[i*N+j] << "];n";
out << "n";
out.close();
Testing this on a different machine I am getting compiler errors from both gcc (7.3.0) and clang (6.0.0). It compiles and runs with tcc (0.9.27) but finishes slower than the looping over the indices.
c++ arrays gcc optimization compiler-optimization
3
Soidx
is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?
– Peter Cordes
Nov 15 '18 at 14:36
3
What is the pattern? The two triplets shown have aoutput + (0, 1, 2) = input + (1, 2, 0)
andoutput + (0, 1, 2) = input + (1, 0, 2)
pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.
– Jonathan Leffler
Nov 15 '18 at 14:47
3
I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you useshort *__restrict dst
to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.
– Peter Cordes
Nov 15 '18 at 14:47
2
BTW: are your unsigned shorts wider than 16 bits?
– joop
Nov 15 '18 at 15:02
idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
– Jon
Nov 15 '18 at 16:55
|
show 3 more comments
I want to optimize the re-ordering of some data arrays that hold about 4 million unsigned shorts. The aim is to process a data stream by bringing values that should be similar to each other to be close to each other. The pseudo-code is like this:
for( i=0; i<n; i++)
dest[i] = src[ idx[i] ] ;
To optimize the code for a specific list of idx[i]
I tried to compile a 4 million line c function with the idx values filled in:
void reorder( unsigned short * restrict i, unsigned short * restrict o)
o[0]=i[2075723];
o[1]=i[2075724];
o[2]=i[2075722];
...
o[4194301]=i[4192257];
o[4194302]=i[4192256];
o[4194303]=i[4190208];
I had hoped to get GCC to create a clever stream of pshufw/pblend/unpack instructions ... instead it hangs after using up a lot of memory (7 GB). I was trying to make copy based version to avoid the complications of doing swaps in place.
Would anyone be able to suggest good ways to produce optimized the code for this problem? So far I tried:
- ordered reading, random writes : 60 ms (openmp didn't help)
- ordered writing, random reads : 20 ms (openmp -> 4 ms)
I was hoping to end up getting closer to the memory bandwidth (order 0.4 ms). A scheme that takes into account cache size and does blocking should help but I don't know where to start for designing one to do it. I also wonder if there is a simple way to exploit SIMD instructions?
Making a toy example with transpose I couldn't even get gcc to output an SIMD version, see:
https://godbolt.org/z/bzGWad
Is this a difficult problem for compilers or am I missing something simple?
Edit 21/11/2018 Added a complete but minimal example of the problem
Here is a complete example of the problem I am trying to optimise. In reality the ordering is a more complicated function, but the point is just to order data pixels according to their distance from the image center, like unwinding a spiral.
#include <omp.h>
#include <vector>
#include <cmath>
#include <iostream>
#include <iomanip>
#include <algorithm>
#define N 2048
// Sorting on output, one core
void reorder_simple( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output)
for( int i=0; i<N*N; i++)
output[i] = input[ indices[i] ];
// Sorting on output write, many cores
void reorder_omp( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output)
#pragma omp parallel for
for( int i=0; i<N*N; i++)
output[i] = input[ indices[i] ];
// Benchmark for memory throughput, one core
void copy_simple( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output)
for( int i=0; i<N*N; i++)
output[i] = input[i];
// Benchmark for memory throughput, many cores
void copy_omp ( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output)
#pragma omp parallel for
for( int i=0; i<N*N; i++)
output[i] = input[i];
// Macro to avoid retyping
#define bench(func)
func( indices, input, output);
start = omp_get_wtime();
for( size_t i=0; i<100; i++)
func( indices, input, output );
end = omp_get_wtime();
std:: cout << std::setw(15) << #func <<
", Time taken: " << (end-start)/100 << " /sn";
int main()
std::vector<float> sort_order(N*N);
std::vector<size_t> indices(N*N);
float radius, azimuth, ci, cj;
double start, end;
unsigned short *input, *output;
ci = N*0.496; // changes according to calibration
cj = N*0.4985; // reality is more complicated (tilts etc)
for( size_t i=0; i<N; i++)
for( size_t j=0; j<N; j++)
radius = sqrt( (i-ci)*(i-ci) + (j-cj)*(j-cj) );
azimuth = atan2( i-ci, j-cj ); // from -pi to pi
sort_order[i*N+j] = round( radius ) + azimuth/2/M_PI;
indices[i*N+j] = i*N+j;
// Find the order to sort data onto a spiral
std::sort( indices.begin(), indices.end(),
[&sort_order](int i, int j)
return sort_order[i] < sort_order[j]; );
// Invent some test data
input = new unsigned short [N*N];
output = new unsigned short [N*N];
for( size_t i=0 ; i<N*N; i++)
input[i] = i;
output[i]= 0;
// some timing:
bench(reorder_simple);
bench(reorder_omp) ;
bench(copy_simple) ;
bench(copy_omp) ;
% g++ reorder.cpp -o reorder -std=c++11 -O3 -march=native -fopenmp -Wall
% ./reorder
reorder_simple, Time taken: 0.0179023 /s
reorder_omp, Time taken: 0.00349932 /s
copy_simple, Time taken: 0.00140805 /s
copy_omp, Time taken: 0.000250205 /s
I would like to get the reorder_omp
function to be closer to the speed of the copy_omp
function. Detectors can run at 500 frames per second so 3.5 ms is bad in comparison to 0.25 ms.
Edit again: 21/11/2018 the code to write the function that does not compile
//top of file
#include <fstream>
...
//just before the end:
std::ofstream out;
out.open("cfunc.c");
out << "void cfunc( unsigned short * restrict input,n" <<
" unsigned short * restrict output) n";
for(int i=0;i<N;i++)
for(int j=0;j<N;j++)
out << "output[" << i*N+j << "] = input[" << indices[i*N+j] << "];n";
out << "n";
out.close();
Testing this on a different machine I am getting compiler errors from both gcc (7.3.0) and clang (6.0.0). It compiles and runs with tcc (0.9.27) but finishes slower than the looping over the indices.
c++ arrays gcc optimization compiler-optimization
I want to optimize the re-ordering of some data arrays that hold about 4 million unsigned shorts. The aim is to process a data stream by bringing values that should be similar to each other to be close to each other. The pseudo-code is like this:
for( i=0; i<n; i++)
dest[i] = src[ idx[i] ] ;
To optimize the code for a specific list of idx[i]
I tried to compile a 4 million line c function with the idx values filled in:
void reorder( unsigned short * restrict i, unsigned short * restrict o)
o[0]=i[2075723];
o[1]=i[2075724];
o[2]=i[2075722];
...
o[4194301]=i[4192257];
o[4194302]=i[4192256];
o[4194303]=i[4190208];
I had hoped to get GCC to create a clever stream of pshufw/pblend/unpack instructions ... instead it hangs after using up a lot of memory (7 GB). I was trying to make copy based version to avoid the complications of doing swaps in place.
Would anyone be able to suggest good ways to produce optimized the code for this problem? So far I tried:
- ordered reading, random writes : 60 ms (openmp didn't help)
- ordered writing, random reads : 20 ms (openmp -> 4 ms)
I was hoping to end up getting closer to the memory bandwidth (order 0.4 ms). A scheme that takes into account cache size and does blocking should help but I don't know where to start for designing one to do it. I also wonder if there is a simple way to exploit SIMD instructions?
Making a toy example with transpose I couldn't even get gcc to output an SIMD version, see:
https://godbolt.org/z/bzGWad
Is this a difficult problem for compilers or am I missing something simple?
Edit 21/11/2018 Added a complete but minimal example of the problem
Here is a complete example of the problem I am trying to optimise. In reality the ordering is a more complicated function, but the point is just to order data pixels according to their distance from the image center, like unwinding a spiral.
#include <omp.h>
#include <vector>
#include <cmath>
#include <iostream>
#include <iomanip>
#include <algorithm>
#define N 2048
// Sorting on output, one core
void reorder_simple( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output)
for( int i=0; i<N*N; i++)
output[i] = input[ indices[i] ];
// Sorting on output write, many cores
void reorder_omp( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output)
#pragma omp parallel for
for( int i=0; i<N*N; i++)
output[i] = input[ indices[i] ];
// Benchmark for memory throughput, one core
void copy_simple( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output)
for( int i=0; i<N*N; i++)
output[i] = input[i];
// Benchmark for memory throughput, many cores
void copy_omp ( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output)
#pragma omp parallel for
for( int i=0; i<N*N; i++)
output[i] = input[i];
// Macro to avoid retyping
#define bench(func)
func( indices, input, output);
start = omp_get_wtime();
for( size_t i=0; i<100; i++)
func( indices, input, output );
end = omp_get_wtime();
std:: cout << std::setw(15) << #func <<
", Time taken: " << (end-start)/100 << " /sn";
int main()
std::vector<float> sort_order(N*N);
std::vector<size_t> indices(N*N);
float radius, azimuth, ci, cj;
double start, end;
unsigned short *input, *output;
ci = N*0.496; // changes according to calibration
cj = N*0.4985; // reality is more complicated (tilts etc)
for( size_t i=0; i<N; i++)
for( size_t j=0; j<N; j++)
radius = sqrt( (i-ci)*(i-ci) + (j-cj)*(j-cj) );
azimuth = atan2( i-ci, j-cj ); // from -pi to pi
sort_order[i*N+j] = round( radius ) + azimuth/2/M_PI;
indices[i*N+j] = i*N+j;
// Find the order to sort data onto a spiral
std::sort( indices.begin(), indices.end(),
[&sort_order](int i, int j)
return sort_order[i] < sort_order[j]; );
// Invent some test data
input = new unsigned short [N*N];
output = new unsigned short [N*N];
for( size_t i=0 ; i<N*N; i++)
input[i] = i;
output[i]= 0;
// some timing:
bench(reorder_simple);
bench(reorder_omp) ;
bench(copy_simple) ;
bench(copy_omp) ;
% g++ reorder.cpp -o reorder -std=c++11 -O3 -march=native -fopenmp -Wall
% ./reorder
reorder_simple, Time taken: 0.0179023 /s
reorder_omp, Time taken: 0.00349932 /s
copy_simple, Time taken: 0.00140805 /s
copy_omp, Time taken: 0.000250205 /s
I would like to get the reorder_omp
function to be closer to the speed of the copy_omp
function. Detectors can run at 500 frames per second so 3.5 ms is bad in comparison to 0.25 ms.
Edit again: 21/11/2018 the code to write the function that does not compile
//top of file
#include <fstream>
...
//just before the end:
std::ofstream out;
out.open("cfunc.c");
out << "void cfunc( unsigned short * restrict input,n" <<
" unsigned short * restrict output) n";
for(int i=0;i<N;i++)
for(int j=0;j<N;j++)
out << "output[" << i*N+j << "] = input[" << indices[i*N+j] << "];n";
out << "n";
out.close();
Testing this on a different machine I am getting compiler errors from both gcc (7.3.0) and clang (6.0.0). It compiles and runs with tcc (0.9.27) but finishes slower than the looping over the indices.
c++ arrays gcc optimization compiler-optimization
c++ arrays gcc optimization compiler-optimization
edited Nov 21 '18 at 23:10
Jon
asked Nov 15 '18 at 14:25
JonJon
64659
64659
3
Soidx
is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?
– Peter Cordes
Nov 15 '18 at 14:36
3
What is the pattern? The two triplets shown have aoutput + (0, 1, 2) = input + (1, 2, 0)
andoutput + (0, 1, 2) = input + (1, 0, 2)
pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.
– Jonathan Leffler
Nov 15 '18 at 14:47
3
I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you useshort *__restrict dst
to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.
– Peter Cordes
Nov 15 '18 at 14:47
2
BTW: are your unsigned shorts wider than 16 bits?
– joop
Nov 15 '18 at 15:02
idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
– Jon
Nov 15 '18 at 16:55
|
show 3 more comments
3
Soidx
is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?
– Peter Cordes
Nov 15 '18 at 14:36
3
What is the pattern? The two triplets shown have aoutput + (0, 1, 2) = input + (1, 2, 0)
andoutput + (0, 1, 2) = input + (1, 0, 2)
pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.
– Jonathan Leffler
Nov 15 '18 at 14:47
3
I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you useshort *__restrict dst
to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.
– Peter Cordes
Nov 15 '18 at 14:47
2
BTW: are your unsigned shorts wider than 16 bits?
– joop
Nov 15 '18 at 15:02
idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
– Jon
Nov 15 '18 at 16:55
3
3
So
idx
is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?– Peter Cordes
Nov 15 '18 at 14:36
So
idx
is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?– Peter Cordes
Nov 15 '18 at 14:36
3
3
What is the pattern? The two triplets shown have a
output + (0, 1, 2) = input + (1, 2, 0)
and output + (0, 1, 2) = input + (1, 0, 2)
pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.– Jonathan Leffler
Nov 15 '18 at 14:47
What is the pattern? The two triplets shown have a
output + (0, 1, 2) = input + (1, 2, 0)
and output + (0, 1, 2) = input + (1, 0, 2)
pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.– Jonathan Leffler
Nov 15 '18 at 14:47
3
3
I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you use
short *__restrict dst
to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.– Peter Cordes
Nov 15 '18 at 14:47
I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you use
short *__restrict dst
to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.– Peter Cordes
Nov 15 '18 at 14:47
2
2
BTW: are your unsigned shorts wider than 16 bits?
– joop
Nov 15 '18 at 15:02
BTW: are your unsigned shorts wider than 16 bits?
– joop
Nov 15 '18 at 15:02
idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
– Jon
Nov 15 '18 at 16:55
idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
– Jon
Nov 15 '18 at 16:55
|
show 3 more comments
1 Answer
1
active
oldest
votes
(comment section is too short)
I would test the following idea:
Maintain reverse index table, so that naive algorithm becomes:
for (i = 0; i<n; i++)
dest[index[i]] = src[i];
Instead of using naive algorithm:
2.1 Create temporary array of pairs (value, destindex)
struct pair
int value;
int destindex;
;
for (i = 0; i < n; i++)
pairs[i] = .value=src[i], .destindex=index[i];2.2 Use merge or quick sort to sort array of pairs by
.destindex
field2.3 Copy values from array of pairs into
dest
array
There are no random accesses in this algorithm and thus no random access page faults. However, I am not sure that it will work better than naive algorithm because of large number of linear passes.
The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
– Peter Cordes
Nov 21 '18 at 17:26
I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
– Jon
Nov 21 '18 at 23:27
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53321584%2fhow-to-optimise-the-re-ordering-of-an-array%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
(comment section is too short)
I would test the following idea:
Maintain reverse index table, so that naive algorithm becomes:
for (i = 0; i<n; i++)
dest[index[i]] = src[i];
Instead of using naive algorithm:
2.1 Create temporary array of pairs (value, destindex)
struct pair
int value;
int destindex;
;
for (i = 0; i < n; i++)
pairs[i] = .value=src[i], .destindex=index[i];2.2 Use merge or quick sort to sort array of pairs by
.destindex
field2.3 Copy values from array of pairs into
dest
array
There are no random accesses in this algorithm and thus no random access page faults. However, I am not sure that it will work better than naive algorithm because of large number of linear passes.
The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
– Peter Cordes
Nov 21 '18 at 17:26
I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
– Jon
Nov 21 '18 at 23:27
add a comment |
(comment section is too short)
I would test the following idea:
Maintain reverse index table, so that naive algorithm becomes:
for (i = 0; i<n; i++)
dest[index[i]] = src[i];
Instead of using naive algorithm:
2.1 Create temporary array of pairs (value, destindex)
struct pair
int value;
int destindex;
;
for (i = 0; i < n; i++)
pairs[i] = .value=src[i], .destindex=index[i];2.2 Use merge or quick sort to sort array of pairs by
.destindex
field2.3 Copy values from array of pairs into
dest
array
There are no random accesses in this algorithm and thus no random access page faults. However, I am not sure that it will work better than naive algorithm because of large number of linear passes.
The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
– Peter Cordes
Nov 21 '18 at 17:26
I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
– Jon
Nov 21 '18 at 23:27
add a comment |
(comment section is too short)
I would test the following idea:
Maintain reverse index table, so that naive algorithm becomes:
for (i = 0; i<n; i++)
dest[index[i]] = src[i];
Instead of using naive algorithm:
2.1 Create temporary array of pairs (value, destindex)
struct pair
int value;
int destindex;
;
for (i = 0; i < n; i++)
pairs[i] = .value=src[i], .destindex=index[i];2.2 Use merge or quick sort to sort array of pairs by
.destindex
field2.3 Copy values from array of pairs into
dest
array
There are no random accesses in this algorithm and thus no random access page faults. However, I am not sure that it will work better than naive algorithm because of large number of linear passes.
(comment section is too short)
I would test the following idea:
Maintain reverse index table, so that naive algorithm becomes:
for (i = 0; i<n; i++)
dest[index[i]] = src[i];
Instead of using naive algorithm:
2.1 Create temporary array of pairs (value, destindex)
struct pair
int value;
int destindex;
;
for (i = 0; i < n; i++)
pairs[i] = .value=src[i], .destindex=index[i];2.2 Use merge or quick sort to sort array of pairs by
.destindex
field2.3 Copy values from array of pairs into
dest
array
There are no random accesses in this algorithm and thus no random access page faults. However, I am not sure that it will work better than naive algorithm because of large number of linear passes.
answered Nov 21 '18 at 17:19
gudokgudok
2,79121324
2,79121324
The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
– Peter Cordes
Nov 21 '18 at 17:26
I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
– Jon
Nov 21 '18 at 23:27
add a comment |
The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
– Peter Cordes
Nov 21 '18 at 17:26
I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
– Jon
Nov 21 '18 at 23:27
The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
– Peter Cordes
Nov 21 '18 at 17:26
The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
– Peter Cordes
Nov 21 '18 at 17:26
I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
– Jon
Nov 21 '18 at 23:27
I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
– Jon
Nov 21 '18 at 23:27
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53321584%2fhow-to-optimise-the-re-ordering-of-an-array%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
So
idx
is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?– Peter Cordes
Nov 15 '18 at 14:36
3
What is the pattern? The two triplets shown have a
output + (0, 1, 2) = input + (1, 2, 0)
andoutput + (0, 1, 2) = input + (1, 0, 2)
pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.– Jonathan Leffler
Nov 15 '18 at 14:47
3
I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you use
short *__restrict dst
to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.– Peter Cordes
Nov 15 '18 at 14:47
2
BTW: are your unsigned shorts wider than 16 bits?
– joop
Nov 15 '18 at 15:02
idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
– Jon
Nov 15 '18 at 16:55