Vectorize a 6 for loop cumulative sum in python










4















The mathematical problem is:



enter image description here



The expression within the sums is actually much more complex than the one above, but this is for a minimal working example to not over-complicate things. I have written this in Python using 6 nested for loops and as expected it performs very badly (the true form performs badly and needs evaluating millions of times), even with help from Numba, Cython and friends. Here it is written using nested for loops and a cumulative sum:



import numpy as np

def func1(a,b,c,d):
'''
Minimal working example of multiple summation
'''
B = 0
for ai in range(0,a):
for bi in range(0,b):
for ci in range(0,c):
for di in range(0,d):
for ei in range(0,ai+bi):
for fi in range(0,ci+di):
B += (2)**(ei-fi-ai-ci-di+1)*(ei**2-2*(ei*fi)-7*di)*np.math.factorial(ei)


return a, b, c, d, B


The expression is controlled with 4 numbers as input and for func1(4,6,3,4) the output for B is 21769947.844726562.



I have looked around for assistance with this and have found several Stack posts, with some examples being:



Exterior product in NumPy : Vectorizing six nested loops



Vectorizing triple for loop in Python/Numpy with different array shapes



Python vectorizing nested for loops



I have tried to use what I have learned from these useful posts, but after many attempts, I keep arriving at the wrong answer. Even vectorizing one of the inner sums will reap a huge performance gain for the real problem, but the fact that the sum ranges are different seems to be throwing me off. Does anyone have any tips on how to progress with this?










share|improve this question






















  • You may be able to shave off some time by remembering factorial calls with a lookup table. Unless np already implements that

    – Mitchel Paulin
    Nov 14 '18 at 17:16











  • This does not answer your question but if speed is your concern, can't you just write this part of the code in C++? I'd expect at least a 10x speedup, but even a 100x wouldn't surprise me.

    – szmate1618
    Nov 14 '18 at 17:17











  • That's seems ok . What is the expected answer ?

    – B. M.
    Nov 14 '18 at 18:25











  • Have you tried PyPy?

    – Gabriel
    Nov 14 '18 at 18:29






  • 1





    Caching np.math.factorial gives a 30% speedup in my machine. cache_fat_ei = [ np.math.factorial(ei) for ei in range(0,a+b) ].

    – Gabriel
    Nov 14 '18 at 19:15















4















The mathematical problem is:



enter image description here



The expression within the sums is actually much more complex than the one above, but this is for a minimal working example to not over-complicate things. I have written this in Python using 6 nested for loops and as expected it performs very badly (the true form performs badly and needs evaluating millions of times), even with help from Numba, Cython and friends. Here it is written using nested for loops and a cumulative sum:



import numpy as np

def func1(a,b,c,d):
'''
Minimal working example of multiple summation
'''
B = 0
for ai in range(0,a):
for bi in range(0,b):
for ci in range(0,c):
for di in range(0,d):
for ei in range(0,ai+bi):
for fi in range(0,ci+di):
B += (2)**(ei-fi-ai-ci-di+1)*(ei**2-2*(ei*fi)-7*di)*np.math.factorial(ei)


return a, b, c, d, B


The expression is controlled with 4 numbers as input and for func1(4,6,3,4) the output for B is 21769947.844726562.



I have looked around for assistance with this and have found several Stack posts, with some examples being:



Exterior product in NumPy : Vectorizing six nested loops



Vectorizing triple for loop in Python/Numpy with different array shapes



Python vectorizing nested for loops



I have tried to use what I have learned from these useful posts, but after many attempts, I keep arriving at the wrong answer. Even vectorizing one of the inner sums will reap a huge performance gain for the real problem, but the fact that the sum ranges are different seems to be throwing me off. Does anyone have any tips on how to progress with this?










share|improve this question






















  • You may be able to shave off some time by remembering factorial calls with a lookup table. Unless np already implements that

    – Mitchel Paulin
    Nov 14 '18 at 17:16











  • This does not answer your question but if speed is your concern, can't you just write this part of the code in C++? I'd expect at least a 10x speedup, but even a 100x wouldn't surprise me.

    – szmate1618
    Nov 14 '18 at 17:17











  • That's seems ok . What is the expected answer ?

    – B. M.
    Nov 14 '18 at 18:25











  • Have you tried PyPy?

    – Gabriel
    Nov 14 '18 at 18:29






  • 1





    Caching np.math.factorial gives a 30% speedup in my machine. cache_fat_ei = [ np.math.factorial(ei) for ei in range(0,a+b) ].

    – Gabriel
    Nov 14 '18 at 19:15













4












4








4


1






The mathematical problem is:



enter image description here



The expression within the sums is actually much more complex than the one above, but this is for a minimal working example to not over-complicate things. I have written this in Python using 6 nested for loops and as expected it performs very badly (the true form performs badly and needs evaluating millions of times), even with help from Numba, Cython and friends. Here it is written using nested for loops and a cumulative sum:



import numpy as np

def func1(a,b,c,d):
'''
Minimal working example of multiple summation
'''
B = 0
for ai in range(0,a):
for bi in range(0,b):
for ci in range(0,c):
for di in range(0,d):
for ei in range(0,ai+bi):
for fi in range(0,ci+di):
B += (2)**(ei-fi-ai-ci-di+1)*(ei**2-2*(ei*fi)-7*di)*np.math.factorial(ei)


return a, b, c, d, B


The expression is controlled with 4 numbers as input and for func1(4,6,3,4) the output for B is 21769947.844726562.



I have looked around for assistance with this and have found several Stack posts, with some examples being:



Exterior product in NumPy : Vectorizing six nested loops



Vectorizing triple for loop in Python/Numpy with different array shapes



Python vectorizing nested for loops



I have tried to use what I have learned from these useful posts, but after many attempts, I keep arriving at the wrong answer. Even vectorizing one of the inner sums will reap a huge performance gain for the real problem, but the fact that the sum ranges are different seems to be throwing me off. Does anyone have any tips on how to progress with this?










share|improve this question














The mathematical problem is:



enter image description here



The expression within the sums is actually much more complex than the one above, but this is for a minimal working example to not over-complicate things. I have written this in Python using 6 nested for loops and as expected it performs very badly (the true form performs badly and needs evaluating millions of times), even with help from Numba, Cython and friends. Here it is written using nested for loops and a cumulative sum:



import numpy as np

def func1(a,b,c,d):
'''
Minimal working example of multiple summation
'''
B = 0
for ai in range(0,a):
for bi in range(0,b):
for ci in range(0,c):
for di in range(0,d):
for ei in range(0,ai+bi):
for fi in range(0,ci+di):
B += (2)**(ei-fi-ai-ci-di+1)*(ei**2-2*(ei*fi)-7*di)*np.math.factorial(ei)


return a, b, c, d, B


The expression is controlled with 4 numbers as input and for func1(4,6,3,4) the output for B is 21769947.844726562.



I have looked around for assistance with this and have found several Stack posts, with some examples being:



Exterior product in NumPy : Vectorizing six nested loops



Vectorizing triple for loop in Python/Numpy with different array shapes



Python vectorizing nested for loops



I have tried to use what I have learned from these useful posts, but after many attempts, I keep arriving at the wrong answer. Even vectorizing one of the inner sums will reap a huge performance gain for the real problem, but the fact that the sum ranges are different seems to be throwing me off. Does anyone have any tips on how to progress with this?







python numpy for-loop vectorization






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 14 '18 at 17:12









YetiYeti

145113




145113












  • You may be able to shave off some time by remembering factorial calls with a lookup table. Unless np already implements that

    – Mitchel Paulin
    Nov 14 '18 at 17:16











  • This does not answer your question but if speed is your concern, can't you just write this part of the code in C++? I'd expect at least a 10x speedup, but even a 100x wouldn't surprise me.

    – szmate1618
    Nov 14 '18 at 17:17











  • That's seems ok . What is the expected answer ?

    – B. M.
    Nov 14 '18 at 18:25











  • Have you tried PyPy?

    – Gabriel
    Nov 14 '18 at 18:29






  • 1





    Caching np.math.factorial gives a 30% speedup in my machine. cache_fat_ei = [ np.math.factorial(ei) for ei in range(0,a+b) ].

    – Gabriel
    Nov 14 '18 at 19:15

















  • You may be able to shave off some time by remembering factorial calls with a lookup table. Unless np already implements that

    – Mitchel Paulin
    Nov 14 '18 at 17:16











  • This does not answer your question but if speed is your concern, can't you just write this part of the code in C++? I'd expect at least a 10x speedup, but even a 100x wouldn't surprise me.

    – szmate1618
    Nov 14 '18 at 17:17











  • That's seems ok . What is the expected answer ?

    – B. M.
    Nov 14 '18 at 18:25











  • Have you tried PyPy?

    – Gabriel
    Nov 14 '18 at 18:29






  • 1





    Caching np.math.factorial gives a 30% speedup in my machine. cache_fat_ei = [ np.math.factorial(ei) for ei in range(0,a+b) ].

    – Gabriel
    Nov 14 '18 at 19:15
















You may be able to shave off some time by remembering factorial calls with a lookup table. Unless np already implements that

– Mitchel Paulin
Nov 14 '18 at 17:16





You may be able to shave off some time by remembering factorial calls with a lookup table. Unless np already implements that

– Mitchel Paulin
Nov 14 '18 at 17:16













This does not answer your question but if speed is your concern, can't you just write this part of the code in C++? I'd expect at least a 10x speedup, but even a 100x wouldn't surprise me.

– szmate1618
Nov 14 '18 at 17:17





This does not answer your question but if speed is your concern, can't you just write this part of the code in C++? I'd expect at least a 10x speedup, but even a 100x wouldn't surprise me.

– szmate1618
Nov 14 '18 at 17:17













That's seems ok . What is the expected answer ?

– B. M.
Nov 14 '18 at 18:25





That's seems ok . What is the expected answer ?

– B. M.
Nov 14 '18 at 18:25













Have you tried PyPy?

– Gabriel
Nov 14 '18 at 18:29





Have you tried PyPy?

– Gabriel
Nov 14 '18 at 18:29




1




1





Caching np.math.factorial gives a 30% speedup in my machine. cache_fat_ei = [ np.math.factorial(ei) for ei in range(0,a+b) ].

– Gabriel
Nov 14 '18 at 19:15





Caching np.math.factorial gives a 30% speedup in my machine. cache_fat_ei = [ np.math.factorial(ei) for ei in range(0,a+b) ].

– Gabriel
Nov 14 '18 at 19:15












4 Answers
4






active

oldest

votes


















3














EDIT 3:



Final (I think) version, a bit cleaner and faster incorporating ideas from max9111's answer.



import numpy as np
from numba import as nb

@nb.njit()
def func1_jit(a, b, c, d):
# Precompute
exp_min = 5 - (a + b + c + d)
exp_max = b
exp = 2. ** np.arange(exp_min, exp_max + 1)
fact_e = np.empty((a + b - 2))
fact_e[0] = 1
for ei in range(1, len(fact_e)):
fact_e[ei] = ei * fact_e[ei - 1]
# Loops
B = 0
for ai in range(0, a):
for bi in range(0, b):
for ci in range(0, c):
for di in range(0, d):
for ei in range(0, ai + bi):
for fi in range(0, ci + di):
B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
return B


This is already quite faster than any of the previous options, but we are still not taking advantage of multiple CPUs. One way to do it is within the function itself, for example parallelizing the outer loop. This adds some overhead on each call to create the threads, so for small inputs is actually a bit slower, but should be significantly faster for bigger values:



import numpy as np
from numba import as nb

@nb.njit(parallel=True)
def func1_par(a, b, c, d):
# Precompute
exp_min = 5 - (a + b + c + d)
exp_max = b
exp = 2. ** np.arange(exp_min, exp_max + 1)
fact_e = np.empty((a + b - 2))
fact_e[0] = 1
for ei in range(1, len(fact_e)):
fact_e[ei] = ei * fact_e[ei - 1]
# Loops
B = np.empty((a,))
for ai in nb.prange(0, a):
Bi = 0
for bi in range(0, b):
for ci in range(0, c):
for di in range(0, d):
for ei in range(0, ai + bi):
for fi in range(0, ci + di):
Bi += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
B[ai] = Bi
return np.sum(B)


Or, if you have many points where you want to evaluate the function, you can parallelize at that level too. Here a_arr, b_arr, c_arr and d_arr are vectors of values where the function is to be evaluated:



from numba import as nb

@nb.njit(parallel=True)
def func1_arr(a_arr, b_arr, c_arr, d_arr):
B_arr = np.empty((len(a_arr),))
for i in nb.prange(len(B_arr)):
B_arr[i] = func1_jit(a_arr[i], b_arr[i], c_arr[i], d_arr[i])
return B_arr


The best configuration depends on your inputs, usage pattern, hardware, etc. so you can combine the different ideas to suit your case.




EDIT 2:



Actually, forget what I said before. The best thing is to JIT-compile the algorithm, but in a more effective manner. Compute first the expensive parts (I took the exponential and the factorial) and then pass it to the compiled loopy function:



import numpy as np
from numba import njit

def func1(a, b, c, d):
exp_min = 5 - (a + b + c + d)
exp_max = b
exp = 2. ** np.arange(exp_min, exp_max + 1)
ee = np.arange(a + b - 2)
fact_e = scipy.special.factorial(ee)
return func1_inner(a, b, c, d, exp_min, exp, fact_e)

@njit()
def func1_inner(a, b, c, d, exp_min, exp, fact_e):
B = 0
for ai in range(0, a):
for bi in range(0, b):
for ci in range(0, c):
for di in range(0, d):
for ei in range(0, ai + bi):
for fi in range(0, ci + di):
B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
return B


This is, in my experiments, the fastest option by far, and takes little extra memory (only the precomputed values, with size linear on the input).



a, b, c, d = 4, 6, 3, 4
# The original function
%timeit func1_orig(a, b, c, d)
# 2.07 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# The grid-evaluated function
%timeit func1_grid(a, b, c, d)
# 256 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# The precompuation + JIT-compiled function
%timeit func1_jit(a, b, c, d)
# 19.6 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)



Well there is always the option of grid-evaluating the whole thing:



import numpy as np
import scipy.special

def func1(a, b, c, d):
ai, bi, ci, di, ei, fi = np.ogrid[:a, :b, :c, :d, :a + b - 2, :c + d - 2]
# Compute
B = (2.) ** (ei - fi - ai - ci - di + 1) * (ei ** 2 - 2 * (ei * fi) - 7 * di) * scipy.special.factorial(ei)
# Mask out of range elements for last two inner loops
m = (ei < ai + bi) & (fi < ci + di)
return np.sum(B * m)

print(func1(4, 6, 3, 4))
# 21769947.844726562


I used scipy.special.factorial because apparently np.factorial does not work with arrays for some reason.



Obivously, the memory cost of this will grow very fast as you increment the parameters. The code actually performs more computations than necessary, because the two inner loops have varying number of iterations, so (in this method) you have to use the largest and then remove what you don't need. The hope is that vectorization will make up for that. A small IPython benchmark:



a, b, c, d = 4, 6, 3, 4
# func1_orig is the original loop-based version
%timeit func1_orig(a, b, c, d)
# 2.9 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# func1 here is the vectorized version
%timeit func1(a, b, c, d)
# 210 µs ± 6.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



EDIT:



Note the previous approach is not an all-or-nothing thing either. You can choose to grid-evaluate only some of the loops. For example, the two innermost loops could be vectorized like this:



def func1(a, b, c, d):
B = 0
e = np.arange(a + b - 2).reshape((-1, 1))
f = np.arange(c + d - 2)
for ai in range(0, a):
for bi in range(0, b):
ei = e[:ai + bi]
for ci in range(0, c):
for di in range(0, d):
fi = f[:ci + di]
B += np.sum((2.) ** (ei - fi - ai - ci - di + 1) * (ei ** 2 - 2 * (ei * fi) - 7 * di) * scipy.special.factorial(ei))
return B


This still has loops, but it does avoid extra computations, and the memory requirements are much lower. Which one is best depends on the sizes of the inputs I guess. In my tests, with the original values (4, 6, 3, 4) this is even slower than the original function; also, for that case it seems that creating new arrays for ei and fi on each loop is faster than operating on a slice of a pre-created one. However, if you multiply the input by 4 (14, 24, 12, 16) then this is much faster than the original (about x5), although still slower than the fully vectorized one (about x3). On the other hand, I could compute the value for the input scaled by ten (40, 60, 30, 40) with this one (in ~5min) but not with the previous one because of the memory (I didn't test how long it'd take with the original function). Using @numba.jit helps a bit, although not enormously (cannot use nopython due to the factorial function). You can experiment with vectorizing more or less loops depending on the size of your inputs.






share|improve this answer

























  • Thank you so much for this great answer. I had not considered using a grid based approach but it lends itself to this problem well. I will experiment with both options and see how they perform with the other answers provided. I need to run the function for ~ 200,000 combinations of (a,b,c,d) up to ~ 50. I estimate the original function will take months/years to complete it. Another issue I need to address is precision loss with large integers. Perhaps gmpy2 may be of use?

    – Yeti
    Nov 15 '18 at 11:30











  • @Yeti I have just added another different option. Precomputing the expensive parts to avoid repeated calculation and JIT-compile the rest seems to be the best option here.

    – jdehesa
    Nov 15 '18 at 11:38











  • @Yeti About precision, I don't know enough about the topic to give advice (I mean, I know that adding many numbers can cause significant precision loss, but I don't know much about methods to work around that). I don't know if storing all the values in an array and summing them at the end with np.sum (which takes memory) makes a difference or not. In the experiments I made, both methods (accumulation and final sum) seemed to produce very close results, if not exactly the same (I think I was always using 64-bit floats, though).

    – jdehesa
    Nov 15 '18 at 11:49











  • I am experimenting with the JIT compilation option; and I must be stupid but I can't see where exp_min comes from, but exp_max makes sense.

    – Yeti
    Nov 15 '18 at 12:58











  • @jdehesa I implemented the factorization in Numba, which gives quite a significant improvement. You can add this to your answer and I will delete mine afterwards (It is only meant as a comment)

    – max9111
    Nov 15 '18 at 13:12


















2














This is only a comment on @jdehesa answer.



If a function isn't supported by Numba itself it is usually recommendable to implement it yourself. In case of the factorization this isn't a complicated task.



Code



import numpy as np
import numba as nb

@nb.njit()
def factorial(a):
res=1.
for i in range(1,a+1):
res*=i
return res

@nb.njit()
def func1(a, b, c, d):
B = 0.

exp_min = 5 - (a + b + c + d)
exp_max = b
exp = 2. ** np.arange(exp_min, exp_max + 1)

fact_e=np.empty(a + b - 2)
for i in range(a + b - 2):
fact_e[i]=factorial(i)

for ai in range(0, a):
for bi in range(0, b):
for ci in range(0, c):
for di in range(0, d):
for ei in range(0, ai + bi):
for fi in range(0, ci + di):
B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
return B


Parallel version



@nb.njit(parallel=True)
def func_p(a_vec,b_vec,c_vec,d_vec):
res=np.empty(a_vec.shape[0])
for i in nb.prange(a_vec.shape[0]):
res[i]=func1(a_vec[i], b_vec[i], c_vec[i], d_vec[i])
return res


Example



a_vec=np.random.randint(low=2,high=10,size=1000000)
b_vec=np.random.randint(low=2,high=10,size=1000000)
c_vec=np.random.randint(low=2,high=10,size=1000000)
d_vec=np.random.randint(low=2,high=10,size=1000000)

res_2=func_p(a_vec,b_vec,c_vec,d_vec)


The single threaded version leads to 5.6µs (after the first run) on your example.



The parallel version will almost lead to another Number_of_Cores speedup for calculating many values. Keep in mind that the compilation overhead is larger for the parallel version (above 0.5s for the first call).






share|improve this answer

























  • Thanks for this, I'm looking into it, but I'm getting a Numba error which (among other things) says This should not have happened, a problem has occurred in Numba's internals. What version of Numba are you using? I have 0.39.0.

    – jdehesa
    Nov 15 '18 at 13:29











  • @ jdehesa Also 0.39.0. What is above the This should not have happened...? Something like Untyped global name... or cannot determine Numba type of...

    – max9111
    Nov 15 '18 at 13:38












  • The full error message was this (func1_2 is your func1 here). Not sure what that is about, since I later updated my function with basically the same and it works (see my update). I also noted that each element in fact_e can just be computed from the previous one instead of calling factorial (btw no need to delete the answer since it's a valuable contribution).

    – jdehesa
    Nov 15 '18 at 14:01


















1














Using this cartesian_product function
you can transform your nested loop into matrices and then you can simply calculate your respective nested sigmas in a vectorized manner:



In [37]: def nested_sig(args):
...: base_prod = cartesian_product(*arrays)
...: second_prod = cartesian_product(base_prod[:,:2].sum(1), base_prod[:,2:].sum(1))
...: total = np.column_stack((base_prod, second_prod))
...: # the items in each row denotes the following variables in order:
...: # ai, bi, ci, di, ei, fi
...: x = total[:, 4] - total[:, 5] - total[:, 0] - total[:, 2] - total[:, 3] + 1
...: y = total[:, 4] - total[:, 5]
...: result = np.power(2, x) * (np.power(total[:, 4], 2) - 2*y - 7*total[:, 3]) * np.math.factorial(total[:,4])
...: return result





share|improve this answer






























    1














    I see three sources of improvement in your code :



    • range(0,a) is enter image description here


    • you make a lot of work in the inner loop


    • you sum term in a random way, with risk of loss of precision for bigger entries.


    Here a version (probably not yet good) which try to improve this points.



    @numba.njit
    def func1o(a,b,c,d):
    "2**(ei-fi-ai-ci-di+1)*(ei**2-2*(ei*fi)-7*di)*ei!"
    POW=2.; SUM=0.;
    L=
    for ai in arange(0.,a+1):
    for bi in range(0,b+1):
    for ci in range(0,c+1):
    for di in range(0,d+1):
    FACT=1.
    for ei in arange(0,ai+bi+1):
    for fi in range(0,ci+di+1):
    L.append(POW*SUM*FACT)
    POW /= 2
    SUM -= 2*ei
    POW *= 2
    SUM += 2*(ei-fi)+1
    FACT *= ei+1
    POW /=2
    SUM -= 7*di
    POW /= 2
    POW /= 2
    A=np.array(L)
    I=np.abs(A).argsort()
    return A[I].sum()





    share|improve this answer
























      Your Answer






      StackExchange.ifUsing("editor", function ()
      StackExchange.using("externalEditor", function ()
      StackExchange.using("snippets", function ()
      StackExchange.snippets.init();
      );
      );
      , "code-snippets");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "1"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53305490%2fvectorize-a-6-for-loop-cumulative-sum-in-python%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      4 Answers
      4






      active

      oldest

      votes








      4 Answers
      4






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      3














      EDIT 3:



      Final (I think) version, a bit cleaner and faster incorporating ideas from max9111's answer.



      import numpy as np
      from numba import as nb

      @nb.njit()
      def func1_jit(a, b, c, d):
      # Precompute
      exp_min = 5 - (a + b + c + d)
      exp_max = b
      exp = 2. ** np.arange(exp_min, exp_max + 1)
      fact_e = np.empty((a + b - 2))
      fact_e[0] = 1
      for ei in range(1, len(fact_e)):
      fact_e[ei] = ei * fact_e[ei - 1]
      # Loops
      B = 0
      for ai in range(0, a):
      for bi in range(0, b):
      for ci in range(0, c):
      for di in range(0, d):
      for ei in range(0, ai + bi):
      for fi in range(0, ci + di):
      B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
      return B


      This is already quite faster than any of the previous options, but we are still not taking advantage of multiple CPUs. One way to do it is within the function itself, for example parallelizing the outer loop. This adds some overhead on each call to create the threads, so for small inputs is actually a bit slower, but should be significantly faster for bigger values:



      import numpy as np
      from numba import as nb

      @nb.njit(parallel=True)
      def func1_par(a, b, c, d):
      # Precompute
      exp_min = 5 - (a + b + c + d)
      exp_max = b
      exp = 2. ** np.arange(exp_min, exp_max + 1)
      fact_e = np.empty((a + b - 2))
      fact_e[0] = 1
      for ei in range(1, len(fact_e)):
      fact_e[ei] = ei * fact_e[ei - 1]
      # Loops
      B = np.empty((a,))
      for ai in nb.prange(0, a):
      Bi = 0
      for bi in range(0, b):
      for ci in range(0, c):
      for di in range(0, d):
      for ei in range(0, ai + bi):
      for fi in range(0, ci + di):
      Bi += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
      B[ai] = Bi
      return np.sum(B)


      Or, if you have many points where you want to evaluate the function, you can parallelize at that level too. Here a_arr, b_arr, c_arr and d_arr are vectors of values where the function is to be evaluated:



      from numba import as nb

      @nb.njit(parallel=True)
      def func1_arr(a_arr, b_arr, c_arr, d_arr):
      B_arr = np.empty((len(a_arr),))
      for i in nb.prange(len(B_arr)):
      B_arr[i] = func1_jit(a_arr[i], b_arr[i], c_arr[i], d_arr[i])
      return B_arr


      The best configuration depends on your inputs, usage pattern, hardware, etc. so you can combine the different ideas to suit your case.




      EDIT 2:



      Actually, forget what I said before. The best thing is to JIT-compile the algorithm, but in a more effective manner. Compute first the expensive parts (I took the exponential and the factorial) and then pass it to the compiled loopy function:



      import numpy as np
      from numba import njit

      def func1(a, b, c, d):
      exp_min = 5 - (a + b + c + d)
      exp_max = b
      exp = 2. ** np.arange(exp_min, exp_max + 1)
      ee = np.arange(a + b - 2)
      fact_e = scipy.special.factorial(ee)
      return func1_inner(a, b, c, d, exp_min, exp, fact_e)

      @njit()
      def func1_inner(a, b, c, d, exp_min, exp, fact_e):
      B = 0
      for ai in range(0, a):
      for bi in range(0, b):
      for ci in range(0, c):
      for di in range(0, d):
      for ei in range(0, ai + bi):
      for fi in range(0, ci + di):
      B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
      return B


      This is, in my experiments, the fastest option by far, and takes little extra memory (only the precomputed values, with size linear on the input).



      a, b, c, d = 4, 6, 3, 4
      # The original function
      %timeit func1_orig(a, b, c, d)
      # 2.07 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
      # The grid-evaluated function
      %timeit func1_grid(a, b, c, d)
      # 256 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
      # The precompuation + JIT-compiled function
      %timeit func1_jit(a, b, c, d)
      # 19.6 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)



      Well there is always the option of grid-evaluating the whole thing:



      import numpy as np
      import scipy.special

      def func1(a, b, c, d):
      ai, bi, ci, di, ei, fi = np.ogrid[:a, :b, :c, :d, :a + b - 2, :c + d - 2]
      # Compute
      B = (2.) ** (ei - fi - ai - ci - di + 1) * (ei ** 2 - 2 * (ei * fi) - 7 * di) * scipy.special.factorial(ei)
      # Mask out of range elements for last two inner loops
      m = (ei < ai + bi) & (fi < ci + di)
      return np.sum(B * m)

      print(func1(4, 6, 3, 4))
      # 21769947.844726562


      I used scipy.special.factorial because apparently np.factorial does not work with arrays for some reason.



      Obivously, the memory cost of this will grow very fast as you increment the parameters. The code actually performs more computations than necessary, because the two inner loops have varying number of iterations, so (in this method) you have to use the largest and then remove what you don't need. The hope is that vectorization will make up for that. A small IPython benchmark:



      a, b, c, d = 4, 6, 3, 4
      # func1_orig is the original loop-based version
      %timeit func1_orig(a, b, c, d)
      # 2.9 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
      # func1 here is the vectorized version
      %timeit func1(a, b, c, d)
      # 210 µs ± 6.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



      EDIT:



      Note the previous approach is not an all-or-nothing thing either. You can choose to grid-evaluate only some of the loops. For example, the two innermost loops could be vectorized like this:



      def func1(a, b, c, d):
      B = 0
      e = np.arange(a + b - 2).reshape((-1, 1))
      f = np.arange(c + d - 2)
      for ai in range(0, a):
      for bi in range(0, b):
      ei = e[:ai + bi]
      for ci in range(0, c):
      for di in range(0, d):
      fi = f[:ci + di]
      B += np.sum((2.) ** (ei - fi - ai - ci - di + 1) * (ei ** 2 - 2 * (ei * fi) - 7 * di) * scipy.special.factorial(ei))
      return B


      This still has loops, but it does avoid extra computations, and the memory requirements are much lower. Which one is best depends on the sizes of the inputs I guess. In my tests, with the original values (4, 6, 3, 4) this is even slower than the original function; also, for that case it seems that creating new arrays for ei and fi on each loop is faster than operating on a slice of a pre-created one. However, if you multiply the input by 4 (14, 24, 12, 16) then this is much faster than the original (about x5), although still slower than the fully vectorized one (about x3). On the other hand, I could compute the value for the input scaled by ten (40, 60, 30, 40) with this one (in ~5min) but not with the previous one because of the memory (I didn't test how long it'd take with the original function). Using @numba.jit helps a bit, although not enormously (cannot use nopython due to the factorial function). You can experiment with vectorizing more or less loops depending on the size of your inputs.






      share|improve this answer

























      • Thank you so much for this great answer. I had not considered using a grid based approach but it lends itself to this problem well. I will experiment with both options and see how they perform with the other answers provided. I need to run the function for ~ 200,000 combinations of (a,b,c,d) up to ~ 50. I estimate the original function will take months/years to complete it. Another issue I need to address is precision loss with large integers. Perhaps gmpy2 may be of use?

        – Yeti
        Nov 15 '18 at 11:30











      • @Yeti I have just added another different option. Precomputing the expensive parts to avoid repeated calculation and JIT-compile the rest seems to be the best option here.

        – jdehesa
        Nov 15 '18 at 11:38











      • @Yeti About precision, I don't know enough about the topic to give advice (I mean, I know that adding many numbers can cause significant precision loss, but I don't know much about methods to work around that). I don't know if storing all the values in an array and summing them at the end with np.sum (which takes memory) makes a difference or not. In the experiments I made, both methods (accumulation and final sum) seemed to produce very close results, if not exactly the same (I think I was always using 64-bit floats, though).

        – jdehesa
        Nov 15 '18 at 11:49











      • I am experimenting with the JIT compilation option; and I must be stupid but I can't see where exp_min comes from, but exp_max makes sense.

        – Yeti
        Nov 15 '18 at 12:58











      • @jdehesa I implemented the factorization in Numba, which gives quite a significant improvement. You can add this to your answer and I will delete mine afterwards (It is only meant as a comment)

        – max9111
        Nov 15 '18 at 13:12















      3














      EDIT 3:



      Final (I think) version, a bit cleaner and faster incorporating ideas from max9111's answer.



      import numpy as np
      from numba import as nb

      @nb.njit()
      def func1_jit(a, b, c, d):
      # Precompute
      exp_min = 5 - (a + b + c + d)
      exp_max = b
      exp = 2. ** np.arange(exp_min, exp_max + 1)
      fact_e = np.empty((a + b - 2))
      fact_e[0] = 1
      for ei in range(1, len(fact_e)):
      fact_e[ei] = ei * fact_e[ei - 1]
      # Loops
      B = 0
      for ai in range(0, a):
      for bi in range(0, b):
      for ci in range(0, c):
      for di in range(0, d):
      for ei in range(0, ai + bi):
      for fi in range(0, ci + di):
      B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
      return B


      This is already quite faster than any of the previous options, but we are still not taking advantage of multiple CPUs. One way to do it is within the function itself, for example parallelizing the outer loop. This adds some overhead on each call to create the threads, so for small inputs is actually a bit slower, but should be significantly faster for bigger values:



      import numpy as np
      from numba import as nb

      @nb.njit(parallel=True)
      def func1_par(a, b, c, d):
      # Precompute
      exp_min = 5 - (a + b + c + d)
      exp_max = b
      exp = 2. ** np.arange(exp_min, exp_max + 1)
      fact_e = np.empty((a + b - 2))
      fact_e[0] = 1
      for ei in range(1, len(fact_e)):
      fact_e[ei] = ei * fact_e[ei - 1]
      # Loops
      B = np.empty((a,))
      for ai in nb.prange(0, a):
      Bi = 0
      for bi in range(0, b):
      for ci in range(0, c):
      for di in range(0, d):
      for ei in range(0, ai + bi):
      for fi in range(0, ci + di):
      Bi += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
      B[ai] = Bi
      return np.sum(B)


      Or, if you have many points where you want to evaluate the function, you can parallelize at that level too. Here a_arr, b_arr, c_arr and d_arr are vectors of values where the function is to be evaluated:



      from numba import as nb

      @nb.njit(parallel=True)
      def func1_arr(a_arr, b_arr, c_arr, d_arr):
      B_arr = np.empty((len(a_arr),))
      for i in nb.prange(len(B_arr)):
      B_arr[i] = func1_jit(a_arr[i], b_arr[i], c_arr[i], d_arr[i])
      return B_arr


      The best configuration depends on your inputs, usage pattern, hardware, etc. so you can combine the different ideas to suit your case.




      EDIT 2:



      Actually, forget what I said before. The best thing is to JIT-compile the algorithm, but in a more effective manner. Compute first the expensive parts (I took the exponential and the factorial) and then pass it to the compiled loopy function:



      import numpy as np
      from numba import njit

      def func1(a, b, c, d):
      exp_min = 5 - (a + b + c + d)
      exp_max = b
      exp = 2. ** np.arange(exp_min, exp_max + 1)
      ee = np.arange(a + b - 2)
      fact_e = scipy.special.factorial(ee)
      return func1_inner(a, b, c, d, exp_min, exp, fact_e)

      @njit()
      def func1_inner(a, b, c, d, exp_min, exp, fact_e):
      B = 0
      for ai in range(0, a):
      for bi in range(0, b):
      for ci in range(0, c):
      for di in range(0, d):
      for ei in range(0, ai + bi):
      for fi in range(0, ci + di):
      B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
      return B


      This is, in my experiments, the fastest option by far, and takes little extra memory (only the precomputed values, with size linear on the input).



      a, b, c, d = 4, 6, 3, 4
      # The original function
      %timeit func1_orig(a, b, c, d)
      # 2.07 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
      # The grid-evaluated function
      %timeit func1_grid(a, b, c, d)
      # 256 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
      # The precompuation + JIT-compiled function
      %timeit func1_jit(a, b, c, d)
      # 19.6 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)



      Well there is always the option of grid-evaluating the whole thing:



      import numpy as np
      import scipy.special

      def func1(a, b, c, d):
      ai, bi, ci, di, ei, fi = np.ogrid[:a, :b, :c, :d, :a + b - 2, :c + d - 2]
      # Compute
      B = (2.) ** (ei - fi - ai - ci - di + 1) * (ei ** 2 - 2 * (ei * fi) - 7 * di) * scipy.special.factorial(ei)
      # Mask out of range elements for last two inner loops
      m = (ei < ai + bi) & (fi < ci + di)
      return np.sum(B * m)

      print(func1(4, 6, 3, 4))
      # 21769947.844726562


      I used scipy.special.factorial because apparently np.factorial does not work with arrays for some reason.



      Obivously, the memory cost of this will grow very fast as you increment the parameters. The code actually performs more computations than necessary, because the two inner loops have varying number of iterations, so (in this method) you have to use the largest and then remove what you don't need. The hope is that vectorization will make up for that. A small IPython benchmark:



      a, b, c, d = 4, 6, 3, 4
      # func1_orig is the original loop-based version
      %timeit func1_orig(a, b, c, d)
      # 2.9 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
      # func1 here is the vectorized version
      %timeit func1(a, b, c, d)
      # 210 µs ± 6.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



      EDIT:



      Note the previous approach is not an all-or-nothing thing either. You can choose to grid-evaluate only some of the loops. For example, the two innermost loops could be vectorized like this:



      def func1(a, b, c, d):
      B = 0
      e = np.arange(a + b - 2).reshape((-1, 1))
      f = np.arange(c + d - 2)
      for ai in range(0, a):
      for bi in range(0, b):
      ei = e[:ai + bi]
      for ci in range(0, c):
      for di in range(0, d):
      fi = f[:ci + di]
      B += np.sum((2.) ** (ei - fi - ai - ci - di + 1) * (ei ** 2 - 2 * (ei * fi) - 7 * di) * scipy.special.factorial(ei))
      return B


      This still has loops, but it does avoid extra computations, and the memory requirements are much lower. Which one is best depends on the sizes of the inputs I guess. In my tests, with the original values (4, 6, 3, 4) this is even slower than the original function; also, for that case it seems that creating new arrays for ei and fi on each loop is faster than operating on a slice of a pre-created one. However, if you multiply the input by 4 (14, 24, 12, 16) then this is much faster than the original (about x5), although still slower than the fully vectorized one (about x3). On the other hand, I could compute the value for the input scaled by ten (40, 60, 30, 40) with this one (in ~5min) but not with the previous one because of the memory (I didn't test how long it'd take with the original function). Using @numba.jit helps a bit, although not enormously (cannot use nopython due to the factorial function). You can experiment with vectorizing more or less loops depending on the size of your inputs.






      share|improve this answer

























      • Thank you so much for this great answer. I had not considered using a grid based approach but it lends itself to this problem well. I will experiment with both options and see how they perform with the other answers provided. I need to run the function for ~ 200,000 combinations of (a,b,c,d) up to ~ 50. I estimate the original function will take months/years to complete it. Another issue I need to address is precision loss with large integers. Perhaps gmpy2 may be of use?

        – Yeti
        Nov 15 '18 at 11:30











      • @Yeti I have just added another different option. Precomputing the expensive parts to avoid repeated calculation and JIT-compile the rest seems to be the best option here.

        – jdehesa
        Nov 15 '18 at 11:38











      • @Yeti About precision, I don't know enough about the topic to give advice (I mean, I know that adding many numbers can cause significant precision loss, but I don't know much about methods to work around that). I don't know if storing all the values in an array and summing them at the end with np.sum (which takes memory) makes a difference or not. In the experiments I made, both methods (accumulation and final sum) seemed to produce very close results, if not exactly the same (I think I was always using 64-bit floats, though).

        – jdehesa
        Nov 15 '18 at 11:49











      • I am experimenting with the JIT compilation option; and I must be stupid but I can't see where exp_min comes from, but exp_max makes sense.

        – Yeti
        Nov 15 '18 at 12:58











      • @jdehesa I implemented the factorization in Numba, which gives quite a significant improvement. You can add this to your answer and I will delete mine afterwards (It is only meant as a comment)

        – max9111
        Nov 15 '18 at 13:12













      3












      3








      3







      EDIT 3:



      Final (I think) version, a bit cleaner and faster incorporating ideas from max9111's answer.



      import numpy as np
      from numba import as nb

      @nb.njit()
      def func1_jit(a, b, c, d):
      # Precompute
      exp_min = 5 - (a + b + c + d)
      exp_max = b
      exp = 2. ** np.arange(exp_min, exp_max + 1)
      fact_e = np.empty((a + b - 2))
      fact_e[0] = 1
      for ei in range(1, len(fact_e)):
      fact_e[ei] = ei * fact_e[ei - 1]
      # Loops
      B = 0
      for ai in range(0, a):
      for bi in range(0, b):
      for ci in range(0, c):
      for di in range(0, d):
      for ei in range(0, ai + bi):
      for fi in range(0, ci + di):
      B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
      return B


      This is already quite faster than any of the previous options, but we are still not taking advantage of multiple CPUs. One way to do it is within the function itself, for example parallelizing the outer loop. This adds some overhead on each call to create the threads, so for small inputs is actually a bit slower, but should be significantly faster for bigger values:



      import numpy as np
      from numba import as nb

      @nb.njit(parallel=True)
      def func1_par(a, b, c, d):
      # Precompute
      exp_min = 5 - (a + b + c + d)
      exp_max = b
      exp = 2. ** np.arange(exp_min, exp_max + 1)
      fact_e = np.empty((a + b - 2))
      fact_e[0] = 1
      for ei in range(1, len(fact_e)):
      fact_e[ei] = ei * fact_e[ei - 1]
      # Loops
      B = np.empty((a,))
      for ai in nb.prange(0, a):
      Bi = 0
      for bi in range(0, b):
      for ci in range(0, c):
      for di in range(0, d):
      for ei in range(0, ai + bi):
      for fi in range(0, ci + di):
      Bi += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
      B[ai] = Bi
      return np.sum(B)


      Or, if you have many points where you want to evaluate the function, you can parallelize at that level too. Here a_arr, b_arr, c_arr and d_arr are vectors of values where the function is to be evaluated:



      from numba import as nb

      @nb.njit(parallel=True)
      def func1_arr(a_arr, b_arr, c_arr, d_arr):
      B_arr = np.empty((len(a_arr),))
      for i in nb.prange(len(B_arr)):
      B_arr[i] = func1_jit(a_arr[i], b_arr[i], c_arr[i], d_arr[i])
      return B_arr


      The best configuration depends on your inputs, usage pattern, hardware, etc. so you can combine the different ideas to suit your case.




      EDIT 2:



      Actually, forget what I said before. The best thing is to JIT-compile the algorithm, but in a more effective manner. Compute first the expensive parts (I took the exponential and the factorial) and then pass it to the compiled loopy function:



      import numpy as np
      from numba import njit

      def func1(a, b, c, d):
      exp_min = 5 - (a + b + c + d)
      exp_max = b
      exp = 2. ** np.arange(exp_min, exp_max + 1)
      ee = np.arange(a + b - 2)
      fact_e = scipy.special.factorial(ee)
      return func1_inner(a, b, c, d, exp_min, exp, fact_e)

      @njit()
      def func1_inner(a, b, c, d, exp_min, exp, fact_e):
      B = 0
      for ai in range(0, a):
      for bi in range(0, b):
      for ci in range(0, c):
      for di in range(0, d):
      for ei in range(0, ai + bi):
      for fi in range(0, ci + di):
      B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
      return B


      This is, in my experiments, the fastest option by far, and takes little extra memory (only the precomputed values, with size linear on the input).



      a, b, c, d = 4, 6, 3, 4
      # The original function
      %timeit func1_orig(a, b, c, d)
      # 2.07 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
      # The grid-evaluated function
      %timeit func1_grid(a, b, c, d)
      # 256 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
      # The precompuation + JIT-compiled function
      %timeit func1_jit(a, b, c, d)
      # 19.6 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)



      Well there is always the option of grid-evaluating the whole thing:



      import numpy as np
      import scipy.special

      def func1(a, b, c, d):
      ai, bi, ci, di, ei, fi = np.ogrid[:a, :b, :c, :d, :a + b - 2, :c + d - 2]
      # Compute
      B = (2.) ** (ei - fi - ai - ci - di + 1) * (ei ** 2 - 2 * (ei * fi) - 7 * di) * scipy.special.factorial(ei)
      # Mask out of range elements for last two inner loops
      m = (ei < ai + bi) & (fi < ci + di)
      return np.sum(B * m)

      print(func1(4, 6, 3, 4))
      # 21769947.844726562


      I used scipy.special.factorial because apparently np.factorial does not work with arrays for some reason.



      Obivously, the memory cost of this will grow very fast as you increment the parameters. The code actually performs more computations than necessary, because the two inner loops have varying number of iterations, so (in this method) you have to use the largest and then remove what you don't need. The hope is that vectorization will make up for that. A small IPython benchmark:



      a, b, c, d = 4, 6, 3, 4
      # func1_orig is the original loop-based version
      %timeit func1_orig(a, b, c, d)
      # 2.9 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
      # func1 here is the vectorized version
      %timeit func1(a, b, c, d)
      # 210 µs ± 6.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



      EDIT:



      Note the previous approach is not an all-or-nothing thing either. You can choose to grid-evaluate only some of the loops. For example, the two innermost loops could be vectorized like this:



      def func1(a, b, c, d):
      B = 0
      e = np.arange(a + b - 2).reshape((-1, 1))
      f = np.arange(c + d - 2)
      for ai in range(0, a):
      for bi in range(0, b):
      ei = e[:ai + bi]
      for ci in range(0, c):
      for di in range(0, d):
      fi = f[:ci + di]
      B += np.sum((2.) ** (ei - fi - ai - ci - di + 1) * (ei ** 2 - 2 * (ei * fi) - 7 * di) * scipy.special.factorial(ei))
      return B


      This still has loops, but it does avoid extra computations, and the memory requirements are much lower. Which one is best depends on the sizes of the inputs I guess. In my tests, with the original values (4, 6, 3, 4) this is even slower than the original function; also, for that case it seems that creating new arrays for ei and fi on each loop is faster than operating on a slice of a pre-created one. However, if you multiply the input by 4 (14, 24, 12, 16) then this is much faster than the original (about x5), although still slower than the fully vectorized one (about x3). On the other hand, I could compute the value for the input scaled by ten (40, 60, 30, 40) with this one (in ~5min) but not with the previous one because of the memory (I didn't test how long it'd take with the original function). Using @numba.jit helps a bit, although not enormously (cannot use nopython due to the factorial function). You can experiment with vectorizing more or less loops depending on the size of your inputs.






      share|improve this answer















      EDIT 3:



      Final (I think) version, a bit cleaner and faster incorporating ideas from max9111's answer.



      import numpy as np
      from numba import as nb

      @nb.njit()
      def func1_jit(a, b, c, d):
      # Precompute
      exp_min = 5 - (a + b + c + d)
      exp_max = b
      exp = 2. ** np.arange(exp_min, exp_max + 1)
      fact_e = np.empty((a + b - 2))
      fact_e[0] = 1
      for ei in range(1, len(fact_e)):
      fact_e[ei] = ei * fact_e[ei - 1]
      # Loops
      B = 0
      for ai in range(0, a):
      for bi in range(0, b):
      for ci in range(0, c):
      for di in range(0, d):
      for ei in range(0, ai + bi):
      for fi in range(0, ci + di):
      B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
      return B


      This is already quite faster than any of the previous options, but we are still not taking advantage of multiple CPUs. One way to do it is within the function itself, for example parallelizing the outer loop. This adds some overhead on each call to create the threads, so for small inputs is actually a bit slower, but should be significantly faster for bigger values:



      import numpy as np
      from numba import as nb

      @nb.njit(parallel=True)
      def func1_par(a, b, c, d):
      # Precompute
      exp_min = 5 - (a + b + c + d)
      exp_max = b
      exp = 2. ** np.arange(exp_min, exp_max + 1)
      fact_e = np.empty((a + b - 2))
      fact_e[0] = 1
      for ei in range(1, len(fact_e)):
      fact_e[ei] = ei * fact_e[ei - 1]
      # Loops
      B = np.empty((a,))
      for ai in nb.prange(0, a):
      Bi = 0
      for bi in range(0, b):
      for ci in range(0, c):
      for di in range(0, d):
      for ei in range(0, ai + bi):
      for fi in range(0, ci + di):
      Bi += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
      B[ai] = Bi
      return np.sum(B)


      Or, if you have many points where you want to evaluate the function, you can parallelize at that level too. Here a_arr, b_arr, c_arr and d_arr are vectors of values where the function is to be evaluated:



      from numba import as nb

      @nb.njit(parallel=True)
      def func1_arr(a_arr, b_arr, c_arr, d_arr):
      B_arr = np.empty((len(a_arr),))
      for i in nb.prange(len(B_arr)):
      B_arr[i] = func1_jit(a_arr[i], b_arr[i], c_arr[i], d_arr[i])
      return B_arr


      The best configuration depends on your inputs, usage pattern, hardware, etc. so you can combine the different ideas to suit your case.




      EDIT 2:



      Actually, forget what I said before. The best thing is to JIT-compile the algorithm, but in a more effective manner. Compute first the expensive parts (I took the exponential and the factorial) and then pass it to the compiled loopy function:



      import numpy as np
      from numba import njit

      def func1(a, b, c, d):
      exp_min = 5 - (a + b + c + d)
      exp_max = b
      exp = 2. ** np.arange(exp_min, exp_max + 1)
      ee = np.arange(a + b - 2)
      fact_e = scipy.special.factorial(ee)
      return func1_inner(a, b, c, d, exp_min, exp, fact_e)

      @njit()
      def func1_inner(a, b, c, d, exp_min, exp, fact_e):
      B = 0
      for ai in range(0, a):
      for bi in range(0, b):
      for ci in range(0, c):
      for di in range(0, d):
      for ei in range(0, ai + bi):
      for fi in range(0, ci + di):
      B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
      return B


      This is, in my experiments, the fastest option by far, and takes little extra memory (only the precomputed values, with size linear on the input).



      a, b, c, d = 4, 6, 3, 4
      # The original function
      %timeit func1_orig(a, b, c, d)
      # 2.07 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
      # The grid-evaluated function
      %timeit func1_grid(a, b, c, d)
      # 256 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
      # The precompuation + JIT-compiled function
      %timeit func1_jit(a, b, c, d)
      # 19.6 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)



      Well there is always the option of grid-evaluating the whole thing:



      import numpy as np
      import scipy.special

      def func1(a, b, c, d):
      ai, bi, ci, di, ei, fi = np.ogrid[:a, :b, :c, :d, :a + b - 2, :c + d - 2]
      # Compute
      B = (2.) ** (ei - fi - ai - ci - di + 1) * (ei ** 2 - 2 * (ei * fi) - 7 * di) * scipy.special.factorial(ei)
      # Mask out of range elements for last two inner loops
      m = (ei < ai + bi) & (fi < ci + di)
      return np.sum(B * m)

      print(func1(4, 6, 3, 4))
      # 21769947.844726562


      I used scipy.special.factorial because apparently np.factorial does not work with arrays for some reason.



      Obivously, the memory cost of this will grow very fast as you increment the parameters. The code actually performs more computations than necessary, because the two inner loops have varying number of iterations, so (in this method) you have to use the largest and then remove what you don't need. The hope is that vectorization will make up for that. A small IPython benchmark:



      a, b, c, d = 4, 6, 3, 4
      # func1_orig is the original loop-based version
      %timeit func1_orig(a, b, c, d)
      # 2.9 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
      # func1 here is the vectorized version
      %timeit func1(a, b, c, d)
      # 210 µs ± 6.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



      EDIT:



      Note the previous approach is not an all-or-nothing thing either. You can choose to grid-evaluate only some of the loops. For example, the two innermost loops could be vectorized like this:



      def func1(a, b, c, d):
      B = 0
      e = np.arange(a + b - 2).reshape((-1, 1))
      f = np.arange(c + d - 2)
      for ai in range(0, a):
      for bi in range(0, b):
      ei = e[:ai + bi]
      for ci in range(0, c):
      for di in range(0, d):
      fi = f[:ci + di]
      B += np.sum((2.) ** (ei - fi - ai - ci - di + 1) * (ei ** 2 - 2 * (ei * fi) - 7 * di) * scipy.special.factorial(ei))
      return B


      This still has loops, but it does avoid extra computations, and the memory requirements are much lower. Which one is best depends on the sizes of the inputs I guess. In my tests, with the original values (4, 6, 3, 4) this is even slower than the original function; also, for that case it seems that creating new arrays for ei and fi on each loop is faster than operating on a slice of a pre-created one. However, if you multiply the input by 4 (14, 24, 12, 16) then this is much faster than the original (about x5), although still slower than the fully vectorized one (about x3). On the other hand, I could compute the value for the input scaled by ten (40, 60, 30, 40) with this one (in ~5min) but not with the previous one because of the memory (I didn't test how long it'd take with the original function). Using @numba.jit helps a bit, although not enormously (cannot use nopython due to the factorial function). You can experiment with vectorizing more or less loops depending on the size of your inputs.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 16 '18 at 8:19

























      answered Nov 14 '18 at 17:44









      jdehesajdehesa

      24.8k43554




      24.8k43554












      • Thank you so much for this great answer. I had not considered using a grid based approach but it lends itself to this problem well. I will experiment with both options and see how they perform with the other answers provided. I need to run the function for ~ 200,000 combinations of (a,b,c,d) up to ~ 50. I estimate the original function will take months/years to complete it. Another issue I need to address is precision loss with large integers. Perhaps gmpy2 may be of use?

        – Yeti
        Nov 15 '18 at 11:30











      • @Yeti I have just added another different option. Precomputing the expensive parts to avoid repeated calculation and JIT-compile the rest seems to be the best option here.

        – jdehesa
        Nov 15 '18 at 11:38











      • @Yeti About precision, I don't know enough about the topic to give advice (I mean, I know that adding many numbers can cause significant precision loss, but I don't know much about methods to work around that). I don't know if storing all the values in an array and summing them at the end with np.sum (which takes memory) makes a difference or not. In the experiments I made, both methods (accumulation and final sum) seemed to produce very close results, if not exactly the same (I think I was always using 64-bit floats, though).

        – jdehesa
        Nov 15 '18 at 11:49











      • I am experimenting with the JIT compilation option; and I must be stupid but I can't see where exp_min comes from, but exp_max makes sense.

        – Yeti
        Nov 15 '18 at 12:58











      • @jdehesa I implemented the factorization in Numba, which gives quite a significant improvement. You can add this to your answer and I will delete mine afterwards (It is only meant as a comment)

        – max9111
        Nov 15 '18 at 13:12

















      • Thank you so much for this great answer. I had not considered using a grid based approach but it lends itself to this problem well. I will experiment with both options and see how they perform with the other answers provided. I need to run the function for ~ 200,000 combinations of (a,b,c,d) up to ~ 50. I estimate the original function will take months/years to complete it. Another issue I need to address is precision loss with large integers. Perhaps gmpy2 may be of use?

        – Yeti
        Nov 15 '18 at 11:30











      • @Yeti I have just added another different option. Precomputing the expensive parts to avoid repeated calculation and JIT-compile the rest seems to be the best option here.

        – jdehesa
        Nov 15 '18 at 11:38











      • @Yeti About precision, I don't know enough about the topic to give advice (I mean, I know that adding many numbers can cause significant precision loss, but I don't know much about methods to work around that). I don't know if storing all the values in an array and summing them at the end with np.sum (which takes memory) makes a difference or not. In the experiments I made, both methods (accumulation and final sum) seemed to produce very close results, if not exactly the same (I think I was always using 64-bit floats, though).

        – jdehesa
        Nov 15 '18 at 11:49











      • I am experimenting with the JIT compilation option; and I must be stupid but I can't see where exp_min comes from, but exp_max makes sense.

        – Yeti
        Nov 15 '18 at 12:58











      • @jdehesa I implemented the factorization in Numba, which gives quite a significant improvement. You can add this to your answer and I will delete mine afterwards (It is only meant as a comment)

        – max9111
        Nov 15 '18 at 13:12
















      Thank you so much for this great answer. I had not considered using a grid based approach but it lends itself to this problem well. I will experiment with both options and see how they perform with the other answers provided. I need to run the function for ~ 200,000 combinations of (a,b,c,d) up to ~ 50. I estimate the original function will take months/years to complete it. Another issue I need to address is precision loss with large integers. Perhaps gmpy2 may be of use?

      – Yeti
      Nov 15 '18 at 11:30





      Thank you so much for this great answer. I had not considered using a grid based approach but it lends itself to this problem well. I will experiment with both options and see how they perform with the other answers provided. I need to run the function for ~ 200,000 combinations of (a,b,c,d) up to ~ 50. I estimate the original function will take months/years to complete it. Another issue I need to address is precision loss with large integers. Perhaps gmpy2 may be of use?

      – Yeti
      Nov 15 '18 at 11:30













      @Yeti I have just added another different option. Precomputing the expensive parts to avoid repeated calculation and JIT-compile the rest seems to be the best option here.

      – jdehesa
      Nov 15 '18 at 11:38





      @Yeti I have just added another different option. Precomputing the expensive parts to avoid repeated calculation and JIT-compile the rest seems to be the best option here.

      – jdehesa
      Nov 15 '18 at 11:38













      @Yeti About precision, I don't know enough about the topic to give advice (I mean, I know that adding many numbers can cause significant precision loss, but I don't know much about methods to work around that). I don't know if storing all the values in an array and summing them at the end with np.sum (which takes memory) makes a difference or not. In the experiments I made, both methods (accumulation and final sum) seemed to produce very close results, if not exactly the same (I think I was always using 64-bit floats, though).

      – jdehesa
      Nov 15 '18 at 11:49





      @Yeti About precision, I don't know enough about the topic to give advice (I mean, I know that adding many numbers can cause significant precision loss, but I don't know much about methods to work around that). I don't know if storing all the values in an array and summing them at the end with np.sum (which takes memory) makes a difference or not. In the experiments I made, both methods (accumulation and final sum) seemed to produce very close results, if not exactly the same (I think I was always using 64-bit floats, though).

      – jdehesa
      Nov 15 '18 at 11:49













      I am experimenting with the JIT compilation option; and I must be stupid but I can't see where exp_min comes from, but exp_max makes sense.

      – Yeti
      Nov 15 '18 at 12:58





      I am experimenting with the JIT compilation option; and I must be stupid but I can't see where exp_min comes from, but exp_max makes sense.

      – Yeti
      Nov 15 '18 at 12:58













      @jdehesa I implemented the factorization in Numba, which gives quite a significant improvement. You can add this to your answer and I will delete mine afterwards (It is only meant as a comment)

      – max9111
      Nov 15 '18 at 13:12





      @jdehesa I implemented the factorization in Numba, which gives quite a significant improvement. You can add this to your answer and I will delete mine afterwards (It is only meant as a comment)

      – max9111
      Nov 15 '18 at 13:12













      2














      This is only a comment on @jdehesa answer.



      If a function isn't supported by Numba itself it is usually recommendable to implement it yourself. In case of the factorization this isn't a complicated task.



      Code



      import numpy as np
      import numba as nb

      @nb.njit()
      def factorial(a):
      res=1.
      for i in range(1,a+1):
      res*=i
      return res

      @nb.njit()
      def func1(a, b, c, d):
      B = 0.

      exp_min = 5 - (a + b + c + d)
      exp_max = b
      exp = 2. ** np.arange(exp_min, exp_max + 1)

      fact_e=np.empty(a + b - 2)
      for i in range(a + b - 2):
      fact_e[i]=factorial(i)

      for ai in range(0, a):
      for bi in range(0, b):
      for ci in range(0, c):
      for di in range(0, d):
      for ei in range(0, ai + bi):
      for fi in range(0, ci + di):
      B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
      return B


      Parallel version



      @nb.njit(parallel=True)
      def func_p(a_vec,b_vec,c_vec,d_vec):
      res=np.empty(a_vec.shape[0])
      for i in nb.prange(a_vec.shape[0]):
      res[i]=func1(a_vec[i], b_vec[i], c_vec[i], d_vec[i])
      return res


      Example



      a_vec=np.random.randint(low=2,high=10,size=1000000)
      b_vec=np.random.randint(low=2,high=10,size=1000000)
      c_vec=np.random.randint(low=2,high=10,size=1000000)
      d_vec=np.random.randint(low=2,high=10,size=1000000)

      res_2=func_p(a_vec,b_vec,c_vec,d_vec)


      The single threaded version leads to 5.6µs (after the first run) on your example.



      The parallel version will almost lead to another Number_of_Cores speedup for calculating many values. Keep in mind that the compilation overhead is larger for the parallel version (above 0.5s for the first call).






      share|improve this answer

























      • Thanks for this, I'm looking into it, but I'm getting a Numba error which (among other things) says This should not have happened, a problem has occurred in Numba's internals. What version of Numba are you using? I have 0.39.0.

        – jdehesa
        Nov 15 '18 at 13:29











      • @ jdehesa Also 0.39.0. What is above the This should not have happened...? Something like Untyped global name... or cannot determine Numba type of...

        – max9111
        Nov 15 '18 at 13:38












      • The full error message was this (func1_2 is your func1 here). Not sure what that is about, since I later updated my function with basically the same and it works (see my update). I also noted that each element in fact_e can just be computed from the previous one instead of calling factorial (btw no need to delete the answer since it's a valuable contribution).

        – jdehesa
        Nov 15 '18 at 14:01















      2














      This is only a comment on @jdehesa answer.



      If a function isn't supported by Numba itself it is usually recommendable to implement it yourself. In case of the factorization this isn't a complicated task.



      Code



      import numpy as np
      import numba as nb

      @nb.njit()
      def factorial(a):
      res=1.
      for i in range(1,a+1):
      res*=i
      return res

      @nb.njit()
      def func1(a, b, c, d):
      B = 0.

      exp_min = 5 - (a + b + c + d)
      exp_max = b
      exp = 2. ** np.arange(exp_min, exp_max + 1)

      fact_e=np.empty(a + b - 2)
      for i in range(a + b - 2):
      fact_e[i]=factorial(i)

      for ai in range(0, a):
      for bi in range(0, b):
      for ci in range(0, c):
      for di in range(0, d):
      for ei in range(0, ai + bi):
      for fi in range(0, ci + di):
      B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
      return B


      Parallel version



      @nb.njit(parallel=True)
      def func_p(a_vec,b_vec,c_vec,d_vec):
      res=np.empty(a_vec.shape[0])
      for i in nb.prange(a_vec.shape[0]):
      res[i]=func1(a_vec[i], b_vec[i], c_vec[i], d_vec[i])
      return res


      Example



      a_vec=np.random.randint(low=2,high=10,size=1000000)
      b_vec=np.random.randint(low=2,high=10,size=1000000)
      c_vec=np.random.randint(low=2,high=10,size=1000000)
      d_vec=np.random.randint(low=2,high=10,size=1000000)

      res_2=func_p(a_vec,b_vec,c_vec,d_vec)


      The single threaded version leads to 5.6µs (after the first run) on your example.



      The parallel version will almost lead to another Number_of_Cores speedup for calculating many values. Keep in mind that the compilation overhead is larger for the parallel version (above 0.5s for the first call).






      share|improve this answer

























      • Thanks for this, I'm looking into it, but I'm getting a Numba error which (among other things) says This should not have happened, a problem has occurred in Numba's internals. What version of Numba are you using? I have 0.39.0.

        – jdehesa
        Nov 15 '18 at 13:29











      • @ jdehesa Also 0.39.0. What is above the This should not have happened...? Something like Untyped global name... or cannot determine Numba type of...

        – max9111
        Nov 15 '18 at 13:38












      • The full error message was this (func1_2 is your func1 here). Not sure what that is about, since I later updated my function with basically the same and it works (see my update). I also noted that each element in fact_e can just be computed from the previous one instead of calling factorial (btw no need to delete the answer since it's a valuable contribution).

        – jdehesa
        Nov 15 '18 at 14:01













      2












      2








      2







      This is only a comment on @jdehesa answer.



      If a function isn't supported by Numba itself it is usually recommendable to implement it yourself. In case of the factorization this isn't a complicated task.



      Code



      import numpy as np
      import numba as nb

      @nb.njit()
      def factorial(a):
      res=1.
      for i in range(1,a+1):
      res*=i
      return res

      @nb.njit()
      def func1(a, b, c, d):
      B = 0.

      exp_min = 5 - (a + b + c + d)
      exp_max = b
      exp = 2. ** np.arange(exp_min, exp_max + 1)

      fact_e=np.empty(a + b - 2)
      for i in range(a + b - 2):
      fact_e[i]=factorial(i)

      for ai in range(0, a):
      for bi in range(0, b):
      for ci in range(0, c):
      for di in range(0, d):
      for ei in range(0, ai + bi):
      for fi in range(0, ci + di):
      B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
      return B


      Parallel version



      @nb.njit(parallel=True)
      def func_p(a_vec,b_vec,c_vec,d_vec):
      res=np.empty(a_vec.shape[0])
      for i in nb.prange(a_vec.shape[0]):
      res[i]=func1(a_vec[i], b_vec[i], c_vec[i], d_vec[i])
      return res


      Example



      a_vec=np.random.randint(low=2,high=10,size=1000000)
      b_vec=np.random.randint(low=2,high=10,size=1000000)
      c_vec=np.random.randint(low=2,high=10,size=1000000)
      d_vec=np.random.randint(low=2,high=10,size=1000000)

      res_2=func_p(a_vec,b_vec,c_vec,d_vec)


      The single threaded version leads to 5.6µs (after the first run) on your example.



      The parallel version will almost lead to another Number_of_Cores speedup for calculating many values. Keep in mind that the compilation overhead is larger for the parallel version (above 0.5s for the first call).






      share|improve this answer















      This is only a comment on @jdehesa answer.



      If a function isn't supported by Numba itself it is usually recommendable to implement it yourself. In case of the factorization this isn't a complicated task.



      Code



      import numpy as np
      import numba as nb

      @nb.njit()
      def factorial(a):
      res=1.
      for i in range(1,a+1):
      res*=i
      return res

      @nb.njit()
      def func1(a, b, c, d):
      B = 0.

      exp_min = 5 - (a + b + c + d)
      exp_max = b
      exp = 2. ** np.arange(exp_min, exp_max + 1)

      fact_e=np.empty(a + b - 2)
      for i in range(a + b - 2):
      fact_e[i]=factorial(i)

      for ai in range(0, a):
      for bi in range(0, b):
      for ci in range(0, c):
      for di in range(0, d):
      for ei in range(0, ai + bi):
      for fi in range(0, ci + di):
      B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]
      return B


      Parallel version



      @nb.njit(parallel=True)
      def func_p(a_vec,b_vec,c_vec,d_vec):
      res=np.empty(a_vec.shape[0])
      for i in nb.prange(a_vec.shape[0]):
      res[i]=func1(a_vec[i], b_vec[i], c_vec[i], d_vec[i])
      return res


      Example



      a_vec=np.random.randint(low=2,high=10,size=1000000)
      b_vec=np.random.randint(low=2,high=10,size=1000000)
      c_vec=np.random.randint(low=2,high=10,size=1000000)
      d_vec=np.random.randint(low=2,high=10,size=1000000)

      res_2=func_p(a_vec,b_vec,c_vec,d_vec)


      The single threaded version leads to 5.6µs (after the first run) on your example.



      The parallel version will almost lead to another Number_of_Cores speedup for calculating many values. Keep in mind that the compilation overhead is larger for the parallel version (above 0.5s for the first call).







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 15 '18 at 13:24

























      answered Nov 15 '18 at 13:10









      max9111max9111

      2,2761612




      2,2761612












      • Thanks for this, I'm looking into it, but I'm getting a Numba error which (among other things) says This should not have happened, a problem has occurred in Numba's internals. What version of Numba are you using? I have 0.39.0.

        – jdehesa
        Nov 15 '18 at 13:29











      • @ jdehesa Also 0.39.0. What is above the This should not have happened...? Something like Untyped global name... or cannot determine Numba type of...

        – max9111
        Nov 15 '18 at 13:38












      • The full error message was this (func1_2 is your func1 here). Not sure what that is about, since I later updated my function with basically the same and it works (see my update). I also noted that each element in fact_e can just be computed from the previous one instead of calling factorial (btw no need to delete the answer since it's a valuable contribution).

        – jdehesa
        Nov 15 '18 at 14:01

















      • Thanks for this, I'm looking into it, but I'm getting a Numba error which (among other things) says This should not have happened, a problem has occurred in Numba's internals. What version of Numba are you using? I have 0.39.0.

        – jdehesa
        Nov 15 '18 at 13:29











      • @ jdehesa Also 0.39.0. What is above the This should not have happened...? Something like Untyped global name... or cannot determine Numba type of...

        – max9111
        Nov 15 '18 at 13:38












      • The full error message was this (func1_2 is your func1 here). Not sure what that is about, since I later updated my function with basically the same and it works (see my update). I also noted that each element in fact_e can just be computed from the previous one instead of calling factorial (btw no need to delete the answer since it's a valuable contribution).

        – jdehesa
        Nov 15 '18 at 14:01
















      Thanks for this, I'm looking into it, but I'm getting a Numba error which (among other things) says This should not have happened, a problem has occurred in Numba's internals. What version of Numba are you using? I have 0.39.0.

      – jdehesa
      Nov 15 '18 at 13:29





      Thanks for this, I'm looking into it, but I'm getting a Numba error which (among other things) says This should not have happened, a problem has occurred in Numba's internals. What version of Numba are you using? I have 0.39.0.

      – jdehesa
      Nov 15 '18 at 13:29













      @ jdehesa Also 0.39.0. What is above the This should not have happened...? Something like Untyped global name... or cannot determine Numba type of...

      – max9111
      Nov 15 '18 at 13:38






      @ jdehesa Also 0.39.0. What is above the This should not have happened...? Something like Untyped global name... or cannot determine Numba type of...

      – max9111
      Nov 15 '18 at 13:38














      The full error message was this (func1_2 is your func1 here). Not sure what that is about, since I later updated my function with basically the same and it works (see my update). I also noted that each element in fact_e can just be computed from the previous one instead of calling factorial (btw no need to delete the answer since it's a valuable contribution).

      – jdehesa
      Nov 15 '18 at 14:01





      The full error message was this (func1_2 is your func1 here). Not sure what that is about, since I later updated my function with basically the same and it works (see my update). I also noted that each element in fact_e can just be computed from the previous one instead of calling factorial (btw no need to delete the answer since it's a valuable contribution).

      – jdehesa
      Nov 15 '18 at 14:01











      1














      Using this cartesian_product function
      you can transform your nested loop into matrices and then you can simply calculate your respective nested sigmas in a vectorized manner:



      In [37]: def nested_sig(args):
      ...: base_prod = cartesian_product(*arrays)
      ...: second_prod = cartesian_product(base_prod[:,:2].sum(1), base_prod[:,2:].sum(1))
      ...: total = np.column_stack((base_prod, second_prod))
      ...: # the items in each row denotes the following variables in order:
      ...: # ai, bi, ci, di, ei, fi
      ...: x = total[:, 4] - total[:, 5] - total[:, 0] - total[:, 2] - total[:, 3] + 1
      ...: y = total[:, 4] - total[:, 5]
      ...: result = np.power(2, x) * (np.power(total[:, 4], 2) - 2*y - 7*total[:, 3]) * np.math.factorial(total[:,4])
      ...: return result





      share|improve this answer



























        1














        Using this cartesian_product function
        you can transform your nested loop into matrices and then you can simply calculate your respective nested sigmas in a vectorized manner:



        In [37]: def nested_sig(args):
        ...: base_prod = cartesian_product(*arrays)
        ...: second_prod = cartesian_product(base_prod[:,:2].sum(1), base_prod[:,2:].sum(1))
        ...: total = np.column_stack((base_prod, second_prod))
        ...: # the items in each row denotes the following variables in order:
        ...: # ai, bi, ci, di, ei, fi
        ...: x = total[:, 4] - total[:, 5] - total[:, 0] - total[:, 2] - total[:, 3] + 1
        ...: y = total[:, 4] - total[:, 5]
        ...: result = np.power(2, x) * (np.power(total[:, 4], 2) - 2*y - 7*total[:, 3]) * np.math.factorial(total[:,4])
        ...: return result





        share|improve this answer

























          1












          1








          1







          Using this cartesian_product function
          you can transform your nested loop into matrices and then you can simply calculate your respective nested sigmas in a vectorized manner:



          In [37]: def nested_sig(args):
          ...: base_prod = cartesian_product(*arrays)
          ...: second_prod = cartesian_product(base_prod[:,:2].sum(1), base_prod[:,2:].sum(1))
          ...: total = np.column_stack((base_prod, second_prod))
          ...: # the items in each row denotes the following variables in order:
          ...: # ai, bi, ci, di, ei, fi
          ...: x = total[:, 4] - total[:, 5] - total[:, 0] - total[:, 2] - total[:, 3] + 1
          ...: y = total[:, 4] - total[:, 5]
          ...: result = np.power(2, x) * (np.power(total[:, 4], 2) - 2*y - 7*total[:, 3]) * np.math.factorial(total[:,4])
          ...: return result





          share|improve this answer













          Using this cartesian_product function
          you can transform your nested loop into matrices and then you can simply calculate your respective nested sigmas in a vectorized manner:



          In [37]: def nested_sig(args):
          ...: base_prod = cartesian_product(*arrays)
          ...: second_prod = cartesian_product(base_prod[:,:2].sum(1), base_prod[:,2:].sum(1))
          ...: total = np.column_stack((base_prod, second_prod))
          ...: # the items in each row denotes the following variables in order:
          ...: # ai, bi, ci, di, ei, fi
          ...: x = total[:, 4] - total[:, 5] - total[:, 0] - total[:, 2] - total[:, 3] + 1
          ...: y = total[:, 4] - total[:, 5]
          ...: result = np.power(2, x) * (np.power(total[:, 4], 2) - 2*y - 7*total[:, 3]) * np.math.factorial(total[:,4])
          ...: return result






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 14 '18 at 17:58









          KasrâmvdKasrâmvd

          78.9k1091128




          78.9k1091128





















              1














              I see three sources of improvement in your code :



              • range(0,a) is enter image description here


              • you make a lot of work in the inner loop


              • you sum term in a random way, with risk of loss of precision for bigger entries.


              Here a version (probably not yet good) which try to improve this points.



              @numba.njit
              def func1o(a,b,c,d):
              "2**(ei-fi-ai-ci-di+1)*(ei**2-2*(ei*fi)-7*di)*ei!"
              POW=2.; SUM=0.;
              L=
              for ai in arange(0.,a+1):
              for bi in range(0,b+1):
              for ci in range(0,c+1):
              for di in range(0,d+1):
              FACT=1.
              for ei in arange(0,ai+bi+1):
              for fi in range(0,ci+di+1):
              L.append(POW*SUM*FACT)
              POW /= 2
              SUM -= 2*ei
              POW *= 2
              SUM += 2*(ei-fi)+1
              FACT *= ei+1
              POW /=2
              SUM -= 7*di
              POW /= 2
              POW /= 2
              A=np.array(L)
              I=np.abs(A).argsort()
              return A[I].sum()





              share|improve this answer





























                1














                I see three sources of improvement in your code :



                • range(0,a) is enter image description here


                • you make a lot of work in the inner loop


                • you sum term in a random way, with risk of loss of precision for bigger entries.


                Here a version (probably not yet good) which try to improve this points.



                @numba.njit
                def func1o(a,b,c,d):
                "2**(ei-fi-ai-ci-di+1)*(ei**2-2*(ei*fi)-7*di)*ei!"
                POW=2.; SUM=0.;
                L=
                for ai in arange(0.,a+1):
                for bi in range(0,b+1):
                for ci in range(0,c+1):
                for di in range(0,d+1):
                FACT=1.
                for ei in arange(0,ai+bi+1):
                for fi in range(0,ci+di+1):
                L.append(POW*SUM*FACT)
                POW /= 2
                SUM -= 2*ei
                POW *= 2
                SUM += 2*(ei-fi)+1
                FACT *= ei+1
                POW /=2
                SUM -= 7*di
                POW /= 2
                POW /= 2
                A=np.array(L)
                I=np.abs(A).argsort()
                return A[I].sum()





                share|improve this answer



























                  1












                  1








                  1







                  I see three sources of improvement in your code :



                  • range(0,a) is enter image description here


                  • you make a lot of work in the inner loop


                  • you sum term in a random way, with risk of loss of precision for bigger entries.


                  Here a version (probably not yet good) which try to improve this points.



                  @numba.njit
                  def func1o(a,b,c,d):
                  "2**(ei-fi-ai-ci-di+1)*(ei**2-2*(ei*fi)-7*di)*ei!"
                  POW=2.; SUM=0.;
                  L=
                  for ai in arange(0.,a+1):
                  for bi in range(0,b+1):
                  for ci in range(0,c+1):
                  for di in range(0,d+1):
                  FACT=1.
                  for ei in arange(0,ai+bi+1):
                  for fi in range(0,ci+di+1):
                  L.append(POW*SUM*FACT)
                  POW /= 2
                  SUM -= 2*ei
                  POW *= 2
                  SUM += 2*(ei-fi)+1
                  FACT *= ei+1
                  POW /=2
                  SUM -= 7*di
                  POW /= 2
                  POW /= 2
                  A=np.array(L)
                  I=np.abs(A).argsort()
                  return A[I].sum()





                  share|improve this answer















                  I see three sources of improvement in your code :



                  • range(0,a) is enter image description here


                  • you make a lot of work in the inner loop


                  • you sum term in a random way, with risk of loss of precision for bigger entries.


                  Here a version (probably not yet good) which try to improve this points.



                  @numba.njit
                  def func1o(a,b,c,d):
                  "2**(ei-fi-ai-ci-di+1)*(ei**2-2*(ei*fi)-7*di)*ei!"
                  POW=2.; SUM=0.;
                  L=
                  for ai in arange(0.,a+1):
                  for bi in range(0,b+1):
                  for ci in range(0,c+1):
                  for di in range(0,d+1):
                  FACT=1.
                  for ei in arange(0,ai+bi+1):
                  for fi in range(0,ci+di+1):
                  L.append(POW*SUM*FACT)
                  POW /= 2
                  SUM -= 2*ei
                  POW *= 2
                  SUM += 2*(ei-fi)+1
                  FACT *= ei+1
                  POW /=2
                  SUM -= 7*di
                  POW /= 2
                  POW /= 2
                  A=np.array(L)
                  I=np.abs(A).argsort()
                  return A[I].sum()






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Nov 14 '18 at 20:47

























                  answered Nov 14 '18 at 20:40









                  B. M.B. M.

                  13.3k12034




                  13.3k12034



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53305490%2fvectorize-a-6-for-loop-cumulative-sum-in-python%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      這個網誌中的熱門文章

                      How to read a connectionString WITH PROVIDER in .NET Core?

                      Node.js Script on GitHub Pages or Amazon S3

                      Museum of Modern and Contemporary Art of Trento and Rovereto