Improve runtime of python code with indexing and string concatenation










2















I tried hard to improve the runtime of the following code snippet, which turned out to be the CPU-bottleneck in an asyncio-client package that I'm developing:



data = [''] * n
for i, ix in enumerate(indices):
data[ix] = elements[i]
s = 't'.join(data)
return s


What I do is basically very simple:




  • elements is a list of str (each <= 7 characters) that I eventually write at specific positions into a tab-separated file.


  • indices is a list of int giving the positions of each of the elements in the file

  • If there is no element at a certain position, an empty string is inserted

I finally write the string into a text file using aiofiles.



So far, I tried to use a generator to create the data on the fly, as well as to use numpy for faster indexing, but with no success. Any idea how to make this code run faster would be great. Here is a reproducible example with timing:



import numpy as np
import timeit

n = 1_000_000 # total number of items
k = 500_000 # number of elements to insert
elements = ['ade/gua'] * k # elements to insert, <= 7 unicode characters
indices = list(range(0, n, 2)) # indices where to insert, sorted
assert len(elements) == len(indices)

# This is where I started
def baseline():
data = [''] * n
for i, ix in enumerate(indices):
data[ix] = elements[i]
s = 't'.join(data)
return s

# Generate values on the fly
def generator():
def f():
it = iter(enumerate(indices))
i, ix = next(it)
for j in range(n):
if j == ix:
yield elements[i]
try:
i, ix = next(it)
except:
pass
else:
yield ''
s = 't'.join(f()) # iterating though generation seem too costly
return s

# Harness numpy
indices_np = np.array(indices) # indices could also be numpy array
def numpy():
data = np.full(n, '', dtype='<U7')
data[indices_np] = elements # this is faster with numpy
s = 't'.join(data) # much slower. array2string or savetxt does not help
return s

assert baseline() == generator() == numpy()

timeit.timeit(baseline, number=10) # 0.8463204780127853
timeit.timeit(generator, number=10) # 2.048296730965376 -> great job
timeit.timeit(numpy, number=10) # 4.486689139157534 -> life sucks


Edit 1



To address some of the points raised in the comments:



  • I write the string aiofiles.open(filename, mode='w') as file and file.write()


  • Indices can generally not be expressed as a range


  • Indices can assumed to be always sorted at no extra cost.


  • ASCII characters are sufficient


Edit 2



Based on the answer of Mad Physicist, I tried the following code, with no success.



def buffer_plumbing():
m = len(elements) # total number of data points to insert
k = 7 # each element is 7 bytes long, only ascii
total_bytes = n - 1 + m * 7 # total number of bytes for the buffer

# find out the number of preceeding gaps for each element
gap = np.empty_like(indices_np)
gap[0] = indices_np[0] # that many gaps a the beginning
np.subtract(indices_np[1:], indices_np[:-1], out=gap[1:])
gap[1:] -= 1 # subtract one to get the gaps (except for the first)

# pre-allocate a large enough byte buffer
s = np.full(total_bytes , 't', dtype='S1')

# write element into the buffer
start = 0
for i, (g, e) in enumerate(zip(gap, elements)):
start += g
s[start: start + k].view(dtype=('S', k))[:] = e
start += k + 1
return s.tostring().decode('utf-8')

timeit.timeit(buffer_plumbing, number=10) # 26.82









share|improve this question
























  • I have to ask, what does 'ade/cyt' mean?

    – Mad Physicist
    Nov 15 '18 at 13:32











  • You are relocating the string umpteen times when you do the join. Your best optimization may be to avoid constructing it in the first place. What is the final purpose of that string? Are you writing it somewhere?

    – Mad Physicist
    Nov 15 '18 at 13:36











  • @MadPhysicist ade/cyt refered to nucleotides (as was biologically wrong, btw). I write it to a text file.

    – dmy
    Nov 15 '18 at 13:40











  • Why don't you replace the 'ade/gua' and other strings with integers. For example 1 for 'ade/gua', 2 for 'ade/cyt', etc... And with data = np.zeros(n, dtype=np.int64) you do data[ix] = elements[i]. Or what is your expected output? How many combinations do you expect?

    – Scotty1-
    Nov 15 '18 at 14:02






  • 1





    Can you show how you write the string exactly. I think your biggest speedup will come from not concatenating the whole thing if you're writing it to a file.

    – Mad Physicist
    Nov 16 '18 at 3:41















2















I tried hard to improve the runtime of the following code snippet, which turned out to be the CPU-bottleneck in an asyncio-client package that I'm developing:



data = [''] * n
for i, ix in enumerate(indices):
data[ix] = elements[i]
s = 't'.join(data)
return s


What I do is basically very simple:




  • elements is a list of str (each <= 7 characters) that I eventually write at specific positions into a tab-separated file.


  • indices is a list of int giving the positions of each of the elements in the file

  • If there is no element at a certain position, an empty string is inserted

I finally write the string into a text file using aiofiles.



So far, I tried to use a generator to create the data on the fly, as well as to use numpy for faster indexing, but with no success. Any idea how to make this code run faster would be great. Here is a reproducible example with timing:



import numpy as np
import timeit

n = 1_000_000 # total number of items
k = 500_000 # number of elements to insert
elements = ['ade/gua'] * k # elements to insert, <= 7 unicode characters
indices = list(range(0, n, 2)) # indices where to insert, sorted
assert len(elements) == len(indices)

# This is where I started
def baseline():
data = [''] * n
for i, ix in enumerate(indices):
data[ix] = elements[i]
s = 't'.join(data)
return s

# Generate values on the fly
def generator():
def f():
it = iter(enumerate(indices))
i, ix = next(it)
for j in range(n):
if j == ix:
yield elements[i]
try:
i, ix = next(it)
except:
pass
else:
yield ''
s = 't'.join(f()) # iterating though generation seem too costly
return s

# Harness numpy
indices_np = np.array(indices) # indices could also be numpy array
def numpy():
data = np.full(n, '', dtype='<U7')
data[indices_np] = elements # this is faster with numpy
s = 't'.join(data) # much slower. array2string or savetxt does not help
return s

assert baseline() == generator() == numpy()

timeit.timeit(baseline, number=10) # 0.8463204780127853
timeit.timeit(generator, number=10) # 2.048296730965376 -> great job
timeit.timeit(numpy, number=10) # 4.486689139157534 -> life sucks


Edit 1



To address some of the points raised in the comments:



  • I write the string aiofiles.open(filename, mode='w') as file and file.write()


  • Indices can generally not be expressed as a range


  • Indices can assumed to be always sorted at no extra cost.


  • ASCII characters are sufficient


Edit 2



Based on the answer of Mad Physicist, I tried the following code, with no success.



def buffer_plumbing():
m = len(elements) # total number of data points to insert
k = 7 # each element is 7 bytes long, only ascii
total_bytes = n - 1 + m * 7 # total number of bytes for the buffer

# find out the number of preceeding gaps for each element
gap = np.empty_like(indices_np)
gap[0] = indices_np[0] # that many gaps a the beginning
np.subtract(indices_np[1:], indices_np[:-1], out=gap[1:])
gap[1:] -= 1 # subtract one to get the gaps (except for the first)

# pre-allocate a large enough byte buffer
s = np.full(total_bytes , 't', dtype='S1')

# write element into the buffer
start = 0
for i, (g, e) in enumerate(zip(gap, elements)):
start += g
s[start: start + k].view(dtype=('S', k))[:] = e
start += k + 1
return s.tostring().decode('utf-8')

timeit.timeit(buffer_plumbing, number=10) # 26.82









share|improve this question
























  • I have to ask, what does 'ade/cyt' mean?

    – Mad Physicist
    Nov 15 '18 at 13:32











  • You are relocating the string umpteen times when you do the join. Your best optimization may be to avoid constructing it in the first place. What is the final purpose of that string? Are you writing it somewhere?

    – Mad Physicist
    Nov 15 '18 at 13:36











  • @MadPhysicist ade/cyt refered to nucleotides (as was biologically wrong, btw). I write it to a text file.

    – dmy
    Nov 15 '18 at 13:40











  • Why don't you replace the 'ade/gua' and other strings with integers. For example 1 for 'ade/gua', 2 for 'ade/cyt', etc... And with data = np.zeros(n, dtype=np.int64) you do data[ix] = elements[i]. Or what is your expected output? How many combinations do you expect?

    – Scotty1-
    Nov 15 '18 at 14:02






  • 1





    Can you show how you write the string exactly. I think your biggest speedup will come from not concatenating the whole thing if you're writing it to a file.

    – Mad Physicist
    Nov 16 '18 at 3:41













2












2








2








I tried hard to improve the runtime of the following code snippet, which turned out to be the CPU-bottleneck in an asyncio-client package that I'm developing:



data = [''] * n
for i, ix in enumerate(indices):
data[ix] = elements[i]
s = 't'.join(data)
return s


What I do is basically very simple:




  • elements is a list of str (each <= 7 characters) that I eventually write at specific positions into a tab-separated file.


  • indices is a list of int giving the positions of each of the elements in the file

  • If there is no element at a certain position, an empty string is inserted

I finally write the string into a text file using aiofiles.



So far, I tried to use a generator to create the data on the fly, as well as to use numpy for faster indexing, but with no success. Any idea how to make this code run faster would be great. Here is a reproducible example with timing:



import numpy as np
import timeit

n = 1_000_000 # total number of items
k = 500_000 # number of elements to insert
elements = ['ade/gua'] * k # elements to insert, <= 7 unicode characters
indices = list(range(0, n, 2)) # indices where to insert, sorted
assert len(elements) == len(indices)

# This is where I started
def baseline():
data = [''] * n
for i, ix in enumerate(indices):
data[ix] = elements[i]
s = 't'.join(data)
return s

# Generate values on the fly
def generator():
def f():
it = iter(enumerate(indices))
i, ix = next(it)
for j in range(n):
if j == ix:
yield elements[i]
try:
i, ix = next(it)
except:
pass
else:
yield ''
s = 't'.join(f()) # iterating though generation seem too costly
return s

# Harness numpy
indices_np = np.array(indices) # indices could also be numpy array
def numpy():
data = np.full(n, '', dtype='<U7')
data[indices_np] = elements # this is faster with numpy
s = 't'.join(data) # much slower. array2string or savetxt does not help
return s

assert baseline() == generator() == numpy()

timeit.timeit(baseline, number=10) # 0.8463204780127853
timeit.timeit(generator, number=10) # 2.048296730965376 -> great job
timeit.timeit(numpy, number=10) # 4.486689139157534 -> life sucks


Edit 1



To address some of the points raised in the comments:



  • I write the string aiofiles.open(filename, mode='w') as file and file.write()


  • Indices can generally not be expressed as a range


  • Indices can assumed to be always sorted at no extra cost.


  • ASCII characters are sufficient


Edit 2



Based on the answer of Mad Physicist, I tried the following code, with no success.



def buffer_plumbing():
m = len(elements) # total number of data points to insert
k = 7 # each element is 7 bytes long, only ascii
total_bytes = n - 1 + m * 7 # total number of bytes for the buffer

# find out the number of preceeding gaps for each element
gap = np.empty_like(indices_np)
gap[0] = indices_np[0] # that many gaps a the beginning
np.subtract(indices_np[1:], indices_np[:-1], out=gap[1:])
gap[1:] -= 1 # subtract one to get the gaps (except for the first)

# pre-allocate a large enough byte buffer
s = np.full(total_bytes , 't', dtype='S1')

# write element into the buffer
start = 0
for i, (g, e) in enumerate(zip(gap, elements)):
start += g
s[start: start + k].view(dtype=('S', k))[:] = e
start += k + 1
return s.tostring().decode('utf-8')

timeit.timeit(buffer_plumbing, number=10) # 26.82









share|improve this question
















I tried hard to improve the runtime of the following code snippet, which turned out to be the CPU-bottleneck in an asyncio-client package that I'm developing:



data = [''] * n
for i, ix in enumerate(indices):
data[ix] = elements[i]
s = 't'.join(data)
return s


What I do is basically very simple:




  • elements is a list of str (each <= 7 characters) that I eventually write at specific positions into a tab-separated file.


  • indices is a list of int giving the positions of each of the elements in the file

  • If there is no element at a certain position, an empty string is inserted

I finally write the string into a text file using aiofiles.



So far, I tried to use a generator to create the data on the fly, as well as to use numpy for faster indexing, but with no success. Any idea how to make this code run faster would be great. Here is a reproducible example with timing:



import numpy as np
import timeit

n = 1_000_000 # total number of items
k = 500_000 # number of elements to insert
elements = ['ade/gua'] * k # elements to insert, <= 7 unicode characters
indices = list(range(0, n, 2)) # indices where to insert, sorted
assert len(elements) == len(indices)

# This is where I started
def baseline():
data = [''] * n
for i, ix in enumerate(indices):
data[ix] = elements[i]
s = 't'.join(data)
return s

# Generate values on the fly
def generator():
def f():
it = iter(enumerate(indices))
i, ix = next(it)
for j in range(n):
if j == ix:
yield elements[i]
try:
i, ix = next(it)
except:
pass
else:
yield ''
s = 't'.join(f()) # iterating though generation seem too costly
return s

# Harness numpy
indices_np = np.array(indices) # indices could also be numpy array
def numpy():
data = np.full(n, '', dtype='<U7')
data[indices_np] = elements # this is faster with numpy
s = 't'.join(data) # much slower. array2string or savetxt does not help
return s

assert baseline() == generator() == numpy()

timeit.timeit(baseline, number=10) # 0.8463204780127853
timeit.timeit(generator, number=10) # 2.048296730965376 -> great job
timeit.timeit(numpy, number=10) # 4.486689139157534 -> life sucks


Edit 1



To address some of the points raised in the comments:



  • I write the string aiofiles.open(filename, mode='w') as file and file.write()


  • Indices can generally not be expressed as a range


  • Indices can assumed to be always sorted at no extra cost.


  • ASCII characters are sufficient


Edit 2



Based on the answer of Mad Physicist, I tried the following code, with no success.



def buffer_plumbing():
m = len(elements) # total number of data points to insert
k = 7 # each element is 7 bytes long, only ascii
total_bytes = n - 1 + m * 7 # total number of bytes for the buffer

# find out the number of preceeding gaps for each element
gap = np.empty_like(indices_np)
gap[0] = indices_np[0] # that many gaps a the beginning
np.subtract(indices_np[1:], indices_np[:-1], out=gap[1:])
gap[1:] -= 1 # subtract one to get the gaps (except for the first)

# pre-allocate a large enough byte buffer
s = np.full(total_bytes , 't', dtype='S1')

# write element into the buffer
start = 0
for i, (g, e) in enumerate(zip(gap, elements)):
start += g
s[start: start + k].view(dtype=('S', k))[:] = e
start += k + 1
return s.tostring().decode('utf-8')

timeit.timeit(buffer_plumbing, number=10) # 26.82






python numpy optimization






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 16 '18 at 12:02







dmy

















asked Nov 15 '18 at 13:25









dmydmy

132




132












  • I have to ask, what does 'ade/cyt' mean?

    – Mad Physicist
    Nov 15 '18 at 13:32











  • You are relocating the string umpteen times when you do the join. Your best optimization may be to avoid constructing it in the first place. What is the final purpose of that string? Are you writing it somewhere?

    – Mad Physicist
    Nov 15 '18 at 13:36











  • @MadPhysicist ade/cyt refered to nucleotides (as was biologically wrong, btw). I write it to a text file.

    – dmy
    Nov 15 '18 at 13:40











  • Why don't you replace the 'ade/gua' and other strings with integers. For example 1 for 'ade/gua', 2 for 'ade/cyt', etc... And with data = np.zeros(n, dtype=np.int64) you do data[ix] = elements[i]. Or what is your expected output? How many combinations do you expect?

    – Scotty1-
    Nov 15 '18 at 14:02






  • 1





    Can you show how you write the string exactly. I think your biggest speedup will come from not concatenating the whole thing if you're writing it to a file.

    – Mad Physicist
    Nov 16 '18 at 3:41

















  • I have to ask, what does 'ade/cyt' mean?

    – Mad Physicist
    Nov 15 '18 at 13:32











  • You are relocating the string umpteen times when you do the join. Your best optimization may be to avoid constructing it in the first place. What is the final purpose of that string? Are you writing it somewhere?

    – Mad Physicist
    Nov 15 '18 at 13:36











  • @MadPhysicist ade/cyt refered to nucleotides (as was biologically wrong, btw). I write it to a text file.

    – dmy
    Nov 15 '18 at 13:40











  • Why don't you replace the 'ade/gua' and other strings with integers. For example 1 for 'ade/gua', 2 for 'ade/cyt', etc... And with data = np.zeros(n, dtype=np.int64) you do data[ix] = elements[i]. Or what is your expected output? How many combinations do you expect?

    – Scotty1-
    Nov 15 '18 at 14:02






  • 1





    Can you show how you write the string exactly. I think your biggest speedup will come from not concatenating the whole thing if you're writing it to a file.

    – Mad Physicist
    Nov 16 '18 at 3:41
















I have to ask, what does 'ade/cyt' mean?

– Mad Physicist
Nov 15 '18 at 13:32





I have to ask, what does 'ade/cyt' mean?

– Mad Physicist
Nov 15 '18 at 13:32













You are relocating the string umpteen times when you do the join. Your best optimization may be to avoid constructing it in the first place. What is the final purpose of that string? Are you writing it somewhere?

– Mad Physicist
Nov 15 '18 at 13:36





You are relocating the string umpteen times when you do the join. Your best optimization may be to avoid constructing it in the first place. What is the final purpose of that string? Are you writing it somewhere?

– Mad Physicist
Nov 15 '18 at 13:36













@MadPhysicist ade/cyt refered to nucleotides (as was biologically wrong, btw). I write it to a text file.

– dmy
Nov 15 '18 at 13:40





@MadPhysicist ade/cyt refered to nucleotides (as was biologically wrong, btw). I write it to a text file.

– dmy
Nov 15 '18 at 13:40













Why don't you replace the 'ade/gua' and other strings with integers. For example 1 for 'ade/gua', 2 for 'ade/cyt', etc... And with data = np.zeros(n, dtype=np.int64) you do data[ix] = elements[i]. Or what is your expected output? How many combinations do you expect?

– Scotty1-
Nov 15 '18 at 14:02





Why don't you replace the 'ade/gua' and other strings with integers. For example 1 for 'ade/gua', 2 for 'ade/cyt', etc... And with data = np.zeros(n, dtype=np.int64) you do data[ix] = elements[i]. Or what is your expected output? How many combinations do you expect?

– Scotty1-
Nov 15 '18 at 14:02




1




1





Can you show how you write the string exactly. I think your biggest speedup will come from not concatenating the whole thing if you're writing it to a file.

– Mad Physicist
Nov 16 '18 at 3:41





Can you show how you write the string exactly. I think your biggest speedup will come from not concatenating the whole thing if you're writing it to a file.

– Mad Physicist
Nov 16 '18 at 3:41












1 Answer
1






active

oldest

votes


















2














You can pre-sort your data, after converting it to a pair of numpy arrays. This will allow you to manipulate a single pre-existing buffer rather than copying strings over and over as you reallocate them. The difference between my suggestion and your attempt is that we will use ndarray.tobytes (or ndarray.tostring) with the assumption that you only have ASCII characters. In fact, you can completely bypass the copy operation involved in converting into a bytes object by using ndarray.tofile directly.



If you have elements in-hand, you know that the total length of your line will be the combined length of the elements and n-1 tab separators. The start of an element in the full string is therefore it's index (the number of tabs that precede it) plus the cumulative length of all the elements that come before it. Here is a simple implementation of a single-buffer fill using mostly Python loops:



lengths = np.array([len(e) for e in elements])
indices = np.asanyarray(indices)
elements = np.array(elements, dtype='S7')
order = np.argsort(indices)

elements = elements[order]
indices = indices[order]
lengths = lengths[order]

cumulative = np.empty_like(lengths)
cumulative[0] = 0
np.cumsum(lengths[:-1], out=cumulative[1:])
cumulative += lengths

s = np.full(cumulative[-1] + n - 1, 't', dtype='S1')
for i, l, e in zip(cumulative, lengths, elements):
s[i:i + l].view(dtype=('S', l))[:] = e


There are lots of possible optimizations to play with here, such as the possibility of allocating s using np.empty and only filling in the required elements with tabs. This will be left as an excise for the reader.



Another possibility is to avoid converting elements to a numpy array entirely (it probably just wastes space and time). You can then rewrite the for loop as



for i, l, o in zip(cumulative, lengths, order):
s[i:i + l].view(dtype=('S', l))[:] = elements[o]


You can dump the result into a bytes object with



s = s.tobytes()


OR



s = s.tostring()


You can write the result as-is to a file opened for binary writing. In fact, if you don't need a copy of the buffer in the form of a bytes, you can just write to the file directly:



s.tofile(f)


That will save you some memory and processing time.



In the same vein, you may be better off just writing to a file directly, piece by piece. This saves you not only the need to allocate the full buffer, but also the cumulative lengths. In fact, the only thing you need this way is the diff of successive indices to tell you how many tabs to insert:



indices = np.asanyarray(indices)
order = np.argsort(indices)

indices = indices[order]

tabs = np.empty_like(indices)
tabs[0] = indices[0]
np.subtract(indices[1:], indices[:-1], out=tabs[1:])
tabs[1:] -= 1

for t, o in zip(tabs, order):
f.write('t' * t)
f.write(elements[o])
f.write('t' * (n - indices[-1] - 1))


This second approach has two major advantages besides the reduced amount of calculation. The first is that it works with unicode characters rather than ASCII only. The second is that it does not allocate any buffers besides strings of tabs, which should be extremely fast.



In both cases, having elements and indices sorted into ascending order by index will speed things up dramatically. The first case reduces to



lengths = np.array([len(e) for e in elements])
indices = np.asanyarray(indices)

cumulative = np.empty_like(lengths)
cumulative[0] = 0
np.cumsum(lengths[:-1], out=cumulative[1:])
cumulative += lengths

s = np.full(cumulative[-1] + n - 1, 't', dtype='S1')
for i, l, e in zip(cumulative, lengths, elements):
s[i:i + l].view(dtype=('S', l))[:] = e


And the second becomes just



indices = np.asanyarray(indices)

tabs = np.empty_like(indices)
tabs[0] = indices[0]
np.subtract(indices[1:], indices[:-1], out=tabs[1:])
tabs[1:] -= 1

for t, e in zip(tabs, elements):
f.write('t' * t)
f.write(e)
f.write('t' * (n - indices[-1] - 1))





share|improve this answer























  • Thanks for the extensive answer. It tried the last approach, but I think there is a bug, because I need 't' also between consecutive elements as a delimitier. Besides, this massively slows down everything du to the large number of calls to file.write.

    – dmy
    Nov 16 '18 at 9:41











  • I tried everything now and its just getting worse. Seems like time to surrender...

    – dmy
    Nov 16 '18 at 12:03











  • @dmy. I'll play with it some more once I'm on a desktop. This is interesting. Remember to include the I/O in your timing. That's likely to be a bigger bottleneck than the in-memory stuff.

    – Mad Physicist
    Nov 16 '18 at 13:54











  • I'm using asyncio for this reason so I made the CPU to be my bottleneck.

    – dmy
    Nov 16 '18 at 14:04











  • If you're writing to a file, you're writing to a file. Writing x number of bytes will take a fixed amount of time, which will generally only go up if you try to parallelize with other disk operations.

    – Mad Physicist
    Nov 16 '18 at 14:07










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53320519%2fimprove-runtime-of-python-code-with-indexing-and-string-concatenation%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














You can pre-sort your data, after converting it to a pair of numpy arrays. This will allow you to manipulate a single pre-existing buffer rather than copying strings over and over as you reallocate them. The difference between my suggestion and your attempt is that we will use ndarray.tobytes (or ndarray.tostring) with the assumption that you only have ASCII characters. In fact, you can completely bypass the copy operation involved in converting into a bytes object by using ndarray.tofile directly.



If you have elements in-hand, you know that the total length of your line will be the combined length of the elements and n-1 tab separators. The start of an element in the full string is therefore it's index (the number of tabs that precede it) plus the cumulative length of all the elements that come before it. Here is a simple implementation of a single-buffer fill using mostly Python loops:



lengths = np.array([len(e) for e in elements])
indices = np.asanyarray(indices)
elements = np.array(elements, dtype='S7')
order = np.argsort(indices)

elements = elements[order]
indices = indices[order]
lengths = lengths[order]

cumulative = np.empty_like(lengths)
cumulative[0] = 0
np.cumsum(lengths[:-1], out=cumulative[1:])
cumulative += lengths

s = np.full(cumulative[-1] + n - 1, 't', dtype='S1')
for i, l, e in zip(cumulative, lengths, elements):
s[i:i + l].view(dtype=('S', l))[:] = e


There are lots of possible optimizations to play with here, such as the possibility of allocating s using np.empty and only filling in the required elements with tabs. This will be left as an excise for the reader.



Another possibility is to avoid converting elements to a numpy array entirely (it probably just wastes space and time). You can then rewrite the for loop as



for i, l, o in zip(cumulative, lengths, order):
s[i:i + l].view(dtype=('S', l))[:] = elements[o]


You can dump the result into a bytes object with



s = s.tobytes()


OR



s = s.tostring()


You can write the result as-is to a file opened for binary writing. In fact, if you don't need a copy of the buffer in the form of a bytes, you can just write to the file directly:



s.tofile(f)


That will save you some memory and processing time.



In the same vein, you may be better off just writing to a file directly, piece by piece. This saves you not only the need to allocate the full buffer, but also the cumulative lengths. In fact, the only thing you need this way is the diff of successive indices to tell you how many tabs to insert:



indices = np.asanyarray(indices)
order = np.argsort(indices)

indices = indices[order]

tabs = np.empty_like(indices)
tabs[0] = indices[0]
np.subtract(indices[1:], indices[:-1], out=tabs[1:])
tabs[1:] -= 1

for t, o in zip(tabs, order):
f.write('t' * t)
f.write(elements[o])
f.write('t' * (n - indices[-1] - 1))


This second approach has two major advantages besides the reduced amount of calculation. The first is that it works with unicode characters rather than ASCII only. The second is that it does not allocate any buffers besides strings of tabs, which should be extremely fast.



In both cases, having elements and indices sorted into ascending order by index will speed things up dramatically. The first case reduces to



lengths = np.array([len(e) for e in elements])
indices = np.asanyarray(indices)

cumulative = np.empty_like(lengths)
cumulative[0] = 0
np.cumsum(lengths[:-1], out=cumulative[1:])
cumulative += lengths

s = np.full(cumulative[-1] + n - 1, 't', dtype='S1')
for i, l, e in zip(cumulative, lengths, elements):
s[i:i + l].view(dtype=('S', l))[:] = e


And the second becomes just



indices = np.asanyarray(indices)

tabs = np.empty_like(indices)
tabs[0] = indices[0]
np.subtract(indices[1:], indices[:-1], out=tabs[1:])
tabs[1:] -= 1

for t, e in zip(tabs, elements):
f.write('t' * t)
f.write(e)
f.write('t' * (n - indices[-1] - 1))





share|improve this answer























  • Thanks for the extensive answer. It tried the last approach, but I think there is a bug, because I need 't' also between consecutive elements as a delimitier. Besides, this massively slows down everything du to the large number of calls to file.write.

    – dmy
    Nov 16 '18 at 9:41











  • I tried everything now and its just getting worse. Seems like time to surrender...

    – dmy
    Nov 16 '18 at 12:03











  • @dmy. I'll play with it some more once I'm on a desktop. This is interesting. Remember to include the I/O in your timing. That's likely to be a bigger bottleneck than the in-memory stuff.

    – Mad Physicist
    Nov 16 '18 at 13:54











  • I'm using asyncio for this reason so I made the CPU to be my bottleneck.

    – dmy
    Nov 16 '18 at 14:04











  • If you're writing to a file, you're writing to a file. Writing x number of bytes will take a fixed amount of time, which will generally only go up if you try to parallelize with other disk operations.

    – Mad Physicist
    Nov 16 '18 at 14:07















2














You can pre-sort your data, after converting it to a pair of numpy arrays. This will allow you to manipulate a single pre-existing buffer rather than copying strings over and over as you reallocate them. The difference between my suggestion and your attempt is that we will use ndarray.tobytes (or ndarray.tostring) with the assumption that you only have ASCII characters. In fact, you can completely bypass the copy operation involved in converting into a bytes object by using ndarray.tofile directly.



If you have elements in-hand, you know that the total length of your line will be the combined length of the elements and n-1 tab separators. The start of an element in the full string is therefore it's index (the number of tabs that precede it) plus the cumulative length of all the elements that come before it. Here is a simple implementation of a single-buffer fill using mostly Python loops:



lengths = np.array([len(e) for e in elements])
indices = np.asanyarray(indices)
elements = np.array(elements, dtype='S7')
order = np.argsort(indices)

elements = elements[order]
indices = indices[order]
lengths = lengths[order]

cumulative = np.empty_like(lengths)
cumulative[0] = 0
np.cumsum(lengths[:-1], out=cumulative[1:])
cumulative += lengths

s = np.full(cumulative[-1] + n - 1, 't', dtype='S1')
for i, l, e in zip(cumulative, lengths, elements):
s[i:i + l].view(dtype=('S', l))[:] = e


There are lots of possible optimizations to play with here, such as the possibility of allocating s using np.empty and only filling in the required elements with tabs. This will be left as an excise for the reader.



Another possibility is to avoid converting elements to a numpy array entirely (it probably just wastes space and time). You can then rewrite the for loop as



for i, l, o in zip(cumulative, lengths, order):
s[i:i + l].view(dtype=('S', l))[:] = elements[o]


You can dump the result into a bytes object with



s = s.tobytes()


OR



s = s.tostring()


You can write the result as-is to a file opened for binary writing. In fact, if you don't need a copy of the buffer in the form of a bytes, you can just write to the file directly:



s.tofile(f)


That will save you some memory and processing time.



In the same vein, you may be better off just writing to a file directly, piece by piece. This saves you not only the need to allocate the full buffer, but also the cumulative lengths. In fact, the only thing you need this way is the diff of successive indices to tell you how many tabs to insert:



indices = np.asanyarray(indices)
order = np.argsort(indices)

indices = indices[order]

tabs = np.empty_like(indices)
tabs[0] = indices[0]
np.subtract(indices[1:], indices[:-1], out=tabs[1:])
tabs[1:] -= 1

for t, o in zip(tabs, order):
f.write('t' * t)
f.write(elements[o])
f.write('t' * (n - indices[-1] - 1))


This second approach has two major advantages besides the reduced amount of calculation. The first is that it works with unicode characters rather than ASCII only. The second is that it does not allocate any buffers besides strings of tabs, which should be extremely fast.



In both cases, having elements and indices sorted into ascending order by index will speed things up dramatically. The first case reduces to



lengths = np.array([len(e) for e in elements])
indices = np.asanyarray(indices)

cumulative = np.empty_like(lengths)
cumulative[0] = 0
np.cumsum(lengths[:-1], out=cumulative[1:])
cumulative += lengths

s = np.full(cumulative[-1] + n - 1, 't', dtype='S1')
for i, l, e in zip(cumulative, lengths, elements):
s[i:i + l].view(dtype=('S', l))[:] = e


And the second becomes just



indices = np.asanyarray(indices)

tabs = np.empty_like(indices)
tabs[0] = indices[0]
np.subtract(indices[1:], indices[:-1], out=tabs[1:])
tabs[1:] -= 1

for t, e in zip(tabs, elements):
f.write('t' * t)
f.write(e)
f.write('t' * (n - indices[-1] - 1))





share|improve this answer























  • Thanks for the extensive answer. It tried the last approach, but I think there is a bug, because I need 't' also between consecutive elements as a delimitier. Besides, this massively slows down everything du to the large number of calls to file.write.

    – dmy
    Nov 16 '18 at 9:41











  • I tried everything now and its just getting worse. Seems like time to surrender...

    – dmy
    Nov 16 '18 at 12:03











  • @dmy. I'll play with it some more once I'm on a desktop. This is interesting. Remember to include the I/O in your timing. That's likely to be a bigger bottleneck than the in-memory stuff.

    – Mad Physicist
    Nov 16 '18 at 13:54











  • I'm using asyncio for this reason so I made the CPU to be my bottleneck.

    – dmy
    Nov 16 '18 at 14:04











  • If you're writing to a file, you're writing to a file. Writing x number of bytes will take a fixed amount of time, which will generally only go up if you try to parallelize with other disk operations.

    – Mad Physicist
    Nov 16 '18 at 14:07













2












2








2







You can pre-sort your data, after converting it to a pair of numpy arrays. This will allow you to manipulate a single pre-existing buffer rather than copying strings over and over as you reallocate them. The difference between my suggestion and your attempt is that we will use ndarray.tobytes (or ndarray.tostring) with the assumption that you only have ASCII characters. In fact, you can completely bypass the copy operation involved in converting into a bytes object by using ndarray.tofile directly.



If you have elements in-hand, you know that the total length of your line will be the combined length of the elements and n-1 tab separators. The start of an element in the full string is therefore it's index (the number of tabs that precede it) plus the cumulative length of all the elements that come before it. Here is a simple implementation of a single-buffer fill using mostly Python loops:



lengths = np.array([len(e) for e in elements])
indices = np.asanyarray(indices)
elements = np.array(elements, dtype='S7')
order = np.argsort(indices)

elements = elements[order]
indices = indices[order]
lengths = lengths[order]

cumulative = np.empty_like(lengths)
cumulative[0] = 0
np.cumsum(lengths[:-1], out=cumulative[1:])
cumulative += lengths

s = np.full(cumulative[-1] + n - 1, 't', dtype='S1')
for i, l, e in zip(cumulative, lengths, elements):
s[i:i + l].view(dtype=('S', l))[:] = e


There are lots of possible optimizations to play with here, such as the possibility of allocating s using np.empty and only filling in the required elements with tabs. This will be left as an excise for the reader.



Another possibility is to avoid converting elements to a numpy array entirely (it probably just wastes space and time). You can then rewrite the for loop as



for i, l, o in zip(cumulative, lengths, order):
s[i:i + l].view(dtype=('S', l))[:] = elements[o]


You can dump the result into a bytes object with



s = s.tobytes()


OR



s = s.tostring()


You can write the result as-is to a file opened for binary writing. In fact, if you don't need a copy of the buffer in the form of a bytes, you can just write to the file directly:



s.tofile(f)


That will save you some memory and processing time.



In the same vein, you may be better off just writing to a file directly, piece by piece. This saves you not only the need to allocate the full buffer, but also the cumulative lengths. In fact, the only thing you need this way is the diff of successive indices to tell you how many tabs to insert:



indices = np.asanyarray(indices)
order = np.argsort(indices)

indices = indices[order]

tabs = np.empty_like(indices)
tabs[0] = indices[0]
np.subtract(indices[1:], indices[:-1], out=tabs[1:])
tabs[1:] -= 1

for t, o in zip(tabs, order):
f.write('t' * t)
f.write(elements[o])
f.write('t' * (n - indices[-1] - 1))


This second approach has two major advantages besides the reduced amount of calculation. The first is that it works with unicode characters rather than ASCII only. The second is that it does not allocate any buffers besides strings of tabs, which should be extremely fast.



In both cases, having elements and indices sorted into ascending order by index will speed things up dramatically. The first case reduces to



lengths = np.array([len(e) for e in elements])
indices = np.asanyarray(indices)

cumulative = np.empty_like(lengths)
cumulative[0] = 0
np.cumsum(lengths[:-1], out=cumulative[1:])
cumulative += lengths

s = np.full(cumulative[-1] + n - 1, 't', dtype='S1')
for i, l, e in zip(cumulative, lengths, elements):
s[i:i + l].view(dtype=('S', l))[:] = e


And the second becomes just



indices = np.asanyarray(indices)

tabs = np.empty_like(indices)
tabs[0] = indices[0]
np.subtract(indices[1:], indices[:-1], out=tabs[1:])
tabs[1:] -= 1

for t, e in zip(tabs, elements):
f.write('t' * t)
f.write(e)
f.write('t' * (n - indices[-1] - 1))





share|improve this answer













You can pre-sort your data, after converting it to a pair of numpy arrays. This will allow you to manipulate a single pre-existing buffer rather than copying strings over and over as you reallocate them. The difference between my suggestion and your attempt is that we will use ndarray.tobytes (or ndarray.tostring) with the assumption that you only have ASCII characters. In fact, you can completely bypass the copy operation involved in converting into a bytes object by using ndarray.tofile directly.



If you have elements in-hand, you know that the total length of your line will be the combined length of the elements and n-1 tab separators. The start of an element in the full string is therefore it's index (the number of tabs that precede it) plus the cumulative length of all the elements that come before it. Here is a simple implementation of a single-buffer fill using mostly Python loops:



lengths = np.array([len(e) for e in elements])
indices = np.asanyarray(indices)
elements = np.array(elements, dtype='S7')
order = np.argsort(indices)

elements = elements[order]
indices = indices[order]
lengths = lengths[order]

cumulative = np.empty_like(lengths)
cumulative[0] = 0
np.cumsum(lengths[:-1], out=cumulative[1:])
cumulative += lengths

s = np.full(cumulative[-1] + n - 1, 't', dtype='S1')
for i, l, e in zip(cumulative, lengths, elements):
s[i:i + l].view(dtype=('S', l))[:] = e


There are lots of possible optimizations to play with here, such as the possibility of allocating s using np.empty and only filling in the required elements with tabs. This will be left as an excise for the reader.



Another possibility is to avoid converting elements to a numpy array entirely (it probably just wastes space and time). You can then rewrite the for loop as



for i, l, o in zip(cumulative, lengths, order):
s[i:i + l].view(dtype=('S', l))[:] = elements[o]


You can dump the result into a bytes object with



s = s.tobytes()


OR



s = s.tostring()


You can write the result as-is to a file opened for binary writing. In fact, if you don't need a copy of the buffer in the form of a bytes, you can just write to the file directly:



s.tofile(f)


That will save you some memory and processing time.



In the same vein, you may be better off just writing to a file directly, piece by piece. This saves you not only the need to allocate the full buffer, but also the cumulative lengths. In fact, the only thing you need this way is the diff of successive indices to tell you how many tabs to insert:



indices = np.asanyarray(indices)
order = np.argsort(indices)

indices = indices[order]

tabs = np.empty_like(indices)
tabs[0] = indices[0]
np.subtract(indices[1:], indices[:-1], out=tabs[1:])
tabs[1:] -= 1

for t, o in zip(tabs, order):
f.write('t' * t)
f.write(elements[o])
f.write('t' * (n - indices[-1] - 1))


This second approach has two major advantages besides the reduced amount of calculation. The first is that it works with unicode characters rather than ASCII only. The second is that it does not allocate any buffers besides strings of tabs, which should be extremely fast.



In both cases, having elements and indices sorted into ascending order by index will speed things up dramatically. The first case reduces to



lengths = np.array([len(e) for e in elements])
indices = np.asanyarray(indices)

cumulative = np.empty_like(lengths)
cumulative[0] = 0
np.cumsum(lengths[:-1], out=cumulative[1:])
cumulative += lengths

s = np.full(cumulative[-1] + n - 1, 't', dtype='S1')
for i, l, e in zip(cumulative, lengths, elements):
s[i:i + l].view(dtype=('S', l))[:] = e


And the second becomes just



indices = np.asanyarray(indices)

tabs = np.empty_like(indices)
tabs[0] = indices[0]
np.subtract(indices[1:], indices[:-1], out=tabs[1:])
tabs[1:] -= 1

for t, e in zip(tabs, elements):
f.write('t' * t)
f.write(e)
f.write('t' * (n - indices[-1] - 1))






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 16 '18 at 5:57









Mad PhysicistMad Physicist

38.1k1677109




38.1k1677109












  • Thanks for the extensive answer. It tried the last approach, but I think there is a bug, because I need 't' also between consecutive elements as a delimitier. Besides, this massively slows down everything du to the large number of calls to file.write.

    – dmy
    Nov 16 '18 at 9:41











  • I tried everything now and its just getting worse. Seems like time to surrender...

    – dmy
    Nov 16 '18 at 12:03











  • @dmy. I'll play with it some more once I'm on a desktop. This is interesting. Remember to include the I/O in your timing. That's likely to be a bigger bottleneck than the in-memory stuff.

    – Mad Physicist
    Nov 16 '18 at 13:54











  • I'm using asyncio for this reason so I made the CPU to be my bottleneck.

    – dmy
    Nov 16 '18 at 14:04











  • If you're writing to a file, you're writing to a file. Writing x number of bytes will take a fixed amount of time, which will generally only go up if you try to parallelize with other disk operations.

    – Mad Physicist
    Nov 16 '18 at 14:07

















  • Thanks for the extensive answer. It tried the last approach, but I think there is a bug, because I need 't' also between consecutive elements as a delimitier. Besides, this massively slows down everything du to the large number of calls to file.write.

    – dmy
    Nov 16 '18 at 9:41











  • I tried everything now and its just getting worse. Seems like time to surrender...

    – dmy
    Nov 16 '18 at 12:03











  • @dmy. I'll play with it some more once I'm on a desktop. This is interesting. Remember to include the I/O in your timing. That's likely to be a bigger bottleneck than the in-memory stuff.

    – Mad Physicist
    Nov 16 '18 at 13:54











  • I'm using asyncio for this reason so I made the CPU to be my bottleneck.

    – dmy
    Nov 16 '18 at 14:04











  • If you're writing to a file, you're writing to a file. Writing x number of bytes will take a fixed amount of time, which will generally only go up if you try to parallelize with other disk operations.

    – Mad Physicist
    Nov 16 '18 at 14:07
















Thanks for the extensive answer. It tried the last approach, but I think there is a bug, because I need 't' also between consecutive elements as a delimitier. Besides, this massively slows down everything du to the large number of calls to file.write.

– dmy
Nov 16 '18 at 9:41





Thanks for the extensive answer. It tried the last approach, but I think there is a bug, because I need 't' also between consecutive elements as a delimitier. Besides, this massively slows down everything du to the large number of calls to file.write.

– dmy
Nov 16 '18 at 9:41













I tried everything now and its just getting worse. Seems like time to surrender...

– dmy
Nov 16 '18 at 12:03





I tried everything now and its just getting worse. Seems like time to surrender...

– dmy
Nov 16 '18 at 12:03













@dmy. I'll play with it some more once I'm on a desktop. This is interesting. Remember to include the I/O in your timing. That's likely to be a bigger bottleneck than the in-memory stuff.

– Mad Physicist
Nov 16 '18 at 13:54





@dmy. I'll play with it some more once I'm on a desktop. This is interesting. Remember to include the I/O in your timing. That's likely to be a bigger bottleneck than the in-memory stuff.

– Mad Physicist
Nov 16 '18 at 13:54













I'm using asyncio for this reason so I made the CPU to be my bottleneck.

– dmy
Nov 16 '18 at 14:04





I'm using asyncio for this reason so I made the CPU to be my bottleneck.

– dmy
Nov 16 '18 at 14:04













If you're writing to a file, you're writing to a file. Writing x number of bytes will take a fixed amount of time, which will generally only go up if you try to parallelize with other disk operations.

– Mad Physicist
Nov 16 '18 at 14:07





If you're writing to a file, you're writing to a file. Writing x number of bytes will take a fixed amount of time, which will generally only go up if you try to parallelize with other disk operations.

– Mad Physicist
Nov 16 '18 at 14:07



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53320519%2fimprove-runtime-of-python-code-with-indexing-and-string-concatenation%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

How to read a connectionString WITH PROVIDER in .NET Core?

In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

Museum of Modern and Contemporary Art of Trento and Rovereto