Python Numpy: Structured Arrays vs Same Datatype Array Operation Cost










0















I want to create an array of arrays of the structure:



[line_number,count,temperature,humidity,sensor1_on,sensor2_on]


Where the first two need to be uint32, while temperature and humidity can be uint8, and the sensor_ons can be of type bool.



I later need to sort the 2d array based on the combination of line_number and then count. I also need to perform averages and other statistical computation on lists of all the temperature and humidity data (separately).



I found structured arrays which are convenient for data storage and retrieval:



np_data=np.zeros([num_lines],
dtype='uint32,'#Line No
'uint32,'# Count
'uint8,' #TEMP
'uint8,' #HUMID
'bool,' #S1 On
'bool'#S2 On
)


for this vs



np_data=np.zeros([num_lines,5],dtype='uint32') 
# I would pack my bools into the last uint32 and then unpack later
# but it seems like a waste of space


Do I lose anything (numpy processing power, vectorized processing, sorting speed, etc) by creating the structured array vs the one with all the same data types? Is there another solution one would recommend?










share|improve this question
























  • I think you just need to do some timings on realistic data. We can make guesses from experience, but they'll be just that - guesses.

    – hpaulj
    Nov 15 '18 at 0:34















0















I want to create an array of arrays of the structure:



[line_number,count,temperature,humidity,sensor1_on,sensor2_on]


Where the first two need to be uint32, while temperature and humidity can be uint8, and the sensor_ons can be of type bool.



I later need to sort the 2d array based on the combination of line_number and then count. I also need to perform averages and other statistical computation on lists of all the temperature and humidity data (separately).



I found structured arrays which are convenient for data storage and retrieval:



np_data=np.zeros([num_lines],
dtype='uint32,'#Line No
'uint32,'# Count
'uint8,' #TEMP
'uint8,' #HUMID
'bool,' #S1 On
'bool'#S2 On
)


for this vs



np_data=np.zeros([num_lines,5],dtype='uint32') 
# I would pack my bools into the last uint32 and then unpack later
# but it seems like a waste of space


Do I lose anything (numpy processing power, vectorized processing, sorting speed, etc) by creating the structured array vs the one with all the same data types? Is there another solution one would recommend?










share|improve this question
























  • I think you just need to do some timings on realistic data. We can make guesses from experience, but they'll be just that - guesses.

    – hpaulj
    Nov 15 '18 at 0:34













0












0








0








I want to create an array of arrays of the structure:



[line_number,count,temperature,humidity,sensor1_on,sensor2_on]


Where the first two need to be uint32, while temperature and humidity can be uint8, and the sensor_ons can be of type bool.



I later need to sort the 2d array based on the combination of line_number and then count. I also need to perform averages and other statistical computation on lists of all the temperature and humidity data (separately).



I found structured arrays which are convenient for data storage and retrieval:



np_data=np.zeros([num_lines],
dtype='uint32,'#Line No
'uint32,'# Count
'uint8,' #TEMP
'uint8,' #HUMID
'bool,' #S1 On
'bool'#S2 On
)


for this vs



np_data=np.zeros([num_lines,5],dtype='uint32') 
# I would pack my bools into the last uint32 and then unpack later
# but it seems like a waste of space


Do I lose anything (numpy processing power, vectorized processing, sorting speed, etc) by creating the structured array vs the one with all the same data types? Is there another solution one would recommend?










share|improve this question
















I want to create an array of arrays of the structure:



[line_number,count,temperature,humidity,sensor1_on,sensor2_on]


Where the first two need to be uint32, while temperature and humidity can be uint8, and the sensor_ons can be of type bool.



I later need to sort the 2d array based on the combination of line_number and then count. I also need to perform averages and other statistical computation on lists of all the temperature and humidity data (separately).



I found structured arrays which are convenient for data storage and retrieval:



np_data=np.zeros([num_lines],
dtype='uint32,'#Line No
'uint32,'# Count
'uint8,' #TEMP
'uint8,' #HUMID
'bool,' #S1 On
'bool'#S2 On
)


for this vs



np_data=np.zeros([num_lines,5],dtype='uint32') 
# I would pack my bools into the last uint32 and then unpack later
# but it seems like a waste of space


Do I lose anything (numpy processing power, vectorized processing, sorting speed, etc) by creating the structured array vs the one with all the same data types? Is there another solution one would recommend?







python arrays numpy






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 14 '18 at 23:47









Joel

1,5686719




1,5686719










asked Nov 14 '18 at 23:19









azazelspeaksazazelspeaks

2,4781616




2,4781616












  • I think you just need to do some timings on realistic data. We can make guesses from experience, but they'll be just that - guesses.

    – hpaulj
    Nov 15 '18 at 0:34

















  • I think you just need to do some timings on realistic data. We can make guesses from experience, but they'll be just that - guesses.

    – hpaulj
    Nov 15 '18 at 0:34
















I think you just need to do some timings on realistic data. We can make guesses from experience, but they'll be just that - guesses.

– hpaulj
Nov 15 '18 at 0:34





I think you just need to do some timings on realistic data. We can make guesses from experience, but they'll be just that - guesses.

– hpaulj
Nov 15 '18 at 0:34












1 Answer
1






active

oldest

votes


















1














I did some performance testing on several array types. My test results are available as an answer at this topic:
is ndarray faster than recarray access?

(Ignore the downvote on my question. Apparently someone didn't like how I asked it.)



The short version: extracting data from a masked array was much slower than the same operation on a ndarray. Access times for a structured array and a recarray were slower than a ndarray, but all were fractions of a second. Clearly there is overhead when using masked arrays (maybe similar to a record array?). There is a good discussion of the differences between array types here:
numpy-discussion:structured-arrays-recarrays-and-record-arrays



There are other limitations. For example, many (most/all) of the numpy matrix and math operations are limited to ndarrays (require same data type). I don't think these apply to your case, since you are using the structured array like a table.






share|improve this answer
























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310226%2fpython-numpy-structured-arrays-vs-same-datatype-array-operation-cost%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    I did some performance testing on several array types. My test results are available as an answer at this topic:
    is ndarray faster than recarray access?

    (Ignore the downvote on my question. Apparently someone didn't like how I asked it.)



    The short version: extracting data from a masked array was much slower than the same operation on a ndarray. Access times for a structured array and a recarray were slower than a ndarray, but all were fractions of a second. Clearly there is overhead when using masked arrays (maybe similar to a record array?). There is a good discussion of the differences between array types here:
    numpy-discussion:structured-arrays-recarrays-and-record-arrays



    There are other limitations. For example, many (most/all) of the numpy matrix and math operations are limited to ndarrays (require same data type). I don't think these apply to your case, since you are using the structured array like a table.






    share|improve this answer





























      1














      I did some performance testing on several array types. My test results are available as an answer at this topic:
      is ndarray faster than recarray access?

      (Ignore the downvote on my question. Apparently someone didn't like how I asked it.)



      The short version: extracting data from a masked array was much slower than the same operation on a ndarray. Access times for a structured array and a recarray were slower than a ndarray, but all were fractions of a second. Clearly there is overhead when using masked arrays (maybe similar to a record array?). There is a good discussion of the differences between array types here:
      numpy-discussion:structured-arrays-recarrays-and-record-arrays



      There are other limitations. For example, many (most/all) of the numpy matrix and math operations are limited to ndarrays (require same data type). I don't think these apply to your case, since you are using the structured array like a table.






      share|improve this answer



























        1












        1








        1







        I did some performance testing on several array types. My test results are available as an answer at this topic:
        is ndarray faster than recarray access?

        (Ignore the downvote on my question. Apparently someone didn't like how I asked it.)



        The short version: extracting data from a masked array was much slower than the same operation on a ndarray. Access times for a structured array and a recarray were slower than a ndarray, but all were fractions of a second. Clearly there is overhead when using masked arrays (maybe similar to a record array?). There is a good discussion of the differences between array types here:
        numpy-discussion:structured-arrays-recarrays-and-record-arrays



        There are other limitations. For example, many (most/all) of the numpy matrix and math operations are limited to ndarrays (require same data type). I don't think these apply to your case, since you are using the structured array like a table.






        share|improve this answer















        I did some performance testing on several array types. My test results are available as an answer at this topic:
        is ndarray faster than recarray access?

        (Ignore the downvote on my question. Apparently someone didn't like how I asked it.)



        The short version: extracting data from a masked array was much slower than the same operation on a ndarray. Access times for a structured array and a recarray were slower than a ndarray, but all were fractions of a second. Clearly there is overhead when using masked arrays (maybe similar to a record array?). There is a good discussion of the differences between array types here:
        numpy-discussion:structured-arrays-recarrays-and-record-arrays



        There are other limitations. For example, many (most/all) of the numpy matrix and math operations are limited to ndarrays (require same data type). I don't think these apply to your case, since you are using the structured array like a table.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 15 '18 at 16:46

























        answered Nov 15 '18 at 15:13









        kcw78kcw78

        3451210




        3451210





























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53310226%2fpython-numpy-structured-arrays-vs-same-datatype-array-operation-cost%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Barbados

            How to read a connectionString WITH PROVIDER in .NET Core?

            Node.js Script on GitHub Pages or Amazon S3