Delete duplicate entries in extremely big Database










1















Last week I accidentally introduced a bug in our website, and it kept reporting errors when receiving data to insert in the Database, although it was correctly adding the new entry. As such, the information kept being sent and it kept reporting there had been an error writing on the database and writing on it, duplicating every entry between a few dozens of times to a few hundreds. I don't know why the loop stopped but it eventually gave an OK, and stopped writing.



I didn't understand that the error came from my code at first so it took me a couple of days to solve it, meanwhile our database kept being bombarded and flooded.



In hindsight I didn't handle this very well, I am aware, but I would like to fix what I have screwed.



I have been searching the Stack for ways to clean duplicate rows and there are dozens of questions and answers for it, so I frankensteined a solution that seems to work.



DELETE FROM app_info
WHERE `idinfo` NOT IN (SELECT minid
FROM (SELECT Min(`idinfo`) AS minid
FROM app_info
GROUP BY `when`,
`idbooth`,
`iddongle`,
`typeinfo`,
`money`,
`money2`,
`currency`,
`stock`,
`i1`,
`i2`,
`i3`,
`i4`,
`i5`,
`str1`,
`str2`,
`pbnew`,
`in1`,
`in2`,
`in3`,
`in4`,
`in5`,
`in6`,
`in7`,
`in8`) e)
AND `idinfo` < 1545000
AND `idinfo` > 1541500;


This works, but I have to be doing brackets of ids, and limit the search there, otherwise it takes too long and cancels before doing anything. Even doing so, it takes 3 or 4 minutes where the website stops working (the database is too busy, I guess), and it will take me ages to clean the DB.



I have been thinking of a way to streamline this and I thought that I could search for all individual entries starting from the day I introduced, grouping them, and make a list, then I could maybe make a php file that looped through the list searching for all entries that correspond to that item in the list and deleting all but one.



I have tried, making the list it's easy, I think:



SELECT * FROM `App_info` 
WHERE `when` > '2018-11-05'
GROUP BY `typeInfo` , `str2`
ORDER BY `App_info`.`when` ASC


But I have not been able to reformulate my first query to take into account the info that the second generates. At first I supposed that substituing the "WHEN" that I used to make brackets by the info on the fields I grouped by in the second query would help, but if I do it like that I get 0 rows, so it does nothing.



Also depending on the entry selected, if it has too many copies like the ones that have hundreds, it crashes the database anyway... So it doesn't seem to be the solution I was looking for.



I don't know what to try anymore. How can I clean the duplicate entries when they are so many in number, without crashing the database? I guess that hogging the DB will be inevitable, but I can just issue a downtime for maintenance, so it would not be a problem.










share|improve this question
























  • Thanks for formatting it

    – Helwar
    Nov 13 '18 at 14:56











  • I think you are missing Having COUNT(*) > 1 in the subquery

    – Madhur Bhaiya
    Nov 13 '18 at 15:01















1















Last week I accidentally introduced a bug in our website, and it kept reporting errors when receiving data to insert in the Database, although it was correctly adding the new entry. As such, the information kept being sent and it kept reporting there had been an error writing on the database and writing on it, duplicating every entry between a few dozens of times to a few hundreds. I don't know why the loop stopped but it eventually gave an OK, and stopped writing.



I didn't understand that the error came from my code at first so it took me a couple of days to solve it, meanwhile our database kept being bombarded and flooded.



In hindsight I didn't handle this very well, I am aware, but I would like to fix what I have screwed.



I have been searching the Stack for ways to clean duplicate rows and there are dozens of questions and answers for it, so I frankensteined a solution that seems to work.



DELETE FROM app_info
WHERE `idinfo` NOT IN (SELECT minid
FROM (SELECT Min(`idinfo`) AS minid
FROM app_info
GROUP BY `when`,
`idbooth`,
`iddongle`,
`typeinfo`,
`money`,
`money2`,
`currency`,
`stock`,
`i1`,
`i2`,
`i3`,
`i4`,
`i5`,
`str1`,
`str2`,
`pbnew`,
`in1`,
`in2`,
`in3`,
`in4`,
`in5`,
`in6`,
`in7`,
`in8`) e)
AND `idinfo` < 1545000
AND `idinfo` > 1541500;


This works, but I have to be doing brackets of ids, and limit the search there, otherwise it takes too long and cancels before doing anything. Even doing so, it takes 3 or 4 minutes where the website stops working (the database is too busy, I guess), and it will take me ages to clean the DB.



I have been thinking of a way to streamline this and I thought that I could search for all individual entries starting from the day I introduced, grouping them, and make a list, then I could maybe make a php file that looped through the list searching for all entries that correspond to that item in the list and deleting all but one.



I have tried, making the list it's easy, I think:



SELECT * FROM `App_info` 
WHERE `when` > '2018-11-05'
GROUP BY `typeInfo` , `str2`
ORDER BY `App_info`.`when` ASC


But I have not been able to reformulate my first query to take into account the info that the second generates. At first I supposed that substituing the "WHEN" that I used to make brackets by the info on the fields I grouped by in the second query would help, but if I do it like that I get 0 rows, so it does nothing.



Also depending on the entry selected, if it has too many copies like the ones that have hundreds, it crashes the database anyway... So it doesn't seem to be the solution I was looking for.



I don't know what to try anymore. How can I clean the duplicate entries when they are so many in number, without crashing the database? I guess that hogging the DB will be inevitable, but I can just issue a downtime for maintenance, so it would not be a problem.










share|improve this question
























  • Thanks for formatting it

    – Helwar
    Nov 13 '18 at 14:56











  • I think you are missing Having COUNT(*) > 1 in the subquery

    – Madhur Bhaiya
    Nov 13 '18 at 15:01













1












1








1








Last week I accidentally introduced a bug in our website, and it kept reporting errors when receiving data to insert in the Database, although it was correctly adding the new entry. As such, the information kept being sent and it kept reporting there had been an error writing on the database and writing on it, duplicating every entry between a few dozens of times to a few hundreds. I don't know why the loop stopped but it eventually gave an OK, and stopped writing.



I didn't understand that the error came from my code at first so it took me a couple of days to solve it, meanwhile our database kept being bombarded and flooded.



In hindsight I didn't handle this very well, I am aware, but I would like to fix what I have screwed.



I have been searching the Stack for ways to clean duplicate rows and there are dozens of questions and answers for it, so I frankensteined a solution that seems to work.



DELETE FROM app_info
WHERE `idinfo` NOT IN (SELECT minid
FROM (SELECT Min(`idinfo`) AS minid
FROM app_info
GROUP BY `when`,
`idbooth`,
`iddongle`,
`typeinfo`,
`money`,
`money2`,
`currency`,
`stock`,
`i1`,
`i2`,
`i3`,
`i4`,
`i5`,
`str1`,
`str2`,
`pbnew`,
`in1`,
`in2`,
`in3`,
`in4`,
`in5`,
`in6`,
`in7`,
`in8`) e)
AND `idinfo` < 1545000
AND `idinfo` > 1541500;


This works, but I have to be doing brackets of ids, and limit the search there, otherwise it takes too long and cancels before doing anything. Even doing so, it takes 3 or 4 minutes where the website stops working (the database is too busy, I guess), and it will take me ages to clean the DB.



I have been thinking of a way to streamline this and I thought that I could search for all individual entries starting from the day I introduced, grouping them, and make a list, then I could maybe make a php file that looped through the list searching for all entries that correspond to that item in the list and deleting all but one.



I have tried, making the list it's easy, I think:



SELECT * FROM `App_info` 
WHERE `when` > '2018-11-05'
GROUP BY `typeInfo` , `str2`
ORDER BY `App_info`.`when` ASC


But I have not been able to reformulate my first query to take into account the info that the second generates. At first I supposed that substituing the "WHEN" that I used to make brackets by the info on the fields I grouped by in the second query would help, but if I do it like that I get 0 rows, so it does nothing.



Also depending on the entry selected, if it has too many copies like the ones that have hundreds, it crashes the database anyway... So it doesn't seem to be the solution I was looking for.



I don't know what to try anymore. How can I clean the duplicate entries when they are so many in number, without crashing the database? I guess that hogging the DB will be inevitable, but I can just issue a downtime for maintenance, so it would not be a problem.










share|improve this question
















Last week I accidentally introduced a bug in our website, and it kept reporting errors when receiving data to insert in the Database, although it was correctly adding the new entry. As such, the information kept being sent and it kept reporting there had been an error writing on the database and writing on it, duplicating every entry between a few dozens of times to a few hundreds. I don't know why the loop stopped but it eventually gave an OK, and stopped writing.



I didn't understand that the error came from my code at first so it took me a couple of days to solve it, meanwhile our database kept being bombarded and flooded.



In hindsight I didn't handle this very well, I am aware, but I would like to fix what I have screwed.



I have been searching the Stack for ways to clean duplicate rows and there are dozens of questions and answers for it, so I frankensteined a solution that seems to work.



DELETE FROM app_info
WHERE `idinfo` NOT IN (SELECT minid
FROM (SELECT Min(`idinfo`) AS minid
FROM app_info
GROUP BY `when`,
`idbooth`,
`iddongle`,
`typeinfo`,
`money`,
`money2`,
`currency`,
`stock`,
`i1`,
`i2`,
`i3`,
`i4`,
`i5`,
`str1`,
`str2`,
`pbnew`,
`in1`,
`in2`,
`in3`,
`in4`,
`in5`,
`in6`,
`in7`,
`in8`) e)
AND `idinfo` < 1545000
AND `idinfo` > 1541500;


This works, but I have to be doing brackets of ids, and limit the search there, otherwise it takes too long and cancels before doing anything. Even doing so, it takes 3 or 4 minutes where the website stops working (the database is too busy, I guess), and it will take me ages to clean the DB.



I have been thinking of a way to streamline this and I thought that I could search for all individual entries starting from the day I introduced, grouping them, and make a list, then I could maybe make a php file that looped through the list searching for all entries that correspond to that item in the list and deleting all but one.



I have tried, making the list it's easy, I think:



SELECT * FROM `App_info` 
WHERE `when` > '2018-11-05'
GROUP BY `typeInfo` , `str2`
ORDER BY `App_info`.`when` ASC


But I have not been able to reformulate my first query to take into account the info that the second generates. At first I supposed that substituing the "WHEN" that I used to make brackets by the info on the fields I grouped by in the second query would help, but if I do it like that I get 0 rows, so it does nothing.



Also depending on the entry selected, if it has too many copies like the ones that have hundreds, it crashes the database anyway... So it doesn't seem to be the solution I was looking for.



I don't know what to try anymore. How can I clean the duplicate entries when they are so many in number, without crashing the database? I guess that hogging the DB will be inevitable, but I can just issue a downtime for maintenance, so it would not be a problem.







mysql






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 13 '18 at 14:55









Madhur Bhaiya

19.6k62236




19.6k62236










asked Nov 13 '18 at 14:50









HelwarHelwar

479




479












  • Thanks for formatting it

    – Helwar
    Nov 13 '18 at 14:56











  • I think you are missing Having COUNT(*) > 1 in the subquery

    – Madhur Bhaiya
    Nov 13 '18 at 15:01

















  • Thanks for formatting it

    – Helwar
    Nov 13 '18 at 14:56











  • I think you are missing Having COUNT(*) > 1 in the subquery

    – Madhur Bhaiya
    Nov 13 '18 at 15:01
















Thanks for formatting it

– Helwar
Nov 13 '18 at 14:56





Thanks for formatting it

– Helwar
Nov 13 '18 at 14:56













I think you are missing Having COUNT(*) > 1 in the subquery

– Madhur Bhaiya
Nov 13 '18 at 15:01





I think you are missing Having COUNT(*) > 1 in the subquery

– Madhur Bhaiya
Nov 13 '18 at 15:01












2 Answers
2






active

oldest

votes


















1














I suggest using a temporary table to store the duplicate IDs.



  1. Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.


  2. Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.


  3. In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).


  4. If this is still too long, you can do them in batches from the temp table.


The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.






share|improve this answer























  • Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.

    – Helwar
    Nov 19 '18 at 7:08


















1














Efficiently, you may utilize the INSERT IGNORE query. The steps:



  • Create a temporary table similar with similar schema to the existing
    table.


  • Add UNIQUE constraint to the columns we want.


  • Run INSERT IGNORE to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate the UNIQUE constraint (which was already ignored).


  • Rename the original table to something else and rename the temporary
    table to the original table.


  • Drop the redundant table.

Hope this helps.






share|improve this answer






















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283633%2fdelete-duplicate-entries-in-extremely-big-database%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    I suggest using a temporary table to store the duplicate IDs.



    1. Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.


    2. Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.


    3. In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).


    4. If this is still too long, you can do them in batches from the temp table.


    The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.






    share|improve this answer























    • Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.

      – Helwar
      Nov 19 '18 at 7:08















    1














    I suggest using a temporary table to store the duplicate IDs.



    1. Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.


    2. Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.


    3. In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).


    4. If this is still too long, you can do them in batches from the temp table.


    The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.






    share|improve this answer























    • Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.

      – Helwar
      Nov 19 '18 at 7:08













    1












    1








    1







    I suggest using a temporary table to store the duplicate IDs.



    1. Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.


    2. Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.


    3. In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).


    4. If this is still too long, you can do them in batches from the temp table.


    The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.






    share|improve this answer













    I suggest using a temporary table to store the duplicate IDs.



    1. Use your select query that gives you the duplicate ids to be deleted but insert the records into a new table. With the group by this may take a while but will not lock the database.


    2. Run whatever tests you need to verify that the temp table contains only the IDs that need to be deleted.


    3. In a maintenance window, after a backup, run a delete where ID in (select ID from temp_table).


    4. If this is still too long, you can do them in batches from the temp table.


    The major advantage is that you're not running that big heavy query on your table while at the same time locking the table for deletes, which may even cause a deadlock.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 13 '18 at 15:53









    CyrusCyrus

    1,1101610




    1,1101610












    • Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.

      – Helwar
      Nov 19 '18 at 7:08

















    • Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.

      – Helwar
      Nov 19 '18 at 7:08
















    Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.

    – Helwar
    Nov 19 '18 at 7:08





    Sorry for taking so long to accept an answer. I was quite busy last week. This is what helped me most, thank you! I could do it in an hour approx, had to break it in a few batches but not as many as how I was doing earlier. Thanks again.

    – Helwar
    Nov 19 '18 at 7:08













    1














    Efficiently, you may utilize the INSERT IGNORE query. The steps:



    • Create a temporary table similar with similar schema to the existing
      table.


    • Add UNIQUE constraint to the columns we want.


    • Run INSERT IGNORE to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate the UNIQUE constraint (which was already ignored).


    • Rename the original table to something else and rename the temporary
      table to the original table.


    • Drop the redundant table.

    Hope this helps.






    share|improve this answer



























      1














      Efficiently, you may utilize the INSERT IGNORE query. The steps:



      • Create a temporary table similar with similar schema to the existing
        table.


      • Add UNIQUE constraint to the columns we want.


      • Run INSERT IGNORE to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate the UNIQUE constraint (which was already ignored).


      • Rename the original table to something else and rename the temporary
        table to the original table.


      • Drop the redundant table.

      Hope this helps.






      share|improve this answer

























        1












        1








        1







        Efficiently, you may utilize the INSERT IGNORE query. The steps:



        • Create a temporary table similar with similar schema to the existing
          table.


        • Add UNIQUE constraint to the columns we want.


        • Run INSERT IGNORE to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate the UNIQUE constraint (which was already ignored).


        • Rename the original table to something else and rename the temporary
          table to the original table.


        • Drop the redundant table.

        Hope this helps.






        share|improve this answer













        Efficiently, you may utilize the INSERT IGNORE query. The steps:



        • Create a temporary table similar with similar schema to the existing
          table.


        • Add UNIQUE constraint to the columns we want.


        • Run INSERT IGNORE to copy data from the original table to the temporary table. This way, any duplicate rows will not be inserted into the temporary table because they violate the UNIQUE constraint (which was already ignored).


        • Rename the original table to something else and rename the temporary
          table to the original table.


        • Drop the redundant table.

        Hope this helps.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 13 '18 at 16:04









        TeeKeaTeeKea

        3,22341630




        3,22341630



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283633%2fdelete-duplicate-entries-in-extremely-big-database%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            How to read a connectionString WITH PROVIDER in .NET Core?

            Node.js Script on GitHub Pages or Amazon S3

            Museum of Modern and Contemporary Art of Trento and Rovereto