minimizing the cost of uploading a very large tar file to Google Cloud Storage










1















I'm currently trying upload and then untar a very large file (1.3 tb) into Google Cloud Storage at the lowest price.



I initially thought about creating a really cheap instance just to download the file and put it in a bucket, then creating a new instance with a good amount of RAM to untar the file and then put the result in a new bucket.
However since the bucket price depends on the nbr of request I/O I'm not sure it's the best option, and even for performance it might not be the best.



What would be the best strategy to untar the file in the cheapest way?










share|improve this question




























    1















    I'm currently trying upload and then untar a very large file (1.3 tb) into Google Cloud Storage at the lowest price.



    I initially thought about creating a really cheap instance just to download the file and put it in a bucket, then creating a new instance with a good amount of RAM to untar the file and then put the result in a new bucket.
    However since the bucket price depends on the nbr of request I/O I'm not sure it's the best option, and even for performance it might not be the best.



    What would be the best strategy to untar the file in the cheapest way?










    share|improve this question


























      1












      1








      1








      I'm currently trying upload and then untar a very large file (1.3 tb) into Google Cloud Storage at the lowest price.



      I initially thought about creating a really cheap instance just to download the file and put it in a bucket, then creating a new instance with a good amount of RAM to untar the file and then put the result in a new bucket.
      However since the bucket price depends on the nbr of request I/O I'm not sure it's the best option, and even for performance it might not be the best.



      What would be the best strategy to untar the file in the cheapest way?










      share|improve this question
















      I'm currently trying upload and then untar a very large file (1.3 tb) into Google Cloud Storage at the lowest price.



      I initially thought about creating a really cheap instance just to download the file and put it in a bucket, then creating a new instance with a good amount of RAM to untar the file and then put the result in a new bucket.
      However since the bucket price depends on the nbr of request I/O I'm not sure it's the best option, and even for performance it might not be the best.



      What would be the best strategy to untar the file in the cheapest way?







      google-cloud-platform google-cloud-storage google-compute-engine bucket






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 15 '18 at 19:58









      Dan

      4,28011938




      4,28011938










      asked Nov 15 '18 at 8:31









      user1672455user1672455

      865




      865






















          1 Answer
          1






          active

          oldest

          votes


















          1














          First some background information on pricing:



          Google has pretty good documentation about how to ingest data into GCS. From that guide:




          Today, when you move data to Cloud Storage, there are no ingress traffic charges. The gsutil tool and the Storage Transfer Service are both offered at no charge. See the GCP network pricing page for the most up-to-date pricing details.




          The "network pricing page" just says:




          [Traffic type: Ingress] Price: No charge, unless there is a resource such as a load balancer that is processing ingress traffic. Responses to requests count as egress and are charged.




          There is additional information on the GCS pricing page about your idea to use a GCE VM to write to GCS:




          There are no network charges for accessing data in your Cloud Storage buckets when you do so with other GCP services in the following scenarios:



          • Your bucket and GCP service are located in the same multi-regional or regional location. For example, accessing data in an asia-east1 bucket with an asia-east1 Compute Engine instance.



          From later in that same page, there is also information about the pre-request pricing:




          Class A Operations: storage.*.insert[1]



          [1] Simple, multipart, and resumable uploads with the JSON API are each considered one Class A operation.




          The cost for Class A operations is per 10,000 operations, and is either $0.05 or $0.10 depending on the storage type. I believe you would only be doing 1 Class A operation (or at most, 1 Class A operation per file that you upload), so this probably wouldn't add up to much usage overall.




          Now to answer your question:



          For your use case, it sounds like you want to have the files in the tarball be individual files in GCS (as opposed to just having a big tarball stored in one file in GCS). The first step is to untar it somewhere, and the second step is to use gsutil cp to copy it to GCS.



          Unless you have to (i.e. not enough space on the machine that holds the tarball now), I wouldn't recommend copying the tarball to an intermediate VM in GCE before uploading to GCE, for two reasons:




          1. gsutil cp already handles a bunch of annoying edge cases for you: parallel uploads, resuming an upload in case there's a network failure, retries, checksum comparisons, etc.

          2. Using any GCE VMs will add cost to this whole copy operation -- costs for the disks plus costs for the VMs themselves.

          If you want to try the procedure out with something lower-risk first, make a small directory with a few megabytes of data and a few files and use gsutil cp to copy it, then check how much you were charged for that. From the GCS pricing page:




          Charges accrue daily, but Cloud Storage bills you only at the end of the billing period. You can view unbilled usage in your project's billing page in the Google Cloud Platform Console.




          So you'd just have to wait a day to see how much you were billed.






          share|improve this answer
























            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53315218%2fminimizing-the-cost-of-uploading-a-very-large-tar-file-to-google-cloud-storage%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            First some background information on pricing:



            Google has pretty good documentation about how to ingest data into GCS. From that guide:




            Today, when you move data to Cloud Storage, there are no ingress traffic charges. The gsutil tool and the Storage Transfer Service are both offered at no charge. See the GCP network pricing page for the most up-to-date pricing details.




            The "network pricing page" just says:




            [Traffic type: Ingress] Price: No charge, unless there is a resource such as a load balancer that is processing ingress traffic. Responses to requests count as egress and are charged.




            There is additional information on the GCS pricing page about your idea to use a GCE VM to write to GCS:




            There are no network charges for accessing data in your Cloud Storage buckets when you do so with other GCP services in the following scenarios:



            • Your bucket and GCP service are located in the same multi-regional or regional location. For example, accessing data in an asia-east1 bucket with an asia-east1 Compute Engine instance.



            From later in that same page, there is also information about the pre-request pricing:




            Class A Operations: storage.*.insert[1]



            [1] Simple, multipart, and resumable uploads with the JSON API are each considered one Class A operation.




            The cost for Class A operations is per 10,000 operations, and is either $0.05 or $0.10 depending on the storage type. I believe you would only be doing 1 Class A operation (or at most, 1 Class A operation per file that you upload), so this probably wouldn't add up to much usage overall.




            Now to answer your question:



            For your use case, it sounds like you want to have the files in the tarball be individual files in GCS (as opposed to just having a big tarball stored in one file in GCS). The first step is to untar it somewhere, and the second step is to use gsutil cp to copy it to GCS.



            Unless you have to (i.e. not enough space on the machine that holds the tarball now), I wouldn't recommend copying the tarball to an intermediate VM in GCE before uploading to GCE, for two reasons:




            1. gsutil cp already handles a bunch of annoying edge cases for you: parallel uploads, resuming an upload in case there's a network failure, retries, checksum comparisons, etc.

            2. Using any GCE VMs will add cost to this whole copy operation -- costs for the disks plus costs for the VMs themselves.

            If you want to try the procedure out with something lower-risk first, make a small directory with a few megabytes of data and a few files and use gsutil cp to copy it, then check how much you were charged for that. From the GCS pricing page:




            Charges accrue daily, but Cloud Storage bills you only at the end of the billing period. You can view unbilled usage in your project's billing page in the Google Cloud Platform Console.




            So you'd just have to wait a day to see how much you were billed.






            share|improve this answer





























              1














              First some background information on pricing:



              Google has pretty good documentation about how to ingest data into GCS. From that guide:




              Today, when you move data to Cloud Storage, there are no ingress traffic charges. The gsutil tool and the Storage Transfer Service are both offered at no charge. See the GCP network pricing page for the most up-to-date pricing details.




              The "network pricing page" just says:




              [Traffic type: Ingress] Price: No charge, unless there is a resource such as a load balancer that is processing ingress traffic. Responses to requests count as egress and are charged.




              There is additional information on the GCS pricing page about your idea to use a GCE VM to write to GCS:




              There are no network charges for accessing data in your Cloud Storage buckets when you do so with other GCP services in the following scenarios:



              • Your bucket and GCP service are located in the same multi-regional or regional location. For example, accessing data in an asia-east1 bucket with an asia-east1 Compute Engine instance.



              From later in that same page, there is also information about the pre-request pricing:




              Class A Operations: storage.*.insert[1]



              [1] Simple, multipart, and resumable uploads with the JSON API are each considered one Class A operation.




              The cost for Class A operations is per 10,000 operations, and is either $0.05 or $0.10 depending on the storage type. I believe you would only be doing 1 Class A operation (or at most, 1 Class A operation per file that you upload), so this probably wouldn't add up to much usage overall.




              Now to answer your question:



              For your use case, it sounds like you want to have the files in the tarball be individual files in GCS (as opposed to just having a big tarball stored in one file in GCS). The first step is to untar it somewhere, and the second step is to use gsutil cp to copy it to GCS.



              Unless you have to (i.e. not enough space on the machine that holds the tarball now), I wouldn't recommend copying the tarball to an intermediate VM in GCE before uploading to GCE, for two reasons:




              1. gsutil cp already handles a bunch of annoying edge cases for you: parallel uploads, resuming an upload in case there's a network failure, retries, checksum comparisons, etc.

              2. Using any GCE VMs will add cost to this whole copy operation -- costs for the disks plus costs for the VMs themselves.

              If you want to try the procedure out with something lower-risk first, make a small directory with a few megabytes of data and a few files and use gsutil cp to copy it, then check how much you were charged for that. From the GCS pricing page:




              Charges accrue daily, but Cloud Storage bills you only at the end of the billing period. You can view unbilled usage in your project's billing page in the Google Cloud Platform Console.




              So you'd just have to wait a day to see how much you were billed.






              share|improve this answer



























                1












                1








                1







                First some background information on pricing:



                Google has pretty good documentation about how to ingest data into GCS. From that guide:




                Today, when you move data to Cloud Storage, there are no ingress traffic charges. The gsutil tool and the Storage Transfer Service are both offered at no charge. See the GCP network pricing page for the most up-to-date pricing details.




                The "network pricing page" just says:




                [Traffic type: Ingress] Price: No charge, unless there is a resource such as a load balancer that is processing ingress traffic. Responses to requests count as egress and are charged.




                There is additional information on the GCS pricing page about your idea to use a GCE VM to write to GCS:




                There are no network charges for accessing data in your Cloud Storage buckets when you do so with other GCP services in the following scenarios:



                • Your bucket and GCP service are located in the same multi-regional or regional location. For example, accessing data in an asia-east1 bucket with an asia-east1 Compute Engine instance.



                From later in that same page, there is also information about the pre-request pricing:




                Class A Operations: storage.*.insert[1]



                [1] Simple, multipart, and resumable uploads with the JSON API are each considered one Class A operation.




                The cost for Class A operations is per 10,000 operations, and is either $0.05 or $0.10 depending on the storage type. I believe you would only be doing 1 Class A operation (or at most, 1 Class A operation per file that you upload), so this probably wouldn't add up to much usage overall.




                Now to answer your question:



                For your use case, it sounds like you want to have the files in the tarball be individual files in GCS (as opposed to just having a big tarball stored in one file in GCS). The first step is to untar it somewhere, and the second step is to use gsutil cp to copy it to GCS.



                Unless you have to (i.e. not enough space on the machine that holds the tarball now), I wouldn't recommend copying the tarball to an intermediate VM in GCE before uploading to GCE, for two reasons:




                1. gsutil cp already handles a bunch of annoying edge cases for you: parallel uploads, resuming an upload in case there's a network failure, retries, checksum comparisons, etc.

                2. Using any GCE VMs will add cost to this whole copy operation -- costs for the disks plus costs for the VMs themselves.

                If you want to try the procedure out with something lower-risk first, make a small directory with a few megabytes of data and a few files and use gsutil cp to copy it, then check how much you were charged for that. From the GCS pricing page:




                Charges accrue daily, but Cloud Storage bills you only at the end of the billing period. You can view unbilled usage in your project's billing page in the Google Cloud Platform Console.




                So you'd just have to wait a day to see how much you were billed.






                share|improve this answer















                First some background information on pricing:



                Google has pretty good documentation about how to ingest data into GCS. From that guide:




                Today, when you move data to Cloud Storage, there are no ingress traffic charges. The gsutil tool and the Storage Transfer Service are both offered at no charge. See the GCP network pricing page for the most up-to-date pricing details.




                The "network pricing page" just says:




                [Traffic type: Ingress] Price: No charge, unless there is a resource such as a load balancer that is processing ingress traffic. Responses to requests count as egress and are charged.




                There is additional information on the GCS pricing page about your idea to use a GCE VM to write to GCS:




                There are no network charges for accessing data in your Cloud Storage buckets when you do so with other GCP services in the following scenarios:



                • Your bucket and GCP service are located in the same multi-regional or regional location. For example, accessing data in an asia-east1 bucket with an asia-east1 Compute Engine instance.



                From later in that same page, there is also information about the pre-request pricing:




                Class A Operations: storage.*.insert[1]



                [1] Simple, multipart, and resumable uploads with the JSON API are each considered one Class A operation.




                The cost for Class A operations is per 10,000 operations, and is either $0.05 or $0.10 depending on the storage type. I believe you would only be doing 1 Class A operation (or at most, 1 Class A operation per file that you upload), so this probably wouldn't add up to much usage overall.




                Now to answer your question:



                For your use case, it sounds like you want to have the files in the tarball be individual files in GCS (as opposed to just having a big tarball stored in one file in GCS). The first step is to untar it somewhere, and the second step is to use gsutil cp to copy it to GCS.



                Unless you have to (i.e. not enough space on the machine that holds the tarball now), I wouldn't recommend copying the tarball to an intermediate VM in GCE before uploading to GCE, for two reasons:




                1. gsutil cp already handles a bunch of annoying edge cases for you: parallel uploads, resuming an upload in case there's a network failure, retries, checksum comparisons, etc.

                2. Using any GCE VMs will add cost to this whole copy operation -- costs for the disks plus costs for the VMs themselves.

                If you want to try the procedure out with something lower-risk first, make a small directory with a few megabytes of data and a few files and use gsutil cp to copy it, then check how much you were charged for that. From the GCS pricing page:




                Charges accrue daily, but Cloud Storage bills you only at the end of the billing period. You can view unbilled usage in your project's billing page in the Google Cloud Platform Console.




                So you'd just have to wait a day to see how much you were billed.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 15 '18 at 20:01

























                answered Nov 15 '18 at 19:08









                DanDan

                4,28011938




                4,28011938





























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53315218%2fminimizing-the-cost-of-uploading-a-very-large-tar-file-to-google-cloud-storage%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    這個網誌中的熱門文章

                    How to read a connectionString WITH PROVIDER in .NET Core?

                    In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

                    Museum of Modern and Contemporary Art of Trento and Rovereto