Parquet binary UTF8 as string on hive










0














There is a parquet file with a binary (UTF8) column named abc in it.

Is it possible to create an external table on hive that will contain the same column abc, but casted to string?



The structure of the parquet file:



$ parquet-tools schema ~/Downloads/dataset.gz.parquet
message spark_schema
optional binary abc (UTF8);










share|improve this question























  • Try select cast(abc as string) from mytable or similar? There's also decode function or base64. If you explain what is inside binary data that would be more useful.
    – Hitobat
    Nov 12 '18 at 18:24










  • @Hitobat I want to remove the burden of casting types from the users. Do you think it might be possible?
    – dmigo
    Nov 13 '18 at 12:40










  • Oh, I understand what you mean now. In this case, you can have an external-table to read the data, and then define a view based on the table for your users to query from.
    – Hitobat
    Nov 13 '18 at 14:00















0














There is a parquet file with a binary (UTF8) column named abc in it.

Is it possible to create an external table on hive that will contain the same column abc, but casted to string?



The structure of the parquet file:



$ parquet-tools schema ~/Downloads/dataset.gz.parquet
message spark_schema
optional binary abc (UTF8);










share|improve this question























  • Try select cast(abc as string) from mytable or similar? There's also decode function or base64. If you explain what is inside binary data that would be more useful.
    – Hitobat
    Nov 12 '18 at 18:24










  • @Hitobat I want to remove the burden of casting types from the users. Do you think it might be possible?
    – dmigo
    Nov 13 '18 at 12:40










  • Oh, I understand what you mean now. In this case, you can have an external-table to read the data, and then define a view based on the table for your users to query from.
    – Hitobat
    Nov 13 '18 at 14:00













0












0








0







There is a parquet file with a binary (UTF8) column named abc in it.

Is it possible to create an external table on hive that will contain the same column abc, but casted to string?



The structure of the parquet file:



$ parquet-tools schema ~/Downloads/dataset.gz.parquet
message spark_schema
optional binary abc (UTF8);










share|improve this question















There is a parquet file with a binary (UTF8) column named abc in it.

Is it possible to create an external table on hive that will contain the same column abc, but casted to string?



The structure of the parquet file:



$ parquet-tools schema ~/Downloads/dataset.gz.parquet
message spark_schema
optional binary abc (UTF8);







hive parquet sqldatatypes






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 13 '18 at 14:23

























asked Nov 12 '18 at 18:06









dmigo

557936




557936











  • Try select cast(abc as string) from mytable or similar? There's also decode function or base64. If you explain what is inside binary data that would be more useful.
    – Hitobat
    Nov 12 '18 at 18:24










  • @Hitobat I want to remove the burden of casting types from the users. Do you think it might be possible?
    – dmigo
    Nov 13 '18 at 12:40










  • Oh, I understand what you mean now. In this case, you can have an external-table to read the data, and then define a view based on the table for your users to query from.
    – Hitobat
    Nov 13 '18 at 14:00
















  • Try select cast(abc as string) from mytable or similar? There's also decode function or base64. If you explain what is inside binary data that would be more useful.
    – Hitobat
    Nov 12 '18 at 18:24










  • @Hitobat I want to remove the burden of casting types from the users. Do you think it might be possible?
    – dmigo
    Nov 13 '18 at 12:40










  • Oh, I understand what you mean now. In this case, you can have an external-table to read the data, and then define a view based on the table for your users to query from.
    – Hitobat
    Nov 13 '18 at 14:00















Try select cast(abc as string) from mytable or similar? There's also decode function or base64. If you explain what is inside binary data that would be more useful.
– Hitobat
Nov 12 '18 at 18:24




Try select cast(abc as string) from mytable or similar? There's also decode function or base64. If you explain what is inside binary data that would be more useful.
– Hitobat
Nov 12 '18 at 18:24












@Hitobat I want to remove the burden of casting types from the users. Do you think it might be possible?
– dmigo
Nov 13 '18 at 12:40




@Hitobat I want to remove the burden of casting types from the users. Do you think it might be possible?
– dmigo
Nov 13 '18 at 12:40












Oh, I understand what you mean now. In this case, you can have an external-table to read the data, and then define a view based on the table for your users to query from.
– Hitobat
Nov 13 '18 at 14:00




Oh, I understand what you mean now. In this case, you can have an external-table to read the data, and then define a view based on the table for your users to query from.
– Hitobat
Nov 13 '18 at 14:00












2 Answers
2






active

oldest

votes


















1














There are three different types involved:



  • There is a SQL schema for the table in Hive. Each column has a type, like STRING or DECIMAL. Each table (or partition in case it's a partitioned table) consist of multiple files that must be of the same file format, for example PLAINTEXT, AVRO or PARQUET.

  • Each of the individual files will have type information as well (except PLAINTEXT). In the case of Parquet, this means two further levels:

    • The physical type describes the storage size, for example INT32, BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY.

    • The logical type tells applications how to interpret the data, for example UTF8 or DECIMAL.


The STRING column type of Hive is stored as a BYTE_ARRAY physical type (called binary in Parquet schema definitions) with the UTF8 logical type annotation.






share|improve this answer






























    0














    Apparently, you can simply specify string as the type of the column and it will be taken care of.



    CREATE EXTERNAL TABLE `dataset`( 
    `abc` string)
    STORED AS parquet
    LOCATION
    '...';





    share|improve this answer






















      Your Answer






      StackExchange.ifUsing("editor", function ()
      StackExchange.using("externalEditor", function ()
      StackExchange.using("snippets", function ()
      StackExchange.snippets.init();
      );
      );
      , "code-snippets");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "1"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267746%2fparquet-binary-utf8-as-string-on-hive%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      1














      There are three different types involved:



      • There is a SQL schema for the table in Hive. Each column has a type, like STRING or DECIMAL. Each table (or partition in case it's a partitioned table) consist of multiple files that must be of the same file format, for example PLAINTEXT, AVRO or PARQUET.

      • Each of the individual files will have type information as well (except PLAINTEXT). In the case of Parquet, this means two further levels:

        • The physical type describes the storage size, for example INT32, BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY.

        • The logical type tells applications how to interpret the data, for example UTF8 or DECIMAL.


      The STRING column type of Hive is stored as a BYTE_ARRAY physical type (called binary in Parquet schema definitions) with the UTF8 logical type annotation.






      share|improve this answer



























        1














        There are three different types involved:



        • There is a SQL schema for the table in Hive. Each column has a type, like STRING or DECIMAL. Each table (or partition in case it's a partitioned table) consist of multiple files that must be of the same file format, for example PLAINTEXT, AVRO or PARQUET.

        • Each of the individual files will have type information as well (except PLAINTEXT). In the case of Parquet, this means two further levels:

          • The physical type describes the storage size, for example INT32, BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY.

          • The logical type tells applications how to interpret the data, for example UTF8 or DECIMAL.


        The STRING column type of Hive is stored as a BYTE_ARRAY physical type (called binary in Parquet schema definitions) with the UTF8 logical type annotation.






        share|improve this answer

























          1












          1








          1






          There are three different types involved:



          • There is a SQL schema for the table in Hive. Each column has a type, like STRING or DECIMAL. Each table (or partition in case it's a partitioned table) consist of multiple files that must be of the same file format, for example PLAINTEXT, AVRO or PARQUET.

          • Each of the individual files will have type information as well (except PLAINTEXT). In the case of Parquet, this means two further levels:

            • The physical type describes the storage size, for example INT32, BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY.

            • The logical type tells applications how to interpret the data, for example UTF8 or DECIMAL.


          The STRING column type of Hive is stored as a BYTE_ARRAY physical type (called binary in Parquet schema definitions) with the UTF8 logical type annotation.






          share|improve this answer














          There are three different types involved:



          • There is a SQL schema for the table in Hive. Each column has a type, like STRING or DECIMAL. Each table (or partition in case it's a partitioned table) consist of multiple files that must be of the same file format, for example PLAINTEXT, AVRO or PARQUET.

          • Each of the individual files will have type information as well (except PLAINTEXT). In the case of Parquet, this means two further levels:

            • The physical type describes the storage size, for example INT32, BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY.

            • The logical type tells applications how to interpret the data, for example UTF8 or DECIMAL.


          The STRING column type of Hive is stored as a BYTE_ARRAY physical type (called binary in Parquet schema definitions) with the UTF8 logical type annotation.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 14 '18 at 18:33

























          answered Nov 14 '18 at 17:32









          Zoltan

          1,128313




          1,128313























              0














              Apparently, you can simply specify string as the type of the column and it will be taken care of.



              CREATE EXTERNAL TABLE `dataset`( 
              `abc` string)
              STORED AS parquet
              LOCATION
              '...';





              share|improve this answer



























                0














                Apparently, you can simply specify string as the type of the column and it will be taken care of.



                CREATE EXTERNAL TABLE `dataset`( 
                `abc` string)
                STORED AS parquet
                LOCATION
                '...';





                share|improve this answer

























                  0












                  0








                  0






                  Apparently, you can simply specify string as the type of the column and it will be taken care of.



                  CREATE EXTERNAL TABLE `dataset`( 
                  `abc` string)
                  STORED AS parquet
                  LOCATION
                  '...';





                  share|improve this answer














                  Apparently, you can simply specify string as the type of the column and it will be taken care of.



                  CREATE EXTERNAL TABLE `dataset`( 
                  `abc` string)
                  STORED AS parquet
                  LOCATION
                  '...';






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Nov 13 '18 at 14:36

























                  answered Nov 13 '18 at 14:24









                  dmigo

                  557936




                  557936



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      To learn more, see our tips on writing great answers.





                      Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                      Please pay close attention to the following guidance:


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267746%2fparquet-binary-utf8-as-string-on-hive%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      這個網誌中的熱門文章

                      How to read a connectionString WITH PROVIDER in .NET Core?

                      Node.js Script on GitHub Pages or Amazon S3

                      Museum of Modern and Contemporary Art of Trento and Rovereto