Remove HTML elements inside Markdown










1















Goal



Transform Markdown file with HTML inside into pure Markdown



Code: in.md



# Title

## Subtitle

### Sub-subtitle

<span><div>Line before image</div><div><br></div><div><img src="img.png" width=404 height=255><br></div><div><br></div><div>Line after image</div><div><br></div><div>Text</div></span><h1><span>Heading 1</span></h1><span><div>Text</div><div><br></div></span><h2><span>Heading 2</span></h2><span><div>Text</div></span><h3><span>Heading 3</span></h3><div><span>Text</span></div><div><span><br></span></div><span><div>Line before code</div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code><div><span style="">Line after code</span><br></div><div><span style=""><br></span></div><div><span style=""><a href="http://pandoc.org">Link</a></span></div><div><span style=""><br></span></div><div><ul><li>Unordered bullet 1<br></li><li>Unordered bullet 2<br></li></ul></div><div><span style=""><br></span></div><div><ol><li>Ordered bullet 1<br></li><li>Ordered bullet 2<br></li></ol></div><div><span style=""><br></span></div></span><blockquote style="margin:0 0 0 40px;border:none;padding:0px;"><span><div><span style="">Quote line 1</span></div></span><span><div><span style="">Quote line 2</span></div></span></blockquote><span><div><span style=""><br></span></div><div><span style="">Text</span></div><div><span style=""><br></span></div><div><i>Italic</i></div><div><i><br></i></div><div>Text</div><div></div></span>

Markdown text

More Markdown text


Attempts



I tried a number of Pandoc scripts:



Attempt 1



pandoc -f markdown -t markdown_strict --atx-headers in.md -o out.md



  • Line breaks added


  • No conversion


Extracts from result



<h3>
<span>H3</span>
</h3>
<span>txt</span>

<span><br></span>


and



<ul>
<li>
bullet<br>
</li>
<li>
list<br>
</li>
</ul>


Running the transformation command a second time on result does nothing.



Attempt 2



pandoc -f markdown -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md



Result



Same as above



Attempt 3



pandoc -f markdown-markdown_in_html_blocks -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md



Result



Same as above with fewer line breaks



Attempt 4



pandoc -f markdown -t markdown_strict-native_divs-native_spans-raw_html --atx-headers in.md -o out.md



Extracts from result



All HTML elements are stripped out, but no Markdown is applied:



Heading 1
Text

Heading 2
Text

Heading 3
Text


and



Unordered bullet 1
Unordered bullet 2
Unordered bullet 3


Misc



  • I cannot adjust how in.md is generated originally.


  • Pandoc does not have to be part of the solution. However, using Pandoc seems to make sense because (1) the transformation needs to be executed by an Azure DevOps release pipeline, and running a simple command fits nicely in that workflow and (2) the desired result is simply one clean Markdown file.


  • I can script a solution using Regex (and will, if no other solution makes sense), but if a Pandoc command (or another solution) accomplishes it, that seems less prone to my human error.


Thank you for any thoughts or advice.










share|improve this question


























    1















    Goal



    Transform Markdown file with HTML inside into pure Markdown



    Code: in.md



    # Title

    ## Subtitle

    ### Sub-subtitle

    <span><div>Line before image</div><div><br></div><div><img src="img.png" width=404 height=255><br></div><div><br></div><div>Line after image</div><div><br></div><div>Text</div></span><h1><span>Heading 1</span></h1><span><div>Text</div><div><br></div></span><h2><span>Heading 2</span></h2><span><div>Text</div></span><h3><span>Heading 3</span></h3><div><span>Text</span></div><div><span><br></span></div><span><div>Line before code</div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code><div><span style="">Line after code</span><br></div><div><span style=""><br></span></div><div><span style=""><a href="http://pandoc.org">Link</a></span></div><div><span style=""><br></span></div><div><ul><li>Unordered bullet 1<br></li><li>Unordered bullet 2<br></li></ul></div><div><span style=""><br></span></div><div><ol><li>Ordered bullet 1<br></li><li>Ordered bullet 2<br></li></ol></div><div><span style=""><br></span></div></span><blockquote style="margin:0 0 0 40px;border:none;padding:0px;"><span><div><span style="">Quote line 1</span></div></span><span><div><span style="">Quote line 2</span></div></span></blockquote><span><div><span style=""><br></span></div><div><span style="">Text</span></div><div><span style=""><br></span></div><div><i>Italic</i></div><div><i><br></i></div><div>Text</div><div></div></span>

    Markdown text

    More Markdown text


    Attempts



    I tried a number of Pandoc scripts:



    Attempt 1



    pandoc -f markdown -t markdown_strict --atx-headers in.md -o out.md



    • Line breaks added


    • No conversion


    Extracts from result



    <h3>
    <span>H3</span>
    </h3>
    <span>txt</span>

    <span><br></span>


    and



    <ul>
    <li>
    bullet<br>
    </li>
    <li>
    list<br>
    </li>
    </ul>


    Running the transformation command a second time on result does nothing.



    Attempt 2



    pandoc -f markdown -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md



    Result



    Same as above



    Attempt 3



    pandoc -f markdown-markdown_in_html_blocks -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md



    Result



    Same as above with fewer line breaks



    Attempt 4



    pandoc -f markdown -t markdown_strict-native_divs-native_spans-raw_html --atx-headers in.md -o out.md



    Extracts from result



    All HTML elements are stripped out, but no Markdown is applied:



    Heading 1
    Text

    Heading 2
    Text

    Heading 3
    Text


    and



    Unordered bullet 1
    Unordered bullet 2
    Unordered bullet 3


    Misc



    • I cannot adjust how in.md is generated originally.


    • Pandoc does not have to be part of the solution. However, using Pandoc seems to make sense because (1) the transformation needs to be executed by an Azure DevOps release pipeline, and running a simple command fits nicely in that workflow and (2) the desired result is simply one clean Markdown file.


    • I can script a solution using Regex (and will, if no other solution makes sense), but if a Pandoc command (or another solution) accomplishes it, that seems less prone to my human error.


    Thank you for any thoughts or advice.










    share|improve this question
























      1












      1








      1








      Goal



      Transform Markdown file with HTML inside into pure Markdown



      Code: in.md



      # Title

      ## Subtitle

      ### Sub-subtitle

      <span><div>Line before image</div><div><br></div><div><img src="img.png" width=404 height=255><br></div><div><br></div><div>Line after image</div><div><br></div><div>Text</div></span><h1><span>Heading 1</span></h1><span><div>Text</div><div><br></div></span><h2><span>Heading 2</span></h2><span><div>Text</div></span><h3><span>Heading 3</span></h3><div><span>Text</span></div><div><span><br></span></div><span><div>Line before code</div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code><div><span style="">Line after code</span><br></div><div><span style=""><br></span></div><div><span style=""><a href="http://pandoc.org">Link</a></span></div><div><span style=""><br></span></div><div><ul><li>Unordered bullet 1<br></li><li>Unordered bullet 2<br></li></ul></div><div><span style=""><br></span></div><div><ol><li>Ordered bullet 1<br></li><li>Ordered bullet 2<br></li></ol></div><div><span style=""><br></span></div></span><blockquote style="margin:0 0 0 40px;border:none;padding:0px;"><span><div><span style="">Quote line 1</span></div></span><span><div><span style="">Quote line 2</span></div></span></blockquote><span><div><span style=""><br></span></div><div><span style="">Text</span></div><div><span style=""><br></span></div><div><i>Italic</i></div><div><i><br></i></div><div>Text</div><div></div></span>

      Markdown text

      More Markdown text


      Attempts



      I tried a number of Pandoc scripts:



      Attempt 1



      pandoc -f markdown -t markdown_strict --atx-headers in.md -o out.md



      • Line breaks added


      • No conversion


      Extracts from result



      <h3>
      <span>H3</span>
      </h3>
      <span>txt</span>

      <span><br></span>


      and



      <ul>
      <li>
      bullet<br>
      </li>
      <li>
      list<br>
      </li>
      </ul>


      Running the transformation command a second time on result does nothing.



      Attempt 2



      pandoc -f markdown -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md



      Result



      Same as above



      Attempt 3



      pandoc -f markdown-markdown_in_html_blocks -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md



      Result



      Same as above with fewer line breaks



      Attempt 4



      pandoc -f markdown -t markdown_strict-native_divs-native_spans-raw_html --atx-headers in.md -o out.md



      Extracts from result



      All HTML elements are stripped out, but no Markdown is applied:



      Heading 1
      Text

      Heading 2
      Text

      Heading 3
      Text


      and



      Unordered bullet 1
      Unordered bullet 2
      Unordered bullet 3


      Misc



      • I cannot adjust how in.md is generated originally.


      • Pandoc does not have to be part of the solution. However, using Pandoc seems to make sense because (1) the transformation needs to be executed by an Azure DevOps release pipeline, and running a simple command fits nicely in that workflow and (2) the desired result is simply one clean Markdown file.


      • I can script a solution using Regex (and will, if no other solution makes sense), but if a Pandoc command (or another solution) accomplishes it, that seems less prone to my human error.


      Thank you for any thoughts or advice.










      share|improve this question














      Goal



      Transform Markdown file with HTML inside into pure Markdown



      Code: in.md



      # Title

      ## Subtitle

      ### Sub-subtitle

      <span><div>Line before image</div><div><br></div><div><img src="img.png" width=404 height=255><br></div><div><br></div><div>Line after image</div><div><br></div><div>Text</div></span><h1><span>Heading 1</span></h1><span><div>Text</div><div><br></div></span><h2><span>Heading 2</span></h2><span><div>Text</div></span><h3><span>Heading 3</span></h3><div><span>Text</span></div><div><span><br></span></div><span><div>Line before code</div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code><div><span style="">Line after code</span><br></div><div><span style=""><br></span></div><div><span style=""><a href="http://pandoc.org">Link</a></span></div><div><span style=""><br></span></div><div><ul><li>Unordered bullet 1<br></li><li>Unordered bullet 2<br></li></ul></div><div><span style=""><br></span></div><div><ol><li>Ordered bullet 1<br></li><li>Ordered bullet 2<br></li></ol></div><div><span style=""><br></span></div></span><blockquote style="margin:0 0 0 40px;border:none;padding:0px;"><span><div><span style="">Quote line 1</span></div></span><span><div><span style="">Quote line 2</span></div></span></blockquote><span><div><span style=""><br></span></div><div><span style="">Text</span></div><div><span style=""><br></span></div><div><i>Italic</i></div><div><i><br></i></div><div>Text</div><div></div></span>

      Markdown text

      More Markdown text


      Attempts



      I tried a number of Pandoc scripts:



      Attempt 1



      pandoc -f markdown -t markdown_strict --atx-headers in.md -o out.md



      • Line breaks added


      • No conversion


      Extracts from result



      <h3>
      <span>H3</span>
      </h3>
      <span>txt</span>

      <span><br></span>


      and



      <ul>
      <li>
      bullet<br>
      </li>
      <li>
      list<br>
      </li>
      </ul>


      Running the transformation command a second time on result does nothing.



      Attempt 2



      pandoc -f markdown -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md



      Result



      Same as above



      Attempt 3



      pandoc -f markdown-markdown_in_html_blocks -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md



      Result



      Same as above with fewer line breaks



      Attempt 4



      pandoc -f markdown -t markdown_strict-native_divs-native_spans-raw_html --atx-headers in.md -o out.md



      Extracts from result



      All HTML elements are stripped out, but no Markdown is applied:



      Heading 1
      Text

      Heading 2
      Text

      Heading 3
      Text


      and



      Unordered bullet 1
      Unordered bullet 2
      Unordered bullet 3


      Misc



      • I cannot adjust how in.md is generated originally.


      • Pandoc does not have to be part of the solution. However, using Pandoc seems to make sense because (1) the transformation needs to be executed by an Azure DevOps release pipeline, and running a simple command fits nicely in that workflow and (2) the desired result is simply one clean Markdown file.


      • I can script a solution using Regex (and will, if no other solution makes sense), but if a Pandoc command (or another solution) accomplishes it, that seems less prone to my human error.


      Thank you for any thoughts or advice.







      shell markdown pandoc






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 15 '18 at 1:24









      hcdocshcdocs

      1377




      1377






















          1 Answer
          1






          active

          oldest

          votes


















          4














          My suggestion is to convert the full document to HTML first, then convert the result to your desired Markdown format:



          pandoc --from=markdown --to=html in.md | 
          pandoc --from=html --to=markdown-raw_html-native_divs --output out.md


          Note that the input seems to contain invalid HTML (e.g., div must not occur in span or code elements per the HTML standard), so the embedded HTML doesn't quite mean what it's supposed to mean.



          One will also notice some spans containing only newlines, which make the output look ugly. The best solution for this would be to remove them via a pandoc filter.






          share|improve this answer
























            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311148%2fremove-html-elements-inside-markdown%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            4














            My suggestion is to convert the full document to HTML first, then convert the result to your desired Markdown format:



            pandoc --from=markdown --to=html in.md | 
            pandoc --from=html --to=markdown-raw_html-native_divs --output out.md


            Note that the input seems to contain invalid HTML (e.g., div must not occur in span or code elements per the HTML standard), so the embedded HTML doesn't quite mean what it's supposed to mean.



            One will also notice some spans containing only newlines, which make the output look ugly. The best solution for this would be to remove them via a pandoc filter.






            share|improve this answer





























              4














              My suggestion is to convert the full document to HTML first, then convert the result to your desired Markdown format:



              pandoc --from=markdown --to=html in.md | 
              pandoc --from=html --to=markdown-raw_html-native_divs --output out.md


              Note that the input seems to contain invalid HTML (e.g., div must not occur in span or code elements per the HTML standard), so the embedded HTML doesn't quite mean what it's supposed to mean.



              One will also notice some spans containing only newlines, which make the output look ugly. The best solution for this would be to remove them via a pandoc filter.






              share|improve this answer



























                4












                4








                4







                My suggestion is to convert the full document to HTML first, then convert the result to your desired Markdown format:



                pandoc --from=markdown --to=html in.md | 
                pandoc --from=html --to=markdown-raw_html-native_divs --output out.md


                Note that the input seems to contain invalid HTML (e.g., div must not occur in span or code elements per the HTML standard), so the embedded HTML doesn't quite mean what it's supposed to mean.



                One will also notice some spans containing only newlines, which make the output look ugly. The best solution for this would be to remove them via a pandoc filter.






                share|improve this answer















                My suggestion is to convert the full document to HTML first, then convert the result to your desired Markdown format:



                pandoc --from=markdown --to=html in.md | 
                pandoc --from=html --to=markdown-raw_html-native_divs --output out.md


                Note that the input seems to contain invalid HTML (e.g., div must not occur in span or code elements per the HTML standard), so the embedded HTML doesn't quite mean what it's supposed to mean.



                One will also notice some spans containing only newlines, which make the output look ugly. The best solution for this would be to remove them via a pandoc filter.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 15 '18 at 15:47

























                answered Nov 15 '18 at 9:00









                tarlebtarleb

                5,71732241




                5,71732241





























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311148%2fremove-html-elements-inside-markdown%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    這個網誌中的熱門文章

                    Barbados

                    How to read a connectionString WITH PROVIDER in .NET Core?

                    Node.js Script on GitHub Pages or Amazon S3