Remove HTML elements inside Markdown
Goal
Transform Markdown file with HTML inside into pure Markdown
Code: in.md
# Title
## Subtitle
### Sub-subtitle
<span><div>Line before image</div><div><br></div><div><img src="img.png" width=404 height=255><br></div><div><br></div><div>Line after image</div><div><br></div><div>Text</div></span><h1><span>Heading 1</span></h1><span><div>Text</div><div><br></div></span><h2><span>Heading 2</span></h2><span><div>Text</div></span><h3><span>Heading 3</span></h3><div><span>Text</span></div><div><span><br></span></div><span><div>Line before code</div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code><div><span style="">Line after code</span><br></div><div><span style=""><br></span></div><div><span style=""><a href="http://pandoc.org">Link</a></span></div><div><span style=""><br></span></div><div><ul><li>Unordered bullet 1<br></li><li>Unordered bullet 2<br></li></ul></div><div><span style=""><br></span></div><div><ol><li>Ordered bullet 1<br></li><li>Ordered bullet 2<br></li></ol></div><div><span style=""><br></span></div></span><blockquote style="margin:0 0 0 40px;border:none;padding:0px;"><span><div><span style="">Quote line 1</span></div></span><span><div><span style="">Quote line 2</span></div></span></blockquote><span><div><span style=""><br></span></div><div><span style="">Text</span></div><div><span style=""><br></span></div><div><i>Italic</i></div><div><i><br></i></div><div>Text</div><div></div></span>
Markdown text
More Markdown text
Attempts
I tried a number of Pandoc scripts:
Attempt 1
pandoc -f markdown -t markdown_strict --atx-headers in.md -o out.md
Line breaks added
No conversion
Extracts from result
<h3>
<span>H3</span>
</h3>
<span>txt</span>
<span><br></span>
and
<ul>
<li>
bullet<br>
</li>
<li>
list<br>
</li>
</ul>
Running the transformation command a second time on result does nothing.
Attempt 2
pandoc -f markdown -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md
Result
Same as above
Attempt 3
pandoc -f markdown-markdown_in_html_blocks -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md
Result
Same as above with fewer line breaks
Attempt 4
pandoc -f markdown -t markdown_strict-native_divs-native_spans-raw_html --atx-headers in.md -o out.md
Extracts from result
All HTML elements are stripped out, but no Markdown is applied:
Heading 1
Text
Heading 2
Text
Heading 3
Text
and
Unordered bullet 1
Unordered bullet 2
Unordered bullet 3
Misc
I cannot adjust how
in.md
is generated originally.Pandoc does not have to be part of the solution. However, using Pandoc seems to make sense because (1) the transformation needs to be executed by an Azure DevOps release pipeline, and running a simple command fits nicely in that workflow and (2) the desired result is simply one clean Markdown file.
I can script a solution using Regex (and will, if no other solution makes sense), but if a Pandoc command (or another solution) accomplishes it, that seems less prone to my human error.
Thank you for any thoughts or advice.
shell markdown pandoc
add a comment |
Goal
Transform Markdown file with HTML inside into pure Markdown
Code: in.md
# Title
## Subtitle
### Sub-subtitle
<span><div>Line before image</div><div><br></div><div><img src="img.png" width=404 height=255><br></div><div><br></div><div>Line after image</div><div><br></div><div>Text</div></span><h1><span>Heading 1</span></h1><span><div>Text</div><div><br></div></span><h2><span>Heading 2</span></h2><span><div>Text</div></span><h3><span>Heading 3</span></h3><div><span>Text</span></div><div><span><br></span></div><span><div>Line before code</div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code><div><span style="">Line after code</span><br></div><div><span style=""><br></span></div><div><span style=""><a href="http://pandoc.org">Link</a></span></div><div><span style=""><br></span></div><div><ul><li>Unordered bullet 1<br></li><li>Unordered bullet 2<br></li></ul></div><div><span style=""><br></span></div><div><ol><li>Ordered bullet 1<br></li><li>Ordered bullet 2<br></li></ol></div><div><span style=""><br></span></div></span><blockquote style="margin:0 0 0 40px;border:none;padding:0px;"><span><div><span style="">Quote line 1</span></div></span><span><div><span style="">Quote line 2</span></div></span></blockquote><span><div><span style=""><br></span></div><div><span style="">Text</span></div><div><span style=""><br></span></div><div><i>Italic</i></div><div><i><br></i></div><div>Text</div><div></div></span>
Markdown text
More Markdown text
Attempts
I tried a number of Pandoc scripts:
Attempt 1
pandoc -f markdown -t markdown_strict --atx-headers in.md -o out.md
Line breaks added
No conversion
Extracts from result
<h3>
<span>H3</span>
</h3>
<span>txt</span>
<span><br></span>
and
<ul>
<li>
bullet<br>
</li>
<li>
list<br>
</li>
</ul>
Running the transformation command a second time on result does nothing.
Attempt 2
pandoc -f markdown -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md
Result
Same as above
Attempt 3
pandoc -f markdown-markdown_in_html_blocks -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md
Result
Same as above with fewer line breaks
Attempt 4
pandoc -f markdown -t markdown_strict-native_divs-native_spans-raw_html --atx-headers in.md -o out.md
Extracts from result
All HTML elements are stripped out, but no Markdown is applied:
Heading 1
Text
Heading 2
Text
Heading 3
Text
and
Unordered bullet 1
Unordered bullet 2
Unordered bullet 3
Misc
I cannot adjust how
in.md
is generated originally.Pandoc does not have to be part of the solution. However, using Pandoc seems to make sense because (1) the transformation needs to be executed by an Azure DevOps release pipeline, and running a simple command fits nicely in that workflow and (2) the desired result is simply one clean Markdown file.
I can script a solution using Regex (and will, if no other solution makes sense), but if a Pandoc command (or another solution) accomplishes it, that seems less prone to my human error.
Thank you for any thoughts or advice.
shell markdown pandoc
add a comment |
Goal
Transform Markdown file with HTML inside into pure Markdown
Code: in.md
# Title
## Subtitle
### Sub-subtitle
<span><div>Line before image</div><div><br></div><div><img src="img.png" width=404 height=255><br></div><div><br></div><div>Line after image</div><div><br></div><div>Text</div></span><h1><span>Heading 1</span></h1><span><div>Text</div><div><br></div></span><h2><span>Heading 2</span></h2><span><div>Text</div></span><h3><span>Heading 3</span></h3><div><span>Text</span></div><div><span><br></span></div><span><div>Line before code</div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code><div><span style="">Line after code</span><br></div><div><span style=""><br></span></div><div><span style=""><a href="http://pandoc.org">Link</a></span></div><div><span style=""><br></span></div><div><ul><li>Unordered bullet 1<br></li><li>Unordered bullet 2<br></li></ul></div><div><span style=""><br></span></div><div><ol><li>Ordered bullet 1<br></li><li>Ordered bullet 2<br></li></ol></div><div><span style=""><br></span></div></span><blockquote style="margin:0 0 0 40px;border:none;padding:0px;"><span><div><span style="">Quote line 1</span></div></span><span><div><span style="">Quote line 2</span></div></span></blockquote><span><div><span style=""><br></span></div><div><span style="">Text</span></div><div><span style=""><br></span></div><div><i>Italic</i></div><div><i><br></i></div><div>Text</div><div></div></span>
Markdown text
More Markdown text
Attempts
I tried a number of Pandoc scripts:
Attempt 1
pandoc -f markdown -t markdown_strict --atx-headers in.md -o out.md
Line breaks added
No conversion
Extracts from result
<h3>
<span>H3</span>
</h3>
<span>txt</span>
<span><br></span>
and
<ul>
<li>
bullet<br>
</li>
<li>
list<br>
</li>
</ul>
Running the transformation command a second time on result does nothing.
Attempt 2
pandoc -f markdown -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md
Result
Same as above
Attempt 3
pandoc -f markdown-markdown_in_html_blocks -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md
Result
Same as above with fewer line breaks
Attempt 4
pandoc -f markdown -t markdown_strict-native_divs-native_spans-raw_html --atx-headers in.md -o out.md
Extracts from result
All HTML elements are stripped out, but no Markdown is applied:
Heading 1
Text
Heading 2
Text
Heading 3
Text
and
Unordered bullet 1
Unordered bullet 2
Unordered bullet 3
Misc
I cannot adjust how
in.md
is generated originally.Pandoc does not have to be part of the solution. However, using Pandoc seems to make sense because (1) the transformation needs to be executed by an Azure DevOps release pipeline, and running a simple command fits nicely in that workflow and (2) the desired result is simply one clean Markdown file.
I can script a solution using Regex (and will, if no other solution makes sense), but if a Pandoc command (or another solution) accomplishes it, that seems less prone to my human error.
Thank you for any thoughts or advice.
shell markdown pandoc
Goal
Transform Markdown file with HTML inside into pure Markdown
Code: in.md
# Title
## Subtitle
### Sub-subtitle
<span><div>Line before image</div><div><br></div><div><img src="img.png" width=404 height=255><br></div><div><br></div><div>Line after image</div><div><br></div><div>Text</div></span><h1><span>Heading 1</span></h1><span><div>Text</div><div><br></div></span><h2><span>Heading 2</span></h2><span><div>Text</div></span><h3><span>Heading 3</span></h3><div><span>Text</span></div><div><span><br></span></div><span><div>Line before code</div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code><div><span style="">Line after code</span><br></div><div><span style=""><br></span></div><div><span style=""><a href="http://pandoc.org">Link</a></span></div><div><span style=""><br></span></div><div><ul><li>Unordered bullet 1<br></li><li>Unordered bullet 2<br></li></ul></div><div><span style=""><br></span></div><div><ol><li>Ordered bullet 1<br></li><li>Ordered bullet 2<br></li></ol></div><div><span style=""><br></span></div></span><blockquote style="margin:0 0 0 40px;border:none;padding:0px;"><span><div><span style="">Quote line 1</span></div></span><span><div><span style="">Quote line 2</span></div></span></blockquote><span><div><span style=""><br></span></div><div><span style="">Text</span></div><div><span style=""><br></span></div><div><i>Italic</i></div><div><i><br></i></div><div>Text</div><div></div></span>
Markdown text
More Markdown text
Attempts
I tried a number of Pandoc scripts:
Attempt 1
pandoc -f markdown -t markdown_strict --atx-headers in.md -o out.md
Line breaks added
No conversion
Extracts from result
<h3>
<span>H3</span>
</h3>
<span>txt</span>
<span><br></span>
and
<ul>
<li>
bullet<br>
</li>
<li>
list<br>
</li>
</ul>
Running the transformation command a second time on result does nothing.
Attempt 2
pandoc -f markdown -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md
Result
Same as above
Attempt 3
pandoc -f markdown-markdown_in_html_blocks -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md
Result
Same as above with fewer line breaks
Attempt 4
pandoc -f markdown -t markdown_strict-native_divs-native_spans-raw_html --atx-headers in.md -o out.md
Extracts from result
All HTML elements are stripped out, but no Markdown is applied:
Heading 1
Text
Heading 2
Text
Heading 3
Text
and
Unordered bullet 1
Unordered bullet 2
Unordered bullet 3
Misc
I cannot adjust how
in.md
is generated originally.Pandoc does not have to be part of the solution. However, using Pandoc seems to make sense because (1) the transformation needs to be executed by an Azure DevOps release pipeline, and running a simple command fits nicely in that workflow and (2) the desired result is simply one clean Markdown file.
I can script a solution using Regex (and will, if no other solution makes sense), but if a Pandoc command (or another solution) accomplishes it, that seems less prone to my human error.
Thank you for any thoughts or advice.
shell markdown pandoc
shell markdown pandoc
asked Nov 15 '18 at 1:24
hcdocshcdocs
1377
1377
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
My suggestion is to convert the full document to HTML first, then convert the result to your desired Markdown format:
pandoc --from=markdown --to=html in.md |
pandoc --from=html --to=markdown-raw_html-native_divs --output out.md
Note that the input seems to contain invalid HTML (e.g., div
must not occur in span
or code
elements per the HTML standard), so the embedded HTML doesn't quite mean what it's supposed to mean.
One will also notice some spans containing only newlines, which make the output look ugly. The best solution for this would be to remove them via a pandoc filter.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311148%2fremove-html-elements-inside-markdown%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
My suggestion is to convert the full document to HTML first, then convert the result to your desired Markdown format:
pandoc --from=markdown --to=html in.md |
pandoc --from=html --to=markdown-raw_html-native_divs --output out.md
Note that the input seems to contain invalid HTML (e.g., div
must not occur in span
or code
elements per the HTML standard), so the embedded HTML doesn't quite mean what it's supposed to mean.
One will also notice some spans containing only newlines, which make the output look ugly. The best solution for this would be to remove them via a pandoc filter.
add a comment |
My suggestion is to convert the full document to HTML first, then convert the result to your desired Markdown format:
pandoc --from=markdown --to=html in.md |
pandoc --from=html --to=markdown-raw_html-native_divs --output out.md
Note that the input seems to contain invalid HTML (e.g., div
must not occur in span
or code
elements per the HTML standard), so the embedded HTML doesn't quite mean what it's supposed to mean.
One will also notice some spans containing only newlines, which make the output look ugly. The best solution for this would be to remove them via a pandoc filter.
add a comment |
My suggestion is to convert the full document to HTML first, then convert the result to your desired Markdown format:
pandoc --from=markdown --to=html in.md |
pandoc --from=html --to=markdown-raw_html-native_divs --output out.md
Note that the input seems to contain invalid HTML (e.g., div
must not occur in span
or code
elements per the HTML standard), so the embedded HTML doesn't quite mean what it's supposed to mean.
One will also notice some spans containing only newlines, which make the output look ugly. The best solution for this would be to remove them via a pandoc filter.
My suggestion is to convert the full document to HTML first, then convert the result to your desired Markdown format:
pandoc --from=markdown --to=html in.md |
pandoc --from=html --to=markdown-raw_html-native_divs --output out.md
Note that the input seems to contain invalid HTML (e.g., div
must not occur in span
or code
elements per the HTML standard), so the embedded HTML doesn't quite mean what it's supposed to mean.
One will also notice some spans containing only newlines, which make the output look ugly. The best solution for this would be to remove them via a pandoc filter.
edited Nov 15 '18 at 15:47
answered Nov 15 '18 at 9:00
tarlebtarleb
5,71732241
5,71732241
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311148%2fremove-html-elements-inside-markdown%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown