Remove HTML elements inside Markdown

Goal

Transform Markdown file with HTML inside into pure Markdown

Code: `in.md`

# Title

## Subtitle

### Sub-subtitle

<span><div>Line before image</div><div><br></div><div><img src="img.png" width=404 height=255><br></div><div><br></div><div>Line after image</div><div><br></div><div>Text</div></span><h1><span>Heading 1</span></h1><span><div>Text</div><div><br></div></span><h2><span>Heading 2</span></h2><span><div>Text</div></span><h3><span>Heading 3</span></h3><div><span>Text</span></div><div><span><br></span></div><span><div>Line before code</div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code><div><span style="">Line after code</span><br></div><div><span style=""><br></span></div><div><span style=""><a href="http://pandoc.org">Link</a></span></div><div><span style=""><br></span></div><div><ul><li>Unordered bullet 1<br></li><li>Unordered bullet 2<br></li></ul></div><div><span style=""><br></span></div><div><ol><li>Ordered bullet 1<br></li><li>Ordered bullet 2<br></li></ol></div><div><span style=""><br></span></div></span><blockquote style="margin:0 0 0 40px;border:none;padding:0px;"><span><div><span style="">Quote line 1</span></div></span><span><div><span style="">Quote line 2</span></div></span></blockquote><span><div><span style=""><br></span></div><div><span style="">Text</span></div><div><span style=""><br></span></div><div><i>Italic</i></div><div><i><br></i></div><div>Text</div><div></div></span>

Markdown text

More Markdown text

Attempts

I tried a number of Pandoc scripts:

Attempt 1

pandoc -f markdown -t markdown_strict --atx-headers in.md -o out.md

Line breaks added

No conversion

Extracts from result

<h3>
<span>H3</span>
</h3>
<span>txt</span>

<span><br></span>

and

<ul>
<li>
bullet<br>
</li>
<li>
list<br>
</li>
</ul>

Running the transformation command a second time on result does nothing.

Attempt 2

pandoc -f markdown -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md

Result

Same as above

Attempt 3

pandoc -f markdown-markdown_in_html_blocks -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md

Result

Same as above with fewer line breaks

Attempt 4

pandoc -f markdown -t markdown_strict-native_divs-native_spans-raw_html --atx-headers in.md -o out.md

Extracts from result

All HTML elements are stripped out, but no Markdown is applied:

Heading 1
Text

Heading 2
Text

Heading 3
Text

and

Unordered bullet 1
Unordered bullet 2
Unordered bullet 3

Misc

I cannot adjust how in.md is generated originally.

Pandoc does not have to be part of the solution. However, using Pandoc seems to make sense because (1) the transformation needs to be executed by an Azure DevOps release pipeline, and running a simple command fits nicely in that workflow and (2) the desired result is simply one clean Markdown file.

I can script a solution using Regex (and will, if no other solution makes sense), but if a Pandoc command (or another solution) accomplishes it, that seems less prone to my human error.

Thank you for any thoughts or advice.

asked Nov 15 '18 at 1:24

hcdocs

1377

add a comment |

Goal

Transform Markdown file with HTML inside into pure Markdown

Code: `in.md`

# Title

## Subtitle

### Sub-subtitle

<span><div>Line before image</div><div><br></div><div><img src="img.png" width=404 height=255><br></div><div><br></div><div>Line after image</div><div><br></div><div>Text</div></span><h1><span>Heading 1</span></h1><span><div>Text</div><div><br></div></span><h2><span>Heading 2</span></h2><span><div>Text</div></span><h3><span>Heading 3</span></h3><div><span>Text</span></div><div><span><br></span></div><span><div>Line before code</div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code><div><span style="">Line after code</span><br></div><div><span style=""><br></span></div><div><span style=""><a href="http://pandoc.org">Link</a></span></div><div><span style=""><br></span></div><div><ul><li>Unordered bullet 1<br></li><li>Unordered bullet 2<br></li></ul></div><div><span style=""><br></span></div><div><ol><li>Ordered bullet 1<br></li><li>Ordered bullet 2<br></li></ol></div><div><span style=""><br></span></div></span><blockquote style="margin:0 0 0 40px;border:none;padding:0px;"><span><div><span style="">Quote line 1</span></div></span><span><div><span style="">Quote line 2</span></div></span></blockquote><span><div><span style=""><br></span></div><div><span style="">Text</span></div><div><span style=""><br></span></div><div><i>Italic</i></div><div><i><br></i></div><div>Text</div><div></div></span>

Markdown text

More Markdown text

Attempts

I tried a number of Pandoc scripts:

Attempt 1

pandoc -f markdown -t markdown_strict --atx-headers in.md -o out.md

Line breaks added

No conversion

Extracts from result

<h3>
<span>H3</span>
</h3>
<span>txt</span>

<span><br></span>

and

<ul>
<li>
bullet<br>
</li>
<li>
list<br>
</li>
</ul>

Running the transformation command a second time on result does nothing.

Attempt 2

pandoc -f markdown -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md

Result

Same as above

Attempt 3

pandoc -f markdown-markdown_in_html_blocks -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md

Result

Same as above with fewer line breaks

Attempt 4

pandoc -f markdown -t markdown_strict-native_divs-native_spans-raw_html --atx-headers in.md -o out.md

Extracts from result

All HTML elements are stripped out, but no Markdown is applied:

Heading 1
Text

Heading 2
Text

Heading 3
Text

and

Unordered bullet 1
Unordered bullet 2
Unordered bullet 3

Misc

I cannot adjust how in.md is generated originally.

Pandoc does not have to be part of the solution. However, using Pandoc seems to make sense because (1) the transformation needs to be executed by an Azure DevOps release pipeline, and running a simple command fits nicely in that workflow and (2) the desired result is simply one clean Markdown file.

I can script a solution using Regex (and will, if no other solution makes sense), but if a Pandoc command (or another solution) accomplishes it, that seems less prone to my human error.

Thank you for any thoughts or advice.

asked Nov 15 '18 at 1:24

hcdocs

1377

add a comment |

Goal

Transform Markdown file with HTML inside into pure Markdown

Code: `in.md`

# Title

## Subtitle

### Sub-subtitle

<span><div>Line before image</div><div><br></div><div><img src="img.png" width=404 height=255><br></div><div><br></div><div>Line after image</div><div><br></div><div>Text</div></span><h1><span>Heading 1</span></h1><span><div>Text</div><div><br></div></span><h2><span>Heading 2</span></h2><span><div>Text</div></span><h3><span>Heading 3</span></h3><div><span>Text</span></div><div><span><br></span></div><span><div>Line before code</div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code><div><span style="">Line after code</span><br></div><div><span style=""><br></span></div><div><span style=""><a href="http://pandoc.org">Link</a></span></div><div><span style=""><br></span></div><div><ul><li>Unordered bullet 1<br></li><li>Unordered bullet 2<br></li></ul></div><div><span style=""><br></span></div><div><ol><li>Ordered bullet 1<br></li><li>Ordered bullet 2<br></li></ol></div><div><span style=""><br></span></div></span><blockquote style="margin:0 0 0 40px;border:none;padding:0px;"><span><div><span style="">Quote line 1</span></div></span><span><div><span style="">Quote line 2</span></div></span></blockquote><span><div><span style=""><br></span></div><div><span style="">Text</span></div><div><span style=""><br></span></div><div><i>Italic</i></div><div><i><br></i></div><div>Text</div><div></div></span>

Markdown text

More Markdown text

Attempts

I tried a number of Pandoc scripts:

Attempt 1

pandoc -f markdown -t markdown_strict --atx-headers in.md -o out.md

Line breaks added

No conversion

Extracts from result

<h3>
<span>H3</span>
</h3>
<span>txt</span>

<span><br></span>

and

<ul>
<li>
bullet<br>
</li>
<li>
list<br>
</li>
</ul>

Running the transformation command a second time on result does nothing.

Attempt 2

pandoc -f markdown -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md

Result

Same as above

Attempt 3

pandoc -f markdown-markdown_in_html_blocks -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md

Result

Same as above with fewer line breaks

Attempt 4

pandoc -f markdown -t markdown_strict-native_divs-native_spans-raw_html --atx-headers in.md -o out.md

Extracts from result

All HTML elements are stripped out, but no Markdown is applied:

Heading 1
Text

Heading 2
Text

Heading 3
Text

and

Unordered bullet 1
Unordered bullet 2
Unordered bullet 3

Misc

I cannot adjust how in.md is generated originally.

Pandoc does not have to be part of the solution. However, using Pandoc seems to make sense because (1) the transformation needs to be executed by an Azure DevOps release pipeline, and running a simple command fits nicely in that workflow and (2) the desired result is simply one clean Markdown file.

I can script a solution using Regex (and will, if no other solution makes sense), but if a Pandoc command (or another solution) accomplishes it, that seems less prone to my human error.

Thank you for any thoughts or advice.

asked Nov 15 '18 at 1:24

hcdocs

1377

Goal

Transform Markdown file with HTML inside into pure Markdown

Code: `in.md`

# Title

## Subtitle

### Sub-subtitle

<span><div>Line before image</div><div><br></div><div><img src="img.png" width=404 height=255><br></div><div><br></div><div>Line after image</div><div><br></div><div>Text</div></span><h1><span>Heading 1</span></h1><span><div>Text</div><div><br></div></span><h2><span>Heading 2</span></h2><span><div>Text</div></span><h3><span>Heading 3</span></h3><div><span>Text</span></div><div><span><br></span></div><span><div>Line before code</div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code><div><span style="">Line after code</span><br></div><div><span style=""><br></span></div><div><span style=""><a href="http://pandoc.org">Link</a></span></div><div><span style=""><br></span></div><div><ul><li>Unordered bullet 1<br></li><li>Unordered bullet 2<br></li></ul></div><div><span style=""><br></span></div><div><ol><li>Ordered bullet 1<br></li><li>Ordered bullet 2<br></li></ol></div><div><span style=""><br></span></div></span><blockquote style="margin:0 0 0 40px;border:none;padding:0px;"><span><div><span style="">Quote line 1</span></div></span><span><div><span style="">Quote line 2</span></div></span></blockquote><span><div><span style=""><br></span></div><div><span style="">Text</span></div><div><span style=""><br></span></div><div><i>Italic</i></div><div><i><br></i></div><div>Text</div><div></div></span>

Markdown text

More Markdown text

Attempts

I tried a number of Pandoc scripts:

Attempt 1

pandoc -f markdown -t markdown_strict --atx-headers in.md -o out.md

Line breaks added

No conversion

Extracts from result

<h3>
<span>H3</span>
</h3>
<span>txt</span>

<span><br></span>

and

<ul>
<li>
bullet<br>
</li>
<li>
list<br>
</li>
</ul>

Running the transformation command a second time on result does nothing.

Attempt 2

pandoc -f markdown -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md

Result

Same as above

Attempt 3

pandoc -f markdown-markdown_in_html_blocks -t markdown_strict-native_divs-native_spans --atx-headers in.md -o out.md

Result

Same as above with fewer line breaks

Attempt 4

pandoc -f markdown -t markdown_strict-native_divs-native_spans-raw_html --atx-headers in.md -o out.md

Extracts from result

All HTML elements are stripped out, but no Markdown is applied:

Heading 1
Text

Heading 2
Text

Heading 3
Text

and

Unordered bullet 1
Unordered bullet 2
Unordered bullet 3

Misc

I cannot adjust how in.md is generated originally.

Pandoc does not have to be part of the solution. However, using Pandoc seems to make sense because (1) the transformation needs to be executed by an Azure DevOps release pipeline, and running a simple command fits nicely in that workflow and (2) the desired result is simply one clean Markdown file.

I can script a solution using Regex (and will, if no other solution makes sense), but if a Pandoc command (or another solution) accomplishes it, that seems less prone to my human error.

Thank you for any thoughts or advice.

shell markdown pandoc

asked Nov 15 '18 at 1:24

hcdocs

1377

asked Nov 15 '18 at 1:24

hcdocs

1377

asked Nov 15 '18 at 1:24

hcdocs

1377

asked Nov 15 '18 at 1:24

hcdocs

1377

asked Nov 15 '18 at 1:24

hcdocs

1377

add a comment |

1 Answer
1

active

oldest

votes

My suggestion is to convert the full document to HTML first, then convert the result to your desired Markdown format:

pandoc --from=markdown --to=html in.md | 
 pandoc --from=html --to=markdown-raw_html-native_divs --output out.md

Note that the input seems to contain invalid HTML (e.g., div must not occur in span or code elements per the HTML standard), so the embedded HTML doesn't quite mean what it's supposed to mean.

One will also notice some spans containing only newlines, which make the output look ugly. The best solution for this would be to remove them via a pandoc filter.

edited Nov 15 '18 at 15:47

answered Nov 15 '18 at 9:00

tarleb

5,71732241

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311148%2fremove-html-elements-inside-markdown%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

My suggestion is to convert the full document to HTML first, then convert the result to your desired Markdown format:

pandoc --from=markdown --to=html in.md | 
 pandoc --from=html --to=markdown-raw_html-native_divs --output out.md

Note that the input seems to contain invalid HTML (e.g., div must not occur in span or code elements per the HTML standard), so the embedded HTML doesn't quite mean what it's supposed to mean.

One will also notice some spans containing only newlines, which make the output look ugly. The best solution for this would be to remove them via a pandoc filter.

edited Nov 15 '18 at 15:47

answered Nov 15 '18 at 9:00

tarleb

5,71732241

add a comment |

My suggestion is to convert the full document to HTML first, then convert the result to your desired Markdown format:

pandoc --from=markdown --to=html in.md | 
 pandoc --from=html --to=markdown-raw_html-native_divs --output out.md

Note that the input seems to contain invalid HTML (e.g., div must not occur in span or code elements per the HTML standard), so the embedded HTML doesn't quite mean what it's supposed to mean.

One will also notice some spans containing only newlines, which make the output look ugly. The best solution for this would be to remove them via a pandoc filter.

edited Nov 15 '18 at 15:47

answered Nov 15 '18 at 9:00

tarleb

5,71732241

add a comment |

My suggestion is to convert the full document to HTML first, then convert the result to your desired Markdown format:

pandoc --from=markdown --to=html in.md | 
 pandoc --from=html --to=markdown-raw_html-native_divs --output out.md

Note that the input seems to contain invalid HTML (e.g., div must not occur in span or code elements per the HTML standard), so the embedded HTML doesn't quite mean what it's supposed to mean.

One will also notice some spans containing only newlines, which make the output look ugly. The best solution for this would be to remove them via a pandoc filter.

edited Nov 15 '18 at 15:47

answered Nov 15 '18 at 9:00

tarleb

5,71732241

My suggestion is to convert the full document to HTML first, then convert the result to your desired Markdown format:

pandoc --from=markdown --to=html in.md | 
 pandoc --from=html --to=markdown-raw_html-native_divs --output out.md

Note that the input seems to contain invalid HTML (e.g., div must not occur in span or code elements per the HTML standard), so the embedded HTML doesn't quite mean what it's supposed to mean.

One will also notice some spans containing only newlines, which make the output look ugly. The best solution for this would be to remove them via a pandoc filter.

edited Nov 15 '18 at 15:47

answered Nov 15 '18 at 9:00

tarleb

5,71732241

edited Nov 15 '18 at 15:47

answered Nov 15 '18 at 9:00

tarleb

5,71732241

answered Nov 15 '18 at 9:00

tarleb

5,71732241

answered Nov 15 '18 at 9:00

tarleb

5,71732241

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

yGvW8dSTIaKhpCft553sVF,VqnnkVOSGR L2,onzUD,DQ9H87,As,F3jliIxj,OeTQuMDYTWxxfXJDARin,Fsp0PJ

Remove HTML elements inside Markdown

Goal

Code: in.md

Attempts

Attempt 1

Extracts from result

Attempt 2

Result

Attempt 3

Result

Attempt 4

Extracts from result

Misc

Goal

Code: in.md

Attempts

Attempt 1

Extracts from result

Attempt 2

Result

Attempt 3

Result

Attempt 4

Extracts from result

Misc

Goal

Code: in.md

Attempts

Attempt 1

Extracts from result

Attempt 2

Result

Attempt 3

Result

Attempt 4

Extracts from result

Misc

Goal

Code: in.md

Attempts

Attempt 1

Extracts from result

Attempt 2

Result

Attempt 3

Result

Attempt 4

Extracts from result

Misc

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

這個網誌中的熱門文章

How to read a connectionString WITH PROVIDER in .NET Core?

Spillway

A major

Code: `in.md`

Code: `in.md`

Code: `in.md`

Code: `in.md`

1 Answer
1

1 Answer
1

1 Answer
1