MarkItDown: Microsoft's Tool That Converts Any Document into Context for Your AI

If you’re building a RAG pipeline, an AI assistant, or any flow where an LLM needs to read real documents, you’ve hit the same wall: your data isn’t clean. It’s an HR Word doc, a PDF from your vendor, a PowerPoint from last quarter’s planning. None of that is ready for an LLM by default.

Microsoft has an answer for that. MarkItDown is an open-source Python library that converts virtually any document format into clean, structured Markdown — the format that LLMs actually understand well. It was born inside Microsoft Research as an internal tool for the AutoGen multi-agent framework, released as open-source at the end of 2024, and has since accumulated over 91,000 stars on GitHub. The latest stable version, v0.1.5, came out on February 20, 2026.


Why Markdown for LLMs?

It’s not arbitrary. Markdown occupies an ideal middle ground: it’s close to plain text (low token overhead), but preserves document structure — headings, lists, tables, links. Mainstream models are trained extensively on Markdown and handle it natively. When you feed an LLM a raw PDF full of messy whitespace and lost hierarchy, retrieval quality drops. When you feed it clean Markdown, chunking, embedding, and citations work much better.

MarkItDown’s job is to bridge the gap between messy source material and reliable input for the LLM.


What It Converts

The library handles a wide variety of input formats:

  • Office Documents: DOCX, PPTX, XLSX, XLS
  • PDFs: via pdfminer (PDFs with text layer; OCR requires the plugin)
  • Images: JPG, PNG — with LLM-generated descriptions when you provide a client
  • Audio: WAV, MP3 — via voice transcription
  • Web Content: HTML, URLs
  • Structured Data: CSV, JSON, XML
  • Compressed Archives: ZIP (processes content recursively)

The architecture is clean: each format has a dedicated DocumentConverter class, registered on startup. Processing happens entirely in memory — no temporary files — which matters for both performance and security.


Get Started in 4 Lines

pip install 'markitdown[all]'
from markitdown import MarkItDown

md = MarkItDown()

result = md.convert("report_q4.xlsx")
print(result.text_content)

That’s it. result.text_content gives you structured Markdown, preserving sheet names, table rows, and any headings.


CLI Usage

MarkItDown also comes as a command-line tool, useful for batch preprocessing:

# Convert a single file
markitdown document.pdf -o output.md

# Convert and pipe to another tool
markitdown report.docx | grep "## "

Image Description with an LLM Client

For images (and audio), MarkItDown can call an LLM to generate a description — useful when you need images inside a document to be semantically searchable:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

result = md.convert("architecture_diagram.png")
print(result.text_content)
# Returns a structured description of the image content

This works with any OpenAI-compatible client — you’re not locked into OpenAI specifically.


The OCR Plugin

For PDFs and Office files containing images with embedded text, the markitdown-ocr plugin extends the base library with LLM Vision-based OCR:

pip install 'markitdown[all]' markitdown-ocr
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)

result = md.convert("scanned_contract.pdf")
print(result.text_content)

If no llm_client is provided, the plugin loads but silently falls back to the standard text extractor.


MCP Server: MarkItDown Inside Claude Desktop

One of the most interesting additions in the v0.1.x cycle: MarkItDown now comes with an official MCP server (markitdown-mcp), which means you can expose document conversion as a tool within Claude Desktop or any MCP-compatible client.

In practice: instead of manually preprocessing files before sending them to Claude, your MCP configuration handles the conversion on the fly. Claude can call MarkItDown, get structured Markdown back, and reason over the content without you manually managing the pipeline step.


Where MarkItDown Fits (and Where It Doesn’t)

MarkItDown is the right tool when you need fast, lightweight conversion for LLM consumption — RAG pipelines, document indexing, AI assistants that need to read business files.

It’s not the right tool when:

  • You need high-fidelity formatting for human readers → use Pandoc instead
  • Your documents are complex scientific PDFs with tables, equations, and reading order challenges → Docling (IBM) handles those better
  • You’re building a large-scale document ETL pipeline with 40+ connectorsUnstructured.io is more appropriate there

Knowing where each tool ends is as useful as knowing what it does.


v0.1.5: What Changed

The latest version (February 20, 2026) addressed two security dependencies:

  • Updates mammoth to 1.11.0 to resolve CVE-2025-11849
  • Updates pdfminer.six to 20251107 to resolve GHSA-wf5f-4jwr-ppcp

Also in recent versions: shift from minidom to defusedxml for XML parsing — another security hardening move. If you’re running an older version of MarkItDown in any environment processing untrusted documents, it’s worth updating.


The Practical Takeaway

The preprocessing step in AI pipelines is boring, but it’s where you win or lose a lot of quality. MarkItDown is a well-maintained library, MIT-licensed, from a team that built it for real production use inside one of the world’s largest AI research groups. It handles the messy conversion layer so you can focus on what the LLM does with the data.

Install it, point it at your stack of documents, and give your LLM something it can actually work with.

pip install 'markitdown[all]'

GitHub: microsoft/markitdown


How are you handling document preprocessing in your AI projects? Are you building something custom or relying on libraries like this? Share your experience in the comments :backhand_index_pointing_down: