PDF to Text Conversion with Rust on AWS Lambda

PDF to Text Conversion with Rust on AWS Lambda

DEV Community: rust (fayismahmood)

A complete guide to building a serverless PDF conversion service using Rust, pdf_oxide, and cargo-lambda.

Demo: https://pdf-to-text-phi.vercel.app
Repo: https://github.com/fayismahmood/pdf-to-text

Prerequisites

You'll need Rust 1.70+ installed, AWS CLI configured with credentials, and basic familiarity with AWS Lambda concepts. For cargo-lambda installation, follow the official guide.

Your AWS IAM user also needs permissions for:

  • lambda:* (or the required Lambda deployment permissions)
  • iam:CreateRole
  • iam:AttachRolePolicy
  • iam:PassRole

Without these permissions, cargo lambda deploy may fail during the initial deployment when creating the Lambda execution role.
To install cargo-lambda, follow the official installation guide:

Installing cargo-lambda

cargo-lambda is the official Cargo subcommand for AWS Lambda functions.

macOS / Linux
curl -L https:// cargo-lambda.info/install.sh | sh

Verify: cargo lambda --version

Project Setup

Create new project
cargo lambda new pdf-converter --http
cd pdf-converter

This creates an HTTP-compatible Lambda project with API Gateway integration.

Update Cargo.toml
[package]
name = "pdf-converter"
version = "0.1.0"
edition = "2021"

[dependencies]
lambda_http = "1.0"
pdf_oxide = "0.3"
tokio = { version = "1", features = ["macros"] }

Code Implementation

src/http_handler.rs
use lambda_http::{Body, Error, Request, RequestExt, Response};
use pdf_oxide::{PdfDocument, converters::ConversionOptions};

pub(crate) enum FileType {
Html,
Text,
Markdown,
}

impl FileType {
fn from_str(s: &str) -> Option<Self> {
match s.to_lowercase().as_str() {
"html" => Some(FileType::Html),
"text" => Some(FileType::Text),
"markdown" => Some(FileType::Markdown),
_ => None,
}
}
}

pub(crate) async fn function_handler(event: Request) -> Result<Response<Body>, Error> {
let file_type = event
.query_string_parameters_ref()
.and_then(|params| params.first("file_type"))
.unwrap_or("text");

let file_type = FileType::from_str(file_type).unwrap_or(FileType::Text);

let body = event.body().to_vec();
let pdf_data = PdfDocument::from_bytes(body)?;

let options = ConversionOptions::default();
let page_count = pdf_data.page_count()?;
let mut result = String::new();

for i in 0..page_count {
let page_content = match file_type {
FileType::Html => pdf_data.to_html(i, &options)?,
FileType::Text => pdf_data.to_plain_text(i, &options)?,
FileType::Markdown => pdf_data.to_markdown(i, &options)?,
};
result.push_str(&page_content);
}

let content_type = match file_type {
FileType::Html => "text/html",
FileType::Text => "text/plain",
FileType::Markdown => "text/markdown",
};

let resp = Response::builder()
.status(200)
.header("content-type", content_type)
.body(result.into())
.map_err(Box::new)?;
Ok(resp)
}

Local Testing

Start the local server
cargo lambda watch

Send a PDF via curl
curl -X POST 'http://localhost:9000/function/function_handler?file_type=markdown' \
-H 'Content-Type: application/pdf' \
--data-binary @document.pdf

Deployment

Build for production
cargo lambda build --release --arm64

The --arm64 flag targets AWS Graviton processors for better cost/performance.

Deploy to AWS
cargo lambda deploy

The first deployment will create an IAM role automatically. Subsequent deployments will reuse it.

Via AWS CLI
# Package the function
cargo lambda build --release

# Deploy
aws lambda deploy

Performance Benchmarks

pdf_oxide Performance

pdf_oxide is one of the fastest PDF libraries available, with benchmark results on 3,830 real-world PDFs:

Python PDF Libraries Comparison

Rust PDF Libraries Comparison

pdf_oxide is 5× faster than pdf_extract and 17× faster than oxidize_pdf in Rust.

AWS Lambda Cold Start

Rust's minimal runtime and compiled binary size result in extremely fast cold starts:


Memory Usage

With pdf_oxide's efficient design, memory usage stays low:


API Usage

Request
POST /function/function_handler?file_type={html|text|markdown}
Content-Type: application/pdf

<binary PDF data>

Response

Returns the converted content with appropriate content-type header.

Example with AWS CLI
aws lambda invoke \
--function-name pdf-converter \
--payload '{"file_type": "markdown"}' \
--cli-binary-format raw-in-base64-out \
response.json

Further Reading

Generated by RSStT. The copyright belongs to the original author.

Source

Report Page