md-skill
NewSummary
This skill provides a Python cheat sheet for manipulating PDFs, covering reading, editing, and regenerating PDF files using pdfplumber, reportlab, and pypdf.
- It includes techniques for overlaying text on existing layouts without altering the original design, such as updating prices in a price list.
Overview
Cheat Sheet: Manipulasi PDF dengan Python
Ringkasan praktis 3 library utama yang sering dipakai bareng untuk baca, edit, dan bikin ulang PDF — termasuk teknik yang dipakai untuk update harga di price list (overlay text di atas layout asli tanpa merusak desain).
1. Kapan Pakai Library Apa
| Kebutuhan | Library |
|---|---|
| Baca teks, tabel, koordinat, warna dari PDF yang sudah ada | pdfplumber |
| Bikin konten baru (teks, kotak, garis) dari nol | reportlab |
| Gabung/pisah/rotate/encrypt/timpa (overlay) halaman PDF | pypdf |
Install sekali jalan:
pip install pdfplumber reportlab pypdf --break-system-packages2. pdfplumber — Membaca PDF
Extract teks per kata + koordinat
import pdfplumber
with pdfplumber.open("file.pdf") as pdf:
page = pdf.pages[0]
words = page.extract_words()
for w in words:
print(w["text"], w["x0"], w["x1"], w["top"], w["bottom"])Catatan: top/bottom dihitung dari atas halaman (beda sama reportlab, lihat bagian #6).
Extract tabel
tables = page.extract_tables()
for row in tables[0]:
print(row)Cek warna background suatu area (rects)
Berguna kalau mau nutup teks lama dengan warna yang sama persis dengan kotak kuning/highlight di belakangnya:
for r in page.rects:
if r["fill"]:
print(r["x0"], r["x1"], r["top"], r["bottom"], r["non_stroking_color"])Cek font & ukuran karakter asli
for c in page.chars[:5]:
print(c["text"], c["fontname"], c["size"], c["non_stroking_color"])3. reportlab — Bikin Konten / Overlay Baru
Canvas dasar, ukuran harus sama dengan halaman PDF target
from reportlab.pdfgen import canvas
import io
packet = io.BytesIO()
c = canvas.Canvas(packet, pagesize=(page.width, page.height)) # ambil dari pdfplumber
c.save()
packet.seek(0)Daftarkan font custom (TTF) sebelum dipakai
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
pdfmetrics.registerFont(TTFont("MyFont", "/usr/share/fonts/truetype/liberation/LiberationSans-Bold.ttf"))
c.setFont("MyFont", 12)Gambar kotak solid (untuk nutup teks lama) + teks baru
c.setFillColorRGB(1, 1, 0.4) # warna kotak (cocokkan dengan bg asli)
c.rect(x, y, w, h, fill=1, stroke=0)
c.setFillColorRGB(0, 0, 0) # warna teks
c.drawString(x, y, "Teks Baru")⚠️ Jangan pakai karakter Unicode subscript/superscript (₀¹²) di reportlab — font built-in nggak punya glyph-nya, hasilnya kotak hitam. Pakai tag
<sub>/<super>diParagraph, bukan canvas string biasa.
4. pypdf — Gabung, Overlay, Rotate, dst.
Baca & tulis dasar
from pypdf import PdfReader, PdfWriter
reader = PdfReader("file.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.write(open("output.pdf", "wb"))⚠️ merge_page() — arah penting!
base_page.merge_page(overlay_page)Ini artinya: `overlay_page` digambar DI ATAS `base_page`. Polanya A.merge_page(B) → B tampil di atas A.
Jadi kalau mau overlay teks baru tampil (nggak ketutup desain asli):
orig_page.merge_page(overlay_page) # benar: overlay di atas, terlihat
writer.add_page(orig_page)Kalau dibalik (overlay_page.merge_page(orig_page)), hasilnya overlay malah tertutup desain asli. Selalu cek hasil render-nya (lihat #5 langkah terakhir) — perilaku ini kadang kebalik antar versi pypdf, jangan asal percaya tanpa verifikasi visual.
Rotate / Split / Encrypt (singkat)
page.rotate(90) # rotate
writer.encrypt("user_pw", "owner_pw") # password protect5. Contoh Praktis: Update Harga di PDF (Tanpa Ubah Layout)
Pola lengkap yang dipakai buat naikin semua harga di price list 20% tapi desain/gambar produk/warna tetap sama:
import pdfplumber, re, io
from reportlab.pdfgen import canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from pypdf import PdfReader, PdfWriter
pdfmetrics.registerFont(TTFont("Bold", "/usr/share/fonts/truetype/liberation/LiberationSans-Bold.ttf"))
def overlay_untuk_halaman(page, factor=1.2):
packet = io.BytesIO()
c = canvas.Canvas(packet, pagesize=(page.width, page.height))
for w in page.extract_words():
if not re.match(r"^\d{1,3}(,\d{3})+$", w["text"]):
continue # skip yang bukan format harga "15,000"
harga_lama = int(w["text"].replace(",", ""))
harga_baru = f"{round(harga_lama * factor):,}"
# cari warna bg di belakang teks ini (default putih)
bg = (1, 1, 1)
for r in page.rects:
if r["fill"] and r["x0"] <= w["x0"]+5 and r["x1"] >= w["x1"]-5 \
and r["top"] <= w["top"]+2 and r["bottom"] >= w["bottom"]-2:
bg = r["non_stroking_color"]
break
# konversi koordinat pdfplumber (top-down) -> reportlab (bottom-up)
y_bawah = page.height - w["bottom"]
h = w["bottom"] - w["top"]
c.setFillColorRGB(*bg)
c.rect(w["x0"]-2, y_bawah-1, (w["x1"]-w["x0"])+20, h+2, fill=1, stroke=0)
c.setFillColorRGB(0, 0, 0)
c.setFont("Bold", h*0.82)
c.drawString(w["x0"]-4, y_bawah+1, harga_baru)
c.save()
packet.seek(0)
return packet
with pdfplumber.open("price_list.pdf") as pdf:
original = PdfReader("price_list.pdf")
writer = PdfWriter()
for i, page in enumerate(pdf.pages):
overlay = PdfReader(overlay_untuk_halaman(page)).pages[0]
orig_page = original.pages[i]
orig_page.merge_page(overlay) # overlay di atas, harga lama tertutup
writer.add_page(orig_page)
writer.write(open("price_list_updated.pdf", "wb"))6. Gotchas yang Sering Kejebak
- Sistem koordinat beda arah.
pdfplumberpakaitop(jarak dari atas).reportlabpakaiydari bawah. Konversi:y_reportlab = page_height - top_pdfplumber. - Arah `merge_page()` menentukan siapa di atas siapa — selalu render ke gambar (
pdf2image) dan cek visual sebelum kirim hasil final. - Warna kotak penutup harus sama dengan background asli (putih polos vs kuning highlight) — kalau nggak, hasilnya keliatan "ditambal".
- Lebar kotak penutup harus dilebihkan sedikit dari teks lama, supaya nggak ada sisa angka lama yang nongol kalau angka baru lebih pendek.
- Font harus didaftarkan (
registerFont) sebelum dipakai disetFont(), beda dengan font sistem yang otomatis tersedia di Word/LibreOffice.
Catatan: dokumen ini ringkasan teknik umum berbasis library open-source (pdfplumber, reportlab, pypdf), ditulis ulang dari pengalaman project — bukan salinan dari materi internal apa pun.
Install & Usage
mkdir -p .claude/skillsmkdir -p .claude/skills && curl -o .claude/skills/md-skill.md https://raw.githubusercontent.com/MRGHOZ/md-skill/main/SKILL.md/md-skillUse Cases
Usage Examples
/md-skill Extract all tables from invoice.pdf and save as CSV
/md-skill Overlay 'New Price: $99' at coordinates (100, 200) on page 1 of price_list.pdf
/md-skill Merge three PDF files into one and encrypt with password 'secret123'
Security Audits
Frequently Asked Questions
What is md-skill?
This skill provides a Python cheat sheet for manipulating PDFs, covering reading, editing, and regenerating PDF files using pdfplumber, reportlab, and pypdf. It includes techniques for overlaying text on existing layouts without altering the original design, such as updating prices in a price list.
How to install md-skill?
To install md-skill: create the skills directory (mkdir -p .claude/skills), then run: mkdir -p .claude/skills && curl -o .claude/skills/md-skill.md https://raw.githubusercontent.com/MRGHOZ/md-skill/main/SKILL.md. Finally, /md-skill in Claude Code.
What is md-skill best for?
md-skill is a skill categorized under General. Created by MRGHOZ.
What can I use md-skill for?
md-skill is useful for: Extract text, tables, and coordinates from existing PDFs for data analysis.; Overlay new text onto a PDF page while preserving the original layout and design.; Create new PDF content from scratch, including text, boxes, and lines.; Merge, split, rotate, or encrypt PDF pages using pypdf.; Identify background colors and font properties of text in a PDF for precise editing.; Automate price list updates by overlaying new prices on existing PDF templates..