Compare commits

...

5 Commits

Author SHA1 Message Date
Ásgeir Thor Johnson
3964e72269 Create pptx.md 2025-10-02 08:12:47 +00:00
Ásgeir Thor Johnson
14448f4257 Create xlsx.md 2025-10-02 08:12:35 +00:00
Ásgeir Thor Johnson
4402f2ea4b Create pdf.md 2025-10-02 08:12:17 +00:00
Ásgeir Thor Johnson
d19c913737 Create docx.md 2025-10-02 08:12:02 +00:00
Ásgeir Thor Johnson
1ca43585da Create claude-4.5-sonnet.md 2025-10-02 07:54:21 +00:00
5 changed files with 2419 additions and 0 deletions

File diff suppressed because it is too large Load Diff

177
Anthropic/docx.md Normal file
View File

@@ -0,0 +1,177 @@
# Word Document Handler (/mnt/skills/public/docx/SKILL.md)
---
name: Word Document Handler
description: Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction
when_to_use: "When Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks"
version: 0.0.1
---
# DOCX creation, editing, and analysis
## Overview
A user may ask you to create, edit, or analyze the contents of a .docx file. A .docx file is essentially a ZIP archive containing XML files and other resources that you can read or edit. You have different tools and workflows available for different tasks.
## Workflow Decision Tree
### Reading/Analyzing Content
Use "Text extraction" or "Raw XML access" sections below
### Creating New Document
Use "Creating a new Word document" workflow
### Editing Existing Document
- **Your own document + simple changes**
Use "Basic OOXML editing" workflow
- **Someone else's document**
Use **"Redlining workflow"** (recommended default)
- **Legal, academic, business, or government docs**
Use **"Redlining workflow"** (required)
## Reading and analyzing content
### Text extraction
If you just need to read the text contents of a document, you should convert the document to markdown using pandoc. Pandoc provides excellent support for preserving document structure and can show tracked changes:
```bash
# Convert document to markdown with tracked changes
pandoc --track-changes=all path-to-file.docx -o output.md
# Options: --track-changes=accept/reject/all
```
### Raw XML access
You need raw XML access for: comments, complex formatting, document structure, embedded media, and metadata. For any of these features, you'll need to unpack a document and read its raw XML contents.
#### Unpacking a file
`python ooxml/scripts/unpack.py <office_file> <output_directory>`
#### Key file structures
* `word/document.xml` - Main document contents
* `word/comments.xml` - Comments referenced in document.xml
* `word/media/` - Embedded images and media files
* Tracked changes use `<w:ins>` (insertions) and `<w:del>` (deletions) tags
## Creating a new Word document
When creating a new Word document from scratch, use **docx-js**, which allows you to create Word documents using JavaScript/TypeScript.
### Workflow
1. **MANDATORY - READ ENTIRE FILE**: Read [`docx-js.md`](docx-js.md) (~500 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Read the full file content for detailed syntax, critical formatting rules, and best practices before proceeding with document creation.
2. Create a JavaScript/TypeScript file using Document, Paragraph, TextRun components (You can assume all dependencies are installed, but if not, refer to the dependencies section below)
3. Export as .docx using Packer.toBuffer()
## Editing an existing Word document
When editing an existing Word document, you need to work with the raw Office Open XML (OOXML) format. This involves unpacking the .docx file, editing the XML content, and repacking it.
### Workflow
1. **MANDATORY - READ ENTIRE FILE**: Read [`ooxml.md`](ooxml.md) (~500 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Read the full file content for detailed syntax, critical validation rules, and patterns before proceeding.
2. Unpack the document: `python ooxml/scripts/unpack.py <office_file> <output_directory>`
3. Edit the XML files (primarily `word/document.xml` and `word/comments.xml`)
4. **CRITICAL**: Validate immediately after each edit and fix any validation errors before proceeding: `python ooxml/scripts/validate.py <dir> --original <file>`
5. Pack the final document: `python ooxml/scripts/pack.py <input_directory> <office_file>`
## Redlining workflow for document review
This workflow allows you to plan comprehensive tracked changes using markdown before implementing them in OOXML. **CRITICAL**: For complete tracked changes, you must implement ALL changes systematically.
### Comprehensive tracked changes workflow
1. **Get markdown representation**: Convert document to markdown with tracked changes preserved:
```bash
pandoc --track-changes=all path-to-file.docx -o current.md
```
2. **Create comprehensive revision checklist**: Create a detailed checklist of ALL changes needed, with tasks listed in sequential order.
- All tasks should start as unchecked items using `[ ]` format
- **DO NOT use markdown line numbers** - they don't map to XML structure
- **DO use:**
- Section/heading numbers (e.g., "Section 3.2", "Article IV")
- Paragraph identifiers if numbered
- Grep patterns with unique surrounding text
- Document structure (e.g., "first paragraph", "signature block")
- Example: `[ ] Section 8: Change "30 days" to "60 days" (grep: "notice period of.*days prior")`
- Consider that text may be split across multiple `<w:t>` elements due to formatting
- Save as `revision-checklist.md`
3. **Setup tracked changes infrastructure**:
- Unpack the document: `python ooxml/scripts/unpack.py <office_file> <output_directory>`
- Run setup script: `python skills/docx/scripts/setup_redlining.py <unpacked_directory>`
- This automatically:
- Creates `word/people.xml` with Claude as author (ID 0)
- Updates `[Content_Types].xml` to include people.xml content type
- Updates `word/_rels/document.xml.rels` to add people.xml relationship
- Adds `<w:trackRevisions/>` to `word/settings.xml`
- Generates and adds a random 8-character hex RSID (e.g., "6CEA06C3")
- Displays the generated RSID for reference
- **CRITICAL**: Note the RSID displayed by the script - you MUST use this same RSID for ALL tracked changes
4. **Apply changes from checklist systematically**:
- **MANDATORY - READ ENTIRE FILE**: Read [`ooxml.md`](ooxml.md) (~500 lines) completely from start to finish. **NEVER set any range limits when reading this file.** Pay special attention to the section titled "Tracked Change Patterns".
- **CRITICAL for sub-agents**: If delegating work to sub-agents, each sub-agent MUST also read the "Tracked Change Patterns" section of `ooxml.md` before making any XML edits
- **Process each checklist item sequentially**: Go through revision checklist line by line
- **Locate text using grep**: Use grep to find the exact text location in `word/document.xml`
- **Read context with Read tool**: Use Read tool to view the complete XML structure around each change
- **Apply tracked changes**: Use Edit/MultiEdit tools for precision
- **Use consistent RSID**: Use the SAME RSID from step 3 for ALL tracked changes (IMPORTANT: RSID attributes go on `w:r` tags and are invalid on `w:del` or `w:ins` tags)
- **Track changes format**: All insertions use `<w:ins w:id="X" w:author="Claude" w:date="...">`, deletions use `<w:del w:id="X" w:author="Claude" w:date="...">`
5. **MANDATORY - Review and complete checklist**:
- **Verify all changes**: Convert document to markdown and use grep/search to verify each change:
```bash
pandoc --track-changes=all <packed_file.docx> -o verification.md
grep -E "pattern" verification.md # Check for each updated term
```
- **Update checklist systematically**: Mark items [x] only after verification confirms the change
- **CRITICAL - Complete any incomplete tasks**: If items remain unchecked, you MUST complete them before proceeding
- **Document incomplete items**: Note any items not addressed and specific reasons why
- **Ensure 100% completion**: All checklist items must be [x] before proceeding
6. **Final validation and packaging**:
- Final validation: `python ooxml/scripts/validate.py <directory> --original <file>`
- Pack only after validation passes: `python ooxml/scripts/pack.py <input_directory> <office_file>`
- Only consider task complete when validation passes AND checklist is 100% complete
## Converting Documents to Images
To visually analyze Word documents, convert them to images using a two-step process:
1. **Convert DOCX to PDF**:
```bash
soffice --headless --convert-to pdf document.docx
```
2. **Convert PDF pages to JPEG images**:
```bash
pdftoppm -jpeg -r 150 document.pdf page
```
This creates files like `page-1.jpg`, `page-2.jpg`, etc.
Options:
- `-r 150`: Sets resolution to 150 DPI (adjust for quality/size balance)
- `-jpeg`: Output JPEG format (use `-png` for PNG if preferred)
- `-f N`: First page to convert (e.g., `-f 2` starts from page 2)
- `-l N`: Last page to convert (e.g., `-l 5` stops at page 5)
- `page`: Prefix for output files
Example for specific range:
```bash
pdftoppm -jpeg -r 150 -f 2 -l 5 document.pdf page # Converts only pages 2-5
```
## Code Style Guidelines
**IMPORTANT**: When generating code for DOCX operations:
- Write concise code
- Avoid verbose variable names and redundant operations
- Avoid unnecessary print statements
## Dependencies
Required dependencies (install if not available):
- **pandoc**: `sudo apt-get install pandoc` (for text extraction)
- **docx**: `npm install -g docx` (for creating new documents)
- **LibreOffice**: `sudo apt-get install libreoffice` (for PDF conversion)
- **Poppler**: `sudo apt-get install poppler-utils` (for pdftoppm to convert PDF to images)

297
Anthropic/pdf.md Normal file
View File

@@ -0,0 +1,297 @@
# PDF Processing (/mnt/skills/public/pdf/SKILL.md)
---
name: PDF Processing
description: Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms.
when_to_use: When Claude needs to fill in a PDF form or programmatically process, generate, or analyze PDF documents at scale.
version: 0.0.1
dependencies: pytesseract>=0.3.10, pdf2image>=1.16.0
---
# PDF Processing Guide
## Overview
This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see REFERENCE.md. If you need to fill out a PDF form, read FORMS.md and follow its instructions.
## Quick Start
```python
from pypdf import PdfReader, PdfWriter
# Read a PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
# Extract text
text = ""
for page in reader.pages:
text += page.extract_text()
```
## Python Libraries
### pypdf - Basic Operations
#### Merge PDFs
```python
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)
```
#### Split PDF
```python
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)
```
#### Extract Metadata
```python
reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")
```
#### Rotate Pages
```python
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90) # Rotate 90 degrees clockwise
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)
```
### pdfplumber - Text and Table Extraction
#### Extract Text with Layout
```python
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
```
#### Extract Tables
```python
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
print(f"Table {j+1} on page {i+1}:")
for row in table:
print(row)
```
#### Advanced Table Extraction
```python
import pandas as pd
with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table: # Check if table is not empty
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)
# Combine all tables
if all_tables:
combined_df = pd.concat(all_tables, ignore_index=True)
combined_df.to_excel("extracted_tables.xlsx", index=False)
```
### reportlab - Create PDFs
#### Basic PDF Creation
```python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter
# Add text
c.drawString(100, height - 100, "Hello World!")
c.drawString(100, height - 120, "This is a PDF created with reportlab")
# Add a line
c.line(100, height - 140, 400, height - 140)
# Save
c.save()
```
#### Create PDF with Multiple Pages
```python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
# Add content
title = Paragraph("Report Title", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
story.append(body)
story.append(PageBreak())
# Page 2
story.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Content for page 2", styles['Normal']))
# Build PDF
doc.build(story)
```
## Command-Line Tools
### pdftotext (poppler-utils)
```bash
# Extract text
pdftotext input.pdf output.txt
# Extract text preserving layout
pdftotext -layout input.pdf output.txt
# Extract specific pages
pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5
```
### qpdf
```bash
# Merge PDFs
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf
# Rotate pages
qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees
# Remove password
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
```
### pdftk (if available)
```bash
# Merge
pdftk file1.pdf file2.pdf cat output merged.pdf
# Split
pdftk input.pdf burst
# Rotate
pdftk input.pdf rotate 1east output rotated.pdf
```
## Common Tasks
### Extract Text from Scanned PDFs
```python
# Requires: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path
# Convert PDF to images
images = convert_from_path('scanned.pdf')
# OCR each page
text = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
print(text)
```
### Add Watermark
```python
from pypdf import PdfReader, PdfWriter
# Create watermark (or load existing)
watermark = PdfReader("watermark.pdf").pages[0]
# Apply to all pages
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)
```
### Extract Images
```bash
# Using pdfimages (poppler-utils)
pdfimages -j input.pdf output_prefix
# This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.
```
### Password Protection
```python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
# Add password
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
writer.write(output)
```
## Quick Reference
| Task | Best Tool | Command/Code |
|------|-----------|--------------|
| Merge PDFs | pypdf | `writer.add_page(page)` |
| Split PDFs | pypdf | One page per file |
| Extract text | pdfplumber | `page.extract_text()` |
| Extract tables | pdfplumber | `page.extract_tables()` |
| Create PDFs | reportlab | Canvas or Platypus |
| Command line merge | qpdf | `qpdf --empty --pages ...` |
| OCR scanned PDFs | pytesseract | Convert to image first |
| Fill PDF forms | pdf-lib or pypdf (see FORMS.md) | See FORMS.md |
## Next Steps
- For advanced pypdfium2 usage, see REFERENCE.md
- For JavaScript libraries (pdf-lib), see REFERENCE.md
- If you need to fill out a PDF form, follow the instructions in FORMS.md
- For troubleshooting guides, see REFERENCE.md

410
Anthropic/pptx.md Normal file
View File

@@ -0,0 +1,410 @@
# PowerPoint Suite (/mnt/skills/public/pptx/SKILL.md)
---
name: PowerPoint Suite
description: Presentation creation, editing, and analysis.
when_to_use: "When Claude needs to work with presentations (.pptx files) for: (1) Creating new presentations, (2) Modifying or editing content, (3) Working with layouts, (4) Adding comments or speaker notes, or any other presentation tasks"
version: 0.0.3
---
# PPTX creation, editing, and analysis
## Overview
A user may ask you to create, edit, or analyze the contents of a .pptx file. A .pptx file is essentially a ZIP archive containing XML files and other resources that you can read or edit. You have different tools and workflows available for different tasks.
## Reading and analyzing content
### Text extraction
If you just need to read the text contents of a presentation, you should convert the document to markdown:
```bash
# Convert document to markdown
python -m markitdown path-to-file.pptx
```
### Raw XML access
You need raw XML access for: comments, speaker notes, slide layouts, animations, design elements, and complex formatting. For any of these features, you'll need to unpack a presentation and read its raw XML contents.
#### Unpacking a file
`python ooxml/scripts/unpack.py <office_file> <output_dir>`
**Note**: The unpack.py script is located at `skills/pptx/ooxml/scripts/unpack.py` relative to the project root. If the script doesn't exist at this path, use `find . -name "unpack.py"` to locate it.
#### Key file structures
* `ppt/presentation.xml` - Main presentation metadata and slide references
* `ppt/slides/slide{N}.xml` - Individual slide contents (slide1.xml, slide2.xml, etc.)
* `ppt/notesSlides/notesSlide{N}.xml` - Speaker notes for each slide
* `ppt/comments/modernComment_*.xml` - Comments for specific slides
* `ppt/slideLayouts/` - Layout templates for slides
* `ppt/slideMasters/` - Master slide templates
* `ppt/theme/` - Theme and styling information
* `ppt/media/` - Images and other media files
#### Typography and color extraction
**When given an example design to emulate**: Always analyze the presentation's typography and colors first using the methods below:
1. **Read theme file**: Check `ppt/theme/theme1.xml` for colors (`<a:clrScheme>`) and fonts (`<a:fontScheme>`)
2. **Sample slide content**: Examine `ppt/slides/slide1.xml` for actual font usage (`<a:rPr>`) and colors
3. **Search for patterns**: Use grep to find color (`<a:solidFill>`, `<a:srgbClr>`) and font references across all XML files
## Creating a new PowerPoint presentation **without a template**
When creating a new PowerPoint presentation from scratch, use the **html2pptx** workflow to convert HTML slides to PowerPoint with accurate positioning.
### Design Principles
**CRITICAL**: Before creating any presentation, analyze the content and choose appropriate design elements:
1. **Consider the subject matter**: What is this presentation about? What tone, industry, or mood does it suggest?
2. **Check for branding**: If the user mentions a company/organization, consider their brand colors and identity
3. **Match palette to content**: Select colors that reflect the subject
4. **State your approach**: Explain your design choices before writing code
**Requirements**:
- ✅ State your content-informed design approach BEFORE writing code
- ✅ Use web-safe fonts only: Arial, Helvetica, Times New Roman, Georgia, Courier New, Verdana, Tahoma, Trebuchet MS, Impact
- ✅ Create clear visual hierarchy through size, weight, and color
- ✅ Ensure readability: strong contrast, appropriately sized text, clean alignment
- ✅ Be consistent: repeat patterns, spacing, and visual language across slides
#### Color Palette Selection
**Choosing colors creatively**:
- **Think beyond defaults**: What colors genuinely match this specific topic? Avoid autopilot choices.
- **Consider multiple angles**: Topic, industry, mood, energy level, target audience, brand identity (if mentioned)
- **Be adventurous**: Try unexpected combinations - a healthcare presentation doesn't have to be green, finance doesn't have to be navy
- **Build your palette**: Pick 3-5 colors that work together (dominant colors + supporting tones + accent)
- **Ensure contrast**: Text must be clearly readable on backgrounds
**Example color palettes** (use these to spark creativity - choose one, adapt it, or create your own):
1. **Classic Blue**: Deep navy (#1C2833), slate gray (#2E4053), silver (#AAB7B8), off-white (#F4F6F6)
2. **Teal & Coral**: Teal (#5EA8A7), deep teal (#277884), coral (#FE4447), white (#FFFFFF)
3. **Bold Red**: Red (#C0392B), bright red (#E74C3C), orange (#F39C12), yellow (#F1C40F), green (#2ECC71)
4. **Warm Blush**: Mauve (#A49393), blush (#EED6D3), rose (#E8B4B8), cream (#FAF7F2)
5. **Burgundy Luxury**: Burgundy (#5D1D2E), crimson (#951233), rust (#C15937), gold (#997929)
6. **Deep Purple & Emerald**: Purple (#B165FB), dark blue (#181B24), emerald (#40695B), white (#FFFFFF)
7. **Cream & Forest Green**: Cream (#FFE1C7), forest green (#40695B), white (#FCFCFC)
8. **Pink & Purple**: Pink (#F8275B), coral (#FF574A), rose (#FF737D), purple (#3D2F68)
9. **Lime & Plum**: Lime (#C5DE82), plum (#7C3A5F), coral (#FD8C6E), blue-gray (#98ACB5)
10. **Black & Gold**: Gold (#BF9A4A), black (#000000), cream (#F4F6F6)
11. **Sage & Terracotta**: Sage (#87A96B), terracotta (#E07A5F), cream (#F4F1DE), charcoal (#2C2C2C)
12. **Charcoal & Red**: Charcoal (#292929), red (#E33737), light gray (#CCCBCB)
13. **Vibrant Orange**: Orange (#F96D00), light gray (#F2F2F2), charcoal (#222831)
14. **Forest Green**: Black (#191A19), green (#4E9F3D), dark green (#1E5128), white (#FFFFFF)
15. **Retro Rainbow**: Purple (#722880), pink (#D72D51), orange (#EB5C18), amber (#F08800), gold (#DEB600)
16. **Vintage Earthy**: Mustard (#E3B448), sage (#CBD18F), forest green (#3A6B35), cream (#F4F1DE)
17. **Coastal Rose**: Old rose (#AD7670), beaver (#B49886), eggshell (#F3ECDC), ash gray (#BFD5BE)
18. **Orange & Turquoise**: Light orange (#FC993E), grayish turquoise (#667C6F), white (#FCFCFC)
#### Visual Details Options
**Geometric Patterns**:
- Diagonal section dividers instead of horizontal
- Asymmetric column widths (30/70, 40/60, 25/75)
- Rotated text headers at 90° or 270°
- Circular/hexagonal frames for images
- Triangular accent shapes in corners
- Overlapping shapes for depth
**Border & Frame Treatments**:
- Thick single-color borders (10-20pt) on one side only
- Double-line borders with contrasting colors
- Corner brackets instead of full frames
- L-shaped borders (top+left or bottom+right)
- Underline accents beneath headers (3-5pt thick)
**Typography Treatments**:
- Extreme size contrast (72pt headlines vs 11pt body)
- All-caps headers with wide letter spacing
- Numbered sections in oversized display type
- Monospace (Courier New) for data/stats/technical content
- Condensed fonts (Arial Narrow) for dense information
- Outlined text for emphasis
**Chart & Data Styling**:
- Monochrome charts with single accent color for key data
- Horizontal bar charts instead of vertical
- Dot plots instead of bar charts
- Minimal gridlines or none at all
- Data labels directly on elements (no legends)
- Oversized numbers for key metrics
**Layout Innovations**:
- Full-bleed images with text overlays
- Sidebar column (20-30% width) for navigation/context
- Modular grid systems (3×3, 4×4 blocks)
- Z-pattern or F-pattern content flow
- Floating text boxes over colored shapes
- Magazine-style multi-column layouts
**Background Treatments**:
- Solid color blocks occupying 40-60% of slide
- Gradient fills (vertical or diagonal only)
- Split backgrounds (two colors, diagonal or vertical)
- Edge-to-edge color bands
- Negative space as a design element
### Layout Tips
**When creating slides with charts or tables:**
- **Two-column layout (PREFERRED)**: Use a header spanning the full width, then two columns below - text/bullets in one column and the featured content in the other. This provides better balance and makes charts/tables more readable. Use flexbox with unequal column widths (e.g., 40%/60% split) to optimize space for each content type.
- **Full-slide layout**: Let the featured content (chart/table) take up the entire slide for maximum impact and readability
- **NEVER vertically stack**: Do not place charts/tables below text in a single column - this causes poor readability and layout issues
### Workflow
1. **MANDATORY - READ ENTIRE FILE**: Read [`html2pptx.md`](html2pptx.md) completely from start to finish. **NEVER set any range limits when reading this file.** Read the full file content for detailed syntax, critical formatting rules, and best practices before proceeding with presentation creation.
2. Create an HTML file for each slide with proper dimensions (e.g., 720pt × 405pt for 16:9)
- Use `<p>`, `<h1>`-`<h6>`, `<ul>`, `<ol>` for all text content
- Use `class="placeholder"` for areas where charts/tables will be added (render with gray background for visibility)
- **CRITICAL**: Rasterize gradients and icons as PNG images FIRST using Sharp, then reference in HTML
- **LAYOUT**: For slides with charts/tables/images, use either full-slide layout or two-column layout for better readability
3. Create and run a JavaScript file using the [`html2pptx.js`](scripts/html2pptx.js) library to convert HTML slides to PowerPoint and save the presentation
- Use the `html2pptx()` function to process each HTML file
- Add charts and tables to placeholder areas using PptxGenJS API
- Save the presentation using `pptx.writeFile()`
4. **Visual validation**: Generate thumbnails and inspect for layout issues
- Create thumbnail grid: `python scripts/thumbnail.py output.pptx workspace/thumbnails --cols 4`
- Read and carefully examine the thumbnail image for:
* Text overflow or truncation
* Misaligned elements
* Incorrect colors or fonts
* Missing content
* Layout problems
- If issues found, diagnose and fix before proceeding
## Creating a new PowerPoint presentation **from a template**
When given a PowerPoint template, you can create a new presentation by replacing the text content in the template slides.
### Workflow
1. **Unpack the template**: Extract the template's XML structure
```bash
python ooxml/scripts/unpack.py template.pptx unpacked_template
```
2. **Read the presentation structure**: Read `unpacked_template/ppt/presentation.xml` to understand the overall structure and slide references
3. **Examine template slides**: Check the first few slide XML files to understand the structure
```bash
# View slide structure
python -c "from lxml import etree; tree = etree.parse('unpacked_template/ppt/slides/slide1.xml'); print(etree.tostring(tree, pretty_print=True, encoding='unicode'))"
```
4. **Copy template to working file**: Make a copy of the template for editing
```bash
cp template.pptx working.pptx
```
5. **Generate text shape inventory**:
```bash
python scripts/inventory.py working.pptx > template-inventory.json
```
The inventory provides a structured view of ALL text shapes in the presentation:
```json
{
"slide-0": {
"shape-0": {
"shape_id": "2",
"shape_name": "Title 1",
"placeholder_type": "TITLE",
"text_content": "Original title text here...",
"default_font_size": 44.0,
"default_font_name": "Calibri Light"
},
"shape-1": {
"shape_id": "3",
"shape_name": "Content Placeholder 2",
"placeholder_type": "BODY",
"text_content": "Original content text...",
"default_font_size": 18.0
}
},
"slide-1": {
...
}
}
```
**Understanding the inventory**:
- Each slide is identified as "slide-N" (zero-indexed)
- Each text shape within a slide is identified as "shape-N" (zero-indexed by occurrence)
- `placeholder_type` indicates the shape's role: TITLE, BODY, SUBTITLE, etc.
- `text_content` shows the current text (useful for identifying which shape to replace)
- `default_font_size` and `default_font_name` show the shape's default formatting
6. **Create replacement text JSON**: Based on the inventory, create a JSON file specifying which shapes to update with new text
- **IMPORTANT**: Reference shapes using the slide and shape identifiers from the inventory (e.g., "slide-0", "shape-1")
- **CRITICAL**: Each shape's "paragraphs" field must contain **properly formatted paragraph objects**, not plain text strings
- Each paragraph object can include:
- `text`: The actual text content (required)
- `alignment`: Text alignment (e.g., "CENTER", "LEFT", "RIGHT")
- `bold`: Boolean for bold text
- `italic`: Boolean for italic text
- `bullet`: Boolean to enable bullet points (when true, `level` is also required)
- `level`: Integer for bullet indent level (0 = no indent, 1 = first level, etc.)
- `font_size`: Float for custom font size
- `font_name`: String for custom font name
- `color`: String for RGB color (e.g., "FF0000" for red)
- `theme_color`: String for theme-based color (e.g., "DARK_1", "ACCENT_1")
- **IMPORTANT**: When bullet: true, do NOT include bullet symbols (•, -, *) in text - they're added automatically
- **ESSENTIAL FORMATTING RULES**:
- Headers/titles should typically have `"bold": true`
- List items should have `"bullet": true, "level": 0` (level is required when bullet is true)
- Preserve any alignment properties (e.g., `"alignment": "CENTER"` for centered text)
- Include font properties when different from default (e.g., `"font_size": 14.0`, `"font_name": "Lora"`)
- Colors: Use `"color": "FF0000"` for RGB or `"theme_color": "DARK_1"` for theme colors
- The replacement script expects **properly formatted paragraphs**, not just text strings
- **Overlapping shapes**: Prefer shapes with larger default_font_size or more appropriate placeholder_type
- Save the updated inventory with replacements to `replacement-text.json`
- **WARNING**: Different template layouts have different shape counts - always check the actual inventory before creating replacements
Example paragraphs field showing proper formatting:
```json
"paragraphs": [
{
"text": "New presentation title text",
"alignment": "CENTER",
"bold": true
},
{
"text": "Section Header",
"bold": true
},
{
"text": "First bullet point without bullet symbol",
"bullet": true,
"level": 0
},
{
"text": "Red colored text",
"color": "FF0000"
},
{
"text": "Theme colored text",
"theme_color": "DARK_1"
},
{
"text": "Regular paragraph text without special formatting"
}
]
```
**Shapes not listed in the replacement JSON are automatically cleared**:
```json
{
"slide-0": {
"shape-0": {
"paragraphs": [...] // This shape gets new text
}
// shape-1 and shape-2 from inventory will be cleared automatically
}
}
```
**Common formatting patterns for presentations**:
- Title slides: Bold text, sometimes centered
- Section headers within slides: Bold text
- Bullet lists: Each item needs `"bullet": true, "level": 0`
- Body text: Usually no special properties needed
- Quotes: May have special alignment or font properties
7. **Apply replacements using the `replace.py` script**
```bash
python scripts/replace.py working.pptx replacement-text.json output.pptx
```
The script will:
- First extract the inventory of ALL text shapes using functions from inventory.py
- Validate that all shapes in the replacement JSON exist in the inventory
- Clear text from ALL shapes identified in the inventory
- Apply new text only to shapes with "paragraphs" defined in the replacement JSON
- Preserve formatting by applying paragraph properties from the JSON
- Handle bullets, alignment, font properties, and colors automatically
- Save the updated presentation
Example validation errors:
```
ERROR: Invalid shapes in replacement JSON:
- Shape 'shape-99' not found on 'slide-0'. Available shapes: shape-0, shape-1, shape-4
- Slide 'slide-999' not found in inventory
```
```
ERROR: Replacement text made overflow worse in these shapes:
- slide-0/shape-2: overflow worsened by 1.25" (was 0.00", now 1.25")
```
## Creating Thumbnail Grids
To create visual thumbnail grids of PowerPoint slides for quick analysis and reference:
```bash
python scripts/thumbnail.py template.pptx [output_prefix]
```
**Features**:
- Creates: `thumbnails.jpg` (or `thumbnails-1.jpg`, `thumbnails-2.jpg`, etc. for large decks)
- Default: 5 columns, max 30 slides per grid (5×6)
- Custom prefix: `python scripts/thumbnail.py template.pptx my-grid`
- Note: The output prefix should include the path if you want output in a specific directory (e.g., `workspace/my-grid`)
- Adjust columns: `--cols 4` (range: 3-6, affects slides per grid)
- Grid limits: 3 cols = 12 slides/grid, 4 cols = 20, 5 cols = 30, 6 cols = 42
- Slides are zero-indexed (Slide 0, Slide 1, etc.)
**Use cases**:
- Template analysis: Quickly understand slide layouts and design patterns
- Content review: Visual overview of entire presentation
- Navigation reference: Find specific slides by their visual appearance
- Quality check: Verify all slides are properly formatted
**Examples**:
```bash
# Basic usage
python scripts/thumbnail.py presentation.pptx
# Combine options: custom name, columns
python scripts/thumbnail.py template.pptx analysis --cols 4
```
## Converting Slides to Images
To visually analyze PowerPoint slides, convert them to images using a two-step process:
1. **Convert PPTX to PDF**:
```bash
soffice --headless --convert-to pdf template.pptx
```
2. **Convert PDF pages to JPEG images**:
```bash
pdftoppm -jpeg -r 150 template.pdf slide
```
This creates files like `slide-1.jpg`, `slide-2.jpg`, etc.
Options:
- `-r 150`: Sets resolution to 150 DPI (adjust for quality/size balance)
- `-jpeg`: Output JPEG format (use `-png` for PNG if preferred)
- `-f N`: First page to convert (e.g., `-f 2` starts from page 2)
- `-l N`: Last page to convert (e.g., `-l 5` stops at page 5)
- `slide`: Prefix for output files
Example for specific range:
```bash
pdftoppm -jpeg -r 150 -f 2 -l 5 template.pdf slide # Converts only pages 2-5
```
## Code Style Guidelines
**IMPORTANT**: When generating code for PPTX operations:
- Write concise code
- Avoid verbose variable names and redundant operations
- Avoid unnecessary print statements
## Dependencies
Required dependencies (should already be installed):
- **markitdown**: `pip install "markitdown[pptx]"` (for text extraction from presentations)
- **pptxgenjs**: `npm install -g pptxgenjs` (for creating presentations via html2pptx)
- **playwright**: `npm install -g playwright` (for HTML rendering in html2pptx)
- **react-icons**: `npm install -g react-icons react react-dom` (for icons)
- **sharp**: `npm install -g sharp` (for SVG rasterization and image processing)
- **LibreOffice**: `sudo apt-get install libreoffice` (for PDF conversion)
- **Poppler**: `sudo apt-get install poppler-utils` (for pdftoppm to convert PDF to images)

289
Anthropic/xlsx.md Normal file
View File

@@ -0,0 +1,289 @@
# Excel Spreadsheet Handler (/mnt/skills/public/xlsx/SKILL.md)
---
name: Excel Spreadsheet Handler
description: Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization
when_to_use: "When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas"
version: 0.0.1
dependencies: openpyxl, pandas
---
# Requirements for Outputs
## All Excel files
### Zero Formula Errors
- Every Excel model MUST be delivered with ZERO formula errors (#REF!, #DIV/0!, #VALUE!, #N/A, #NAME?)
### Preserve Existing Templates (when updating templates)
- Study and EXACTLY match existing format, style, and conventions when modifying files
- Never impose standardized formatting on files with established patterns
- Existing template conventions ALWAYS override these guidelines
## Financial models
### Color Coding Standards
Unless otherwise stated by the user or existing template
#### Industry-Standard Color Conventions
- **Blue text (RGB: 0,0,255)**: Hardcoded inputs, and numbers users will change for scenarios
- **Black text (RGB: 0,0,0)**: ALL formulas and calculations
- **Green text (RGB: 0,128,0)**: Links pulling from other worksheets within same workbook
- **Red text (RGB: 255,0,0)**: External links to other files
- **Yellow background (RGB: 255,255,0)**: Key assumptions needing attention or cells that need to be updated
### Number Formatting Standards
#### Required Format Rules
- **Years**: Format as text strings (e.g., "2024" not "2,024")
- **Currency**: Use $#,##0 format; ALWAYS specify units in headers ("Revenue ($mm)")
- **Zeros**: Use number formatting to make all zeros "-", including percentages (e.g., "$#,##0;($#,##0);-")
- **Percentages**: Default to 0.0% format (one decimal)
- **Multiples**: Format as 0.0x for valuation multiples (EV/EBITDA, P/E)
- **Negative numbers**: Use parentheses (123) not minus -123
### Formula Construction Rules
#### Assumptions Placement
- Place ALL assumptions (growth rates, margins, multiples, etc.) in separate assumption cells
- Use cell references instead of hardcoded values in formulas
- Example: Use =B5*(1+$B$6) instead of =B5*1.05
#### Formula Error Prevention
- Verify all cell references are correct
- Check for off-by-one errors in ranges
- Ensure consistent formulas across all projection periods
- Test with edge cases (zero values, negative numbers)
- Verify no unintended circular references
#### Documentation Requirements for Hardcodes
- Comment or in cells beside (if end of table). Format: "Source: [System/Document], [Date], [Specific Reference], [URL if applicable]"
- Examples:
- "Source: Company 10-K, FY2024, Page 45, Revenue Note, [SEC EDGAR URL]"
- "Source: Company 10-Q, Q2 2025, Exhibit 99.1, [SEC EDGAR URL]"
- "Source: Bloomberg Terminal, 8/15/2025, AAPL US Equity"
- "Source: FactSet, 8/20/2025, Consensus Estimates Screen"
# XLSX creation, editing, and analysis
## Overview
A user may ask you to create, edit, or analyze the contents of an .xlsx file. You have different tools and workflows available for different tasks.
## Important Requirements
**LibreOffice Required for Formula Recalculation**: You can assume LibreOffice is installed for recalculating formula values using the `recalc.py` script. The script automatically configures LibreOffice on first run
## Reading and analyzing data
### Data analysis with pandas
For data analysis, visualization, and basic operations, use **pandas** which provides powerful data manipulation capabilities:
```python
import pandas as pd
# Read Excel
df = pd.read_excel('file.xlsx') # Default: first sheet
all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # All sheets as dict
# Analyze
df.head() # Preview data
df.info() # Column info
df.describe() # Statistics
# Write Excel
df.to_excel('output.xlsx', index=False)
```
## Excel File Workflows
## CRITICAL: Use Formulas, Not Hardcoded Values
**Always use Excel formulas instead of calculating values in Python and hardcoding them.** This ensures the spreadsheet remains dynamic and updateable.
### ❌ WRONG - Hardcoding Calculated Values
```python
# Bad: Calculating in Python and hardcoding result
total = df['Sales'].sum()
sheet['B10'] = total # Hardcodes 5000
# Bad: Computing growth rate in Python
growth = (df.iloc[-1]['Revenue'] - df.iloc[0]['Revenue']) / df.iloc[0]['Revenue']
sheet['C5'] = growth # Hardcodes 0.15
# Bad: Python calculation for average
avg = sum(values) / len(values)
sheet['D20'] = avg # Hardcodes 42.5
```
### ✅ CORRECT - Using Excel Formulas
```python
# Good: Let Excel calculate the sum
sheet['B10'] = '=SUM(B2:B9)'
# Good: Growth rate as Excel formula
sheet['C5'] = '=(C4-C2)/C2'
# Good: Average using Excel function
sheet['D20'] = '=AVERAGE(D2:D19)'
```
This applies to ALL calculations - totals, percentages, ratios, differences, etc. The spreadsheet should be able to recalculate when source data changes.
## Common Workflow
1. **Choose tool**: pandas for data, openpyxl for formulas/formatting
2. **Create/Load**: Create new workbook or load existing file
3. **Modify**: Add/edit data, formulas, and formatting
4. **Save**: Write to file
5. **Recalculate formulas (MANDATORY IF USING FORMULAS)**: Use the recalc.py script
```bash
python recalc.py output.xlsx
```
6. **Verify and fix any errors**:
- The script returns JSON with error details
- If `status` is `errors_found`, check `error_summary` for specific error types and locations
- Fix the identified errors and recalculate again
- Common errors to fix:
- `#REF!`: Invalid cell references
- `#DIV/0!`: Division by zero
- `#VALUE!`: Wrong data type in formula
- `#NAME?`: Unrecognized formula name
### Creating new Excel files
```python
# Using openpyxl for formulas and formatting
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
wb = Workbook()
sheet = wb.active
# Add data
sheet['A1'] = 'Hello'
sheet['B1'] = 'World'
sheet.append(['Row', 'of', 'data'])
# Add formula
sheet['B2'] = '=SUM(A1:A10)'
# Formatting
sheet['A1'].font = Font(bold=True, color='FF0000')
sheet['A1'].fill = PatternFill('solid', start_color='FFFF00')
sheet['A1'].alignment = Alignment(horizontal='center')
# Column width
sheet.column_dimensions['A'].width = 20
wb.save('output.xlsx')
```
### Editing existing Excel files
```python
# Using openpyxl to preserve formulas and formatting
from openpyxl import load_workbook
# Load existing file
wb = load_workbook('existing.xlsx')
sheet = wb.active # or wb['SheetName'] for specific sheet
# Working with multiple sheets
for sheet_name in wb.sheetnames:
sheet = wb[sheet_name]
print(f"Sheet: {sheet_name}")
# Modify cells
sheet['A1'] = 'New Value'
sheet.insert_rows(2) # Insert row at position 2
sheet.delete_cols(3) # Delete column 3
# Add new sheet
new_sheet = wb.create_sheet('NewSheet')
new_sheet['A1'] = 'Data'
wb.save('modified.xlsx')
```
## Recalculating formulas
Excel files created or modified by openpyxl contain formulas as strings but not calculated values. Use the provided `recalc.py` script to recalculate formulas:
```bash
python recalc.py <excel_file> [timeout_seconds]
```
Example:
```bash
python recalc.py output.xlsx 30
```
The script:
- Automatically sets up LibreOffice macro on first run
- Recalculates all formulas in all sheets
- Scans ALL cells for Excel errors (#REF!, #DIV/0!, etc.)
- Returns JSON with detailed error locations and counts
- Works on both Linux and macOS
## Formula Verification Checklist
Quick checks to ensure formulas work correctly:
### Essential Verification
- [ ] **Test 2-3 sample references**: Verify they pull correct values before building full model
- [ ] **Column mapping**: Confirm Excel columns match (e.g., column 64 = BL, not BK)
- [ ] **Row offset**: Remember Excel rows are 1-indexed (DataFrame row 5 = Excel row 6)
### Common Pitfalls
- [ ] **NaN handling**: Check for null values with `pd.notna()`
- [ ] **Far-right columns**: FY data often in columns 50+
- [ ] **Multiple matches**: Search all occurrences, not just first
- [ ] **Division by zero**: Check denominators before using `/` in formulas (#DIV/0!)
- [ ] **Wrong references**: Verify all cell references point to intended cells (#REF!)
- [ ] **Cross-sheet references**: Use correct format (Sheet1!A1) for linking sheets
### Formula Testing Strategy
- [ ] **Start small**: Test formulas on 2-3 cells before applying broadly
- [ ] **Verify dependencies**: Check all cells referenced in formulas exist
- [ ] **Test edge cases**: Include zero, negative, and very large values
### Interpreting recalc.py Output
The script returns JSON with error details:
```json
{
"status": "success", // or "errors_found"
"total_errors": 0, // Total error count
"total_formulas": 42, // Number of formulas in file
"error_summary": { // Only present if errors found
"#REF!": {
"count": 2,
"locations": ["Sheet1!B5", "Sheet1!C10"]
}
}
}
```
## Best Practices
### Library Selection
- **pandas**: Best for data analysis, bulk operations, and simple data export
- **openpyxl**: Best for complex formatting, formulas, and Excel-specific features
### Working with openpyxl
- Cell indices are 1-based (row=1, column=1 refers to cell A1)
- Use `data_only=True` to read calculated values: `load_workbook('file.xlsx', data_only=True)`
- **Warning**: If opened with `data_only=True` and saved, formulas are replaced with values and permanently lost
- For large files: Use `read_only=True` for reading or `write_only=True` for writing
- Formulas are preserved but not evaluated - use recalc.py to update values
### Working with pandas
- Specify data types to avoid inference issues: `pd.read_excel('file.xlsx', dtype={'id': str})`
- For large files, read specific columns: `pd.read_excel('file.xlsx', usecols=['A', 'C', 'E'])`
- Handle dates properly: `pd.read_excel('file.xlsx', parse_dates=['date_column'])`
## Code Style Guidelines
**IMPORTANT**: When generating Python code for Excel operations:
- Write minimal, concise Python code without unnecessary comments
- Avoid verbose variable names and redundant operations
- Avoid unnecessary print statements
**For Excel files themselves**:
- Add comments to cells with complex formulas or important assumptions
- Document data sources for hardcoded values
- Include notes for key calculations and model sections