Case Study
Concurrent Broken-Link Sweeper
Zero-dependency Go CLI that extracts URLs from Markdown/HTML, then sweeps them concurrently with goroutines, semaphores, and defensive HTTP handling.
- Go
- goroutines
- net/http
- sync
- regexp
- go
- golang
- concurrency
- networking
- seo
- zero-dependency
Overview
For the final build in this series, we use Go (Golang) — built by Google for massive concurrency and networking problems, making it the perfect language for a network sweeper.
This project demonstrates advanced systems-level programming. It recursively scans a local directory of Markdown or HTML files, extracts all external URLs using regular expressions, and utilizes goroutines and a semaphore pattern (via buffered channels) to ping dozens of websites concurrently without overwhelming your machine’s network ports.
Like Projects 1–4, this uses zero external dependencies — relying entirely on Go’s standard library (net/http, sync, regexp).
What it implements
- Recursive directory scan — walks
.mdand.htmlfiles for external URLs - Regex URL extraction — deduplicates via map keys before any network I/O
- Concurrent sweeper —
sync.WaitGroup+ buffered channel semaphore (15 concurrent requests) - Defensive HTTP client — 5s timeouts, TLS bypass for legacy sites, HEAD with GET fallback on 405
- Thread-safe reporting — mutex-protected broken-link collection and color-coded terminal output
Project setup
1. Install Go
Grab Go from go.dev if needed.
2. Initialize the module
mkdir link-sweeper && cd link-sweeper
go mod init sweeper
3. Sample Markdown file
Create test_post.md:
# Welcome to my blog
Check out my [GitHub](https://github.com) and my [cool project](https://this-website-definitely-does-not-exist-2026.com).
Here is a [Google link](https://google.com) and a [broken link](http://httpstat.us/404).
The code (main.go)
package main
import (
"crypto/tls"
"fmt"
"net/http"
"os"
"path/filepath"
"regexp"
"sync"
"time"
)
// --- ANSI Terminal Colors ---
const (
Reset = "\033[0m"
Red = "\033[91m"
Green = "\033[92m"
Yellow = "\033[93m"
Cyan = "\033[96m"
Bold = "\033[1m"
)
// Regex to find external HTTP/HTTPS URLs in text
var urlRegex = regexp.MustCompile(`https?://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,}(?:/[^"'>\)\s]*)?`)
// --- Custom HTTP Client ---
// We use a custom client to enforce strict timeouts and ignore SSL certificate
// errors on older sites, preventing the sweeper from hanging indefinitely.
var httpClient = &http.Client{
Timeout: 5 * time.Second,
Transport: &http.Transport{
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
},
}
func main() {
if len(os.Args) < 2 {
fmt.Printf("%sUsage: go run main.go <directory_path>%s\n", Yellow, Reset)
os.Exit(1)
}
targetDir := os.Args[1]
fmt.Printf("%s🚀 Initializing Concurrent Link Sweeper in: %s%s\n", Cyan, targetDir, Reset)
uniqueURLs := extractURLsFromDir(targetDir)
if len(uniqueURLs) == 0 {
fmt.Printf("%sNo URLs found in the specified directory.%s\n", Yellow, Reset)
return
}
fmt.Printf("Found %s%d unique URLs%s. Commencing concurrent sweep...\n\n", Bold, len(uniqueURLs), Reset)
var wg sync.WaitGroup
semaphore := make(chan struct{}, 15)
var brokenLinks []string
var mu sync.Mutex
startTime := time.Now()
for url := range uniqueURLs {
wg.Add(1)
go func(targetURL string) {
defer wg.Done()
semaphore <- struct{}{}
isBroken, statusCode := checkURL(targetURL)
<-semaphore
if isBroken {
fmt.Printf(" %s[❌ BROKEN ]%s %s (Status: %d)\n", Red, Reset, targetURL, statusCode)
mu.Lock()
brokenLinks = append(brokenLinks, targetURL)
mu.Unlock()
} else {
fmt.Printf(" %s[✅ ONLINE ]%s %s\n", Green, Reset, targetURL)
}
}(url)
}
wg.Wait()
duration := time.Since(startTime)
fmt.Printf("\n%s--- Sweep Complete ---%s\n", Cyan, Reset)
fmt.Printf("Time elapsed: %v\n", duration)
fmt.Printf("Total Links Scanned: %d\n", len(uniqueURLs))
if len(brokenLinks) > 0 {
fmt.Printf("Broken Links Found: %s%d%s\n", Red, len(brokenLinks), Reset)
} else {
fmt.Printf("Broken Links Found: %s0%s 🎉\n", Green, Reset)
}
}
func extractURLsFromDir(dir string) map[string]struct{} {
urls := make(map[string]struct{})
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
}
if !info.IsDir() && (filepath.Ext(path) == ".md" || filepath.Ext(path) == ".html") {
content, err := os.ReadFile(path)
if err != nil {
return nil
}
matches := urlRegex.FindAllString(string(content), -1)
for _, match := range matches {
urls[match] = struct{}{}
}
}
return nil
})
return urls
}
func checkURL(url string) (isBroken bool, statusCode int) {
resp, err := httpClient.Head(url)
if err != nil {
return true, 0
}
defer resp.Body.Close()
if resp.StatusCode == http.StatusMethodNotAllowed {
respGet, errGet := httpClient.Get(url)
if errGet != nil {
return true, 0
}
defer respGet.Body.Close()
return respGet.StatusCode >= 400, respGet.StatusCode
}
return resp.StatusCode >= 400, resp.StatusCode
}
Execution
Run against the current directory to parse test_post.md:
go run main.go .
Compile to a production binary:
go build -o sweeper main.go
./sweeper .
Point it at this site’s content tree for a real-world sweep:
./sweeper /path/to/davidcolecloud/src/content
Why this shines on a portfolio
- Goroutines and channels — not a sequential
forloop waiting on each site.sync.WaitGroupplus a semaphore channel explicitly controls memory and network concurrency — exactly what infrastructure teams look for. - Production-ready defensive coding — strict timeouts, TLS bypass for legacy sites, and automatic HEAD → GET fallback for finicky servers. Real-world network failure handling, not happy-path-only code.
The technical sandbox (Projects 1–8)
These micro-projects form a growing technical sandbox for davidcole.cloud:
| # | Stack | Role |
|---|---|---|
| 1 | TypeScript / Node | Serve Markdown APIs |
| 2 | Python | Local AI log analytics |
| 3 | Java | Modernize legacy flat-file data |
| 4 | SQL / PostgreSQL | Recursive hierarchies natively in the database |
| 5 | Go | Sweep the network concurrently |
| 6 | Python | MCP-style context server for local Linux agents |
| 7 | Python | 2PC fleet migrations for isolated tenant databases |
| 8 | Python | Raw TCP sockets with asyncio event-loop concurrency |
Together they demonstrate senior-level full-stack and systems engineering — valuable for hires, consulting clients, and technical peers reviewing this portfolio.
Related work
- Project 4: Recursive Hierarchy Builder — database-side tree compilation
- Local-first SEO — why link health matters for client sites
- Project 1: Markdown API Router — the content layer this sweeper validates
- Project 6: Local Context Server — MCP-style agent bridge on the same homelab stack