Case Study

Concurrent Broken-Link Sweeper

Zero-dependency Go CLI that extracts URLs from Markdown/HTML, then sweeps them concurrently with goroutines, semaphores, and defensive HTTP handling.

  • Go
  • goroutines
  • net/http
  • sync
  • regexp
  • go
  • golang
  • concurrency
  • networking
  • seo
  • zero-dependency

Overview

For the final build in this series, we use Go (Golang) — built by Google for massive concurrency and networking problems, making it the perfect language for a network sweeper.

This project demonstrates advanced systems-level programming. It recursively scans a local directory of Markdown or HTML files, extracts all external URLs using regular expressions, and utilizes goroutines and a semaphore pattern (via buffered channels) to ping dozens of websites concurrently without overwhelming your machine’s network ports.

Like Projects 1–4, this uses zero external dependencies — relying entirely on Go’s standard library (net/http, sync, regexp).

What it implements

  • Recursive directory scan — walks .md and .html files for external URLs
  • Regex URL extraction — deduplicates via map keys before any network I/O
  • Concurrent sweepersync.WaitGroup + buffered channel semaphore (15 concurrent requests)
  • Defensive HTTP client — 5s timeouts, TLS bypass for legacy sites, HEAD with GET fallback on 405
  • Thread-safe reporting — mutex-protected broken-link collection and color-coded terminal output

Project setup

1. Install Go

Grab Go from go.dev if needed.

2. Initialize the module

mkdir link-sweeper && cd link-sweeper
go mod init sweeper

3. Sample Markdown file

Create test_post.md:

# Welcome to my blog
Check out my [GitHub](https://github.com) and my [cool project](https://this-website-definitely-does-not-exist-2026.com).
Here is a [Google link](https://google.com) and a [broken link](http://httpstat.us/404).

The code (main.go)

package main

import (
	"crypto/tls"
	"fmt"
	"net/http"
	"os"
	"path/filepath"
	"regexp"
	"sync"
	"time"
)

// --- ANSI Terminal Colors ---
const (
	Reset  = "\033[0m"
	Red    = "\033[91m"
	Green  = "\033[92m"
	Yellow = "\033[93m"
	Cyan   = "\033[96m"
	Bold   = "\033[1m"
)

// Regex to find external HTTP/HTTPS URLs in text
var urlRegex = regexp.MustCompile(`https?://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,}(?:/[^"'>\)\s]*)?`)

// --- Custom HTTP Client ---
// We use a custom client to enforce strict timeouts and ignore SSL certificate
// errors on older sites, preventing the sweeper from hanging indefinitely.
var httpClient = &http.Client{
	Timeout: 5 * time.Second,
	Transport: &http.Transport{
		TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
	},
}

func main() {
	if len(os.Args) < 2 {
		fmt.Printf("%sUsage: go run main.go <directory_path>%s\n", Yellow, Reset)
		os.Exit(1)
	}
	targetDir := os.Args[1]

	fmt.Printf("%s🚀 Initializing Concurrent Link Sweeper in: %s%s\n", Cyan, targetDir, Reset)

	uniqueURLs := extractURLsFromDir(targetDir)
	if len(uniqueURLs) == 0 {
		fmt.Printf("%sNo URLs found in the specified directory.%s\n", Yellow, Reset)
		return
	}

	fmt.Printf("Found %s%d unique URLs%s. Commencing concurrent sweep...\n\n", Bold, len(uniqueURLs), Reset)

	var wg sync.WaitGroup
	semaphore := make(chan struct{}, 15)

	var brokenLinks []string
	var mu sync.Mutex

	startTime := time.Now()

	for url := range uniqueURLs {
		wg.Add(1)

		go func(targetURL string) {
			defer wg.Done()

			semaphore <- struct{}{}
			isBroken, statusCode := checkURL(targetURL)
			<-semaphore

			if isBroken {
				fmt.Printf("  %s[❌ BROKEN ]%s %s (Status: %d)\n", Red, Reset, targetURL, statusCode)
				mu.Lock()
				brokenLinks = append(brokenLinks, targetURL)
				mu.Unlock()
			} else {
				fmt.Printf("  %s[✅ ONLINE ]%s %s\n", Green, Reset, targetURL)
			}
		}(url)
	}

	wg.Wait()
	duration := time.Since(startTime)

	fmt.Printf("\n%s--- Sweep Complete ---%s\n", Cyan, Reset)
	fmt.Printf("Time elapsed: %v\n", duration)
	fmt.Printf("Total Links Scanned: %d\n", len(uniqueURLs))
	if len(brokenLinks) > 0 {
		fmt.Printf("Broken Links Found: %s%d%s\n", Red, len(brokenLinks), Reset)
	} else {
		fmt.Printf("Broken Links Found: %s0%s 🎉\n", Green, Reset)
	}
}

func extractURLsFromDir(dir string) map[string]struct{} {
	urls := make(map[string]struct{})

	filepath.Walk(dir, func(path string, info os.FileInfo, err error) error {
		if err != nil {
			return err
		}
		if !info.IsDir() && (filepath.Ext(path) == ".md" || filepath.Ext(path) == ".html") {
			content, err := os.ReadFile(path)
			if err != nil {
				return nil
			}

			matches := urlRegex.FindAllString(string(content), -1)
			for _, match := range matches {
				urls[match] = struct{}{}
			}
		}
		return nil
	})

	return urls
}

func checkURL(url string) (isBroken bool, statusCode int) {
	resp, err := httpClient.Head(url)
	if err != nil {
		return true, 0
	}
	defer resp.Body.Close()

	if resp.StatusCode == http.StatusMethodNotAllowed {
		respGet, errGet := httpClient.Get(url)
		if errGet != nil {
			return true, 0
		}
		defer respGet.Body.Close()
		return respGet.StatusCode >= 400, respGet.StatusCode
	}

	return resp.StatusCode >= 400, resp.StatusCode
}

Execution

Run against the current directory to parse test_post.md:

go run main.go .

Compile to a production binary:

go build -o sweeper main.go
./sweeper .

Point it at this site’s content tree for a real-world sweep:

./sweeper /path/to/davidcolecloud/src/content

Why this shines on a portfolio

  1. Goroutines and channels — not a sequential for loop waiting on each site. sync.WaitGroup plus a semaphore channel explicitly controls memory and network concurrency — exactly what infrastructure teams look for.
  2. Production-ready defensive coding — strict timeouts, TLS bypass for legacy sites, and automatic HEAD → GET fallback for finicky servers. Real-world network failure handling, not happy-path-only code.

The technical sandbox (Projects 1–8)

These micro-projects form a growing technical sandbox for davidcole.cloud:

#StackRole
1TypeScript / NodeServe Markdown APIs
2PythonLocal AI log analytics
3JavaModernize legacy flat-file data
4SQL / PostgreSQLRecursive hierarchies natively in the database
5GoSweep the network concurrently
6PythonMCP-style context server for local Linux agents
7Python2PC fleet migrations for isolated tenant databases
8PythonRaw TCP sockets with asyncio event-loop concurrency

Together they demonstrate senior-level full-stack and systems engineering — valuable for hires, consulting clients, and technical peers reviewing this portfolio.