How Blog API caching works: layers and refresh flow

The Blog API uses a two-layer caching strategy to serve blog post data quickly without hitting the upstream source on every request. When the server starts, it attempts to load previously cached data from /tmp/cache.json into memory. If that file does not exist, the in-memory cache starts empty. The only way to populate or refresh the cache — whether on a fresh start or after new posts are published — is to call POST /blogs/cache.

How the cache is populated

The cache is built from two sources that work in sequence. On startup, the server reads /tmp/cache.json if it exists and loads its contents into the in-memory cache list:

startup

try:
    with open("/tmp/cache.json", "r") as file:
        cache = json.load(file)
except FileNotFoundError:
    cache = []

On demand, calling POST /blogs/cache triggers a live scrape, overwrites the in-memory cache, and writes the result back to /tmp/cache.json:

on-demand refresh

cache = scrape_blogs(
    "https://raw.githubusercontent.com/Project516/project516.github.io/refs/heads/master/blog.html"
)
with open("/tmp/cache.json", "w") as file:
    json.dump(cache, file)

How the scraper works

When POST /blogs/cache is called, the scraper fetches the raw HTML of the upstream blog index from GitHub and parses it with BeautifulSoup. It locates every <article> element in the document, then extracts three pieces of data from each one:

The text content of the <h2> tag as the post title
The href attribute of the first <a> tag, prefixed with https://project516.dev/
The datetime attribute of the <time> tag as the post date

Any article that does not contain an <a> tag is skipped.

Blog post data shape

Each cached blog post is a JSON object with three fields:

blog post object

{
  "title": "My blog post title",
  "link": "https://project516.dev/posts/my-blog-post",
  "date": "2024-11-01"
}

title

string

required

The text content of the post’s <h2> element, with surrounding whitespace stripped.

link

string

required

The absolute URL to the blog post, constructed by prepending https://project516.dev/ to the href found in the article’s <a> tag.

date

string

required

The datetime attribute value from the post’s <time> element, typically in YYYY-MM-DD format.

When to refresh the cache

You must call POST /blogs/cache to pick up any new blog posts published to the upstream source. The read endpoints (GET /blogs, GET /blogs/latest, GET /blogs/search) all read directly from the in-memory cache and never trigger a scrape themselves.

refresh the cache

curl -X POST http://localhost:8000/blogs/cache

A successful refresh returns:

success response

{
  "message": "Blogs cached successfully"
}

If the server restarts and /tmp/cache.json is not present — for example, after a system reboot that clears /tmp — the in-memory cache will be empty and all read endpoints will return no data until you call POST /blogs/cache again.

POST /blogs/cache is rate-limited to 1 request per minute per IP address. If you need to trigger multiple refreshes in quick succession during testing, wait at least 60 seconds between calls.

If the upstream GitHub URL is unreachable when you call POST /blogs/cache, the scraper raises an exception and the server returns an HTTP 500 error. The in-memory cache is not modified — the exception is thrown before the cache variable is reassigned, so existing cached data is preserved. Verify connectivity to raw.githubusercontent.com if you receive a 500 response.

Get Started

Guides

How Blog API caching works: layers and refresh flow

How the cache is populated

How the scraper works

Blog post data shape

When to refresh the cache

Build docs developers (and LLMs) love

Get Started

Guides

Documentation Index

​How the cache is populated

​How the scraper works

​Blog post data shape

​When to refresh the cache

Build docs developers (and LLMs) love

How the cache is populated

How the scraper works

Blog post data shape

When to refresh the cache