Aware Expert
ENFR

Phil Nash

Phil Nash is a Developer at Sonar specializing in NodeJS, Ruby, JavaScript, GraphQL, and DevOps. With a strong background in web development and a keen interest in APIs and continuous integration, Phil actively shares his knowledge at conferences like API World and DevOpsCon.

Phil is known for his engaging talks that cover topics such as building better APIs with GraphQL and leveraging continuous integration for faster development.

Phil maintains a strong online presence through his website, where he shares valuable resources and blog posts, and actively engages with the developer community on social media platforms like Twitter and GitHub.

Check the latest blog posts of Phil Nash below

After the recent supply chain attacks on the npm ecosystem, notaby the Shai-Hulud 2.0 worm, GitHub took a number of actions to shore up the security of publishing packages to hopefully avoid further attacks. One of the outcomes was that long-lived npm tokens were revoked in favour of short-lived tokens or using trusted publishing.

I have GitHub Actions set up to publish new versions of npm packages that I maintain when a new tag is pushed to the repository. This workflow used long-lived npm tokens to authenticate, so when it came to updating a package recently I needed to update the publishing method too. The npm documentation on trusted publishing for npm packages was useful to a point, but there were some things I needed to do that the docs either didn't cover explicitly or weren't obvious enough, to get my package published successfully. I also came across this thread on GitHub where other people had similar issues. I wanted to share those things here.

TL;DR

Briefly, the changes that worked for me were to add the following to my GitHub Action publishing workflow:

# Permission to generate an OIDC token
permissions:
  id-token: write

jobs:
  publish:
    steps:
      ...
      # Ensure the latest npm is installed
      - run: npm install -g npm@latest
      ...
      # Add the --provenance flag to the publish command
      - run: npm publish --provenance

And ensure that the package.json refers to the correct repository:

{
  ...
  "repository": {
    "type": "git",
    "url": "git+https://github.com/${username}/${packageName}.git",
  },
  ...
}

For a bit more detail and alternative ways to set some of these settings, read on.

Package settings

Ok, so this is embarrassing, but initially I couldn't find the settings I needed to enable trusted publishing. The npm docs say:

Navigate to your package settings on npmjs.com and find the "Trusted Publisher" section.

I spent far too long looking around the https://www.npmjs.com/settings/${username}/packages page for the "Trusted Publisher" section. What I needed was the specific package settings, available here: https://www.npmjs.com/package/${packageName}/access.

You need to set up trusted publishing for each of your packages individually. That might be fine if you only maintain a few, it's going to be a huge hassle if you have a lot.

Once you have filled in the trusted publisher settings, then its on to updating your project so that it can be published successfully.

Permissions

This is in the npm docs, so I'm just including it for completeness. You need to give the workflow permission to generate an OIDC token that it can then use to publish the package. To do this requires one permission being set in your workflow file.

permissions:
  id-token: write

npm version

The docs clearly call out that:

Note: Trusted publishing requires npm CLI version 11.5.1 or later.

I needed to upgrade the version of npm used by my GitHub Actions workflow, so I added a simple step to install the latest version of npm as part of the run before publishing:

- run: npm install -g npm@latest

Automatic provenance

The docs also say:

When you publish using trusted publishing, npm automatically generates and publishes provenance attestations for your package. This happens by default—you don't need to add the --provenance flag to your publish command.

I did not find this to be the case. I needed to add the --provenance flag so that my package would publish successfully.

run: npm publish --provenance

This was something that seemed to help others too. You may only need to pass --provenance the first time, with it continuing to work automatically beyond that, but it can't hurt to keep it in your publish script (for when you need to update another package and you copy things over).

You can also set your package to generate provenance attestations on publishing by setting the provenance option in publishConfig in your package.json file.

{
  ...
  "publishConfig": {
    "provenance": true
  }
  ...
}

Or you can set the NPM_CONFIG_PROVENANCE environment variable.

env:
  NPM_CONFIG_PROVENANCE: true
run: npm publish

Repository details

Finally, I don't know if this last part helped as I did already have it set, but others in this GitHub thread found that setting the repository field in the package's package.json to specifically point to the GitHub repository also helped.

{
  ...
  "repository": {
    "type": "git",
    "url": "git+https://github.com/${username}/${packageName}.git",
  },
  ...
}

When you set up your trusted publisher in npm you do have to provide the repository details, so it makes sense to me that the package should agree with those details too.

Keep the ecosystem safe

Short-lived tokens, trusted publishing, and provenance all help keep the entire ecosystem safe. If you've read this far, it is because you are also updating your packages to publish with this method

I know there are people out there with many more packages, and packages that are much more popular than any of mine, but I hope this helps. It does amuse me that I went through this for a package that I'm pretty sure I'm the only user of, but at least I now know how to do it for the future.

I hope to see trusted publishing continue to expand to more providers, it is limited to GitHub and GitLab at the time of writing, and to be used by more packages. And I hope to see fewer worms charging through the package ecosystem and threatening all of our applications in the future.

The Date object in JavaScript is frequently one that causes trouble. So much so, it is set to be replaced by Temporal soon. This is the story of an issue that I faced that will be much easier to handle once Temporal is more widespread.

The issue

In January 2025 I was in Santa Clara, California writing some JavaScript to perform some reporting. I wanted to be able to get a number of events that happened within a month, so I would create a date object for the first day of the month, add one month to it and then subtract a day to return the last day. Seems straightforward, right?

I got a really weird result though. I reduced the issue to the following code.

const date = new Date("2024-01-01T00:00:00.000Z");
date.toISOString();
// => "2024-01-01T00:00:00.000Z" as expected
date.setMonth(1);
date.toISOString();
// => "2023-03-04-T00:00:00.000Z" WTF?

I added a month to the 1st of January 2024 and landed on the 4th March, 2023. What happened?

Times and zones

You might have thought it was odd for me to set this scene on the West coast of the US, but it turned out this mattered. This code would have run fine in UTC and everywhere East of it.

JavaScript dates are more than just dates, they are responsible for time as well. Even though I only wanted to deal with days and months in this example the time still mattered.

I did know this, so I set the time to UTC thinking that this would work for me wherever I was. That was my downfall. Let's break down what happened.

Midnight on the 1st January, 2024 in UTC is still 4pm on the 31st December, 2023 in Pacific Time (UTC-8). date.setMonth(1) sets the date to February (as months are 0-indexed unlike days). But we started on 31st December, 2023 so JavaScript has to handle the non-existant date of 31st February, 2023. It does this by overflowing to the next month, so we get 3rd March. Finally, to print it out, the date is translated back into UTC, giving the final result: midnight on 4th March, 2023.

All of these steps feel reasonable when you break it down, the confusion stems from how unexpected that result was.

So, how do you fix this?

Always use UTC

Since I didn't actually care for the time and I knew I wanted to work with UTC, I fixed this code using the Date object's setUTCMonth method. My original code subtracted a day to get the last day in a month, so I used the setUTCDate method too. All set${timePeriod} methods have a setUTC${timePeriod} equivalent to help you work with this.

const date = new Date("2024-01-01T00:00:00.000Z");
date.toISOString();
// => "2024-01-01T00:00:00.000Z"
date.setUTCMonth(1);
date.toISOString();
// => "2024-02-01-T00:00:00.000Z"

So this fixed my issue. Can it be better though?

Bring on Temporal

One of the reasons this went wrong was because I was trying to manipulate dates, but I was actually manipulating dates and times without thinking about it. I mentioned Temporal at the top of the post because it has objects specifically for this.

If I was to write this code using Temporal I would be able to use the Temporal.PlainDate to represent a calendar date, a date without a time or time zone.

This simplifies things already, but Temporal also makes it more obvious how to manipulate dates. Rather than setting months and dates or adding milliseconds to update a date, you add a duration. You can either construct a duration with the Temporal.Duration object or use an object that defines a duration.

Temporal also makes objects immutable, so every time you change a date it returns a new object.

In this case I wanted to add a month, so with Temporal it would look like this:

const startDate = Temporal.PlainDate.from("2024-01-01");
// => Temporal.PlainDate 2024-01-01
const nextMonth = startDate.add({ months: 1 });
// => Temporal.PlainDate 2024-02-01
const endDate = nextMonth.subtract({ days: 1 });
// => Temporal.PlainDate 2024-01-31

Date manipulation without worrying about times, wonderful!

Of course, there are many more benefits to the very well throught out Temporal API and I cannot wait for it to be a part of every JavaScript runtime.

Mind the time zone

Temporal has still not made it to many JavaScript engines. At the time of writing, it is available in Firefox and nowhere else, so if you want to test this out open up Firefox or check out one of the polyfills @js-temporal/polyfill or temporal-polyfill.

If you still have to use Date make sure you keep your time zone in mind. I'd try to move to, or at least learn how to use, Temporal now.

And watch out for time zones, even when you try to avoid them they can end up giving you a headache.

When you’re building a retrieval-augmented generation (RAG) app, the first thing you need to do is prepare your data. You need to:

There are many ways that you can create vector embeddings in Python. In this post, we’ll take a look at four ways to generate vector embeddings: locally, via API, via a framework, and with Astra DB's Vectorize.

<div class="info"> <p>This post was originally written for DataStax, but didn't survive a content migration as part of <a href="https://www.ibm.com/new/announcements/ibm-to-acquire-datastax-helping-clients-bring-the-power-of-unstructured-data-to-enterprise-ai-applications">IBM's purchase</a>. I thought the content was useful, so have republished it here.</p> </div>

Local vector embeddings

There are many pre-trained embedding models available on Hugging Face that you can use to create vector embeddings. Sentence Transformers (SBERT) is a library that makes it easy to use these models for vector embedding, as well as cross-encoding for reranking. It even has tools for finetuning models, if that’s something that might be of use.

You can install the library with:

pip install sentence_transformers

A popular local model for vector embedding is all-MiniLM-L6-v2. It’s trained as a good all-rounder that produces a 384-dimension vector from a chunk of text.

To use it, import sentence_transformers and create a model using the identifier from Hugging Face, in this case "all-MiniLM-L6-v2". If you want to use a model that isn't in the sentence-transformers project, like the multilingual BGE-M3, you can use the organization to identify the model too, like, "BAAI/BGE-M3". Once you've loaded the model, use the encode method to create the vector embedding. The full code looks like this:

from sentence_transformers import SentenceTransformer


model = SentenceTransformer("all-MiniLM-L6-v2")
sentence = "A robot may not injure a human being or, through inaction, allow a human being to come to harm."
embedding = model.encode(sentence)

print(embedding)
# => [ 1.95171311e-03  1.51085425e-02  3.36140348e-03  2.48030387e-02 ... ]

If you pass an array of texts to the model, they’ll all be encoded:

from sentence_transformers import SentenceTransformer


model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
    "A robot may not injure a human being or, through inaction, allow a human being to come to harm.",
    "A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.",
    "A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.",
]
embeddings = model.encode(sentences)

print(embeddings)
# => [[ 0.00195174  0.01510859  0.00336139 ...  0.07971715  0.09885529  -0.01855042]
# [-0.04523939 -0.00046248  0.02036596 ...  0.08779042  0.04936493  -0.06218244]
# [-0.05453169  0.01125113 -0.00680178 ...  0.06443197  0.08771271  -0.00063468]]

There are many more models you can use to generate vector embeddings with the sentence-transformers library and, because you’re running locally, you can try them out to see which is most appropriate for your data. You do need to watch out for any restrictions that these models might have. For example, the all-MiniLM-L6-v2 model doesn’t produce good results for more than 128 tokens and can only handle a maximum of 256 tokens. BGE-M3, on the other hand, can encode up to 8,192 tokens. However, the BGE-M3 model is a couple of gigabytes in size and all-MiniLM-L6-v2 is under 100MB, so there are space and memory constraints to consider, too.

Local embedding models like this are useful when you’re experimenting on your laptop, or if you have hardware that PyTorch can use to speed up the encoding process. It’s a good way to get comfortable running different models and seeing how they interact with your data.

If you don't want to run your models locally, there are plenty of available APIs you can use to create embeddings for your documents.

APIs

There are several services that make embedding models available as APIs. These include LLM providers like OpenAI, Google, or Cohere, as well as specialist providers like Jina AI or model hosts like Fireworks.

These API providers provide HTTP APIs, often with a Python package to make it easy to call them. You will typically require an API key from the service. Once you have that setup you can generate vector embeddings by sending your text to the API.

For example, with Google's google-genai SDK and a Gemini API key you can generate a vector embedding with their Gemini embedding model like this:

from google import genai


client = genai.Client(api_key="GEMINI_API_KEY")

result = client.models.embed_content(
        model="gemini-embedding-001",
        contents="A robot may not injure a human being or, through inaction, allow a human being to come to harm.")

print(result.embeddings)

Each API can be different, though many providers do make OpenAI-compatible APIs. However, each time you try a new provider you might find you have a new API to learn. Unless, of course, you try one of the available frameworks that are intended to simplify this.

Frameworks

There are several projects available, like LangChain or LlamaIndex, that create abstractions over the common components of the GenAI ecosystem, including embeddings.

Both LangChain and LlamaIndex have methods for creating vector embeddings via APIs or local models, all with the same interface. For example, you can create the same Gemini embedding as the code snippet above with LangChain like this:

from langchain_google_genai import GoogleGenerativeAIEmbeddings


embeddings = GoogleGenerativeAIEmbeddings(
    model="gemini-embedding-001",
    google_api_key="GEMINI_API_KEY"
)
result = embeddings.embed_query("A robot may not injure a human being or, through inaction, allow a human being to come to harm.")
print(result)

As a comparison, here is how you would generate an embedding using an OpenAI embeddings model and LangChain:

from langchain_openai import OpenAIEmbeddings


embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    api_key="OPENAI_API_KEY"
)
result = embeddings.embed_query("A robot may not injure a human being or, through inaction, allow a human being to come to harm.")
print(result)

We had to change the name of the import and the API key we used, but otherwise the code is identical. This makes it easy to swap them out and experiment.

If you're using LangChain to build your entire RAG pipeline, these embeddings fit in well with the vector database interfaces. You can provide an embedding model to the database object and LangChain handles generating the embeddings as you insert documents or perform queries. For example, here's how you can combine the Google embeddings model with the LangChain wrapper for Astra DB.

from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_astradb import AstraDBVectorStore


embeddings = GoogleGenerativeAIEmbeddings(
    model="gemini-embedding-001",
    google_api_key="GEMINI_API_KEY"
)

vector_store = AstraDBVectorStore(
   collection_name="astra_vector_langchain",
   embedding=embeddings,
   api_endpoint="ASTRA_DB_API_ENDPOINT",
   token="ASTRA_DB_APPLICATION_TOKEN"
)

vector_store.add_documents(documents) # a list of document objects to store in the db

You can use the same vector_store object and associated embeddings to perform the vector search, too.

results = vector_store.similarity_search("Are robots allowed to protect themselves?")

LlamaIndex has a similar set of abstractions that enable you to combine different embedding models and vector stores. Check out this LlamaIndex introduction to RAG to learn more.

If you're new to embeddings, LangChain has a handy list of embedding models and providers that can help you find different options to try.

Directly in the database

The methods we’ve talked through so far have involved creating a vector independently of storing it in or using it to search against a vector database. When you want to store those vectors in a vector database like Astra DB, it looks a bit like this:

from astrapy import DataAPIClient


client = DataAPIClient("ASTRA_DB_APPLICATION_TOKEN")
database = client.get_database("ASTRA_DB_API_ENDPOINT")
collection = database.get_collection("COLLECTION_NAME")

result = collection.insert_one(
    {
         "text": "A robot may not injure a human being or, through inaction, allow a human being to come to harm.",
         "$vector": [0.04574034, 0.038084425, -0.00916391, ...]
    }
)

The above assumes that you have already created your vector-enabled collection with the right number of dimensions for the model you’re using.

Performing a vector search then looks like this:

cursor = collection.find(
    {},
    sort={"$vector": [0.04574034, 0.038084425, -0.00916391, ...]}
)

for document in cursor:
    print(document)

In these examples, you have to create your vectors first, before storing or searching against the database with them. In the case of the frameworks, you might not see this happen, as it has been abstracted away, but the operations are being performed.

With Astra DB, you can have the database generate the vector embeddings for you as you either insert the document into the collection or at the point of performing the search. This is called Astra Vectorize and it simplifies a crucial step in your RAG pipeline.

To use Vectorize, you first need to set up an embedding provider integration. There’s one built-in integration that you can use with no extra work; the NVIDIA NV-Embed-QA model, or you can choose one of the other embeddings providers and configure them with your API.

When you create a collection, you can choose which embedding provider you want to use with the requisite number of dimensions.

When you set up your collection this way you can add content and have it automatically vectorized by using the special property $vectorize.

result = collection.insert_one(
    {
         "$vectorize": "A robot may not injure a human being or, through inaction, allow a human being to come to harm."
    }
)

Then, when a user query comes in, you can perform a vector search by sorting using the $vectorize property. Astra DB will create the vector embedding and then make the search in one step.

cursor = collection.find(
    {},
    sort={"$vectorize": "Are robots allowed to protect themselves?"},
    limit=5
)

There are several advantages to this approach:

  • The Astra DB team has done the work to make the embedding creation robust already
  • Making two separate API calls to create embeddings and then store them is often slower than letting Astra DB handle it
  • Using the built-in NVIDIA embeddings model is even quicker than that
  • You have less code to write and maintain

A world of vector embedding options

As we have seen, there are many choices you can make in how to implement vector embeddings, which model you use, and which provider you use. It's an important step in your RAG pipeline and it is important to spend the time to find out which model and method is right for your application and your data.

You can choose to host your own models, rely on third-party APIs, abstract the problem away through frameworks, or entrust Astra DB to create embeddings for you. Of course, if you want to avoid code entirely, then you can drag-and-drop your components into place with Langflow.

This is one of those cathartic blog posts. One in which I spent several frustrating hours trying to debug something that really should have just worked. Once I had finally found out what was going on I felt that I had to write it all down just in case someone else is out there dealing with the same issue. So if you have found yourself in a situation where using fetch in Node.js for a multipart/form-data request doesn't work, this might help you out.

If that doesn't apply to you, have a read anyway and share my pain.

What happened?

Today, while trying to write a Node.js client for an API, I got stuck on one particular endpoint. It was an endpoint for uploading files, so it required the body to be formatted as multipart/form-data. JavaScript makes it easy to create such a request, you use a FormData object to gather your data, including files, and you submit it via fetch. The formatting of the request body is then handled for you and it normally just works.

Today it did not "just work".

HTTP 422

Since version 18, Node.js has supported the fetch API, via a project called undici. The undici project is added as a dependency to Node.js and the fetch function is exposed to the global scope.

To write the code for this upload endpoint should have been straightforward. I put together something like this:

import { readFile } from "node:fs/promises";
import { extname, basename } from "node:path";

async function uploadFile(url, filePath) {
  const data = await readFile(filePath);
  const type = mime.getType(extname(filePath));
  const file = new File([data], basename(filePath), type);

  const form = new FormData();
  form.append("file", file);

  const headers = new Headers();
  headers.set("Accept", "application/json");

  return fetch(url, {
    method: "POST",
    headers,
    body: form
  });
}

The real code has a few more complexities, but this is a good approximation of what I expected to be able to write.

I lined up a test against the API, fired it off and was disappointed to receive a 422 response with the message "Invalid multipart formatting".

I pored back over the code, not that there was a lot of it, to try to work out what I had done wrong. Unable to find anything, I turned to other tools.

I tried to proxy and inspect my request to see if anything was obviously wrong. Then I tried sending the request from another tool to see if I could get the API endpoint to respond with a success. Using Bruno I was able to make a successful request.

With a correct request and an incorrect request, I compared the two. But I didn't get very far. The URL, the headers, and the request body all looked the same, yet one method of sending the request worked and the other didn't.

Digging into the API

The API client I am writing is for Langflow. It's an open-source, low-code tool for building generative AI flows and agents. Langflow is part of DataStax, where I am working and doing things like hooking Langflow up to Bluesky to create fun generative AI bots.

Because Langflow is open-source, once I had run out of ideas with my code I could dig into the code behind the API to see if I could work out what was going on there. I found where the error message was coming from along with a hint as to what might be wrong.

Multipart requests

A multipart request is often made up of multiple parts that may not be of the same type. This is how you are able to submit text fields and upload an image file in the same request. To separate the different types, a multipart request comes up with a unique string to act as a boundary between the types of content. This boundary string is shared in the Content-Type header and looks like this:

Content-Type: multipart/form-data; boundary=ExampleBoundaryString

An example multipart request would then look like this:

POST /foo HTTP/1.1
Content-Length: 68137
Content-Type: multipart/form-data; boundary=ExampleBoundaryString

--ExampleBoundaryString
Content-Disposition: form-data; name="description"

Description input value
--ExampleBoundaryString
Content-Disposition: form-data; name="myFile"; filename="foo.txt"
Content-Type: text/plain

[content of the file foo.txt chosen by the user]
--ExampleBoundaryString--

A server is then able to split up and parse the different parts of the request using the boundary.

This Content-Type example is courtesy of MDN.

The Langflow API was checking to see whether the request body started with the boundary string and ended with the boundary string plus two dashes and then \r\n.

boundary_start = f"--{boundary}".encode()
boundary_end = f"--{boundary}--\r\n".encode()

if not body.startswith(boundary_start) or not body.endswith(boundary_end):
    return JSONResponse(
        status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
        content={"detail": "Invalid multipart formatting"},
    )

The combination of a carriage return and line feed (CRLF), a holdover from the days of typewriters, turned out to be my undoing. After throwing some print lines in this code (I do not know how to debug Python any other way) I confirmed that the end of the body did not match because it was missing the CRLF.

A missing CRLF. Hours of debugging trying to spot a missing \r\n. A Python application causing me trouble over insignificant whitespace.

Convention over specification

It turns out that spec for multipart/form-data, RFC 7578 does not mandate that the body of the request ends with a CRLF. It does say that each different part of the request must be delimited with a CRLF, "--", and then the value of the boundary parameter. It does not say that there needs to be a CRLF at the end of the body, nor does it say that there can't be a CRLF either.

In fact, it turns out that many popular HTTP clients, including curl, do add this CRLF. It's a bit of a convention in HTTP clients, so it seems that when the team at Langflow wanted to implement a check on the validity of a multipart request, they included the CRLF in their expectations.

On the other hand, I can only presume the team building undici looked at the spec and realised they didn't need to add unnecessary whitespace and left the CRLF out.

And this is where I landed. Stuck between an HTTP client that wouldn't add a CRLF and an API that expected it. It took me far too long to figure this out.

Fixing both sides

The latest version of Node.js, as I write this, is 23.6.0 and it still behaves this way. However, the code has been updated in undici version 7.1.0 to include the trailing CRLF and I am sure it will be in a release version of Node.js soon. I'm loathe to call this a fix as there was technically nothing wrong with what they were previously doing, but convention wins here.

On the other side of things, I made a pull request to loosen Langflow's definition of a valid multipart request. I'll have to wait to see how that goes.

As for my own code, I installed the latest version of undici into the project, imported fetch and it started working immediately.

So, if you're using fetch in Node.js between version 18.0.0 and 23.6.0 and you're making requests to a server that expects a multipart request to end in CRLF, you too have felt this pain and I am sorry. Yes, it's specific, but what else is the web for if not for sharing very specific problems and how you eventually mananaged to fix them?

Scraping web pages is one way to fetch content for your retrieval-augmented generation (RAG) application. But parsing the content from a web page can be a pain.

Mozilla's open-source library Readability.js is a useful tool for extracting just the important parts of a web page. Let's look at how to use it as part of a data ingestion pipeline for a RAG application.

<div class="info"> <p>This post was originally written for DataStax, but didn't survive a content migration as part of <a href="https://www.ibm.com/new/announcements/ibm-to-acquire-datastax-helping-clients-bring-the-power-of-unstructured-data-to-enterprise-ai-applications">IBM's purchase</a>. I thought the content was useful, so have republished it here.</p> </div>

Retrieving unstructured data from a web page

Web pages are a source of unstructured data that we can use in RAG-based apps. But web pages are often full of content that is irrelevant; things like headers, sidebars, and footers. They contain useful context for someone browsing the site, but detract from the main subject of a page.

To get the best data for RAG, we need to remove irrelevant content. When you’re working within one site, you can use tools like cheerio to parse the HTML yourself based on your knowledge of the site's structure. But if you're scraping pages across different layouts and designs, you need a good way to return just the relevant content and avoid the rest.

Repurposing reader view

Most web browsers come with a reader view that strips out everything but the article title and content. Here is the difference between the browser and reader mode when applied to a blog post on my personal site.

Mozilla makes the underlying library for Firefox's reader mode available as a standalone open-source module: Readability.js. So we can use Readability.js in a data pipeline to strip irrelevant content and return high quality results from scraping a web page.

How to scrape data with Node.js and Readability.js

Let's take a look at an example of scraping the article content from my previous blog post on creating vector embeddings in Node.js. Here's some JavaScript you can use to retrieve the HTML for the page:

const html = await fetch(
  "https://philna.sh/blog/2024/09/25/how-to-create-vector-embeddings-in-node-js/"
).then((res) => res.text());
console.log(html);

This includes all the HTML tags as well as the navigation, footer, share links, calls to action and other things you can find on most web sites.

To improve on this, you could install a module like cheerio and select only the important parts:

npm install cheerio
import * as cheerio from "cheerio";

const html = await fetch(
  "https://philna.sh/blog/2024/09/25/how-to-create-vector-embeddings-in-node-js/"
).then((res) => res.text());

const $ = cheerio.load(html);

console.log($("h1").text(), "\n");
console.log($("section#blog-content > div:first-child").text());

With this code you get the title and text of the article. As I said earlier, this is great if you know the structure of the HTML, but that won't always be the case.

Instead, install Readability.js and jsdom:

npm install @mozilla/readability jsdom

Readability.js normally runs in a browser environment and uses the live document rather than a string of HTML, so we need to include jsdom to provide that in Node.js. Now we can turn the HTML we already loaded into a document and pass it to Readability.js to parse out the content.

import { Readability } from "@mozilla/readability";
import { JSDOM } from "jsdom";

const url =
  "https://philna.sh/blog/2024/09/25/how-to-create-vector-embeddings-in-node-js/";
const html = await fetch(url).then((res) => res.text());

const doc = new JSDOM(html, { url });
const reader = new Readability(doc.window.document);
const article = reader.parse();

console.log(article);

When you inspect the article, you can see that it has parsed a number of things from the HTML.

There's the title, author, excerpt, publish time, and both the content and textContent. The textContent property is the plain text content of the article, ready for you to split into chunks, create vector embeddings, and ingest into a vector database. The content property is the original HTML, including links and images. This could be useful if you want to extract links or process the images somehow.

You might also want to see whether the document is likely to return good results. Reader view works well on articles, but is less useful for other types of content. You can do a quick check to see if the HTML is suitable for processing with Readability.js with the function isProbablyReaderable. If this function returns false you may want to parse the HTML in a different way, or even inspect the contents at that URL to see whether it has useful content for you.

const doc = new JSDOM(html, { url });
const reader = new Readability(doc.window.document);

if (isProbablyReaderable(doc.window.document)) {
  const article = reader.parse();
  console.log(article);
} else {
  // do something else
}

If the page fails this check, you might want to flag the URL to see whether it does include useful information for your RAG application, or whether it should be excluded.

Using Readability with LangChain.js

If you're using LangChain.js for your application, you can also use Readability.js to return the content from an HTML page. It fits nicely into your data ingestion pipelines, working with other LangChain components, like text chunkers and vector stores.

The following example uses LangChain.js to load the same page as above, return the relevant content from the page using the MozillaReadabilityTransformer, split the text into chunks using the RecursiveCharacterTextSplitter, create vector embeddings with OpenAI, and store the data in Astra DB.

You'll need to install the following dependencies:

npm install @langchain/core @langchain/community @langchain/openai @datastax/astra-db-ts @mozilla/readability jsdom

To run the example, you will need to create an Astra DB database and store the database's endpoint and application token in your environment as ASTRA_DB_APPLICATION_TOKEN and ASTRA_DB_API_ENDPOINT. You will also need an OpenAI API key stored in your environment as OPENAI_API_KEY.

Import the dependencies:

import { HTMLWebBaseLoader } from "@langchain/community/document_loaders/web/html";
import { MozillaReadabilityTransformer } from "@langchain/community/document_transformers/mozilla_readability";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { OpenAIEmbeddings } from "@langchain/openai";
import { AstraDBVectorStore } from "@langchain/community/vectorstores/astradb";

We use the HTMLWebBaseLoader to load the raw HTML from the URL we provide. The HTML is then passed through the MozillaReadabilityTransformer to extract the text, which is then split into chunks by the RecursiveCharacterTextSplitter. Finally, we create an embedding provider and an Astra DB vector store that will be used to turn the text chunks into vector embeddings and store them in the vector database.

const loader = new HTMLWebBaseLoader(
  "https://philna.sh/blog/2024/09/25/how-to-create-vector-embeddings-in-node-js/"
);
const transformer = new MozillaReadabilityTransformer();
const splitter = new RecursiveCharacterTextSplitter({
  maxCharacterCount: 1000,
  chunkOverlap: 200,
});
const embeddings = new OpenAIEmbeddings({
  model: "text-embedding-3-small",
});
const vectorStore = new AstraDBVectorStore(embeddings, {
  token: process.env.ASTRA_DB_APPLICATION_TOKEN,
  endpoint: process.env.ASTRA_DB_API_ENDPOINT,
  collection: "content",
  collectionOptions: {
    vector: {
      dimension: 1536,
      metric: "cosine",
    },
  },
});
await vectorStore.initialize();

The initialisation of all the components makes up most of the work. Once everything is set up, you can load, transform, split, embed and store the documents like this:

const docs = await loader.load();
const sequence = transformer.pipe(splitter);
const vectorizedDocs = await sequence.invoke(docs);
await vectorStore.addDocuments(vectorizedDocs);

More accurate data from web scraping with Readability.js

Readability.js is a battle-tested library powering Firefox's reader mode that we can use to scrape only relevant data from web pages. This cleans up web content and makes it much more useful for RAG.

As we've seen, you can do this directly with the library or using LangChain.js and the MozillaReadabilityTransformer.

Getting data from a web page is only the first step in your ingestion pipeline. From here you'll need to split your text into chunks, create vector embeddings, and store everything in a vector database. Then you'll be ready to build your RAG-powered application.

Have you ever had one of those times when you think you're doing everything right, yet still you get an unexpected bug in your application? Particularly when it is state-related and you thought you did everything you could to isolate the state by making copies instead of mutating it in place.

Especially when you are, say, building a game that copies a blank initial state when you create a new room and, no matter what you do, you still find that every player is in every room.

If you find yourself in this sort of situation, like I might have recently, then it's almost certain that you have nested state, you are only making a shallow clone, and you should be using structuredClone.

Shallow copies of nested states

Here's a simple version of the issue I described above. We have a default state and when we generate a new room that state is cloned to the room. The room has a function to add a player to its state.

const defaultState = {
  roomName: "",
  players: []
}

class Room {
  constructor(name) {
    this.state = { ...defaultState, roomName: name }
  }

  addPlayer(playerName) {
    this.state.players.push(playerName);
  }
}

You can create a new room and add a player to it.

const room = new Room("room1");
room.addPlayer("Phil");
console.log(room.state);
// { roomName: "room1", players: ["Phil"] }

But if you try to create a second room, you'll find the player is already in there.

const room2 = new Room("room2");
console.log(room2.state);
// { roomName: "room2", players: ["Phil"] }

It turns out the player even entered the default state.

console.log(defaultState);
// { roomName: "", players: ["Phil"] }

The issue is the clone of the default state that was made in the constructor. It uses object spread syntax (though it could have been using Object.assign) to make a shallow clone of the default state object, but a shallow clone is only useful for cloning primitive values in the object. If you have values like arrays or objects, they aren't cloned, the reference to the original object is cloned.

You can see this because the players array in the above example is equal across the default state and the two rooms.

defaultState.players === room.state.players;
// true
defaultState.players === room2.state.players;
// true
room.state.players === room2.state.players;
// true

Since all these references point to the same object, whenever you make an update to any room's players, all rooms and the default state will reflect that.

How to make a deep clone

There have been many ways in JavaScript over the years to make deep clones of objects, examples include Lodash's cloneDeep and using JSON to stringify and then parse an object. However it turned out that the web platform already had an underlying algorithm to perform deep clones through APIs like postMessage.

In 2015 it was suggested on a W3C mailing list that the algorithm was exposed publicly, though it took until late 2021 when Deno, Node.js and Firefox released support for structuredClone, followed in 2022 by the other browsers.

If you want to make a deep clone of an object in JavaScript you should use structuredClone.

Using structuredClone

Let's see the function in action. If we update the Room class from the example above to use structuredClone, it looks like this:

class Room {
  constructor(name) {
    this.state = structuredClone(defaultState);
    this.state.roomName = name;
  }

  addPlayer(playerName) {
    this.state.players.push(playerName);
  }
}

Creating one room acts as it did before:

const room = new Room("room1");
room.addPlayer("Phil");
console.log(room.state);
// { roomName: "room1", players: ["Phil"] }

But creating a second room now works as expected, the players are no longer shared.

const room2 = new Room("room2");
console.log(room2.state);
// { roomName: 'room2', players: [] }

And the players arrays are no longer equal references, but completely different objects:

defaultState.players === room.state.players;
// false
defaultState.players === room2.state.players;
// false
room.state.players === room2.state.players;
// false

It is worth reading how the structured clone algorithm works and what doesn't work. For example, you cannot deep clone Function objects or DOM nodes, but circular references are handled with no problem.

What a state

If you find yourself in a pickle when trying to clone your state, it may be because you are only making a shallow clone. If you have a simple state that is made up of primitives, then it is fine to use a shallow clone. Once you introduce nested objects that's when you need to consider using structuredClone to create a deep clone and avoid just copying references.

If you do find yourself facing this issue, I hope it takes you less time than it took me to realise what was going on.

When you’re building a retrieval-augmented generation (RAG) app, job number one is preparing your data. You’ll need to take your unstructured data and split it up into chunks, turn those chunks into vector embeddings, and finally, store the embeddings in a vector database.

There are many ways that you can create vector embeddings in JavaScript. In this post, we’ll investigate four ways to generate vector embeddings in Node.js: locally, via API, via a framework, and with Astra DB's Vectorize.

<div class="info"> <p>This post was originally written for DataStax, but didn't survive a content migration as part of <a href="https://www.ibm.com/new/announcements/ibm-to-acquire-datastax-helping-clients-bring-the-power-of-unstructured-data-to-enterprise-ai-applications">IBM's purchase</a>. I thought the content was useful, so have republished it here.</p> </div>

Local vector embeddings

There are lots of open-source models available on HuggingFace that can be used to create vector embeddings. Transformers.js is a module that lets you use machine learning models in JavaScript, both in the browser and Node.js. It uses the ONNX runtime to achieve this; it works with models that have published ONNX weights, of which there are plenty. Some of those models we can use to create vector embeddings.

You can install the module with:

npm install @huggingface/transformers

The package can actually perform many tasks, but feature extraction is what you want for generating vector embeddings.

A popular, local model for vector embedding is all-MiniLM-L6-v2. It’s trained as a good all-rounder and produces a 384-dimension vector from a chunk of text.

To use it, import the pipeline function from Transformers.js and create an extractor that will perform "feature-extraction" using your provided model. You can then pass a chunk of text to the extractor and it will return a tensor object which you can turn into a plain JavaScript array of numbers.

All in all, it looks like this:

import { pipeline } from "@huggingface/transformers";

const extractor = await pipeline(
  "feature-extraction",
  "Xenova/all-MiniLM-L6-v2"
);

const response = await extractor(
  [
    "A robot may not injure a human being or, through inaction, allow a human being to come to harm.",
  ],
  { pooling: "mean", normalize: true }
);

console.log(Array.from(response.data));
// => [-0.004044221248477697,  0.026746056973934174,   0.0071970801800489426, ... ]

You can actually embed multiple texts at a time if you pass an array to the extractor. Then you can call tolist on the response and that will return you a list of arrays as your vectors.

const response = await extractor(
  [
    "A robot may not injure a human being or, through inaction, allow a human being to come to harm.",
    "A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.",
    "A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.",
  ],
  { pooling: "mean", normalize: true }
);

console.log(response.tolist());
// [
//   [ -0.006129210349172354,  0.016346964985132217,   0.009711502119898796, ...],
//   [-0.053930871188640594,  -0.002175076398998499,   0.032391052693128586, ...],
//   [-0.05358131229877472,  0.021030642092227936, 0.0010665050940588117, ...]
// ]

There are many models you can use to create vector embeddings from text, and, because you’re running locally, you can try them out to see which works best for your data. You should pay attention to the length of text that these models can handle. For example, the all-MiniLM-L6-v2 model does not provide good results for more than 128 tokens and can handle a maximum of 256 tokens, so it’s useful for sentences or small paragraphs. If you have a bigger source of text data than that, you’ll need to split your data into appropriately sized chunks.

Local embedding models like this are useful if you’re experimenting on your own machine, or have the right hardware to run them efficiently when deployed. It's an easy way to get comfortable with different models and get a feel for how things work without having to sign up to a bunch of different API services.

Having said that, there are a lot of useful vector embedding models available as an API, so let's take a look at them next.

APIs

There are an abundance of services that provide embedding models as APIs. These include LLM providers, like OpenAI, Google or Cohere, as well as specialist providers like Voyage AI or Jina. Most providers have general purpose embedding models, but some provide models trained for specific datasets, like Voyage AI's finance, law and code optimised models.

These API providers provide HTTP APIs, often with an npm package to make it easy to call them. You’ll typically need an API key from the service and you can then generate embeddings by sending your text to the API.

For example, you can use Google's text embedding models through the Gemini API like this:

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.API_KEY });
const text =
  "A robot may not injure a human being or, through inaction, allow a human being to come to harm.";

const response = await ai.models.embedContent({
  model: "gemini-embedding-001",
  contents: text,
});

console.log(response.embeddings[0].values);
// => [ -0.0049246787, 0.031826325, -0.0075687882, ... ]

Each API is different though, so while making a request to create embeddings is normally fairly straightforward, you’ll likely have to learn a new method for each API you want to call—unless of course, you try one of the available frameworks that are intended to simplify this.

Frameworks

There are many projects out there, like LangChain or LlamaIndex, that create abstractions over the various parts of the GenAI toolchain, including embeddings.

Both LangChain and LlamaIndex enable you to generate embeddings via APIs or local models, all with the same interface. For example, here’s how you can create the same embedding as above using the Gemini API and LangChain together:

import { GoogleGenerativeAIEmbeddings } from "@langchain/google-genai";

const embeddings = new GoogleGenerativeAIEmbeddings({
  apiKey: process.env.API_KEY,
  model: "gemini-embedding-001",
});
const text =
  "A robot may not injure a human being or, through inaction, allow a human being to come to harm.";

const embedding = await embeddings.embedQuery(text);
console.log(embedding);
// => [-0.0049246787, 0.031826325, -0.0075687882, ...]

To compare, this is what it looks like to use the OpenAI embeddings model through LangChain:

import { OpenAIEmbeddings } from "@langchain/openai";

const embeddings = new OpenAIEmbeddings({
  apiKey: process.env.API_KEY,
  model: "text-embedding-3-large",
});
const text =
  "A robot may not injure a human being or, through inaction, allow a human being to come to harm.";

const embedding = await embeddings.embedQuery(text);
console.log(embedding);
// => [0.009445431, -0.0073068426, -0.00814802, ...]

Aside from changing the name of the import and sometimes the options, the embedding models all have a consistent interface to make it easier to swap them out.

If you’re using LangChain to create your entire pipeline, these embedding interfaces work very well alongside the vector database interfaces. You can provide an embedding model to the database integration and LangChain handles generating the embeddings as you insert documents or perform vector searches. For example, here is how to embed some documents using Google's embeddings and store them in Astra DB via LangChain:

import { GoogleGenerativeAIEmbeddings } from "@langchain/google-genai";
import { AstraDBVectorStore } from "@langchain/community/vectorstores/astradb";

const embeddings = new GoogleGenerativeAIEmbeddings({
  apiKey: process.env.API_KEY,
  model: "gemini-embedding-001",
});

const vectorStore = await AstraDBVectorStore.fromDocuments(
  documents, // a list of document objects to put in the store
  embeddings, // the embeddings model
  astraConfig // config to connect to Astra DB
);

When you provide the embeddings model to the database object, you can then use it to perform vector searches too.

const results = vectorStore.similaritySearch(
  "Are robots allowed to protect themselves?"
);

LlamaIndex allows for similar creation of embedding models and vector stores that use them. Check out the LlamaIndex documentation on RAG.

As a bonus, the lists of models that LangChain and LlamaIndex integrate are good examples of popular embedding models.

Directly in the database

So far, the methods above mostly involve creating a vector embedding independently of storing the embedding in a vector database. When you want to store those vectors in a vector database like Astra DB, it looks a bit like this:

import { DataAPIClient } from "@datastax/astra-db-ts";
const client = new DataAPIClient(process.env.ASTRA_DB_APPLICATION_TOKEN);
const db = client.db(process.env.ASTRA_DB_API_ENDPOINT);
const collection = db.collection(process.env.ASTRA_DB_COLLECTION);

await collection.insertOne({
  text: "A robot may not injure a human being or, through inaction, allow a human being to come to harm.",
  $vector: [0.04574034, 0.038084425, -0.00916391, ...]
});

This assumes you have already created a vector enabled collection with the correct number of dimensions for the model you are using.

You can also search against the documents in your collection using a vector like this:

const cursor = collection.find({}, {
  sort: { $vector: [0.04574034, 0.038084425, -0.00916391, ...] },
  limit: 5,
});
const results = await cursor.toArray();

In this case, you have to create your vectors first, and then store or search against the database with them. Even in the case of the frameworks, that process happens, but it’s just abstracted away.

With Astra DB, you can have the database generate the embeddings for you as you’re inserting documents into a collection or as you perform a vector search against a collection.

This is called Astra DB vectorize; here's how it works.

First, set up an embedding provider integration. There is a built-in integration offering the NVIDIA NV-Embed-QA model, or you can choose one of the other providers and configure them with your own API key.

Then when you set up a collection, you can choose which embedding provider you want to use and set the correct number of dimensions.

Now, when you add a document to this collection, you can add the content using the special key $vectorize and a vector embedding will be created.

await collection.insertOne({
  $vectorize:
    "A robot may not injure a human being or, through inaction, allow a human being to come to harm.",
});

When you want to perform a vector search against this collection, you can sort by the special $vectorize field and again, Astra DB will handle creating vector embeddings and then performing the search.

const cursor = collection.find(
  {},
  {
    sort: { $vectorize: "Are robots allowed to protect themselve?" },
    limit: 5,
  }
);
const results = await cursor.toArray();

This has several advantages:

  • It's robust, as Astra DB handles the interaction with the embedding provider
  • It can be quicker than making two separate API calls to create embeddings and then store them
  • It's less code for you to write

Choose the method that works best for your application

There are many models, providers, and methods you can use to turn text into vector embeddings. Creating vector embeddings from your content is a vital part of the RAG pipeline and it does require some experimentation to get it right for your data.

You have the choice to host your own models, call on APIs, use a framework, or let Astra DB handle creating vector embeddings for you. And, if you want to avoid code altogether, you could choose to use Langflow's drag-and-drop interface to create your RAG pipeline.

Retrieval-augmented generation (RAG) applications begin with data, so getting your data in the right shape to work well with vector databases and large language models (LLMs) is the first challenge you’re likely to face when you get started building. In this post, we'll discuss the different ways to work with text data in JavaScript, exploring how to split it up into chunks and prepare it for use in a RAG app.

<div class="info"> <p>This post was originally written for DataStax, but didn't survive a content migration as part of <a href="https://www.ibm.com/new/announcements/ibm-to-acquire-datastax-helping-clients-bring-the-power-of-unstructured-data-to-enterprise-ai-applications">IBM's purchase</a>. I thought the content was useful, so have republished it here.</p> </div>

Why chunking is important

Often you will have swaths of unstructured text content that need to be broken down into smaller chunks of data. These chunks of text are turned into vector embeddings and stored in a vector database like Astra DB.

Compared to a whole document, smaller chunks of text capture fewer topics or ideas, which means that their embeddings will contain more focused meaning. This makes each chunk easier to match against incoming user queries and makes for more accurate retrieval. When you improve your retrieval, you can feed fewer, but more relevant, tokens to an LLM and create a more accurate and useful RAG system.

If you want to read more on the theory behind text chunking check out this post from Unstructured on best practices for chunking.

Chunking in JavaScript

Let's move beyond theory and take a look at the options in JavaScript to chunk your data. The libraries we're going to look at are: llm-chunk, LangChain, LlamaIndex, and semantic-chunking. You can experiment with these chunking libraries using this app I put together.

We'll also take a look at using the Unstructured API for more complex use cases. Let's get started with the simplest of these modules.

llm-chunk

llm-chunk describes itself as a "super simple and easy-to-use text splitter" and it is! You can install it using npm with:

npm install llm-chunk

It consists of one function, chunk, and it takes just a few options. You can choose maximum and minimum sizes for the chunks it produces, pick the size of the overlap between chunks, and choose a strategy for splitting the text, either by sentence or paragraph.

By default, it will split text up by paragraphs with a maximum length of 1,000 characters, and no minimum length or overlap.

import { chunk } from "llm-chunk";
const text = loadText(); // get some text to split
const chunks = chunk(text);

You can choose to split the text by sentences, alter the maximum and minimum number of characters by chunk, or by how many characters each chunk should overlap.

const chunks = chunk(text, { minLength: 128, maxLength: 1024, overlap: 128 });

For more complex use cases, you can opt to parse a set of delimiters. These get turned into a regular expression and used as the basis for splitting up the text instead of just by paragraph or sentence. llm-chunk is simple, fast, and a great option when you are starting your RAG journey.

LangChain

LangChain is much more than a text splitter. It is a library that helps you load data, split it up, embed it, store that data in and retrieve it from vector databases, feed it as a prompt to LLMs, and more. There’s a lot to explore with LangChain, but we’re going to concentrate on text splitting.

You can install the LangChain text splitters with this command:

npm install @langchain/textsplitters

If you are using the main langchain module then @langchain/textsplitters is included as a dependency.

LangChain includes three main splitter classes, the CharacterTextSplitter, RecursiveCharacterTextSplitter, and TokenTextSplitter. Let's take a look at what each of them do.

CharacterTextSplitter

This is the simplest of the splitters provided by LangChain. It splits up a document by a character, then merges segments back together until they reach the desired chunk size, overlapping the chunks by the desired number of characters.

The default character for splitting up text is "\n\n". This means it aims to initially split the text by paragraphs, though you can also provide the character you want to split by too. Here's how you would use this LangChain splitter:

import { CharacterTextSplitter } from "@langchain/textsplitters";
const text = loadText(); // get some text to split
const splitter = new CharacterTextSplitter({
  chunkSize: 1024,
  chunkOverlap: 128,
});
const output = await splitter.splitText(text);

You can also output LangChain Documents which is useful if you are using the rest of LangChain to create a data pipeline:

const output = await splitter.createDocuments([text]);

RecursiveCharacterTextSplitter

The CharacterTextSplitter is naive, and doesn't take into account much of the structure of a piece of text. The RecursiveCharacterTextSplitter goes beyond by using a list of separators to progressively break text down until it creates chunks that fit the size you want. By default it splits text first by paragraphs, then sentences, then words.

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
const text = loadText(); // get some text to split
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1024,
  chunkOverlap: 128,
});
const output = splitter.splitText(text);

You can provide different characters for the RecursiveCharacterTextSplitter to split the text on so it can be used to split up other types of text. There are two classes available that make it easy to split up Markdown or Latex, the MarkdownTextSplitter and LatexTextSplitter respectively.

But this can also be used to split up code for a number of different languages. For example, if you wanted to split up a JavaScript file, you could do this:

const splitter = RecursiveCharacterTextSplitter.fromLanguage("js", {
  chunkSize: 300,
  chunkOverlap: 0,
});
const jsOutput = await splitter.splitText(jsCode);

The RecursiveCharacterTextSplitter is a versatile text splitter and is likely a good first stop when you're building up a pipeline to ingest data for your RAG application.

TokenTextSplitter

Taking a different approach, the TokenTextSplitter turns the text first into tokens using js-tiktoken, splits the tokens into chunks and then converts the tokens back into text.

Tokens are the way that LLMs consume content, a token can be a whole word or just a part of it. OpenAI has a good representation of how their models break text into tokens. You can use the TokenTextSplitter like this:

import { TokenTextSplitter } from "@langchain/textsplitters";
const text = loadText(); // get some text to split
const splitter = new TokenTextSplitter({
  chunkSize: 1024,
  chunkOverlap: 128,
});
const output = splitter.splitText(text);

LlamaIndex

LlamaIndex is also responsible for much more than just text splitting. We're going to hone in on the splitting capabilities though, which you can use outside of LlamaIndex too.

LlamaIndex considers chunks of a document as Nodes and the rest of the library works with Nodes. There are three available processors: SentenceSplitter, MarkdownNodeParser. and SentenceWindowNodeParser.

You can install the entire LlamaIndex suite with:

npm install llamaindex

If you just want the text parsers, you can install just the LlamaIndex core:

npm install @llamaindex/core

SentenceSplitter

The SentenceSplitter is the simplest of the LlamaIndex splitters. It splits the text into sentences and then combines them into a string that is smaller than the provided chunkSize.

LlamaIndex does this differently from the previous splitters; it measures the size of a chunk in tokens rather than characters. There are approximately four characters to a token and a good default is 1024 characters with an overlap of 128, so you should aim for chunks of 256 tokens and an overlap of 32 tokens.

The SentenceSplitter returns an array of chunks like this:

import { SentenceSplitter } from "@llamaindex/core/node-parser";
const text = loadText(); // get some text to split
const splitter = new SentenceSplitter({
  chunkSize: 256,
  chunkOverlap: 32,
});
const output = splitter.splitText(text);

MarkdownNodeParser

If you have Markdown to split up then the MarkdownNodeParser will split it up into logical sections based on the headers in the document.

This splitter doesn't let you set a chunkSize or overlap, so you do give up that level of control. It also works over LlamaIndex Documents or Nodes. In this example we turn our text into a Document first, then get the Nodes from the Documents.

import { MarkdownNodeParser } from "@llamaindex/core/node-parser";
import { Document } from "@llamaindex/core/schema";
const text = loadText(); // get some text to split
const splitter = new MarkdownNodeParser();
const nodes = splitter.getNodesFromDocuments([new Document({ text })]);
const output = nodes.map((node) => node.text);

SentenceWindowNodeParser

The final LlamaIndex parser breaks text down into sentences and then produces a Node for each sentence with a window of sentences to either side. You can choose how big the window is. Choosing a window size of one will produce Nodes with three sentences, the current sentence, one before and one after. A window size of two produces Nodes with five sentences, the current sentence, and two either side.

This parser works on Documents as well; you use it like so:

import { SentenceWindowNodeParser } from "@llamaindex/core/node-parser";
import { Document } from "@llamaindex/core/schema";
const text = loadText(); // get some text to split
const splitter = new SentenceWindowNodeParser({ windowSize: 3 });
const nodes = splitter.getNodesFromDocuments([new Document({ text })]);
const output = nodes.map((node) => node.text);

semantic-chunking

semantic-chunking is not a popular text splitter in the JavaScript world, but I wanted to include it as something that’s a bit different. It still gives you control over the maximum size of your chunks, but the way it splits up the chunks uses a more interesting method.

It first splits the text into sentences, and then it generates embedding vectors for each sentence using a locally downloaded model (by default it uses Xenova/all-MiniLM-L6-v2 but you can choose a different one if you want). It then groups the sentences into chunks based on how similar they are using cosine similarity. The intention is to group sentences that go together and have related contents, and then, when the topic changes, start a new chunk.

This is a smarter type of chunking than just splitting by chunk size, and likely even smarter than the markdown parsers that at least take section headings into account. The trade-off is that it is likely to be slower as there is more computation to be done.

It is still simple to use though; install it with:

npm install semantic-chunking

You can pass a similarityThreshold to the chunkit function, which is the minimum cosine similarity required for two sentences to be included in the same chunk. A high threshold provides a high bar for a sentence to be included and likely results in smaller chunks. A low threshold will allow for fewer and bigger chunks. As always, it’s worth experimenting with this setting to find what works for your data.

import { chunkit } from "semantic-chunking";
const text = loadText(); // get some text to split
const chunks = chunkit(text, {
  maxTokenSize: 256,
  similarityThreshold: 0.5,
});

There are other options around further combining chunks, check out the documentation for more detail.

Unstructured

Unstructured is a platform for extracting data from files and chunking it in a smart way. There is an open-source toolkit for this, a no-code platform, and an API. We're going to investigate the API here.

The API has support for extracting data from loads of different file types, including images and documents like PDFs that may contain images or tables of data. Below is a simple example of calling the Unstructured API; you can read about more of the capabilities, particularly around extracting data from PDFs and images, in the Unstructured API documentation.

First you should install the API client:

npm install unstructured-client

You will need an API key. You can get a free API key, or the paid service has a two-week free trial. Once you have an API key, you can use the client to call the API.

In this example, I am chunking text from a markdown file, but check out the API parameters you can use for behaviour with other types of file.

import { UnstructuredClient } from "unstructured-client";
import { ChunkingStrategy } from "unstructured-client/sdk/models/shared/index.js";
import { readFileSync } from "node:fs";

const client = new UnstructuredClient({
  serverURL: "https://api.unstructuredapp.io",
  security: {
    apiKeyAuth: "YOUR_API_KEY_HERE",
  },
});

const data = readFileSync("./post.md");

const res = await client.general.partition({
  partitionParameters: {
    files: {
      content: data,
      fileName: "post.md",
    },
    chunkingStrategy: ChunkingStrategy.BySimilarity,
  },
});
if (res.statusCode == 200) {
  console.log(res.elements);
}

You'll notice I selected a chunking strategy of similarity; you can read about the other chunking strategies available in the documentation as well as partitioning strategies, which are used for documents like images and PDFs.

What's your chunking strategy?

It turns out that there are many options for splitting text up into chunks in the JavaScript ecosystem. From the quick and simple, like llm-chunk, fully featured libraries with full GenAI pipelines like LangChain or LlamaIndex, and more complex methods like semantic-chunking and the Unstructured API with all the options it brings for loading and splitting documents.

Getting your chunking strategy right is important for getting good results from your RAG application, so ensure you read about the best practices and try out the available options to see the sort of results you get from them. If you want to experiment with the libraries llm-chunk, LangChain, LlamaIndex, and semantic-chunking, check out the example application, Chunkers.

However you turn your text into chunks for your RAG application, it's good to understand all of the available options.

Generative AI enables us to build incredible new types of applications, but large language model (LLM) responses can be slow. If we wait for the full response before updating the user interface, we might be making our users wait more than they need to. Thankfully, most LLM APIs—including OpenAI, Anthropic, and Langflow provide streaming endpoints that you can use to stream responses as they are generated. In this post, we're going to see how to use JavaScript's fetch API to immediately update your front-end application as an LLM generates output and create a better user experience.

<div class="info"> <p>This post was originally written for DataStax, but didn't survive a content migration as part of <a href="https://www.ibm.com/new/announcements/ibm-to-acquire-datastax-helping-clients-bring-the-power-of-unstructured-data-to-enterprise-ai-applications">IBM's purchase</a>. I thought the content was useful, so have republished it here.</p> </div>

Slow responses without streaming

Let's start with an example to show what a slow result looks like to a user. You can try it here. This example GitHub repo demonstrates an Express application that serves up static files and has one endpoint that streams some lorem ipsum text to the front-end. The following code is how the server streams the response:

app.get("/stream", async (_req, res) => {
  res.set("Content-Type", "text/plain");
  res.set("Transfer-Encoding", "chunked");

  const textChunks = text.replace(/\n/g, "<br>").split(/(?<=\.)/g);

  for (let chunk of textChunks) {
    res.write(chunk);
    await sleep(250);
  }

  res.end();
});

The Content-Type header shows that we are returning text and the Transfer-Encoding header tells the browser that the text will be arriving as a stream of chunks.

Express enables you to write content to the response at any time, using res.write. In this case, we break a section of text up into sentences and then write each sentence after a 250 millisecond gap. This is standing in for our streaming LLM response, as it is simpler than putting together an entire GenAI application to demonstrate this.

Let's take a look at how we would normally implement a fetch request to get some data.

const response = await fetch("/stream");
const text = await response.text();
output.innerHTML = text;

The text function collects the entire response then decodes it into text. In this example, we write the text to the page. This is fine if the response is fast, but our example server above is returning a new chunk of text every quarter of a second, and it will depend on how long the response is as to how long it takes to resolve response.text().

Streaming with fetch

Setting up to stream a response from a server is not as straightforward, but the results are worth it. We're going to use the fact that the body of a response is a ReadableStream. We can then set up a streaming pipeline that decodes the incoming stream and writes it to the page.

We need to decode the stream because the chunks of stream we receive are bytes in the form of a Uint8Array. We want them as text, so we can use a TextDecoderStream to decode the bytes as they flow through. The TextDecoderStream is an implementation of a TransformStream; it has both a readable and writable stream, so it can be used in streaming pipelines.

We then want to write the text to the page, so we can build a custom WritableStream to handle that. Let's see the code:

const response = await fetch("/stream");
const decoderStream = new TextDecoderStream("utf-8");
const writer = new WritableStream({
  write(chunk) {
    output.innerHTML += chunk;
  },
});

response.body.pipeThrough(decoderStream).pipeTo(writer);

And now let's compare this version side-by-side to the version that waits for the whole response before it decodes and writes it to the page.

As you can see, the text takes the same amount of time to load whether we use the streaming version or the regular version, but users can start reading the response much earlier when it streams straight to the page.

When you are dealing with LLMs, or other responses that progressively generate a response, streaming the response and rendering it to the page as you receive it gives users a much better perceived performance over waiting for the full response.

More to know about streams

There's more to streams than the code above. Our basic implementation of a WritableStream above included a write function that is called when new chunks of data are ready to be written. You can also define the following functions:

  • start: called as soon as the stream is constructed and used to set up any resources needed for writing the data
  • close: called once there are no more chunks to write to the stream and used to release any resources
  • abort: called if the application signals that the stream should be immediately closed and, like close, used to clean up any resources. Unlike close, abort will run even if there are chunks still to be written and cause them to be thrown away.

You can also pass a queuingStrategy to a WritableStream that allows you to control how fast the stream receives data. This is known as "backpressure," and you can read more about it on MDN.

Catching errors

If there is an error in the stream, like a break in the connection between the front-end and the server, you will need to catch the error. The stream method pipeTo is the final method that is called in a pipeline and returns a Promise. You can either catch errors with the catch method on the Promise:

response.body
  .pipeThrough(decoderStream)
  .pipeTo(writer)
  .catch((error) => {
    console.log("Something went wrong with the stream!");
  });

Or you can await the result and use a try/catch block:

try {
  await response.body
    .pipeThrough(decoderStream)
    .pipeTo(writer);
} catch((error) => {
  console.log("Something went wrong with the stream!");
}

Server-sent events

In the example above, the stream from the server was just text. Many LLM APIs, including Anthropic, Google, OpenAI, and Langflow, send more data back than just the text response and use the server-sent events standard to format those streams.

To send data as server-sent events we'd need to change a couple of things from our original server. The Content-Type becomes text/event-stream. Instead of sending plain text, each message must be labeled as either "event," "data," "id," or "retry," and messages are separated by two newlines. Taking the server side example from earlier, we could update it like so:

app.get("/stream", async (_req, res) => {
  res.set("Content-Type", "text/event-stream");
  res.set("Transfer-Encoding", "chunked");

  const textChunks = text.replace(/\n/g, "<br>").split(/(?<=\.)/g);

  for (let chunk of textChunks) {
    res.write(`event: message\ndata: ${chunk}\n\n`);
    await sleep(250);
  }

  res.end();
});

Now each message is of the form:

event: message
data: ${textChunk}

Which means we need to parse this on the client side. Normally, this is easily done by making a connection to the server using the EventSource object and letting the browser parse the messages into events. If you are building with GenAI, you're most likely sending user queries to the server over a POST request, which is not supported by EventSource.

To use server-sent events with a POST request, you’ll need to parse the responses. One way to do this is to use the eventsource-parser module, which even makes a transform stream available; this fits in nicely with our existing application.

To handle server-sent events in our existing front-end, we can import an EventSourceParserStream from eventsource-parser/stream and use it in our pipeline.

const response = await fetch("/stream");
const decoderStream = new TextDecoderStream("utf-8");
const parserStream = new EventSourceParserStream();
const writer = new WritableStream({
  write(event) {
    output.innerHTML += event.data;
  },
});

response.body
  .pipeThrough(decoderStream)
  .pipeThrough(parserStream)
  .pipeTo(writer);

Note that now we get an event emitted from the stream that has a data property containing the data sent from the server. This is still text in this application, but you could send, for example, a JSON object containing structured data instead. You can see this implementation of server-sent event streaming in the example GitHub repo.

Async iterables

There is an easier way to handle the incoming chunks of a streaming response using a for await...of loop.

In this case we don't need to create our own WritableStream, though we do still need to use a TextDecoder to decode the bytes into a string. It relies on the body of the response being a ReadableStream that implements the async iterable protocol, and looks like this:

const decoder = new TextDecoder();
for await (const chunk of response.body) {
  const text = decoder.decode(chunk, { stream: true });
  streamOutput.innerHTML += text;
}

Sadly, Safari does not support this technique, so we have to avoid it in the front-end for now.

Get comfortable with streaming

Understanding how to stream a response from a web server and consume it on the client is vital to creating a great user experience when working with LLMs. Using fetch as a ReadableStream is the web-native way to do so. When you stream a potentially slow text response, you can improve the perceived performance of your application, leading to happier users and customers

If you're playing with Langflow, you'll find that you can stream responses from the API, and you should take advantage of this.

Vercel's AI SDK uses fetch to make streaming easier for a number of frameworks with the useChat hook. For React Server Component users, you can also stream UI using Vercel's AI RSC API.

If you want to play around with streaming, you can check out the example code from this blog post on GitHub.

Grouping items in an array is one of those things you've probably done a load of times. Each time you would have written a grouping function by hand or perhaps reached for lodash's groupBy function.

The good news is that JavaScript is now getting grouping methods so you won't have to anymore. Object.groupBy and Map.groupBy are new methods that will make grouping easier and save us time or a dependency.

Grouping until now

Let's say you have an array of objects representing people and you want to group them by their age. You might use a forEach loop like this:

const people = [
  { name: "Alice", age: 28 },
  { name: "Bob", age: 30 },
  { name: "Eve", age: 28 },
];

const peopleByAge = {};

people.forEach((person) => {
  const age = person.age;
  if (!peopleByAge[age]) {
    peopleByAge[age] = [];
  }
  peopleByAge[age].push(person);
});
console.log(peopleByAge);
/*
{
  "28": [{"name":"Alice","age":28}, {"name":"Eve","age":28}],
  "30": [{"name":"Bob","age":30}]
}
*/

Or you may choose to use reduce, like this:

const peopleByAge = people.reduce((acc, person) => {
  const age = person.age;
  if (!acc[age]) {
    acc[age] = [];
  }
  acc[age].push(person);
  return acc;
}, {});

Either way, it's slightly awkward code. You always have to check the object to see whether the grouping key exists and if not, create it with an empty array. Then you can push the item into the array.

Grouping with Object.groupBy

With the new Object.groupBy method, you can outcome like this:

const peopleByAge = Object.groupBy(people, (person) => person.age);

Much simpler! Though there are some things to be aware of.

Object.groupBy returns a null-prototype object. This means the the object does not inherit any properties from Object.prototype. This is great because it means you won't accidentally overwrite any properties on Object.prototype, but it also means that the object doesn't have any of the methods you might expect, like hasOwnProperty or toString.

const peopleByAge = Object.groupBy(people, (person) => person.age);
console.log(peopleByAge.hasOwnProperty("28"));
// TypeError: peopleByAge.hasOwnProperty is not a function

The callback function you pass to Object.groupBy should return a string or a Symbol. If it returns anything else, it will be coerced to a string.

In our example, we have been returning the age as a number, but in the result it is coerced to string. Though you can still access the properties using a number as using square bracket notation will also coerce the argument to string.

console.log(peopleByAge[28]);
// => [{"name":"Alice","age":28}, {"name":"Eve","age":28}]
console.log(peopleByAge["28"]);
// => [{"name":"Alice","age":28}, {"name":"Eve","age":28}]

Grouping with Map.groupBy

Map.groupBy does almost the same thing as Object.groupBy except it returns a Map. This means that you can use all the usual Map functions. It also means that you can return any type of value from the callback function.

const ceo = { name: "Jamie", age: 40, reportsTo: null };
const manager = { name: "Alice", age: 28, reportsTo: ceo };

const people = [
  ceo,
  manager,
  { name: "Bob", age: 30, reportsTo: manager },
  { name: "Eve", age: 28, reportsTo: ceo },
];

const peopleByManager = Map.groupBy(people, (person) => person.reportsTo);

In this case, we are grouping people by who they report to. Note that to retrieve items from this Map by an object, the objects have to have the same identity.

peopleByManager.get(ceo);
// => [{ name: "Alice", age: 28, reportsTo: ceo }, { name: "Eve", age: 28, reportsTo: ceo }]
peopleByManager.get({ name: "Jamie", age: 40, reportsTo: null });
// => undefined

In the above example, the second line uses an object that looks like the ceo object, but it is not the same object so it doesn't return anything from the Map. To retrieve items successfully from the Map, make sure you keep a reference to the object you want to use as the key.

When will this be available?

The two groupBy methods are part of a TC39 proposal that is currently at stage 3. This means that there is a good chance it will become a standard and, as such, there are implementations appearing.

Chrome 117 just launched with support for these two methods and Firefox released support in version 119. Safari had implemented these methods under different names, I'm sure they will update that soon. As the methods are in Chrome that means they have been implemented in V8, so will be available in Node the next time V8 is updated.

Why use static methods?

You might wonder why this is being implemented as Object.groupBy and not Array.prototype.groupBy. According to the proposal there is a library that used to monkey patch the Array.prototype with an incompatible groupBy method. When considering new APIs for the web, backwards compatibility is hugely important. This was highligted a few years ago when trying to implement Array.prototype.flatten, in an event known as SmooshGate.

Fortunately, using static methods actually seems better for future extensibility. When the Records and Tuples proposal comes to fruition, we can add a Record.groupBy method for grouping arrays into an immutable record.

JavaScript is filling in the gaps

Grouping items together is clearly an important thing we do as developers. lodash.groupBy is currently downloaded from npm between 1.5 and 2 million times a week. It's great to see JavaScript filling in these gaps and making it easier for us to do our jobs.

For now, go get Chrome 117 and try these new methods out for yourself.

With the recent release of version 20.6.0, Node.js now has built-in support for .env files. You can now load environment variables from a .env file into process.env in your Node.js application completely dependency-free.

Loading an .env file is now as simple as:

node --env-file .env

What is .env?

.env files are used to configure environment variables that will be present within a running application. The idea comes from the Twelve-Factor App methodology, which says to store everything that is likely to vary between deploys (e.g. dev, staging, production) in the environment.

Config should not be a part of your application code and should not be checked-in to version control. Things like API credentials, or other secrets, should be stored separately and loaded in the environment in which they are needed. A .env file lets you manage your config for applications where it isn't practical to set variables in the environment, like your development machine or <abbr title="continous integration">CI</abbr>.

There are libraries in many different languages that support using a .env file to load variables into the environment, they are usually called "dotenv", and the Node.js dotenv is no different. But now, Node.js itself supports this behaviour.

How do you use .env in Node.js?

A .env file looks like this:

PASSWORD=supersecret
API_KEY=84de8263ccad4d3dabba0754e3c68b7a
# .env files can have comments too

By convention you would save this as .env in the root of your project, though you can call it whatever you want.

You can then set the variables in the file as environment variables by starting Node.js with the --env-file flag pointing to your .env file. When loaded, the variables are available as properties of process.env.

$ node --env-file .env
Welcome to Node.js v20.6.0.
Type ".help" for more information.
> console.log(process.env.PASSWORD)
supersecret
undefined
> console.log(process.env.API_KEY)
84de8263ccad4d3dabba0754e3c68b7a
undefined

Supported features

Support right now is fairly basic compared to dotenv. For example:

But, the feature is under active development. Since the 20.7.0 release, you can now specify multiple files. The variables from the last file will override any previous files.

node --env-file .env --env-file .env.development

There is more work to be done, and some of these features may be added. You can follow the discussion on GitHub here.

Incorrect features

In the 20.6.0 release, the documentation says, "If the same variable is defined in the environment and in the file, the value from the environment takes precedence." This is the way that all dotenv packages work by default. However, that is not currently true of Node.js's implementation and variables in the .env file will override the environment.

This has been fixed as of version 20.7.0. Variables defined in the environment now take precedence over variables in a .env file.

Benefits to Node.js's implementation

Even though this implementation is missing some features, it has some benefits over using a third-party package. Node.js loads and parses the .env file as it is starting up, so you can include environment variables that configure Node.js itself, like NODE_OPTIONS.

So, you can have an .env file that looks like this:

NODE_OPTIONS="--no-warnings --inspect=127.0.0.1:9229"

Then, when you run node --env-file=.env the process will run without emitting warnings and it will activate the inspector on the IP address 127.0.0.1:9229.

Note: you cannot put NODE_OPTIONS="--env-file .env in your .env file. It is disallowed to avoid inifinite loops.

Node.js just keeps improving

Go try out Node.js version 20.6.0! Version 20 has brought new features, like a stable test runner, mock timers, and now .env file support, as well as many other upgrades, fixes and improvements. Version 20 becomes the active <abbr title="long term support">LTS</abbr> version of Node.js in October, so now is a good time to test these new features out and start considering upgrading your application to take advantage.

Generating pagination links is not as straightforward as it may seem. So, while rebuilding my own site with Astro, I released a <Pagination /> component on npm as @philnash/astro-pagination that anyone can use in their Astro site. Read on to find out more.

Pagination

Pagination is something that most content sites need. It is often better to list collections with lots of entries, like blog posts, across multiple pages because a single page would be overwhelming to scroll through.

Astro provides the paginate function to the callback to getStaticPaths to make it easy to turn a collection into a number of paths and pages. Once you have turned your collection into a list of pages, you then need to render links that your users can use to navigate through the list.

While this may at first seem straightforward, as with many things on the web, there are hidden depths to it. Not only do you need to write the code that parses the current page and produces links to the previous page, the next page and a window of pages around the current one, you also need to produce accessible HTML that will be easy to use for any of your site's visitors.

An Astro Pagination component

To make this easy for anyone building with Astro, I released @philnash/astro-pagination on npm. It is a <Pagination /> component that you can use in your Astro site to create pagination links based on a Page object.

How to use it

Start by installing the package:

npm install @philnash/astro-pagination

Then in a list page, you can import and use the Pagination component. The component requires two properties, a page and a urlPattern. The page should be an Astro Page object, typically provided through Astro.props. The urlPattern should be the path you are using for your paginated pages, with a {} where the page number should go. A simplified Astro page might look something like this:

---
import Pagination from "@philnash/astro-pagination";

export async function getStaticPaths({ paginate }) { /* ... */ }
const { page } = Astro.props;
---

{ /* render the items from the page */ }

<Pagination page={page} urlPattern="/blog/page/{}" />

This will render HTML that looks like:

<nav role="navigation" aria-label="Pagination">
  <ul>
    <li>
      <a
        href="/blog/page/4"
        class="previous-page"
        aria-label="Go to previous page"
        >« Prev</a
      >
    </li>
    <li>
      <a class="number" href="/blog/page" aria-label="Go to page 1"> 1 </a>
    </li>
    <li>
      <span class="start-ellipsis">…</span>
    </li>
    <li>
      <a class="number" href="/blog/page/3" aria-label="Go to page 3"> 3 </a>
    </li>
    <li>
      <a class="number" href="/blog/page/4" aria-label="Go to page 4"> 4 </a>
    </li>
    <li>
      <em aria-current="page" aria-label="Current page, page 5"> 5 </em>
    </li>
    <li>
      <a class="number" href="/blog/page/6" aria-label="Go to page 6"> 6 </a>
    </li>
    <li>
      <a class="number" href="/blog/page/7" aria-label="Go to page 7"> 7 </a>
    </li>
    <li>
      <span class="end-ellipsis">…</span>
    </li>
    <li>
      <a class="number" href="/blog/page/10" aria-label="Go to page 10"> 10 </a>
    </li>
    <li>
      <a href="/blog/page/6" class="next-page" aria-label="Go to next page"
        >Next »</a
      >
    </li>
  </ul>
</nav>

This renders like this:

Well, it renders like that with my site's CSS applied. You will need to style it yourself.

The generated markup includes a bunch of things, including accessibility features based on research from a11ymatters on accessible pagination. There is:

  • a link to the previous page
  • a link to the first page
  • a window of links around the current page
  • ellipses to show where pages exist between the first/last page and the current window
  • a link to the last page
  • a link to the next page
  • a <nav> element around the links, with a role attribute set to "navigation" and an aria-label attribute to describe it as "Pagination"
  • a list of links, allowing assistive technology to announce how many items there are in the list and navigate through them
  • an aria-label attribute on each link to provide a full description of the link's destination
  • an aria-current attribute on the element representing the current page
  • a helpful class name on each of the important elements to allow for styling

Advanced usage

There are more properties you can pass to the <Pagination /> component that give you greater control over the output. They include properties like previousLabel and nextLabel, that lets you set the text for the previous and next links, or windowSize, that lets you determine how many pages are shown in the middle of the range. You can see all the the available options in the documentation.

Future improvements

While the <Pagination /> component is ready to be used, and is already in use on my own site, there are definitely some improvements that I will be adding. For example, you should be able to:

  • use a component for the previousLabel and nextLabel
  • style the links like other Astro components
  • add class names to the elements, so you can style using utility CSS frameworks
  • handle internationalisation

Ultimately, I'd love for this to be a component that every Astro site considers using for pagination, from my blog right here to the Astro blog itself.

Any feedack?

I tried to make this component simple to use and flexible. As I've described above, there's plenty more to do, but it's always worth asking for feedback too.

So, would you use this component to simplify pagination in your own Astro sites? Is there anything you'd add or change? Let me know on Twitter, another social network of choice, or in the GitHub issues.

Astro Pagination

Check out the source on GitHub (give it a star, if you fancy) and let me know if you use this component.

Bluesky is the new social network in town and it's an exciting place to explore right now. I was fortunate enough to get an invite early on and take part in the early community. But Bluesky is not just a Twitter clone, it's an application on top of The AT Protocol, a (still being built) federated protocol for social networks with some interesting properties.

Because there's a protocol, that also means there's an API. If you remember far enough back to the early days of Twitter, the API drove a lot of exciting applications and features for users. There was a swarm of activity around Twitter's API and now, even in the early days, there is a similar excitement about Bluesky's API. So let's look into how to get started with the API using TypeScript and build a bot that runs on a schedule using GitHub Actions.

What you will need

In order to build against the Bluesky API you will need an account. As I write this, accounts are invite only, but as the application and protocol stabilises, I expect there to me more available.

You will also need:

Getting started with the API

Let's write a quick script to post a status to Bluesky from JavaScript. In your terminal create a new Node.js project:

mkdir bluesky-bot
cd bluesky-bot
npm init --yes

Add an .nvmrc file with the version 18.16.0 so that we can guarantee the Node.js version this will run on.

echo 22 > .nvmrc

Install the AT Protocol/Bluesky API client:

npm install @atproto/api

This library gives us easy access to the Bluesky API, a rich text library for formatting links and mentions, and lower level access to the AT Protocol itself.

Send your first Bluesky post

Create a file called index.js and open it in your editor.

Start by requiring the AtpAgent class from the package.

const { AtpAgent } = require("@atproto/api");

Create an asynchronous function that will connect to the service and send a post. Within that function instantiate a new agent, setting the service to "https://bsky.social", the only available AT Protocol service at the moment. Then log the agent in with your Bluesky identifier (your username) and an app password. This is just a quick script to get going with, so we're just going to embed our credentials for now, in reality you want to keep credentials out of your code and load them through environment variables or similar.

async function sendPost() {
  const agent = new AtpAgent({ service: "https://bsky.social" });
  await agent.login({
    identifier: "YOUR_IDENTIFIER_HERE",
    password: "YOUR_PASSWORD_HERE",
  });
  // ...
}

Once you've logged in, you can then use the agent to post a status.

async function sendPost(text) {
  const agent = new AtpAgent({ service: "https://bsky.social" });
  await agent.login({
    identifier: "YOUR_IDENTIFIER_HERE",
    password: "YOUR_PASSWORD_HERE",
  });
  await agent.post({ text });
}

Now you can call the sendPost method with the text you want to send to the API:

sendPost("Hello from the Bluesky API!");

Run this code in your terminal with node index.js and you will see your first post from the API on your Bluesky account.

Sending posts with rich text

If you want to send links or mentions on the platform you can't just send plain text. Instead you need to send rich text, and the library provides a function to create that. Let's update the above code to generate rich text and use it to make a post.

First, require the RichText module.

const { AtpAgent, RichText } = require("@atproto/api");

Then take the text you want to send and create a new RichText object with it. Use that object to detect the facets in the text, then pass both the text and the facets to the post method.

async function sendPost(text) {
  const agent = new AtpAgent({ service: "https://bsky.social" });
  await agent.login({
    identifier: "YOUR_IDENTIFIER_HERE",
    password: "YOUR_PASSWORD_HERE",
  });
  const richText = new RichText({ text });
  await richText.detectFacets(agent);
  await agent.post({
    text: richText.text,
    facets: richText.facets,
  });
}

If you call the sendPost function with text that includes a user mention or a link, it will be correctly linked in Bluesky and notify the mentioned user.

sendPost("Hello from the Bluesky API! Hi @philna.sh!");

That's the basics on creating posts using the Bluesky API. Now let's take a look at scheduling the posts.

Scheduling with GitHub Actions

GitHub Actions lets you automate things in your repositories. This means we can use it to automate posting to Bluesky. One of the triggers for a GitHub Action is the schedule which lets you run a workflow at specified times using cron syntax.

We can add a GitHub Actions workflow to this application that will start working when we push the repo up to GitHub. Before we do that, we should remove our credentials first. Update the code that logs in to the Bluesky service to use environment variables instead of hard coding the credentials:

  await agent.login({
    identifier: process.env.BSKY_HANDLE,
    password: process.env.BSKY_PASSWORD,
  });

Next, create a directory called .github with a directory called workflows inside.

mkdir -p .github/workflows

Create a YAML file called post.yml and open it in your editor. Add the following:

name: "Post to Bluesky"

on: workflow_dispatch

jobs:
  post:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Use Node.js
        uses: actions/setup-node@v3
        with:
          node-version-file: ".nvmrc"
      - run: npm ci
      - name: Send post
        run: node index.js
        env:
          BSKY_HANDLE: ${{ secrets.BSKY_HANDLE }}
          BSKY_PASSWORD: ${{ secrets.BSKY_PASSWORD }}

This workflow file does a bunch of things. It sets up one job called post that will:

  • run on the latest Ubuntu
  • checkout the repository
  • install the Node.js version listed in the .nvmrc file we created earlier
  • install the dependencies
  • run the index.js file, with two secrets added to the environment

In the workflow above the workflow_dispatch trigger is present instead of the schedule trigger. The workflow_dispatch trigger allows you to start a workflow by visiting it in the GitHub UI and pressing a button. It's a great way to test your workflow is working without having to wait for the schedule or push a new commit.

Create a GitHub repo, and push all this code up to it. In the repo settings, find Actions secrets and variables. Add two secrets called BSKY_HANDLE and BSKY_PASSWORD which contain the credentials you were using earlier.

Testing it out

With your code and secrets in place head to the Actions tab for your repo. Click on the workflow called "Post to Bluesky" and then find the button that says "Run workflow". This is the workflow_dispatch trigger and it will run the workflow, eventually running your code and posting to Bluesky. Use this to test out the workflow and any changes to the code before you eventually write the schedule.

Scheduling the workflow

Once you are happy with your code and that the workflow is working it's time to set up a schedule. Remove the workflow_dispatch trigger and replace it with the schedule trigger, which looks like this:

on:
  schedule:
    - cron: "30 5,17 * * *"

I don't read cron, but crontab.guru tells me that this would run the workflow at 5:30 and 17:30 every day. I recommend playing around with that tool to get your schedule correct.

Once you are happy, save, commit and push to GitHub and your Bluesky bot will set off posting.

A template to make this easier

To make this easier, I created a template that has all of the above ready to go for you. Hit the big green "Use this template" button on the repo and you will get your own project ready to go. All you need to do is provide your own function that will return the text that will get posted to Bluesky. There are also instructions in the README to walk you through it all.

My first bot

I've used this template repo to create my first bot on the Bluesky platform. It's a simple but fun one. It posts an hourly dad joke to Bluesky from the icanhazdadjoke.com API. You can find the code for this bot on GitHub too.

Bluesky is going to be a lot of fun

When Twitter first started the availability of the API caused a wave of creativity from developers. Even though Bluesky remains in very early invite only mode there is already a lot of things being built and it is exciting to see.

I'm looking forward to creating bots with this method, but also exploring more of the API, data and protocol to see what can be achieved.

If you have an account with Bluesky, come follow me here. See you in the sky!

ChatGPT has taken the world by storm and this week OpenAI released the ChatGPT API. I've spent some time playing with ChatGPT in the browser, but the best way to really get on board with these new capabilities is to try building something with it. With the API available, now is that time.

This was inspired by Greg Baugues's implementation of a chatbot command line interface (CLI) in 16 lines of Python. I thought I'd start by trying to build the same chatbot but using JavaScript.

(It turns out that Ricky Robinett also had this idea and published his bot code here, it's pleasing to see how similar the implementations are!)

The code

It turns out that Node.js requires a bit more code to deal with command line input than Python, so where Greg's version was 16 lines mine takes 31. Having built this little bot, I'm no less excited about the potential for building with this API though.

Here's the full code, I'll explain what it is doing further down.

import { createInterface } from "node:readline/promises";
import { stdin as input, stdout as output, env } from "node:process";
import { Configuration, OpenAIApi } from "openai";

const configuration = new Configuration({ apiKey: env.OPENAI_API_KEY });
const openai = new OpenAIApi(configuration);
const readline = createInterface({ input, output });

const chatbotType = await readline.question(
  "What type of chatbot would you like to create? "
);
const messages = [{ role: "system", content: chatbotType }];
let userInput = await readline.question("Say hello to your new assistant.\n\n");

while (userInput !== ".exit") {
  messages.push({ role: "user", content: userInput });
  try {
    const response = await openai.createChatCompletion({
      messages,
      model: "gpt-3.5-turbo",
    });

    const botMessage = response.data.choices[0].message;
    if (botMessage) {
      messages.push(botMessage);
      userInput = await readline.question("\n" + botMessage.content + "\n\n");
    } else {
      userInput = await readline.question("\nNo response, try asking again\n");
    }
  } catch (error) {
    console.log(error.message);
    userInput = await readline.question("\nSomething went wrong, try asking again\n");
  }
}

readline.close();

When you run this code it looks like this:

<figure class="post-image post-image-left"> <img src="/posts/chatgpt/chatgpt.gif" alt="An example of the chatbot running. I ask it to respond in haiku and it does twice." loading="lazy" /> </figure>

Let's dig into how it works and how you can build your own.

Building a chatbot

You will need an OpenAI platform account to interact with the ChatGPT API. Once you have signed up, create an API key from your account dashboard.

As long as you have Node.js installed, the only other thing you'll need is the openai Node.js module.

Let's start a Node.js project and create this CLI application. First create a directory for the project, change into it and initialise it with npm:

mkdir chatgpt-cli
cd chatgpt-cli
npm init --yes

Install the openai module as a dependency:

npm install openai

Open package.json and add the key "type": "module" to the configuration, so we can build this as an ES module which will allow us to use top level await.

Create a file called index.js and open it in your editor.

Interacting with the OpenAI API

There are two parts to the code, dealing with input and output on the command line and dealing with the OpenAI API. Let's start by looking at how the API works.

First we import two objects from the openai module, the Configuration and OpenAIApi. The Configuration class will be used to create a configuration that holds the API key, you can then use that configuration to create an OpenAIApi client.

import { env } from "node:process";
import { Configuration, OpenAIApi } from "openai";

const configuration = new Configuration({ apiKey: env.OPENAI_API_KEY });
const openai = new OpenAIApi(configuration);

In this case, we'll store the API key in the environment and read it with env.OPENAI_API_KEY.

To interact with the API we now use the OpenAI client to create chat completions for us. OpenAI's text generating models don't actually converse with you, but are built to take input and come up with plausible sounding text that would follow that input, a completion. With ChatGPT, the model is configured to receive a list of messages and then come up with a completion for the conversation. Messages in this system can come from one of 3 different entities, the "system", "user" and "assistant". The "assistant" is ChatGPT itself, the "user" is the person interacting and the system allows the program (or the user, as we'll see in this example) to provide instructions that define how the assistant behaves. Changing the system prompts for how the assistant behaves is one of the most interesting things to play around with and allows you to create different types of assistants.

With our openai object configured as above, we can create messages to send to an assistant and request a response like this:

const messages = [
  { role: "system", content: "You are a helpful assistant" },
  { role: "user", content: "Can you suggest somewhere to eat in the centre of London?" }
];
const response = await openai.createChatCompletion({
  messages,
  model: "gpt-3.5-turbo",
});
console.log(response.data.choices[0].message);
// => "Of course! London is known for its diverse and delicious food scene..."

As the conversation goes on, we can add the user's questions and assistant's responses to the messages array, which we send with each request. That gives the bot history of the conversation, context for which it can build further answers on.

To create the CLI, we just need to hook this up to user input in the terminal.

Interacting with the terminal

Node.js provides the Readline module which makes it easy to receive input and write output to streams. To work with the terminal, those streams will be stdin and stdout.

We can import stdin and stdout from the node:process module, renaming them to input and output to make them easier to use with Readline. We also import the createInterface function from node:readline

import { createInterface } from "node:readline/promises";
import { stdin as input, stdout as output } from "node:process";

We then pass the input and output streams to createInterface and that gives us an object we can use to write to the output and read from the input, all with the question function:

const readline = createInterface({ input, output });

const chatbotType = await readline.question(
  "What type of chatbot would you like to create? "
);

The above code hooks up the input and output stream. The readline object is then used to post the question to the output and return a promise. When the user replies by writing into the terminal and pressing return, the promise resolves with the text that the user wrote.

Completing the CLI

With both of those parts, we can write all of the code. Create a new file called index.js and enter the code below.

We start with the imports we described above:

import { createInterface } from "node:readline/promises";
import { stdin as input, stdout as output, env } from "node:process";
import { Configuration, OpenAIApi } from "openai";

Then we initialise the API client and the Readline module:

const configuration = new Configuration({ apiKey: env.OPENAI_API_KEY });
const openai = new OpenAIApi(configuration);
const readline = createInterface({ input, output });

Next, we ask the first question of the user: "What type of chatbot would you like to create?". We will use the answer of this to create a "service" message in a new array of messages that we will continue to add to as the conversation goes on.

const chatbotType = await readline.question(
  "What type of chatbot would you like to create? "
);
const messages = [{ role: "system", content: chatbotType }];

We then prompt the user to start interacting with the chatbot and start a loop that says while the user input is not equal to the string ".exit" keep sending that input to the API. If the user enters ".exit" the program will end, like in the Node.js REPL.

let userInput = await readline.question("Say hello to your new assistant.\n\n");

while (userInput !== ".exit") {
  // loop
}

readline.close();

Inside the loop we add the userInput to the messages array as a "user" message. Then, within a try/catch block, send it to the OpenAI API. We set the model as "gpt-3.5-turbo" which is the underlying name for ChatGPT.

When we get a response from the API we get the message out of the response.data.choices array. If there is a message we store it as an "assistant" message in the array of messages and output it to the user, waiting for their input again using readline. If there is no message in the response from the API, we alert the user and wait for further user input. Finally, if there is an error making a request to the API we catch the error, log the message and tell the user to try again.

while (userInput !== ".exit") {
  messages.push({ role: "user", content: userInput });
  try {
    const response = await openai.createChatCompletion({
      messages,
      model: "gpt-3.5-turbo",
    });

    const botMessage = response.data.choices[0].message;
    if (botMessage) {
      messages.push(botMessage);
      userInput = await readline.question("\n" + botMessage.content + "\n\n");
    } else {
      userInput = await readline.question("\nNo response, try asking again\n");
    }
  } catch (error) {
    console.log(error.message);
    userInput = await readline.question(
      "\nSomething went wrong, try asking again\n"
    );
  }
}

Put that all together and you have your assistant. The full code is at the top of this post or on GitHub.

You can now run the assistant by passing it your OpenAI API key as an environment on the command line:

OPENAI_API_KEY=YOUR_API_KEY node index.js

This will start your interaction with the assistant, starting with it asking what kind of assistant you want. Once you've declared that, you can start chatting with it.

Experimenting helps us to understand

Personally, I'm not actually sure how useful ChatGPT is. It is clearly impressive, its ability to return text that reads as if it was written by a human is incredible. However, it returns content that is not necessarily correct, regardless of how confidently it presents that content.

Experimenting with ChatGPT is the only way that we can try to understand what it useful for, thus building a simple chat bot likes this gives us grounds for that experiment. Learning that the system commands can give the bot different personalities and make it respond in different ways is very interesting.

You might have heard, for example, that you can ask ChatGPT to help you with programming, but you could also specify a JSON structure and effectively use it as an API as well. But as you experiment with that you will likely find that it should not be an information API, but more likely something you can use to understand your natural text and turn it into a JSON object. To me this is exciting as it means that ChatGPT could help create more natural voice assistants, that can translate meaning from speech better than the existing crop that expect commands to be given in a more exact manner. I still have experimenting to do with this idea, and having this tool gives me that opportunity.

This is just the beginning

If experimenting with this technology is the important thing for us to understand what we can build with it and what we should or should not build with it, then making it easier to experiment is the next goal. My next goal is to expand this tool so that it can save, interact with and edit multiple assistants so that you can continue to work with them and improve them over time.

In the meantime, you can check out the full code for this first assistant in GitHub, follow the repo to keep up with improvements.

I recently came across this blog post from Ruud van Asseldonk titled "The yaml document from hell". I've always heard that yaml has its pitfalls, but hadn't looked into the details and thankfully hadn't been affected, mainly due to my very infrequent and simple use of yaml. If you are in the same boat as me, I recommend reading that article now as I almost can't believe I've avoided any issues with it.

The article digs into the issues in the yaml spec itself, and then describes what happens in Python's PyYAML and Golang's yaml library with an example file, the titular yaml document from hell. I wanted to see how things were in the JavaScript ecosystem.

Yaml in JavaScript

A search for JavaScript yaml parsers on npm brings up yaml (which I have used in my own project) and js-yaml. js-yaml has the most weekly downloads according to npm and the most stars on GitHub however yaml seems to be under more active development, having been most recently published (a month ago at the time of writing) compared to js-yaml's last publish date almost 2 years ago. There is also yamljs, but the project hasn't received a commit since November 2019 and hasn't been released for 6 years, so I am going to disregard it for now.

Let's see what yaml and js-yaml do with the yaml document from hell.

The document itself

To save yourself from going back and forth between van Asseldonk's article and this one, here is the yaml document.

server_config:
  port_mapping:
    # Expose only ssh and http to the public internet.
    - 22:22
    - 80:80
    - 443:443

  serve:
    - /robots.txt
    - /favicon.ico
    - *.html
    - *.png
    - !.git  # Do not expose our Git repository to the entire world.

  geoblock_regions:
    # The legal team has not approved distribution in the Nordics yet.
    - dk
    - fi
    - is
    - no
    - se

  flush_cache:
    on: [push, memory_pressure]
    priority: background

  allow_postgres_versions:
    - 9.5.25
    - 9.6.24
    - 10.23
    - 12.13

So how do our JavaScript libraries handle this file?

The failures

Anchors, aliases, and tags

Let's start with the failures. As described in the original article under the subhead "Anchors, aliases, and tags" this section is invalid:

  serve:
    - /robots.txt
    - /favicon.ico
    - *.html
    - *.png
    - !.git  # Do not expose our Git repository to the entire world.

This causes both of our JavaScript yaml libraries to throw an error, both referencing an undefined alias. This is because the * is a way to reference an anchor created earlier in the document using an &. In our document's case, that anchor was never created, so this is a parsing error.

If you want to learn more about anchors and aliases it seems like something that is important in build pipelines. Both Bitbucket and GitLab have written about how to use anchors to avoid repeating sections in yaml files.

For the purposes of trying to get the file to parse, we can make those aliases strings as they were likely intended.

  serve:
    - /robots.txt
    - /favicon.ico
    - "*.html"
    - "*.png"
    - !.git  # Do not expose our Git repository to the entire world.

Now we get another parsing error from our libraries; both of them complain about an unknown or unresolved tag. The ! at the start of !.git is the character triggering this behaviour.

Tags seem to be the most complicated part of yaml to me. They depend on the parser you are using and allow that parser to do something custom with the content that follows the tag. My understanding is that you could use this in JavaScript to, say, tag some content to be parsed into a Map instead of an Object or a Set instead of an Array. Van Asseldonk explains this with this alarming sentence:

This means that loading an untrusted yaml document is generally unsafe, as it may lead to arbitrary code execution.

PyYaml apparently has a safe_load method that will avoid this, but Go's yaml package doesn't. It seems that the JavaScript libraries also lack this feature, so the warning for untrusted yaml documents stands.

If you do want to take advantage of the tag feature in yaml, you can check out the yaml package's documentation on custom data types or js-yaml's supported yaml types and unsafe type extensions.

To make the yaml file parse, let's encase all the weird yaml artifacts in quotes to make them strings:

  serve:
    - /robots.txt
    - /favicon.ico
    - "*.html"
    - "*.png"
    - "!.git"  # Do not expose our Git repository to the entire world.

With the serve block looking it does above, the file now parses. So what happens to the rest of the potential yaml gotchas?

Accidental numbers

One thing that I am gathering from this investigation so far is that if you need something to be a string, do not be ambiguous about it, surround it in quotes. That counted for the aliases and tags above and it also counts for accidental numbers. In the following section of the yaml file you see a list of version numbers:

  allow_postgres_versions:
    - 9.5.25
    - 9.6.24
    - 10.23
    - 12.13

Version numbers are strings, numbers can't have more than one decimal point in them. But when this is parsed by either JavaScript library the result is as follows:

  allow_postgres_versions: [ '9.5.25', '9.6.24', 10.23, 12.13 ]

Now we have an array of strings and numbers. If a yaml parser thinks something looks like a number it will parse it as such. And when you come to use those values they might not act as you expect.

Version numbers in GitHub Actions

I have had this issue within GitHub Actions before. It was in a Ruby project, but this applies to anyone trying to use version numbers in a GitHub Actions yaml file. I tried to use a list of Ruby version numbers, this worked fine up until Ruby version 3.1 was released. I had 3.0 in the array. Within GitHub Actions this was parsed as the integer 3. This might seem fine, except that when you give an integer version to GitHub Actions it picks the latest minor point for that version. So, once Ruby 3.1 was released, the number 3.0 would select version 3.1. I had to make the version number a string, "3.0", and then it was applied correctly.

Accidental numbers cause issues. If you need a string, make sure you provide a string.

The successes

It's not all bad in the JavaScript world. After working through the issues above, we might now be in the clear. Let's take a look now at what parsed correctly from this yaml file.

Sexagesimal numbers

Under the port mapping section of the yaml file we see:

  port_mapping:
    # Expose only ssh and http to the public internet.
    - 22:22
    - 80:80
    - 443:443

That 22:22 is dangerous in yaml version 1.1 and PyYaml parses it as a sexagesimal (base 60) number, giving the result of 1342. Thankfully both JavaScript libraries have implemented yaml 1.2 and 22:22 is parsed correctly as a string in this case.

  port_mapping: [ '22:22', '80:80', '443:443' ]

The Norway problem

In yaml 1.1 no is parsed as false. This is known as "the Norway problem" because listing countries as two character identifiers is fairly common and having this yaml:

  geoblock_regions:
    - dk
    - fi
    - is
    - no
    - se

parsed into this JavaScript:

  geoblock_regions: [ 'dk', 'fi', 'is', false, 'se' ]

is just not helpful. The good news is that, unlike Go's yaml library, both JavaScript libraries have implemented yaml 1.2 and dropped no as an alternative for false. The geoblock_regions sections is successfully parsed as follows:

  geoblock_regions: [ 'dk', 'fi', 'is', 'no', 'se' ]

Non-string keys

You might believe that keys in yaml would be parsed as strings, like JSON. However they can be any value. Once again there are values that may trip you up. Much like with the Norway problem in which yes and no can be parsed as true and false, the same goes for on and off. This is manifested in our yaml file in the flush_cache section:

  flush_cache:
    on: [push, memory_pressure]
    priority: background

Here the key is on, but in some libraries it is parsed as a boolean. In Python, even more confusingly the boolean is then stringified and appears as the key "True". Thankfully this is handled by the JavaScript libraries and on becomes the key "on".

  flush_cache: { on: [ 'push', 'memory_pressure' ], priority: 'background' }

This is of particular concern in GitHub Actions again, where on is used to determine what events should trigger an Action. I wonder if GitHub had to work around this when implementing their parsing.

Parsing as yaml version 1.1

Many of the issues that our JavaScript libraries sidestep are problems from yaml 1.1 and both libraries have fully implemented yaml 1.2. If you do wish to throw caution to the wind, or you have to parse a yaml file explicitly with yaml 1.1 settings, the yaml library can do that for you. You can pass a second argument to the parse function to tell it to use version 1.1, like so:

import { parse } from "yaml";
const yaml = parse(yamlContents, { version: "1.1" });
console.log(yaml);

Now you get a result with all of the fun described above:

{
  server_config: {
    port_mapping: [ 1342, '80:80', '443:443' ],
    serve: [ '/robots.txt', '/favicon.ico', '*.html', '*.png', '!.git' ],
    geoblock_regions: [ 'dk', 'fi', 'is', false, 'se' ],
    flush_cache: { true: [ 'push', 'memory_pressure' ], priority: 'background' },
    allow_postgres_versions: [ '9.5.25', '9.6.24', 10.23, 12.13 ]
  }
}

Note that in this case I left the aliases and tags quoted as strings so that the file could be parsed successfully.

Stick with version 1.2, the default in both JavaScript yaml libraries, and you'll get a much more sensible result.

Isn't yaml fun?

In this post we've seen that it's easy to write malformed yaml if you weren't aware of aliases or tags. It's also easy to write mixed arrays of strings and numbers. There are also languages and libraries in which yaml 1.1 is still hanging around and on. yes, off, and no are booleans and some numbers can be parsed into base 60.

My advice, after going through all of this, is to err on the side of caution when writing yaml. If you want a key or a value to be a string, surround it in quotes and explicitly make it a string.

On the other hand, if you are parsing someone else's yaml then you will need to program defensively and try to handle the edge cases, like accidental numbers, that can still cause issues.

Finally, if you have the option, choose a different format to yaml. Yaml is supposed to be human-friendly, but the surprises and the bugs that it can produce are certainly not developer-friendly and ultimately that defeats the purpose.

The conclusion to the original yaml document from hell post suggests many alternatives to yaml that will work better. I can't help but think that in the world of JavaScript that something JSON based, but friendlier to author, should be the solution.

There is a package that simply strips comments from JSON or there's JSON5 a JSON format that aims to be easier to write and maintain by hand. JSON5 supports comments as well as trailing commas, multiline strings, and various number formats. Either of these are a good start if you want to make authoring JSON easier and parsing hand authored files more consistent.

If you can avoid yaml, I recommend it. If you can't, good luck.

Two factor authentication (2FA) is a great way to improve the security of user accounts in an application. It helps protect against common issues with passwords, like users picking easily guessable passwords or reusing the same password across multiple sites. There are different ways to implement two factor authentication, including SMS, using an authenticator application and WebAuthn.

SMS is the most widely used and won't be going away, so it falls on us as developers to do our best to build the best SMS 2FA experience for our users. The WebOTP API is one way we can help reduce friction in the login experience and even provide some protection against phishing.

What is the WebOTP API?

The WebOTP API is an extension to the Credential Management API. The Credential Management API started by giving us the ability to store and access credentials in a browser's password manager, but now encompasses WebAuthn and two factor authentication. The WebOTP API allows us to request permission from the user to read a 2FA code out of an incoming SMS message.

When you implement the WebOTP API the second step of a login process can go from an awkward process of reading and copying a number of digits from an SMS, to a single button press. A great improvement, I think you'll agree.

<img src="/posts/webotp/webotp.gif" alt="An animation showing a login experience where after entering a username and password, a permissions dialog pops up asking for permission to read a 2FA code from an SMS. When approved, the code is entered into an input and the form submitted." loading="lazy" />

How does it work?

To implement WebOTP you will need to do two things:

  1. Update the message you send with the WebOTP format
  2. Add some JavaScript to the login page to request permission to read the message

The SMS message

To have the WebOTP API recognise a message as an incoming 2FA code you need to add a line to the end of the message that you send. That line must include an @ symbol followed by the domain for the site that your user will be logging in to, then a space, the # symbol and then the code itself. If your user is logging in on example.com and the code you are sending them is 123456 then the message needs to look like this:

Your code to log in to the application is 123456

@example.com #123456

The domain ties the message to the website the user should be logging onto. This helps protect against phishing, WebOTP can't be used to request the code from an SMS if the domain the user is logging in to doesn't match the domain in the message. Obviously it can't stop a user copying a code across from a message, but it might give them pause if they come to expect this behaviour.

The JavaScript

Once you have your messages set up in the right format you need some JavaScript on your 2nd factor page that will trigger the WebOTP API, ask the user permission for access to the message and collect the code.

The most minimal version of this code looks like this:

if ('OTPCredential' in window) {
  navigator.credentials.get({
    otp: {
      transport: ['sms']
    }
  }).then((otp) => {
    submitOTP(otp.code);
  });
}

We ask the navigator.credentials object to get a one time password (OTP) from the SMS transport. If the browser detects an incoming message with the right domain and a code in it, the user will be prompted, asking for access. If the user approves the promise resolves with an otp object which has a code property. You can then submit that code to the form and complete the user's login process.

A more complete version of the code, that handles things like finding an input and form, cancelling the request if the form is submitted, and submitting the form if the request is successful, looks like this:

if ('OTPCredential' in window) {
  window.addEventListener('DOMContentLoaded', e => {
    const input = document.querySelector('input[autocomplete="one-time-code"]');
    if (!input) return;
    const ac = new AbortController();
    const form = input.closest('form');
    if (form) {
      form.addEventListener('submit', e => ac.abort());
    }
    navigator.credentials.get({
      otp: { transport:['sms'] },
      signal: ac.signal
    }).then(otp => {
      input.value = otp.code;
      if (form) {
        form.submit();
      }
    }).catch(err => {
      console.error(err);
    });
  });
}

This will work for many sites, but copying and pasting code isn't the best way to share code, so I came up with something a bit easier.

Declarative WebOTP with web components

On Safari, you can get similar behaviour to the WebOTP API by adding one attribute to the <input> element for the OTP code. Setting autocomplete="one-time-code" will trigger Safari to offer the code from the SMS via autocomplete.

Inspired by this, I wanted to make WebOTP just as easy. So, I published a web component, the <web-otp-input> component, that handles the entire process. You can see all the code and how to use it on GitHub. For a quick example, you can add the component to your page as an ES module:

<script type="module" src="https://unpkg.com/@philnash/web-otp-input"></script>

Or install it to your project from npm:

npm install @philnash/web-otp-input

and import it to your application:

import { WebOTPInput } from "@philnash/web-otp-input";

You can then wrap the <web-otp-input> around your existing <input> within a <form>, like this:

<form action="/verification" method="POST">
  <div>
    <label for="otp">Enter your code:</label>
    <web-otp-input>
      <input type="text" autocomplete="one-time-code" inputmode="numeric" id="otp" name="otp" />
    </web-otp-input>
  </div>
  <button type="submit">Submit</button>
</form>

Then the WebOTP experience will happen automatically for anyone on a browser that supports it, without writing any additional JavaScript.

WebOTP: a better experience

The WebOTP API makes two factor authentication with SMS a better experience. For browsers that support it, entering the code that is sent as a second factor becomes a breeze for users.

There are even circumstances where it works for desktop browsers too. For a user with Chrome on the desktop and Chrome on Android and signed into their Google account on both, signing in on the desktop will cause a notification on the mobile device asking to approve sending the code to the desktop. Approving that on the mobile devices transfers the code to the desktop browser. You don't even have to write more code to handle this, all you need is the JavaScript in this article.

For more on WebOTP, check out these articles:

If you are building two factor authentication or phone verification, consider implementing the WebOTP API as well to make that process easier for your users.

Mastodon is different to most online services. It is a federated network, so when you set up an account you need to choose a server to use. Your username then becomes a combination of your handle and that server you signed up to. For example, I am currently @philnash@mastodon.social.

But what if you want to personalise that a bit more? What if you wanted to use your own domain for your Mastodon account without having to host a whole Mastodon server? Using your own domain means that no matter what instance you used, or if you moved instance, you could share one Mastodon username that always pointed to the right profile and was personalised to your own site.

WebFinger to the rescue

It turns out that you can do this. Maarten Balliauw wrote about how Mastodon uses WebFinger to attach extra information to an email address. Information like an associated profile page or ActivityPub stream.

Implementing WebFinger requires your domain to respond to a request to /.well-known/webfinger with a JSON representation of the associated accounts. If you have a Mastodon account you can check out what your WebFinger JSON looks like by making a request to https://#{instance}/.well-known/webfinger?resource=acct:#{username}@#{instance}. For example, my WebFinger JSON is available at this URL: https://mastodon.social/.well-known/webfinger?resource=acct:philnash@mastodon.social.

To associate a Mastodon account with your own domain, you can serve this JSON yourself from a /.well-known/webfinger endpoint.

WebFinger with Jekyll

As Maarten pointed out in his post, you can copy the JSON response from your Mastodon instance to a file that you then serve from your own site. My site is powered by Jekyll, so I wanted to make it easy for me, and anyone else using Jekyll, to create and serve that WebFinger JSON. I've also built Jekyll plugins before, like jekyll-gzip, jekyll-brotli, jekyll-zopfli, and jekyll-web_monetization.

I got to work and built jekyll-mastodon_webfinger.

How to use it

You can serve up your own WebFinger JSON on your Jekyll site to point to your Mastodon profile by following these steps:

  1. Add jekyll-mastodon_webfinger to your Gemfile:

    bundle add jekyll-mastodon_webfinger
    
  2. Add the plugin to your list of plugins in _config.yml:

    plugins:
      - jekyll/mastodon_webfinger
    
  3. Add your Mastodon username and instance to _config.yml:

    mastodon:
      username: philnash
      instance: mastodon.social
    

Next time you build the site, you will find a /.well-known/webfinger file in your output directory, and when you deploy you will be able to refer to your Mastodon account using your own domain.

You can see the result of this by checking the WebFinger endpoint on my domain: https://philna.sh/.well-known/webfinger or by searching for @phil@philna.sh on your Mastodon instance.

<figure> <img src="/posts/mastodon/search.png" alt="When you search for @phil@philna.sh on your Mastodon instance, you will find my account"> </figure>

As this is a static file it sort of acts like a catch-all email address. You can actually search for @any_username@philna.sh and you will find me. If you wanted to restrict this, you would need to build an endpoint that could respond dynamically to the request.

Other ways to serve Mastodon WebFinger responses

I'm not the only one to have considered this. Along with Maarten's original post on the topic, others have built tools or posted about how to do this with your own site.

Lindsay Wardell wrote up how to integrate Mastodon with Astro including showing how to display her feed within her Astro site.

Dominik Kundel put together a Netlify plugin that generates a Mastodon WebFinger file for your Netlify hosted site.

Take a trip into the Fediverse

An interesting side-effect of the increase in popularity of Mastodon is learning and understanding the protocols that underpin federating a social network like this. WebFinger and ActivityPub are having their moment and I look forward to see what further integrations and applications can be built on top of them.

In the meantime, you can use the techniques in this post to use your own domain as an alias for your Mastodon profile. And if you fancy it, connect with me on Mastodon by searching for @phil@philna.sh or at https://mastodon.social/@philnash.

Link shortening has been around for a long time and Bitly is arguably the king of the link shorteners. It has support for shortening long URLs as well as custom short links, custom domains, and metrics to track how each link is performing.

For those of us with the power of code at our fingertips, Bitly also has an API. With the Bitly API you can build all of the functionality of Bitly into your own applications and expose it to your users. In this post you'll learn how to use the Bitly Ruby gem to use the Bitly API in your Ruby applications.

Getting started

To start shortening or expanding links with the Bitly gem, you'll need Ruby installed and a Bitly account.

To make API requests against the Bitly API you will need an access token. Log in to your Bitly account and head to the API settings. Here you can enter your account password and generate a new token. This token will only be shown once, so copy it now.

Using the Bitly API

Open a terminal and install the Bitly gem:

gem install bitly

Let's explore the gem in the terminal. Open an irb session:

irb

Require the gem:

require "bitly"

Create an authenticated API client using the token you created in your Bitly account:

client = Bitly::API::Client.new(
  token: "Enter your access token here"
)

Shortening a URL

You can now use this client object to access all the Bitly APIs. For example, you can shorten a URL to a Bitlink like this:

long_url = "https://twitter.com/philnash"
bitlink = client.shorten(long_url: long_url)
bitlink.link
# => "https://bit.ly/3zYdN21"

The shorten endpoint is a simplified method of shortening a URL. You can also use the create endpoint and set other attributes, like adding a title, tags or deeplinks into native applications.

long_url = "https://twitter.com/philnash"
bitlink = client.create_bitlink(
  long_url: long_url,
  title: "Phil Nash on Twitter",
  tags: ["social media", "worth following"]
)
bitlink.link
# => "https://bit.ly/3zYdN21"
bitlink.title
# => "Phil Nash on Twitter"

Expanding a URL

The API client can also be used to expand Bitlinks. You can use the expand method with any Bitlink, not just ones that you have shortened yourself. When you expand a URL you will get back publicly available information about the URL.

bitlink = client.expand(bitlink: "bit.ly/3zYdN21")
bitlink.long_url
# => "https://twitter.com/philnash"
bitlink.title
# => nil
# (title is not public information)

If the URL was a Bitlink from your own account you get more detailed information when you use the bitlink method.

bitlink = client.bitlink(bitlink: "bit.ly/3zYdN21")
bitlink.long_url
# => "https://twitter.com/philnash"
bitlink.title
# => "Phil Nash on Twitter"

Other Bitlink methods

Once you have a bitlink object, you can call other methods on it. If you wanted to update the information about the link, for example, you can use the update method:

bitlink.update(title: "Phil Nash on Twitter. Go follow him")
bitlink.title
# => "Phil Nash on Twitter. Go follow him"

You can also fetch metrics for your link, including the clicks_summary, link_clicks and click_metrics_by_country. For example:

click_summary = bitlink.clicks_summary
click_summary.total_clicks
# => 1
# (not very popular yet)

Methods that return a list of metrics implement Enumerable so you can loop through them using each:

country_clicks = bitlink.click_metrics_by_country
country_clicks.each do |metric|
  puts "#{metric.value}: #{metric.clicks}"
end
# => AU: 1
# (it was just me clicking it)

With these methods, you can create or fetch short links and then retrieve metrics about them. Your application can shorten links and measure their impact with a few lines of code.

There's more!

For advanced uses, you can also authorise other Bitly accounts to create and fetch short links via OAuth2 as well as manage your account's users, groups, and organisations.

A useful little tool

To find out more about using the Bitly API with Ruby, you can read the Bitly API documentation, the Bitly gem's generated documentation and the docs in the GitHub repo.

If you are interested, you can also read a bit more about the backstory of the Bitly Ruby gem. I've been working on this project since 2009, would you believe?

Are you using the Bitly API or do you have any feedback or feature requests? Let me know on Twitter at @philnash or open an issue in the repo.

Sometimes the platform we are building on provides more functionality than we can keep in our own heads. However, depending on the problem, we often find ourselves trying to write the code to solve the issue rather than finding and using the existing solution provided by the platform. I almost fell for this recently when trying to parse a query string.

I can do it myself

A colleagues had a query string they needed to parse in a function, and asked for recommendations on how to do so. For some reason I decided to roll up my sleeves and code directly into Slack. I came up with something like this:

function parse(input) {
  return input
    .split("&")
    .map((pairs) => pairs.split("="))
    .reduce((acc, [k, v]) => {
      acc[k] = decodeURIComponent(v);
      return acc;
    }, {});
}

Looks pretty good, right? It even uses reduce which makes all array operations look extra fancy. And it works:

parse("name=Phil&hello=world");
// => { name: 'Phil', hello: 'world' }

Except for when it doesn't work. Like if the query string uses the + character to encode spaces.

parse("name=Phil+Nash");
// => { name: 'Phil+Nash' }

Or if you have more than one value for a key, like this:

parse("name=Phil&name=John")
// => { name: "John" }

I could have worked to fix these issues, but that would just bring up more questions. Like how should multiple values be represented? Always as an array? On as an array if there is more than one value?

The problem with all of this is that even after thinking about these extra issues, there are likely more hiding out there. All this thinking and coding is a waste anyway, because we have URLSearchParams.

URLSearchParams has a specification and several implementations

The URLSearchParams class is specified in the URL standard and started appearing in browsers in 2016 and in Node.js in 2017 (in version 7.5.0 as part of the URL standard library module and then in version 10.0.0 in the global namespace).

It handles all of the parsing problems above:

const params = new URLSearchParams("name=Phil&hello=world");
params.get("name");
// => "Phil"
params.get("hello");
// => world

const multiParams = new URLSearchParams("name=Phil&name=John");
multiParams.get("name");
// => "Phil"
// ???
multiParams.getAll("name");
// => [ 'Phil', 'John' ]

And it handles more, like iterating over the parameters:

for (const [key, value] of params) {
  console.log(`${key}: ${value}`)
}
// => name: Phil
// => hello: world

for (const [key, value] of multiParams) {
  console.log(`${key}: ${value}`)
}
// => name: Phil
// => name: John

Or adding to the parameters and serialising back to a query string:

multiParams.append("name", "Terry");
multiParams.append("favouriteColour", "red");
multiParams.toString();
// => 'name=Phil&name=John&name=Terry&favouriteColour=red'

The final thing I like about URLSearchParams is that it is available in both the browser and Node.js. Node.js has had the querystring module since version 0.10.0, but when APIs like this are available on the client and server side, then JavaScript developers can be more productive regardless of the environment in which they are working.

As an aside, one of the things I appreciate about the Deno project is their aim for Deno to use web platform APIs where possible.

Use the platform that is available

This post started as a story about choosing to write code to solve a problem that the platform already had solved. Once I realised my mistake I jumped straight back into Slack to correct myself and recommend URLSearchParams. When you understand the capabilities of the platform you are working with you can both code more efficiently and avoid bugs.

I never have to write code to parse URL parameters, I wrote this post to remind myself of that. You never have to either.

When I am developing web applications in Node.js, I like the server to restart when I make changes, so I use nodemon. When I am developing an application that consumes webhooks or that I want to share publicly, I use ngrok. In fact, I like ngrok so much, I volunteered to help maintain the Node.js wrapper for ngrok.

Now, you can run ngrok and nodemon separately and things work fine. But what if you always want to run them together and you want just one command to do that. Since nodemon is a Node package and ngrok has a Node wrapper, we can do this. Here's how.

An example with a basic Express app

You might already have an application you want to do with this, but for the purposes of the post, let's create an example Express application. We can create a basic application with the Express generator like this:

npx express-generator test-app
cd test-app
npm install

Start the application with:

npm start

Open your browswer to localhost:3000 and you will see the welcome page. You can also open localhost:3000/users and it will say "respond with a resource". Open routes/users.js and change the route from:

router.get('/', function(req, res, next) {
  res.send('respond with a resource');
});

to:

router.get('/', function(req, res, next) {
  res.send('respond with a list of users');
});

Refresh localhost:3000/users and you will see it still returns "respond with a resource". Stop the application, restart it with npm start, and when you reload the page it will have changed. We can make this better with nodemon.

Starting the app with nodemon

To start the application with nodemon, you first need to install the package:

npm install nodemon --save-dev

You can then run the application by opening package.json and changing the start script from "start": "node ./bin/www" to "start": "nodemon ./bin/www". This works great and your application now restarts when you make changes.

Adding ngrok to the mix

This all works great with nodemon on its own, but now we want to give the application a public URL while we are developing it. We can use ngrok for this and we can build it in using the ngrok Node package. Start by installing ngrok:

npm install ngrok --save-dev

Now, we could add ngrok to the ./bin/www script that the Express generator created for us. But if you do this, then every time you change something, nodemon will restart the application and your ngrok tunnel. If you're using a free or unregistered ngrok account then your ngrok URL will keep changing on every restart. Instead, let's build a script that starts an ngrok tunnel and then uses nodemon to run the application script ./bin/www.

Create a new file in the bin directoy called ./bin/dev. You might need to make this file executable with chmod 755 ./bin/dev. Open it in your editor.

Start by adding a shebang for Node. We'll also add in a protection to make sure this script isn't run in production.

#!/usr/bin/env node

if (process.env.NODE_ENV === "production") {
  console.error(
    "Do not use nodemon in production, run bin/www directly instead."
  );
  process.exitCode = 1;
  return;
}

In this case, if the environment variable NODE_ENV is set to production the script will just return early.

Next, require the ngrok and nodemon packages.

const ngrok = require("ngrok");
const nodemon = require("nodemon");

Use ngrok to open up a tunnel connecting to a port 3000 on localhost, the port that Express uses. To open the tunnel we call on the connect method of ngrok which returns a promise that resolves with the URL of the ngrok tunnel.

ngrok
  .connect({
    proto: "http",
    addr: "3000",
  })
  .then(url => {
    console.log(url);
  })

You can run this in the terminal with ./bin/dev, you will see an ngrok URL logged. So now we have started the ngrok tunnel, but we aren't yet running the Node application.

Let's make the logging a bit nicer and then move on to starting the application with nodemon.

ngrok
  .connect({
    proto: "http",
    addr: "3000",
  })
  .then(url => {
    console.log(`ngrok tunnel opened at: ${url}`);
    console.log("Open the ngrok dashboard at: https://localhost:4040\n");

    nodemon({
      script: "./bin/www",
      exec: `NGROK_URL=${url} node`,
    });
  })

Here we call on nodemon and pass two arguments via an object. The script is the file we want to run to start the application, in this case ./bin/www. The exec option tells nodemon what script to execute to run the script. We set the NGROK_URL environment variable to the URL that ngrok created for us so that we can refer to the ngrok URL within the application if we need it. Then the rest of the exec command is node.

Start the application with ./bin/dev and you will see the application start up. You can load it at localhost:3000 or at the ngrok URL that is logged. You will also find that if you change the response in routes/users.js then it will update on the next refresh. Now you have ngrok and nodemon working together.

Finessing the script

This is working now, but there are a couple more things we can do to improve the script. We can listen to events on nodemon to give us more information about what is happening to the application and when the underlying application quits, we should close the ngrok tunnel too. We should also catch any errors that might happen to ngrok when connecting a tunnel. Here's the full script:

#!/usr/bin/env node

if (process.env.NODE_ENV === "production") {
  console.error(
    "Do not use nodemon in production, run bin/www directly instead."
  );
  process.exitCode = 1;
  return;
}

const ngrok = require("ngrok");
const nodemon = require("nodemon");

ngrok
  .connect({
    proto: "http",
    addr: "3000",
  })
  .then((url) => {
    console.log(`ngrok tunnel opened at: ${url}`);
    console.log("Open the ngrok dashboard at: https://localhost:4040\n");

    nodemon({
      script: "./bin/www",
      exec: `NGROK_URL=${url} node`,
    }).on("start", () => {
      console.log("The application has started");
    }).on("restart", files => {
      console.group("Application restarted due to:")
      files.forEach(file => console.log(file));
      console.groupEnd();
    }).on("quit", () => {
      console.log("The application has quit, closing ngrok tunnel");
      ngrok.kill().then(() => process.exit(0));
    });
  })
  .catch((error) => {
    console.error("Error opening ngrok tunnel: ", error);
    process.exitCode = 1;
  });

Go back to package.json change the start script back to node ./bin/www and add a new script to run the application in dev mode:

  "scripts": {
    "start": "node ./bin/www",
    "dev": "node ./bin/dev"
  },

Now you can start your application with npm run dev and it will use nodemon to restart on file changes and open an ngrok tunnel which doesn't change when the application restarts.

<figure> <img src="/posts/nodemon-ngrok-working-together.png" alt='A terminal window showing the application running with both nodemon and ngrok. The logs show that some pages have been visited, then one of the routes changed and the app was reloaded.' loading="lazy" /> </figure>

Nodemon and ngrok working in tandem

You can adjust the script above to work for any of your applications. Indeed, aside from needing Node to run the nodemon and ngrok packages, you could use this for any application you are building. For more details, check out the nodemon documentation and the ngrok documentation.

If you are a VS Code user and you prefer having ngrok at the tip of your command prompt, take a look at my ngrok for VS Code plugin.

The script in this post was inspired by Santiago Palladino's version which I brought up to date and added usage instructions. Thanks to Santiago, Alex Bubenshchykov, the author of the ngrok Node package, Remy Sharp, the author of nodemon, and Alan Shreve, the creator of ngrok, all for making this possible.