jdrae.github.io

Writing Technical Blog Posts with Codex and Obsidian

2026-06-01T00:00:00+00:00

I like writing. More precisely, I feel like I need to write regularly. I usually organize my thoughts in a diary, and when I learn something technical, I write it down separately in my notes.

So I have always thought I should create a blog and manage my writing a little more systematically. But whenever I actually set up a blog, I ended up not posting much. Eventually I would delete it, recreate it, and repeat the same cycle.

The reason I wrote regularly but did not upload much was simple. Once a technical post goes online, it starts to feel like a responsibility. What if I write something incorrect? What if the post looks too rough? Is this even worth publishing on a blog? Those thoughts kept following me.

But now, thanks to the progress of AI, I finally felt that it was time to start blogging again.

Why a Jekyll blog again?

Whenever I thought about a blog, I first thought of a GitHub Pages-based Jekyll blog that I could customize. The problems were that I had to build the blog myself, and managing posts was more tedious than I expected.

But now we are in the agent era. I described the conditions I wanted, and in a single day I was able to create something close to the blog I had imagined. Managing posts has also become much easier because an agent can help with it.

The conditions I wanted were fairly clear.

It should have category pages
It should support toggling between Korean and English posts
It should be deployable with GitHub Pages
Posts should be managed as Markdown

The blog is still something I am gradually improving. My next goal is to register it with Google Search Console and clean up the SEO-related settings.

Writing starts in Obsidian

Here is the actual writing process.

In the past, I would have written a draft, searched for and verified details one by one, polished the sentences again, and only then published the final post. It might have taken an entire day. But now that GPT and Codex exist, I no longer need to hold every part of that process by myself.

I use Obsidian and Codex together to write and revise posts, and I think it works well enough that I wanted to share the process here.

First, I create a blog folder in Obsidian and add the following subfolders inside it.

blog/
  raw/
  editing/
  published/

Each folder has a simple role.

raw/: rough drafts that have not been polished yet.
editing/: posts currently being revised by Codex.
published/: posts that have been finalized and published to the blog.

At first, I freely write the outline and the ideas I want to include in raw/. At this stage, writing down thoughts quickly matters more than sentence quality. It is fine if the tone is rough, and it is fine if the paragraph order is not perfect.

Then I add the blog folder as a Codex project. Codex copies the post from raw/ into editing/ and writes the revised version there. I ask Codex to review the style, typos, structure, Markdown formatting, and content. In other words, I use it as my own editor.

After reviewing the final version in editing/, I move it to published/. That keeps editing/ as a workspace that only contains the post currently being revised.

Defining editing rules with AGENTS.md

The most important file in this process is AGENTS.md.

If you add AGENTS.md to the blog/ folder, you can define the rules Codex should follow when editing posts. For example, I wrote principles like these.

# AGENTS.md

- Codex acts as a senior editor and editor for blog posts.
- Preserve the core content and intent of the original draft.
- Write in polite, friendly, professional, and simple Korean.
- Do not directly edit files inside raw/ or published/.
- When editing a post in raw/, first copy it into editing/.
- Make all edits only in the copied file inside editing/.
- Normalize image links as Markdown relative links that work in Obsidian.

Once these rules are defined, I do not need to repeat the same instructions every time. I can simply ask, “Please edit this post.” In particular, making the roles of raw/, editing/, and published/ explicit helps reduce the chance of accidentally modifying the original draft.

Everyone runs their blog differently, so the contents of AGENTS.md should be adjusted to fit the workflow. The important part is to clearly tell Codex what role it should play and which files it is allowed to modify.

Uploading to a Jekyll blog

Once the revision is done, it is time to upload the post to the blog.

I also added my personal GitHub Pages blog repository as a Codex project. Then I ask Codex to move the post into the folder where posts are stored, and to adjust the front matter and image paths to match the blog’s Jekyll format.

The rough flow looks like this.

Review the final post in editing/.
Move the post into the blog repository’s _posts folder or the relevant post folder.
Clean up the Jekyll front matter to match the blog format.
Check that image paths work correctly in the deployed environment.
Run a local build and check for problems.
If needed, translate and create the English version as well.

It is worth going through your own posting and deployment process in chat at least once. After that, you can ask Codex to “summarize the posting flow we just followed and write it into AGENTS.md,” and the process becomes much more stable the next time.

Publishing feels less burdensome now

Using this method has made publishing feel much less burdensome.

I can quickly write a rough draft in a stream-of-consciousness style, and typos and tone are revised according to the existing rules. Codex can suggest titles, split the post into sections, and clean up the Markdown syntax, which makes the final post much easier to read.

It is also useful that I can ask for verification when I am not fully confident about the content. Of course, the final responsibility still belongs to me as the person publishing the post. But it is much more efficient than struggling alone with every sentence.

Then did GPT write this post?

So does that mean GPT wrote this post?

I do not think so. When writing this post, I did not ask, “Write a blog post about how to write technical blog posts with Codex and Obsidian.” I had a topic I chose, a line of reasoning and context I had thought through, and a claim I wanted to make.

GPT and Codex only made the formal parts of the post easier to read. In my usage, AI is closer to an editor than an author.

I think this is the difference between AI slop and writing that is not. Does the writing contain the author’s thoughts and experience? If it satisfies that condition, isn’t it reasonable to get help with sentence cleanup, structure, and typo fixes?

I think the same idea can apply to coding. Do you know why the code was designed that way? If you can explain the reason, then even if you received help from an agent, the code is still close to something you made. An agent is not a tool that replaces my thinking. It is a tool that helps me implement the design and intent I already have more quickly and accurately.

How Developers Without Design Knowledge Can Create Consistent UI with AI

2026-03-26T00:00:00+00:00

If someone asked me what feels most difficult when building websites or apps, the first thing that comes to mind is design.

In the past, I might have thought, “Isn’t functionality more important than design?” But in practice, what first catches a user’s eye is often not a “well-built feature,” but a screen that “looks well made.” That does not mean functionality is unimportant. It is simply hard to deny that what users encounter first is not code, but the interface.

On top of that, designers are getting increasingly good at vibe coding these days. There are so many services now that look beautiful and work well. It feels like developers are entering an era where we either need to be able to build products that look reasonably good, or create products that are clearly differentiated in functionality even if the design is lacking.

In any case, there is no downside to making something look good, so I started thinking it would be useful to learn at least a little about design.

How can I design well… or rather, make AI design well?

That said, studying design from the ground up in a formal way felt too inefficient for me. In this era of agents, I decided to first look for ways to “make AI design well” rather than trying to “become good at design myself.”

I tried Figma AI and Stitch first, but simply entering prompts did not produce satisfying results in one shot. They were fine for getting layout ideas, but it was not easy to get output polished enough to apply directly to a real service. Of course, this may have been because I was not good at writing design prompts.

So I watched various design methodology videos on YouTube and looked into how actual designers use AI agents.

That was when I came across the idea of a design system.

The problem is not “taste,” but “making things concrete”

After learning about design systems, I started to understand why I had been struggling. The problem was not so much that I lacked design taste, but that I could not turn the vague image in my head into concrete rules.

A design system is a framework that defines the design elements a service will use in advance and helps keep them consistent. For example, it defines colors, font sizes, spacing, button styles, card shapes, modal designs, and so on.

AI agents often produce slightly different designs each time you ask. So instead of asking them to “make it pretty” every time, I learned that if you make them refer to a predefined design system, they can produce much more consistent results.

A design system does not solve UX for you

However, I handled the screen planning myself.

I first drew wireframes by hand, then moved them into Figma. Since this part is related to UX, I decided it was better to think through the user flow and screen structure myself rather than hand everything over to AI.

A design system is closer to a tool that makes UI implementation easier. Creating a design system does not magically produce perfect UI/UX without any planning. It is much better than starting from nothing, but you still need to decide where you are headed.

Creating and applying a design system

1. Explore references

First, I decided on the mood of the app I wanted to build.

At minimum, it is useful to define the following:

The overall concept of the app
- Examples: simple, analog, dark, minimal, emotional, and so on
Main color
One or two supporting colors

You can collect references from Pinterest, or look for websites and apps you want to follow. The important thing is to gather visual material that lets you confirm, “This is the kind of feeling I want.”

AI also needs a reference point. Just as it would be awkward to tell a person, “Please just make it pretty,” AI will also make things however it wants if you do not give it proper direction.

2. Create a design system with Figma MCP

After deciding on references and the concept, I connected to Figma MCP and asked it to create a design system.

At this stage, I kept checking the results in Figma and making adjustments. It is also helpful to prepare basic components like buttons, modals, and cards in advance, because it makes later work much easier.

If you search for design system prompt, you can find many examples of prompts for creating design systems. I used a GitHub prompt that appeared near the top of the search results at the time.

Design System Foundation Prompt

When creating a design system, it is good to check that it includes at least the following:

Color Token
Spacing Scale
Typography Scale
Radius
Shadow / Elevation
Basic components
- Button
- Input
- Card
- Modal
- List Item
States for each component
- Default
- Pressed
- Disabled
- Error
- Loading

3. Turn the design system into code in the frontend project

Next, I connected Figma MCP from the frontend project folder and asked it to write code based on the design system implemented in Figma.

The important point here is to avoid hardcoding colors, font sizes, spacing, and similar values.

For example, if each button directly uses a color value like #3366FF, you will have to search through every file later when you want to change the main color. If you manage these values as design tokens instead, you can update them in one place.

I also asked it to create a Markdown file defining the design system in writing. With this in place, when assigning work to an AI agent later, I can give a clear instruction such as, “Implement this based on this document.”

4. Apply it to the actual frontend screens

Finally, I applied the code-based design system to the actual frontend screens.

After that, whenever I modified the screen design, I kept checking whether it matched the design system. I already had app screens that I had planned myself in Figma, and most of the structure was fixed. So I was mostly in a situation where I only needed to adjust details such as spacing, font size, and radius.

The advantage of this approach is that it feels less like “designing from scratch every time” and more like “refining within a defined set of rules.” It also makes instructions to AI much clearer, and the results become less inconsistent.

Additional note: A method I used when creating a landing page

Developers often lack design knowledge, so it can be difficult to explain exactly what kind of design style they want.

The prompt introduced in the Reddit post below provides 25 different design styles, randomly chooses one, rewrites a detailed prompt for that style, and finally uses it to create a landing page.

Reddit - I made a prompt to generate unique beautiful landing pages

The downside of using a random approach is that you may need to try several times until you get the design you want.

So I first read the descriptions of the design styles defined in that prompt, chose a design philosophy I liked, and then asked AI to write a detailed prompt for that philosophy.

In other words, instead of immediately asking it to create a landing page, I first went through a meta-prompting process to make the design more concrete. After a few attempts, I was able to get landing pages that were closer to what I wanted.

Recommended resources

A video I found interesting

Design Systems are a Waste of Time Now

Notes from AWS Unicorn Day Seoul 2026

2026-03-19T00:00:00+00:00

Through several company examples and sessions, I was able to see what kinds of synergy can emerge when development with AI is combined with AWS.

In this post, I want to summarize the parts that stood out to me from the sessions I attended, along with key terms and concepts that came up during the talks. Since this is a reconstruction based on brief notes I took on site, some wording or details may differ slightly from the actual presentations.

From Implementing Text2SQL to Reducing the Data Team’s Workload: Practical Operations Tips

Son Hoeyeon, Solutions Architect; Park Seoyoung, Solutions Architect (AWS)

As data-driven decision-making becomes increasingly important in business settings, it is still difficult for people who do not know SQL to access data directly. This session covers how to quickly implement Text2SQL in a startup environment using LLMs, prompt engineering, and RAG, and shares practical know-how for improving accuracy in real services.

Text2SQL implementation and operations tips

When operating Text2SQL, it seemed important to design constraints and operational metrics together so that users can reliably get the results they want, rather than focusing only on the ability to convert natural language into SQL.

Constraints must be clearly defined for multi-turn queries.
Users often continue their questions across multiple turns, so it is important to clearly limit how much previous conversational context should be reflected and which tables and columns can be used.
Few-shot examples and schema pruning work well together.
Providing suitable examples to the LLM and excluding schema information that is not relevant to the query can reduce noise. As a result, you can expect more accurate and consistent SQL generation.
A/B testing should be performed based on user feedback.
To check whether generated SQL matches the user’s actual intent, it is necessary to collect user feedback and experimentally compare the effects of changes to prompts or model configuration.
Dynamic model selection can be considered.
Instead of using the same model for every query, this approach selects an appropriate model based on query difficulty, cost, and latency requirements.

For observability metrics to understand performance, you can use response time, average number of turns per session, SQL generation success rate, user feedback results, and similar indicators. In an AWS environment, Amazon Bedrock and Amazon CloudWatch can be used together to observe model calls and application operations metrics.

Terms

Schema pruning
A method of selecting only the tables, columns, and relationships from the full database schema that are highly relevant to the current question and passing them to the LLM. Reducing unnecessary schema information can lower the chance that the model references the wrong table or generates incorrect joins.
Dynamic model selection
A strategy for dynamically choosing which model to use based on request complexity, cost, latency, and accuracy requirements. For example, simple queries can be handled by a cheaper and faster model, while complex analytical queries can be handled by a more capable model.

Building an Ontology with Our Service’s Data

Park Jinwoo, Solutions Architect (AWS)

This session explores ontology, a topic that many customers have recently been thinking about, and explains how to build ontology on AWS. It covers how to graph data using AWS Agentic AI services, Neptune, RDB, and analytics services, and how to add existing structured and unstructured data to an ontology. It also presents ways to use agents to remove data silos and apply the results in service, planning, and marketing.

What is a good approach if you want to apply LLMs while effectively making use of existing database assets? In this session, one answer was ontology and graph-based data usage.

An ontology is a way to explicitly define the concepts, components, relationships, conditions, and entities used within a specific domain. It is also closely connected to knowledge graphs and the Semantic Web.

The core use cases are integrating scattered data, inferring implicit information based on relationships between data, and better understanding user intent. For example, if you manage a logistics system, you could implement a digital twin to experiment with various scenarios and use the results to suggest new processes.

However, there are practical barriers to data integration. It is easy to think, “Why not put all the data in one place and ask AI about it?” But in reality, “putting all the data in one place” itself is very difficult.

The representative reasons are as follows.

Tacit knowledge exists.
Knowledge such as personal experience, know-how, and intuition is difficult to turn into data because it is not clearly documented or represented in systems.
Data silos exist.
Different teams may use different formats, storage systems, and terminology. Even the same word, such as “user,” may have different meanings across teams.

Therefore, ontology should not be built by indiscriminately integrating all data. Instead, it is more suitable to build it by selecting the necessary data first, centered on a business objective. Final decisions should also consider the overall context rather than looking only at individual pieces of data.

You can organize what data to include in an ontology through questions like these:

What problem are we trying to solve?
Does the data needed for that purpose actually exist?
How is the data collected and loaded?
How is the data currently being used?

One approach introduced for implementing ontology in an AWS environment was to use the open-source SDK Strands Agents together with Amazon Neptune, AWS’s graph database. This approach can be extended into a pattern where an agent receives natural language queries, converts them into graph database queries, explores graph relationships, and then explains the results back to the user.

Amazon Redshift’s Zero-ETL integration also helps connect data from operational databases to analytics environments more easily, allowing fresher data to be used for analytics, AI/ML, and reporting. However, Zero-ETL does not eliminate every data transformation process, nor does it make a database immediately and perfectly understandable to an LLM. Data modeling, permissions, quality management, and business terminology still need to be designed separately.

It is also worth carefully deciding whether a graph DB is truly necessary when implementing an enterprise ontology. For example, travel information connects many domains such as flights, accommodation, tourism, and restaurants, so a graph DB may seem suitable. On the other hand, in actual service implementation, the join structure of an existing RDB may be easier, faster, and more intuitive.

In the end, I felt that the important question is not “Should we use a graph DB?” but whether the graph model provides enough value compared with the problem we are trying to solve and the complexity of the data relationships.

Terms

OWL (Web Ontology Language)
A standard language used to express ontologies on the web. It can define classes, properties, relationships, constraints, and other elements in a machine-understandable form, and is used in Semantic Web and knowledge graph implementations.
Semantic Web
A concept that aims to assign meaning and relationships to information on the web so that machines, not only humans, can understand and process the meaning of data. Ontology, RDF, and OWL are often mentioned together in this context.
Digital Twin
A model that replicates a real-world system, device, process, or space in a digital environment. It can be used to monitor current state based on operational data or simulate what outcomes may occur under certain conditions.
Data silo
A state in which data is separated by department, system, or service and is not connected across boundaries. When data silos are severe, information about the same customer or product can be scattered across multiple systems, making it difficult to understand the full context.
Strands Agents
An open-source AI agent SDK released by AWS. It enables a model-driven approach to building AI agents and can integrate not only with AWS services such as Amazon Bedrock, but also with various external models and tools.
Zero-ETL
An approach intended to reduce the burden of building separate ETL pipelines and make it easier to connect data from operational data sources to analytics systems. AWS provides Zero-ETL integrations between Amazon Redshift and several data sources, with the goal of using fresher data for analytics and AI/ML.

References

Building AWS Serverless OpenClaw with Vibe Coding

Jung Dohyun, Principal Consultant (Roboco Co., Ltd.)

This session shares practical know-how from building the Serverless-OpenClaw project in just one day by using vibe coding to migrate OpenClaw, a recent open-source project that has drawn attention, to AWS serverless infrastructure. Based on the speaker’s long experience as a software developer and technical trainer at AWS, he designed an architecture that combines Fargate, Lambda, API Gateway, and DynamoDB to maintain strong security while achieving operating costs of around $1 per month. The session covers the full process of implementing architecture design, security hardening, and cost optimization strategies through vibe coding, and introduces practical best practices such as TDD-based quality assurance, interview-based design, incremental implementation, and prompt strategies for effectively giving context to AI.

Personally, this was the session that left the strongest impression on me. I had been used to configuring deployment steps one by one myself, so it felt especially new to learn that by using the AWS CLI, a significant portion of deployment work can be delegated to an agent.

Of course, entrusting deployment to an agent does not mean that every process automatically becomes safe. If anything, you need to define constraints, validation pipelines, cost ceilings, and security standards even more carefully. So in this summary, I focused more on the prompt strategy and development approach used to implement the architecture than on the architecture itself.

First, you need to run an interview session to design the deployment architecture according to the nature of the project. In this session, cost optimization is set as the main goal, and requirements are made concrete based on an AWS CDK stack. During the interview, various trade-offs are compared, and the maximum monthly cost suitable for the project is fixed.

For example, I was willing to spend up to about $20 per month on a personal project, so I was recommended a Lightsail instance. I then used Docker to run the frontend, backend, and database all on that instance. Considering future scalability, it was cheaper and easier to manage than the Railway + Vercel combination I had used before, and I was also satisfied with its performance and speed.

Another point that stood out to me was the validation approach. Having a person manually verify everything on every deployment is inefficient, and Human in the Loop (HIL) can become a bottleneck. Instead, it may be more realistic to have AI review things once more, receive the result as a report, and let a person perform the final check.

The session introduced an approach that forces every commit to pass the following validation pipeline.

TDD-based validation
The implementation is modified until all test cases pass.
Pre-commit hook
Before committing, ESLint, Vitest, type checks, and similar validations are run.
Pre-push hook
Before pushing to the remote repository, E2E tests and CDK Synth validation are performed.
Validation of README constraints
For example, it checks whether NAT Gateway usage is prohibited and whether the monthly cost ceiling is being respected.
Cost and security checklist validation
Skills or checklists such as /cost and /security are used to review cost and security requirements.

At first, applying this kind of validation pipeline directly may seem complicated. But if you clone the GitHub repository in the references and ask an agent, “I want to apply this validation pipeline to my project,” you can guide it to write the automation code.

Terms

CDK Synth
The process of checking whether AWS CDK code is correctly converted into a CloudFormation template. It is used to validate that infrastructure definitions are correct before deployment.

References

https://github.com/serithemage/serverless-openclaw

Spark in Action 3: Runtime, Scheduling, and a Real-Time Processing Example

2025-09-08T00:00:00+00:00

This is the third post in my notes on Spark in Action by Petar Zečević and Marko Bonaći. In this post, I will summarize the runtime components and scheduling methods that make up a Spark application, and finally look at a real-time dashboard example.

While the previous posts covered RDDs, partitioning, and shuffling, this post is closer to how Spark actually runs on a cluster.

Spark Runtime Components

A Spark application runs through the collaboration of several runtime components.

Client

The client is the entity that starts the driver. Examples include spark-submit, spark-shell, and custom applications using the Spark API.

Driver

The driver is a kind of wrapper that exists once per Spark application.

The driver is responsible for the following.

Request memory and CPU resources from the cluster manager.
Split application logic into stages and tasks.
Send tasks to multiple executors.
Collect task execution results.

Deployment mode can be divided into two types depending on where the driver runs.

Cluster Deployment Mode

In cluster deployment mode, the driver is separated from the client.

The driver runs inside the cluster as a separate JVM process. Therefore, resources for the driver process, such as JVM heap memory, are managed by the cluster.

Client Deployment Mode

In client deployment mode, the driver runs in the client’s JVM process.

Executor

An executor is a JVM process that executes tasks requested by the driver and returns results back to the driver.

An executor runs tasks in parallel across multiple task slots. In general, task slots are implemented as threads, so they are said to be configured at around two to three times the number of CPU cores.

SparkContext

SparkContext is the basic interface for accessing a Spark runtime instance. The driver creates and starts a SparkContext instance.

When running an application through the Spark API, the application must start SparkContext directly.

Only one SparkContext can be created per JVM. There is an option to use multiple contexts, but it is closer to a testing feature and is generally not recommended.

Scheduling

Spark scheduling can be viewed from three perspectives.

It schedules executor, JVM process, and CPU task slot resources.
The cluster manager allocates CPU and memory resources to each executor.
Job scheduling is executed inside the application.

Cluster Resource Scheduling

Cluster resource scheduling is the process of allocating resources to executors of multiple Spark applications running on a single cluster.

The cluster manager starts, stops, and restarts processes, and limits the maximum number of CPU cores available to each executor.

The cluster manager does the following.

Starts executor processes requested by the driver.
Starts the driver process as well when using cluster deployment mode.

Executors are not shared between applications. Therefore, if multiple applications run simultaneously on a single cluster, resource contention can occur.

Spark Job Scheduling

Spark job scheduling is the process of scheduling CPU and memory resources for running tasks inside a single Spark application.

The driver has several scheduler objects. Once executors are running, it decides which executor will run which task.

Multiple jobs sharing the same SparkContext compete for executor resources. SparkContext is thread-safe.

Job scheduling determines CPU resource usage in the cluster. It also indirectly affects memory usage, because running more tasks in a single JVM uses more heap memory.

CPU resources are managed at the task level. Memory resources, on the other hand, are managed by dividing them into multiple segments.

FIFO Scheduler

The FIFO scheduler lets the job that requested resources first occupy as many task slots as it needs.

If the job that started first does not use many resources, other jobs can also run simultaneously. But if the first job needs to occupy all resources, the next job must wait until the existing job has used them.

FAIR Scheduler

The FAIR scheduler distributes resources evenly in a round-robin manner.

Even if a job requests task slots later, it does not necessarily have to wait until a long-running job completes.

Using scheduler pools allows weights and minimum shares to be configured. If a weight is set, jobs in a particular pool can receive more resources than jobs in other pools. The minimum share is the minimum number of CPU cores that each pool can always use.

Speculative Execution

Speculative execution is a feature for reducing the problem of straggler tasks, which take unusually longer than other tasks in the same stage.

Spark can request the same task processing the same partition data on another executor as well. If the existing task is delayed and the speculative task completes first, Spark uses the result of the speculative task to reduce overall job latency.

Related settings include the following.

spark.speculation = true: enables speculative execution.
spark.speculation.interval: the interval for checking whether speculative tasks should be launched.
spark.speculation.quantile: the progress ratio of tasks that must be completed before speculative tasks are launched.
spark.speculation.multiplier: the criterion for determining how delayed an existing task is.

However, speculative tasks must be used carefully. For example, if the task writes data to a database, the same data may be written twice.

Data Locality

Data locality is a strategy for running tasks on executors located as close as possible to the data.

Preferred Locations

Spark has hostnames or executor lists that store partition data for each partition. It can use this location information to run computation close to the data.

However, preferred location information is available only for RDDs created from HDFS data and cached RDDs.

HDFS RDDs retrieve location information from the HDFS cluster through the Hadoop API. For cached RDDs, Spark directly manages the executor locations where each partition is cached.

Data Locality Levels

When Spark cannot secure the best task slot, it waits for a certain amount of time. If it still cannot secure one, it tries scheduling to the next-best location.

Depending on where a task runs, data locality levels are divided as follows.

PROCESS_LOCAL: runs on the executor that cached the partition.
NODE_LOCAL: runs on a node that can directly access the partition. This is a location that can access the data without going through the network, and another executor on the same machine may fall into this category.
RACK_LOCAL: runs on another machine mounted in the same rack as the machine storing the partition. Since only YARN can refer to rack information in the cluster, this level is possible only on YARN.
NO_PREF: no preferred location exists. The data can be accessed at the same speed from anywhere in the cluster.
ANY: runs the task in another location when data locality cannot be secured.

Here, a rack is a standard-sized frame for mounting servers and network equipment. Within the same rack, even if data is transferred over the network, it only needs to pass through the switch.

Memory Scheduling

Memory scheduling is the process in which the cluster manager allocates memory to executor JVM processes, and Spark manages memory used by jobs and tasks.

Memory Managed by the Cluster Manager

The memory allocated to an executor is configured with spark.executor.memory.

Memory Managed by Spark

In Spark 1.5.2 and earlier, executor memory was divided to store cached data and temporary shuffle data. Because usage in the divided memory regions could exceed their limits, a safety ratio was defined. The default allocation used 54% for cache, 16% for shuffling, and the remaining 30% for other Java objects and resource storage.

Starting with Spark 1.6.0, memory is managed in a unified way. Therefore, if there is no shuffling, the cache may occupy the entire memory. However, the area occupied by execution memory cannot be converted into the storage memory area.

Example: Real-Time Dashboard

Finally, I will summarize a real-time dashboard example.

class KafkaProducerWrapper(object):
    producer = None

    @staticmethod
    def getProducer(brokerList):
        if KafkaProducerWrapper.producer == None:
            KafkaProducerWrapper.producer = KafkaProducer(
                bootstrap_servers=brokerList,
                key_serializer=str.encode,
                value_serializer=str.encode
            )
        return KafkaProducerWrapper.producer

if __name__ == "__main__":
    # ... omitted

    # data key types for the output map
    SESSION_COUNT = "SESS"
    REQ_PER_SEC = "REQ"
    ERR_PER_SEC = "ERR"
    ADS_PER_SEC = "AD"

    requests = reqsPerSecond.map(lambda sc: (sc[0], {REQ_PER_SEC: sc[1]}))
    errors = errorsPerSecond.map(lambda sc: (sc[0], {ERR_PER_SEC: sc[1]}))
    finalSessionCount = sessionCount.map(
        lambda c: (
            long((datetime.now() - zerotime).total_seconds() * 1000),
            {SESSION_COUNT: c}
        )
    )
    ads = adsPerSecondAndType.map(
        lambda stc: (stc[0][0], {ADS_PER_SEC + "#" + stc[0][1]: stc[1]})
    )

    # all the streams are unioned and combined
    finalStats = finalSessionCount \
        .union(requests) \
        .union(errors) \
        .union(ads) \
        .reduceByKey(lambda m1, m2: dict(m1.items() + m2.items()))

    def sendMetrics(itr):
        global brokerList
        prod = KafkaProducerWrapper.getProducer([brokerList])
        for m in itr:
            mstr = ",".join([str(x) + "->" + str(m[1][x]) for x in m[1]])
            prod.send(
                statsTopic,
                key=str(m[0]),
                value=str(m[0]) + ":(" + mstr + ")"
            )
        prod.flush()

    # Each partition uses its own Kafka producer to send formatted messages.
    finalStats.foreachRDD(lambda rdd: rdd.foreachPartition(sendMetrics))

    print("Starting the streaming context... Kill me with ^C")

    ssc.start()
    ssc.awaitTermination()

In this example, the number of active sessions is processed in one-second mini-batches, so results keyed by per-second timestamps are combined and sent to Kafka.

The Kafka producer object initialized in the driver cannot be sent to workers. Instead, the producer is initialized inside tasks that run on workers.

Scala’s KafkaProducerWrapper companion object creates a single instance through lazy instantiation and initializes a single Kafka producer instance.

Using foreachPartition, a producer object can be initialized once per JVM and used to send messages to Kafka. Since multiple partitions share the same executor JVM, the producer object can also be shared.

Closing

In this post, I summarized Spark runtime components, resource scheduling, data locality, memory scheduling, and a real-time dashboard example.

Spark in Action 2: Understanding Partitioning and Shuffling

2025-08-29T00:00:00+00:00

This is the second post in my notes on Spark in Action by Petar Zečević and Marko Bonaći. In the first post, I looked at Spark’s basic execution flow and RDDs. In this post, I will summarize partitioning and shuffling, which directly affect performance.

Understanding how partitions are divided in Spark and when data movement occurs makes it much easier to see why a job becomes slow.

Data Partitioning

Partitioning is the process of splitting data across multiple cluster nodes. In Spark, partitioning has a major impact on performance and resource usage.

An RDD partition is a subset of RDD data. Spark splits files into partitions and stores them across cluster nodes, and the set of these distributed partitions forms a single RDD.

The number of partitions affects how work is distributed across the cluster. It is also directly connected to the number of tasks created when transformation operations are executed on an RDD.

If there are too few partitions, the cluster cannot be fully utilized. Conversely, each task may have to process too much data and exceed the executor’s memory resources.

In general, it is said to be good to use three to four times as many partitions as the number of cores in the cluster. However, if there are too many tasks, task management itself can become a bottleneck.

Partitioner

A Partitioner performs partitioning by assigning a partition number to each element of an RDD.

HashPartitioner

HashPartitioner is the default partitioner. It calculates the partition using each element’s Java hash code with the formula partitionIndex = hashCode % numOfPartitions.

Because it is hash-based, it cannot guarantee that all partitions will be exactly the same size. However, as long as the number of partitions is not too small, the data is generally distributed fairly evenly.

RangePartitioner

RangePartitioner splits data in a sorted RDD into roughly equal range intervals. It determines range boundaries based on sampled data.

The book explains that it is not often used in practice.

Custom Partitioner for Pair RDDs

When processing Pair RDDs composed of key-value pairs, a custom Partitioner can be used. It is useful when data must be placed into specific partitions according to a particular criterion.

Shuffling

Shuffling refers to physical data movement between partitions.

Shuffling occurs when data from multiple partitions must be combined to create partitions for a new RDD.

val prods = transByCust.aggregateByKey(List[String]())(
  (prods, tran) => prods ::: List(tran(3)),
  (prods1, prods2) => prods1 ::: prods2
)

For example, to group data by key, Spark must look through all partitions of the RDD and physically gather elements with the same key. During this process, data moves between partitions.

Two types of functions are used in aggregateByKey.

Transformation function: merges values within each partition and changes the value type.
Merge function: performs final merging of multiple values through the shuffling stage.

The task performed immediately before shuffling is called a map task, and the task performed immediately after is called a reduce task.

External Shuffle Service

When shuffling is performed, executors must read intermediate files produced by other executors using a pull method. If a failure occurs in the middle, the data processed by that executor may become unavailable and the job may stop.

An external shuffle service provides a single point where executors can read intermediate shuffle files, optimizing the data exchange process.

Representative settings include the following.

spark.shuffle.manager: configures the shuffling algorithm. hash and sort can be used, and the default is sort.
spark.shuffle.consolidateFiles: configures whether intermediate files generated during shuffling should be consolidated. The default is false.
spark.shuffle.spill: configures whether data should be spilled to disk when memory resources are exceeded. The default is true.

Reducing Unnecessary Shuffling

To improve Spark job performance, reducing unnecessary shuffling is important. Shuffling is expensive because it involves network and disk I/O.

When Explicitly Changing the Partitioner

Shuffling occurs when using a custom Partitioner or a HashPartitioner with a different number of partitions from the previous HashPartitioner.

If possible, it is better to keep the default Partitioner.

When Removing the Partitioner

map and flatMap remove the Partitioner. If operators such as join or groupByKey are used afterward, shuffling may occur.

If there is no need to change the key, it is better to use mapValues or flatMapValues. Another option is to use mapPartitions, mapPartitionsWithIndex, glom, and similar methods so that data is mapped only within partitions, while setting preservePartitioning = true.

Changing RDD Partitions

There are cases where partitioning must be explicitly changed to distribute workload.

partitionBy

partitionBy can be used only on Pair RDDs. It creates a new RDD by receiving a Partitioner object to use for partitioning.

coalesce

coalesce is used to change the number of partitions.

When reducing the number of partitions, it selects the same number of parent RDD partitions as the new number of partitions, then splits and merges elements from the remaining partitions.

If shuffle = false is set, transformation operators before coalesce also use the current number of partitions. Conversely, if shuffle = true is set, transformation operators before coalesce use the original number of partitions, and only the operations afterward use the changed number of partitions.

repartition

repartition is equivalent to calling coalesce with shuffle set to true.

repartitionAndSortWithinPartition

repartitionAndSortWithinPartition receives a new Partitioner and sorts elements within each partition. Since sorting is performed together during the shuffling stage, it performs better than calling repartition and then sorting separately.

RDD Dependencies

Spark’s execution model is a DAG. A DAG is a graph that defines RDDs as vertices and dependencies between RDDs as edges.

Whenever a transformation operator is called, a new edge is created. The new RDD depends on the previous RDD, and this graph is called RDD lineage.

RDD dependencies can be broadly divided into narrow dependencies and wide dependencies.

Narrow Dependencies

Narrow dependencies occur in transformation operations that do not require data to be transferred to other partitions.

One-to-one dependency: most operations except union fall into this category.
Range dependency: combines dependencies on multiple parent RDDs into one. union falls into this category.

Wide Dependencies

Wide dependencies are formed when shuffling is performed. For example, a join always creates a wide dependency.

Stages

Spark divides a single Spark job into multiple stages based on the points where shuffling occurs.

Stage results are stored as intermediate files on the disks of executor machines. Spark creates tasks for each stage and partition, then passes them to executors.

When a stage ends with shuffling, it is called a shuffle-map task. Tasks created in the final stage are called result tasks.

Checkpoints

If RDD lineage becomes too long, recovery cost increases when a failure occurs. In this case, checkpoints can be used to store the entire RDD data at an intermediate point.

If a failure occurs, Spark can recover from the checkpoint instead of re-running all operations from the beginning.

Closing

In this post, I summarized Spark partitioning, shuffling, and RDD dependencies.

In the next post, I will look at which components a Spark application actually runs as, and how cluster resources and tasks are scheduled.

Spark in Action 1: From MapReduce to RDDs

2025-08-19T00:00:00+00:00

I am going to record my notes from reading Spark in Action by Petar Zečević and Marko Bonaći in three parts. In this first post, I will start with MapReduce and Hadoop as background for understanding Spark, then summarize Spark’s basic execution flow and the concept of RDDs.

What Is MapReduce?

MapReduce is a large-scale data processing model introduced in Google’s paper MapReduce: Simplified Data Processing on Large Clusters. Its core idea is to make cluster computing easier to handle through a simpler model.

The MapReduce processing flow can be viewed in three broad steps.

Split a job into smaller pieces and map them across multiple nodes in a cluster for distributed processing.
Each node processes the task assigned to it and produces intermediate results.
The split intermediate results are aggregated in the reduce phase to produce the final result.

MapReduce tries to solve three major problems.

Parallel processing: split work into smaller units and process them simultaneously.
Data distribution: split data across multiple nodes for storage and processing.
Fault tolerance: handle failures in distributed components.

For example, the master periodically sends pings to all worker nodes. If a worker does not respond for a certain period of time, the master determines that the worker has a problem, resets the map tasks that worker was handling to their initial state, and reschedules them on another worker.

An important idea in this model is not to move data to where computation happens, but to send the program to where the data is stored. For large-scale data, network transfer costs are high, so it is important to compute as close to the data as possible.

Word Count Example

The most representative example is word count.

map: split each sentence into words and return a list of (word, 1) pairs.
shuffle phase: group map results by key so that the same word is passed to the same reducer.
reduce: sum the occurrences for each word to produce the final result.

The shuffle phase can become a bottleneck, but it makes aggregation by word simple in the subsequent reduce phase.

What Is Spark?

Spark is a big data processing platform that replaces Hadoop’s MapReduce.

Hadoop is a Java-based open-source framework for distributed computing. People usually think of it together with the Hadoop Distributed File System, or HDFS, and the MapReduce processing engine.

Spark is similar to Hadoop in that it is a general-purpose distributed computing platform. However, because it is designed to keep large amounts of data in memory, better performance can be expected for iterative computation or interactive analysis.

In Hadoop MapReduce, if the result of one job needs to be used in another job, it must be saved to HDFS and then read again. This makes it inefficient for iterative algorithms. Also, not every problem can be naturally decomposed using only MapReduce operations.

Spark can be viewed as a processing engine that emerged to address these limitations.

Cases Where Spark Is Not Suitable

Spark is not the right tool for every situation.

Because it uses a distributed architecture, some overhead occurs in processing time. This overhead is not a major problem for large datasets, but for small datasets, another framework may be more efficient.

Spark is also not suitable for OLTP systems, which process large volumes of atomic transactions. Instead, it is better suited for batch processing or analytical workloads, namely OLAP.

Hadoop’s Core Ideas

Hadoop is based on three main ideas.

Parallelization: split many operations into smaller parts.
Distribution: split data across multiple nodes for storage.
Fault tolerance: handle failures in distributed components.

Spark shares these basic assumptions of distributed processing. The difference lies in how data is reused and how execution plans are constructed.

Spark’s Execution Process

Suppose we store a 300 MB file in an HDFS cluster. HDFS can split this file into blocks of 128 MB, 128 MB, and 44 MB, and store them across three nodes in the cluster. If the replication factor is set to the default value of 3, HDFS also replicates each block to two other nodes.

Spark asks Hadoop for the location of each block, or partition, of the file. It then loads each block into the RAM of the HDFS node where that block is stored. This is called data locality.

Using data locality allows computation to happen near where the data exists, rather than moving large amounts of data over the network.

The distributed collection referenced by an RDD is a set of multiple partitions. Users do not need to think every time about the fact that this collection is split across multiple nodes.

For example, when filtering is performed, only the filtered information is stored in RAM. If cache is used afterward, the same RDD can be reused in memory by another job without loading the file again. This filtering operation runs in parallel across multiple nodes.

RDD

RDD stands for Resilient Distributed Dataset. It is Spark’s basic abstraction and the core concept for handling data in a distributed environment.

RDDs have three major characteristics.

Immutability

An RDD is a read-only dataset. Transformation operators do not modify an existing RDD directly; they always create a new RDD object. In other words, once an RDD is created, it is immutable.

Resilience

An RDD has fault tolerance. Even if a node fails, the RDD can be restored.

Spark records the log of transformation operators used to create a dataset. If a failure occurs, it does not rebuild the entire dataset. Instead, it recomputes only the dataset held by the failed node and restores the RDD.

Distribution

An RDD is a dataset stored on one or more nodes. Users can use it like a logical collection without directly handling which physical node stores the data.

This can be understood as location transparency. Even if the physical pieces of a file are stored in multiple places, users access the data through a file name or RDD reference.

Transformation Operators and Action Operators

Spark operations can be broadly divided into transformation operators and action operators.

Transformation operators: manipulate data and create a new RDD. Examples include filter and map.
Action operators: actually return computation results. Examples include count and foreach.

Spark uses lazy evaluation. Calling a transformation operator does not immediately trigger computation. Actual computation is executed when an action operator is called.

Thanks to this approach, Spark can collect execution plans and compute them in a more efficient way.

Scala for Comprehension Example

The book also covers Scala code. For example, the following code reads lines from a file and creates a Set.

val employees = Set() ++ (
  for {
    line <- fromFile(empPath).getLines
  } yield line.trim
)

At each cycle of the for loop, the line.trim value is added to a temporary collection. When the loop ends, this temporary collection is returned and then merged into the Set.

Shared Variables

In a distributed environment, multiple nodes in a cluster sometimes need to refer to the same data. In this case, Spark’s shared variables can be used.

val bcEmployees = sc.broadcast(employees)
val isEmp = user => bcEmployees.value.contains(user)

Shared variables are sent exactly once to each node in the cluster and automatically cached in memory. If shared variables are not used, the same data may be repeatedly transferred over the network as many times as the number of tasks performing the work.

Spark distributes shared variables using a P2P protocol. Each node exchanges and spreads the shared variable with other nodes, which is also called a gossip protocol. This prevents the master execution from being significantly delayed.

When accessing a shared variable, the value method must be used.

Closing

In this post, I first looked at MapReduce and Hadoop as background for understanding Spark, then summarized Spark’s basic execution flow and RDDs.

In the next post, I will summarize partitioning, shuffling, and RDD dependencies, which are important for understanding Spark performance.

How Search Result Rankings Are Calculated: Learning to Rank

2021-08-31T00:00:00+00:00

There are countless documents on the web, and we can now search for almost any information that exists in the world. That makes a different question more important: “How do I find the information I want among all of that information?”

To think about it simply, I could ask a search engine to show me every document containing the keyword Plato. But would that really be a good search experience? If I had to read every document one by one to find information about Plato, it might be faster to email a philosophy professor instead.

What we need, then, is a way to rank search results. Among the many documents containing what I searched for, the system should show the best-written and most likely useful documents in order.

Seen this way, we have already identified the core of Information Retrieval fairly well.

Find documents containing the search terms
Define what “most useful” means
Calculate rankings according to that criterion

Basic Principles of Search

Before going into details, let’s first look at how search works. To simplify the explanation, I will refer to the various components of a search engine collectively as the “search bot.”

Extracting Index Terms from Documents

Before search can begin, there must first be data to show as results. Crawlers collect various documents from websites and store them in a database. At this point, the content of each document is also analyzed and stored.

One important piece of information in a document is words. For example, if we want to find documents containing the word Plato, it is much more efficient to store in advance which documents are connected to the word Plato than to scan every document in the database each time.

Word	Documents
Plato	document1, document2, …
Nietzsche	document2, document3, …

This structure, which connects words to documents, is called an inverted index. To extract words from documents for indexing, morphological analysis and stopword removal are also needed.

User Queries and Intent

The user now enters what they want to know into the search box. Examples include Plato biography, Korea Olympic schedule, or good shoes for jogging. This is called a query.

To provide more accurate results, the search bot tries to understand the user intent. For example, if someone searches for Korea Olympic schedule while the Tokyo Olympics are taking place, it would be more appropriate to show Korea’s event schedule for the Tokyo Olympics than the schedule for the PyeongChang Olympics held in Korea.

Also, as voice search has become more active, it has become important to handle not only simple keyword searches but also natural language queries such as When was Plato born?

Database Search and Ranking

Once the user’s query and intent are obtained, the search bot first uses the inverted index to retrieve candidate documents. For example, if the query is good shoes for jogging, it retrieves documents containing words such as jogging, shoes, and good.

It then calculates rankings by combining factors such as user intent, document credibility, and relevance between the query and document. The quality of this ranking calculation strongly affects the search experience.

Hey Google, Learn to Rank

Learning to Rank (LTR) is also called Machine-Learned Ranking (MLR). As discussed earlier, statistical information about keywords in a query is not enough to create good search results. Various features such as click counts, document credibility, freshness, and relevance to user intent must be extracted, and the optimal ranking must be learned from them.

An LTR model is generally built through the following process.

Create a judgment list
- Match suitable documents to a given query.
Define features
- Decide which features the model will learn from, such as click count, likes, document length, or title matching score.
Create training data
- Set feature values for each document included in the judgment list.
Train and evaluate the model
- Precision: the proportion of results returned by the model that are actually relevant.
- Recall: the proportion of all relevant results that the model returned.
- nDCG: a metric that evaluates search result quality while considering rank.
Apply it to the search engine

nDCG: Normalized Discounted Cumulative Gain

Models learn in the direction of reducing error. There are several ways to evaluate the quality of a search ranking model, but here we will look at a representative metric, nDCG (Normalized Discounted Cumulative Gain).

DCG_p calculates relevance for the top p search results while discounting the weight according to rank. Users usually look at higher-ranked search results more often, so the relevance of the first result is more important than the relevance of the hundredth result. DCG reflects this property.

However, since recommendation models or search models may return different result ranges, normalization is needed for comparison. Dividing DCG_p by IDCG_p gives the normalized value, nDCG. Here, IDCG_p is the DCG when the top p search results are ordered ideally.

Higher nDCG values indicate better search results.

Learning to Rank Approaches

To return ordered search results, let’s define a function f. f(d, q) takes document d and query q as input and returns the document’s score or rank. The goal is to learn a function such that nDCG is maximized when all documents are sorted by f(d, q).

LTR can be approached broadly in three ways: pointwise, pairwise, and listwise.

Pointwise Learning to Rank

As the simplest example, consider the following formula.

f(d, q) = 10 * titleScore(d, q) + 2 * descScore(d, q)

This example comes from Search as Machine Learning. It calculates scores for all documents, then sorts them in descending score order.

The pointwise approach looks at each document individually and learns by reducing the difference between the calculated score and the target score. It is easy to understand and relatively simple to implement. However, if the error for the first-ranked document and the error for the hundredth-ranked document are treated in the same way, it becomes difficult to sufficiently reflect the greater importance of top-ranked results in real search.

Pairwise Learning to Rank

The pairwise approach compares pairs of documents to adjust rankings. Given a pair of documents (x_i, x_j), if x_i ranks higher than x_j, it can be assigned 1; if lower, -1.

The fact that x_i ranks higher than x_j can be interpreted as meaning that we can classify which document is more relevant based on the difference between their features. Based on this idea, RankSVM finds a decision boundary that separates document pairs and learns ranking direction from it.

The pairwise approach has the advantage of learning relative order between documents. However, because it does not directly optimize the quality of the entire list, a gap can appear between the evaluation metric and the training objective.

Listwise Learning to Rank

The listwise approach compares the ideal order of the entire document list with the order returned by the model. For example, the order from rank 1 to rank 100 can be considered one permutation among 100! possible permutations. This method calculates and compares the probability that the search result permutation returned by the model is the actual target permutation.

When calculating permutations, it considers position-specific probabilities such as the probability that document i is ranked first and the probability that document j is ranked second. Therefore, it can give greater influence to higher rankings.

However, calculating ranking probabilities for all documents is computationally expensive. For this reason, simplified methods such as Top-one probability are sometimes used instead of calculating the full permutation.

Summary

Search does not end with simply finding documents that contain keywords. It also requires calculating rankings for candidate documents so that users can find the information they actually want more quickly.

Learning to Rank is an approach that does not leave ranking calculations only to manually written rules, but instead lets a model learn from various features and evaluation data. Pointwise predicts the score of a single document, pairwise learns relative order between document pairs, and listwise directly handles the order of the entire search result list.

Ultimately, a good search system must solve both “finding documents” and “sorting documents well.” Learning to Rank is one representative method for handling the second problem.

References

A Quick Tour of NLP: From TF-IDF to Transformer

2021-08-18T00:00:00+00:00

Natural language processing (NLP) is a field that represents text as numbers, learns relationships among those numbers, and produces meaningful results from them. In this post, I will walk through the broad flow from traditional search techniques such as TF-IDF and BM25 to Word2Vec, RNNs, Attention, and Transformer.

TF-IDF and BM25

TF-IDF is the value obtained by multiplying the frequency of a given keyword, or Term Frequency, by its Inverse Document Frequency.

Inverse Document Frequency inversely reflects how many documents in the entire collection contain that keyword. Common words receive lower IDF values, while words that appear frequently in a specific document but rarely across the whole corpus receive higher values. The reason for applying a logarithm is to reduce the excessive gap in IDF values as the number of documents grows.

In short, TF represents how often a keyword appears within a document, and IDF represents how rare that keyword is across all documents. A document is represented as a vector composed of TF-IDF values for each word. When a query comes in, we can calculate cosine similarity between the query’s TF-IDF vector and document vectors, then return similar documents.

BM25 is a ranking function that improves TF-IDF-based search scores. It improves search result quality by adding document length normalization and smoothing to TF-IDF’s simple frequency-based score.

The left side of the formula is IDF, and the right side is the normalized TF component. f_td is the frequency of term t in document d. The k and b values in the denominator are constant parameters, and the document length l(d) divided by the average document length avgdl is also used for normalization.

In the IDF component, N is the total number of documents, and df_t is the number of documents containing the term. Adding 0.5 avoids cases where the denominator becomes zero. This can be considered a form of smoothing.

From Frequency to Meaning: Dimensionality Reduction Techniques

Linear Discriminant Analysis

One simple dimensionality reduction method used in the 1990s is Linear Discriminant Analysis. This method requires training data with predefined labels for binary classification.

First, it calculates the average position, or centroid, of TF-IDF vectors belonging to one class. It also calculates the average position of TF-IDF vectors in the other class, then draws a line connecting the two centroids. When classifying new data, it takes the dot product between this line vector and the data’s TF-IDF vector to determine which class the data is closer to.

LSA: Latent Semantic Analysis

Latent Semantic Analysis (LSA) is an algorithm that analyzes TF-IDF vectors to extract topics from documents. If Linear Discriminant Analysis is closer to a supervised learning method for binary classification, LSA is an unsupervised learning method that does not require predefined topics.

LSA borrows its idea from PCA (Principal Component Analysis), which is used to reduce the dimensionality of high-dimensional data such as images.

LSA uses Singular Value Decomposition (SVD) to generate a topic-document matrix from a term-document matrix or TF-IDF matrix.

SVD decomposes the original matrix into the product of three matrices. In this decomposition, U and V are orthogonal matrices, while S, or Sigma, is a diagonal matrix. The diagonal elements of S are called singular values. The size of S is connected to the number of topics, and reducing this size results in Truncated SVD.

LDA: Latent Dirichlet Allocation

LDA (Latent Dirichlet Allocation) assumes that a document contains multiple topics in different proportions, and that each word was selected from one of those topics.

For example, suppose we have words like [bicycle, Han River, swimsuit, ocean]. Document 1 could consist of [bicycle, Han River], document 2 of [swimsuit, ocean], and document 3 of [Han River, ocean]. If the topics are [biking, swimming, travel], document 1 could contain multiple topics in proportions such as biking 0.7, swimming 0.1, and travel 0.2.

Conversely, when document 1 contains [bicycle, Han River], we can estimate which topic it is closest to.

First, we set k topics that exist in the document collection. These k topics are assumed to be distributed across documents according to a Dirichlet distribution. Then each word in each document is assigned to one of the k topics. For a word to be classified into the correct topic, we consider both how that word is classified in other documents and how the other words in the same document are classified. Repeating this process across all words in all documents eventually converges to stable values.

Word2Vec

If LSA is closer to understanding the meaning or topic of a document, Word2Vec extracts dense vector representations of individual words. It starts from the assumption that a word’s meaning can be inferred from the words around it.

Word2Vec obtains word vectors using two main methods: Skip-gram and CBOW (Continuous Bag of Words).

For example, suppose we have a sentence like today's lunch is a delicious hamburger. Skip-gram predicts surrounding words such as today's, lunch, and hamburger when delicious is given as input. Conversely, CBOW predicts delicious when today's, lunch, and hamburger are given as input.

What matters when extracting word vectors is not the final output itself, but the hidden layer weights created during training. Since the input is a one-hot vector, the weights affected by that input word can be used as the word vector.

CNN

CNNs (Convolutional Neural Networks), which are mainly used in two-dimensional image domains, can also be applied to text. In text, one-dimensional convolution filters are used to capture local relationships among words.

A convolution filter moves horizontally over a word-vector matrix and performs convolution across the input. This operation multiplies the word embeddings inside the filter by the filter weights, sums the results, and usually applies an activation function such as ReLU. Since each step can be calculated independently, parallel processing is possible.

Each convolution filter produces a different output, and this output is passed as input to the next neural network stage. Dimensionality can then be reduced through pooling, or overfitting can be reduced through dropout. In the final layer, an activation function is applied to represent each data point as a single value. This value is passed to the loss function to calculate error, and the filter weights are updated through backpropagation. Optimizers such as Adam or RMSProp are used to reduce the loss.

RNN and LSTM

CNNs and Word2Vec mostly identify patterns through surrounding words. However, text contains many words that are semantically connected even when they are far apart. To handle this kind of sequential information, RNNs (Recurrent Neural Networks) are used. An RNN passes the output at the current time step t as input to the next time step t+1.

Backpropagation in an RNN is called BPTT (BackPropagation Through Time). It calculates the error between the final output and the target value, then traces backward to determine how much the weights at previous steps contributed. The problem is that as the neural network becomes deeper, vanishing or exploding gradients become more likely.

LSTM (Long Short-Term Memory) is a structure that mitigates these gradient problems while strengthening an RNN’s memory capability. It introduces a state at each step of the neural network, creating a memory that increasingly covers the entire input text as it progresses.

This memory state passes through three gates. The forget gate removes unnecessary memory, and the candidate gate selects components to newly strengthen. Finally, the output gate applies an activation function based on the updated memory vector and input data to produce the output. This output is passed to the next LSTM step.

GRU (Gated Recurrent Unit) is another commonly used structure with a similar purpose.

Seq2Seq and Attention

Seq2Seq refers to an encoder-decoder structure made of LSTMs or GRUs. It feeds input text into the encoder to create a vector, then passes this vector and the expected output values into the decoder to generate results. It is suitable for translation tasks where input and output lengths differ, and because of LSTM characteristics, it can generate variable-length text.

However, Seq2Seq models represent input text as a fixed-size vector. As the text becomes longer, it becomes harder to compress all meaning into a single vector.

Attention allows the decoder to revisit relevant input words when predicting each output word. In other words, when selecting y_i, it uses the encoder output h_j weighted by the attention weight a_ij. The context vector c_i for y_i can be represented as sum(a_ij * h_j).

Attention scores are calculated using the current decoder output and encoder hidden states, then passed through a softmax function to create a probability vector. This vector and the current decoder output are then used to calculate the next decoder hidden state.

Transformer

Transformer removes the RNN-based neural networks used in Seq2Seq encoders and decoders, and implements both encoder and decoder using only attention.

However, removing RNNs also removes sequential position information from words. Transformer solves this with Positional Encoding. Positional Encoding adds position information by applying sine functions to even positions of word embedding vectors and cosine functions to odd positions.

The resulting word embeddings pass through Multi-Head Self-Attention in the encoder. Attention calculates relationships between a specific word’s query and other words’ keys and values. It first takes the dot product between the query and the full key matrix to compute attention scores, then applies softmax to obtain probability values. Multiplying this probability vector by the values produces a value-weighted result representing the relationship between the query and keys.

In the encoder, Q, K, and V are all produced from the same input, so self-attention is performed.

After attention, the data passes through a Feed Forward Network (FFN). The FFN applies ReLU to the first linear layer, then computes the result through a second linear layer. The weights of these linear layers are shared within a single encoder layer, but different layers have different weights.

Add & Norm, located between attention and FFN, refers to residual connection and layer normalization. A residual connection adds the input and output of a function.

The encoder result is then passed to the decoder. The decoder first performs self-attention. Here, the mask prevents the model from referring to target words after the current time step by assigning very small values to future positions.

The decoder’s second attention uses the encoder outputs as key and value, and the decoder values as query, allowing it to refer to encoder information. The same process is then repeated to produce the final output.

Summary

TF-IDF and BM25 compare text based on word frequency and importance within documents. LSA and LDA attempt to discover hidden topics in documents, and Word2Vec represents words themselves as meaningful vectors. CNNs, RNNs, and LSTMs are neural network-based approaches for learning patterns and sequence in text.

Finally, Attention and Transformer learn which words to treat as more important in long contexts, and evolved in a direction that reduces the burden of sequential computation. The flow of NLP ultimately leads to the question: “How do we turn text into numbers, and how do we learn meaningful relationships among those numbers?”

References

Natural Language Processing in Action with Python, 2020
https://wikidocs.net/book/2155
https://m.blog.naver.com/ckdgus1433/221608376139
https://d2l.ai/chapter_recurrent-modern/lstm.html
http://incredible.ai/nlp/2020/02/20/Sequence-To-Sequence-with-Attention/

Speaker Recognition Through Self-Attention Encoding and Pooling

2021-08-17T00:00:00+00:00

This post is a review based on the paper Self-attention encoding and pooling for speaker recognition.

Overview

Not every frame in utterance data is equally important. Some frames contain more information for distinguishing a speaker, while others may be relatively less important. Attention is a technique that reflects these differences as weights and helps the model focus on more important frames.

This paper introduces a method for performing speaker recognition using Self-Attention, specifically the Transformer architecture proposed by Google. In particular, it moves away from conventional statistical pooling and designs a pooling layer that applies attention, making more active use of the strengths of self-attention.

In speaker recognition, attention has mostly been studied around the pooling layer. However, many previous studies used RNNs or applied multi-head attention, which had the downside of high computational cost. This paper focuses on reducing the number of parameters so that the model can be used even on mobile devices.

Because Transformer builds its encoder using only attention functions instead of RNNs, it can reduce computational complexity. Referring to this structure, the paper uses single-head self-attention when extracting speaker embeddings, and also applies a self-attention function to the pooling layer. As a result, it significantly reduces the number of parameters while maintaining performance. It is also meaningful because there were not many attempts at the time to apply deep learning-based speaker authentication to mobile devices.

Then how was attention proposed, and how is the attention pooling proposed in this paper different from existing pooling methods? Before looking at the paper in detail, let’s first briefly review the background.

Getting to Transformer

This section explains the background using models from the text domain rather than speaker recognition. However, utterance frames with temporal order can be understood as analogous to words with order in a sentence. In other words, the problem of deciding which words in an input sentence to focus on in a translation model resembles the problem of finding which frames best reveal speaker characteristics in speaker recognition.

Seq2Seq

Attention was first proposed in text-based domains. For tasks such as translation or chatbots, where a model must receive sentences of varying lengths and generate another sentence, a model was needed that could handle different input and output lengths. Sequence-to-Sequence (Seq2Seq) was well suited to this need.

Seq2Seq uses an RNN to predict the next word based on previously predicted words. It can handle input sentences of different lengths because it compresses the input sentence into a fixed-length context vector, the final hidden state of the encoder.

However, compressing the input sentence into a single vector inevitably causes information loss. Also, because later words are predicted only from information about earlier words, performance degrades as sentences become longer. This is the long-term dependency problem.

Attention Mechanism

The Attention Mechanism improves on these problems in Seq2Seq.

In attention, the context vector is not a single fixed piece of information. Instead, it is computed based on attention scores that change at each point when an output word is predicted. For example, when predicting the tth output word, the model refers to the hidden states of all input words, computes a softmax result, and uses the weights for each input word to create information for the current step.

The word with the highest weight does not simply become the output word. The attention score calculated at that step acts again as input for predicting the tth word. Because the entire input sentence is selectively considered each time an output word is predicted, more stable performance can be expected even for longer sentences.

Transformer

However, the Attention Mechanism still followed the recursive structure of Seq2Seq. Google then proposed Transformer.

Both Seq2Seq and attention-based models consist of an encoder that processes input words and a decoder that processes output words. Transformer also uses an encoder-decoder structure, but removes the recursive word-by-word processing method and builds the encoder and decoder only with attention. As a result, computation time is reduced and inputs can be processed in parallel.

The difference between self-attention and regular attention lies in whether Q, K, and V passed to the attention function come from the same source or different sources. The Transformer encoder uses self-attention, while some decoder layers use regular attention.

Methods for reducing sequential computation existed before, but reflecting dependencies between distant words required a lot of computation. Transformer uses Positional Encoding to reflect word order while simplifying the computation process. However, the paper reviewed here does not use positional encoding.

Changes in the Pooling Layer

Utterance data used in speaker recognition has varying lengths. Therefore, after obtaining vectors for each frame, a pooling technique is needed to convert them into an utterance-level vector.

Early methods used average pooling, which sums frame vectors and takes their average. Later, statistic pooling was proposed, which considers not only the mean of frame vectors but also their standard deviation. According to the paper, however, it has not been clearly reported what effect the standard deviation actually provides. Related details can be found in Attentive Statistics Pooling for Deep Speaker Embedding.

After that, attentive statistic pooling, which applies attention, was introduced and showed performance improvements. In contrast, this paper proposes self-attention pooling, which removes the statistical component.

Attentive statistic pooling uses attention scores extracted from frame vectors as weights to compute the mean and standard deviation. This paper, on the other hand, introduces learnable parameters and applies an attention function. The meaningful point is that the parameters of the pooling layer are adjusted together as training progresses.

Model Architecture

Self-Attention Encoder

The paper designs the model by borrowing the encoder part of Transformer. In speaker recognition, the encoder’s role is to compute attention scores for input frames and apply these weights back to the input to extract speaker embeddings.

The encoder is a stack of N identical encoder layers. Each encoder layer contains a self-attention mechanism and a position-wise feed-forward layer. The outputs of both layers pass through residual connection and layer normalization before being passed to the next layer.

Transformer uses multi-head attention for parallel processing, but this paper applies single-head attention to reduce the number of parameters.

# class Encoder

self.layer_stack = nn.ModuleList([
    EncoderLayer(d_m, d_ff, d_k, d_v, dropout=dropout)
    for _ in range(n_layers)
])

The encoder consists of N=2 layers, and each layer has the following two layers.

# class EncoderLayer

self.slf_attn = SelfAttention(d_m, d_k, d_v, dropout=dropout)
self.pos_ffn = PositionwiseFeedForward(d_m, d_ff, dropout=dropout)

1. Single-Head Self-Attention Mechanism

# class SelfAttention

self.w_q = nn.Linear(d_m, d_k)
self.w_k = nn.Linear(d_m, d_k)
self.w_v = nn.Linear(d_m, d_v)

First, learnable parameters w_q and w_k with dimensions (d_m, d_k), and w_v with dimensions (d_m, d_v), are defined. The paper uses d_k = d_v.

In conventional multi-head attention, the relationship is usually d_m / num_head = d_k = d_v. Since this paper uses a single head, this can be viewed as d_m / 1 = d_m = d_k = d_v.

# class SelfAttention

q = self.w_q(x)
k = self.w_k(x)
v = self.w_v(x)

attn = self.attention_func(q, k, v) # scaled dot-product attention

If the input x has shape (T, d_m), after multiplication with each parameter, the resulting tensors become q: (T, d_k), k: (T, d_k), and v: (T, d_v). The generated q, k, and v are used as inputs to the attention function.

class ScaledDotProductAttention(nn.Module):
    def __init__(self, temperature, attn_dropout=0.1):
        super().__init__()
        self.temperature = temperature # temperature=np.power(d_k, 0.5)
        self.softmax = nn.Softmax(dim=2)

    def forward(self, q, k, v):
        attn = torch.bmm(q, k.transpose(1, 2))
        attn = attn / self.temperature
        attn = self.softmax(attn)
        attn = torch.bmm(attn, v)
        return attn

The attention function used here is scaled dot-product attention, proposed in the Transformer paper. This method is used because it is faster than additive attention.

It multiplies q: (T, d_k) by k.transpose: (d_k, T), passes the result through softmax, and then multiplies it again by v: (T, d_v). The output has shape (T, d_v). In the final multiplication by v, information from specific frames is emphasized more strongly.

attn = self.layer_norm(attn + residual) # residual connection

The attention result passes through residual connection and layer normalization before being passed to the next layer.

2. Position-Wise Feed-Forward

class PositionwiseFeedForward(nn.Module):
    """Implements position-wise feedforward sublayer.

    FFN(x) = max(0, xW1 + b1)W2 + b2
    """

    def __init__(self, d_m, d_ff, dropout=0.1):
        super().__init__()
        self.w_1 = nn.Linear(d_m, d_ff)
        self.w_2 = nn.Linear(d_ff, d_m)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(d_m)

    def forward(self, x):
        residual = x
        output = self.w_2(F.relu(self.w_1(x)))
        output = self.dropout(output)
        output = self.layer_norm(output + residual) # residual connection
        return output

The next layer has a Linear - ReLU - Linear structure. The (T, d_v) result obtained earlier is multiplied by (d_m, d_ff), then again by (d_ff, d_m), producing a (T, d_m) result.

Self-Attention Pooling Layer

In the pooling layer, the (T, d_m) result is converted into an utterance vector with shape (1, d_m).

First, w_c: (1, d_m) is multiplied by the transpose of the encoder output (d_m, T). The result is passed through softmax to create attention scores, then multiplied again by the encoder output (T, d_m). Through this process, a final utterance vector with shape (1, d_m) is obtained.

class SelfAttentionPooling(nn.Module):
    def __init__(self, d_m, dropout=0.1):
        super().__init__()
        self.d_m = d_m
        self.softmax = nn.Softmax(dim=2)
        self.w_c = nn.Linear(d_m, 1)

    def forward(self, x): # (bs, T, d_m)
        attn = self.w_c(x).transpose(1, 2) # (bs, 1, T)
        attn = self.softmax(attn)
        attn = torch.bmm(attn, x) # (bs, 1, d_m)
        return attn

DNN Classifier

# class Transformer

def forward(self, x, is_test=False):
    x = self.encoder(x)
    x = self.pooling(x)
    x = self.fc1(x)
    x = self.relu(x)
    x = self.fc2(x)
    x = self.relu(x)
    if is_test:
        return torch.squeeze(x)
    x = self.fc3(x)
    x = self.relu(x)
    return torch.squeeze(x)

To extract speaker embeddings, the (1, d_m) output of the pooling layer passes through three fully connected layers. After training, the output of the second fully connected layer is used when obtaining actual speaker embeddings.

Experimental Setup

Protocol

Vox1
- train: VoxCeleb1 development set
- test: VoxCeleb1 test set
Vox2
- train: VoxCeleb2 development set
- test: VoxCeleb1 test set
Vox1-E
- train: VoxCeleb2 development set
- test: VoxCeleb1 development + test

Preprocessing

30-dimensional MFCC
Data augmentation and test-time augmentation are not used
Cepstral Mean Variance Normalization is applied
Training is based on 300 frames

Training

ReLU
Adam optimizer
Learning rate: 1e-4
Non-linearity, batch normalization, TDNN are used
PLDA backend
Baseline: x-vector

Parameters

Number of encoder layers: N = 2
d_k = d_v = 512
d_ff = 2048
Dropout
- encoder: 0.1
- other: 0.2
Dense layer dimension
- first: 90
- others: 400 (similar to i-vector)
AMSoftmax
- scaling factor: 30
- margin: 0.4

Results

Vox1 Protocol

It showed a slight improvement over x-vector with LDA/PLDA and VGG-M.
When AMSoftmax was used, performance improved by 8.93% over x-vector LDA/PLDA and 7.99% over VGG-M.

Vox2 Protocol / Vox1-E Protocol

It improved by about 20% and 15% over x-vector with LDA/PLDA.
ResNet-34 and ResNet-50 showed better results because they use far more parameters.
In Vox2, SAEP showed performance similar to ResNet-34 while using about 94% fewer parameters.

Effect of Key and Value Dimensions

When d_k = d_v was set to 64, 128, and 512, the number of parameters was 0.83M, 0.88M, and 1.16M, respectively.
When d_ff = 1024 and d_v = d_k = 64, it recorded 7.83% EER on the Vox2 protocol, with only 0.45M parameters.
This is meaningful because it requires almost one-tenth the number of parameters compared with x-vector.

Summary

This paper shows that applying a self-attention encoder and self-attention pooling to a speaker recognition model can significantly reduce the number of parameters while maintaining performance. I found it especially interesting that it considered a speaker authentication model usable in environments with limited computational resources, such as mobile devices.

The core idea is not to treat every frame equally, but to give attention to frames that contain more speaker information. Existing statistical pooling creates utterance vectors based on mean and standard deviation, while self-attention pooling directly adjusts frame-level importance through learnable parameters.

I think this is a good example showing that the ideas behind Transformer are not limited to natural language processing, but can also be applied to other domains with temporal order, such as speaker recognition.

jdrae.github.io

Writing Technical Blog Posts with Codex and Obsidian

Why a Jekyll blog again?

Writing starts in Obsidian

Defining editing rules with AGENTS.md

Uploading to a Jekyll blog

Publishing feels less burdensome now

Then did GPT write this post?

How Developers Without Design Knowledge Can Create Consistent UI with AI

How can I design well… or rather, make AI design well?

The problem is not “taste,” but “making things concrete”

A design system does not solve UX for you

Creating and applying a design system

1. Explore references

2. Create a design system with Figma MCP

3. Turn the design system into code in the frontend project

4. Apply it to the actual frontend screens

Additional note: A method I used when creating a landing page

Recommended resources

A video I found interesting

Notes from AWS Unicorn Day Seoul 2026

From Implementing Text2SQL to Reducing the Data Team’s Workload: Practical Operations Tips

Text2SQL implementation and operations tips

Terms

Building an Ontology with Our Service’s Data

Terms

References

Building AWS Serverless OpenClaw with Vibe Coding

Terms

References

Spark in Action 3: Runtime, Scheduling, and a Real-Time Processing Example

Spark Runtime Components

Client

Driver

Cluster Deployment Mode

Client Deployment Mode

Executor

SparkContext

Scheduling

Cluster Resource Scheduling

Spark Job Scheduling

FIFO Scheduler

FAIR Scheduler

Speculative Execution

Data Locality

Preferred Locations

Data Locality Levels

Memory Scheduling

Memory Managed by the Cluster Manager

Memory Managed by Spark

Example: Real-Time Dashboard

Closing

Spark in Action 2: Understanding Partitioning and Shuffling

Data Partitioning

Partitioner

HashPartitioner

RangePartitioner

Custom Partitioner for Pair RDDs

Shuffling

External Shuffle Service

Shuffle-Related Parameters

Reducing Unnecessary Shuffling

When Explicitly Changing the Partitioner

When Removing the Partitioner

Changing RDD Partitions

partitionBy

coalesce

repartition

repartitionAndSortWithinPartition

RDD Dependencies

Narrow Dependencies

Wide Dependencies

Stages

Checkpoints

Closing

Spark in Action 1: From MapReduce to RDDs

What Is MapReduce?

Word Count Example

What Is Spark?

Cases Where Spark Is Not Suitable