Smooth embeddings for arXiv scientific paper titles

Following up on my recent project creating fake arXiv abstracts with RNNs, I have developed a way to embed titles of papers into a vector space. The way I did this is heavily inspired by this paper by Bowman, et al. I followed a slightly simplified approach, in which I simply try to autoencode the titles with a seq2seq network, and use the hidden state that gets passed from the encoder to the decoder as the embedding. This by itself does not generate very smooth embeddings however, which Bowman et al address by including a variational autoencoder in-between the encoder and decoder RNNs. I was lazy and simply added a small amount of noise to the hidden representation during training, which had a similar effect.

Having such an embedding allows one to to some pretty entertaining things. First of all, one can interpolate between two paper titles, by taking the embeddings of two titles, and sampling a number of points that lie between them. Here is one such example:

signature of antiferromagnetic long-range order in the optical spectrum of strongly correlated potential
signature of antiferromagnetic long-range order in the optical excitations of highly correlated systems
signature of antiferromagnetic order nuclei in the 0d term of quenched systems
existence of antiferromagnetic order nuclei in the static region of mesoscopic systems
existence of self-gravitating one-dimensional rings of the maxwell chain "
existence of self-gravitating random static solutions of the toda system
existence of axially symmetric field solutions of the einstein-vlasov system
existence of axially symmetric static solutions of the einstein-vlasov system

(note that I normalised everything to lower case). Here are some more examples, and here are some examples with more fine-grained sampling between the points.

Another thing that one can do is calculate “analogies” as one can do with word2vec embeddings, such as “king is to queen as man is to woman”, by adding and subtracting their respective vectors, i.e. queen-king+man=woman. This seems to work reasonably well for some examples, and was not reported by Bowman, et al. For example, I got:

  "Nonlinear Kalman Filtering for with convolutional neural networks"
- "Convolutional neural networks"
+ "Generative adversarial networks
= "nonlinear non-standard interference of with adversarial networks"

which is kind of weird but not too bad. Another try got the network waxing philosophical:

  "What is the origin of species?" 
- "On the origin of species"
+ "On the theory of relativity"
= "what is theory. a theory"

In general it seems to be able to do substitutions of words quite alright if the positions of the words are similar. Doing this for 3 completely random titles with no obvious relations leads to gibberish output, but I wouldn’t expect anything else.

Future ideas include putting a dense layer before and after the hidden units, in order to get even more robust embeddings (right now they are the states of the 2 layers of RNNs concatenated). Another idea is to somehow separate “semantic” and “syntactic” aspects of the embedding, so that some dimensions would cover the subject matter, and others the grammatical structure that the idea is presented in.


RNN-generated arXiv abstracts


I scraped the entirety of arXiv abstracts to do some experiments. To get started, I trained a char-rnn on all the q-bio abstracts and generated a bunch of synthetic abstracts. Some of the results were quite fun, see below:

Various brain areas reveal spatiotemporal activity patterns that repeat over time: resulting intracellular elements of genetic regulatory networks are quantified. Using a ” experimental study of neural networks, the framework of cellular Markov models to the importance of complexity induces a identification of challenges for understanding specialized biological structures.


Modelling forest composition function for meaningful laws in cortical networks, in the light of simplifying assumption of interaction networks with the same importance they exploit networks used by previous models in topological detail. Existing methods largely depend on a kinetic SIR model under physical networks. We have used the stationary law of overlapping phylogenetic tree distributions as a popular utility. Making use of eigenvalue laws and a scheme augmented along the population and eventually simplify a network .


It also tries to generate LaTeX but it doesn’t get it quite right yet:
Geometry of DNA looping where the residence of 26 ‘ alleles diffusing out than amplitude distributions ( $ F ( x ) $ -test are abrupt at short times $ O ( n = 0.5 ) < $ ^ { 2+ } $ due to a balance matrix , and the synergism of the model and a statistical mechanics level comparable .
I experimented with generating arXiv categories and titles along with the abstracts.
Categories: q-bio.PE stat.AP stat.ME
Title: Joint Resolution Basis GDDA-BLAST Reaction: Mechanism of Biodynamics Waves problems in swarms
Abstract: In a reply that is robust from male molecules in the ecosystem and have presented to apply it to the city in proteomics evolution. An important entity presently processing a MS/MS spectrum outbreak, monitoring, requires on tests but not only an important difficulty in big datasets, opening gained from the usual graph lens and (human) sensitivity analysis. Future dimension test subjects are valid how the epigenetic basis for protein sequence directionality increases the increase within size state. We corrected review particular methods.

Weekly Vim Focus: Week 3

It has been a while since the last posts from this series, so the title weekly doesn’t really apply. The previous 2 bite-sized commitments to improve my Vim skills worked well, with about 75% of the things I tried to internalize now in daily use. This week I will focus on some movements, sticking to the number 4.

  • H and L move to top and bottom screen.
  • ctrl+F and ctlr+B move a page forward and back.
    I find the behaviour of this a bit strange, because it places you on the second to last line of the current page.
  • ' followed by a mark moves to the line of that mark.
    Useful automatic marks are . for the location of the last edit and ' for the location before the last jump.
  • nG moves to line n.
    I used to use :n for this, but that is not really a movement and thus can’t be combined with other commands.

For more movements also have a look at this Vim Movements Wallpaper. I won’t use it personally, but it makes for a handy reference.

4 Bit Synthesizer

Built a 4-bit synthesizer together with a friend, based on this project. We assembled it on a breadboard, still need to transfer it to a PCB. It is based on an ATMega48 microcontroller, and the sound is generated through a R-2R ladder DAC (the resistors soldered into a chunk in the picture)

Here is a sample of the lo-fi goodness:

The synth is controlled via MIDI, and can generate a single voice in one of 4 waveforms at a time. Several synths can be daisychained to be controlled via one MIDI connector and feed a single output though, so hopefully once we get to soldering everything together we can do that.

Quick Productivity Trick: Capture Distractions

This is probably obvious and/or part of systems like GTD, but I came up with this for myself recently, and it has worked remarkably well.

When working and trying to focus on one task at hand, which may not be my favorite task of all times, I find myself having to deal with distracting ideas popping into my head. Some of them are good ideas, and some of them are completely useless, but this can really only be determined after spending some time thinking about them, which is exactly what I am trying to avoid. I used to frequently cave in on this type of internal distraction, and my productivity would take a hit. The worst part is that once you start considering one such idea, it tends to lead to another and another.

What I have started doing now is just capture these ideas on paper, and then get back to them in my next break. Often times I will realize that the idea was stupid, and doesn’t deserve being pursued any further. But having it there puts my mind at ease, and makes it a lot easier to get on with the current task. The simple act of moving the distracting thought onto paper seems to make it easier to remove it from my mind. I use a simple notepad for this, which I keep next to my keyboard.

Weekly Vim Focus: Week 2

Week 1 of my vim productivity boosting experiment went well. During the week most of the commands I aimed for ended up being used often.
I got a lot of helpful comments, both here and on Hacker News. I ended up unmapping my arrow keys, and I do feel like this made a difference, I am now comfortable with those keys. As a bonus, it feels a lot more natural using hjkl in compound movement commands than it did with the arrow keys.

This week I will focus on repeating previously issued commands. Since 4 at a time worked well for me last time, I will stick to that number.

  • . repeats the last edit
  • gv marks the previous visual mark
  • & repeats a substitution
  • @: repeats a command

For next week I will go through the comments and pick the ones I found most useful in there.

Weekly Vim Focus

I’m a long time user of vim, but sometimes I feel I have retained some bad habits from when I first started using it, or that I am stagnating in my skills. Sure, I learn a new trick every now and then, but most of them I tend to forget again. So I am going to pick a few commands, and make an effort to stick to them consciously for a week. Hopefully. they will make it into my permanent repertoire that way. Depending on how well this goes, I might make a series out of it.

Here’s what I am going for this week:

  • use c for change instead of deleting and then going to insert mode
  • use ctrl-O to jump to the previous location (I just found out this also works across buffers)
  • use ctrl-R to paste registers in insert mode
  • use hjkl instead of arrow keys for navigation. (this is a big one, tried it before, and failed. But it makes a lot of sense to keep the fingers on home row as much as possible. Might go as far as remapping the arrow keys to force myself to do this.)

Seems like a short enough list to be tractable. More to come next week if I find this approach to be helpful.

cells. A massively multi-agent Python programming game.

This article describes a project of mine that has been laying around my harddrive in a rough draft for a couple of months. I want to continue developing it at some point, but can’t right now for lack of time. Many of the details described here will very likely be subject to change, but the basic concept stands.


Cells is a programming game, meaning that the player programms the agents before actual gameplay starts, and then watches pits his code against that of his oponents. It features two or more teams of hundreds or even thousands of identical agents, which I call “cells”. These cells live in a 2-dimensional simulated environment, and compete for the control of resources scattered around the world.

By identical I mean that they share the same code, but since their state need not be identical, they can specialize during the course of the game, if the programmer designed it that way. There is no central controlling instance, and therefor the behavior of each team is somewhat like that of a swarm, or a tribe of ants(the original inspiration for this). While it is theoretically possible to make one cell the commander and let all the rest do its bidding, a more decentralized and emergent style is encouraged by the design.

The two main aspects of this game are the cells and the world they inhabit. A brief description of the two follows.

The World

  • Elevations: I mentioned earlier that the playing field is 2-dimensional, which is not entirely accurate. It is really 3-dimensional, because there are different elevations. These are indicated by shades of brown in the screenshot above. Cells may only cross a certain elevation difference, ie they can’t move up or down a too steep cliff. This is important, because it makes it possible to build walls.
  • Energy sources: As explained below, cells need energy to survive. There are two sources of energy, one that is scattered around the whole world, and is not renewable. And the other, which is generated by plants, symbolized by green dots on the map. These produce this energy at different rates, and do so for ever.
  • Coordinates: The world is divided into integer coordinates. A cell may only move to an adjacent coordinate any given turn. Elevations are also given as integers.

The cells

  • Energy: Cells need a certain amount of energy every turn to survive. They can store as much energy as they want in themselves. If their energy level reaches zero, they die, and they leave behind an amount of energy at the coordinate they died in.
  • Senses: Cells are very myopic. They can only sense the fields they are on, and the ones directly adjacent to them. The information they can sense is the elevation and energy of the coordinates, the presence of any cells present and the team they are on. I’m not quite sure if they should be aware of global coordinates. Right now they are.
  • Communication: Cells can always communicate with each other. They send messages to each other through a global message cue(each team has one), and messages get delivered in the next turn. In the current version, they can send arbitrary information, but I might restrict that later on, to perhaps an integer.
  • Fighting: Fighting in cells takes place in a very game-theory-ish way. If a cell encounters an enemy, it can either choose to attack or to do nothing. If both choose to attack, they both sustain heavy damage(read: loss of energy). If both do nothing, they walk away unharmed. If one attacks, and the other doesn’t, the defending one sustains damage, but not as much as if it had attacked. I think this opens up the decision to some interesting tactical considerations.
  • Reproduce: Cells, being cells, can reproduce by splitting into 2. This requires a certain amount of energy, and the spawner can pass some arguments to the constructor of the spawnee.
  • Manipulate terrain: Cells can lift one unit of earth from the ground, reducing the elevation there by one, and later dump it at a different location. The screenshot shows an example of this, the read tribe is building circular walls around the plants they control.

This is the game so far, as mentioned earlier, it is still very much in draft form. If you would like to play around with it, I have it on git here. The only requirement is pygame.

In the next post I will outline the changes and additions I would like to make in the future. If you want to participate or have any ideas please feel free to comment or contact me.

My solution to the Quora challenge

Spoiler Alert! This post contains details of how I solved this problem. If you want to try this problem yourself, I strongly recommend you don’t read on. Coming up with your own method to do this efficiently is all the fun.

A few weeks ago my friend Albert nerd-sniped me with this coding problem from Quora, which they seem to use in their hiring process. I was immediately intrigued, and spent the next day or so on it.

It goes like this: A datacenter needs to be cooled using a duct. It is required to pass through every room of the datacenter exactly once, and some can’t be crossed(marked by a 1). It must start at the entrance(2) and end at the exit(3). The task is not to find a way of doing this, but rather counting how many ways there are to do it. I won’t reproduce the specifics, but here is a snippet:

Here is an example datacenter:

2  0  0  0

0  0  0  0

0  0  3  1

There are two possible ways to run the duct here:

0--0--3  1


2  0--0--0
|  |     |
0  0  0--0
|  |  |
0--0  3  1

Write a program to compute the number of possible ways to run the duct. For the above example, the correct answer is 2.

Check their challenge page for specifics.

My first implementation was a simple forward search. Start at the entrance and iteratively grow the tubing one step at a time. I implemented this with a queue, which contained all the legal tubing variations for a given length N. From this I generated all the possibilities for length N+1 by growing by one in all possible directions.

This proved to be disastrously slow, and memory consumption was high, too. Since the branches of the search tree grow exponentially, it is highly beneficial to prune them as early as possible. I did this in 3 ways.

  • Merging equivalent branches: To find new solutions it really only matters wether a given room already contains a tube or not. The direction of the tubes is irrelevant. This means that leaf nodes in a tree that have the same ending, and have passed through the same rooms can be merged into one. The number of solutions derived from this merged leaf needs to be multiplied by the number of original variants it contains.
  • Prune branches that cut off areas: If a part of the datacenter is cut off and can’t be reached from current the end of the tubing, the branch will not result in valid solutions. They can be pruned.
  • Dead ends: This was the last optimization I came up with, and it was surprisingly effective. Basically, to pass through a room, it must have 2 or more adjacent rooms free. If there is only one free, this means the tube can enter, but not leave any more. As a result all the branches that contain this pattern are doomed an can be pruned.

The resulting algorithm is quite speedy. I wrote it in C++, using STL containers for all my data structures. I was only really interested in the algorithmic optimizations, and didn’t worry too much about memory optimizations or other tricks to shave off milliseconds. Also, I performed all these optimizations all of the time. It might well be that their effect is beneficial in the beginning but outweighed by their cost in the final stages, when there is not much left to prune.

Another optimization I thought of but never implemented:

  • Uneven number of exits: If there is an area that can only be reached by a number of bottlenecks of width one, this can only be even if the exit is on the same side as the current tube end, and uneven if they lie on different sides. Otherwise the configuration will eventually end up with a dead end as described in the third optimization.

I’m not sure how much time this would save, since the check is quite elaborate compared to the other tests I mentioned.

The reference problem with the 7×8 grid ran in about 15 seconds on my 2Ghz Core2Duo. It found 301716 solutions. Larger examples take significantly longer, so I guess the solution still has exponential scaling behavior, which is probably not surprising since it is very similar to a hamiltonian path problem.

The code is not really fit for public consumption, but if someone asks for it I might clean it up and post it here.