Smooth embeddings for arXiv scientific paper titles

Following up on my recent project creating fake arXiv abstracts with RNNs, I have developed a way to embed titles of papers into a vector space. The way I did this is heavily inspired by this paper by Bowman, et al. I followed a slightly simplified approach, in which I simply try to autoencode the titles with a seq2seq network, and use the hidden state that gets passed from the encoder to the decoder as the embedding. This by itself does not generate very smooth embeddings however, which Bowman et al address by including a variational autoencoder in-between the encoder and decoder RNNs. I was lazy and simply added a small amount of noise to the hidden representation during training, which had a similar effect.

Having such an embedding allows one to to some pretty entertaining things. First of all, one can interpolate between two paper titles, by taking the embeddings of two titles, and sampling a number of points that lie between them. Here is one such example:

signature of antiferromagnetic long-range order in the optical spectrum of strongly correlated potential
signature of antiferromagnetic long-range order in the optical excitations of highly correlated systems
signature of antiferromagnetic order nuclei in the 0d term of quenched systems
existence of antiferromagnetic order nuclei in the static region of mesoscopic systems
existence of self-gravitating one-dimensional rings of the maxwell chain "
existence of self-gravitating random static solutions of the toda system
existence of axially symmetric field solutions of the einstein-vlasov system
existence of axially symmetric static solutions of the einstein-vlasov system

(note that I normalised everything to lower case). Here are some more examples, and here are some examples with more fine-grained sampling between the points.

Another thing that one can do is calculate “analogies” as one can do with word2vec embeddings, such as “king is to queen as man is to woman”, by adding and subtracting their respective vectors, i.e. queen-king+man=woman. This seems to work reasonably well for some examples, and was not reported by Bowman, et al. For example, I got:

  "Nonlinear Kalman Filtering for with convolutional neural networks"
- "Convolutional neural networks"
+ "Generative adversarial networks
= "nonlinear non-standard interference of with adversarial networks"

which is kind of weird but not too bad. Another try got the network waxing philosophical:

  "What is the origin of species?" 
- "On the origin of species"
+ "On the theory of relativity"
= "what is theory. a theory"

In general it seems to be able to do substitutions of words quite alright if the positions of the words are similar. Doing this for 3 completely random titles with no obvious relations leads to gibberish output, but I wouldn’t expect anything else.

Future ideas include putting a dense layer before and after the hidden units, in order to get even more robust embeddings (right now they are the states of the 2 layers of RNNs concatenated). Another idea is to somehow separate “semantic” and “syntactic” aspects of the embedding, so that some dimensions would cover the subject matter, and others the grammatical structure that the idea is presented in.

Advertisement

RNN-generated arXiv abstracts

 

I scraped the entirety of arXiv abstracts to do some experiments. To get started, I trained a char-rnn on all the q-bio abstracts and generated a bunch of synthetic abstracts. Some of the results were quite fun, see below:

Various brain areas reveal spatiotemporal activity patterns that repeat over time: resulting intracellular elements of genetic regulatory networks are quantified. Using a ” experimental study of neural networks, the framework of cellular Markov models to the importance of complexity induces a identification of challenges for understanding specialized biological structures.

 

Modelling forest composition function for meaningful laws in cortical networks, in the light of simplifying assumption of interaction networks with the same importance they exploit networks used by previous models in topological detail. Existing methods largely depend on a kinetic SIR model under physical networks. We have used the stationary law of overlapping phylogenetic tree distributions as a popular utility. Making use of eigenvalue laws and a scheme augmented along the population and eventually simplify a network .

 

It also tries to generate LaTeX but it doesn’t get it quite right yet:
Geometry of DNA looping where the residence of 26 ‘ alleles diffusing out than amplitude distributions ( $ F ( x ) $ -test are abrupt at short times $ O ( n = 0.5 ) < $ ^ { 2+ } $ due to a balance matrix , and the synergism of the model and a statistical mechanics level comparable .
I experimented with generating arXiv categories and titles along with the abstracts.
Categories: q-bio.PE stat.AP stat.ME
Title: Joint Resolution Basis GDDA-BLAST Reaction: Mechanism of Biodynamics Waves problems in swarms
Abstract: In a reply that is robust from male molecules in the ecosystem and have presented to apply it to the city in proteomics evolution. An important entity presently processing a MS/MS spectrum outbreak, monitoring, requires on tests but not only an important difficulty in big datasets, opening gained from the usual graph lens and (human) sensitivity analysis. Future dimension test subjects are valid how the epigenetic basis for protein sequence directionality increases the increase within size state. We corrected review particular methods.

Weekly Vim Focus: Week 3

It has been a while since the last posts from this series, so the title weekly doesn’t really apply. The previous 2 bite-sized commitments to improve my Vim skills worked well, with about 75% of the things I tried to internalize now in daily use. This week I will focus on some movements, sticking to the number 4.

  • H and L move to top and bottom screen.
  • ctrl+F and ctlr+B move a page forward and back.
    I find the behaviour of this a bit strange, because it places you on the second to last line of the current page.
  • ' followed by a mark moves to the line of that mark.
    Useful automatic marks are . for the location of the last edit and ' for the location before the last jump.
  • nG moves to line n.
    I used to use :n for this, but that is not really a movement and thus can’t be combined with other commands.

For more movements also have a look at this Vim Movements Wallpaper. I won’t use it personally, but it makes for a handy reference.

4 Bit Synthesizer

Built a 4-bit synthesizer together with a friend, based on this project. We assembled it on a breadboard, still need to transfer it to a PCB. It is based on an ATMega48 microcontroller, and the sound is generated through a R-2R ladder DAC (the resistors soldered into a chunk in the picture)

Here is a sample of the lo-fi goodness:



The synth is controlled via MIDI, and can generate a single voice in one of 4 waveforms at a time. Several synths can be daisychained to be controlled via one MIDI connector and feed a single output though, so hopefully once we get to soldering everything together we can do that.

Quick Productivity Trick: Capture Distractions

This is probably obvious and/or part of systems like GTD, but I came up with this for myself recently, and it has worked remarkably well.

When working and trying to focus on one task at hand, which may not be my favorite task of all times, I find myself having to deal with distracting ideas popping into my head. Some of them are good ideas, and some of them are completely useless, but this can really only be determined after spending some time thinking about them, which is exactly what I am trying to avoid. I used to frequently cave in on this type of internal distraction, and my productivity would take a hit. The worst part is that once you start considering one such idea, it tends to lead to another and another.

What I have started doing now is just capture these ideas on paper, and then get back to them in my next break. Often times I will realize that the idea was stupid, and doesn’t deserve being pursued any further. But having it there puts my mind at ease, and makes it a lot easier to get on with the current task. The simple act of moving the distracting thought onto paper seems to make it easier to remove it from my mind. I use a simple notepad for this, which I keep next to my keyboard.

Weekly Vim Focus: Week 2

Week 1 of my vim productivity boosting experiment went well. During the week most of the commands I aimed for ended up being used often.
I got a lot of helpful comments, both here and on Hacker News. I ended up unmapping my arrow keys, and I do feel like this made a difference, I am now comfortable with those keys. As a bonus, it feels a lot more natural using hjkl in compound movement commands than it did with the arrow keys.

This week I will focus on repeating previously issued commands. Since 4 at a time worked well for me last time, I will stick to that number.

  • . repeats the last edit
  • gv marks the previous visual mark
  • & repeats a substitution
  • @: repeats a command

For next week I will go through the comments and pick the ones I found most useful in there.