Smooth embeddings for arXiv scientific paper titles

Following up on my recent project creating fake arXiv abstracts with RNNs, I have developed a way to embed titles of papers into a vector space. The way I did this is heavily inspired by this paper by Bowman, et al. I followed a slightly simplified approach, in which I simply try to autoencode the titles with a seq2seq network, and use the hidden state that gets passed from the encoder to the decoder as the embedding. This by itself does not generate very smooth embeddings however, which Bowman et al address by including a variational autoencoder in-between the encoder and decoder RNNs. I was lazy and simply added a small amount of noise to the hidden representation during training, which had a similar effect.

Having such an embedding allows one to to some pretty entertaining things. First of all, one can interpolate between two paper titles, by taking the embeddings of two titles, and sampling a number of points that lie between them. Here is one such example:

signature of antiferromagnetic long-range order in the optical spectrum of strongly correlated potential
signature of antiferromagnetic long-range order in the optical excitations of highly correlated systems
signature of antiferromagnetic order nuclei in the 0d term of quenched systems
existence of antiferromagnetic order nuclei in the static region of mesoscopic systems
existence of self-gravitating one-dimensional rings of the maxwell chain "
existence of self-gravitating random static solutions of the toda system
existence of axially symmetric field solutions of the einstein-vlasov system
existence of axially symmetric static solutions of the einstein-vlasov system

(note that I normalised everything to lower case). Here are some more examples, and here are some examples with more fine-grained sampling between the points.

Another thing that one can do is calculate “analogies” as one can do with word2vec embeddings, such as “king is to queen as man is to woman”, by adding and subtracting their respective vectors, i.e. queen-king+man=woman. This seems to work reasonably well for some examples, and was not reported by Bowman, et al. For example, I got:

  "Nonlinear Kalman Filtering for with convolutional neural networks"
- "Convolutional neural networks"
+ "Generative adversarial networks
= "nonlinear non-standard interference of with adversarial networks"

which is kind of weird but not too bad. Another try got the network waxing philosophical:

  "What is the origin of species?" 
- "On the origin of species"
+ "On the theory of relativity"
= "what is theory. a theory"

In general it seems to be able to do substitutions of words quite alright if the positions of the words are similar. Doing this for 3 completely random titles with no obvious relations leads to gibberish output, but I wouldn’t expect anything else.

Future ideas include putting a dense layer before and after the hidden units, in order to get even more robust embeddings (right now they are the states of the 2 layers of RNNs concatenated). Another idea is to somehow separate “semantic” and “syntactic” aspects of the embedding, so that some dimensions would cover the subject matter, and others the grammatical structure that the idea is presented in.


RNN-generated arXiv abstracts


I scraped the entirety of arXiv abstracts to do some experiments. To get started, I trained a char-rnn on all the q-bio abstracts and generated a bunch of synthetic abstracts. Some of the results were quite fun, see below:

Various brain areas reveal spatiotemporal activity patterns that repeat over time: resulting intracellular elements of genetic regulatory networks are quantified. Using a ” experimental study of neural networks, the framework of cellular Markov models to the importance of complexity induces a identification of challenges for understanding specialized biological structures.


Modelling forest composition function for meaningful laws in cortical networks, in the light of simplifying assumption of interaction networks with the same importance they exploit networks used by previous models in topological detail. Existing methods largely depend on a kinetic SIR model under physical networks. We have used the stationary law of overlapping phylogenetic tree distributions as a popular utility. Making use of eigenvalue laws and a scheme augmented along the population and eventually simplify a network .


It also tries to generate LaTeX but it doesn’t get it quite right yet:
Geometry of DNA looping where the residence of 26 ‘ alleles diffusing out than amplitude distributions ( $ F ( x ) $ -test are abrupt at short times $ O ( n = 0.5 ) < $ ^ { 2+ } $ due to a balance matrix , and the synergism of the model and a statistical mechanics level comparable .
I experimented with generating arXiv categories and titles along with the abstracts.
Categories: q-bio.PE stat.AP stat.ME
Title: Joint Resolution Basis GDDA-BLAST Reaction: Mechanism of Biodynamics Waves problems in swarms
Abstract: In a reply that is robust from male molecules in the ecosystem and have presented to apply it to the city in proteomics evolution. An important entity presently processing a MS/MS spectrum outbreak, monitoring, requires on tests but not only an important difficulty in big datasets, opening gained from the usual graph lens and (human) sensitivity analysis. Future dimension test subjects are valid how the epigenetic basis for protein sequence directionality increases the increase within size state. We corrected review particular methods.