<![CDATA[Julian Eisenschlos]]>https://eisenjulian.github.io/https://eisenjulian.github.io/favicon.pngJulian Eisenschloshttps://eisenjulian.github.io/Ghost 5.80Mon, 18 Mar 2024 09:15:07 GMT60<![CDATA[Foundation models for reasoning on charts]]>Visual language is the form of communication that relies on pictorial symbols outside of text to convey information. It is ubiquitous in our digital life in the form of iconography, infographics, tables, plots, and charts, extending to the real world in street signs, comic books, food labels, etc. For that

]]>
https://eisenjulian.github.io/foundation-models-for-reasoning-on-charts/65f744191cf3c9027032cbcdSun, 17 Mar 2024 19:39:44 GMT

Visual language is the form of communication that relies on pictorial symbols outside of text to convey information. It is ubiquitous in our digital life in the form of iconography, infographics, tables, plots, and charts, extending to the real world in street signs, comic books, food labels, etc. For that reason, having computers better understand this type of media can help with scientific communication and discovery, accessibility, and data transparency.

Foundation models for reasoning on charts
Foundation models for reasoning on charts

Cross posted from the Google AI blog


While computer vision models have made tremendous progress using learning-based solutions since the advent of ImageNet, the focus has been on natural images, where all sorts of tasks, such as classificationvisual question answering (VQA), captioning, detection and segmentation, have been defined, studied and in some cases advanced to reach human performance. However, visual language has not garnered a similar level of attention, possibly because of the lack of large-scale training sets in this space. But over the last few years, new academic datasets have been created with the goal of evaluating question answering systems on visual language images, like PlotQAInfographicsVQA, and ChartQA.

Foundation models for reasoning on charts
Example from ChartQA. Answering the question requires reading the information and computing the sum and the difference.

Existing models built for these tasks relied on integrating optical character recognition (OCR) information and their coordinates into larger pipelines but the process is error prone, slow, and generalizes poorly. The prevalence of these methods was because existing end-to-end computer vision models based on convolutional neural networks (CNNs) or transformers pre-trained on natural images could not be easily adapted to visual language. But existing models are ill-prepared for the challenges in answering questions on charts, including reading the relative height of bars or the angle of slices in pie charts, understanding axis scales, correctly mapping pictograms with their legend values with colors, sizes and textures, and finally performing numerical operations with the extracted numbers.

In light of these challenges, we propose “MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering”. MatCha, which stands for math and charts, is a pixels-to-text foundation model (a pre-trained model with built-in inductive biases that can be fine-tuned for multiple applications) trained on two complementary tasks: (a) chart de-rendering and (b) math reasoning. In chart de-rendering, given a plot or chart, the image-to-text model is required to generate its underlying data table or the code used to render it. For math reasoning pre-training, we pick textual numerical reasoning datasets and render the input into images, which the image-to-text model needs to decode for answers. We also propose “DePlot: One-shot visual language reasoning by plot-to-table translation”, a model built on top of MatCha for one-shot reasoning on charts via translation to tables. With these methods we surpass the previous state of the art in ChartQA by more than 20% and match the best summarization systems that have 1000 times more parameters. Both papers will be presented at ACL2023.

Chart de-rendering

Plots and charts are usually generated by an underlying data table and a piece of code. The code defines the overall layout of the figure (e.g., type, direction, color/shape scheme) and the underlying data table establishes the actual numbers and their groupings. Both the data and code are sent to a compiler/rendering engine to create the final image. To understand a chart, one needs to discover the visual patterns in the image and effectively parse and group them to extract the key information. Reversing the plot rendering process demands all such capabilities and can thus serve as an ideal pre-training task.

Foundation models for reasoning on charts
A chart created from a table in the Airbus A380 Wikipedia page using random plotting options. The pre-training task for MatCha consists of recovering the source table or the source code from the image.

In practice, it is challenging to simultaneously obtain charts, their underlying data tables, and their rendering code. To collect sufficient pre-training data, we independently accumulate [chart, code] and [chart, table] pairs. For [chart, code], we crawl all GitHub IPython notebooks with appropriate licenses and extract blocks with figures. A figure and the code block right before it are saved as a [chart, code] pair. For [chart, table] pairs, we explored two sources. For the first source, synthetic data, we manually write code to convert web-crawled Wikipedia tables from the TaPas codebase to charts. We sampled from and combined several plotting options depending on the column types. In addition, we also add [chart, table] pairs generated in PlotQA to diversify the pre-training corpus. The second source is web-crawled [chart, table] pairs. We directly use the [chart, table] pairs crawled in the ChartQA training set, containing around 20k pairs in total from four websites: StatistaPewOur World in Data, and OECD.

Math reasoning

We incorporate numerical reasoning knowledge into MatCha by learning math reasoning skills from textual math datasets. We use two existing textual math reasoning datasets, MATH and DROP for pre-training. MATH is synthetically created, containing two million training examples per module (type) of questions. DROP is a reading-comprehension–style QA dataset where the input is a paragraph context and a question.

To solve questions in DROP, the model needs to read the paragraph, extract relevant numbers and perform numerical computation. We found both datasets to be complementary. MATH contains a large number of questions across different categories, which helps us identify math operations needed to explicitly inject into the model. DROP’s reading-comprehension format resembles the typical QA format wherein models simultaneously perform information extraction and reasoning. In practice, we render inputs of both datasets into images. The model is trained to decode the answer.

Foundation models for reasoning on charts
To improve the math reasoning skills of MatCha we incorporate examples from MATH and DROP into the pre-training objective, by rendering the input text as images.

End-to-end results

We use a Pix2Struct model backbone, which is an image-to-text transformer tailored for website understanding, and pre-train it with the two tasks described above. We demonstrate the strengths of MatCha by fine-tuning it on several visual language tasks — tasks involving charts and plots for question answering and summarization where no access to the underlying table is possible. MatCha surpasses previous models’ performance by a large margin and also outperforms the previous state of the art, which assumes access to underlying tables.

In the figure below, we first evaluate two baseline models that incorporate information from an OCR pipeline, which until recently was the standard approach for working with charts. The first is based on T5, the second on VisionTaPas. We also compare against PaLI-17B, which is a large (~1000 times larger than the other models) image plus text-to-text transformer trained on a diverse set of tasks but with limited capabilities for reading text and other forms of visual language. Finally, we report the Pix2Struct and MatCha model results.

Foundation models for reasoning on charts
Experimental results on two chart QA benchmarks ChartQA & PlotQA (using relaxed accuracy) and a chart summarization benchmark chart-to-text (using BLEU4). Matcha surpasses the state of the art by a large margin on QA, compared to larger models, and matches these larger models on summarization.

For QA datasets, we use the official relaxed accuracy metric that allows for small relative errors in numerical outputs. For chart-to-text summarization, we report BLEU scores. MatCha achieves noticeably improved results compared to baselines for question answering, and comparable results to PaLI in summarization, where large size and extensive long text/captioning generation pre-training are advantageous for this kind of long-form text generation.

Derendering plus large language model chains

While extremely performant for their number of parameters, particularly on extractive tasks, we observed that fine-tuned MatCha models could still struggle with end-to-end complex reasoning (e.g., mathematical operations involving large numbers or multiple steps). Thus, we also propose a two-step method to tackle this: 1) a model reads a chart, then outputs the underlying table, 2) a large language model (LLM) reads this output and then tries to answer the question solely based on the textual input.

For the first model, we fine-tuned MatCha solely on the chart-to-table task, increasing the output sequence length to guarantee it could recover all or most of the information in the chart. DePlot is the resulting model. In the second stage, any LLM (such as FlanPaLM or Codex) can be used for the task, and we can rely on the standard methods to increase performance on LLMs, for example chain-of-thought and self-consistency. We also experimented with program-of-thoughts where the model produces executable Python code to offload complex computations.

Foundation models for reasoning on charts
An illustration of the DePlot+LLM method. This is a real example using FlanPaLM and Codex. The blue boxes are input to the LLM and the red boxes contain the answer generated by the LLMs. We highlight some of the key reasoning steps in each answer.

As shown in the example above, the DePlot model in combination with LLMs outperforms fine-tuned models by a significant margin, especially so in the human-sourced portion of ChartQA, where the questions are more natural but demand more difficult reasoning. Furthermore, DePlot+LLM can do so without access to any training data.

We have released the new models and code at our GitHub repo, where you can try it out yourself in colab. Checkout the papers for MatCha and DePlot for more details on the experimental results. We hope that our results can benefit the research community and make the information in charts and plots more accessible to everyone.

Acknowledgements

This work was carried out by Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen and Yasemin Altun from our Language Team as part of Fangyu's internship project. Nigel Collier from Cambridge also was a collaborator. We would like to thank Joshua Howland, Alex Polozov, Shrestha Basu Mallick, Massimo Nicosia and William Cohen for their valuable comments and suggestions.

]]>
<![CDATA[Learning to Reason Over Tables from Less Data]]>https://eisenjulian.github.io/learning-to-reason-over-tables-from-less-data/65f738821cf3c9027032cb82Sat, 10 Apr 2021 11:44:58 GMT

The task of recognizing textual entailment, also known as natural language inference, consists of determining whether a piece of text (a premise), can be implied or contradicted (or neither) by another piece of text (the hypothesis). While this problem is often considered an important test for the reasoning skills of machine learning (ML) systems and has been studied in depth for plain text inputs, much less effort has been put into applying such models to structured data, such as websites, tables, databases, etc. Yet, recognizing textual entailment is especially relevant whenever the contents of a table need to be accurately summarized and presented to a user, and is essential for high fidelity question answering systems and virtual assistants.

In "Understanding tables with intermediate pre-training", published in Findings of EMNLP 2020, we introduce the first pre-training tasks customized for table parsing, enabling models to learn better, faster and from less data. We build upon our earlier TAPAS model, which was an extension of the BERT bi-directional Transformer model with special embeddings to find answers in tables. Applying our new pre-training objectives to TAPAS yields a new state of the art on multiple datasets involving tables. On TabFact, for example, it reduces the gap between model and human performance by ~50%. We also systematically benchmark methods of selecting relevant input for higher efficiency, achieving 4x gains in speed and memory, while retaining 92% of the results. All the models for different tasks and sizes are released on GitHub repo, where you can try them out yourself in a colab Notebook.

Learning to Reason Over Tables from Less Data
Posted by Julian Eisenschlos, AI Resident, Google Research, Zürich The task of recognizing textual entailment , also known as natural la...
Learning to Reason Over Tables from Less Data

Cross posted from the Google AI blog


Textual Entailment

The task of textual entailment is more challenging when applied to tabular data than plain text. Consider, for example, a table from Wikipedia with some sentences derived from its associated table content. Assessing if the content of the table entails or contradicts the sentence may require looking over multiple columns and rows, and possibly performing simple numeric computations, like averaging, summing, differencing, etc.

Learning to Reason Over Tables from Less Data
A table together with some statements from TabFact. The content of the table can be used to support or contradict the statements.

Following the methods used by TAPAS, we encode the content of a statement and a table together, pass them through a Transformer model, and obtain a single number with the probability that the statement is entailed or refuted by the table.

Learning to Reason Over Tables from Less Data
The TAPAS model architecture uses a BERT model to encode the statement and the flattened table, read row by row. Special embeddings are used to encode the table structure. The vector output of the first token is used to predict the probability of entailment.

Because the only information in the training examples is a binary value (i.e., "correct" or "incorrect"), training a model to understand whether a statement is entailed or not is challenging and highlights the difficulty in achieving generalization in deep learning, especially when the provided training signal is scarce. Seeing isolated entailed or refuted examples, a model can easily pick-up on spurious patterns in the data to make a prediction, for example the presence of the word "tie" in "Greg Norman and Billy Mayfair tie in rank", instead of truly comparing their ranks, which is what is needed to successfully apply the model beyond the original training data.

Pre-training Tasks

Pre-training tasks can be used to “warm-up” models by providing them with large amounts of readily available unlabeled data. However, pre-training typically includes primarily plain text and not tabular data. In fact, TAPAS was originally pre-trained using a simple masked language modelling objective that was not designed for tabular data applications. In order to improve the model performance on tabular data, we introduce two novel pretraining binary-classification tasks called counterfactual and synthetic, which can be applied as a second stage of pre-training (often called intermediate pre-training).

In the counterfactual task, we source sentences from Wikipedia that mention an entity (person, place or thing) that also appears in a given table. Then, 50% of the time, we modify the statement by swapping the entity for another alternative. To make sure the statement is realistic, we choose a replacement among the entities in the same column in the table. The model is trained to recognize whether the statement was modified or not. This pre-training task includes millions of such examples, and although the reasoning about them is not complex, they typically will still sound natural.

For the synthetic task, we follow a method similar to semantic parsing in which we generate statements using a simple set of grammar rules that require the model to understand basic mathematical operations, such as sums and averages (e.g., "the sum of earnings"), or to understand how to filter the elements in the table using some condition (e.g.,"the country is Australia"). Although these statements are artificial, they help improve the numerical and logical reasoning skills of the model.

Learning to Reason Over Tables from Less Data
Example instances for the two novel pre-training tasks. Counterfactual examples swap entities mentioned in a sentence that accompanies the input table for a plausible alternative. Synthetic statements use grammar rules to create new sentences that require combining the information of the table in complex ways.

Results

We evaluate the success of the counterfactual and synthetic pre-training objectives on the TabFact dataset by comparing to the baseline TAPAS model and to two prior models that have exhibited success in the textual entailment domain, LogicalFactChecker (LFC) and Structure Aware Transformer (SAT). The baseline TAPAS model exhibits improved performance relative to LFC and SAT, but the pre-trained model (TAPAS+CS) performs significantly better, achieving a new state of the art.

We also apply TAPAS+CS to question answering tasks on the SQA dataset, which requires that the model find answers from the content of tables in a dialog setting. The inclusion of CS objectives improves the previous best performance by more than 4 points, demonstrating that this approach also generalizes performance beyond just textual entailment.

Learning to Reason Over Tables from Less Data
Results on TabFact (left) and SQA (right). Using the synthetic and counterfactual datasets, we achieve new state-of-the-art results in both tasks by a large margin.

Data and Compute Efficiency

Another aspect of the counterfactual and synthetic pre-training tasks is that since the models are already tuned for binary classification, they can be applied without any fine-tuning to TabFact. We explore what happens to each of the models when trained only on a subset (or even none) of the data. Without looking at a single example, the TAPAS+CS model is competitive with a strong baseline Table-Bert, and when only 10% of the data are included, the results are comparable to the previous state-of-the-art.

Learning to Reason Over Tables from Less Data
Dev accuracy on TabFact relative to the fraction of the training data used.

A general concern when trying to use large models such as this to operate on tables, is that their high computational requirements makes it difficult for them to parse very large tables. To address this, we investigate whether one can heuristically select subsets of the input to pass through the model in order to optimize its computational efficiency.

We conducted a systematic study of different approaches to filter the input and discovered that simple methods that select for word overlap between a full column and the subject statement give the best results. By dynamically selecting which tokens of the input to include, we can use fewer resources or work on larger inputs at the same cost. The challenge is doing so without losing important information and hurting accuracy.

For instance, the models discussed above all use sequences of 512 tokens, which is around the normal limit for a transformer model (although recent efficiency methods like the Reformer or Performer are proving effective in scaling the input size). The column selection methods we propose here can allow for faster training while still achieving high accuracy on TabFact. For 256 input tokens we get a very small drop in accuracy, but the model can now be pre-trained, fine-tuned and make predictions up to two times faster. With 128 tokens the model still outperforms the previous state-of-the-art model, with an even more significant speed-up — 4x faster across the board.

Learning to Reason Over Tables from Less Data
Accuracy on TabFact using different sequence lengths, by shortening the input with our column selection method.

Using both the column selection method we proposed and the novel pre-training tasks, we can create table parsing models that need fewer data and less compute power to obtain better results.

We have made available the new models and pre-training techniques at our GitHub repo, where you can try it out yourself in colab. In order to make this approach more accessible, we also shared models of varying sizes all the way down to “tiny”. It is our hope that these results will help spur development of table reasoning among the broader research community.

Acknowledgements

This work was carried out by Julian Martin Eisenschlos, Syrine Krichene and Thomas Müller from our Language Team in Zürich. We would like to thank Jordan Boyd-Graber, Yasemin Altun, Emily Pitler, Benjamin Boerschinger, Srini Narayanan, Slav Petrov, William Cohen and Jonathan Herzig for their useful comments and suggestions.

]]>
<![CDATA[Efficient multi-lingual language model fine-tuning]]>https://eisenjulian.github.io/multifit/65f738821cf3c9027032cb80Tue, 10 Sep 2019 00:00:00 GMT

Most of the world’s text is not in English. To enable researchers and practitioners to build impactful solutions in their domains, understanding how our NLP architectures fare in many languages needs to be more than an afterthought. In this post, we introduce our latest paper that studies multilingual text classification and introduces MultiFiT, a novel method based on ULMFiT. MultiFiT, trained on 100 labeled documents in the target language, outperforms multi-lingual BERT. It also outperforms the cutting-edge LASER algorithm—even though LASER requires a corpus of parallel texts, and MultiFiT does not.

Efficient multi-lingual language model fine-tuning · fast.ai NLP
Efficient multi-lingual language model fine-tuning
Cross posted from the fast AI blog

This is joint work by Sebastian Ruder, Piotr Czapla, Marcin Kardas, Sylvain Gugger, Jeremy Howard, and Julian Eisenschlos and benefits from the hundreds of insights into multilingual transfer learning from the whole fast.ai forum community. We invite you to read the full EMNLP 2019 paper or check out the code here.


Introduction

If you have ever worked on an NLP task in any language other than English, we feel your pain. The last couple of years have brought impressive progress in deep learning-based approaches for natural language processing tasks and there’s much to be excited about. However, those advances can be slow to transfer beyond English. In the past, most of academia showed little interest in publishing research or building datasets that go beyond the English language, even though industry applications desperately need language-agnostic techniques. Luckily, thanks to efforts around democratizing access to machine learning and initiatives such as the Bender rule, the tides are changing.

Existing approaches for cross-lingual NLP rely on either:

  • Parallel data across languages—that is, a corpus of documents with exactly the same contents, but written in different languages. This is very hard to acquire in a general setting.
  • A shared vocabulary—that is, a vocabulary that is common across multiple languages. This approach over-represents languages with a lot of data. For some examples, have a look at this blog post An example is multilingual BERT, which is very resource-intensive to train, and can struggle when languages are dissimilar.

The main appeal of cross-lingual models like multilingual BERT are their zero-shot transfer capabilities: given only labels in a high-resource language such as English, they can transfer to another language without any training data in that language. We argue that many low-resource applications do not provide easy access to training data in a high-resource language. Such applications include disaster response on social media, help desks that deal with community needs or support local business owners, etc. In such settings, it is often easier to collect a few hundred training examples in the low-resource language. The utility of zero-shot approaches in general is quite limited; by definition, if you are applying a model to some language, then you have some documents in that language. So it makes sense to use them to help train your model!

In addition, when the target language is very different to the source language (most often English), zero-shot transfer may perform poorly or fail altogether. We have seen this with cross-lingual word embeddings and more recently for multilingual BERT.

We show that we can fine-tune efficient monolingual language models that are competitive with multilingual BERT, in many languages, on a few hundred examples. Our proposed approach Multilingual Fine-Tuning (MultiFiT) is different in a number of ways from the current main stream of NLP models: We do not build on BERT, but leverage a more efficient variant of an LSTM architecture. Consequently, our approach is much cheaper to pretrain and more efficient in terms of space and time complexity. Lastly, we emphasize having nimble monolingual models vs. a monolithic cross-lingual one. We also show that we can achieve superior zero-shot transfer by using a cross-lingual model as the teacher. This highlights the potential of combining monolingual and cross-lingual information.

Our approach

Our method is based on Universal Language Model Fine-Tuning (ULMFiT). For more context, we invite you to check out the previous blog post that explains it in depth. MultiFiT extends ULMFiT to make it more efficient and more suitable for language modelling beyond English: It utilizes tokenization based on subwords rather than words and employs a QRNN rather than an LSTM. In addition, it leverages a number of other improvements.

Subword tokenization   ULMFiT uses word-based tokenization, which works well for the morphologically poor English, but results in very large and sparse vocabularies for morphologically rich languages, such as Polish and Turkish. Some languages such as Chinese don’t really even have the concept of a “word”, so require heuristic segmentation approaches, which tend to be complicated, slow, and inaccurate. On the other extreme as can be seen below, character-based models use individual characters as tokens. While in this case the vocabulary (and thus the number of parameters) can be small, such models require modelling longer dependencies and can thus be harder to train and less expressive than word-based models.

Efficient multi-lingual language model fine-tuning
From character-based to word-based tokenization.

To mitigate this, similar to current neural machine translation models and pretrained language models like BERT and GPT-2, we employ SentencePiece subword tokenization, which has since been incorporated into the fast.ai text package. Subword tokenization strikes a balance between the two approaches by using a mixture of character, subword and word tokens, depending on how common they are.

This way we can have short (on average) representations of sentences, yet are still able to encode rare words. We use a unigram language model based on Wikipedia that learns a vocabulary of tokens together with their probability of occurrence. It assumes that tokens occur independently (hence the unigram in the name). During tokenization this method finds the most probable segmentation into tokens from the vocabulary. In the image below we show an example of tokenizing “_subwords” using a vocabulary trained on English Wikipedia (“_” is used by SentencePiece to denote a whitespace).

Efficient multi-lingual language model fine-tuning
A graph of possible subword tokenizations for the token '_subwords'. The number next to each token is its negative log likelihood. The most probable tokenization corresponds to the shortest weighted path connecting the blue nodes (indicated in red).

To sum up, subword tokenization has two very desirable properties for multilingual language modelling:

  1. Subwords more easily represent inflections, including common prefixes and suffixes and are thus well-suited for morphologically rich languages.
  2. Subword tokenization is a good fit for open-vocabulary problems and eliminates out-of-vocabulary tokens, as the coverage is close to 100% tokens.

QRNN ULMFiT used a state-of-the-art language model at the time, the AWD-LSTM. The AWD-LSTM is a regular LSTM with tuned dropout hyper-parameters. While recent state-of-the-art language models have been increasingly based on Transformers, such as the Transformer-XL, recurrent models still seem to have the edge on smaller datasets such as the Penn Treebank and WikiText-2.

To make our model more efficient, we replace the AWD-LSTM with a Quasi-Recurrent Neural Network (QRNN). The QRNN strikes a balance between an CNN and an LSTM: It can be parallelized across time and minibatch dimensions like a CNN and inherits the LSTM’s sequential bias as the output depends on the order of elements in the sequence. Specifically, the QRNN alternates convolutional layers, which are parallel across timesteps and a recurrent pooling function, which is parallel across channels.

We can see in the figure below how it differs from an LSTM and a CNN. In the LSTM, computation at each timestep depends on the results from the previous timestep (indicated by the non-continuous blocks), while CNNs and QRNNs are more easily parallelizable (indicated by the continuous blocks).

Efficient multi-lingual language model fine-tuning
The computation structure of the QRNN compared with an LSTM and a CNN (Bradbury et al., 2019)

In our experiments, we obtain a 2-3x speed-up during training using QRNNs. QRNNs have been used in a number of applications, such as state-of-the-art speech recognition in the past.

Other improvements Instead of using ULMFiT’s slanted triangular learning rate schedule and gradual unfreezing, we achieve faster training and convergence by employing a cosine variant of the one-cycle policy that is available in the fast.ai library. Finally, we use label smoothing, which transforms the one-hot labels to a “smoother” distribution and has been found particularly useful when learning from noisy labels.

ULMFiT ensembles the predictions of a forward and backward language model. Even though bidirectionality has been found to be important in contextual word vectors, we did not see big improvements for our downstream tasks (text classification) with ELMo-style joint training. As joint training is quite memory-intensive and we emphasize efficiency, we opted to just train forward language models for all languages.

The full model can be seen in the below figure. It consists of a subword embedding layer, four QRNN layers, an aggregation layer, and two linear layers. The aggregation and linear layers are the same as used in ULMFiT.

Efficient multi-lingual language model fine-tuning
The MultiFiT language model with a classifier head. The dimensionality of each layer can be seen in each box at the top (Figure 1 in the paper).

Results

We compare our model to state-of-the-art cross-lingual models including multilingual BERT and LASER (which uses parallel sentences) on two multilingual document classification datasets. Perhaps surprisingly, we find that our monolingual language models fine-tuned only on 100 labeled examples of the corresponding task in the target language outperform zero-shot inference (trained on 1000 examples in the source language) with multilingual BERT and LASER. MultiFit also outperforms the other methods when all models are fine-tuned on 1000 target language examples.

For the detailed results, have a look at the paper.

Zero-shot Transfer with a Cross-lingual Teacher

Still, if a powerful cross-lingual model and labeled data in a high-resource language are available, it would be nice to make use of them in some way. To this end, we propose to use the classifier that is learned on top of the cross-lingual model on the source language data as a teacher to obtain labels for training our model on the target language. This way, we can perform zero-shot transfer using our monolingual language model by bootstrapping from a cross-lingual one.

To illustrate how this works, take a look at the following diagram:

Efficient multi-lingual language model fine-tuning
The steps of the cross-lingual bootstrapping method for zero-shot cross-lingual transfer (Figure 2 in the paper).

The process consists of three main steps:

  1. Our monolingual language model is pre-trained on Wikipedia data in the target language (a) and fine-tuned on in-domain data of the corresponding task (b).
  2. We now train a classifier on top of cross-lingual model such as LASER using labelled data in a high-resource source language and perform zero-shot inference as usual with this classifier to predict labels on target language documents.
  3. In a the final step (c), we can now use these predicted labels to fine-tune a classifier on top of our fine-tuned monolingual language model.

This is similar to distillation, which has recently been used to train smaller language models or distill task-specific information into downstream models. In contrast, to previous work, we do not just seek to distill the information of a big model into a small model but into one with a different inductive bias. In addition to circumventing the need for labels in the target language, our approach thus brings another benefit: As the monolingual model is specialized to the target language, its inductive bias might be more suitable than the more language-agnostic representations learned by the cross-lingual model. It might thus be able to make better use of labels in the target language, even if they are noisy.

We obtain evidence for this hypothesis as the monolingual language model fine-tuned on zero-shot predictions outperforms its teacher in all settings.

Robustness to Noise

Another hypothesis why this teaching works so well is that pre-training makes the monolingual language robust to noise to some extent. The pre-trained information stored in the model may act as a regularizer, biasing it towards the correct labels that are in line with its knowledge of the language.

To test this, we compare a pre-trained language model with a non-pre-trained language model that are fine-tuned on 1k or 10k labelled examples where labels are perturbed with a probability ranging from 0 to 0.75 in the below diagram.

Efficient multi-lingual language model fine-tuning
Comparison of MultiFiT's robustness to label noise with and without pretraining. The red line shows the theoretical accuracy of a perfect model that achieves 100% accuracy with all labels.

As we can see, the pre-trained models are much more robust to label noise. Even with 30% noisy labels, they still maintain about the same performance, whereas the performance of the models without pre-training quickly decays. This highlights robustness to noise as an additional benefit of transfer learning and may facilitate faster crowd-sourcing and data annotation.

Next Steps

We are initially releasing seven pre-trained language models in German, Spanish, French, Italian, Japanese, Russian, and Chinese, since they are the ones in the datasets we studied. You can find the code here. We hope to release many many more, with the help of the community.

The fast.ai community has been very helpful in collecting datasets in many more languages, and applying MultiFiT to them—nearly always with state-of-the-art results. For space limitations in the paper those datasets were not included, and we opted to select well used, balanced multi-lingual datasets. Special thanks to Aayush Yadav, Alexey Demyanchuk, Benjamin van der Burgh, Cahya Wirawan, Charin Polpanumas, Nirant Kasliwal and Tomasz Pietruszka.

Another interesting question to explore further is how very low-resource languages or dialects can benefit from larger corpora in similar languages. We are looking forward to seeing what problems you apply MultiFiT to, so don’t hesitate to ask and share your results in the fast.ai forums.

]]>
<![CDATA[Neural Networks in 100 lines of pure Python]]>https://eisenjulian.github.io/deep-learning-in-100-lines/65f738821cf3c9027032cb7eSun, 17 Mar 2019 15:33:37 GMT

Can we build a Deep learning framework in pure Python and Numpy? Can we make it compact, clear and extendable? Let's set out to explore those ideas and see what we can create!


In today's day and age, there are multiple frameworks to choose from, with various important features such as auto-differentiation, graph-based optimized computation, and hardware acceleration. It's easy to take those features for granted, but every once in a while peeking under the hood can teach us what makes things work and what doesn't.

What we'll cover:

  • Automatic back-propagation
  • How to implement a few basic layers
  • How to create a training loop

If you want to jump straight to the code, feel free to scroll to the second half of the post or try it live in this Colab Notebook.

The design choices are heavily based on Thinc's Deep Learning library since they came up with the smart idea of using  closures and the stack to share and keep track of intermediate results while calculating gradients. This post is also inspired by Jeremy Howard's great PyTorch tutorial, but taking it one more step towards the bottom getting rid of autograd.

A word on notation

Notation can often be a source of confusion, but it can also help us develop the right intuition. In the context of calculus for back propagation, we can focus on functions or on variables to think about derivatives. These are the edges and nodes in the computational graph. We will get to the same results either way, but I find that focusing on variables helps to make things more natural.

  • Let $f:\mathbb{R}^n\to\mathbb{R}$ and $x\in\mathbb{R}^n$, then the gradient is the $n$-dimensional row vector of partial derivatives $\frac{\partial f}{\partial_j }(x)$
  • If $f:\mathbb{R}^n\to\mathbb{R}^m$ and $x\in\mathbb{R}^n$ then the Jacobian matrix is an $m\times n$ matrix such that the $i^{\text{th}}$ is the gradient of $f$ restricted to the $i$ coordinate:
    $$Jf(x)_{ij} = \frac{\partial f_i}{\partial_j}(x)$$
  • For a function $f$ and two vectors $a$ and $b$ such that $a = f(b)$, whenever $f$ is clear from context we can write $\frac{\partial a}{\partial b}$ to denote the Jacobian $Jf$, or gradient if $a$ is a real number.

The chain rule

If we have three vectors in three vector spaces $a\in A$, $b\in B$, and $c\in C$ and two differentiable functions $f:A\to B$ and $g:B\to C$ such that $f(a) = b$ and $g(b) = c$ we can get the Jacobian of the composition as a matrix product of the Jacobian of $f$ and $g$:

$$\frac{\partial c}{\partial a} = \frac{\partial c}{\partial b} \cdot \frac{\partial b}{\partial a}$$

That's the chain rule in all its glory. The back-propagation algorithm, originally described in the '60s and '70s is the application of the chain rule to calculate gradients of a real function with respect to its various parameters.

Keep in mind that our end goal is to find a minimum (hopefully global) of a function by taking steps in the opposite direction of the said gradient, because locally at least this will take it downwards. This is how it looks when we have two parameters to optimize.

Neural Networks in 100 lines of pure Python
Source: 7 Simple Steps To Visualize And Animate The Gradient Descent Algorithm

Reverse mode differentiation

If we compose many functions $f_i(a_i) = a_{i+1}$ instead of just two, then we can apply that formula repeatedly and conclude that

$$\frac{\partial a_n}{\partial a_1} = \frac{\partial a_n}{\partial a_{n-1}} \cdots \frac{\partial a_3}{\partial a_2}\cdot\frac{\partial a_2}{\partial a_1}$$

Of the many ways we can compute this product, the most common are from left to right or from right to left.

In practice, if $a_n$ is a scalar we will calculate the full gradient by starting with the row vector $\frac{\partial a_n}{\partial a_{n-1}}$ and multiplying by the Jacobians $\frac{\partial a_i}{\partial a_{i-1}}$ on the right one at a time. This operation is sometimes referred as VJP, or Vector-Jacobian Product. The whole process is known as reverse mode differentiation, because as intermediate results we gradually compute the gradients working from the last one $\frac{\partial a_n}{\partial a_{n-1}}$ backward through $\frac{\partial a_n}{\partial a_{n-2}}$, $\frac{\partial a_n}{\partial a_{n-3}}$, etc. In the end, we know the gradient of $a_n$ with respect to all other variables.

$$\frac{\partial a_n}{\partial a_{n-1}} \rightarrow \frac{\partial a_n}{\partial a_{n-2}} \rightarrow \cdots \rightarrow \frac{\partial a_n}{\partial a_1}$$

In contrast, forward mode does the exact opposite. It starts from one Jacobian like $\frac{\partial a_2}{\partial a_1}$ and multiplies on the left by $\frac{\partial a_3}{\partial a_2}$ to calculate $\frac{\partial a_3}{\partial a_1}$. If we keep multiplying by $\frac{\partial a_i}{\partial a_{i-1}}$ we'll gradually get the derivatives of all other variables with respect to $a_1$. When $a_1$ is a scalar, all the matrixes on the right side of the products are column vectors, and this is called a Jacobian-Vector Product (or JVP).

$$\frac{\partial a_n}{\partial a_1} \leftarrow\cdots\leftarrow \frac{\partial a_2}{\partial a_1} \leftarrow  \frac{\partial a_2}{\partial a_1}$$

For back propagation, as you might have guessed, we are interested in the first of this two, since there's a single value we want on top, the loss function, derived with respect to multiple variables, the weights. Forward mode could still be used to compute those gradients, but it would be far less efficient since we would have to repeat the process multiple times.

This sounds like a lot of high-dimensional matrix products, but the trick is that very often the Jacobian matrixes will be sparse, block or even diagonal and, because we only care about the result of multiplying it by a row vector on the left, we won't need to compute or store it explicitly

In this article, we are focusing on a sequential composition of layers, but the same ideas apply to any algorithm or computational graph that we use to get to the result. Yet another nice description of reverse and forward mode in a more general context is given here.

In the typical setting for supervised machine learning, we have a big complex function that takes in a tensor of numerical features for our labeled samples, and several tensors that correspond to weights that characterize the model.

The loss, as a scalar function of both the samples and weights, is a measure of how far the model output was from the expected labels. We want to minimize it to find the most suitable weights. In deep learning, this function is expressed as a composition of simpler functions, each of which is easy to differentiate. All of them but the last are what we refer to as layers, and each layer usually has two sets of parameters: the inputs (which can be the output of the previous layer) and the weights.

The last function is the loss metric, which also has two sets of parameters: model output $y$ and the true labels $\hat{y}$. For example, if the loss metric $l$ is the mean square error then $\frac{\partial l}{\partial y}$ is $2\  \text{avg}(y - \hat{y})$. The gradient of the loss will be the starting row vector to apply reverse mode differentiation.

Autograd

The ideas behind Automatic Differentiation are quite mature. It can be done in runtime or during compilation, which can have a dramatic impact on performance. I recommend the HIPS autograd Python package for a thorough explanation of some of the concepts.

The core idea, however, is always the same, and we have known it ever since we started computing derivatives in school. If we can keep track of the computations that resulted in the final scalar output and we know how to differentiate each simple operation (sums and products, powers, exponentials, and logarithms, ...), we can recover the gradient of the output.

Let's say that we have some intermediate linear layer $f$ which is characterized by a matrix multiplication (let's leave the bias out for a sec):

$$y = f(x, w) = x\cdot w$$

In order to tune the values of $w$ with gradient descent, we need to compute the gradient $\frac{\partial l}{\partial w}$. The key observation that it's enough to know how changes in $y$ will affect $l$ to do that.

The contract that any layer has to satisfy is the following: if I tell you the gradient of the loss with respect to your outputs, you can tell me the gradient with respect to your inputs, which are in turn the previous layer outputs.

Now we can apply the chain rule twice: for the gradient with respect to $w$ we get that

$$\frac{\partial l}{\partial w} =\frac{\partial l}{\partial y}\cdot \frac{\partial y}{\partial w} = \frac{\partial l}{\partial y} \cdot x^t $$

and with respect to $x$

$$\frac{\partial l}{\partial x} =\frac{\partial l}{\partial y}\cdot \frac{\partial y}{\partial x} = \frac{\partial l}{\partial y} \cdot w $$

So we can both pass a gradient backward to enable previous layers to update themselves and update the internal layer weights to optimize the loss, and that's it!

Implementation time

Let's look at the code, or try it live in this Colab Notebook. We can start with a class that encapsulates a tensor and its gradients

class Parameter():
  def __init__(self, tensor):
    self.tensor = tensor
    self.gradient = np.zeros_like(self.tensor)

Now we can create the layer class, the key idea is that during a forward pass we return both the layer output and function that can receive the gradient of the loss with respect to the outputs and can return the gradient with respect to the inputs, updating the weight gradients in the process.

This is because while evaluating the model layer by layer there's no way to calculate the gradients if we don't know the final loss yet, instead the best thing you can do is return a function that CAN calculate the gradient later. And that function will only be called after we completed the forward evaluation, when you know the loss and you have all the necessary info to compute the gradients in that layer.

The training process will then have three steps, calculate the forward step, then the backward steps accumulate the gradients, and finally updating the weights. It’s important to do this at the end  since weights can be reused in multiple layers and we don’t want to mutate the weights before time.

class Layer:
  def __init__(self):
    self.parameters = []

  def forward(self, X):
    """
    Override me! A simple no-op layer, it passes forward the inputs
    """
    return X, lambda D: D

  def build_param(self, tensor):
    """
    Creates a parameter from a tensor, and saves a reference for the update step
    """
    param = Parameter(tensor)
    self.parameters.append(param)
    return param

  def update(self, optimizer):
    for param in self.parameters: optimizer.update(param)

It's standard to delegate the job of updating the parameters to an optimizer, which receives an instance of a parameter after every batch. The simplest and most known optimization method out there is the mini-batch stochastic gradient descent

class SGDOptimizer():
  def __init__(self, lr=0.1):
    self.lr = lr

  def update(self, param):
    param.tensor -= self.lr * param.gradient
    param.gradient.fill(0)

Under this framework, and using the results we computed earlier, the linear layer looks like this snippet of code. We are using numpy overloaded @ operator for matrix multiplication.

class Linear(Layer):
  def __init__(self, inputs, outputs):
    super().__init__()
    tensor = np.random.randn(inputs, outputs) * np.sqrt(1 / inputs)
    self.weights = self.build_param(tensor)
    self.bias = self.build_param(np.zeros(outputs))

  def forward(self, X):
    def backward(D):
      self.weights.gradient += X.T @ D
      self.bias.gradient += D.sum(axis=0)
      return D @ self.weights.tensor.T
    return X @ self.weights.tensor +  self.bias.tensor, backward

Now, the next most used types of layers are activations, which are non-linear point-wise functions. The Jacobian of a point-wise function is diagonal, which means that when multiplied by the gradient it also acts a point-wise multiplication.

class ReLu(Layer):
  def forward(self, X):
    mask = X > 0
    return X * mask, lambda D: D * mask

A little harder is to compute the derivative of a Sigmoid, which is again applied pointwise:

class Sigmoid(Layer):
  def forward(self, X):
    S = 1 / (1 + np.exp(-X))
    def backward(D):
      return D * S * (1 - S)
    return S, backward

When we have many layers in a sequence, as we traverse them and get the successive outputs we can save the backward functions in a list that will be used in the reverse order, to get all the way to gradient with respect to the first layer inputs and return it. This is where the magic happens:

class Sequential(Layer):
  def __init__(self, *layers):
    super().__init__()
    self.layers = layers
    for layer in layers:
      self.parameters.extend(layer.parameters)

  def forward(self, X):
    backprops = []
    Y = X
    for layer in self.layers:
      Y, backprop = layer.forward(Y)
      backprops.append(backprop)
    def backward(D):
      for backprop in reversed(backprops): 
        D = backprop(D)
      return D
    return Y, backward

As we mentioned earlier, we will need a way to calculate the loss function associated with a batch of samples, and its gradient. One example would be MSE loss, typically used in regression problems, and it can be implemented in this manner:

def mse_loss(Yp, Yt):
  diff = Yp - Yt
  return np.square(diff).mean(), 2 * diff / len(diff)

Almost there now, we have two types of layers and a way to combine them, so how does the training loop look like. We can use an API similar to scikit-learn or keras.

class Learner():
  def __init__(self, model, loss, optimizer):
    self.model = model
    self.loss = loss
    self.optimizer = optimizer

  def fit_batch(self, X, Y):
    Y_, backward = self.model.forward(X)
    L, D = self.loss(Y_, Y)
    backward(D)
    self.model.update(self.optimizer)
    return L

  def fit(self, X, Y, epochs, bs):
    losses = []
    for epoch in range(epochs):
      p = np.random.permutation(len(X))
      X, Y = X[p], Y[p]
      loss = 0.0
      for i in range(0, len(X), bs):
        loss += self.fit_batch(X[i:i + bs], Y[i:i + bs])
      losses.append(loss)
    return losses

And that is all, if you were keeping track, we even have a few lines of code to spare.

But does it work?

Thought you would never ask, we can start testing with some synthetic data.

X = np.random.randn(100, 10)
w = np.random.randn(10, 1)
b = np.random.randn(1)
Y = X @ W + B

model = Linear(10, 1)
learner = Learner(model, mse_loss, SGDOptimizer(lr=0.05))
learner.fit(X, Y, epochs=10, bs=10)
Neural Networks in 100 lines of pure Python
Training loss in 10 epochs

We can also check that the learned weights coincide with the true ones

print(np.linalg.norm(m.weights.tensor - W), (m.bias.tensor - B)[0])
> 1.848553648022619e-05 5.69305886743976e-06

OK, so that was easy, but let's come up with a non-linear dataset, how about $y = x_1 x_2$, we will add a Sigmoid non-linearity and another linear layer to maker our model more expressive. Here we go:

X = np.random.randn(1000, 2)
Y = X[:, 0] * X[:, 1]

losses1 = Learner(
    Sequential(Linear(2, 1)), 
    mse_loss, 
    SGDOptimizer(lr=0.01)
).fit(X, Y, epochs=50, bs=50)

losses2 = Learner(
    Sequential(
        Linear(2, 10), 
        Sigmoid(), 
        Linear(10, 1)
    ), 
    mse_loss, 
    SGDOptimizer(lr=0.3)
).fit(X, Y, epochs=50, bs=50)

plt.plot(losses1)
plt.plot(losses2)
plt.legend(['1 Layer', '2 Layers'])
plt.show()
Neural Networks in 100 lines of pure Python
Training loss comparing a one layer model vs. a two layer model with a sigmoid activation

Wrapping Up

I hope you have found this educative, we only defined three types of layers and one loss function, so there's much more to be done. In a follow-up post, we will implement binary cross entropy loss as well as other non-linear activations to start building more expressive models. Stay tuned...

Reach out on Twitter at @eisenjulian for questions and requests. Thanks for reading!

References

[1] Thinc Deep Learning Library https://github.com/explosion/thinc
[2] PyTorch Tutorial https://pytorch.org/tutorials/beginner/nn_tutorial.html
[3] Calculus on Computational Graphs http://colah.github.io/posts/2015-08-Backprop/
[4] HIPS Autograd https://github.com/HIPS/autograd

]]>
<![CDATA[Text Classification with TensorFlow Estimators]]>https://eisenjulian.github.io/text-classification-with-estimators/65f738821cf3c9027032cb7fWed, 07 Mar 2018 04:50:00 GMT

Throughout this post we will explain how to classify text using Estimators,  Datasets and Feature Columns, with a scalable high-level API in TensorFlow.

Posted by Sebastian Ruder and Julian Eisenschlos, Google Developer Experts

Here’s the outline of what we’ll cover:

  • Loading data using Datasets.
  • Building baselines using pre-canned estimators.
  • Using word embeddings.
  • Building custom estimators with convolution and LSTM layers.
  • Loading pre-trained word vectors.
  • Evaluating and comparing models using TensorBoard.

Welcome to Part 4 of a blog series that introduces TensorFlow Datasets and Estimators. You don’t need to read all of the previous material, but take a look if you want to refresh any of the following concepts. Part 1 focused on pre-made Estimators, Part 2 discussed feature columns, and Part 3 how to create custom Estimators.

Here in Part 4, we will build on top of all the above to tackle a different family of problems in Natural Language Processing (NLP). In particular, this article demonstrates how to solve a text classification task using custom TensorFlow estimators, embeddings, and the tf.layers module. Along the way, we’ll learn about word2vec and transfer learning as a technique to bootstrap model performance when labeled data is a scarce resource.

We will show you relevant code snippets. Here’s the complete Jupyter Notebook that you can run locally or on Google Colaboratory. The plain .py source file is also available here. Note that the code was written to demonstrate how Estimators work functionally and was not optimized for maximum performance.

The Task

The dataset we will be using is the IMDB Large Movie Review Dataset, which consists of 25,000 highly polar movie reviews for training, and 25,000 for testing. We will use this dataset to train a binary classification model, able to predict whether a review is positive or negative.

For illustration, here’s a piece of a negative review (with 2 stars) in the dataset:

Now, I LOVE Italian horror films. The cheesier they are, the better. However, this is not cheesy Italian. This is week-old spaghetti sauce with rotting meatballs. It is amateur hour on every level. There is no suspense, no horror, with just a few drops of blood scattered around to remind you that you are in fact watching a horror film.

Keras provides a convenient handler for importing the dataset which is also available as a serialized numpy array .npz file to download here. For text classification, it is standard to limit the size of the vocabulary to prevent the dataset from becoming too sparse and high dimensional, causing potential overfitting. For this reason, each review consists of a series of word indexes that go from 4 (the most frequent word in the dataset: the) to 4999, which corresponds to orange. Index 1 represents the beginning of the sentence and the index 2 is assigned to all unknown (also known as out-of-vocabulary or OOV) tokens. These indexes have been obtained by pre-processing the text data in a pipeline that cleans, normalizes and tokenizes each sentence first and then builds a dictionary indexing each of the tokens by frequency.

After we’ve loaded the data in memory we pad each of the sentences with 0 to a fixed size (here: 200) so that we have two $2$-dimensional $25000\times 200$ arrays for training and testing respectively.

vocab_size = 5000
sentence_size = 200
(x_train_variable, y_train), (x_test_variable, y_test) = imdb.load_data(num_words=vocab_size)
x_train = sequence.pad_sequences(
    x_train_variable, 
    maxlen=sentence_size, 
    padding='post', 
    value=0)
x_test = sequence.pad_sequences(
    x_test_variable,
    maxlen=sentence_size, 
    padding='post', 
    value=0)

Input Functions

The Estimator framework uses input functions to split the data pipeline from the model itself. Several helper methods are available to create them, whether your data is in a .csv file, or in a pandas.DataFrame, whether it fits in memory or not. In our case, we can use Dataset.from_tensor_slices for both the train and test sets.

x_len_train = np.array([min(len(x), sentence_size) for x in x_train_variable])
x_len_test = np.array([min(len(x), sentence_size) for x in x_test_variable])

def parser(x, length, y):
    features = {"x": x, "len": length}
    return features, y

def train_input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train))
    dataset = dataset.shuffle(buffer_size=len(x_train_variable))
    dataset = dataset.batch(100)
    dataset = dataset.map(parser)
    dataset = dataset.repeat()
    iterator = dataset.make_one_shot_iterator()
    return iterator.get_next()

def eval_input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((x_test, x_len_test, y_test))
    dataset = dataset.batch(100)
    dataset = dataset.map(parser)
    iterator = dataset.make_one_shot_iterator()
    return iterator.get_next()

We shuffle the training data and do not predefine the number of epochs we want to train, while we only need one epoch of the test data for evaluation. We also add an additional "len" key that captures the length of the original, unpadded sequence, which we will use later.

Datasets can work with out-of-memory sources (not needed in this case) by streaming them record by record, and the shuffle method uses a buffer_size to continuously sample from fixed sized set without loading the entire thing into memory.

Building a baseline

It’s good practice to start any machine learning project trying basic baselines. The simpler the better as having a simple and robust baseline is key to understanding exactly how much we are gaining in terms of performance by adding extra complexity. It may very well be the case that a simple solution is good enough for our requirements.

With that in mind, let us start by trying out one of the simplest models for text classification. That would be a sparse linear model that gives a weight to each token and adds up all of the results, regardless of the order. As this model does not care about the order of words in a sentence, we normally refer to it as a Bag-of-Words approach. Let’s see how we can implement this model using an Estimator.

We start out by defining the feature column that is used as input to our classifier. As we have seen in Part 2, categorical_column_with_identity is the right choice for this pre-processed text input. If we were feeding raw text tokens other feature_columns could do a lot of the pre-processing for us. We can now use the pre-made LinearClassifier.

column = tf.feature_column.categorical_column_with_identity('x', vocab_size)
classifier = tf.estimator.LinearClassifier(
    feature_columns=[column], 
    model_dir=os.path.join(model_dir, 'bow_sparse'))

Finally, we create a simple function that trains the classifier and additionally creates a precision-recall curve. As we do not aim to maximize performance in this blog post, we only train our models for $25,000$ steps.

def train_and_evaluate(classifier):
    classifier.train(input_fn=train_input_fn, steps=25000)
    eval_results = classifier.evaluate(input_fn=eval_input_fn)
    predictions = np.array([p['logistic'][0] for p in classifier.predict(input_fn=eval_input_fn)])
    tf.reset_default_graph() 
    # Add a PR summary in addition to the summaries that the classifier writes
    pr = summary_lib.pr_curve('precision_recall', predictions=predictions, labels=y_test.astype(bool), num_thresholds=21)
    with tf.Session() as sess:
        writer = tf.summary.FileWriter(os.path.join(classifier.model_dir, 'eval'), sess.graph)
        writer.add_summary(sess.run(pr), global_step=0)
        writer.close()
        
train_and_evaluate(classifier)

One of the benefits of choosing a simple model is that it is much more interpretable. The more complex a model, the harder it is to inspect and the more it tends to work like a black box. In this example, we can load the weights from our model’s last checkpoint and take a look at what tokens correspond to the biggest weights in absolute value. The results look like what we would expect.

# Load the tensor with the model weights
weights = classifier.get_variable_value('linear/linear_model/x/weights').flatten()
# Find biggest weights in absolute value
extremes = np.concatenate((sorted_indexes[-8:], sorted_indexes[:8]))
# word_inverted_index is a dictionary that maps from indexes back to tokens
extreme_weights = sorted(
    [(weights[i], word_inverted_index[i]) for i in extremes])
# Create plot
y_pos = np.arange(len(extreme_weights))
plt.bar(y_pos, [pair[0] for pair in extreme_weights], align='center', alpha=0.5)
plt.xticks(y_pos, [pair[1] for pair in extreme_weights], rotation=45, ha='right')
plt.ylabel('Weight')
plt.title('Most significant tokens') 
plt.show()
Text Classification with TensorFlow Estimators

As we can see, tokens with the most positive weight such as ‘refreshing’ are clearly associated with positive sentiment, while tokens that have a large negative weight unarguably evoke negative emotions. A simple but powerful modification that one can do to improve this model is weighting the tokens by their tf-idf scores.

Embeddings

The next step of complexity we can add are word embeddings. Embeddings are a dense low-dimensional representation of sparse high-dimensional data. This allows our model to learn a more meaningful representation of each token, rather than just an index. While an individual dimension is not meaningful, the low-dimensional space — when learned from a large enough corpus — has been shown to capture relations such as tense, plural, gender, thematic relatedness, and many more. We can add word embeddings by converting our existing feature column into an embedding_column. The representation seen by the model is the mean of the embeddings for each token (see the combiner argument in the docs). We can plug in the embedded features into a pre-canned DNNClassifier.

A note for the keen observer: an embedding_column is just an efficient way of applying a fully connected layer to the sparse binary feature vector of tokens, which is multiplied by a constant depending of the chosen combiner. A direct consequence of this is that it wouldn’t make sense to use an embedding_columndirectly in a LinearClassifier because two consecutive linear layers without non-linearities in between add no prediction power to the model, unless of course the embeddings are pre-trained.

embedding_size = 50
word_embedding_column = tf.feature_column.embedding_column(
    column, dimension=embedding_size)
classifier = tf.estimator.DNNClassifier(
    hidden_units=[100],
    feature_columns=[word_embedding_column], 
    model_dir=os.path.join(model_dir, 'bow_embeddings'))
train_and_evaluate(classifier)

We can use TensorBoard to visualize our 50-dimensional word vectors projected into $\mathbb{R}^3$ using t-SNE. We expect similar words to be close to each other. This can be a useful way to inspect our model weights and find unexpected behaviors.

Text Classification with TensorFlow Estimators

At this point one possible approach would be to go deeper, further adding more fully connected layers and playing around with layer sizes and training functions. However, by doing that we would add extra complexity and ignore important structure in our sentences. Words do not live in a vacuum and meaning is compositional, formed by words and its neighbors.

Convolutions are one way to take advantage of this structure, similar to how we can model salient clusters of pixels for image classification. The intuition is that certain sequences of words, or n-grams, usually have the same meaning regardless of their overall position in the sentence. Introducing a structural prior via the convolution operation allows us to model the interaction between neighboring words and consequently gives us a better way to represent such meaning.

The following image shows how a filter matrix F of shape d×m slides across each 3-gram window of tokens to build a new feature map. Afterwards a pooling layer is usually applied to combine adjacent results.

Text Classification with TensorFlow Estimators
Source: Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks by Severyn et al. [2015]

Let us look at the full model architecture. The use of dropout layers is a regularization technique that makes the model less likely to overfit.

Text Classification with TensorFlow Estimators

As seen in previous blog posts, the tf.estimator framework provides a high-level API for training machine learning models, defining train(), evaluate() and predict() operations, handling checkpointing, loading, initializing, serving, building the graph and the session out of the box. There is a small family of pre-made estimators, like the ones we used earlier, but it’s most likely that you will need to build your own.

Writing a custom estimator means writing a model_fn(features, labels, mode, params) that returns an EstimatorSpec. The first step will be mapping the features into our embedding layer:

input_layer = tf.contrib.layers.embed_sequence(
    features['x'], 
    vocab_size, 
    embedding_size,
    initializer=params['embedding_initializer'])

Then we use tf.layers to process each output sequentially.

training = (mode == tf.estimator.ModeKeys.TRAIN)
dropout_emb = tf.layers.dropout(inputs=input_layer, 
                                rate=0.2, 
                                training=training)
conv = tf.layers.conv1d(
    inputs=dropout_emb,
    filters=32,
    kernel_size=3,
    padding="same",
    activation=tf.nn.relu)
pool = tf.reduce_max(input_tensor=conv, axis=1)
hidden = tf.layers.dense(inputs=pool, units=250, activation=tf.nn.relu)
dropout = tf.layers.dropout(inputs=hidden, rate=0.2, training=training)
logits = tf.layers.dense(inputs=dropout_hidden, units=1)

Finally, we will use a Head to simplify the writing of our last part of the model_fn. The head already knows how to compute predictions, loss, train_op, metrics and export outputs, and can be reused across models. This is also used in the pre-made estimators and provides us with the benefit of a uniform evaluation function across all of our models. We will use binary_classification_head, which is a head for single label binary classification that uses sigmoid_cross_entropy_with_logits as the loss function under the hood.

head = tf.contrib.estimator.binary_classification_head()
optimizer = tf.train.AdamOptimizer()    
def _train_op_fn(loss):
    tf.summary.scalar('loss', loss)
    return optimizer.minimize(
        loss=loss,
        global_step=tf.train.get_global_step())

return head.create_estimator_spec(
    features=features,
    labels=labels,
    mode=mode,
    logits=logits,
    train_op_fn=_train_op_fn)

Running this model is just as easy as before:

initializer = tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0))
params = {'embedding_initializer': initializer}
cnn_classifier = tf.estimator.Estimator(model_fn=model_fn,
                                        model_dir=os.path.join(model_dir, 'cnn'),
                                        params=params)
train_and_evaluate(cnn_classifier)

LSTM Networks

Using the Estimator API and the same model head, we can also create a classifier that uses a Long Short-Term Memory (LSTM) cell instead of convolutions. Recurrent models such as this are some of the most successful building blocks for NLP applications. An LSTM processes the entire document sequentially, recursing over the sequence with its cell while storing the current state of the sequence in its memory.

One of the drawbacks of recurrent models compared to CNNs is that, because of the nature of recursion, models turn out deeper and more complex, which usually produces slower training time and worse convergence. LSTMs (and RNNs in general) can suffer convergence issues like vanishing or exploding gradients, that said, with sufficient tuning they can obtain state-of-the-art results for many problems. As a rule of thumb CNNs are good at feature extraction, while RNNs excel at tasks that depend on the meaning of the whole sentence, like question answering or machine translation.

Each cell processes one token embedding at a time and updates its internal state based on a differentiable computation that depends on both the embedding vector x at time t​ and the previous state h at time t−1​. In order to get a better understanding of how LSTMs work, you can refer to Chris Olah’s blog post.

Text Classification with TensorFlow Estimators
Source: Understanding LSTM Networks by Chris Olah

The complete LSTM model can be expressed by the following simple flowchart:

Text Classification with TensorFlow Estimators

In the beginning of this post, we padded all documents up to 200 tokens, which is necessary to build a proper tensor. However, when a document contains fewer than 200 words, we don’t want the LSTM to continue processing padding tokens as it does not add information and degrades performance. For this reason, we additionally want to provide our network with the length of the original sequence before it was padded. Internally, the model then copies the last state through to the sequence’s end. We can do this by using the "len" feature in our input functions. We can now use the same logic as above and simply replace the convolutional, pooling, and flatten layers with our LSTM cell.

lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(100)
_, final_states = tf.nn.dynamic_rnn(
        lstm_cell, inputs, sequence_length=features['len'], dtype=tf.float32)
logits = tf.layers.dense(inputs=final_states.h, units=1)

Pre-trained vectors

Most of the models that we have shown before rely on word embeddings as a first layer. So far, we have initialized this embedding layer randomly. However, much previous work has shown that using embeddings pre-trained on a large unlabeled corpus as initialization is beneficial, particularly when training on only a small number of labeled examples. The most popular pre-trained embedding is word2vec. Leveraging knowledge from unlabeled data via pre-trained embeddings is an instance of transfer learning.

To this end, we will show you how to use them in an Estimator. We will use the pre-trained vectors from another popular model, GloVe.

embeddings = {}
with open('glove.6B.50d.txt', 'r', encoding='utf-8') as f:
    for line in f:
        values = line.strip().split()
        w = values[0]
        vectors = np.asarray(values[1:], dtype='float32')
        embeddings[w] = vectors

After loading the vectors into memory from a file we store them as a numpy.array using the same indexes as our vocabulary. The created array is of shape (5000, 50). At every row index, it contains the 50-dimensional vector representing the word at the same index in our vocabulary.

embedding_matrix = np.random.uniform(-1, 1, size=(vocab_size, embedding_size))
for w, i in word_index.items():
    v = embeddings.get(w)
    if v is not None and i < vocab_size:
        embedding_matrix[i] = v

Finally, we can use a custom initializer function and pass it in the paramsobject to our cnn_model_fn , without any modifications.

def my_initializer(shape=None, dtype=tf.float32, partition_info=None):
    assert dtype is tf.float32
    return embedding_matrix
params = {'embedding_initializer': my_initializer}
cnn_pretrained_classifier = tf.estimator.Estimator(
    model_fn=cnn_model_fn,
    model_dir=os.path.join(model_dir, 'cnn_pretrained'),
    params=params)
train_and_evaluate(cnn_pretrained_classifier)

Running TensorBoard

Now we can launch TensorBoard and see how the different models we’ve trained compare against each other in terms of training time and performance.

In a terminal, we run

> tensorboard --logdir={model_dir}

We can visualize many metrics collected while training and testing, including the loss function values of each model at each training step, and the precision-recall curves. This is of course most useful to select which model works best for our use-case as well as how to choose classification thresholds.

Text Classification with TensorFlow Estimators
Text Classification with TensorFlow Estimators
Training loss across steps on the left and Precision-Recall curves on the test data for each of our models on the left

Getting Predictions

To obtain predictions on new sentences we can use the predict method in the Estimator instances, which will load the latest checkpoint for each model and evaluate on the unseen examples. But before passing the data into the model we have to clean up, tokenize and map each token to the corresponding index as we see below.

def text_to_index(sentence):
    # Remove punctuation characters except for the apostrophe
    translator = str.maketrans('', '', string.punctuation.replace("'", ''))
    tokens = sentence.translate(translator).lower().split()
    return np.array([1] + [word_index[t] if t in word_index else 2 for t in tokens])

def print_predictions(sentences, classifier):
    indexes = [text_to_index(sentence) for sentence in sentences]
    x = sequence.pad_sequences(indexes, 
                               maxlen=sentence_size, 
                               padding='post', 
                               value=0)
    length = np.array([min(len(x), sentence_size) for x in indexes])
    predict_input_fn = tf.estimator.inputs.numpy_input_fn(x={"x": x, "len": length}, shuffle=False)
    predictions = [p['logistic'][0] for p in classifier.predict(input_fn=predict_input_fn)]
    print(predictions)

It is worth noting that the checkpoint itself is not sufficient to make predictions; the actual code used to build the estimator is necessary as well in order to map the saved weights to the corresponding tensors. It’s a good practice to associate saved checkpoints with the branch of code with which they were created.

If you are interested in exporting the models to disk in a fully recoverable way, you might want to look into the SavedModel class, which is especially useful for serving your model through an API using TensorFlow Serving or loading it in the browser with TensorFlow.js.

In this blog post, we explored how to use estimators for text classification, in particular for the IMDB Reviews Dataset. We trained and visualized our own embeddings, as well as loaded pre-trained ones. We started from a simple baseline and made our way to convolutional neural networks and LSTMs.

For more details, be sure to check out:


Thanks for reading! If you like you can find us online at ruder.io and @eisenjulian. Send our way all your feedback and questions.

]]>