Python basics for data analysis:

When I was asked to write a post, I decided to share with the NLP community how learning a programming language can help Computational Linguists to optimize their everyday work.

Since I started working as a Computational Linguist at Bitext, every day I have to deal with large amounts of data and information and I wanted to share my favorite tool to process, analyze, and extract information from large files: Python.

Python, as some of you might know, is a freely accessible programming language and it offers a lot of functionalities for NLP and data analysis. I prefer it over other programming languages because it is widely used in academia, research, and industry, and it also gives the user access to libraries like NLTK, which can be very useful when the data you are analyzing is linguistic.

Let me show you a couple functions of Python for data analysis I use on a daily basis, you will see that even some Python basics can help us greatly.

Information Retrieval:

A Python script can help you to extract from a big file the specific information that you are interested in. This eases the process, reduces the time, and avoids tedious manual work.

As an example, let’s imagine I have a big text file in Russian. With just over 10 lines of simple code, I can extract all the masculine singular forms of the past participle active of Russian verbs (not counting irregular verbs).

python script bitext

This script would iterate through millions of lines and give back a result in seconds, optimizing the process and saving us a lot of time.


Text processing:

Imagine that in the above-mentioned file, we discover that those forms we have extracted were not meant to be masculine (e.g: ‘слышавший’), but instead they should be the feminine form (e.g: ‘слышавшая’). With this simple Python script, we can change the masculine forms into feminine ones.

 python script bitext 2

As well as the fast processing time, using a script ensures that there will not be manual errors (if we make sure we have the correct code, of course). The script will go through the whole file and change all the instances in the same way.


Text normalization:

This is very useful when we get a file full of great information, but to actually use it we need to give it a different format. This formatting may be necessary to make it usable by some other software or simply to make it more readable.

For example, let’s imagine we get the flexion of Spanish verbs from a source that has this messy format:


But in order to use it we want it presented following this format:

Conjugated-form        Lemma           Person             Number           Tense           Aspect

This Python script will read any number of lines and return them in the desired format.

python script bitext 3


This type of change would be impossible to make with most text editors if the file is several million lines long, but with a Python script, we change the whole file uniformly in seconds.

These are only simple examples of what you can do with Python to process and analyze your data, however, they show the great potential this programming language has. At Bitext, we use Python to easily interact with Language Processing APIs, deal with big language databases, generate or process the inflection and derivation of any language and many other tasks. If you are interested in more complex examples complete the form and we will send to you a presentation!


Get more Python examples!

Subscribe Here!