Python code profiling and accelerating your calculations with numba

You wrote up your excellent idea as Python program/module but you are unsatisfied with its performance. The chances are high most of us have been there at least once. I’ve been there last week.

I found excellent method for outlier detection (Enhanced Isolation Forest). eIF was initially written in Python and later optimised in Cython (using C++). C++ is ~40x faster than vanilla Python version, but it lacks the possibility to save the model (which is crucial for my project). Since adding model saving to C++ version is rather complicated buisness, I’ve decided to optimise Python code. Initially I hoped for ~5-10x speed improvement. The final effect surprised me, as rewritten Python code was ~40x faster than initial version matching C++ version performance!

How is it possible? Speeding up your code isn’t trivial. First you need to find which parts of your code are slow (so-called code profiling). Once you know that, you can start tinkering with the code itself (code optimisation).

Code profiling

Traditionally I’ve been relying on %timeit which reports precise execution time for expressions in Python.

%timeit F3.fit(X)
# 1.25 s ± 792 µs per loop (mean ± std. dev. of 7 runs, 1 l oop each)

As awesome as %timeit is, it won’t really tell you which parts of your code are time consuming. At least not directly. For that you’ll need something more advanced.

Code profiling became easier thanks to line_profiler. You can install, load and use it in Jupyter notebook as follows:

# install line_profiler in your system
!pip install line_profiler 
# load the module into current Jupyter notebook
%load_ext line_profiler

# evaluate populate_nodes function of F3.fit program
%lprun -f F3.populate_nodes F3.fit(X)

The example above tells you that although line 134 takes just 11.7 µs per single execution, overall it takes 42.5% of the execution time as it’s executed over 32k times. So starting optimisation of the code from this single line could have dramatic effect on overall execution time.

Code optimisation

First thing I’ve noticed in the original Python code was that in order to calculate outlier score individual samples were streamed through individual trees in the iForest.

        for i in  range(len(X_in)):
            h_temp = 0
            for j in range(self.ntrees):
                h_temp += PathFactor(X_in[i],self.Trees[j]).path*1.0            # Compute path length for each point
            Eh = h_temp/self.ntrees                                             # Average of path length travelled by the point in all trees.
            S[i] = 2.0**(-Eh/self.c)                                            # Anomaly Score
        return S

Since those are operations on arrays, lots of time can be saved if either all samples are processed by individual trees or if individual samples are processed by all trees. Implementing this wasn’t difficult and, combined with cleaning the code from unnecessary variables & classes, resulted in ~6-7x speed-up.

Speeding array operations with numba

Further improvements were much more mild and required detailed code profiling. As mentioned above, single line took 42% overall execution time. Upon closer inspection, I’ve realised that calling X.min(axis=0) and X.max(axis=0) was really time consuming.

x = np.random.random(size=(256, 12))
%timeit x.min(axis=0), x.max(axis=0)
# 15.6 µs ± 43.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Python code can be optimised with numba. For example calculating min and max simultaneously using numba just-in-time compiler results in over 7x faster execution!

from numba import jit

@jit
def minmax(x):
    """np.min(x, axis=0), np.max(x, axis=0) for 2D array but faster"""
    m, n = len(x), len(x[0])
    mi, ma = np.empty(n), np.empty(n)
    mi[:] = ma[:] = x[0]
    for i in range(1, m):
        for j in range(n):
            if x[i, j]>ma[j]: ma[j] = x[i, j]
            elif x[i, j]<mi[j]: mi[j] = x[i, j]
    return mi, ma

%timeit minmax(x) 
# 2.19 µs ± 4.61 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# make sure the results are the same
np.all([minmax(x), (x.min(axis=0), x.max(axis=0))], axis=0)

Apart from that, there have been several other parts that could be optimised with numba. You can have a look at eif_new.py and compare it with older and C++ version using this notebook. If you want to know details, just comment below – I’ll be more than happy to discuss them 🙂

If you’re looking for ways of speeding up array operations, definitely check numexpr beside numba. eIF case didn’t really need numexpr optimisations, but it’s really impressive project and I can imagine many people could benefit from it. So spread the word!

Create book of abstracts from spreadsheet / google forms

Lately a friend of mine complained about interoperability of abstract submissions from numerous applicants. Having the Book of Abstracts is crucial and we faced similar problem organising #NGSchool events.

Note, you’ll need to be somewhat familiar with LaTeX in order to edit the main.tex file to your liking. If you are not afraid of that, the way to proceed is as follows:

  1. Create google form to collect necessary info, such at this one
  2. Create a new spreadsheet to accumulate responses: Responses > Create new spreadsheet
  3. Download responses spreadsheet as Abstracts.xlsx
  4. Clone abstracts repository
  5. git clone https://github.com/lpryszcz/abstracts.git
    cd abstracts
    # install dependencies
    sudo apt install texlive-base texlive-latex-recommended texlive-fonts-recommended texlive-latex-extra make
    
  6. Edit main.tex to your liking
  7. Copy Abstracts.xlsx to the repository
  8. Create pdf
  9. # prepare abstracts.tex
    ./xls2tex.py
    
    # create main.pdf
    make all
    
    # in the case of problems, just run again this point, but first remove the clutter
    rm main.{aux,blg,log,out,toc,pdf}
    

    You’ll find the abstract book in main.pdf.

Running Jupyter as public service

Some time ago, I’ve written about setting up IPython as a public service. Today, I’ll write about setting up Jupyter, IPython descendant, that beside Python supports tons of other languages and frameworks.

Jupyter notebook will be running in separate user, so your personal files are safe, but not as system service. Therefore, you will need to restart it upon system reboot. I recommend running it in SCREEN session, so you can easily login into the server and check the Jupyter state.

  1. Install & setup Jupyter
  2. #
    sudo apt-get install build-essential python-dev
    sudo pip install jupyter
    
    # create new user
    sudo adduser jupyter
     
    # login as new user
    su jupyter
    
    # make sure to add `unset XDG_RUNTIME_DIR` to ~/.bashrc
    # otherwise you'll encounter: OSError: [Errno 13] Permission denied: '/run/user/1003/jupyter'
    echo 'unset XDG_RUNTIME_DIR' >> ~/.bashrc
    source ~/.bashrc
    
    # generate ssl certificates
    mkdir ~/.ssl
    openssl req -x509 -nodes -days 999 -newkey rsa:1024 -keyout ~/.ssl/mykey.key -out ~/.ssl/mycert.pem
    
    # generate config
    jupyter notebook --generate-config
    
    # generate pass and checksum
    ipython -c "from IPython.lib import passwd; passwd()"
    # enter your password twice, save it and copy password hash
    ## Out[1]: 'sha1:[your hashed password here]'
     
    # add to ~/.jupyter/jupyter_notebook_config.py
    c.NotebookApp.ip = '*'
    c.NotebookApp.open_browser = False
    c.NotebookApp.port = 8881
    c.NotebookApp.password = u'sha1:[your hashed password here]'
    c.NotebookApp.certfile = u'/home/jupyter/.ssl/mycert.pem'
    c.NotebookApp.keyfile = u'/home/jupyter/.ssl/mykey.key'
    
    # create some directory for notebook files ie. ~/Public/jupyter
    mkdir -p ~/Public/jupyter && cd ~/Public/jupyter
     
    # start notebook server
    jupyter notebook
    
  3. Add kernels
  4. You can add multiple kernels to Jupyter. Here I’ll cover installation of some:

    • Python
    • sudo pip install ipykernel
      
      # if you wish to use matplotlib, make sure to add to 
      # ~/.ipython/profile_default/ipython_kernel_config.py
      c.InteractiveShellApp.matplotlib = 'inline'
      
    • BASH kernel
    • sudo pip install bash_kernel
      sudo python -m bash_kernel.install
      
    • Perl
    • This didn’t worked for me:/

      sudo cpan Devel::IPerl
    • IRkernel
    • Follow this tutorial.

    • Haskell
    • sudo apt-get install cabal-install
      git clone http://www.github.com/gibiansky/IHaskell
      cd IHaskell
      ./ubuntu-install.sh
      

Then, just navigate to https://YOURDOMAIN.COM:8881/, accept self-signed certificate and enjoy!
Alternatively, you can obtain certificate from Let’s encrypt.

Using existing domain encryption aka Apache proxy
If your domain is already HTTPS, you may consider setting up Jupyter on localhost and redirect all incoming traffic (already encrypted) to particular port on localhost (as suggested by @shebang).

# enable Apache mods
sudo a2enmod proxy proxy_http proxy_wstunnel && sudo service apache2 restart

# add to your Apache config
    <Location "/jupyter" >
        ProxyPass http://localhost:8881/jupyter
        ProxyPassReverse http://localhost:8881/jupyter
    </Location>
    <Location "/jupyter/api/kernels/" >
        ProxyPass        ws://localhost:8881/jupyter/api/kernels/
        ProxyPassReverse ws://localhost:8881/jupyter/api/kernels/
    </Location>
    <Location "/jupyter/api/kernels/">
        ProxyPass        ws://localhost:8881/jupyter/api/kernels/
        ProxyPassReverse ws://localhost:8881/jupyter/api/kernels/
    </Location>

# update you Jupyter config (~/.jupyter/jupyter_notebook_config.py)
c.NotebookApp.ip = 'localhost'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8881
c.NotebookApp.base_url = '/jupyter'
c.NotebookApp.password = u'sha1:[your hashed password here]'
c.NotebookApp.allow_origin = '*'

Note, it’s crucial to add Apache proxy for kernels (/jupyter/api/kernels/), otherwise you won’t be able to use terminals due to failed: Error during WebSocket handshake: Unexpected response code: 400 error.

On handy docker images

Motivated by successful stripping problematic dependencies from Redundans, I have decided to generate smaller Docker image, starting with Alpine Linux image (2Mb / 5Mb after downloading) instead of Ubuntu (49Mb / 122Mb). Previously, I couldn’t really rely on Alpine Linux, because it was impossible to make these problematic dependencies running… But now it’s whole new world of possibilities 😉

There are very few dependencies left, so I have started… (You can find all the commands below).

  1. First, I have check what can be installed from package manager.
    Only Python and Perl.

  2. Then I have checked if any of binaries are working.
    For example, GapCloser is provided as binary. It took me some time to find source code…
    Anyway, none of the binaries worked out of the box. It was expected, as Alpine Linux is super stripped…

  3. I have installed build-base in order to be able to build things.
    Additionally, BWA need zlib-dev.

  4. Alpine Linux doesn’t use standard glibc library, but musl-libc (you can read more about differences between the two), so some programmes (ie. BWA) may be quite reluctant to compile.
    After some hours of trying & thanks to the help of mp15, I have found a solution, not so complicated 🙂

  5. I have realised, that Dockerfile doesn’t like standard BASH brace expansion, that is working otherwise in Docker Alpine console…
    so ls *.{c,h} should be ls *.c *.h

  6. After that, LAST and GapCloser compilation were easy, relatively 😉

Below, you can find the code from Docker file (without RUN commands).

apk add --update --no-cache python perl bash wget build-base zlib-dev
mkdir -p /root/src && cd /root/src && wget http://downloads.sourceforge.net/project/bio-bwa/bwa-0.7.15.tar.bz2 && tar xpfj bwa-0.7.15.tar.bz2 && ln -s bwa-0.7.15 bwa && cd bwa && \
cp kthread.c kthread.c.org && echo "#include <stdint.h>" > kthread.c && cat kthread.c.org >> kthread.c && \
sed -ibak 's/u_int32_t/uint32_t/g' `grep -l u_int32_t *.c *.h` && make && cp bwa /bin/ && \
cd /root/src && wget http://liquidtelecom.dl.sourceforge.net/project/soapdenovo2/GapCloser/src/r6/GapCloser-src-v1.12-r6.tgz && tar xpfz GapCloser-src-v1.12-r6.tgz && ln -s v1.12-r6/ GapCloser && cd GapCloser && make && cp bin/GapCloser /bin/ && \
cd /root/src && wget http://last.cbrc.jp/last-744.zip && unzip last-744.zip && ln -s last-744 last && cd last && make && make install && \
cd /root/src && rm -r last* bwa* GapCloser* v* 

# SSPACE && redundans in /root/srt
cd /root/src && wget -q http://www.baseclear.com/base/download/41SSPACE-STANDARD-3.0_linux-x86_64.tar.gz && tar xpfz 41SSPACE-STANDARD-3.0_linux-x86_64.tar.gz && ln -s SSPACE-STANDARD-3.0_linux-x86_64 SSPACE && wget -O- -q http://cpansearch.perl.org/src/GBARR/perl5.005_03/lib/getopts.pl > SSPACE/dotlib/getopts.pl && \
wget --no-check-certificate -q -O redundans.tgz https://github.com/lpryszcz/redundans/archive/master.tar.gz && tar xpfz redundans.tgz && mv redundans-master redundans && ln -s /root/src/redundans /redundans && rm *gz

apk del wget build-base zlib-dev 
apk add libstdc++

After building & pushing, I have noticed that Alpine-based image is slightly smaller (99Mb), than the one based on Ubuntu (127Mb). Surprisingly, Alpine-based image is larger (273Mb) than Ubuntu-based (244Mb) after downloading. So, I’m afraid all of these hours didn’t really bring any substantial reduction in the image size.

Conclusion?
I was very motivated to build my application on Alpine Linux and expected substantial size reduction. But I’d say that relying on Alpine Linux image doesn’t always pay off in terms of smaller image size, forget about production time… And this I know from my own experience.
But maybe I didn’t something wrong? I’d be really glad for some advices/comments!

Nevertheless, stripping a few dependencies from my application (namely Biopython, numpy & scipy), resulted in much more compact image even using Ubuntu-based image (127Mb vs 191Mb; and 244Mb vs 440Mb after downloading). So I think this is the way to go 🙂

On simplifying dependencies

Lately, to make Redundans more user friendly, I have simplified it’s dependencies, by replacing Biopython, numpy, scipy and SQLite with some (relatively) simple functions or modules.

Here, I will just focus on replacing Biopython, particularly SeqIO.index_db with FastaIndex. You may ask yourself, why I have invested time in reinventing the wheel. I’m big fan of Biopython, yet it’s huge project and some solutions are not optimal or require problematic dependencies. This is the case with SeqIO.db_index, that relies on SQLite3. Here again, I’m a big fan of SQLite, yet building Biopython with SQLite enabled proved not to be very straightforward for non-standard systems or less experience users. Beside, on some NFS settings, the SQLite3 db cannot be created at all.

Ok, let’s start from the basics. SeqIO.index_db allows random access to sequence files, so for example you can rapidly retrieve any entry from very large file. This is achieved by storing the ID and position of each entry from particular file in database, SQLite3 db. Then, if you want to retrieve particular record, SeqIO.index_db looks up if this record is present in SQLite3 db, retrieves record position in the file and reads only small chunk of this file instead of parsing entire file every time you want to get some record(s).
Similar feature is offered by samtools faidx, but in this case, the coordinates of each entry are stored in tab-delimited file .fai (more info about .fai). This format can be easily read & write by any programme, so I have decided to use it. In addition, I have realised, that samtools faidx is flexible enough, so you can add additional columns to the .fai without interrupting its functionality, but about that later…

In Redundans, I’ve been using SeqIO.index_db during assembly reduction (fasta2homozygous.py). Additionally, beside storing index, I’ve been also generating statistics for every FastA file, like number of contigs, cumulative size, N50, N90, GC and so on. I have realised, that these two can be easily combined, by extending .fai with four additional columns, storing number of occurencies for A, C, G & T in every sequence. Such .fai is compatible with samtools faidx and provides very easy way of calculating bunch of statistics about this file.
All of these, I’ve implemented in FastaIndex. Beside being dependency-free & very handy indexer, it can be used also as alternative to samtools faidx to retrieve sequences from large FastA files.

# retrieve bases from 20 to 60 from NODE_2
./FastaIndex.py -i test/run1/contigs.fa -r NODE_2_length_7674_cov_46.7841_ID_3:20-60
>NODE_2_length_7674_cov_46.7841_ID_3
CATAGAACGACTGGTATAAGCCAAACATGACCCATTGTTGC
#Time elapsed: 0:00:00.014243

samtools faidx test/run1/contigs.fa NODE_2_length_7674_cov_46.7841_ID_3:20-60
>NODE_2_length_7674_cov_46.7841_ID_3:20-60
CATAGAACGACTGGTATAAGCCAAACATGACCCATTGTTGC

Tracing exceptions in multiprocessing in Python

I had problems with debugging my programme using multiprocessing.Pool.

Traceback (most recent call last):
  File "src/homologies2mysql_multi.py", line 294, in <module>
    main()
  File "src/homologies2mysql_multi.py", line 289, in main
    o.noupload, o.verbose)
  File "src/homologies2mysql_multi.py", line 242, in homologies2mysql
    for i, data in enumerate(p.imap_unordered(worker, pairs), 1):
  File "/usr/lib64/python2.6/multiprocessing/pool.py", line 520, in next
    raise value
ValueError: need more than 1 value to unpack

I could run it without multiprocessing, but then I’d have to wait some days for the program to reach the point where it crashes.
Luckily, Python is equipped with traceback, that allows handy tracing of exceptions.
Then, you can add a decorator to problematic function, that will report nice error message:

import traceback, functools, multiprocessing
 
def trace_unhandled_exceptions(func):
    @functools.wraps(func)
    def wrapped_func(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except:
            print 'Exception in '+func.__name__
            traceback.print_exc()
    return wrapped_func
 
@trace_unhandled_exceptions
def go():
    print(1)
    raise Exception()
    print(2)
 
p = multiprocessing.Pool(1)
 
p.apply_async(go)
p.close()
p.join()

The error message will look like:

1
Exception in go
Traceback (most recent call last):
  File "<stdin>", line 5, in wrapped_func
  File "<stdin>", line 4, in go
Exception

Solution found on StackOverflow.

Batch convert of .xlsx (Microsoft Office) to .tsv (tab-delimited) files

I had to retrieve data from multiple .xlsx files with multiple sheets. This can be done manually, but it will be rather time-consuming tasks, plus Office quotes text fields, which is not very convenient for downstream analysis…
I have found handy script, xlsx2tsv.py, that does the job, but it reports only one sheet at the time. Thus, I have rewritten xlsx2tsv.py a little to save all sheets from given .xlsx file into separate folder. In addition, multiple .xlsx files can be process at once. My version can be found on github.

xlsx2tsv.py *.xlsx

Installing new version of Python without root

UPDATE 2019: If you have access to sudo, I’d definitely recommend installing python through system manager and using venv i.e.

# install python3.7 
sudo apt install python3.7-venv python3.7-dev
# create virtual environment using python3.7
python3.7 -m venv py37
# activate environment
source py37/bin/activate
# check version
python --version # python -c 'import sys; print(sys.version_info)'
# install new package
pip install numpy
# leave virtual environment
deactivate

Some time ago I was recommending to use Python virtual environment to install local version of Python packages. However this will not solve the issue of outdated version Python in the server your are working in. Here, pythonbrew may be help for you.

# install pythonbrew to ~/.pythonbrew
curl -kL http://xrl.us/pythonbrewinstall | bash

# add to ~/.bashrc to automatically activate pythonbrew
[[ -s "$HOME/.pythonbrew/etc/bashrc" ]] &amp;amp;amp;amp;amp;amp;amp;amp;&amp;amp;amp;amp;amp;amp;amp;amp; source "$HOME/.pythonbrew/etc/bashrc"                                                         

# open new terminal tab (Ctrl+Shift+T) or window (Ctrl+Shift+N)

# install python 2.7.10
pythonbrew install 2.7.10

# and enable the new version
pythonbrew switch 2.7.10

# from now on, you can enjoy the version of your choice and install dependencies
which python
#/home/.../.pythonbrew/pythons/Python-2.7.10/bin/python
python --version
#Python 2.7.10
which pip
#/home/.../.pythonbrew/pythons/Python-2.7.10/bin/pip

Serving IPython notebook on public domain

I’ve been involved in teaching basic programming in Python. There are several good tutorials and on-line courses (just to mention Python@CodeCademy), but I’ve recognised there is a need for some interactive workplace for the students. I’ve got an idea to setup IPython in public domain, as many of the students don’t have Python installed locally or miss certain dependencies…
The task of installing IPython and serving it in publicly seems very easy… But I’ve encountered numerous difficulties on the way, caused by different versions of IPython (ie. split into Jupyter in v4), Apache configuration and firewall setup, just to mention a few. Anyway, I’ve succeeded and I’ve decided to share my experiences here 🙂
First of all, I strongly recommend setting up separate user for serving IPython, as only this way your personal files will be safe ?

  1. Install IPython notebook and prepare new user
  2. # install python-dev and build essentials
    sudo apt-get install build-essential python-dev
    
    # install ipython; v3 is recommended
    sudo pip install ipython[all]==3.2.1
    
    # create new user
    sudo adduser ipython
    
    # login as new user
    su ipython
    
  3. Configure IPython notebook
  4. # create new profile
    ipython profile create nbsever
    
    # generate pass and checksum
    ipython -c "from IPython.lib import passwd; passwd()"
    # enter your password twice, save it and copy password hash
    ## Out[1]: 'sha1:[your hashed password here]'
    
    # add to ~/.ipython/profile_nbserver/ipython_notebook_config.py after `c = get_config()`
    c.NotebookApp.ip = 'localhost'
    c.NotebookApp.open_browser = False
    c.NotebookApp.port = 8889
    c.NotebookApp.base_url = '/ipython'
    c.NotebookApp.password = u'sha1:[your hashed password here]'
    c.NotebookApp.allow_origin='*'
    
    # create some directory for notebook files ie. ~/Public/ipython
    mkdir -p ~/Public/ipython
    cd ~/Public/ipython
    
    # start notebook server
    ipython notebook --profile=nbserver
    
  5. Configure Apache2
  6. # enable mods
    sudo a2enmod proxy proxy_http proxy_wstunnel
    sudo service apache2 restart
    
    # add ipython proxy config to your enabled site ie. /etc/apache2/sites-available/000-default.conf
        # IPython
        <Location "/ipython" >
            ProxyPass http://localhost:8889/ipython
            ProxyPassReverse http://localhost:8889/ipython
        </Location>
    
        <Location "/ipython/api/kernels/" >
            ProxyPass        ws://localhost:8889/ipython/api/kernels/
            ProxyPassReverse ws://localhost:8889/ipython/api/kernels/
        </Location>
        #END
          
    # restart apache2
    sudo service apache2 restart
    

Your public IPython will be accessible at http://yourdomain.com/ipython .
The longest time it took me to realise that c.NotebookApp.allow_origin='*' line is crucial in IPython notebook configuration, otherwise the kernel is loosing connection with an error ‘Connection failed‘ or ‘WebSocket error‘. Additionally, in one of the servers I’ve been trying, there is proxy setup that block some ports high ports, thus it was impossible to connect to WebSocket even with ApacheProxy setup…
If you want to read more especially about setting SSL-enabled notebook, have a look at jupyter documentation.

Identification of potential transcription factor binding sites (TFBS) across species

My colleague asked me for help with identification of targets for some transcription factors (TFs). The complication is that target motifs for these TFs are known in human, (A/G)GGTGT(C/G/T)(A/G), but exact binding motif is not known in the species of interest. Nevertheless, we decided to scan the genome for matches of this motifs. To facilitate that, I’ve written small program, regex2bed.py, finding sequence motifs in the genome. The program employs regex to find matches in forward and reverse complement and reports bed-formatted output.

regex2bed.py -vcs -i DANRE.fa -r "[AG]GGTGT[CGT][AG]" > tf.bed 2> tf.log

regex2bed.py is quite fast, scanning 1.5G genome in ~1 minute on modern desktop. The program reports some basic stats ie. number of matches in +/- strand for each chromosome to stderr.
Most likely, you will find hundred thousands of putative TFBS. Therefore, it’s good to filter some of them ie. focusing on these in proximity of some genes of interest. This can be accomplished using combination of awk, bedtools and two other scripts: bed2region.py and intersect2bed.py.

# crosslink with genes within 100 kb upstream of coding genes
awk '$3=="gene"' genome.gtf > gene.gtf
cat tf.bed | bed2region.py 100000 | bedtools intersect -s -loj -a - -b gene.gtf  | intersect2bed.py > tf.genes100k.bed

And this is how example output will look like:

1       68669   68677   GGGTGTGG        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578  f7; f10; F7 (4 of 4); PROZ (2 of 2); f7i; cul4a
1       71354   71362   aggtgtgg        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100181; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578      f7; f10; LAMP1 (2 of 2); F7 (4 of 4); PROZ (2 of 2); f7i; cul4a
1       76322   76330   AGGTGTGG        0       +       ENSDARG00000034862; ENSDARG00000088581; ENSDARG00000100181; ENSDARG00000100782; ENSDARG00000076900; ENSDARG00000075827; ENSDARG00000096578      f7; f10; LAMP1 (2 of 2); F7 (4 of 4); PROZ (2 of 2); f7i; cul4a

All above mentioned programs can be found in github.
If you want to learn more about regular expression, have a look at Python re module.