counting the matched lines

user3138373 07/11/2018. 4 answers, 105 views
text-processing grep

I have 2 input files. Input file1 looks like this

Equus caballus
Monodelphis domestica
Saccharomyces cerevisiae S288c

Input2 looks like this(showing first 10 lines)

>CM000377.2/60448635-60448529 Equus caballus chromosome 1, whole genome shotgun sequence.   ATCGCTTCTCGGCCTTTTGGCTAAGATCAAGTGTAGTATCTATTCTTATCAGTTTAAAACTAGTGGTGAAATGAGATGTAGACAGTAACATTTGAATTACAACATCA
>CM000377.2/105043590-105043453 Equus caballus chromosome 1, whole genome shotgun sequence. ATTGCTTCTTGGCCTTTTGGCTAAGATCAAGTATAGTATCTGTTCTCATCAATTTAAAAATGGCAATATAAATAGACCCATAGTAGATCCAGATAATGGTGTTATCAGAAAAGGACTTTAAGTAATTTAATATGTTCA
>CM000377.2/137942042-137941941 Equus caballus chromosome 1, whole genome shotgun sequence. ATCGCTTCTCAGACTTTTGGCTAAGATCAAGCGTAGTATCTGTTCTTATCAGTAATTAACTTCAGAAAAGTTAACTCATCTTCAGCAAGGCAGTAATCCCCT
>CM000377.2/97988860-97989002 Equus caballus chromosome 1, whole genome shotgun sequence.   ATCGCTTCTTGGCCTTTTGGCTAAGATCAAGTGTAGGAATCAATGAATTTCTGGTTATGGAGGCTAAAATGATATCTAATCTTGACTTAATCTAGGTCTCTTCAGTATTTGTCACCCTTTACTACATTCTCTGCTGATGCACT
>CM000377.2/77415658-77415776 Equus caballus chromosome 1, whole genome shotgun sequence.   ACTGCTTCTTCGCCTTTTGGCTAAAATCAAGTATAGTATCTGTTCTTACCAGTTTAAGTACTTTTTGTGCTTCTCATGGCTATAAGCCATAATTGCTGTTATAACGGTAAGGATTTTTC
>CM000377.2/172045138-172045024 Equus caballus chromosome 1, whole genome shotgun sequence. ATTGCTTCTTGGCCTTTCAGCTAAGATCAAGTGTTGTATCTGTTCGTATCAGTTTAAATCATTCTGCACCAAAGATATGTCTCTTCTTCTCCATTTATTAATTTGTTCACTTATT
>CM000378.2/50070490-50070688 Equus caballus chromosome 2, whole genome shotgun sequence.   ATTGCTTCTCGGCCTTTTGGCTAAGATCAAGTGTAGTAATTGATTATCTCAAGTTAAGGAGAACTCACTACATCCCAAAGTCTCATTCTTTGTCTGAGTCTTGACACACATACTTCTTTCTGTGAGTATGTCCCTATTGCCTGCAATTGGCAATCTAAACATTCAGTGAAAATCTTCATTAGCTTTGAATGAACCATGT
>CM000378.2/21366877-21367061 Equus caballus chromosome 2, whole genome shotgun sequence.   AAAGCGTCTCAGCCTTTTGGCTAAGATCAAGTGTAGTATCTGTAGCTAGTCTATAACCTGATTGATATGTCCATTTTACCCCAATATCATACCATTATGATTACTGTGGCTTTATATAGCAAATCTTGAACTCAGGTAGTATAAATCCTCTAACTCTGTTCTTTGTCAAAATGGTCTTGGCTATT
>CM000378.2/56987690-56987788 Equus caballus chromosome 2, whole genome shotgun sequence.   ATCGCTTCTCGGCCTTTTGGCTAAGATCAAGTGTAGTATCTGAACGTCGGCGCCCTCGTGAGGAGGCACAGCCTCTCGTTCCCTGCTCCTACACTCCTT
>CM000378.2/18244103-18244249 Equus caballus chromosome 2, whole genome shotgun sequence.   ATCGCTTCTCGGCCTTTTGGCTGAGATCAAGTGTAGAGCTTTGAATAGTATAATAATATTATTTTGATAGTAATAACAATAAACAATCGCTAGCATTAATGAGAGCTTAGTGTATGCCAGTCACCATGCTAAGTGCTCTAGATGCTT
>CM000370.1/74459482-74459563 Monodelphis domestica chromosome 3, whole genome shotgun sequence.    ATCACTTCTCTGCCTTTTGGCTAAGATCAAGTGTAGTATCAATAGATGCAGAAAGAGCTTTTGACAAAATACAACACCCATT
>CM000370.1/105243828-105243703 Monodelphis domestica chromosome 3, whole genome shotgun sequence.  ATTGTTTCTTGGCCTTTTGGCTAAGATCAAGTGTAGAAATATTGTTAAATAATTACTTGTAAGATCTCGGAGAAACTAGAGAAGGTATTTATTGTACCTGGGAGTTTCCCATTCCTGGAACTCTCT
>CM000370.1/143474511-143474342 Monodelphis domestica chromosome 3, whole genome shotgun sequence.  ATTGCTTCTCAACCTTTTGGCTAAGATCAAGTGTAGTATCTATATCCCAATGATGTTTGGGATACTTAGTATTTGGGCAGCTAGAACTCCTCTTCCTGAGTTAAAATCCAGCCAATCACTAGCTGTGTGGCCTTGGGTAAGTCACTTAACCCAGTTTGCCTCAGTTGTCT
>CM000371.1/104846407-104846597 Monodelphis domestica chromosome 4, whole genome shotgun sequence.  ATCGCTTCTCGGCCTTTTGGCTAAGATCAAGTGTAGTATCTGTTCTTATCAGTTTAATATCTGATACGTCCTCTATCCGAGGACAATATATTAAATGGATTTTTGAAGCAGGGAGTCGGAATAGGAGCTTGCTCCGTCCACTCCACGCATCGACCTGGTATTGCAGTACTTCCAGGAACGGTGCACCTCCC
>CM000371.1/104773987-104774177 Monodelphis domestica chromosome 4, whole genome shotgun sequence.  ATCGCTTCTCGGCCTTTTGGCTAAGATCAAGTGTAGTATCTGTTCTTATCAGTTTAATATCTGATACGTCCTCTATCCGAGGACAATATATTAAATGGATTTTTGAAACAGGGAGTCGGAATAGGAGCTTGCTCCGTCCACTCCACGCATCGACCTGGTATTGCAGTACTTCCAGGAACGGTGCACTTCCC
>BK006936.2/681858-681747 TPA: Saccharomyces cerevisiae S288c chromosome II, complete sequence. ATCTCTTTGCCTTTTGGCTTAGATCAAGTGTAGTATCTGTTCTTTTCAGTGTAACAACTGAAATGACCTCAATGAGGCTCATTACCTTTTAATTTGTTACAATACACATTTT

I want to grep lines from input file2 which has matched from input file1 and count them to get total number of times the line in input file1 occurs in input file2

example of output

Equus caballus 10
Monodelphis domestica 5
Saccharomyces cerevisiae S288c 1

and so on.

I have used this to extract from input2 file matched lines in file1

grep -Fwf input1 input2

How can I count how many times each line in input1 occurs in input2?

4 Answers


Rakesh Sharma 07/13/2018.
perl -lne '
    $h{"$_"}=$h[@h]=$_,next if @ARGV && !exists $h{$_};
    for my $h (@h) { 1+index(s/\h+/ /rg, " $h chromosome ") && $s{$h}++; }
    }{print "$_ $s{$_}" for @h;
' file1 file2

Output:

Equus caballus 10
Monodelphis domestica 5
Saccharomyces cerevisiae S288c 1

Explanation:

  • -n will invoke Perl read file line by line AND no printing unless asked to.
  • -l will make RS = ORS = \n
  • Data structures involved:

    • hash %h will have keys as genes read from file1.
    • array @h will have genes (non-dup) in the order they were encountered while reading from file1.
    • hash %s shall have keys have genes and values as number of times this gene was seen in file2.
  • Working:

    • @ARGV shall have contents of 1 file when reading the first argument(file1) and empty when reading the second argument(file2). Hence, the first line will apply to file only and will populate the hash %h and array @h.
    • The second line will apply to lines read from file2 and will update the hash %s to the number of times a particular gene was found on a given line.
      • index(str, substr) function will return the postion of the substring in the string if found, otw a -1 is returned on failure.
    • The 3rd line will run after the file2 has been read, and the contents of the hash %s are printed based on the order of keys set by the array @h.

Jesse_b 07/11/2018.

You can accomplish this with the following:

grep -Fowf input1 input2 | sort | uniq -c

This will produce the output backwards from the way you want it, but if you need them in that order you can do:

grep -Fowf input1 input2 | sort | uniq -c | awk '{c = $1; $1 = ""; print $0,c}'

steeldriver 07/11/2018.

You could do this using arrays in Awk:

awk '
  NR==FNR {a[$0]==1; next} 
  {for(x in a) c[x] += $0 ~ x} 
  END {for (x in a) print x, c[x]}
' input1 input2
Equus caballus 10
Saccharomyces cerevisiae S288c 1
Monodelphis domestica 5

(It could be done with a single array but IMHO it's cleaner with separate arrays for the index set and the count.)


igal 07/11/2018.

Here's a Python script that does what you want:

#!/usr/bin/env python
"""Count the occurrences of the lines in file1 inside file2."""

import sys

path1 = sys.argv[1]
path2 = sys.argv[2]

counts = dict()
with open(path1, 'r') as file1:
    for line1 in file1.read().splitlines():
        counts[line1] = 0
        with open(path2, 'r') as file2:
            for line2 in file2.read().splitlines():
                if line1 in line2:
                    counts[line1] += 1

for key, value in counts.iteritems():
    print("{}: {}".format(key, value))

You could run it like this:

python count.py file1 file2

On your given example data it gives the following output:

Monodelphis domestica: 5
Saccharomyces cerevisiae S288c: 1
Equus caballus: 10

And here's a Bash script that does the same thing:

#!/bin/bash

file1 = "$1"
file2 = "$2"

while read line; do
    count="$(grep -F "${line}" file2 | wc -l)"; echo "${line}: ${count}";
done < file1

HighResolutionMusic.com - Download Hi-Res Songs

1 The Chainsmokers

Beach House flac

The Chainsmokers. 2018. Writer: Andrew Taggart.
2 (G)I-DLE

POP/STARS flac

(G)I-DLE. 2018. Writer: Riot Music Team;Harloe.
3 Anne-Marie

Rewrite The Stars flac

Anne-Marie. 2018. Writer: Benj Pasek;Justin Paul.
4 Ariana Grande

​Thank U, Next flac

Ariana Grande. 2018. Writer: Crazy Mike;Scootie;Victoria Monét;Tayla Parx;TBHits;Ariana Grande.
5 Clean Bandit

Baby flac

Clean Bandit. 2018. Writer: Jack Patterson;Kamille;Jason Evigan;Matthew Knott;Marina;Luis Fonsi.
6 BTS

Waste It On Me flac

BTS. 2018. Writer: Steve Aoki;Jeff Halavacs;Ryan Ogren;Michael Gazzo;Nate Cyphert;Sean Foreman;RM.
7 Imagine Dragons

Bad Liar flac

Imagine Dragons. 2018. Writer: Jorgen Odegard;Daniel Platzman;Ben McKee;Wayne Sermon;Aja Volkman;Dan Reynolds.
8 Nicki Minaj

No Candle No Light flac

Nicki Minaj. 2018. Writer: Denisia “Blu June” Andrews;Kathryn Ostenberg;Brittany "Chi" Coney;Brian Lee;TJ Routon;Tushar Apte;ZAYN;Nicki Minaj.
9 Brooks

Limbo flac

Brooks. 2018.
10 BlackPink

Kiss And Make Up flac

BlackPink. 2018. Writer: Soke;Kny Factory;Billboard;Chelcee Grimes;Teddy Park;Marc Vincent;Dua Lipa.
11 Diplo

Close To Me flac

Diplo. 2018. Writer: Ellie Goulding;Savan Kotecha;Peter Svensson;Ilya;Swae Lee;Diplo.
12 Rita Ora

Velvet Rope flac

Rita Ora. 2018.
13 Fitz And The Tantrums

HandClap flac

Fitz And The Tantrums. 2017. Writer: Fitz And The Tantrums;Eric Frederic;Sam Hollander.
14 Little Mix

Woman Like Me flac

Little Mix. 2018. Writer: Nicki Minaj;Steve Mac;Ed Sheeran;Jess Glynne.
15 Billie Eilish

When The Party's Over flac

Billie Eilish. 2018. Writer: Billie Eilish;FINNEAS.
16 K-391

Mystery flac

K-391. 2018.
17 Cher Lloyd

None Of My Business flac

Cher Lloyd. 2018. Writer: ​iamBADDLUCK;Alexsej Vlasenko;Kate Morgan;Henrik Meinke;Jonas Kalisch;Jeremy Chacon.
18 Backstreet Boys

Chances flac

Backstreet Boys. 2018.
19 Lil Pump

Arms Around You flac

Lil Pump. 2018. Writer: Rio Santana;Lil Pump;Edgar Barrera;Mally Mall;Jon Fx;Skrillex;Maluma;Swae Lee;XXXTENTACION.
20 Kelly Clarkson

Never Enough flac

Kelly Clarkson. 2018. Writer: Benj Pasek;Justin Paul.

Related questions

Hot questions

Language

Popular Tags