Word frequencies from large body of scraped text

I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.

Here is a snippet from the original file:

 1 środka(byłe

 1 środka.było

 1 środkacccxli.

 1 (środkach)

 1 „środkach”

 1 środkach

 1 środkach...

 1 środkach.",

 1 środkach"

 1 środkach".

 1 środkachwzorem

 1 środkach.życie

 1 środkajak

 1 "środkami"

 1 (środkami)

 1 „środkami”)

 1 środkami!"

 1 środkami”

 1 środkami)?

 1 środkami˝.

My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script. As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script.

Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:

# -*- coding: utf-8 -*-



import io

import pandas as pd

import numpy as np



num_batches = 54



for i in range(1 ,num_batches +1):



    infile_path = r'input_batch_' + str(i) + r'.txt'

    outfile_path = r'output_batch_' + str(i) + r'.txt'



    with io.open(infile_path, 'r', encoding = 'utf8') as infile, 

         io.open(outfile_path, 'w', encoding='utf8') as outfile:



        entries_raw = infile.readlines()

        entries_single = [x.strip() for x in entries_raw]

        entries = [x.split('t') for x in entries_single]



        data = pd.DataFrame({"word": , "freq": })



        for j in range(len(entries)):

            data.loc[j] = entries[j][1], entries[j][0]



            freq_dict = dict()

            keys = np.unique(data['word'])



            for key in keys:

                for x in range(len(data)):

                    if data['word'][x] == key:

                        if key in freq_dict:

                            prior_freq = freq_dict.get(key)

                            freq_dict[key] = prior_freq + data['freq'][x]

                        else:

                            freq_dict[key] = data['freq'][x]



        for key in freq_dict.keys():

        outfile.write("%s,%sn" % (key, freq_dict[key]))

The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for loops? Or by using a different data structure for word-frequency lookup than a dictionary?

edited Dec 23 '18 at 1:34

asked Dec 23 '18 at 0:05

Des Grieux

355

1

$begingroup$
I've added the fixed code in one piece below. Thank you!
$endgroup$
– Des Grieux
Dec 23 '18 at 0:22

add a comment |

Here is a snippet from the original file:

 1 środka(byłe

 1 środka.było

 1 środkacccxli.

 1 (środkach)

 1 „środkach”

 1 środkach

 1 środkach...

 1 środkach.",

 1 środkach"

 1 środkach".

 1 środkachwzorem

 1 środkach.życie

 1 środkajak

 1 "środkami"

 1 (środkami)

 1 „środkami”)

 1 środkami!"

 1 środkami”

 1 środkami)?

 1 środkami˝.

Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:

# -*- coding: utf-8 -*-



import io

import pandas as pd

import numpy as np



num_batches = 54



for i in range(1 ,num_batches +1):



    infile_path = r'input_batch_' + str(i) + r'.txt'

    outfile_path = r'output_batch_' + str(i) + r'.txt'



    with io.open(infile_path, 'r', encoding = 'utf8') as infile, 

         io.open(outfile_path, 'w', encoding='utf8') as outfile:



        entries_raw = infile.readlines()

        entries_single = [x.strip() for x in entries_raw]

        entries = [x.split('t') for x in entries_single]



        data = pd.DataFrame({"word": , "freq": })



        for j in range(len(entries)):

            data.loc[j] = entries[j][1], entries[j][0]



            freq_dict = dict()

            keys = np.unique(data['word'])



            for key in keys:

                for x in range(len(data)):

                    if data['word'][x] == key:

                        if key in freq_dict:

                            prior_freq = freq_dict.get(key)

                            freq_dict[key] = prior_freq + data['freq'][x]

                        else:

                            freq_dict[key] = data['freq'][x]



        for key in freq_dict.keys():

        outfile.write("%s,%sn" % (key, freq_dict[key]))

edited Dec 23 '18 at 1:34

asked Dec 23 '18 at 0:05

Des Grieux

355

1

$begingroup$
I've added the fixed code in one piece below. Thank you!
$endgroup$
– Des Grieux
Dec 23 '18 at 0:22

add a comment |

Here is a snippet from the original file:

 1 środka(byłe

 1 środka.było

 1 środkacccxli.

 1 (środkach)

 1 „środkach”

 1 środkach

 1 środkach...

 1 środkach.",

 1 środkach"

 1 środkach".

 1 środkachwzorem

 1 środkach.życie

 1 środkajak

 1 "środkami"

 1 (środkami)

 1 „środkami”)

 1 środkami!"

 1 środkami”

 1 środkami)?

 1 środkami˝.

Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:

# -*- coding: utf-8 -*-



import io

import pandas as pd

import numpy as np



num_batches = 54



for i in range(1 ,num_batches +1):



    infile_path = r'input_batch_' + str(i) + r'.txt'

    outfile_path = r'output_batch_' + str(i) + r'.txt'



    with io.open(infile_path, 'r', encoding = 'utf8') as infile, 

         io.open(outfile_path, 'w', encoding='utf8') as outfile:



        entries_raw = infile.readlines()

        entries_single = [x.strip() for x in entries_raw]

        entries = [x.split('t') for x in entries_single]



        data = pd.DataFrame({"word": , "freq": })



        for j in range(len(entries)):

            data.loc[j] = entries[j][1], entries[j][0]



            freq_dict = dict()

            keys = np.unique(data['word'])



            for key in keys:

                for x in range(len(data)):

                    if data['word'][x] == key:

                        if key in freq_dict:

                            prior_freq = freq_dict.get(key)

                            freq_dict[key] = prior_freq + data['freq'][x]

                        else:

                            freq_dict[key] = data['freq'][x]



        for key in freq_dict.keys():

        outfile.write("%s,%sn" % (key, freq_dict[key]))

edited Dec 23 '18 at 1:34

asked Dec 23 '18 at 0:05

Des Grieux

355

Here is a snippet from the original file:

 1 środka(byłe

 1 środka.było

 1 środkacccxli.

 1 (środkach)

 1 „środkach”

 1 środkach

 1 środkach...

 1 środkach.",

 1 środkach"

 1 środkach".

 1 środkachwzorem

 1 środkach.życie

 1 środkajak

 1 "środkami"

 1 (środkami)

 1 „środkami”)

 1 środkami!"

 1 środkami”

 1 środkami)?

 1 środkami˝.

Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:

# -*- coding: utf-8 -*-



import io

import pandas as pd

import numpy as np



num_batches = 54



for i in range(1 ,num_batches +1):



    infile_path = r'input_batch_' + str(i) + r'.txt'

    outfile_path = r'output_batch_' + str(i) + r'.txt'



    with io.open(infile_path, 'r', encoding = 'utf8') as infile, 

         io.open(outfile_path, 'w', encoding='utf8') as outfile:



        entries_raw = infile.readlines()

        entries_single = [x.strip() for x in entries_raw]

        entries = [x.split('t') for x in entries_single]



        data = pd.DataFrame({"word": , "freq": })



        for j in range(len(entries)):

            data.loc[j] = entries[j][1], entries[j][0]



            freq_dict = dict()

            keys = np.unique(data['word'])



            for key in keys:

                for x in range(len(data)):

                    if data['word'][x] == key:

                        if key in freq_dict:

                            prior_freq = freq_dict.get(key)

                            freq_dict[key] = prior_freq + data['freq'][x]

                        else:

                            freq_dict[key] = data['freq'][x]



        for key in freq_dict.keys():

        outfile.write("%s,%sn" % (key, freq_dict[key]))

python performance dictionary lookup

edited Dec 23 '18 at 1:34

asked Dec 23 '18 at 0:05

Des Grieux

355

edited Dec 23 '18 at 1:34

asked Dec 23 '18 at 0:05

Des Grieux

355

edited Dec 23 '18 at 1:34

asked Dec 23 '18 at 0:05

Des Grieux

355

asked Dec 23 '18 at 0:05

Des Grieux

355

asked Dec 23 '18 at 0:05

Des Grieux

355

1

$begingroup$
I've added the fixed code in one piece below. Thank you!
$endgroup$
– Des Grieux
Dec 23 '18 at 0:22

add a comment |

1

$begingroup$
I've added the fixed code in one piece below. Thank you!
$endgroup$
– Des Grieux
Dec 23 '18 at 0:22

I've added the fixed code in one piece below. Thank you!

– Des Grieux
Dec 23 '18 at 0:22

add a comment |

3 Answers
3

active

oldest

votes

for i in range(1 ,num_batches +1):

Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.

This string:

r'input_batch_' + str(i) + r'.txt'

can be:

f'input_batch_{i}.txt'

This code:

entries_raw = infile.readlines()

entries_single = [x.strip() for x in entries_raw]

entries = [x.split('t') for x in entries_single]

can also be simplified, to:

entries = [line.rstrip().split('t') for line in infile]

Note a few things. You don't need to call readlines(); you can treat the file object itself as an iterator. Also, avoid calling a variable x even if it's an intermediate variable; you need meaningful names.

This is an antipattern inherited from C:

for j in range(len(entries)):

    data.loc[j] = entries[j][1], entries[j][0]

You should instead do:

for j, entry in enumerate(entries):

    data.loc[j] = entry[1], entry[0]

That also applies to your for x in range(len(data)):.

This:

freq_dict = dict()

should be:

freq_dict = {}

This:

if key in freq_dict:

    prior_freq = freq_dict.get(key)

    freq_dict[key] = prior_freq + data['freq'][x]

else:

    freq_dict[key] = data['freq'][x]

can be simplified to:

prior_freq = freq_dict.get(key)

freq_dict[key] = data['freq'][x]

if prior_freq is not None:

    freq_dict[key] += prior_freq

or even (courtesy @AlexHall):

freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)

Note a few things. First of all, you were inappropriately using get - either check for key presence and then use , or use get and then check the return value (which is preferred, as it requires fewer key lookups).

This loop:

for key in freq_dict.keys():

outfile.write("%s,%sn" % (key, freq_dict[key]))

needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:

for key, freq in freq_dict.items():

    outfile.write(f'{key},{freq}n')

edited Dec 24 '18 at 1:02

answered Dec 23 '18 at 1:35

Reinderien

5,215926

1

$begingroup$
The freq_dict code is wrong because it calls .get after assigning to that key. In any case it can be simplified more to freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0).
$endgroup$
– Alex Hall
Dec 23 '18 at 19:07

$begingroup$
@AlexHall Good eyes. Edited.
$endgroup$
– Reinderien
Dec 24 '18 at 1:04

add a comment |

Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:

from collections import Counter



yourListOfWords = [...]



frequencyOfEachWord = Counter(yourListOfWords)

answered Dec 23 '18 at 1:58

AleksandrH

20829

add a comment |

To expand on the answer by @AleksandrH, this is how I would write it using collections.Counter:

import io

from collections import Counter

import regex as re  # the normal re module does not support p{P}...



def read_file(file_name):

    """Reads a file into a Counter object.



    File contains rows with counts and words.

    Words can be multiple words separated by punctuation or whitespace.

    If that is the case, separate them.

    """

    counter = Counter()

    with io.open(file_name, 'r', encoding = 'utf8') as infile:

        for line in infile:

            if not line:

                continue

            freq, words = line.strip().split('t')  # need to omit 't' when testing, because SO replaces tabs with whitespace

            # split on punctuation and whitespace

            words = re.split(r'p{P}|s', words)

            # update all words

            for word in filter(None, words):  # filter out empty strings

                counter[word] += int(freq)

    return counter



def write_file(file_name, counter):

    with io.open(file_name, 'w', encoding='utf8') as outfile:

        outfile.writelines(f'{word},{freq}n' for word, freq in counter.most_common())  # use `items` if order does not matter





if __name__ == "__main__":

    num_batches = 54

    for i in range(1, num_batches + 1):

        counter = read_file(f"input_batch_{i}.txt")

        write_file(f"output_batch_{i}.txt", counter)

This also has (the start of) a docstring describing what the read_file function does, functions in the first place in order to separate concerns, and a if __name__ == "__main__": guard to allow importing from this script without the main code running.

answered Dec 23 '18 at 11:12

Graipher

26.6k54092

add a comment |

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210192%2fword-frequencies-from-large-body-of-scraped-text%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

for i in range(1 ,num_batches +1):

Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.

This string:

r'input_batch_' + str(i) + r'.txt'

can be:

f'input_batch_{i}.txt'

This code:

entries_raw = infile.readlines()

entries_single = [x.strip() for x in entries_raw]

entries = [x.split('t') for x in entries_single]

can also be simplified, to:

entries = [line.rstrip().split('t') for line in infile]

This is an antipattern inherited from C:

for j in range(len(entries)):

    data.loc[j] = entries[j][1], entries[j][0]

You should instead do:

for j, entry in enumerate(entries):

    data.loc[j] = entry[1], entry[0]

That also applies to your for x in range(len(data)):.

This:

freq_dict = dict()

should be:

freq_dict = {}

This:

if key in freq_dict:

    prior_freq = freq_dict.get(key)

    freq_dict[key] = prior_freq + data['freq'][x]

else:

    freq_dict[key] = data['freq'][x]

can be simplified to:

prior_freq = freq_dict.get(key)

freq_dict[key] = data['freq'][x]

if prior_freq is not None:

    freq_dict[key] += prior_freq

or even (courtesy @AlexHall):

freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)

This loop:

for key in freq_dict.keys():

outfile.write("%s,%sn" % (key, freq_dict[key]))

needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:

for key, freq in freq_dict.items():

    outfile.write(f'{key},{freq}n')

edited Dec 24 '18 at 1:02

answered Dec 23 '18 at 1:35

Reinderien

5,215926

1

$begingroup$
The freq_dict code is wrong because it calls .get after assigning to that key. In any case it can be simplified more to freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0).
$endgroup$
– Alex Hall
Dec 23 '18 at 19:07

$begingroup$
@AlexHall Good eyes. Edited.
$endgroup$
– Reinderien
Dec 24 '18 at 1:04

add a comment |

for i in range(1 ,num_batches +1):

Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.

This string:

r'input_batch_' + str(i) + r'.txt'

can be:

f'input_batch_{i}.txt'

This code:

entries_raw = infile.readlines()

entries_single = [x.strip() for x in entries_raw]

entries = [x.split('t') for x in entries_single]

can also be simplified, to:

entries = [line.rstrip().split('t') for line in infile]

This is an antipattern inherited from C:

for j in range(len(entries)):

    data.loc[j] = entries[j][1], entries[j][0]

You should instead do:

for j, entry in enumerate(entries):

    data.loc[j] = entry[1], entry[0]

That also applies to your for x in range(len(data)):.

This:

freq_dict = dict()

should be:

freq_dict = {}

This:

if key in freq_dict:

    prior_freq = freq_dict.get(key)

    freq_dict[key] = prior_freq + data['freq'][x]

else:

    freq_dict[key] = data['freq'][x]

can be simplified to:

prior_freq = freq_dict.get(key)

freq_dict[key] = data['freq'][x]

if prior_freq is not None:

    freq_dict[key] += prior_freq

or even (courtesy @AlexHall):

freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)

This loop:

for key in freq_dict.keys():

outfile.write("%s,%sn" % (key, freq_dict[key]))

needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:

for key, freq in freq_dict.items():

    outfile.write(f'{key},{freq}n')

edited Dec 24 '18 at 1:02

answered Dec 23 '18 at 1:35

Reinderien

5,215926

1

$begingroup$
The freq_dict code is wrong because it calls .get after assigning to that key. In any case it can be simplified more to freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0).
$endgroup$
– Alex Hall
Dec 23 '18 at 19:07

$begingroup$
@AlexHall Good eyes. Edited.
$endgroup$
– Reinderien
Dec 24 '18 at 1:04

add a comment |

for i in range(1 ,num_batches +1):

Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.

This string:

r'input_batch_' + str(i) + r'.txt'

can be:

f'input_batch_{i}.txt'

This code:

entries_raw = infile.readlines()

entries_single = [x.strip() for x in entries_raw]

entries = [x.split('t') for x in entries_single]

can also be simplified, to:

entries = [line.rstrip().split('t') for line in infile]

This is an antipattern inherited from C:

for j in range(len(entries)):

    data.loc[j] = entries[j][1], entries[j][0]

You should instead do:

for j, entry in enumerate(entries):

    data.loc[j] = entry[1], entry[0]

That also applies to your for x in range(len(data)):.

This:

freq_dict = dict()

should be:

freq_dict = {}

This:

if key in freq_dict:

    prior_freq = freq_dict.get(key)

    freq_dict[key] = prior_freq + data['freq'][x]

else:

    freq_dict[key] = data['freq'][x]

can be simplified to:

prior_freq = freq_dict.get(key)

freq_dict[key] = data['freq'][x]

if prior_freq is not None:

    freq_dict[key] += prior_freq

or even (courtesy @AlexHall):

freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)

This loop:

for key in freq_dict.keys():

outfile.write("%s,%sn" % (key, freq_dict[key]))

needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:

for key, freq in freq_dict.items():

    outfile.write(f'{key},{freq}n')

edited Dec 24 '18 at 1:02

answered Dec 23 '18 at 1:35

Reinderien

5,215926

for i in range(1 ,num_batches +1):

Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.

This string:

r'input_batch_' + str(i) + r'.txt'

can be:

f'input_batch_{i}.txt'

This code:

entries_raw = infile.readlines()

entries_single = [x.strip() for x in entries_raw]

entries = [x.split('t') for x in entries_single]

can also be simplified, to:

entries = [line.rstrip().split('t') for line in infile]

This is an antipattern inherited from C:

for j in range(len(entries)):

    data.loc[j] = entries[j][1], entries[j][0]

You should instead do:

for j, entry in enumerate(entries):

    data.loc[j] = entry[1], entry[0]

That also applies to your for x in range(len(data)):.

This:

freq_dict = dict()

should be:

freq_dict = {}

This:

if key in freq_dict:

    prior_freq = freq_dict.get(key)

    freq_dict[key] = prior_freq + data['freq'][x]

else:

    freq_dict[key] = data['freq'][x]

can be simplified to:

prior_freq = freq_dict.get(key)

freq_dict[key] = data['freq'][x]

if prior_freq is not None:

    freq_dict[key] += prior_freq

or even (courtesy @AlexHall):

freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)

This loop:

for key in freq_dict.keys():

outfile.write("%s,%sn" % (key, freq_dict[key]))

needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys, you should be iterating over items:

for key, freq in freq_dict.items():

    outfile.write(f'{key},{freq}n')

edited Dec 24 '18 at 1:02

answered Dec 23 '18 at 1:35

Reinderien

5,215926

edited Dec 24 '18 at 1:02

answered Dec 23 '18 at 1:35

Reinderien

5,215926

answered Dec 23 '18 at 1:35

Reinderien

5,215926

answered Dec 23 '18 at 1:35

Reinderien

5,215926

1

$begingroup$
The freq_dict code is wrong because it calls .get after assigning to that key. In any case it can be simplified more to freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0).
$endgroup$
– Alex Hall
Dec 23 '18 at 19:07

$begingroup$
@AlexHall Good eyes. Edited.
$endgroup$
– Reinderien
Dec 24 '18 at 1:04

add a comment |

1

$begingroup$
The freq_dict code is wrong because it calls .get after assigning to that key. In any case it can be simplified more to freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0).
$endgroup$
– Alex Hall
Dec 23 '18 at 19:07

$begingroup$
@AlexHall Good eyes. Edited.
$endgroup$
– Reinderien
Dec 24 '18 at 1:04

The freq_dict code is wrong because it calls .get after assigning to that key. In any case it can be simplified more to freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0).

– Alex Hall
Dec 23 '18 at 19:07

@AlexHall Good eyes. Edited.

– Reinderien
Dec 24 '18 at 1:04

add a comment |

Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:

from collections import Counter



yourListOfWords = [...]



frequencyOfEachWord = Counter(yourListOfWords)

answered Dec 23 '18 at 1:58

AleksandrH

20829

add a comment |

Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:

from collections import Counter



yourListOfWords = [...]



frequencyOfEachWord = Counter(yourListOfWords)

answered Dec 23 '18 at 1:58

AleksandrH

20829

add a comment |

Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:

from collections import Counter



yourListOfWords = [...]



frequencyOfEachWord = Counter(yourListOfWords)

answered Dec 23 '18 at 1:58

AleksandrH

20829

Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:

from collections import Counter



yourListOfWords = [...]



frequencyOfEachWord = Counter(yourListOfWords)

answered Dec 23 '18 at 1:58

AleksandrH

20829

answered Dec 23 '18 at 1:58

AleksandrH

20829

answered Dec 23 '18 at 1:58

AleksandrH

20829

answered Dec 23 '18 at 1:58

AleksandrH

20829

add a comment |

To expand on the answer by @AleksandrH, this is how I would write it using collections.Counter:

import io

from collections import Counter

import regex as re  # the normal re module does not support p{P}...



def read_file(file_name):

    """Reads a file into a Counter object.



    File contains rows with counts and words.

    Words can be multiple words separated by punctuation or whitespace.

    If that is the case, separate them.

    """

    counter = Counter()

    with io.open(file_name, 'r', encoding = 'utf8') as infile:

        for line in infile:

            if not line:

                continue

            freq, words = line.strip().split('t')  # need to omit 't' when testing, because SO replaces tabs with whitespace

            # split on punctuation and whitespace

            words = re.split(r'p{P}|s', words)

            # update all words

            for word in filter(None, words):  # filter out empty strings

                counter[word] += int(freq)

    return counter



def write_file(file_name, counter):

    with io.open(file_name, 'w', encoding='utf8') as outfile:

        outfile.writelines(f'{word},{freq}n' for word, freq in counter.most_common())  # use `items` if order does not matter





if __name__ == "__main__":

    num_batches = 54

    for i in range(1, num_batches + 1):

        counter = read_file(f"input_batch_{i}.txt")

        write_file(f"output_batch_{i}.txt", counter)

answered Dec 23 '18 at 11:12

Graipher

26.6k54092

add a comment |

To expand on the answer by @AleksandrH, this is how I would write it using collections.Counter:

import io

from collections import Counter

import regex as re  # the normal re module does not support p{P}...



def read_file(file_name):

    """Reads a file into a Counter object.



    File contains rows with counts and words.

    Words can be multiple words separated by punctuation or whitespace.

    If that is the case, separate them.

    """

    counter = Counter()

    with io.open(file_name, 'r', encoding = 'utf8') as infile:

        for line in infile:

            if not line:

                continue

            freq, words = line.strip().split('t')  # need to omit 't' when testing, because SO replaces tabs with whitespace

            # split on punctuation and whitespace

            words = re.split(r'p{P}|s', words)

            # update all words

            for word in filter(None, words):  # filter out empty strings

                counter[word] += int(freq)

    return counter



def write_file(file_name, counter):

    with io.open(file_name, 'w', encoding='utf8') as outfile:

        outfile.writelines(f'{word},{freq}n' for word, freq in counter.most_common())  # use `items` if order does not matter





if __name__ == "__main__":

    num_batches = 54

    for i in range(1, num_batches + 1):

        counter = read_file(f"input_batch_{i}.txt")

        write_file(f"output_batch_{i}.txt", counter)

answered Dec 23 '18 at 11:12

Graipher

26.6k54092

add a comment |

To expand on the answer by @AleksandrH, this is how I would write it using collections.Counter:

import io

from collections import Counter

import regex as re  # the normal re module does not support p{P}...



def read_file(file_name):

    """Reads a file into a Counter object.



    File contains rows with counts and words.

    Words can be multiple words separated by punctuation or whitespace.

    If that is the case, separate them.

    """

    counter = Counter()

    with io.open(file_name, 'r', encoding = 'utf8') as infile:

        for line in infile:

            if not line:

                continue

            freq, words = line.strip().split('t')  # need to omit 't' when testing, because SO replaces tabs with whitespace

            # split on punctuation and whitespace

            words = re.split(r'p{P}|s', words)

            # update all words

            for word in filter(None, words):  # filter out empty strings

                counter[word] += int(freq)

    return counter



def write_file(file_name, counter):

    with io.open(file_name, 'w', encoding='utf8') as outfile:

        outfile.writelines(f'{word},{freq}n' for word, freq in counter.most_common())  # use `items` if order does not matter





if __name__ == "__main__":

    num_batches = 54

    for i in range(1, num_batches + 1):

        counter = read_file(f"input_batch_{i}.txt")

        write_file(f"output_batch_{i}.txt", counter)

answered Dec 23 '18 at 11:12

Graipher

26.6k54092

To expand on the answer by @AleksandrH, this is how I would write it using collections.Counter:

import io

from collections import Counter

import regex as re  # the normal re module does not support p{P}...



def read_file(file_name):

    """Reads a file into a Counter object.



    File contains rows with counts and words.

    Words can be multiple words separated by punctuation or whitespace.

    If that is the case, separate them.

    """

    counter = Counter()

    with io.open(file_name, 'r', encoding = 'utf8') as infile:

        for line in infile:

            if not line:

                continue

            freq, words = line.strip().split('t')  # need to omit 't' when testing, because SO replaces tabs with whitespace

            # split on punctuation and whitespace

            words = re.split(r'p{P}|s', words)

            # update all words

            for word in filter(None, words):  # filter out empty strings

                counter[word] += int(freq)

    return counter



def write_file(file_name, counter):

    with io.open(file_name, 'w', encoding='utf8') as outfile:

        outfile.writelines(f'{word},{freq}n' for word, freq in counter.most_common())  # use `items` if order does not matter





if __name__ == "__main__":

    num_batches = 54

    for i in range(1, num_batches + 1):

        counter = read_file(f"input_batch_{i}.txt")

        write_file(f"output_batch_{i}.txt", counter)

answered Dec 23 '18 at 11:12

Graipher

26.6k54092

answered Dec 23 '18 at 11:12

Graipher

26.6k54092

answered Dec 23 '18 at 11:12

Graipher

26.6k54092

answered Dec 23 '18 at 11:12

Graipher

26.6k54092

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vrftsjtryk