Word frequencies from large body of scraped text
$begingroup$
I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.
Here is a snippet from the original file:
1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środkach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.
My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script. As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script.
Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:
# -*- coding: utf-8 -*-
import io
import pandas as pd
import numpy as np
num_batches = 54
for i in range(1 ,num_batches +1):
infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'
with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
data = pd.DataFrame({"word": , "freq": })
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
freq_dict = dict()
keys = np.unique(data['word'])
for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))
The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for
loops? Or by using a different data structure for word-frequency lookup than a dictionary?
python performance dictionary lookup
$endgroup$
add a comment |
$begingroup$
I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.
Here is a snippet from the original file:
1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środkach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.
My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script. As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script.
Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:
# -*- coding: utf-8 -*-
import io
import pandas as pd
import numpy as np
num_batches = 54
for i in range(1 ,num_batches +1):
infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'
with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
data = pd.DataFrame({"word": , "freq": })
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
freq_dict = dict()
keys = np.unique(data['word'])
for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))
The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for
loops? Or by using a different data structure for word-frequency lookup than a dictionary?
python performance dictionary lookup
$endgroup$
1
$begingroup$
I've added the fixed code in one piece below. Thank you!
$endgroup$
– Des Grieux
Dec 23 '18 at 0:22
add a comment |
$begingroup$
I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.
Here is a snippet from the original file:
1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środkach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.
My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script. As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script.
Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:
# -*- coding: utf-8 -*-
import io
import pandas as pd
import numpy as np
num_batches = 54
for i in range(1 ,num_batches +1):
infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'
with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
data = pd.DataFrame({"word": , "freq": })
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
freq_dict = dict()
keys = np.unique(data['word'])
for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))
The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for
loops? Or by using a different data structure for word-frequency lookup than a dictionary?
python performance dictionary lookup
$endgroup$
I have a file with word frequency logs from a very messy corpus of scraped Polish text that I am trying to clean to get accurate word frequencies. Since this is a big text file, I divided it into batches.
Here is a snippet from the original file:
1 środka(byłe
1 środka.było
1 środkacccxli.
1 (środkach)
1 „środkach”
1 środkach
1 środkach...
1 środkach.",
1 środkach"
1 środkach".
1 środkachwzorem
1 środkach.życie
1 środkajak
1 "środkami"
1 (środkami)
1 „środkami”)
1 środkami!"
1 środkami”
1 środkami)?
1 środkami˝.
My goal is to clean true word labels and remove noisy word labels (e.g. collocations of words concatenated through punctuation). This is what is achieved by the first part of the script. As you can see in the data sample above, several noisy entries belong to the same true label. Once cleaned, their frequencies should be added. This is what I try to achieve in the second part of my script.
Here is the code in one piece with fixed indentation, in case you are able to reproduce my issues on your end:
# -*- coding: utf-8 -*-
import io
import pandas as pd
import numpy as np
num_batches = 54
for i in range(1 ,num_batches +1):
infile_path = r'input_batch_' + str(i) + r'.txt'
outfile_path = r'output_batch_' + str(i) + r'.txt'
with io.open(infile_path, 'r', encoding = 'utf8') as infile,
io.open(outfile_path, 'w', encoding='utf8') as outfile:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
data = pd.DataFrame({"word": , "freq": })
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
freq_dict = dict()
keys = np.unique(data['word'])
for key in keys:
for x in range(len(data)):
if data['word'][x] == key:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))
The problem with this code is that it is either buggy, running into an infinite loop or sth, or is very slow, even for processing a single batch, to the point of being impractical. Are there ways to streamline this code to make it computationally tractable? In particular, can I achieve the same goal without using for
loops? Or by using a different data structure for word-frequency lookup than a dictionary?
python performance dictionary lookup
python performance dictionary lookup
edited Dec 23 '18 at 1:34
Des Grieux
asked Dec 23 '18 at 0:05
Des GrieuxDes Grieux
355
355
1
$begingroup$
I've added the fixed code in one piece below. Thank you!
$endgroup$
– Des Grieux
Dec 23 '18 at 0:22
add a comment |
1
$begingroup$
I've added the fixed code in one piece below. Thank you!
$endgroup$
– Des Grieux
Dec 23 '18 at 0:22
1
1
$begingroup$
I've added the fixed code in one piece below. Thank you!
$endgroup$
– Des Grieux
Dec 23 '18 at 0:22
$begingroup$
I've added the fixed code in one piece below. Thank you!
$endgroup$
– Des Grieux
Dec 23 '18 at 0:22
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
for i in range(1 ,num_batches +1):
Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.
This string:
r'input_batch_' + str(i) + r'.txt'
can be:
f'input_batch_{i}.txt'
This code:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
can also be simplified, to:
entries = [line.rstrip().split('t') for line in infile]
Note a few things. You don't need to call readlines()
; you can treat the file object itself as an iterator. Also, avoid calling a variable x
even if it's an intermediate variable; you need meaningful names.
This is an antipattern inherited from C:
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
You should instead do:
for j, entry in enumerate(entries):
data.loc[j] = entry[1], entry[0]
That also applies to your for x in range(len(data)):
.
This:
freq_dict = dict()
should be:
freq_dict = {}
This:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
can be simplified to:
prior_freq = freq_dict.get(key)
freq_dict[key] = data['freq'][x]
if prior_freq is not None:
freq_dict[key] += prior_freq
or even (courtesy @AlexHall):
freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)
Note a few things. First of all, you were inappropriately using get
- either check for key presence and then use , or use
get
and then check the return value (which is preferred, as it requires fewer key lookups).
This loop:
for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))
needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys
, you should be iterating over items
:
for key, freq in freq_dict.items():
outfile.write(f'{key},{freq}n')
$endgroup$
1
$begingroup$
Thefreq_dict
code is wrong because it calls.get
after assigning to that key. In any case it can be simplified more tofreq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)
.
$endgroup$
– Alex Hall
Dec 23 '18 at 19:07
$begingroup$
@AlexHall Good eyes. Edited.
$endgroup$
– Reinderien
Dec 24 '18 at 1:04
add a comment |
$begingroup$
Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:
from collections import Counter
yourListOfWords = [...]
frequencyOfEachWord = Counter(yourListOfWords)
$endgroup$
add a comment |
$begingroup$
To expand on the answer by @AleksandrH, this is how I would write it using collections.Counter
:
import io
from collections import Counter
import regex as re # the normal re module does not support p{P}...
def read_file(file_name):
"""Reads a file into a Counter object.
File contains rows with counts and words.
Words can be multiple words separated by punctuation or whitespace.
If that is the case, separate them.
"""
counter = Counter()
with io.open(file_name, 'r', encoding = 'utf8') as infile:
for line in infile:
if not line:
continue
freq, words = line.strip().split('t') # need to omit 't' when testing, because SO replaces tabs with whitespace
# split on punctuation and whitespace
words = re.split(r'p{P}|s', words)
# update all words
for word in filter(None, words): # filter out empty strings
counter[word] += int(freq)
return counter
def write_file(file_name, counter):
with io.open(file_name, 'w', encoding='utf8') as outfile:
outfile.writelines(f'{word},{freq}n' for word, freq in counter.most_common()) # use `items` if order does not matter
if __name__ == "__main__":
num_batches = 54
for i in range(1, num_batches + 1):
counter = read_file(f"input_batch_{i}.txt")
write_file(f"output_batch_{i}.txt", counter)
This also has (the start of) a docstring
describing what the read_file
function does, functions in the first place in order to separate concerns, and a if __name__ == "__main__":
guard to allow importing from this script without the main code running.
$endgroup$
add a comment |
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210192%2fword-frequencies-from-large-body-of-scraped-text%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
for i in range(1 ,num_batches +1):
Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.
This string:
r'input_batch_' + str(i) + r'.txt'
can be:
f'input_batch_{i}.txt'
This code:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
can also be simplified, to:
entries = [line.rstrip().split('t') for line in infile]
Note a few things. You don't need to call readlines()
; you can treat the file object itself as an iterator. Also, avoid calling a variable x
even if it's an intermediate variable; you need meaningful names.
This is an antipattern inherited from C:
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
You should instead do:
for j, entry in enumerate(entries):
data.loc[j] = entry[1], entry[0]
That also applies to your for x in range(len(data)):
.
This:
freq_dict = dict()
should be:
freq_dict = {}
This:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
can be simplified to:
prior_freq = freq_dict.get(key)
freq_dict[key] = data['freq'][x]
if prior_freq is not None:
freq_dict[key] += prior_freq
or even (courtesy @AlexHall):
freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)
Note a few things. First of all, you were inappropriately using get
- either check for key presence and then use , or use
get
and then check the return value (which is preferred, as it requires fewer key lookups).
This loop:
for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))
needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys
, you should be iterating over items
:
for key, freq in freq_dict.items():
outfile.write(f'{key},{freq}n')
$endgroup$
1
$begingroup$
Thefreq_dict
code is wrong because it calls.get
after assigning to that key. In any case it can be simplified more tofreq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)
.
$endgroup$
– Alex Hall
Dec 23 '18 at 19:07
$begingroup$
@AlexHall Good eyes. Edited.
$endgroup$
– Reinderien
Dec 24 '18 at 1:04
add a comment |
$begingroup$
for i in range(1 ,num_batches +1):
Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.
This string:
r'input_batch_' + str(i) + r'.txt'
can be:
f'input_batch_{i}.txt'
This code:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
can also be simplified, to:
entries = [line.rstrip().split('t') for line in infile]
Note a few things. You don't need to call readlines()
; you can treat the file object itself as an iterator. Also, avoid calling a variable x
even if it's an intermediate variable; you need meaningful names.
This is an antipattern inherited from C:
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
You should instead do:
for j, entry in enumerate(entries):
data.loc[j] = entry[1], entry[0]
That also applies to your for x in range(len(data)):
.
This:
freq_dict = dict()
should be:
freq_dict = {}
This:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
can be simplified to:
prior_freq = freq_dict.get(key)
freq_dict[key] = data['freq'][x]
if prior_freq is not None:
freq_dict[key] += prior_freq
or even (courtesy @AlexHall):
freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)
Note a few things. First of all, you were inappropriately using get
- either check for key presence and then use , or use
get
and then check the return value (which is preferred, as it requires fewer key lookups).
This loop:
for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))
needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys
, you should be iterating over items
:
for key, freq in freq_dict.items():
outfile.write(f'{key},{freq}n')
$endgroup$
1
$begingroup$
Thefreq_dict
code is wrong because it calls.get
after assigning to that key. In any case it can be simplified more tofreq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)
.
$endgroup$
– Alex Hall
Dec 23 '18 at 19:07
$begingroup$
@AlexHall Good eyes. Edited.
$endgroup$
– Reinderien
Dec 24 '18 at 1:04
add a comment |
$begingroup$
for i in range(1 ,num_batches +1):
Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.
This string:
r'input_batch_' + str(i) + r'.txt'
can be:
f'input_batch_{i}.txt'
This code:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
can also be simplified, to:
entries = [line.rstrip().split('t') for line in infile]
Note a few things. You don't need to call readlines()
; you can treat the file object itself as an iterator. Also, avoid calling a variable x
even if it's an intermediate variable; you need meaningful names.
This is an antipattern inherited from C:
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
You should instead do:
for j, entry in enumerate(entries):
data.loc[j] = entry[1], entry[0]
That also applies to your for x in range(len(data)):
.
This:
freq_dict = dict()
should be:
freq_dict = {}
This:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
can be simplified to:
prior_freq = freq_dict.get(key)
freq_dict[key] = data['freq'][x]
if prior_freq is not None:
freq_dict[key] += prior_freq
or even (courtesy @AlexHall):
freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)
Note a few things. First of all, you were inappropriately using get
- either check for key presence and then use , or use
get
and then check the return value (which is preferred, as it requires fewer key lookups).
This loop:
for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))
needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys
, you should be iterating over items
:
for key, freq in freq_dict.items():
outfile.write(f'{key},{freq}n')
$endgroup$
for i in range(1 ,num_batches +1):
Your inter-token spacing here is a little wonky. I suggest running this code through a linter to get it to be PEP8-compliant.
This string:
r'input_batch_' + str(i) + r'.txt'
can be:
f'input_batch_{i}.txt'
This code:
entries_raw = infile.readlines()
entries_single = [x.strip() for x in entries_raw]
entries = [x.split('t') for x in entries_single]
can also be simplified, to:
entries = [line.rstrip().split('t') for line in infile]
Note a few things. You don't need to call readlines()
; you can treat the file object itself as an iterator. Also, avoid calling a variable x
even if it's an intermediate variable; you need meaningful names.
This is an antipattern inherited from C:
for j in range(len(entries)):
data.loc[j] = entries[j][1], entries[j][0]
You should instead do:
for j, entry in enumerate(entries):
data.loc[j] = entry[1], entry[0]
That also applies to your for x in range(len(data)):
.
This:
freq_dict = dict()
should be:
freq_dict = {}
This:
if key in freq_dict:
prior_freq = freq_dict.get(key)
freq_dict[key] = prior_freq + data['freq'][x]
else:
freq_dict[key] = data['freq'][x]
can be simplified to:
prior_freq = freq_dict.get(key)
freq_dict[key] = data['freq'][x]
if prior_freq is not None:
freq_dict[key] += prior_freq
or even (courtesy @AlexHall):
freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)
Note a few things. First of all, you were inappropriately using get
- either check for key presence and then use , or use
get
and then check the return value (which is preferred, as it requires fewer key lookups).
This loop:
for key in freq_dict.keys():
outfile.write("%s,%sn" % (key, freq_dict[key]))
needs adjustment in a few ways. Firstly, it won't run at all because its indentation is wrong. Also, rather that only iterating over keys
, you should be iterating over items
:
for key, freq in freq_dict.items():
outfile.write(f'{key},{freq}n')
edited Dec 24 '18 at 1:02
answered Dec 23 '18 at 1:35
ReinderienReinderien
5,215926
5,215926
1
$begingroup$
Thefreq_dict
code is wrong because it calls.get
after assigning to that key. In any case it can be simplified more tofreq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)
.
$endgroup$
– Alex Hall
Dec 23 '18 at 19:07
$begingroup$
@AlexHall Good eyes. Edited.
$endgroup$
– Reinderien
Dec 24 '18 at 1:04
add a comment |
1
$begingroup$
Thefreq_dict
code is wrong because it calls.get
after assigning to that key. In any case it can be simplified more tofreq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)
.
$endgroup$
– Alex Hall
Dec 23 '18 at 19:07
$begingroup$
@AlexHall Good eyes. Edited.
$endgroup$
– Reinderien
Dec 24 '18 at 1:04
1
1
$begingroup$
The
freq_dict
code is wrong because it calls .get
after assigning to that key. In any case it can be simplified more to freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)
.$endgroup$
– Alex Hall
Dec 23 '18 at 19:07
$begingroup$
The
freq_dict
code is wrong because it calls .get
after assigning to that key. In any case it can be simplified more to freq_dict[key] = data['freq'][x] + freq_dict.get(key, 0)
.$endgroup$
– Alex Hall
Dec 23 '18 at 19:07
$begingroup$
@AlexHall Good eyes. Edited.
$endgroup$
– Reinderien
Dec 24 '18 at 1:04
$begingroup$
@AlexHall Good eyes. Edited.
$endgroup$
– Reinderien
Dec 24 '18 at 1:04
add a comment |
$begingroup$
Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:
from collections import Counter
yourListOfWords = [...]
frequencyOfEachWord = Counter(yourListOfWords)
$endgroup$
add a comment |
$begingroup$
Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:
from collections import Counter
yourListOfWords = [...]
frequencyOfEachWord = Counter(yourListOfWords)
$endgroup$
add a comment |
$begingroup$
Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:
from collections import Counter
yourListOfWords = [...]
frequencyOfEachWord = Counter(yourListOfWords)
$endgroup$
Reinderien covered most of the other issues with your code. But you should know there's a built-in class for simplifying the task of tallying word frequencies:
from collections import Counter
yourListOfWords = [...]
frequencyOfEachWord = Counter(yourListOfWords)
answered Dec 23 '18 at 1:58
AleksandrHAleksandrH
20829
20829
add a comment |
add a comment |
$begingroup$
To expand on the answer by @AleksandrH, this is how I would write it using collections.Counter
:
import io
from collections import Counter
import regex as re # the normal re module does not support p{P}...
def read_file(file_name):
"""Reads a file into a Counter object.
File contains rows with counts and words.
Words can be multiple words separated by punctuation or whitespace.
If that is the case, separate them.
"""
counter = Counter()
with io.open(file_name, 'r', encoding = 'utf8') as infile:
for line in infile:
if not line:
continue
freq, words = line.strip().split('t') # need to omit 't' when testing, because SO replaces tabs with whitespace
# split on punctuation and whitespace
words = re.split(r'p{P}|s', words)
# update all words
for word in filter(None, words): # filter out empty strings
counter[word] += int(freq)
return counter
def write_file(file_name, counter):
with io.open(file_name, 'w', encoding='utf8') as outfile:
outfile.writelines(f'{word},{freq}n' for word, freq in counter.most_common()) # use `items` if order does not matter
if __name__ == "__main__":
num_batches = 54
for i in range(1, num_batches + 1):
counter = read_file(f"input_batch_{i}.txt")
write_file(f"output_batch_{i}.txt", counter)
This also has (the start of) a docstring
describing what the read_file
function does, functions in the first place in order to separate concerns, and a if __name__ == "__main__":
guard to allow importing from this script without the main code running.
$endgroup$
add a comment |
$begingroup$
To expand on the answer by @AleksandrH, this is how I would write it using collections.Counter
:
import io
from collections import Counter
import regex as re # the normal re module does not support p{P}...
def read_file(file_name):
"""Reads a file into a Counter object.
File contains rows with counts and words.
Words can be multiple words separated by punctuation or whitespace.
If that is the case, separate them.
"""
counter = Counter()
with io.open(file_name, 'r', encoding = 'utf8') as infile:
for line in infile:
if not line:
continue
freq, words = line.strip().split('t') # need to omit 't' when testing, because SO replaces tabs with whitespace
# split on punctuation and whitespace
words = re.split(r'p{P}|s', words)
# update all words
for word in filter(None, words): # filter out empty strings
counter[word] += int(freq)
return counter
def write_file(file_name, counter):
with io.open(file_name, 'w', encoding='utf8') as outfile:
outfile.writelines(f'{word},{freq}n' for word, freq in counter.most_common()) # use `items` if order does not matter
if __name__ == "__main__":
num_batches = 54
for i in range(1, num_batches + 1):
counter = read_file(f"input_batch_{i}.txt")
write_file(f"output_batch_{i}.txt", counter)
This also has (the start of) a docstring
describing what the read_file
function does, functions in the first place in order to separate concerns, and a if __name__ == "__main__":
guard to allow importing from this script without the main code running.
$endgroup$
add a comment |
$begingroup$
To expand on the answer by @AleksandrH, this is how I would write it using collections.Counter
:
import io
from collections import Counter
import regex as re # the normal re module does not support p{P}...
def read_file(file_name):
"""Reads a file into a Counter object.
File contains rows with counts and words.
Words can be multiple words separated by punctuation or whitespace.
If that is the case, separate them.
"""
counter = Counter()
with io.open(file_name, 'r', encoding = 'utf8') as infile:
for line in infile:
if not line:
continue
freq, words = line.strip().split('t') # need to omit 't' when testing, because SO replaces tabs with whitespace
# split on punctuation and whitespace
words = re.split(r'p{P}|s', words)
# update all words
for word in filter(None, words): # filter out empty strings
counter[word] += int(freq)
return counter
def write_file(file_name, counter):
with io.open(file_name, 'w', encoding='utf8') as outfile:
outfile.writelines(f'{word},{freq}n' for word, freq in counter.most_common()) # use `items` if order does not matter
if __name__ == "__main__":
num_batches = 54
for i in range(1, num_batches + 1):
counter = read_file(f"input_batch_{i}.txt")
write_file(f"output_batch_{i}.txt", counter)
This also has (the start of) a docstring
describing what the read_file
function does, functions in the first place in order to separate concerns, and a if __name__ == "__main__":
guard to allow importing from this script without the main code running.
$endgroup$
To expand on the answer by @AleksandrH, this is how I would write it using collections.Counter
:
import io
from collections import Counter
import regex as re # the normal re module does not support p{P}...
def read_file(file_name):
"""Reads a file into a Counter object.
File contains rows with counts and words.
Words can be multiple words separated by punctuation or whitespace.
If that is the case, separate them.
"""
counter = Counter()
with io.open(file_name, 'r', encoding = 'utf8') as infile:
for line in infile:
if not line:
continue
freq, words = line.strip().split('t') # need to omit 't' when testing, because SO replaces tabs with whitespace
# split on punctuation and whitespace
words = re.split(r'p{P}|s', words)
# update all words
for word in filter(None, words): # filter out empty strings
counter[word] += int(freq)
return counter
def write_file(file_name, counter):
with io.open(file_name, 'w', encoding='utf8') as outfile:
outfile.writelines(f'{word},{freq}n' for word, freq in counter.most_common()) # use `items` if order does not matter
if __name__ == "__main__":
num_batches = 54
for i in range(1, num_batches + 1):
counter = read_file(f"input_batch_{i}.txt")
write_file(f"output_batch_{i}.txt", counter)
This also has (the start of) a docstring
describing what the read_file
function does, functions in the first place in order to separate concerns, and a if __name__ == "__main__":
guard to allow importing from this script without the main code running.
answered Dec 23 '18 at 11:12
GraipherGraipher
26.6k54092
26.6k54092
add a comment |
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210192%2fword-frequencies-from-large-body-of-scraped-text%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
I've added the fixed code in one piece below. Thank you!
$endgroup$
– Des Grieux
Dec 23 '18 at 0:22